Ask HN: Why is there no high quality method for voice control of a PC?

90 points by 4midori 4 years ago | 119 comments

Like many people who have spent decades behind a keyboard, RSI (Repetitive Stress Injury) prevents me from writing code and doing graphic design through the usual keyboard and mouse inputs.

So I have turned to a complex and highly unreliable software stack that provides both voice-to-text, and clumsy but limited control of Microsoft Windows, Chrome, etc. This includes Dragon Voice-to-text, Voice Computer, and Talon, plus a browser extension and heavy customization.

Users of Dragon will acknowledge that: a) The software is a creaky dumpster fire built on archaic code b) There is no viable alternative on the market

My question is: *how is it that no one has built something better?* The market is huge, and the Natural Language Processing of "OK Google" and Siri are quite refined at this point.

References:

Dragon: https://www.nuance.com/dragon.html

Voice Computer: https://voicecomputer.com/

Talon: https://talonvoice.com/

Klaster_1 4 years ago |

Is there a high quality voice control for anything? I only ever tried Google Assistant and it mostly always can't comprehend queries beyond "timer 10 minutes", like putting a water boiler on schedule.

reaperducer 4 years ago | |

I think it depends on how you define "high quality."

To me, I'll take a restricted set of actions if they work very reliably.

This is what I had back in the 80's with a Covox Voicemaster plugged into to the joystick port of my Commodore 64. It could only understand a few phrases, but I could define those phrases, and it almost always worked.

If you define "high quality" as being able to respond to a seemingly infinite number of queries, but only understanding and replying correctly occasionally, then Siri is closer to what you want.

dublin 4 years ago | | |

Totally agree. I too had the Covox voice recognizer on an SX-64 doing voice-controlled x10 (and other) home automation in the '80s. The amazing thing is that despite the fact that an RPi 4 has more power than a Cray had back then, modern voice recognition really isn't much better than it was then. (Although it was pretty speaker-dependent...)

I have a handful of Echo Dots and Shows in places I don't mind the security risk, and they are maddeningly incompetent at doing anything in the real world other than telling the weather and acting as a voice-controlled radio (their main use...)

It would be interesting to go back to the Covox approach and rebuild it for today's tech from the ground up (shouldn't need the hardware anymore...), as it worked surprisingly well on computers that had less resources than many (most?) of today's microcontrollers...

ugjka 4 years ago | |

I don't remember the last time i had such rage when i tried to voice search for "French Horn Rebellion - This Moment"[1] on my android TV and i didn't want to type it in with the remote. I'm also not a native english speaker

[1] https://www.youtube.com/watch?v=4khlVbakV_Q

herbst 4 years ago | | |

I feel you. My TV seems to understand whatever it wants. But for whatever reason the talk button is super present on the remote.

imglorp 4 years ago | |

GA is also tightly limited by business constraints.

  "Send a slack to my wife" -> "Sorry, who do you want to text?"

Multi-fail.

ahelwer 4 years ago |

I've heard cursorless (https://github.com/cursorless-dev/cursorless-talon) is good but have never tried it. Syntax-aware voice navigation of code, powered by tree-sitter queries!

I also have a friend who is a gifted programmer who lost his ability to type about a decade ago; he has put together an open-source software stack to help: http://www.cs.columbia.edu/~dwk/

Of course this doesn't really answer your question. But it's a hard problem, and you're basically forced to become a power user to reliably interact with your PC.

pokeyrule 4 years ago | |

creator of Cursorless here. Happy to answer any questions

jiehong 4 years ago | | |

Seriously cool project!

This reminds me of easy motion for vim or ace-jump for emacs.

Do you think it would be possible to have an on-demand contextual hat decoration?

Like you say “show hats words” and only words get decorated with hats and you pick one. It would allow you to maybe show hats only on square brackets or only on function arguments, etc. I find the number of hats with colors a little bit hard to distinguish; should they were contextual, they would require fewer or no color.

Do you map voice commands to keyboard shortcuts available in vs code, or directly via the apis? (Not sure if there is a difference in the end).

Now I wish for a cursor less plugin on the IntelliJ platform.

maxore44 4 years ago | |

I am a cursorless user myself. Using dictation software for programming is actually relatively fast when you get used to it, but editing code (which is how most of us spend the majority of our time) can be pretty slow. Cursorless was a huge productivity booster for me. It got me to switch from Emacs to VS Code which is saying something.

alexhwoods 4 years ago | | |

Same. Have to disagree though. I've been reintroducing a keyboard here and there, and whenever I have to do something in the VSCode editor, I get frustrated with the speed and end up going back to Cursorless.

I think it's a lot faster than keyboard / mouse, mostly because of how little moving of the cursor you have to do.

Could be I was slow to begin with, not super efficient with vim or emacs.

Also, "editing" is the fastest part for me, due to "bring" and "change". So little movement.

jbellis 4 years ago | | |

What distinction are you trying to make between "programming" (fast) and "editing code" (slow) ?

PaulHoule 4 years ago |

Google and Siri are good at what they do. They aren't good at other things, such as dictation.

I see the big problem in voice interaction is that a human being will ask you questions to clarify what you said if they don't understand and current systems don't even try. (Actually the search paradigm lets you do some refinement, "Ok Google" works amazingly well on Android TV.)

Superhuman accuracy at dictation doesn't translate to a useful ability to understand text. You're doing great if you only garble 1 out of 20 words. Some errors are inconsequential, but if it garbles every other sentence then you are going to feel 0% understood.

dataangel 4 years ago |

I have an RSI and I've been coding by voice exclusively for about 7 years. I used a system built on top of Dragon for most of that and in the last year switched to Talon.

I think there are multiple reasons:

* The obvious market is dictation of natural language, but this isn't what you want for voice control. If you try to use long descriptive phrases as your command language everything takes forever. So instead you end up making your own mini command language where all of your common actions are a single syllable, but now it's no longer the English or other natural language that users already know. So now your product has substantial learning curve just like learning a new keyboard layout.

* Everything other than talon has terrible latency. Most existing speech recognition engines were not designed with the kind of latency you want for quick one syllable commands.

* In order for it to be really effective you need the cooperation of applications (this is why I've written extensive emacs integration). Some tools like window speech recognition try to hook in at the UI layer in order to figure out what text is in dialog boxes and such, but in practice they seem to do a pretty terrible job. Windows speech recognition has a very hard time consistently understanding what links you are trying to get it to click on for example. There's also a long tail of applications that just do their own custom UI rendering inside a blank canvas where no hook is possible.

* Good speech recognition even if not specifically targeting computer voice control is a genuinely hard research problem, and standard benchmarks for accuracy are misleading. You see "95% accuracy" and you are like wow that's a high percentage computers almost have this speech recognition thing solved and then you think about it harder and you go wait a minute, that's one mistake every 20 words! Maybe you are still impressed, but then you have to take into account that when the computer does the wrong thing you'll need to issue more commands in order to correct it, which will are also likely be misinterpreted. When you make a typo with a keyboard the mistakes rarely cascade, you just hit backspace.

floatingatoll 4 years ago |

Siri can’t understand “set a timer” more often than 3 in 4 tries for me, and any sentence with more than four words will have one error in it no matter what. I envy you the accuracy your voice assistants offer you, but for me, voice control makes me want to snap my phone in half from frustration at how terrible it is. I still can’t remember why I have a reminder set with the name “2910”, which is the transcription of my spoken English sentence at the time. So at the very least, I imagine the holdup is that voice control failure conditions are miserably bad, when it fails; and, “Delete this sentence” -> “Formatting C:\” misunderstandings are too easy in modern OSes still. (Windows still offers “Format” as a primary context menu choice on the boot hard drive!)

newsbinator 4 years ago | |

I use "wake me up in x minutes". Siri always understands that 100%.

So to set a kitchen timer: "wake me up in 11 minutes"

kbenson 4 years ago |

For anyone interested in this topic, you might be interested in this tech talk[1] from Emily Shea. In it she demos a tech stack similar to what's mentioned here, to fairly good effect. It does appear that required a lot of tweaking on her part and is optimized for the writing of code, and I'm not sure how well it functions for more general contexts.

1: https://www.youtube.com/watch?v=YKuRkGkf5HU

WorldMaker 4 years ago |

The last time I tried Dragon it was just a fancier (bloatware) UI built directly on top of Windows Voice Recognition (and IMO not adding much value on top of it): https://support.microsoft.com/en-us/windows/use-voice-recogn...

Windows Voice Recognition has been around forever (out of the box since XP), it's UI is "serviceable" but not great. (It was slightly better when Cortana was briefly "out of the box" in Windows 10, but has reverted some since.) But I don't think you need to pay for Dragon (or its high memory consumption) if you don't mind taking to learn the quirks of Windows Voice Recognition directly. Most of Dragon's quirks are Windows' quirks anyway papered over with a UI that makes it seem like they are adding value.

Also yeah, one of the answers to "how is it that no one has built something better?" is: Well, Microsoft tried with Cortana, got a huge blowback that "no one" wanted Cortana on their PCs, and gave up.

Stevvo 4 years ago | |

Not sure where you got that idea. Dragon predates Windows and has always used its own models.

It works very well for some people; many have written books with it.

calchris42 4 years ago |

Wow, so many replies that boil down to “because typing is better, you should just type”.

This is fairly insulting as RSI’s are very much a real thing.

Does this community also think that wheelchair ramps should never be invested in because stairs are clearly superior?

I’d rather see the brain power in this community focused on solutions. Keyboard + mouse have lasted so long because they work surprisingly well, but I hope there is a day that we dream up something better that does not require slowly giving ourselves carpel tunnel.

daanzu 4 years ago |

I have been coding entirely by voice for approximately 10 years now (by hand long before that). Most of that time I have been using the Dragonfly (https://github.com/dictation-toolbox/dragonfly) library to construct my own customized voice coding system. The library is highly flexible and open source, allowing you to easily customize everything to suit what you need to be productive. It is perhaps the power user analogue to Dragon Naturally Speaking. With it, you can certainly be highly productive coding by voice. However, it does require work to setup and customize to suit you, so it isn't really for the "general population" of computer users to just sit down and use. With regard to accuracy of speech recognition, being open allows you to (with sufficient motivation) to train a custom acoustic speech model that recognizes your voice specifically extremely well.

Regarding the software packages you referenced: Yes, Dragon is trash that I want nothing to do with, because of its inefficient interface, its complete inability to accurately understand my voice, and its generally shoddy software quality. Voice Computer (which I hadn't seen before) is therefore eliminated as well, though it doesn't look terrible as a front end to Dragon to better use the OS GUI-accessibility info. Many people like Talon, but I demand something open, which I can modify to suit my needs.

Background: I develop kaldi-active-grammar (https://github.com/daanzu/kaldi-active-grammar), a free and open source speech recognition backend usable by Dragonfly, itself entirely by voice. There's also a community of voice coders using Dragonfly and other tools that build on top of it, such as Caster (https://github.com/dictation-toolbox/Caster).

phkahler 4 years ago |

I would like to see Linux lead here. Have a standard voice interface where a voice-to-text process feeds a stream of text to the DE, which can then forward it to the active application (as text). I want this to be a separate "voice" stream so it is not confused with the keyboard. This would allow the eventual creation of a voice assistant at the system level, but also allow individual applications to adopt voice commands starting now. IMHO this should be like version 1 of the concept and it should last a while until we figure out what all is possible and which use-cases need a design change.

Simple dictation could be done at the DE level, where the VtoT stream would be diverted to the keyboard input of the active app. It could also be done at the app level, but this is one feature I think belongs a level up so it can be used by non-voice enabled apps.

bool3max 4 years ago | |

Who do you expect to actually work on this? Billion dollar companies can't get voice controls right. FOSS DEs struggle with keyboard/mouse input, let alone voice.

laserbeam 4 years ago |

The problem is, even if you do build amazing speech to text, it will br slower and less expressive than a keyboard + pointing device (mouse, touch, pen).

For keyboards, you lose positional logic (wasd in games). You lose shortcuts. You lose control over capitalization and formatring. You lose punctuation. You lose non-text input (code, dictating code sounds like like a horrible pain). You lose function keys. And, of course, you lose speed (think of instant things you do with shortcut keys, like alt tab). Not to mention, that you lose the ability to work in silence.

Make the recognition quality gorgeous, and it will still be a less flexible product than what we use today. It has value for accessibility, but people will likely choose keyboards over dictation based on UX alone.

mikob 4 years ago | |

We can choose what's best based for the task at hand. In the same way most people don't use the mouse to click an online keyboard, most people won't use voice control to type WASD in-game.

Dictation, for instance, is an easy-win for voice input. Clicking buttons can be more convenient with voice when we're talking to Smart TVs or, perhaps, if our hands have pizza grease all over them and we don't want to touch the keyboard.

GeeJay 4 years ago |

Voice control of ordinary computer navigation and of program writing and testing has gone nowhere in 30 years. https://dilbert.com/strip/1994-04-24

MattGaiser 4 years ago |

Part of the problem is that context is something AI is bad at and instructions are highly context dependent.

https://www.youtube.com/watch?v=FN2RM-CHkuI

boomka 4 years ago |

There are some tools, I think the reason they will never become widespread or high quality is that voice is just not a great medium for conveying that type of info in the first place. If I type a sentence and then decide to make a correction it is very difficult to explain in words but very quick to click and retype. If I want to position my window somewhere, I wouldn't even want to start thinking about how to explain it, I would just click and drag. And so on and so forth. This limits any potential markets for such tools greatly, so there is little economic incentive to develop them into anything truly high quality.

nosianu 4 years ago |

Voice input is good for high level tasks and goals, requiring a high level comprehension.

For detailed work though the more direct method of translating movements is far more efficient.

When you can describe an abstract end goal voice is great. When you have to actually do all the individual steps towards some high level goal then it's like telling a newbie programmer through some high level database optimization. You only use voice because your main goal here is to teach someone. If the PC could be taught that way, then voice would be in demand for such tasks too.

sapiol 4 years ago | |

Offtopic: Hi, I saw your comments in some older thread about chelation (Cutler Protocol). I too am from germany and have some questions about your chelation protocol. Unfortunately, I can't reply anymore on that other thread. Can you contact me at 1u3_2d227vh7iadt@byom.de ?

sapiol 4 years ago | |

Offtopic: Hi, again. That didn't work as there is a 30m time-limit on byom.de. Can you please contact me again, but this time here: D-8ynpb9p087ukef2v@maildrop.cc

nosianu 4 years ago | | |

Just get a regular account instead of those public services. Some random name at gmx.de for example. With the "d-" mail never appears, without it it works but the mail body was "undefined" when I checked what that service showed. I deleted it again.

I also have such a throwaway-but-real account in my "About" under my user name here, just added it. Should have done that anyway, you just reminded me that I should.

6gvONxR4sf7o 4 years ago |

Once automatic speech recognition (ASR) gets closer to bullet-proof, I expect this to become a huge thing, but right now, it seems like you're getting better error rates than typical.

Any input method where you frequently have to repeat yourself and undo things won't get mainstream. I'd bet people's mainstream tolerance for errors would have to be like one per five to ten minutes before you could get them to really adopt something like this (barring disability reasons, like RSI). Until then, the tech and market don't match.

wmf 4 years ago | |

The problem is that bulletproof speech recognition will only be available as a cloud service and maybe only wrapped in a Siri-style "assistant" UI. You probably won't be able to use it to replace things like Dragon.

viro 4 years ago |

... because it's not a good way to control a computer?

adolph 4 years ago | |

> ... because it's not a good way to control a computer?

This comment speaks to a perception problem for aural methods. The state of the mainstream art doesn't seem much past Forstall's demo of 10 years ago. [0] Are generations of people accustomed to WIMP UI able to wrap their heads around a much smaller interaction set? [1]

Gentner and Nielsen's work described in "The Anti-Mac Interface" [2] speaks to some of the differences people will have to mentally bridge such as:

  Mac | Anti-Mac
  Direct Manipulation | Delegation
  See and Point | Describe and Command
  WYSIWYG | Represent Meaning
  User Control | Shared Control
  Feedback and Dialog | System Handles Details
  Forgiveness | Model User Actions

0. https://www.youtube.com/watch?v=SpGJNPShzRc

1. https://en.wikipedia.org/wiki/Post-WIMP

2. https://web.archive.org/web/20120904231532/http://www.useit....

falcolas 4 years ago | |

Why not? "Open hacker news", "go back", "Play liked playlist in Spotify"

Seems fairly reasonable. It need not be the only way, but not having to use my mouse to do stupidly simple tasks wouldn't break my heart.

quartesixte 4 years ago | | |

The problem I think comes from a gap between distribution of levels of efficiency for computer-human interfaces.

Take “Open Hacker News” for example. One user might Click Browser > Open bookmarks tab > “Hacker News”.

Another, having set up a series of hotkeys, will go (on a windows machine, taskbar set for Browser pinned in position 1):

Win+1 > Ctrl+3

That is incredibly fast, much faster than saying it.

My guess is that much of the software engineering world is either users who can do the first very quickly or don’t find it cumbersome, or users who set up hotkeys like the latter and will outrace the speed of human speech on any given day. Thus the problem gets little attention.

viro 4 years ago | | |

My first guess would be "Open hacker news" requires clear audible speech. While the KB method just requires pressing 'h' and 'enter'. Also, non-cloud speech recognition just recently got decent.

mertd 4 years ago |

As far as human computer interfaces go, keyboard and mouse probably win comfortably in both bandwidth and latency against speech to text in almost all tasks. Former also requires a less physical effort and is creates less noise for others. My guess is that this shrinks the demand for good quality voice HCI significantly and those who really need it end up being overlooked.

mikob 4 years ago | |

You're limiting your thinking to the paradigm of visual interfaces paired with a mouse and keyboard. When all you have is a hammer...

Here's some examples where bandwidth and latency wins with speech:

1. "Play here comes the sun" vs. opening spotify, waiting, clicking the search box, typing here comes the sun, pressing enter, waiting, scanning the page and clicking the right song.

2. "Send email to John asking him if he would like to Play golf" vs. opening Gmail, waiting, clicking compose, start typing john, click the right email, tab to subject... etc.

There are cases where keyboard and mouse input is better... e.g. editing text, graphics production and editing, etc.. But certainly not in "almost all tasks" as you say. I think speech is the 3rd big computer interface that complements the mouse and keyboard and will make computers more productive and convenient for everyone regardless if you have a disability.

warrenm 4 years ago | | |

> 2. "Send email to John asking him if he would like to Play golf"

Which John? Which of that John's contact points you have saved?

..and why don't you have the keyboard shortcuts for those actions committed to muscle memory by now?

D13Fd 4 years ago | |

Agreed. And it’s not just less noise, there is a privacy component to it. I don’t really feel like broadcasting what I am doing to anyone within earshot.

6gvONxR4sf7o 4 years ago | |

Keyboard and mouse certainly don't beat voice for bandwidth (assuming error-free ASR, which doesn't exist today).

falcolas 4 years ago | | |

This guy's not wrong. You can speak clearly and comfortably at 250 words per minute. Most folks will type at less than half that.

Even shortcuts (which peer comments are relying upon) aren't all that fast - they require additional selection movement with the keyboard or mouse before they can be used.

warrenm 4 years ago | | |

Sure they do: `cp file1 file2`

Vs properly enunciating "Kah-Pee f-i-l-e-1 to f-i-l-e-2"

liveoneggs 4 years ago |

GUIs are unsuitable to anything other than the mouse + keyboard. They are the outputs of their respective inputs.

You need dedicated software built on a hypothetical V(oice)UI to get anything decent.

Otherwise your best bet is to find a mouse/trackball/trackpad/pointerstick/touch-screen/pen that doesn't injure you and use text-to-speech in simple text editors.

SlogMaverick 4 years ago |

I've kept this bookmarked for when this eventually happens to me. https://arstechnica.com/gaming/2019/04/coding-without-a-keys...

maxwelljoslyn 4 years ago |

I'm in the same boat as you, OP. Talon has proven a lifesaver ... or at least it promises to be one. I'm still getting used to it.

My finding, for text dictation (not code), is that even halfway decent dictation, such as is available on iPhone, still needs much post-dictation editing. I feel that the biggest impact to be made in this area is superior capabilities for this editing phase.

I summarized and wrote up my thoughts as a grant proposal for Scott Alexander's recent "micro grants" project. Get in touch (email in my profile) if you want to read that, or if you'd like to talk about dictation, voice control, voice coding, and editing operations -- or just get some moral support.

maxore44 4 years ago | |

I have to plug the cursorless vs code extension to speed up code editing. It was a game changer for me.

ipnon 4 years ago |

Speech models today can mine the entire corpus of published conversation and return the most likely response to a given statement. That's not how we converse. Every relationship you have is a little model in your brain that we call a person's "personality." Every one talks differently, has different frames of reference, uses different codes of language, different assumptions. Cutting edge speech models work perfectly for the perfectly average speaker, but that person does not exist! The farther we stray from the mean, the more alienating these speech models become.

kbenson 4 years ago | |

> Cutting edge speech models work perfectly for the perfectly average speaker, but that person does not exist!

This is a well known pit in statistics, I would think, given there are extremely famous stories about this exact issue causing deaths. In the 1950's, the air force was trying to figure out why their pilots were dying, and determined it was because their cockpit designs which used "average" pilots were a poor fit for almost ever real world pilot.[1]

1: https://www.thestar.com/news/insight/2016/01/16/when-us-air-...

cptaj 4 years ago | |

I have massive issues with speech recognition software. It doesn't work for me in either english or spanish. Statements like "google and siri are so advanced now" feel like people are collectively pranking me.

That said, I too have wondered why we don't have speech control for computers or at least appliances.

You don't need to parse all language. Just a standard set of primitives like you'd find on a remote should be way easier to recognize and can even be selected for their ease of parsing. Simple things like on, off, next, back, louder, etc.

ipnon 4 years ago | | |

An interesting project: Automatically convert a terminal commands `--help` page to a speech model. Run that over $PATH, then you never have to type again!

smorgusofborg 4 years ago |

If I had to program with audio, I would make a steno dictionary with a theory that results in a pronunciation that is sufficiently different from normal language and then speak it instead of chord it.

The complexity of doing that is IMO a good explanation of why commercial audio recognition is worthless to someone who programs a computer instead of interacts with humans over a computer.

http://plover.stenoknight.com/2013/03/using-plover-for-pytho...

mikob 4 years ago |

I too noticed that Dragon is trash (2.2/5 rating on the Chrome Webstore, yikes) I've been working on one that's purpose-built for the web. Most software today is moving towards the web, so that's where we narrowly focus. It works everywhere (including HN, Reddit, YouTube, Gmail... even Duolingo)

You can DL it here: https://chrome.google.com/webstore/detail/lipsurf-voice-cont...

alexhwoods 4 years ago |

Talon + Cursorless.

People have built the tools you're talking about. They're Talon and Cursorless.

I think you'd be shocked if you saw how productive some people in the Talon community are. Be sure to join the community Slack.

twright 4 years ago |

Have you looked at Talon[1] for programming and system control? I used it for a few months last year and while the first two weeks were difficult I was able to nail down a workflow that really suited me. After another few weeks I felt as comfortable and capable working with it as I did a keyboard and mouse. (Cannot attest to its capabilities on Windows)

[1] https://talonvoice.com/

simonblack 4 years ago |

JUST IMAGINE THE SCENARIO:

You have just been fired and as the security boys are escorting you to the door, you call out, loud enough to be heard in all the cubicles -

"Computer! Format all drives!"

OR MAYBE THIS OTHER SCENARIO:

The guy in the next cubicle has a loud voice and while he is commanding his own computer to "Exit the file without saving" you find that the work you have carefully constructed over the last four hours is suddenly thrown away too.

newusertoday 4 years ago |

I tried using talonvoice but the recognition engine failed to understad lot of words. I then searched for pronunciation of those words on google and tolonvoice detected them correctly. In the end i learned to pronounce the words in american english so that talonvoice can understand them ;-) .Not what i was hoping for, i wanted to teach computer to recognize my voice not the other way around.

daanzu 4 years ago | |

With an open system/engine, you can train your own personal speech model. For kaldi-active-grammar (https://github.com/daanzu/kaldi-active-grammar), you can do so without all that much difficulty, although the process/documentation could certainly use improvement.

I bootstrapped my personal speech model by retaining the commands from me using WSR. My voice is quite abnormal, and it took only 10 hours of speech data to train a model orders of magnitude more accurate than any generic model I've ever used. And of course, I retain much of my usage now with Kaldi, so my model improves more and more over time. A virtuous flywheel!

miguel-muniz 4 years ago |

Has anyone here used Apple's built in Voice Control[1] feature in MacOS? I imagine having something built in the OS is better than third party software, but I haven't used any so I don't know

[1] https://support.apple.com/en-us/HT210539

rileyphone 4 years ago |

I’ve been messing around with https://github.com/ideasman42/nerd-dictation which, with the big model, gives surprisingly accurate local detections. Definitely more diy/hacker focused than actually being a solution though.

browningstreet 4 years ago |

Alexa can't even hear the 3 things I say to it every single day with any accuracy.

But, it seems like all voice control development keeps getting bought up by the Big 3, so it's not likely to have any significant breakthroughs independent of what Apple, Google and Amazon think voice control is good for.

doesnotexist 4 years ago |

Have you read this blog post by Josh W. Comeau outlining his experience with Talon for a developer workflow?

https://www.joshwcomeau.com/blog/hands-free-coding/

1vuio0pswjnm7 4 years ago |

Surprised that something like Talon + RPi has not tapped into the "smart speaker" market.

cpach 4 years ago |

Just a thought: Have you tried Dasher…?

It’s an alternative input method. Might be worth giving a try.

https://www.inference.org.uk/dasher/DasherSummary2.html

kleer001 4 years ago | |

oooh, I always wanted that. Sad it doesn't look like it's advanced much.

philonoist 4 years ago |

For those who need immediate help for RSI, use ->

Voice Finger by Cozendy [$9.99]

Lenovo Voice Control from msstore [free]

Amazon Alexa from msstore [free]

"Win Key + h" for the inbuilt text box dictation [inbuilt]

serenade.ai [$$]

I don't have an exact answer to you OP but I hope someone builds a helpful one for you.

daviddever23box 4 years ago |

Arguably, one might wish to create a audible, non-linguistic shorthand for positional control, that would allow for higher efficiency when, say, retouching an image within Photoshop, but without the use of hands.

singularity2001 4 years ago |

Nuance started to completely monopolize the market 20 years ago and had crippled a lot of innovation in that space. It's still a minefield of ugly patents for any commercial contestants.

mleonhard 4 years ago |

Switching to a tenting split keyboard (Goldtouch V2) and vertical mouse (Evoluent) reduced my RSI. Strength training (Les Mills Body Pump) is what finally solved it. Have you tried those?

wizzerking 4 years ago |

The major problem when using voice to control a machine is tremors in the voice as the work day proceeds, when the person is stressed, and if the person is experiencing health issues. All these situations/reason will change the timber, and in some cases the intonation. Like 'emphasis on the syllable. Now top that off with accents, like a Hispanic person, or regional slang. Deep Learning kits like https://github.com/FreddieAbad/Voice-Recognition-using-Deep-... are making headway but still far from general voice recognition

Avatars 4 years ago |

"Why is there no high quality method for voice control of a PC?" For the same reason there's no standardized encryption for everyone's comms. Or, 'Why is there no software for gps that works on pc's that is easy and readily available?'. Same reason.

There is a voice assistant ap for Android that uses vosk called Dicio (f-droid). Storage is cheap and easy. Processing power is there even in cheap 3rd world phones. I personally detest typing and would love to talk to my devices without any 3rd party nonsense requirements. Truly there is none because the powers that be do not want everyone thinking they are in control, essentially of anything.

Arubis 4 years ago |

Absent regulation or other incentives to nudge the market otherwise, the overwhelming majority of software is and will be written to use existing input methods--i.e. adding a new input method isn't the core competency of a team creating the world's best todo list app.

With that precondition, any voice-to-control layer on the desktop is in the tough situation of translating between voice input and a piece of software that was designed without voice input in mind.

Google and Siri, etc., aren't as beholden to the desktop/browser interface paradigm, so they don't have to perform this interface translation.

walls 4 years ago |

A friend of mine uses VoiceAttack in a few VR games and it seems to work decently for triggering actions. Not sure if it's any good at transcription though.

wnolens 4 years ago |

That sounds exhausting.

"Open this program"

"Minimize"

"Focus on this text input"

..dictate..

"switch to command mode"

"save and close"

i'd rather just: "click click tab type ctrl-S"

falcolas 4 years ago | |

The actions are a bit more like

"move mouse to this 100x100 pixel square, click twice within 100ms"

"move mouse to this 20x20 pixel square, click once"

"Move mouse to this 100x1000 pixel rectangle, click once"

"Type text at a rate 1/5 (1/2 if you're particularly fast) speaking rate"

"Move your pinkie (your weakest finger) to 'ctrl' and click, move your index to 's' and click, release both, verify it worked with a visual cue then either another 20x20 mouse maneuver, or "move your thumb to 'alt', and your index finger to 'f4' (assuming you have access to the function keys), click and release"

Moving a mouse to a very specific spot on a screen is a relatively slow - and hard if you have any motor control issues - task.

wnolens 4 years ago | | |

You're right. But OP didn't understand why it doesn't exist because "market is so big."

I presumed they meant more than the extreme edge of RSI sufferers. So I ran the thought experiment.

I've had a mild RSI. The solution was get fancy ergo mouse/keyboard/desk/chair, and retrain myself. I've even seen a guy use a joystick instead of a mouse.

sp332 4 years ago | |

The post is about people who physically can't do that.

warrenm 4 years ago | | |

OP claims the market is "huge"

It's not

It's tiny (at best)

Tiny markets don't tend to get much attention

cf 4 years ago | |

Most of these systems entail developing a shorthand. For operations you expect to do a lot you assign one-syllable commands.

marginalia_nu 4 years ago | | |

Let's not create a false dichotomy between mouse control and voice control. There are other alternatives that are arguably less handicapping than being reduced to voice commands.

amelius 4 years ago |

Because Google is keeping all their crowdsourced voice data secret.

danShumway 4 years ago |

I'll give an answer in a slightly separate direction: there aren't engines that are both good enough and open enough to hook into that Open Source communities can build around them.

There are two ways that new software gets built: either the market is big enough and accessible enough that commercial software gets built, or the software is easy enough to build that hobbyists enter the space and solve their own problems. For example, the commercial market for keyboard-driven interfaces is also quite small, but we still have stuff like Sway. But a good keyboard-driven interface is easier to build than speech recognition.

I've been curious about this area for a while, but my understanding is voice-to-text Open Source solutions are still kind of primitive for general text transcribing. The libraries aren't very fun to work with, they're often embedded Python/Java "stuff", and the accuracy isn't great if you advance past the level of text transcription. Additionally, controlling computers and hooking into X or Wayland feels a bit hacky.

That being said, I'll push back on people who are saying that no one would want to control an interface this way. The success of systems like Alexa/Siri/Google are pretty definitive proof to me that (all their weaknesses side) there is a market for voice interfaces. But the ties between that market and the desktop are not strong, and the ecosystem isn't open enough to really build on in that direction.

I suspect that until efforts like Mozilla's open speech datasets pick up more steam and become competitive (if they ever do), it's going to be kind of laggy to find solutions because it's not immediately obvious how to enter the market, either as a commercial company or as an Open Source dev. But maybe I'm wrong and I just haven't researched it enough and the area is totally ripe for disruption. Maybe for people with RSI they'd tolerate something like clipping a bluetooth mic to their lapel or something and that would boost accuracy. Maybe there's another way to approach entering code that isn't just straight text recognition, possibly combining it with some kind of AST or code analysis that made it easier to guess what people were saying.

In any case, I don't think the problem is that people don't want to talk to their computers. Personally I don't like using voice assistants, but they are very popular, in no small part because of the voice part. So maybe there is an evolution of desktop UI controls that could become really popular, or at least competitive with entrenched solutions for people with limited mobility or RSI. But it would require someone to introduce some kind of actual UX innovation into the space, or to find a way of getting over the moat around good recognition and OS integration.

warrenm 4 years ago |

>The market is huge

Apparently ... it's not

Or, rather, it's not YET "huge"

Sure - half the planet is online, but they're speaking myriad languages in more combinations of enunciation, dialect, and accent than is probably even calculable

>the Natural Language Processing of "OK Google" and Siri are quite refined at this point

Totally different to ask for today's weather and to tell a computer what to do - just like it's totally different to hit your favorite search engine and type "what is Pluto's orbit" and to write the search engine that goes off and does what you asked (and even when it does go off and do it, it still returns multiple (often conflicting) results - which leads to the whole problem of identifying authority online (something I wrote about 15+ years ago https://antipaucity.com/2006/10/23/authority-issues-online/#...))

It's also worlds different to be able to respond to variations on a theme of maybe a couple hundred search keywords (is it even that many?) and the literally unlimited number of commands people issue to their computing devices every day. Let's even say Siri is That Good™ - you've got a MacBook, iPhone, and iPad on your desk ...which one should respond when you say, "Hey, Siri"? Why that one vs this one? Do you have to start every command with the name of the device? Maybe that's not so hard at home (maybe), but get into corporate environments with naming conventions like H5GG71WLD? ... or dozens/scores/hundreds of people within listening distance of everyone's microphones getting triggered by other conversations in the room, conference calls, your cubemates' inability to attenuate their voices and aim only at their laptop when talking ...

It's a nightmare to think about - practically, let alone computationally

Most people look at the example of, say, Star Trek for voice commands to "the computer". Ever notice the computer only responds when the script demands it? Geordi shouting in Engineering commands to his team or panicked messages to the bridge are never misinterpreted by the computer as commands to it

That's mighty convenient - and not at all representative of anything resembling a reality we can create [yet]

Maybe in another few decades or centuries ... but I'd wager probably not

Another consideration: speaking is very slow compared to a click, tap, or typing a few characters at a prompt. Why would you want to intentionally make your human-to-device interactions more clumsy and error-prone?

sleepingadmin 4 years ago |

Certainly exists and I have setup this for various blind people who make due. Unfortunately dont recall what it was exactly but they bought it and all that.

The thing about voice is how weak it is. Even if you've well trained it and you speak well, which i don't. It wont be as good as a keyboard.

Putting work into voice like this for productivity is pointless. Any effort is best placed in brain computer interfaces. Hopefully not surgically required, like neurolink is doing. More of a headset like Valve and openbci is doing.

Lets just wear a headset and work, keyboards can just be there in case you need them.

vasco 4 years ago | |

Agree completely on your points about brain computer interfaces and voice not being worth investing in outside of supporting people with accessibility issues, unfortunately.

It's also super weird to speak to a computer. Typing, touching or thinking are all fine, but somehow sitting in a room talking to a machine is a bit weird, even though it's not weird if I'm on a call, I can't explain it. Are there others with similar experience?