>The market is huge
Apparently ... it's not
Or, rather, it's not YET "huge"
Sure - half the planet is online, but they're speaking myriad languages in more combinations of enunciation, dialect, and accent than is probably even calculable
>the Natural Language Processing of "OK Google" and Siri are quite refined at this point
Totally different to ask for today's weather and to tell a computer what to do - just like it's totally different to hit your favorite search engine and type "what is Pluto's orbit" and to write the search engine that goes off and does what you asked (and even when it does go off and do it, it still returns multiple (often conflicting) results - which leads to the whole problem of identifying authority online (something I wrote about 15+ years ago https://antipaucity.com/2006/10/23/authority-issues-online/#...))
It's also worlds different to be able to respond to variations on a theme of maybe a couple hundred search keywords (is it even that many?) and the literally unlimited number of commands people issue to their computing devices every day. Let's even say Siri is That Good™ - you've got a MacBook, iPhone, and iPad on your desk ...which one should respond when you say, "Hey, Siri"? Why that one vs this one? Do you have to start every command with the name of the device? Maybe that's not so hard at home (maybe), but get into corporate environments with naming conventions like H5GG71WLD? ... or dozens/scores/hundreds of people within listening distance of everyone's microphones getting triggered by other conversations in the room, conference calls, your cubemates' inability to attenuate their voices and aim only at their laptop when talking ...
It's a nightmare to think about - practically, let alone computationally
Most people look at the example of, say, Star Trek for voice commands to "the computer". Ever notice the computer only responds when the script demands it? Geordi shouting in Engineering commands to his team or panicked messages to the bridge are never misinterpreted by the computer as commands to it
That's mighty convenient - and not at all representative of anything resembling a reality we can create [yet]
Maybe in another few decades or centuries ... but I'd wager probably not
Another consideration: speaking is very slow compared to a click, tap, or typing a few characters at a prompt. Why would you want to intentionally make your human-to-device interactions more clumsy and error-prone?