Voice Recognition and Text to Speech in Python(ggulati.wordpress.com) |
Voice Recognition and Text to Speech in Python(ggulati.wordpress.com) |
https://github.com/dannguyen/watson-word-watcher
One of the great things about it is its word-level time stamp and confidence data that it returns...here's a few super cuts I've made from the presidential primary debates:
https://www.youtube.com/watch?v=VbXUUSFat9w&list=PLLrlUAN-Lo...
It's not perfect by any means, but the granular results give you a place to start from...here's a super cut of cuss words from a well known episode of The Wire...only 59 such words were heard by Watson even though one scene contains 30+ F-bombs alone:
https://www.youtube.com/watch?v=muP5aH1aWUw&feature=youtu.be
The service is free for the first 1000 minutes each month.
But you are saying you performed speech recognition on the full video then edited it according to where the words you targeted were found. I liked the bomb/terrorist one, the others didn't seem to be "saying" anything.
The important takeaway is that the Watson API parses a stream of spoken audio (other services, such as Microsoft's Oxford, works only on 10-second chunks, i.e. optimized for user commands) and tokenizes it...what you get is a timestamp for when each recognized word appears, as well as a confidence level and alternatives if you so specify. Other speech-transcription options don't always provide this...I don't think PocketSphinx does, for example. Or sending your audio to a mTurk based transcription service.
Here's a little more detail about The Wire transcription, along with the JSON that Watson returns, and a simplified CSV version of it:
https://github.com/dannguyen/watson-word-watcher/tree/master...
http://developer.att.com/apis/speech
Twilio has one that also requires payment:
https://www.twilio.com/docs/api/rest/transcription
It limits input audio to 2 minutes. And I would have to guess that its model is specifically tuned to phone messages, i.e. one speaker, relatively clear and focused audio, and certain probabilities of phrases.
Apparently Kaldi is a lot better, but good luck setting it up!
[0] https://jasperproject.github.io/
[1] https://hn.algolia.com/?query=Jasper%20Project&sort=byPopula...
Aside from circumventing lag, I can also give it some personality. I want to name it Marvin, after the robot from H2G2, so that I can say:
"Marvin, turn the TV off"
"Here I am, brain the size of a planet, and you ask me to turn off the tv. Call that job satisfaction, 'cause I don't."
I also had to 'brew install portaudio flac swig' and a bunch of other python libs. By the time it ran, 'pip freeze' returned:
altgraph==0.12
macholib==1.7
modulegraph==0.12.1
py2app==0.9
PyAudio==0.2.9
pyobjc==3.0.4
pyttsx==1.1
SpeechRecognition==3.3.0
pocketsphinx==0.0.9
My fork of the gist is here: https://gist.github.com/ivanistheone/b988d3de542c1bdd6a90Recognizing speech (speech-to-text) with the Python speech module
https://code.activestate.com/recipes/579115-recognizing-spee...
and
Python text-to-speech with pyttsx
https://code.activestate.com/recipes/578839-python-text-to-s...
Good stuff. I like this area.
It is good enough quality and a good start for those who can not afford paying for Google's API.
The link to the VLC library is pretty handy.
All of those libraries have Python 2.7 versions. Actually for all of them you pip install the same library; for pyttsx, `pip install pyttsx` and ignore jpercent's update.
I'm not sure what you mean about pricing and testing for development. Are you referring to Google's services? They offer 50 reqs/day for voice recognition on a free developer API key (https://www.chromium.org/developers/how-tos/api-keys). Google Translate can also be used by gTTS; it will rate limit or block you if you send too many reqs/min or per day without an appropriately registered API key, but you could play around with it for sure.
If voice recognition is important, it might be worth investigating Sphinx more and putting the time to tweak their English language model files. Synthesis is more difficult, though I think the Windows SAPI, OSX NSSS, and ESpeak on *nix are all "good enough." There are also a range of commercial libraries.
After trying to tweak the threshold parameters without success I just figured I'd add a custom key-command to break the listening loop in my project.
For simple use cases like home automation or desktop automation, I think it's a more practical approach than depending on a cloud API.
[1] https://github.com/kastnerkyle/ez-phones
[2] https://www.reddit.com/r/MachineLearning/comments/3pr4v4/are...
It's the whole "Memory is a process, not a hard drive" thing: Voice recognition as it is today is a slowly evolving graph from input data. You could in theory compress the graph and have it available offline. But it would be hard to chop it up in a way that doesn't completely bust the recognition.
Well, I guess at some point this functionality will become part of the OS. When OSX and Windows offer this, then Linux cannot stay behind, and we will see open source speech recognition libraries.
Are there any academic groups working on this topic, and do they have prototype implementations?
HOWEVER:
The only continuous dictation models available for Julius are Japanese, as it is a Japanese project. This is mainly an issue of training data. The VoxForge models are working towards releasing one for English once they get 140 hours of training data (last time I checked they were around 130); but even so the quality is likely to be far less than commercial speech recognition products, which generally have thousands of hours of training.
In terms of data, http://www.openslr.org/12/ says it has 300 hours + of speech+text from librivox audiobooks. Using Librovox recordings seemed a great idea for making a freely available large dataset.
Initially, I just used standard en-us acoustic model, US english generic language model, and its associated phonetic dictionary. This was the baseline for judging accuracy. It was ok, but neither fast nor very accurate (likely due to my accent and speech defects). I'd say it was about 70% accurate.
Simply reducing the size of the vocabulary boosts accuracy because there is that much less chance of a mistake. It also improves recognition speed. For each of my use cases (home and desktop automation), I created a plain text file with the relevant command words. Then used their online tool [1] to generate a language model and phonetic dictionary from it.
For the acoustic model, there are two approaches - "adapting" and "training". Training is from scratch, while adapting adapts a standard acoustic model to better match personal accent or dialect or speech defects.
I found training as described [2] rather intimidating, and never tried it out. This is likely to take a lot of time (a couple of days atleast I think, based on my adaptation experience).
Instead I "adapted" the en-us acoustic model [3]. About an hour to come up with some grammatically correct text that included all the command words and phrases I wanted. Then reading it aloud while recording using Audacity. I attempted this multiple times, fiddling around with microphone volume and gain, trying to block ambient noise (I live in a rather noisy env), redoing it, final take. Took around 8 hours altogether with breaks. Finally generating the adapted acoustic model. About an hour.
About 95% of the time it understands what I say. About 5% of the time, I have to repeat. Especially with phrases.
Did this on both a desktop and raspberry pi. The Pi is the one managing home automation. I'm happy with it :)
[1]: http://www.speech.cs.cmu.edu/tools/lmtool-new.html
[2]: http://cmusphinx.sourceforge.net/wiki/tutorialam
[3]: http://cmusphinx.sourceforge.net/wiki/tutorialadapt
PS: Reading their documentation and searching for downloads takes more time than the actual task. They really need to improve those.
I was interested in automating transcription to text of my own reminders to myself and other such audio files, say taken on the PC or on a portable voice recorder, hence the earlier trials I did. But at the time nothing worked out well enough, IIRC.
My current desktop automation is doing command recognition. Commands like "open editor / email / browser", "shutdown", "suspend"...about 20 commands in all. 'pocketsphinx_continuous' is started as a daemon at startup and keeps listening in the background (I'm on Ubuntu).
I think from a speech recognition internals point of view transcription is more complex than recognizing these short command phrases. The training or adaptation corpus would have to be much larger than what I used.
He he, the voice "shutdown" command you mention reminds me of a small assembly language routine that I used to use to reboot MSDOS PCs; it was just a single instruction to jump to the start of the BIOS (cold?) boot entry point, IIRC (JMP F000:FFF0 or something like that). Used to enter it into DOS's DEBUG.COM utility with the A command (for Assemble) and then write it out to disk as a tiny .COM file. (IOW, you did not even need an assembler to create it.)
Then you could reboot the PC just by typing:
REBOOT
at the DOS prompt.
Did all kinds of tricks of the trade (not just like that, many other kinds), in the earlier DOS and (more in) UNIX days ... Good fun, and useful to customers, many a time, too, including saving their bacon (aka data) multiple times (with, of course, no backups by them).