Voice Recognition and Text to Speech in Python

Voice Recognition and Text to Speech in Python(ggulati.wordpress.com)

192 points by ggulati 10 years ago | 50 comments

danso 10 years ago |

FWIW, IBM has a wonderful speech to text API...I've put together a repo of examples and Python code:

https://github.com/dannguyen/watson-word-watcher

One of the great things about it is its word-level time stamp and confidence data that it returns...here's a few super cuts I've made from the presidential primary debates:

https://www.youtube.com/watch?v=VbXUUSFat9w&list=PLLrlUAN-Lo...

It's not perfect by any means, but the granular results give you a place to start from...here's a super cut of cuss words from a well known episode of The Wire...only 59 such words were heard by Watson even though one scene contains 30+ F-bombs alone:

https://www.youtube.com/watch?v=muP5aH1aWUw&feature=youtu.be

The service is free for the first 1000 minutes each month.

pbw 10 years ago | |

It took me a while to understand what you did here. I was waiting for some kind of subtitles showing the recognition ability.

But you are saying you performed speech recognition on the full video then edited it according to where the words you targeted were found. I liked the bomb/terrorist one, the others didn't seem to be "saying" anything.

danso 10 years ago | | |

Yeah, I was a bit lazy...I could have used moviepy (which I currently use, but merely as a wrapper around ffmpeg) to add subtitles to show which identified word was identified...I'm hoping to make this into a command-line tool for myself to quickly transcribe things...though making supercuts is just a fun way to demonstrate the concepts.

The important takeaway is that the Watson API parses a stream of spoken audio (other services, such as Microsoft's Oxford, works only on 10-second chunks, i.e. optimized for user commands) and tokenizes it...what you get is a timestamp for when each recognized word appears, as well as a confidence level and alternatives if you so specify. Other speech-transcription options don't always provide this...I don't think PocketSphinx does, for example. Or sending your audio to a mTurk based transcription service.

Here's a little more detail about The Wire transcription, along with the JSON that Watson returns, and a simplified CSV version of it:

https://github.com/dannguyen/watson-word-watcher/tree/master...

th-ai 10 years ago | |

Youtube speech recognition is getting quite good, at least for talking heads in English. Are there additional top tier API's other than the IBM?

danso 10 years ago | | |

AT&T has its own "Watson"...but it requires signing up for a premium account, which I think involves an upfront cost:

http://developer.att.com/apis/speech

Twilio has one that also requires payment:

https://www.twilio.com/docs/api/rest/transcription

It limits input audio to 2 minutes. And I would have to guess that its model is specifically tuned to phone messages, i.e. one speaker, relatively clear and focused audio, and certain probabilities of phrases.

kleiba 10 years ago |

Kids, it's called "speech recognition". Voice recognition also exists, but it's the task of identifying a user based on his/her voice, not the task of transcribing spoken input as text.

transpy 10 years ago | |

Dad, I told you not to use my hacker news account. Log out please.

turnip1979 10 years ago | |

Are there any decent opensource projects out there (preferably with Python APIs) that do speaker or "voice recognition" reasonably well? I know this is an area of active research in academia.

jwitko 10 years ago | |

Kids?

DecoPerson 10 years ago | | |

He jests.

giancarlostoro 10 years ago |

It really would be amazing to be able to get voice recognition software that covers at least recognizing a small enough fraction of our language to be useful without having to reach the cloud. It is definitely a dream I hope we one day achieve, thanks for the article, will test it on my day off and play with it a bit.

IshKebab 10 years ago |

Don't expect this to be anything like modern "good" speech recognition. Sphinx is definitely from the 00's when it seemed like speech recognition would never be solved.

Apparently Kaldi is a lot better, but good luck setting it up!

privong 10 years ago |

Another project along similar lines is the Jasper Project[0], which has received some HN coverage in the past several years[1]. It interfaces with many of the same speech recognition and text-to-speech libraries.

[0] https://jasperproject.github.io/

[1] https://hn.algolia.com/?query=Jasper%20Project&sort=byPopula...

squeaky-clean 10 years ago |

Very cool! I just started playing with speech recognition in Python for home automation this week. I'm controlling some WeMo switches and my PC with an Android Tablet using Autovoice, and it works well as a proof-of-concept, but Autovoice doesn't always register commands, and the "Okay, Google" speech to text can be slow sometimes. I'd like it to take less than 5 seconds between saying "TV Off" and the TV actually turning off., with Autovoice it's anywhere from 3s to 25s depending on the lag. I also figure with real code, I can get commands that are more flexible than Autovoice's regex.

Aside from circumventing lag, I can also give it some personality. I want to name it Marvin, after the robot from H2G2, so that I can say:

"Marvin, turn the TV off"

"Here I am, brain the size of a planet, and you ask me to turn off the tv. Call that job satisfaction, 'cause I don't."

afsina 10 years ago |

They should move from Sphinx to Kaldi and from GMM to DNN acoustic models. Instant 30% improvement.

luke-stanley 10 years ago | |

http://kaldi.sourceforge.net/about.html

turnip1979 10 years ago | | |

Does Kaldi need Windows? I only saw installation instructions for Windows. Also .. I just tried Pocket Sphinx ... says it works on Windows and Linux. So .. no non-apple or cross platform speech rec for us mac devs?

afsina 10 years ago | | |

They use Github now.

ivansavz 10 years ago |

For folks who want to try this at home on Mac OS X, you'll need to change 'sapi5' to 'nsss' on the line 'speech_engine = pyttsx.init('sapi5')'.

I also had to 'brew install portaudio flac swig' and a bunch of other python libs. By the time it ran, 'pip freeze' returned:

    altgraph==0.12
    macholib==1.7
    modulegraph==0.12.1
    py2app==0.9
    PyAudio==0.2.9
    pyobjc==3.0.4
    pyttsx==1.1
    SpeechRecognition==3.3.0
    pocketsphinx==0.0.9

My fork of the gist is here: https://gist.github.com/ivanistheone/b988d3de542c1bdd6a90

vram22 10 years ago |

Nice work, ggulati. I had done some roughly similar stuff, but more basic, using same / similar libraries (but you have researched more libs), a while ago:

Recognizing speech (speech-to-text) with the Python speech module

https://code.activestate.com/recipes/579115-recognizing-spee...

and

Python text-to-speech with pyttsx

https://code.activestate.com/recipes/578839-python-text-to-s...

Good stuff. I like this area.

whizzkid 10 years ago |

Microsoft's translation API has 1 million characters/month free version for text to speech with male/female voice.

It is good enough quality and a good start for those who can not afford paying for Google's API.

iamcreasy 10 years ago | |

Just checked, it's 2 Million character/month for free.

archiebunker 10 years ago |

Excellent post. Very interesting. I see how it works but am using Python 2.7 so based on your headline I suppose it won't work for me. This is the first real lead I've seen for integrating it easily. Pricing isn't terrible, if it goes production. Too bad there is no way to test it first for development. But we're lucky to have this at all.

The link to the VLC library is pretty handy.

ggulati 10 years ago | |

Most of the stuff I found was for Python 2.7! I'll edit that into the post. My focus was for finding libraries that worked with new Python code, e.g. Python 3.5 code.

All of those libraries have Python 2.7 versions. Actually for all of them you pip install the same library; for pyttsx, `pip install pyttsx` and ignore jpercent's update.

I'm not sure what you mean about pricing and testing for development. Are you referring to Google's services? They offer 50 reqs/day for voice recognition on a free developer API key (https://www.chromium.org/developers/how-tos/api-keys). Google Translate can also be used by gTTS; it will rate limit or block you if you send too many reqs/min or per day without an appropriately registered API key, but you could play around with it for sure.

If voice recognition is important, it might be worth investigating Sphinx more and putting the time to tweak their English language model files. Synthesis is more difficult, though I think the Windows SAPI, OSX NSSS, and ESpeak on *nix are all "good enough." There are also a range of commercial libraries.

dr_zoidberg 10 years ago | | |

I too thought it was Python 3 only before I read it. Maybe a better title would be "Coding Jarvis in Python in 2016" and then explaining in the first paragraph that this is Python 2 and 3 compatible, with your personal focus on 3?

Karlozkiller 10 years ago |

I have had a problem with using the speech_recognition library in that it does not stop listening when silence occurs.

After trying to tweak the threshold parameters without success I just figured I'd add a custom key-command to break the listening loop in my project.

infocollector 10 years ago |

Does this work without an internet connection (once downloaded)? If yes, How big is the downloaded footprint? I still haven't gone through the webpage carefully.

akerro 10 years ago | |

There is project Sirius which does it, take a look

http://sirius.clarity-lab.org/category/watch/

ggulati 10 years ago | |

If you use Sphinx for speech recognition and use pyttsx for text to speech (Windows Speech API, OSX NSSS, or ESpeak on Linux) it all works offline - see the "Jarvis's Brain" section.

roel_v 10 years ago | |

No, except for the stt part using sphinx, which is tricky to set up for it to be accurate enough (seems the author of the op didn't go that far)