Speech-to-Text-WaveNet: End-to-end sentence level English speech recognition

Speech-to-Text-WaveNet: End-to-end sentence level English speech recognition(github.com)

172 points by dudisbrie 9 years ago | 42 comments

gambler 9 years ago |

"Some of Deepmind's recent papers are tricky to reproduce. The Paper also omitted specific details about the implementation, and we had to fill the gaps in our own way."

So, I'm not the only one seeing this issue. It seems like many recent AI papers want to look as impressive as possible, wile giving you as little implementation info as possible. This bothers me, because it opposes the very purpose of research publication.

deepnotderp 9 years ago | |

This is more specific to deepmind actually, Facebook and others have been pretty good about publishing code.

CamperBob2 9 years ago | |

Unfortunately I think you'll find similar complaints in every scientific field. Often, results either aren't described well enough to be reproduced, they're too expensive or difficult to reproduce, or they rely on closed-source software and/or inadequately-documented hardware.

bmc7505 9 years ago |

A few weeks ago, a deep learning researcher at one of the world's leading speech groups told me off-the-record that offline, human-parity speech recognition would be "coming soon" to mobile devices. Not sure s/he realized just how soon that would be. Even though state-of-the-art ASR is really expensive to train, recognition is extremely cheap to run, even on lower-power devices. [1][2] With specialized silicon, you can do this, continuously, for free, on something like a smartwatch. You don't need to open a websocket or call an API running on some beefy server to do this, speech-to-text is now a basic commodity. Fully offline, ubiquitous speech recognition is right around the corner. With human-level speech synthesis [3], speech applications are going to get very interesting, very quickly.

[1] http://niclane.org/pubs/deepx_ipsn.pdf

[2] https://www.ibr.cs.tu-bs.de/Cosdeo2016/talks/invitedTalk.pdf

[3] https://github.com/ibab/tensorflow-wavenet

braindead_in 9 years ago | |

A consumer focused human parity ASR service will disrupt so many industries, including mine. I run a human powered transcription service where we transcribe files with high accuracy. I am just waiting for the day when our transcribers can work off a auto-generated transcript instead of typing it all up manually. I'll pay good money for a service where I can just send a file and get a 80-90% accurate transcript with speaker diarization.

skoocda 9 years ago | | |

We've chatted - just an update that I'm implementing diarization this weekend :)

imaginenore 9 years ago | | |

I hope you realize your business is about to go out of business. The only reason you can charge people now is because the automatic recognition sucks compared to humans.

kcorbitt 9 years ago |

This is really exciting. I previously worked at a startup for that could have benefited enormously from even 90% accurate speech recognition. As of six months ago when I last looked, there were no open source speech-to-text libraries with anything approaching the performance of the proprietary work by Google, Microsoft, Baidu, etc. The closest thing was CMU Sphinx, but its accuracy was unacceptable.

Props to the author, and especially to the DeepMind researchers who published their work! I look forward to living in a world where this type of technology is ubiquitous and mostly commoditized.

bmc7505 9 years ago | |

The CMU Sphinx project as it stands is basically dead. Even though they recently implemented some sequence-to-sequence deep learning techniques for g2p [1], the core stack is still based on an ancient GMM/HMM pipeline, and current state of the art projects (even open source ones) have leapfrogged it in terms of accuracy. If you're implementing offline speech recognition today, start with something like this or Kaldi-ASR [2]. It will take a bit of work to get your models to running on a mobile device, but the end result will be much more usable.

[1] http://cmusphinx.sourceforge.net/2016/04/grapheme-to-phoneme...

[2] http://kaldi-asr.org/

snadal 9 years ago | | |

We've worked in the past with CMU Sphinx too, and it is absolutely amazing the advances in this area in the last months.

A little bit off-topic, but do you know any recent work or paper for speech recognition in language teaching area ? (I mean, analysing and rating accuracy of speaker, detect incorrect pronunciation of phones, and so on)

brandoncarl 9 years ago |

To the authors: did you any of your own recordings? I've used my own and clips online, in WAV and other formats, at various sampling rates.

All of the results come back gibberish. The results in the training data seem just fine. Curious if you've tested the above to ensure it didn't overfit.

craigbaker 9 years ago |

Is this really speech recognition from raw waveforms? It looks like they're extracting MFCC features from the raw audio, and using that as input to the neural network. I thought that the point of WaveNet was that it took the raw waveform directly as input, unlike previous architectures which first extract spectral features such as MFCCs to use as the input.

bmc7505 9 years ago | |

Apparently, they tried to use the raw audio waveform with the original setup from the WaveNet paper but couldn't get it to train on their TitanX, so they used MFCCs instead. It's not exactly clear why this is the case.

"Second, the Paper added a mean-pooling layer after the dilated convolution layer for down-sampling. We extracted MFCC from wav files and removed the final mean-pooling layer because the original setting was impossible to run on our TitanX GPU." [1]

[1] https://github.com/buriburisuri/speech-to-text-wavenet#speec...

RandomInteger4 9 years ago |

How much Bandwidth is consumed from voice communications such as when speaking to someone on Skype or over the phone, vs. the same words transmitted via text?

Perhaps future communication applications can have a WaveNet on either end, which learns the voice of the person you're communicating with and then only sends text after a certain point in the conversation?

I'm coming at this from a point of ignorance though, so correct me if I've made erroneous assumptions.

dest 9 years ago | |

text communication is much lighter (a few bytes/s vs kb/s) but you may miss the non verbal contents of voice

RandomInteger4 9 years ago | | |

By non-verbal do you mean like ambient sound? Dogs barking, child yelling, garbage truck garbage trucking? I don't know. If they can do voice, then it might be possible to do ambient sounds of there is a separate nets trained with a library of ambient sounds where it's tuned not to be the same every time the sound plays like how when you have tiled graphics, there are algorithms that remove the unnatural sameness from one tile to the next.

This could have interesting implications for Foley-artists of the 21st century.

How likely would such a tech help lower budget companies who want to implement voice communication within their software, say for video games or similar?

Hmm, now this has me wondering what implications this has for voice acting as well.

EDIT: We can call the ambient sound symbols sent over the wire "Soundmojis" or "amojis" or "audiomojis"

kondro 9 years ago | |

Less than 8kbps in most voice. It pales in comparison to the quantity of bandwidth consumed each day on video.

throwaway13337 9 years ago |

This seems super useful for most speech recognition - understanding context.

It doesn't seem like the mainstream engines (Alexa, Google Voice, Siri) are context aware. Why not?

doublerebel 9 years ago | |

Context involves location, which 99% of the time those bots don't take into consideration. Context does not involve knowing everything about your email or being able to search the entire web. It's much more connected to what you just did and where and when you are doing it.

This is what I'm solving at Optik. Helping you manage the things that you care about in the place that you are, and NOT exposing your personal details to cloud computation.

EGreg 9 years ago | |

Also why can't we track emails sent from our iOS device like we can with desktop GMail plugins??

teajunky 9 years ago |

Wow train.py contains only 83 lines of code (including a few empty lines and commets). And recognize.py is only litte bit longer with 108 lines. Very impressive.

bra-ket 9 years ago | |

typical of machine learning, a whole lot of talking about a few lines of code

hyperbovine 9 years ago | | |

FFT is 4 lines, what is your point.

IshKebab 9 years ago |

Can someone explain why MFCC is used rather than allowing the neural network to learn from the raw waveform? I looked back in the literature and the intention of MFCC & PLP seems to be to remove speaker-dependent features from the audio in order to reduce the dimensionality of the input. But I though the whole point of neural nets is that they can learn from very high dimensional inputs no?

I had a go at implementing wave->phoneme recognition using a simple neural net and it seemed to work pretty well.

Karlozkiller 9 years ago |

This is exactly what I would have wanted for my master thesis about half a year ago, where I wanted to use s2t with good control over the system without having to implement everything myself.

echelon 9 years ago |

Did the original WaveNet text to speech demo come with a paper or source code? (I didn't see either.) I'm interested in techniques, particularly neural network-related, to improve the quality of my Donald Trump text to speech engine [1].

Does anyone on HN do active research in this field? Could I pick your brain for a survey of the best papers (especially review papers) on the subject?

[1] http://jungle.horse

bmc7505 9 years ago | |

> Did the original WaveNet text to speech demo come with a paper or source code?

Paper, yes. [1] Source code, no.

[1] https://arxiv.org/pdf/1609.03499.pdf

londons_explore 9 years ago |

Looking at the training loss graph, it looks like training for more time would produce even better results...

Anyone want to volunteer a few weeks of GPU time to train this better?

gwern 9 years ago | |

Training loss pretty much always decreases. NNs are extremely powerful models, so they can overfit most data. What you want to see is the validation loss graph.

mo1ok 9 years ago |

This is awesome. I was just reading the waveNet paper and wondering how would go about a DIY approach...

EGreg 9 years ago |

Does this require an internet connection, though? Relative to say OpenEars?

amelius 9 years ago |

Perhaps now finally Linux could get a speech recognition input device.