So, I'm not the only one seeing this issue. It seems like many recent AI papers want to look as impressive as possible, wile giving you as little implementation info as possible. This bothers me, because it opposes the very purpose of research publication.
[1] http://niclane.org/pubs/deepx_ipsn.pdf
[2] https://www.ibr.cs.tu-bs.de/Cosdeo2016/talks/invitedTalk.pdf
Props to the author, and especially to the DeepMind researchers who published their work! I look forward to living in a world where this type of technology is ubiquitous and mostly commoditized.
[1] http://cmusphinx.sourceforge.net/2016/04/grapheme-to-phoneme...
A little bit off-topic, but do you know any recent work or paper for speech recognition in language teaching area ? (I mean, analysing and rating accuracy of speaker, detect incorrect pronunciation of phones, and so on)
All of the results come back gibberish. The results in the training data seem just fine. Curious if you've tested the above to ensure it didn't overfit.
"Second, the Paper added a mean-pooling layer after the dilated convolution layer for down-sampling. We extracted MFCC from wav files and removed the final mean-pooling layer because the original setting was impossible to run on our TitanX GPU." [1]
[1] https://github.com/buriburisuri/speech-to-text-wavenet#speec...
Perhaps future communication applications can have a WaveNet on either end, which learns the voice of the person you're communicating with and then only sends text after a certain point in the conversation?
I'm coming at this from a point of ignorance though, so correct me if I've made erroneous assumptions.
This could have interesting implications for Foley-artists of the 21st century.
How likely would such a tech help lower budget companies who want to implement voice communication within their software, say for video games or similar?
Hmm, now this has me wondering what implications this has for voice acting as well.
EDIT: We can call the ambient sound symbols sent over the wire "Soundmojis" or "amojis" or "audiomojis"
It doesn't seem like the mainstream engines (Alexa, Google Voice, Siri) are context aware. Why not?
This is what I'm solving at Optik. Helping you manage the things that you care about in the place that you are, and NOT exposing your personal details to cloud computation.
I had a go at implementing wave->phoneme recognition using a simple neural net and it seemed to work pretty well.
Does anyone on HN do active research in this field? Could I pick your brain for a survey of the best papers (especially review papers) on the subject?
Paper, yes. [1] Source code, no.
Anyone want to volunteer a few weeks of GPU time to train this better?
What you're describing is called "speech verification". Language education is an application I'm personally very interested in, and one that almost no one discusses in the speech community (I assume because of machine translation), so if you find any research papers please let me know! I wrote a little about it: http://breandan.net/2014/02/09/the-end-of-illiteracy/
The task is actually much simpler than STT. You display some text on the screen, wait for an audio sample, then check the model's confidence that the sample matches the text. If the confidence is lower than some threshold, then you play the correct pronunciation through the speaker. The trick is doing this rapidly, so a fast local recognizer is key. I've got a little prototype on Android, and it's pretty neat for learning new words. I'd like to get it working for reading recitation, but that's a lot of work.
Actually, checking against confidence is something that we've tried to play with, but to my knowledge there is not a model that allows you to compare speech confidence against an specific text. Public APIs like MS ProjectOxford.ai can return a confidence, but against the "recognised" text, not against a predefined text.
Going further, this kind of approach can be very effective on words and small sentences, but I'd really love to see which specific phones the learner is failing, which can help in analysing full speaking exercises.
It works, but I am sure it should be possible to do better
It should be possible to train a neural network to catch those special intonations, but it is IMHO substantially harder than the initial project, with uncertain results.
See also this usage in the context of ML:
There's no reason one couldn't train 5 or 10 RNNs for transcription and ensemble them. (Indeed, one cute trick this ICLR was how to get an ensemble of NNs for free so you don't have to spend 5 or 10x time training: simply lower the learning rate during training until it stops improving, save the model, then jack the learning rate way up for a while and start lowering it until it stops improving, save that model, and when finished, now you have _n_ models you can ensemble.) And computing hardware is cheaper than humans, so it will be cheaper to have 5 or 10 RNNs process an audio file than it would be to have 2 or 3 humans independently check, so the ensembling advantage is actually bigger for the NNs in this scenario.
Humans still have the advantage of more semantic understanding, but RNNs can be trained on much larger corpuses and read all related transcripts, so even there the human advantage is not guaranteed.
In practice the ensemble model is compactly transferred into a single network. In order to do that, they train a new network to copy the outputs of the ensemble, exploiting "dark knowledge".
Recurrent Neural Network Training with Dark Knowledge Transfer - https://arxiv.org/abs/1505.04630v5