Is Word Error Rate a Good Metric for Speech Recognition Models?(assemblyai.com) |
Is Word Error Rate a Good Metric for Speech Recognition Models?(assemblyai.com) |
One time I beta tested a new speech model I trained that scored very well on WER. Something like 1/2 to 1/3 as many errors as the previous model.
This new model frustrated so many users, because the _nature_ of errors was much worse than before, despite fewer overall errors. The worst characteristic of this new model was word deletions. They occurred far more often. This makes me think we should consider reporting insertion/replacement/deletion as separate % metrics (which I found some older whitepapers did!)
We have CER (Character Error Rate), which is more granular and helps give a sense of whether entire words are wrong (CER = WER) or mostly just single letters (CER much lower than WER).
-
I'd welcome some ideas for new metrics, even if they only make sense for evaluating my own models against each other.
GPT2 perplexity?
Phoneme aware WER that penalizes errors more if they don't sound "alike" to the ground truth? (Because humans can in some cases read a transcription where every word is wrong, 100% WER, and still figure out by the sound of each incorrect word what the "right" words would have been)
"edge" error rate, that is, the likelihood that errors occur at the beginning / end of an utterance rather than the middle?
Some kind of word histogram, to demonstrate which specific words tend to result in errors / which words tend to be recognized well? One of the tasks I've found hardest is predicting single words in isolation. I'd love a good/standard (demographically distributed) dataset around this, e.g. 100,000 English words spoken in isolation by speakers with good accent/dialect distribution. I built a small version of this myself and I've seen WER >50% on it for many publicly available models.
More focus on accent/dialect aware evaluation datasets?
+ From one of my other comments here: some ways to detect error clustering? I think ideally you want errors to be randomly distributed rather than clustered on adjacent words or focused on specific parts of an utterance (e.g. tend to mess up the last word in the utterance)
https://scholar.google.com/citations?view_op=view_citation&h...
CER is definitely more granular. There are papers that basically count Deletions, for example, as 0.5(D) when calculating WER - since they consider Deletions "less bad", but if these weights aren't standardized then WER scores will be super hard to compare.
Personally I think some metric including some type of perplexity is the way to go.
Could we generalize the WER weighting to optimize for the domain?
Something like
weight = w1 * WER + w2 * phonetic similarity + ...
which also requires a hyperparameter search... But we are already dumping so many GPU hours here.
I assume this is already being investigated by Google, though?
I wonder if you could make that parameter trainable instead of using a hyperparameter search for it.
For phonetic similarity I've been playing with a dual objective system that could be promising.
- whole phrase intent recognition rates. Run the transcribed phrase through a classifier to identify what the phrase is asking for, and compare that to what was expected, calculating an F1 score. Keep track of phrases that score poorly: they need to be improved.
- "domain term" error rate. Identify a list of key words that important to the domain and that must be recognized well, such as location names, products to buy, drug names, terms of art. For every transcribed utterance, measure the F1 score for those terms, and track alternatives created in confusion matrix. This results in distilled list of words the system gets wrong and what is heard instead.
- overall word error rate, to provide a general view of model performance.
In other words, measure LM perplexity on the ground truth words, then on the predicted words, and minimize the difference in perplexities. Ideally with a general model like GPT2 or BERT or something that you aren't using anywhere in your actual ASR.
This may even be more tolerant of errors in the ground truth transcription than raw WER
Exactly. Errors with proper nouns are usually more problematic than errors with stop words, yet they're weighted equally in the WER calculation. Ie, deleting "Bob" and "but" both count as a deletion of the same degree according to WER, but we as humans know that deleting "Bob" is potentially a lot more problematic than deleting "but".
Theory being you don't want to add or remove confusing words, but common stop words are less of an issue.
I'm not sure how this interacts with a multi word replacement, where the new words together make sense but independently make no sense to the LM.
I'm wondering what the higher convolution levels could look like, if this was a CNN analyzing an image. Something between a the complete Ableton/Logic export and a MIDI file. Being able to capture the "feel" of a song (or a section within a song) strikes me as an important milestone towards designing really good generative music.
I can also imagine a generalized "local error rate" which measures how far away errors tend to be from each other. If errors tend to be clustered, I would guess that's showing inability to follow some musical pattern. I think you'd want errors to appear randomly distributed rather than clustered. (This metric might make sense for speech too)
One clever metric that Google mentioned in their early ASR papers was interesting: "WebScore". Basically, they consider a hypothesis transcription to have errors only if it produces a different top web search result. [1] WebScore and WER always seemed to track each other though.
[1] https://static.googleusercontent.com/media/research.google.c...
Sentence/command error rate (rate of 100% correct sentences/commands that don’t need any editing or re-attempting) is a decent proxy for this. It’s no silver bullet, but it more directly measures how frustrated your users will be.
If you really wanted to take care of the issues in the article, you could interview a bunch of users and find what percent of the, would go back and edit each kind of mistake (if 70% would have to go back and change ‘liked’ to ‘like’ then it’s 70% as bad as substituting ‘pound’ for ‘around’ which presumably every user will go back and edit).
The infuriating thing as a user is when metrics don’t map to the extra work I have to do.
"probably going to have to go back and edit" is generally not the case with my Conformer model, which allows fast paced usage like this with practice: https://twitter.com/lunixbochs/status/1378159234861264896
(and the conclusion that I need to prevent the return of RSI at all costs from now on. Don't get me wrong, I'm very thankful that talon does as well as it does. It was a job saver.)
If so, December predates Conformer, so you're talking about the sconv model, which is the model I was complaining about upthread - it was very polarizing with users, and despite the theoretical WER improvements, the errors were much more catastrophic than the model that preceded it.
In either case, I'm constantly making improvements - I'm in the middle of a retrain that fixes some of the biggest issues (such as misrecognizing some short commands as numbers), and I've done a lot of other work recently that has really polished up the experience with the existing model.
As a side rant, it turned out that simply stepping away from work for a few weeks around the holidays nearly fixed my RSI, which makes me so sad about the nature of my career whenever it crops back up.
Btw, any chance you've done any work on the `phones` or related tooling? I remember that (and editing in general) being a pain point.
sconv was especially disappointing because it looked so good on metrics during my training, but the cracks really started to show once it entered user testing. Conformer has been so much less stressful in comparison because most user complaints are about near misses (or completely ambiguous speech where the output is not _wrong_ per se if you listen to the audio) rather than catastrophic failure.
There's another interesting emergent behavior with my user base as I make improvements, which is that as I release improved models allowing users to speak faster without mistakes, some users will speak even faster until there are mistakes again.
Edit: Yep! There have been several improvements on editing, though that's more in the user script domain and my work has still been mostly on the backing tech. I'm planning on working on "first party" user scripts in the future where that stuff is more polished too.
LOL. Users will be users! That's a hilarious case study, thanks for sharing.
> Yep! There have been several improvements on editing, though that's more in the user script domain and my work has still been mostly on the backing tech. I'm planning on working on "first party" user scripts in the future where that stuff is more polished too.
That would be wonderful! If you haven't seen them, I'd suggest looking at Serenade (also ASR) and Nebo (handwriting OCR on ipad) as interesting references for editing UI. They seem to have tight integration between the recognition and editing steps, letting errors be painless to fix by exposing alternative recognitions at the click of a button or short command. It lets them make x% precision@n as convenient as x% accuracy.