Toward better phone call and video transcription with new Cloud Speech-to-Text(cloudplatform.googleblog.com) |
Toward better phone call and video transcription with new Cloud Speech-to-Text(cloudplatform.googleblog.com) |
I followed the link from the blog post that said "check out the demo on our product website". Then there's a big button that says "TRY IT FREE". Good, I say. That leads me through a signup process that involves credit cards and whatnot, and then dumps me out on what I guess is the equivalent of the AWS console, not some nice audio test page.
So then I root around in the console, finally find the text to speech stuff, and screw around with various interfaces. None of them seems to be the right thing. Eventually I decide I must have missed something, go back to the product website, and scroll down further to find the "convert your speech to text right now". Great, say I.
The blog post explicitly talks about video. I want to see if it can transcribe a talk I did, so I tried uploading a file; nothing appears to happen on Firefox. I try a couple more times. I sigh heavily and switch to Chrome.
It does appear to work on Chrome, but it's entirely infuriating. I tried uploading a video file, which was over 50MB, so it refused. I then figured out how to extract the audio alone and uploaded that, at which point it complained it was over a minute. Then I find another incantation to chop my audio to a minute (which they just should have done for me, and which anyway should be explained in the interface).
Finally, I upload 60 seconds of audio. And nothing fucking happens. After all that, the thing just doesn't doesn't work. No error messages, no anything.
This is my first impression of the Google Cloud Platform, and all I hear is the squeaking of clown shoes. I'm sure the rest of it can't be this bad, but if they can't make a simple demo work, I'm unlikely to find out.
AWS just let me transcribe my MP3 in a pretty straightforward way once I'd uploaded it to an S3 bucket. The transcript is done in 2-3x real time, and the quality seems decent. It comes as a complex JSON file with confidence numbers and timestamps for every word, with alternate words when it knows it isn't sure. It's pretty neat.
Google made me use a sort of query builder interface to construct an API request. The query builder did not actually match the features announced in the blog post, so I just tried going with what was there. When I eventually got a valid-looking request, it blew up because it turns out it can't parse MP3s. So then I reencoded to FLAC and uploaded that. I tried a variety of queries, but none of them worked. The one that got closest complained about a bad value for a field the query builder apparently would not let me add.
I gave up. Squeak, squeak, squeak!
And I should add that the people I know at Google are all perfectly smart, so I don't want anybody to think I'm saying that the individual engineers who made this are dumb or bad. This seems like a giant organizational failure, where what gets built is deeply disconnected from user need and the lived user experience.
Normally when I get insight on a place where this happens, the priority is not actually delivering value, but making managers look good according to easily measured but harmful metrics, like, "Are we at competitive parity at a feature checklist level?" or "Did we launch by some made-up deadline so that a manager could claim success?"
If anybody at Google wants to send me their horror stories, please do email or DM me on Twitter. I'd love to know what the hell happened here, and I promise to keep things as confidential as you like.
Thanks for "the squeaking of clown shoes". I'll have to remember that.
And then run algorithms on these texts to classify the conversations into "potentially crime related discussions" classes.
https://www.theguardian.com/commentisfree/2013/may/04/teleph...
I agree with some of the comments regarding Google being a big co & having big co issues. But at the core of it, the team, the offering & attention to what matters is solid.
It's certainly going to open up a whole new realm of possibilities.
Interesting name change. It’s certainly more precise, but was “Speech API” really confusing people?
the focus on improving call center performance is where the money is. plenty more vendors will enter this market.
Then there are implementations of Baidu’s DeepSpeech (PaddlePaddle: https://github.com/PaddlePaddle/DeepSpeech, or Mozilla’s version).
They just throw stuff that would otherwise be useful to the world out there in the least user-friendly way possible. And then they make a big PR push for a while talking about how great the new thing is and then they forget about it and the project languishes.
I've worked with some great, highly autonomous teams. What makes them work well is a strong emotional and informational connection to users. They do lots of user testing, so they can get inside a user's head. They try things out themselves, using what they've learned from talking to those users. And they keep an eye on production usage, because they really care that what they make delivers value and gets used.
I took a ten minute audio segment from a two-person interview, and chopped it up in shorter segments to fit under the 60-second limit, with varying overlap durations to make sure that full sentences would be included on either side of the snip. I ran a battery of tests with segments of 20s, 30s, 40s, 50s and overlaps of 3s, 5s, and 10s. The output was essentially useless garbage, with wild differences in the transcription depending on segment lengths and overlap durations. In one configuration one sentence may be perfectly transcribed and the next was word salad, in another both sentences were useless salads, in another half of each sentence was right but words were missing, etc. No configuration ever yielded a useful output. Time and money spent: several hours, $$$.