Our goal with this project is to build a completely open source, state of the art turn detection model that can be used in any voice AI application. I've been experimenting with LLM voice conversations since GPT-4 was first released. (There's a previous front page Show HN about Pipecat, the open source voice AI orchestration framework I work on. [1]) It's been almost two years, and for most of that time, I've been expecting that someone would "solve" turn detection. We all built initial, pretty good 80/20 versions of turn detection on top of VAD (voice activity detection) models. And then, as an ecosystem, we kind of got stuck. A few production applications have recently started using Gemini 2.0 Flash to do context aware turn detection. [2] But because latency is ~500ms, that's a more complicated approach than using a specialized model. The team at LiveKit released an open weights model that does text-based turn detection. [3] I was really excited to see that, but I'm not super-optimistic that a text-input model will ever be good enough for this task. (A good rule of thumb in deep learning is that you should bet on end-to-end.) So ... I spent Christmas break training several little proof of concept models, and experimenting with generating synthetic audio data. So, so, so much fun. The results were promising enough that I nerd-sniped a few friends and we started working in earnest on this. The model now performs really well on a subset of turn detection tasks. Too well, really. We're overfitting on a not-terribly-broad initial data set of about 8,000 samples. Getting to this point was the initial bar we set for doing a public release and seeing if other people want to get involved in the project. There are lots of ways to contribute. [4] Medium-term goals for the project are:
If you're interested in voice AI or in audio model ML engineering, please try the model out and see what you think. I'd love to hear your thoughts and ideas.[1] https://news.ycombinator.com/item?id=40345696 [2] https://x.com/kwindla/status/1870974144831275410 [3] https://blog.livekit.io/using-a-transformer-to-improve-end-o... |