Show HN: Open-source, native audio turn detection model

Show HN: Open-source, native audio turn detection model(github.com)

126 points by kwindla 1 year ago | 28 comments

Our goal with this project is to build a completely open source, state of the art turn detection model that can be used in any voice AI application.

I've been experimenting with LLM voice conversations since GPT-4 was first released. (There's a previous front page Show HN about Pipecat, the open source voice AI orchestration framework I work on. [1])

It's been almost two years, and for most of that time, I've been expecting that someone would "solve" turn detection. We all built initial, pretty good 80/20 versions of turn detection on top of VAD (voice activity detection) models. And then, as an ecosystem, we kind of got stuck.

A few production applications have recently started using Gemini 2.0 Flash to do context aware turn detection. [2] But because latency is ~500ms, that's a more complicated approach than using a specialized model. The team at LiveKit released an open weights model that does text-based turn detection. [3] I was really excited to see that, but I'm not super-optimistic that a text-input model will ever be good enough for this task. (A good rule of thumb in deep learning is that you should bet on end-to-end.)

So ... I spent Christmas break training several little proof of concept models, and experimenting with generating synthetic audio data. So, so, so much fun. The results were promising enough that I nerd-sniped a few friends and we started working in earnest on this.

The model now performs really well on a subset of turn detection tasks. Too well, really. We're overfitting on a not-terribly-broad initial data set of about 8,000 samples. Getting to this point was the initial bar we set for doing a public release and seeing if other people want to get involved in the project.

There are lots of ways to contribute. [4]

Medium-term goals for the project are:

  - Support for a wide range of languages
  - Inference time of <50ms on GPU and <500ms on CPU
  - Much wider range of speech nuances captured in training data
  - A completely synthetic training data pipeline. (Maybe?)
  - Text conditioning of the model, to support "modes" like credit card, telephone number, and address entry.

If you're interested in voice AI or in audio model ML engineering, please try the model out and see what you think. I'd love to hear your thoughts and ideas.

[1] https://news.ycombinator.com/item?id=40345696

[2] https://x.com/kwindla/status/1870974144831275410

[3] https://blog.livekit.io/using-a-transformer-to-improve-end-o...

[4] https://github.com/pipecat-ai/smart-turn#things-to-do

# Training parameters "learning_rate": 5e-5, "num_epochs": 10, "train_batch_size": 12, "eval_batch_size": 32, "warmup_ratio": 0.2, "weight_decay": 0.05, # Evaluation parameters "eval_steps": 50, "save_steps": 50, "logging_steps": 5, # Model architecture parameters "num_frozen_layers": 20