Open (Apache 2.0) TTS model for streaming conversational audio in realtime

Open (Apache 2.0) TTS model for streaming conversational audio in realtime(github.com)

79 points by SweetSoftPillow 220 days ago | 5 comments

ks2048 216 days ago |

> Our work was heavily inspired by KyutaiTTS and Sesame

I wish they’d describe the technical details of the differences between this and other TTS they were “inspired by”.

So many projects like this, I will just have to assume they are vibe-coded clones to get some publicity unless there’s more technical details.

echelon 216 days ago | |

Sesame is an impressive real time conversational audio-to-audio model you can talk to on their website [1]. But it's closed source. They released some components, but nothing you could use to duplicate their work.

Sesame is what this team (and lots of teams) want to build. I know another team trying to build a real time local NSFW girlfriend you can talk to. They're convinced they can reach $100M ARR quickly if they crack it and make it customizable.

KyutaiTTS provides a lot of the ingredients for this work, but it isn't conditioned for audio to audio afaik or any of the streaming components.

[1] https://app.sesame.com/

popalchemist 216 days ago | |

This is a streaming version of their previously released Dia TTS, which was an original work. You may want to recalibrate your assumption.

Neywiny 216 days ago |

Not sure if it's an artifact of their streaming approach but their intro demo has exclamation marks and question marks and the intonation through the sentence just doesn't fit. It's vocalized regularly with only the last word having that exclamation or question sound. Maybe we need that Spanish upside down question mark at the start to help it.

woodson 216 days ago |

Looks very similar to Kyutai’s models, given that it uses the same neural audio codec (Mimi) and Depformer module etc.