I wish they’d describe the technical details of the differences between this and other TTS they were “inspired by”.
So many projects like this, I will just have to assume they are vibe-coded clones to get some publicity unless there’s more technical details.
Sesame is what this team (and lots of teams) want to build. I know another team trying to build a real time local NSFW girlfriend you can talk to. They're convinced they can reach $100M ARR quickly if they crack it and make it customizable.
KyutaiTTS provides a lot of the ingredients for this work, but it isn't conditioned for audio to audio afaik or any of the streaming components.