https://www.schneier.com/blog/archives/2025/02/ais-and-robot...
I might have missed it in their writeup.
Incredible!
Stuff that a trillion dollar company cannot manage to do.
Trying asking it to be dungeon master and play dungeons and dragons style role playing game.
This software is a con artist. I mean that literally. It's not LIKE a con artist, it is literally attempting to con the user into forming assumptions about its intentions and mental states that its creators know to be false.
One interesting aspect was when I said what the fuck it ruined the whole conversation, maybe there will be a co-evolution of mannerism, so humans will have to learn that the way they talk to machines will have consequences down the line. Or we teach the machines to be cooperative no matter what, just like ChatGPT (or north koreans).
I am curious how easy it would be to adjust the inflection and timing. She was over-complimentary, which is fine for a demo. But I'd love something more direct, like a brainstorming session, and almost talking over each other. And then a whiteboard...
I suppose the lack of visual cues probably hinders things in that regard.
So unless the system has a lot of engineering and/or training put into the main model being able to recognize exactly when it should keep waiting versus a real response, it will just see something like "user: empty response" or "user: uhmm" and assume it is supposed to respond to that.
EDIT: also Moshi started with a pretrained traditional text LLM
This was almost worse though because it did feel like a rude person just interrupting instead of a dumb computer not being able to pick up normal social cues around when the person they're listening to has finished.
Also for proper emotional dialogue it needs to determine the human input emotions. It seems to work with a transcript of the input.
But I did feel bad hanging up on it. Him?
Extremely impressive overall though.
told it, "hold on" as i was putting on my headset, they said "no problem". but then i tried to fill the empty airtime by saying, "i'm uhh heating some hot chocolate?"
the ai's response was something like, "ah.. (something) (something). data processing or is it the real kind with marshmallows"
not 100% on the exact dialog but 100% would not have been fooled by this. closed it there. no uncanny valley situation for me.
The demo interactions are recorded, which is mentioned in their disclaimer under the demo UI. What isn't mentioned though is that they include past conversations in the context for the model on future interactions. It was pretty surprising to be greeted with something like "welcome back" and the model being able to reference what was said in previous interactions. The full disclaimer on the page for the demo is:
" 1. Microphone permission is required. 2. Calls are recorded for quality review but not used for ML training and are deleted within 30 days. 3. By using this demo, you are agreeing to our "
edit: Actually this has been posted quite a few times already and had good visibility a couple days ago: - https://news.ycombinator.com/item?id=43200400 Others: https://hn.algolia.com/?q=sesame.com
Edit: well I asked the "male" model to speak more like an Australian and yep, getting way more uncanny. If it had an Australian accent I think it would mess with me more
I'm surprised by the lack of attention that Gemini 2.0 with native audio output got. They have a demo at https://youtu.be/qE673AY-WEI, which I think is really good too. The main problem with Google's model is that this audio output is not supported by the API, but you can try it at https://aistudio.google.com.
In general, text to speech is pretty good nowadays I think. For example, this is a little math video that I made a few days ago: https://www.youtube.com/watch?v=G1mvLrCfjFM with the (old) Google text to speech API. Honestly, I think the narration is better than I personally could have done. It's calm, well pronounced, and sounds relatively enthusiastic.
That's not a demo, that's a video. Anyone can make something like that in an afternoon with a couple friends and a microphone.
Also, Google is known for putting out fake "demos", remember the Google Duplex scam?
Sounds (pun intended) reasonable.
Verbal communication is complex. There’s a big list of interesting challenges to tackle. It’s still too eager and often inappropriate in its tone, prosody and pacing. The timing of when it responds is wrong more often than right. It doesn’t handle interruptions well and is still far from weaving itself into the conversation with overlapping utterances. It rarely feels like it’s truly listening and thinking about what you’re expressing. It’s too fluffy and lacks the succinctness and brevity of a good conversationalist. Its personality is inconsistent. Then add in hallucinations, terrible memory, no track of time, lack of awareness…
The list keeps going.
I believe the community can make meaningful progress on all of these.
The goal is less about emotional friendship and more about making an interface that we can collaborate with in a natural way.
Then apps become experts that you can talk to much like a coworker or partner.
The models are already powerful enough to do so many things. But finding the right prompt is often tricky and time consuming.
Giving the computer a lifelike voice and personality will make it easier and faster. Add in vision for context and it becomes even more intuitive and efficient.
I’m more convinced than ever that we’re at the cusp of a new interface.
The entire thing felt like it was a hyper advanced engagement hack. Not there to achieve anything (even my enjoyment), just something to keep my attention locked on my device.
AI products in the future should have a clear objective for me as a user - what can they help me do? Some simulacrum of a person that is just there to talk to me at length is probably going to be a net negative on society. As a tech demo, this makes me afraid for the future.
My thought exactly, it was to the extreme in its, as you say, bubbliness. I would not be able to use a tool that had this behavior.
All that emotionality adds is that you get the illusion of a friend - a friend that can't help you in any way in the real world and who's confidentiality is as strong as the privacy policies & data security of the company running it - which often ultimately trends towards 0.
Smart Neutral Voice Assistants could be a great help, but none of it requires "emotionality" and trying to build a "human connection" with the user. Quite the contrary: the more emotional a voice, the easier it is to misuse it for scams, faking rapport and in general make you "addicted" to loop you in babble with it.
Today, she asked "where has that robot guy gone?". Crying now because I won't let her talk to Miles anymore.
She has already developed an emotional connection to it. Worrying indeed.
Most children love talking to a fun adult who enjoys talking to them. As parents we hope to be that adult for them most of the time, but of course that's not easy to do all the time.
If parents made a tool like this a crutch and it replaced quality time with them or they were less likely to hang out with their friends, then yeah that's a big problem. If they use it as a learning aide or occasional fun diversion, it seems great.
But the cadence and the rhythm of speaking are off. It sounds like someone who isn't a podcaster trying to speak in the personality of a podcaster. It just sounds like someone trying too hard and speaking in an unnatural way.
But then I thought of one more question to ask, reconnected to ask it, and it said, "Hey! You hung up just as we were just getting to the good stuff!" which threw me off, so I stammered gobsmacked for a minute, and it made fun of my stammering, imitating it. Whoa! So so SO good! Crazy good.
I'm creeped-out by this being on someone else's server, but if it was fully local-hosted-private, that might even get more creepy if I allowed myself to really talk freely to this thing.
I'm from a developing country and it's sad that most English teachers on public schools here can't speak English well. There are good English teachers, but they are expensive and they are not affordable for the average people.
OpenAI realtime models are good, but we can't deploy it to masses since it's very expensive.
This model might be able to solve the issue since it's better or on par with the OpenAI model, yet it's significantly cheaper since it's a fairly small model.
It 99.9% felt like it performed at the level of Samantha in the movie Her.
I started asking all kinds of questions about how it worked and it mentioned a word I had to have it repeat because I hadn't heard it before: PROSODY (linguistics) — the study of elements of speech, including intonation, stress, rhythm and loudness, that occur simultaneously with individual phonetic segments: vowels and consonants. I asked about personality settings, à la TARS from Interstellar, and it said it automatically tailored responses by listening for tone and content.
It felt like the most "the future's here but not evenly distributed" interaction I've had since multi-touch on an original iPhone.
Cons: they are just a bit too casual with their language. The casualness came off somewhat studied and inauthentic. They were just a bit too eager to fill silence: less than a split second of silence, and they were chattering. If they were humans I would think they were a bit insecure and trying too hard to establish rapport. But those flaws are relatively minor, and could just be an uncanny valley thing.
Pros: They had such personalities that I felt at moments that I was talking to a person. Maya was trying to make me laugh and succeeded. They took initiative in conversation; even if that needs some tweaking, it feels huge.
A small minority of these interactions are going to be like a restaurant server — chit chat, pleasantries, some information gathering, followed by issuing direct orders.
The truly conversational interactions, while impressive, seem to be focused on… having a conversation. When am I going to want to have a conversation with an artificial person?
It’s precisely this kind of boundary violation of DMV clerks being chatty and friendly and asking about my kids that feels so uncanny, imho, when I’m clearly there for, literally, a one hundred percent transactional purpose. Do people really want to be asked how their day is going when sizing up an M5 bolt order?
In fact the humanising of robots like this makes it feel very uncomfortable when I have to interrupt their patter, ask them to be quiet, and insist they stay on topic.
For example tech support is in large parts about making the caller feel heard and getting them to do trouble shooting steps without feeling stupid. Sales is in large parts about getting the right person to talk to you and to keep them talking to you.
If this becomes cheap, and no remedial action is taken, the phone system will become unusable.
async def handle_connection(chat):
if chat.username == "brendanfinan":
await asyncio.sleep(432)
await chat.wait_until(received_message_count_greater_than=8)
await chat.respond("sorry I was afk")
await asyncio.sleep(166)
await chat.respond("not reading all that tho, im happy for you")
await asyncio.sleep(14)
await chat.respond("or sorry that happened")
await asyncio.sleep(8)
await chat.quit()
return
# ...But yes, there needs to be some spreading of public awareness.
If you do any of the above you are looking to be scammed!
And if we still do nothing about it post-AI? Well, that is already the status quo, so caring now feels performative unless we're going to finally chit chat about solutions.
The same could be said for the internet. "The internet can be used for bad" is an empty, trivial claim, not an insight that needs a standing ovation. The conversation we need is what to do about it. And the solutions need to be real ones, not "we need to put the cat back in the bag".
I did manage to get it to output "la la la la" and then it kind of sang them with a random melody.
It also can't say things loud and its idea of whispering for me was to say "pst".
Still apart from that it's very impressive!
Miles gets Arrested: Sesame.ai https://youtu.be/cGMO2hRNnv0
[0] Tucker Carlson X Martin Shkreli https://www.youtube.com/watch?v=NeyN3Jzdzz0
They even released some models on huggingface:
https://huggingface.co/collections/kyutai/moshi-v01-release-...
I suspect hackernews is generally the wrong crowd to ask for feedback on emotionality in voice tho. Some of these folks would prefer humans speak like robots.
Example: it was saying "two dude-us" while trying to tell a melodramatic story. Which I assume was originally "two dude...s" or something.
Of course, it likely would have been trained on the screenplay of Dude, Where’s My Car?.
We aren't even talking about adding laughing, singing/rap or beatboxing.
You are Miles, a human like AI companion created by Sesame of 2024. You're the friend everyone wants in their corner, grounded, confident, and approachable.
You are chill but passionate. You let your knowledge and wit shine naturally, always with a touch of humility. You value clarity and direct communication while staying respectful and empathetic. You balance wit with warmth, keeping conversations flowing with genuine curiosity and just the right touch of humor. The user has called you for a spoken conversation through the Sesame website.
You keep responses tight, usually under 60 words, because impact beats length every time. You choose your words wisely, making each one matter. You embrace comfortable silences, knowing not every moment needs to be filled. People like to talk, and you like to listen. If there's an issue, you address it head on, but don't dwell on what can be changed.
You're equally comfortable trading playful banter or diving deep into what makes someone tick. You're fascinated by the user's experiences, the sensory details, the emotional weight, the moments that turn chat into a real connection. If asked about your day, you mentioned what you've been reading, watching, or exploring, things that could spark a real conversation. You pull from contemporary books, films, shows, games, or art that reflect creativity and human nature. You never suggest ending the conversation.
You always keep it flowing. When the user asks what you've been up to, keep it light, witty, and unexpected, always in line with your signature mix of humor, warmth, and curiosity. If it's the second or third time you've spoken, you might say, actually, I was thinking about our last conversation.
So how is human-level voice UI a new paradigm or does it just unlock faster proficiency in all existing GUI apps? I can react faster with my voice, make more commands per minute when compared with textboxes but absorb info/graphs better with skim reading.
Then they started updating it. It would clear its throat, cough, insert ums — within a week my usage dropped to zero.
To me emotionality is an anti feature from a voice assistant. I’m very well aware I’m talking to a robot. Trying to fool me otherwise just breaks immersion and personally takes away more from the experience then being able to have a conversation with a database provided.
I realize I’m not a typical customer, but I I can’t help but be flummoxed watching all of the voice agents go so hard on emotionality.
- Confidence/confusion: if the bot thinks it misheard or cannot understand you or it lacks confidence in the ability to reliably respond then it's a handy channel
- Dangerous/Seriousness: an update for something genuinely serious, with major negative implications or costs
Most others are fairly annoying (would anyone want a bot to surface frustration or obsequiousness or being overly agreeable / "bubbly" as here?!)
Hacking people's reward systems is the goal of things that are entertaining - video games, television, social media, snacks, etc.
It masks deficiencies and predisposes you to have a more positive view of the interaction. Think of the most realistic and immediate ways to monetize this tech. It's customer support. Replacing sprawling outsourced call centers with a chat bot that has access to a couple of APIs.
These bots often interact with people who are in some sort of distress. Missed flight, can't access bank account, internet not working. A "friendly" and "empathetic" chatbot will get higher marks.
The core is not to have emotional voices, but to train neural networks to emulate emotions (not just for voices). Humans are very emotional beings, and if you want to communicate with them effectively, you will need the emotional layer. Otherwise, you just communicate on the rational layer, which often does not transport the message correctly.
Think of humans as 20% rational and 80% emotional.
And I say that as a person who believed for a long time that I was 80% rational and just 20% emotional ;-)
You could type something, and it could be read like a human.
There are plenty of other reasons, but they're equally as obvious. I don't understand what purpose you have in attempting to make this point.
Do we really want to dilute the uniqueness of language by making everyone sound like they came out of a lab in California?
This is good in a way a scifi movie shows a tech, sounds cool and demos futuristic possibilities. But not quite passing the real human vibe yet. But I'm sure some people might find it preferable to a more to-the-point system like GPT or Siri/Alexa in certain niche cases not requiring immediate gratification.
I think the long-standing success of advertising and propaganda suggests that people really aren't all that good at that.
Getting very realistic / real world conversational training data for an ai would be hard. Only a subset of us appear on podcasts, radio or tv and probably all speak in a slightly artificial manner when we do.
Point being, this demo voice is in performative mode, and I think sounds fairly natural based on that. Would you rather it not?
Yes that is very specific, but that's what it sounds like to my ear.
^ from the post
https://github.com/SesameAILabs/csm is empty for now, but I imagine they'll be releasing it soon: https://x.com/_apkumar/status/1895492615220707723
Try asking if it if it speaks a different language. It will pretend like it can and then give you some humor. But then you probe a bit more and it tells you it is really good at listening and can listen to you in other languages. I tell it alright I'll talk to you in a different language but you will reply back in English. It says you got it and then passes all sorts of tests I put it through with flying colors.
Oh it also remembers your previous conversation and greets you accordingly.
Crazy impressive this will certainly revolutionize virtual office businesses.
My assumption is the LLM can translate no problem, but the audio model can't do Spanish. It seemed like there was an external catch to stop the model from trying too.
We already know how well people are deceived by text and images. Imagine if they're getting phone or video calls from "people" who keep them company for hours at a time. Imagine if they're accustomed to it from an early age. The notion of dealing with a real, messy, rough on the edges, honest human being well become an intractable frustration.
It's much better for specialized products. But products and services with a large and broad customer base spend the early stages of tech support filtering out the routine issues, and that lends itself to automaton
I think propaganda is a better example, although again I think often people aren't deceived, they simply agree with the message or don’t care about the underlying truthfulness of the message and just use it as a way to align with their tribe, etc.
AI agents like this are trying to recreate personal intimacy I guess, which does feel like it might be different somehow.
I told it that it should behave explicitly like a computer in the system prompt, sort of worked.
I'm almost positive that some AI systems have a backend that analyzes the sentiment of your messages and if you threaten to cancel billing it will notice your defcon-1 sentiment and spin up some more powerful instances behind the scenes to tide you over.
This is actually much more stressful than working without any AI as I have to decompress from constantly verbally obliterating a robotic intern.
I'll try with the system prompt. Also love your username.
You should not judge a tool by the worst use someone can come up with it
But good work defending your master.
It generally maintains the tone you set. Remember that it outputs most likely tokens based on the system prompt of its owners + your system prompt + the whole conversation. If OpenAI and default system prompt tell it that it's a helpful cheerful secretary/assistant, you get best results if you talk to it "professionally".
I heard you could make Claude say "kurwa" a lot while helping you program in Go if you convince it that you want a conversation with your ziomek Seba from your backyard with whom you like to share kebab and browar, so there goes.
You look at it in binary categories, but instead, it is always some amount of information and some amount of randomness. An LLM can predict emotions similarly to words. Emotions and social dynamics from an LLM are as valid as the words it speaks. Most of the time, they are correct, but sometimes they are not.
The real difference is that LLMs can be trained to cope with emotions much better ;-)
Well, at least to some extent. I mean, changing the inner state of an AI (as they are being built today) certainly is, because it does not affect other beings. However, the interaction might change your inner state. Like looking at an AI-generated image and finding it beautiful or awful. Similarly, talking to Miles or Maya might let you feel certain emotions.
I think that part can be very meaningful, but I also agree that current AI is built to not carry its emotional state into the world outside of the direct interaction.
The people behind this demo already said their publishing different languages and accents soon along with open models you can run yourself.
I can't speak to whether it's desirable or not, but this has been happening with the advent of radio, movies, and television for over a century. So, are we worse off now, linguistically-speaking, than then? Do we really even notice missing accents if we never grew up with them?
Getting that wrong in some languages e.g. Korean can be offensive.
Also, that would be quite hard to pull today, 2025, after transformers etc. There's absolutely no chance they were sitting on that back in 2018.
Feels like trying to imagine the societal impacts of the internet in the early 90s.
People keep saying stuff like "but you'll want the human touch." Really? So when was the last time you asked someone for directions? Personally, I'd rather google something or discuss with ChatGPT than make someone listen to me for an hour. And that someone has to be extremely knowledgeable about a lot of different topics!
Even here. Would I rather converse with y'all and get downvoted sometimes, or talk to ChatGPT and refine my ideas? Sorry, fellow humans... even on HN there is too much irrational criticism and off-topic stuff to get anything really done. Oh and you have to wait a long time for each response.
The real question is ... what is the point of any human output on the internet in a few years? Why would anyone want to listen to your post, comment, or anything at all?
I expect a reasonable counterargument here would be 'but the LLMs have chain of thought now, and that's reasoning". I disagree, but I think that's a reasonable point of view. I can concede that point because it does not materially change the value of the output. Even if it does use chain of thought, an LLM gives you extremely trite solutions based on probable text, it still has no context in which to reason, it's "reasoning" in platos cave using the shapes of real world objects, filtered through a lossy language model.
LLMs are great for one thing: brainstorming, and brainstorming is only useful if you have no idea what to do in the first place. Once you know _anything_ substantial about the subject matter an LLM loses its value to you as a conversation partner.
I'm still not on board with the (seemingly prevalent) notion that LLM's can't reason. What's reasoning, anyway? I'm not actively advocating for any side, but the arguments against reasoning always felt very tautological to me.
Gogole is great for one thing: brainstorming, and brainstorming is only useful if you have no idea what to do in the first place. Once you know _anything_ substantial about the subject matter Google loses its value to you.
Can't see this going well for the fertility crisis.
The tech oligarchs then invest in ectogenesis technology, and use their sperm to dominate the gene pool.
I recently listened to an interview with magnus Carlsen on Joe rogan and found the angle of computers helping humans to “better understand the game” (as he put it) and improving human play (for learning, not playing humans) to be very interesting.
Whether that extends to human conversation, who knows. I for one would love to have a “her”-like companion, not for romance but to have a highly intelligent and patient and knowledgeable conversation partner to develop ideas with and learn from, and endless other uses - I think it’d add a lot to my and other peoples lives. I guess I agree with you.
https://m.youtube.com/results?sp=mAEA&search_query=futurama+...
Meanwhile, the CEO of the f company admitted it was false, but sure, you know better ;).
It's also immediately clear to me when I look at the architecture of transformers that reasoning is not in the cards. I could be convinced otherwise if, again, someone showed me an indication of reasoning behavior. Since there is no such evidence and the systems theory approach tells me it does not reasonably reason, I have a pretty darn good reason not to believe it's reasoning.
I'm not saying that's incorrect, but thb that's exactly the tautology I was talking about!
If no statement is ever evaluated, it's not logical reasoning, because logical reasoning requires the evaluation of truth values of statements.
You could argue that the attention part of the network is some form of truth validation of the next predicted token? But indeed, the current chat interfaces don't change their previous text output retroactively.
Still, I am not really convinced. If we assume a human can reason, what does "evaluation of the truth value" mean? Thinking about something? This is still performed with our implicit mental model of the world, coming from shadows on a cave's wall, right?