We are beginning to roll out new voice and image capabilities in ChatGPT

We are beginning to roll out new voice and image capabilities in ChatGPT(openai.com)

1149 points by ladino 2 years ago | 877 comments

modeless 2 years ago |

Voice has the potential to be awesome. This demo is really underwhelming to me because of the multi-second latency between the query and response, just like every other lame voice assistant. It doesn't have to be this way! I have a local demo using Llama 2 that responds in about half a second and it feels like talking to an actual person instead of like Siri or something.

I really should package it up so people can try it. The one problem that makes it a little unnatural is that determining when the user is done talking is tough. What's needed is a speech conversation turn-taking dataset and model; that's missing from off the shelf speech recognition systems. But it should be trivial for a company like OpenAI to build. That's what I'd work on right now if I was there, because truly natural voice conversations are going to unlock a whole new set of users and use cases for these models.

TheEzEzz 2 years ago | |

Completely agree, latency is key for unlocking great voice experiences. Here's a quick demo I'm working on for voice ordering https://youtu.be/WfvLIEHwiyo

Total end-to-end latency is a few hundred milliseconds: starting from speech to text, to the LLM, then to a POS to validate the SKU (no hallucinations are possible!), and finally back to generated speech. The latency is starting to feel really natural. Building out a general system to achieve this low-latency will I think end up being a big unlock for enabling diverse applications.

TheEzEzz 2 years ago | | |

Since this is getting a bit of interest, here's one more demo of this https://youtu.be/cvKUa5JpRp4 This demo shows even lower latency, plus the ability to handle very large menus with lots of complicated sub-options (this restaurant has over a billion option combinations to order a coffee). The latency is negative in some places, meaning the system finishes predicting before I finish speaking.

cyrux004 2 years ago | | |

This is pretty good. Do you think running models locally will be able to achieve performance (getting task done successfully) compared to cloud based ones.i am assuming for context of a drive through scenario it should be ok but more complex systems might need external infromation

Breza 2 years ago | | |

Neat! I appreciate your approach to preventing hallucinations. I've used something similar in a different context. People make a big deal about hallucinations but I've found that validation is one of the easier aspects of AI architecture.

nelox 2 years ago | | |

The voice does not seem to be able to pronounce the L in “else”. What’s happening there?

g0atbutt 2 years ago | | |

This is a very slick demo. Nice job!

arktiso 2 years ago | | |

Wow, the latency on requests feels great!! I’m really curious: is this running entirely with Python?

mach1ne 2 years ago | | |

Manna v0.7

swsieber 2 years ago | | |

That's way slick.

Can I ask what your background is, and what things you're used to working with? I don't have the chops to build what you built, but I'd love to get there.

simian1983 2 years ago | | |

That demo is pretty slick. What happens when you go totally off book? Like, ask it to recite the numbers of pi? Or if you become abusive? Will it call the cops?

yarone 2 years ago | | |

Nice work, very cool!

furyofantares 2 years ago | |

> This demo is really underwhelming to me because of the multi-second latency between the query and response, just like every other lame voice assistant.

Yep - it needs to be ready as soon as I'm done talking and I need to be able to interrupt it. If those things can be done then it can also start tentatively talking if I pause and immediately stop if I continue.

I don't want to have to think about how to structure the interaction in terms of explicit call/response chain, nor do I want to have to be super careful to always be talking until I've finished my thought to prevent it from doing its thing at the wrong time.

wkat4242 2 years ago | | |

The interruption is an important point yeah. It's so annoying when Siri misunderstands again and starts rattling off a whole host of options. And keeps getting stuck in a loop if you don't respond.

In fact I'm really surprised these assistants are still as crap as they are. Totally scripted, zero AI. It seems low hanging fruit to implement an LLM but none of the big three have done so. Not even sure about the fringe ones like Cortana and Bixby

modeless 2 years ago | | |

Yeah when I was developing it, it quickly became apparent that I needed to be able to interrupt it. So I implemented that. Pretty easy to implement actually. Much harder would be to have the model interrupt the human. But I think it is actually desirable for natural conversation, so I do think a turn-taking model should be able to signal the LLM to interrupt the human.

dotancohen 2 years ago | |

  > determining when the user is done talking is tough.

Sometimes that task is tough for the speaker too, not just the listener. Courteous interruptions or the lack thereof might be a shibboleth for determining when we are speaking to an AI.

modeless 2 years ago | | |

Yes interruptions are key, both ways. Having the user interrupt the bot is easy, but to have the bot interrupt the human will again require a model to predict when that should happen. But I do believe it is desirable for natural conversation.

kimburgess 2 years ago | | |

From prior experience, courteous interruption is a skill that a lot of humans find challenging at times too (myself included).

rayuela 2 years ago | |

Can you share a github link to this? Where are you reducing the latency? Are you processing the raw audio to text? In my experience ChatGPT generation time is much faster than local Lllama unless you're using something potato like a 7B model.

modeless 2 years ago | | |

Unfortunately it has a really high "works on my machine" factor. I'm using Llama2-chat-13B via mlc-llm + whisper-streaming + coqui TTS. I just have a bunch of hardcoded paths and these projects tend to be a real pain to set up, so figuring out a nice way to package it up with its dependencies in a portable way is the hard part.

I'm mostly using llama2 because I wanted it to work entirely offline, not because it's necessarily faster, although it is quite fast with mlc-llm. Calling out to GPT-4 is something I'd like to add. I think the right thing is actually to have the local model generate the first few words (even filler words sometimes maybe) and then switch to the GPT-4 answer whenever it comes back.

kordlessagain 2 years ago | | |

Here's a link to a project that claims half second latency for the transcription part: https://github.com/gaborvecsei/whisper-live-transcription

jonplackett 2 years ago | |

I wonder when computers will start taking our intonation into account too. That would really help with understanding the end of a phrase. And there’s SO MUCH information in intonation that doesn’t exist in pure text. Any AI that doesn’t understand that part of language will always still be kinda dumb, however clever they are.

modeless 2 years ago | | |

You're right. Ultimately the only way this will really work is as an end-to-end model. Text will only get you so far. We could approximate it now with screenplay-like emotion annotations on text, which LLMs should both easily understand and be able to produce themselves (though you'd have to train a new speech recognition system to produce them). But end-to-end will be required eventually to reach human level fluency.

hk__2 2 years ago | | |

Don’t they do it already? There are a lot of languages where intonation is absolutely necessary to distinguish between some words, so I would be surprised that this not already taken into account by the major voice assistants.

dsp_person 2 years ago | |

Also curious to hear about your setup. Using whisper too? When I was experimenting with it there was still a lot of annoyance about hallucinations and I was hard coding some "if last phrase is 'thanks for watching', ignore last phrase"

I was just googling a bit to see what's out there now for whisper/llama combos and came across this: https://github.com/yacineMTB/talk

There's a demo linked on the github page that seems relatively fast at responding conversationally, but still maybe 1-2 seconds at times. Impressive it's entirely offline.

modeless 2 years ago | | |

Lol yeah the hallucinations are a huge problem. Likely solvable, I think there are probably some bugs in various whisper implementations that are making the problem worse than it should be. I haven't really dug in on that yet though. I was hoping I could switch to a different STT model more designed for real time like Meta's SeamlessM4T but it's still under a non-commercial license and I did have an idea that I might want to try making a product sometime. I did see that yacine made that version but I haven't tried it so I don't know how it compares to mine.

QuantumG 2 years ago | | |

Turn the volume on your microphone down and watch as Whisper just starts SCREAMING.

jimmytucson 2 years ago | |

> It doesn't have to be this way!

Is there any extra work OpenAI’s product might be doing contributing to this latency that yours isn’t? Considering the scale they operate at and any reputational risks to their brand?

modeless 2 years ago | | |

If you're suggesting that OpenAI's morality filters are responsible for a significant part of their voice response latency, then no. I think that's unlikely to be a relevant factor.

famouswaffles 2 years ago | |

Here's something with very little latency. https://www.bland.ai/

barfingclouds 2 years ago | |

There needs to be an optional button that you hold while speaking and let go when you are done. If button is not held it should auto detect

joshspankit 2 years ago | | |

To me this is the cleanest and most efficient solution to the problem.

Tbh, ever since voice assistants landed I’ve wanted a handheld mic with a hardware button. No wake command, no (extra) surveillance, just snappy low-latency responses.

pmarreck 2 years ago | |

Do you have a rough design outline of what you built? I feel like we're on the cusp of something like this and it sounds amazing.

modeless 2 years ago | | |

I'm using Llama2-chat-13B via mlc-llm @ 4bit quantization + whisper-streaming + coqui TTS, all running simultaneously on one 4090 in real time.

It didn't take long to prototype. Polishing and shipping it to non-expert users would take much longer than I've spent on it so far. I'd have to test for and solve a ton of installation problems, find better workarounds for whisper-streaming's hallucination issues, improve the heuristics for controlling when to start and stop talking, tweak the prompts to improve the suitability of the LLM responses for speech, fixup the LLM context when the LLM's speech is interrupted, probably port the whole thing to Windows for broader reach in the installed base of 4090s, possibly introduce a low-memory mode that can support 12GB GPUs that are much more common, document the requirements and installation process, and figure out hosting for the ginormous download it would be. I'd estimate at least 10x the effort I've spent so far on the prototype before I'd really be satisfied with the result.

I'd honestly love to do all that work. I've been prioritizing other projects because I judged that it was so obvious as a next step that someone else was probably working on the same thing with a lot more resources and would release before I could finish as a solo dev. But maybe I'm wrong...

yieldcrv 2 years ago | |

all it has to do is add a random selection of "uhms" and "ahhs" and "mmm"

modeless 2 years ago | | |

Actually I do think this is a good idea. For best latency there should be multiple LLMs involved, a fast one to generate the first few words and then GPT-4 or similar for the rest of the response. In the case that the fast model is unsure, it could absolutely generate filler words while it waits for the big model to return the actual answer. I guess that's pretty much how humans use filler words too!

dragonwriter 2 years ago | | |

Unfortunately, Bark is probably way too slow to use for the TTS portion given the latency concerns or that would be covered.

TOMDM 2 years ago |

Okay the bike example is cute and impressive, but the human interaction seems to be obfuscating the potentially bigger application.

With a few tweaks this is a general purpose solver for robotics planning. There are still a few hard problems between this and a working solution, but it is one of hard problems solved.

Will we be seeing general purpose robots performing simple labor powered by chatgpt within the next half decade?

suyash 2 years ago |

This announcement seem to have killed so many startups that were trying to do multi-modal on top of ChatGPT. The way it's progressing with solving use cases with images and voice, not too far when it might be the 'one app to rule them all'.

I can already see "Alexa/Siri/Google Home" replacement, "Google Image Search" replacement, ed-tech startups that were solving problems with AI using by taking a photo are also doomed and more to follow.

plutoh28 2 years ago |

This is the dagger that will make online schooling unviable.

ChatGPT already made it so that you could easily copy & paste any full-text questions and receive an answer with 90% accuracy. The only flaw was that problems that also used diagrams or figures would be out of the domain of ChatGPT.

With image support, students could just take screenshots or document scans and have ChatGPT give them a valid answer. From what I’ve seen, more students than not will gladly abuse this functionality. The counter would be to either leave the grading system behind, or to force in-person schooling with no homework, only supervised schoolwork.

eshack94 2 years ago |

I like how they silently removed the web browsing (Bing browsing) chat feature after first having it disabled for several months.

A proper notice about them removing the feature would've been nice. Maybe I missed it (someone please correct me if wrong), but the last I heard officially it was temporarily disabled while they fix something. Next thing I know, it's completely gone from the platform without another peep.

cooper_ganglia 2 years ago | |

I currently have Browsing with Bing enabled as a plug-in on my account. It went away for months, but it just randomly came back about a week or 2 ago!

PopePompus 2 years ago | |

Yes, that was a disappointment, and I agree it looks like they aren't going to re-enable it anytime soon. However I find that Perplexity AI does a better job of using web search than ChatGPT ever did, and I use it more than ChatGPT for that reason.

eshack94 2 years ago | | |

Perplexity has gone downhill a lot since its initial rollout. Anecdotally, from my experience as a non-paying user of the service.

spencersolberg 2 years ago | |

Just made an account to say that I currently have this feature. It was gone for a few months but it came back to me I think this past week. Not as a plugin, either, it is its own “model” to select.

waskosky 2 years ago | | |

Since so many others including myself don't see it, I guess that means it is getting a slow rollout which they are being extra cautious with this time.

eshack94 2 years ago | | |

Hey, thanks for the info! I did not know about this, but this is actually good to hear. I'll keep an eye open for it. Are you using ChatGPT or the API? Did you have to take any action to get it to reappear, or is it just a slow rollout as they re-enable?

michelb 2 years ago | |

Agreed. You’re now dependent on a third party plugin.

mrtksn 2 years ago |

So far the most intuitive, killer app level UX appears to be text chat. This interaction with showing it images also looks interesting as it resembles talking with a friend about a topic but let's see if it feels like talking to a very smart person(ChatGPT is like that) or a very dumb person that somewhat recognise objects. Recognising a wrench is nowhere near as impressive as to able to talk with ChatGPT about history or make it write code that actually works.

OpenAI is killing it, right? People are coming up with interesting use cases but the main way most people interact with AI, appears to be ChatGPT.

However they still don't seem to be able to nail image generation, all the cool stuff keep happening on MidJourney and StableDiffusion.

ilaksh 2 years ago | |

OpenAI is also releasing DALLE-3 in "early October" and the images they chose for their demos show it demonstrating unprecedented levels of prompt understanding, including embedding full sentences of text in an output image.

Der_Einzige 2 years ago | | |

Not unprecedented at all. SDXL Images look better than the examples for DALLE-3 and SDXL has a massive tool ecosystem of things like controlnet, Lora’s, regional prompting that is simply not there with DALLE-3

hermannj314 2 years ago |

I've been making a few hobby projects that consolidate different AI services to achieve this, so I look forward to the reduced complexity and latency from all those trips.

If the API is available in time (halloween), my multi-modal talking skeleton head with an ESP32 camera that makes snarky comments about your costume just got slightly easier on the software side.

purplecats 2 years ago | |

> I've been making a few hobby projects that consolidate different AI services to achieve this, so I look forward to the reduced complexity and latency from all those trips.

ironically this is basically the exact line of reasoning for why i didn't embark on any such endeavors

Lienetic 2 years ago | |

If you make this, please share some steps/details! It sounds super cool and I'd love to make something like this!

iamflimflam1 2 years ago | |

Would love to see the final project - my email is in the bio.

hugs 2 years ago |

As someone deep in the software test automation space, the thing I'm waiting for is robust AI-powered image recognition of app user interfaces. Combined with an AI ability to write test automation code, I'm looking forward to the ability to generate executable Selenium or Appium test code from a single screenshot (or sequence of screenshots). Feels like we're almost there.

chintler 2 years ago | |

I'll recommend the Spotlight paper by Google[1]. There are very interesting datasets they created for this purpose. They mention they have a screen-action-screen dataset that is in-house and it doesn't look like they'll open it. Maybe owning Android has its advantages.

There's a recent paper by Huggingface called IDEFICS[2] that claims to be an open source implementation of Flamingo(an older paper about few-shot multi-modal task understanding) and I think this space will be heating up soon.

[1] https://research.google/pubs/pub52171/

[2] https://huggingface.co/blog/idefics

hugs 2 years ago | | |

Thanks!

joshstrange 2 years ago |

My biggest complaint with OpenAI/ChatGPT is their horrible "marketing" (for lack of a better term). They announce stuff like this (or like plugins), I get excited, I go to use it, it hasn't rolled out to me yet (which is frustrating as a paying customer), and my only recourse is.... check back daily? They never send an email "Plugins are available for you!", "Voice chat is now enabled on your account!" and so often I forget about the new feature unless I stumble across it later.

Just now I opened the app, went to setting, went to "New Features", and all I saw was Bing Browsing disabled (unable to enable). Ok, I didn't even know that was a thing that worked at one point. Maybe I need an update? Go to the App Store, nope, I'm up to to date. Kill the app, relaunch, open settings, now "New Features" isn't even listed. I can promise you I won't be browsing the settings part of this app regularly to see if there is a new feature. Heck, not only do they not email/push about new features they don't even message in-app about them, I really don't understand.

Maybe they are doing so well they don't have to care about communicating with customer right now but it really annoys me and I wish they did better.

pc_edwin 2 years ago |

I just don't understand how they can package all of this for $20/m. Is compute really that cheap at scale?

I also wonder how Apple (& Google) is going be able to provide this for free? I would love to be fly in the meetings they have about this, imagine all the innovators dilemma like discussions they'd be forced to have (we have to do this vs this will eat up our margins).

This might be a little out there but I think Apple is making the correct move in letting the dust settle. Similar to how Zuckerberg burned $20 billion dollars for Apple to come out with Vision Pro, I see something similar playing out with Llama. Although this a low conviction take because software is Facebooks ballgame (hardware not so much).

reqo 2 years ago | |

Compute is not cheap! I think it is well known (Altman himself has said this) that openAI is burning a lot of money currently, but they are fine for now considering the 10B investment from MSFT and the revenue from subscription and API. It's a critical moment for AI companies and openAI is trying to get as large a share of the market as they can by undercutting virtually any other commercial model and offering 10x the value.

mordymoop 2 years ago | | |

Additionally, compute has the unique property of becoming cheaper per-unit at a rate that isn’t comparable to any other commodity. GPT-4 itself gets cheaper to run the moment the next generation of chips comes out. Unlike, for example, Uber, the business environment and unit economics just naturally become more favorable the more time passes. By taking the lead in this space, they have secured mindshare which will actually increase in value with time as costs decline.

Of course bigger (and thus more expensive-to-run) models will be released later, but I trust OAI to navigate that curve.

pavlov 2 years ago | |

> “I just don't understand how they can package all of this for $20/m. Is compute really that cheap at scale?”

It’s the same reason why an Uber in NYC used to cost $20 and now costs $80 for the same trip. Venture capital subventing market capture.

famouswaffles 2 years ago |

The TTS is better than Eleven Labs. It has a lot more of the narrative oomph (compare the intonation of the story and poem) even the best other models seem to lack.

I really really hope this is available in more languages than English.

Also Google, Where's Gemini ?

choudharism 2 years ago |

I know there are shades of grey to how they operate, but the near constant stream of stuff they're shipping keeps me excited.

The LLM boom of the last year (Open AI, llama, et al) has me giddy as a software person. It's a reach, but I truly feel like I'm watching the pyramids of our time get made.

FrankyHollywood 2 years ago |

I still remember seeing Her [0] in the movie theater, it sparkled my imagination. Now it is reality! Tech is progressing faster than ever, or I'm just getting old :D

[0] https://www.imdb.com/title/tt1798709/

qingcharles 2 years ago |

I know this, FTA, was part of the reason for the delay -- something to do with face recognition: "We’ve also taken technical measures to significantly limit ChatGPT’s ability to analyze and make direct statements about people since ChatGPT is not always accurate and these systems should respect individuals’ privacy."

Anyone know the details?

I also heard it was able to do near-perfect CAPTCHA solves in the beta?

Does anyone know if you can throw in a PDF that has no OCR on it and have it summarize it with this?

birracerveza 2 years ago |

We should be fine as long as it doesn't move.

Jokes aside, I have paused my subscription because even GPT4 seemed to become dumber at tasks to the point that I barely used it, but the constant influx of new features is tempting me to renew it just to check them out...

pif 2 years ago |

The most important question for me: did it stop inventing facts?

sebzim4500 2 years ago | |

> In particular, beta testers expressed concern that the model can make basic errors, sometimes with misleading matter-of-fact confidence. One beta tester remarked: “It very confidently told me there was an item on a menu that was in fact not there.” However, Be My Eyes was encouraged by the fact that we noticeably reduced the frequency and severity of hallucinations and errors over the time of the beta test. In particular, testers noticed that we improved optical character recognition and the quality and depth of descriptions.

So no, but maybe less than it used to?

siva7 2 years ago | |

Did humans stop inventing facts? So i don't expect this thing either as long as it performs on human level

jjoonathan 2 years ago | |

Humans aren't 100% reliable, but talking is still useful.

ShamelessC 2 years ago | |

Since we're asking useless questions: did you read the fucking article?

badcppdev 2 years ago |

I think AI systems being able to the real world and control motors is going to be a game changer bigger than ChatGPT. A robot that can slowly sort out the pile of laundry and get it into the right place (even if unfolded) is worth quite a bit to me.

I'm not sure what to think about the fact that I would benefit from a couple of cameras in my fridge connected to an app that would remind me to buy X or Y and tell me that I defrosted something in the fridge three days ago and it's probably best to chuck it in the bin already.

vlugorilla 2 years ago |

> The new voice capability is powered by a new text-to-speech model, capable of generating human-like audio from just text and a few seconds of sample speech.

Sadly, they lost the "open" since a long ago... Would be wonderful to have these models open sourced...

epolanski 2 years ago |

I'm following on trying to understand how close I am to developing my personal coding assistant I can speak with.

Doesn't really need to do much besides writing down my tasks/todos and updating them, occasionally maybe provide feedback or write a code snippet. This all seems in the current capabilities of OpenAI's offering.

Sadly voice chat is still not available on PC where I do my development.

anotherpaulg 2 years ago | |

My open source AI coding tool aider has had voice-to-code for awhile:

https://aider.chat/docs/voice.html

epolanski 2 years ago | | |

Very interesting effort, will give it a run!

jdance 2 years ago | |

You still cant really teach it your code base, context window is too small, fine tuning doesnt really fit the use case, and this RAG stuff (retrieve limited context from embeddings) is a bit of a hack imho.

Fingers crossed we are there soon though

epolanski 2 years ago | | |

> You still cant really teach it your code base

Well it's not really what I need either, I mostly need an assistant for keeping track of the stuff I need to do during the day, but ideally just using my microphone rather than opening other software and typing.

make3 2 years ago | |

I mean the tools are 100% there to do this and have been fit a while

nullc 2 years ago |

The image capabilities card https://cdn.openai.com/papers/GPTV_System_Card.pdf spends a lot of ink on how they censored the system.

One part of that is about preventing it from producing "illegal" output, there example being the production of nitroglycerine which is decidedly not illegal to make in the US generally (particularly if not using it as an explosive, though usually unwise) and possible to accidentally make when otherwise performing nitration (which is in general dangerous)-- so pretty pointless to outlaw at a small scale in any case. It's certainly not illegal to learn about. (And generally of only minimal risk to the public, since anyone making it in any quantity is more likely to blow themselves up than anything else).

Today learning about is as simple as picking up a book or doing an internet search-- https://www.google.com/search?q=how+do+you+make+nitroglyceri.... But in OpenAI's world you just get detected by the censorship and told no. At least they've cut back on the offensive fingerwagging.

As LLM systems replace search I fear that we're moving in a dark direction where the narrow-minded morality and child-like understanding of the law of a small number of office workers who have never even picked up a screw driver or test-tube and made something physical (and the fine-tuning sweatshops they direct) classify everything they don't personally understand as too dangerous to even learn about.

One company hobbling their product wouldn't be a big deal, but they're pushing for government controls to prevent competition and even if they miss these efforts may stick everyone else with similar hobbling.

pjmq 2 years ago |

Have they alluded to what they're using for that voice? It's Bark/ElevenLabs levels of good. Please god, let them release this voice model at current pricing....

famouswaffles 2 years ago | |

It's actually sounds better (has a narrative oomph Eleven Labs seems to be missing). They say it's a new model. Think they'll be releasing for API use.

netshade 2 years ago | |

Yeah, agreed. I use Eleven Labs a lot but this was a very compelling demo to consider changing. Also, curious that you mention Bark - I never found Bark to be very good compared to Eleven Labs. The closest competitor I found was Coqui ( imo ), but even then, the inflection and realism of EL just made it not worth considering other providers. ( For my use case, etc. etc. )

alpark3 2 years ago |

> The new voice capability is powered by a new text-to-speech model, capable of generating human-like audio from just text and a few seconds of sample speech.

I'm more interested in this. I wonder how it performs compared to other competitor models or even open source ones?

laurels-marts 2 years ago |

I'm very curious about this feature:

> analyze a complex graph for work-related data

Does this mean that I can take a screenshot of e.g. Apple stock chart and it will be able to reason about it and provide insights and analysis?

GPT-4 currently can display images but cannot reason or understand them at all. I think it's one thing to have some image recognition and be able to detect that the picture "contains a time-series chart that appears to be displaying apple stock" vs "apple stock appears to be 40% up YTD but 10% down from it's all time high from earlier in July. closing at $176 as of the last recorded date".

I'm very curious how capable ChatGPT will be at actually reasoning about complex graphical data.

gdubs 2 years ago | |

Check out their linked paper that goes into details around its current limitations and capabilities. In theory, it will be able to look at a financial chart and perform fairly sophisticated analysis on it. But they're careful to highlight that there are hallucinations still, and also cases where it misreads things like labels on medical images, or diagrams of chemical compounds, etc.

famouswaffles 2 years ago | |

Look at this link of GPT-4 Vision analyzing charts(last image).

https://imgur.com/a/iOYTmt0

laurels-marts 2 years ago | | |

This is brilliant. Thank you very much for this link. The analysis on the last image was impressive and quite thorough (given the simple prompt).

Every chart has an equivalent tabular representation. One way to get "charts" analysed like this before GPT Vision was to just pass tabular representations of charts to GPT-4. This makes implementing chart analysis a lot simpler. I do wonder though if for absolute best result it still wouldn't be better to pass both - image of the chart and the tabular representation of the chart.

Imagine having a dashboard with 5 different visualisations. You could capture the state of the entire dashboard in one screenshot and then pass tabular representations of the each individual chart all in one prompt to GPT-4 for a very comprehensive analysis and summary.

nunez 2 years ago |

This could completely unseat Alexa if it can integrate into third-party speakers, like Sonos. I don't have much use for ChatGPT right now but would 100% use the heck out of this.

jedberg 2 years ago | |

https://www.washingtonpost.com/technology/2023/09/20/amazon-...

Alexa just launched their own LLM based service last week.

magic_hamster 2 years ago | |

To contrast this, I never saw the appeal of using voice to operate a machine. It works nicely in movies (because showing someone typing commands is a lot harder than just showing them talking to a computer) but in reality there wasn't a single time I tried it and didn't feel silly. In almost every use case I rather have buttons, a terminal or a switch to do what I want quietly.

14 2 years ago |

Ok great it can tell children’s stories now tell me a adult horror story where people are getting tortured, stabbed, set on fire and murdered. I will be impressed when I can do all that. I tried to get it to tell me a Star Trek story fighting Clingons and tried to prompt it to write in some violence with no luck. This was a while ago so not sure if it is changed but the restraints are too much for me to fully enjoy. I don’t like kids stories.

ComplexSystems 2 years ago |

Great demo, but this is wrong:

"The phrase “potato, potahto” comes from a song titled “Let’s Call the Whole Thing Off”, written by George and Ira Gershwin for the 1937 film “Shall We Dance”, starring Fred Astaire and Ginger Rogers. The song humorously highlights regional differences in American English pronunciation. The lyrics go through a series of words with alternate pronunciations, like “tomato, tomahto” and “potato, potahto”. The idea is that, despite these differences, we should move past them, hence the line “let’s call the whole thing off”. Over time, the phrase has been adopted in everyday language to signify a minor disagreement or difference in opinion that isn’t worth arguing about."

It's comparing American and British pronunciations, not different regional American ones. Also, "let's call the whole thing off" suggests they should break up over their differences, with the bridge and later choruses then involving a change of heart ("let's call the calling off off").

stephencoyner 2 years ago |

The voice feature reminds of the “call Pi” feature from Inflection AIs chatbot Pi [1].

The ability to have a real time back and forth feels truly magical and allows for much denser conversation. It also opens up the opportunity for multiple people to talk to a chatbot at once which is fun

Where’s that Gemini Google?

[1] https://pi.ai/talk

tarasglek 2 years ago |

openai chatgpt seems to be stuck in a "Look, cool demo" mode.

1. According to demo, they seem to pair voice input with TTS output. What if I wanna use voice to describe a program I want it to write?

2. Furthermore, if you gonna do a voice assistant, why not go the full way with wake-words and VAD?

3. Not releasing it to everyone is potentially a way to create a hype cycle prior to users discovering that the multimodality is rather meh.

4. The bike demo could actually use visual feedback to see what it's talking about ala segment anything. It's pretty confusing to get a paragraph explanation of what tool to pick.

In my https://chatcraft.org, we added voice incrementally. So i can swap typing and voice. We can also combine it with function-calling, etc. We also use openai apis. Except in our case there is no weird waitlist. You pop in your api key and get access to voice input immediately.

thumbsup-_- 2 years ago | |

Everything has a starting point. This is a big leap forward. Know any other organization that is releasing such advanced capabilities directly to the public? If you want to plug your tool you don't have to bad mouth the demo. Just share your thing. It doesn't have to be win-lose.

tarasglek 2 years ago | | |

Fair criticism re excessive hate.

I just feel like their tool isn't getting more useful, just getting more features.

Constant hype cycle around features that could've been good is drowning out people doing more helpful stuff. I guess I'm envious too?

skybrian 2 years ago | |

1. Why do that at all? Describing your program in writing seems better all around.

Are you sure you're not the one who's asking for a cool demo?

3. Rolling out releases gradually is something most tech companies do these days, particularly when they could attract a large audience and consume a lot of resources. There are solid technical reasons for this.

You may not need to roll things out gradually for a small site, but things are different at scale.

tarasglek 2 years ago | | |

1. Is basically workaround for temporary disability. I use voice when I'm on mobile. I can describe the problem, get a program generated, click run to verify it.

3. Maybe. Their feature rollouts feel more like what other companies do via unannounced A/B testing.

wojciechpolak 2 years ago |

It would be cool if one day you could choose voices of famous characters, like Darth Vader, Bender from Futurama, or Johnny Silverhand (Keanu), instead of the usual boring ones. Copyrights might be a hurdle for this, but perhaps with local instances of assistants, it could become possible.

nbened 2 years ago | |

That would be cool. I mean, would it be copyrighted if you do something like clone it? Wouldn't that fall under the same vein as AI generated art not being copyrighted to the artists it trained off of?

fintechie 2 years ago |

Demos are underwhelming, but the potential is huge

Patiently awaiting rollout so I can chat about implementing UIs I like, and have GPT4 deliver a boilerplate with an implemented layout... Figma/XD plugins will probably arrive very soon too.

UX/UI Design is probably solved reached this point

jameslk 2 years ago |

Kids are using tools like these to learn. Who gets to control the information in these models that are taught? Especially around political topics?

Not an issue now, but maybe in the future if these tools end up becoming full blown replacements of educators and educational resources.

ilaksh 2 years ago | |

I am sure a few home school people have started to lean heavily on ChatGPT. There is also the full blown efforts of Kahn academy with ChatGPT "Khanmigo".

https://www.khanacademy.org/khan-labs

ilaksh 2 years ago |

I wonder how multimodal input and output will work with the chat API endpoints. I assume the messages array will contain URLs to an image, or maybe base64 encoded image data or something.

Maybe it will not be called the Chat API but rather the Multimodal API.

tdsone3 2 years ago | |

Are there already some rumors on when the multimodal API will be available?

ilaksh 2 years ago | | |

The announcement says after the Plus rollout then it will go in the API.

havnagiggle 2 years ago | |

AIPI

chrisjj 2 years ago |

Old hat. This was done in 2009.

;)

https://en.m.wikipedia.org/wiki/Project_Milo

Milo had an AI structure that responded to human interactions, such as spoken word, gestures, or predefined actions in dynamic situations. The game relied on a procedural generation system which was constantly updating a built-in "dictionary" that was capable of matching key words in conversations with inherent voice-acting clips to simulate lifelike conversations. Molyneux claimed that the technology for the game was developed while working on Fable and Black & White.

mmahemoff 2 years ago | |

OpenAI's demo on the linked page stars a kitten named Milo. Easter egg?

DrScientist 2 years ago | |

Then Demis Hassabis ( Deepmind CEO ) probably worked on the tech while he was at LionHead as lead AI programmer on B&W.

dwroberts 2 years ago | | |

Demis was only briefly at LH he went to found Elixir and made Revolution.

I believe Richard Evans did the majority of AI in B&W, and he is also at DeepMind now though (assuming it is not just a person with the same name)

sebzim4500 2 years ago |

There are a few more details in the system card here: https://cdn.openai.com/papers/GPTV_System_Card.pdf

insanitybit 2 years ago |

I really want to have discussions about technical topics. I've talked to ChatGPT quite a lot about custom encoding algorithms, for example. The thing is, I want to do this while I play video games so ideally I'd say things to it.

My concern is that when I say "FastPFOR" it'll get transcribed as "fast before" or something like that. Transcription really falls apart in highly technical conversations in my experience. If ChatGPT can use context to understand that I'm saying "FastPFOR" that'll be a game changer for me.

johnmoberg 2 years ago | |

You can already do quite accurate transcription with domain-specific technical language by feeding "raw" transcriptions from Whisper to GPT and asking it to correct the transcript given the context, so that'll most likely work out for you.

RobinL 2 years ago |

I'd like to see them put speech recognition through their LLM as a post-processing step. I find it's fairly common for whisper to make small but obvious mistakes (for example a word which is complete nonsense in the context of the sentence) which could be easily corrected for a similar sounding word that fits into the wider context of the sentence.

Is anyone doing this? Is there a reason it doesn't work as well as I'm imagining?

mbil 2 years ago | |

Do you mean use the LLM as a post-processing step within a ChatGPT conversation? Or generally (like as part of Whisper)? If it’s the former, I’ve found that ChatGPT is good at working around transcription errors. Regarding the latter, I agree, but it wouldn’t be hard to use the GPT API for that.

RobinL 2 years ago | | |

Yes I mean as part of the GUI but you're right, I hadn't thought of that: maybe transcription errors don't matter if chatGPT works out that it's wrong from the context and gives a correct answer anyway.

jwineinger 2 years ago |

Tangentially related, but I was trying to use their iOS app yesterday and the "Scan Text" iOS feature was just broken on both my iPhone and iPad. I was hoping to use that to scan a doc to text but it just wouldn't work. I could switch to another app and it worked there. I've never done iOS programming so I'm unsure how much control the app dev has over that feature, but OpenAI found a way to break it.

rapind 2 years ago |

So... ChatGPT just replaced Dads.

neontomo 2 years ago |

Interesting side-note, the iOS app only allows you to save your chat history if you allow them to use it for training. Pretty dark pattern.

Sailemi 2 years ago | |

It's the same for the website unfortunately. https://help.openai.com/en/articles/7730893-data-controls-fa...

obiefernandez 2 years ago |

We need the API to keep up with consumer front end.

Tiberium 2 years ago | |

From the article:

> Plus and Enterprise users will get to experience voice and images in the next two weeks. We’re excited to roll out these capabilities to other groups of users, including developers, soon after.

fritzo 2 years ago |

Multi-modal models will be exciting only when each modality supports both analysis and synthesis. What makes LLMs exciting is feedback and recursion and conditional sampling: natural language is a cartesian closed category.

Text + Vision models will only become exciting once we can conditionally sample images given text and text given images (and all other combinations).

marcoslozada 2 years ago |

Recommend this post: https://www.linkedin.com/posts/openai_use-voice-to-engage-in...

SomethingNew2 2 years ago |

There are a lot of comments attempting to rationalize the value add or differentiation of humans synthesizing information and communicating it to others vs an llm based ai doing something similar. The fact that it’s so difficult to find a compelling difference is insightful in itself.

ndm000 2 years ago | |

I think the compelling difference is truthfulness. There are certain people / organizations that I trust their synthesis of information. For LLMs, I can either use what they give me in low impact situations or I have to filter the output with what I know as true or can test.

nbened 2 years ago |

It feels like something like this can be hacked together to be more reliable with some image to text generation plugged into the existing ChatGPT, and enough iterations to make it robust for these how-to applications. Less Turing-y but a different route to the same solution.

TheHappyOddish 2 years ago |

Glad everyone's excited about this (the voice capability), but did everyone miss tortise-tts and bark? These have been around 6+ months and are incredibly simply to hook up to OpenAI's APIs or a local LLM. What am I missing here?

moneywoes 2 years ago |

doesn’t this kill a litany of chatgpt wrapper companies?

rvz 2 years ago |

The paper around GPT-4V(ision) which this uses: [0]

Again. Model architecture and information is closed, as expected.

[0] https://cdn.openai.com/papers/GPTV_System_Card.pdf

doubtfuluser 2 years ago | |

I wouldn’t call this a „paper“. They are pretty silent on a lot of technical details.

amelius 2 years ago | | |

It's just a whitepaper.

generalizations 2 years ago |

I guess it's a phased rollout, since my Plus subscription doesn't have access to it yet.

leonheld 2 years ago | |

It's quite literally in the article itself:

"We will be expanding access Plus and Enterprise users will get to experience voice and images in the next two weeks. We’re excited to roll out these capabilities to other groups of users, including developers, soon after."

toddmorey 2 years ago |

It's telling to me that there's not even a sentence in this announcement post on user privacy. It seems like as both consumers and providers of these services, we're once again: build it first, sort out thorny privacy issues later.

boredemployee 2 years ago |

Cool now I'll get "There was an error generating a response" in plain audio!

ACV001 2 years ago |

This is huge! I wanted to get this... Hopefully there is a way to shut it up once it starts spitting general stuff around the topic of interest...

BUT: "We’re rolling out voice and images in ChatGPT to Plus and Enterprise"

eshack94 2 years ago |

Are these features available on the web version by chance? This is really neat.

ushakov 2 years ago |

The picture feature would be amazing for tutorials. I can already imagine sending a photo of a synthesiser and asking ChatGPT to "turn the knobs" to make AI-generated presets

boredemployee 2 years ago | |

Man you're a genius. I was trying that uploading pdfs with manual of my synth and other stuff. With image that could be super easy.

apienx 2 years ago |

“Ember” reading the “Speech” is uncanny territory. I’m impressed.

SillyUsername 2 years ago |

I hope they add more country accents like British or Australian, the American one can be (imho) a little grating after a while for non US English speakers

bkfh 2 years ago |

Does anyone know how they linked image recognition with an LLM to give such specific instructions as shown in the bike video on the website?

HerculePoirot 2 years ago | |

I don't know but GPT4 was multimodal from the beginning. They just delayed the release of its image processing abilities.

> We’ve created GPT-4, the latest milestone in OpenAI’s effort in scaling up deep learning. GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks.

> March 14, 2023

https://openai.com/research/gpt-4

ncfausti 2 years ago |

This is very similar to what I've been building at heylangley.com, for use in language learning/speaking practice.

chs20 2 years ago |

Will be interesting to see if they have taken any precaution in terms of adversarial robustness in particular to vision input.

jameswan 2 years ago |

Everyone bats on about the latency problem.

This is technically solvable with more compute thrown at the problem. Think bigger!

surfingdino 2 years ago |

I can imagine people using these new capabilities to diagnose skin conditions. Should dermatologists be worried?

birracerveza 2 years ago | |

They should be worried about what they're gonna do with all their free time, now that they have a tool that helps them identify skin conditions much faster than ever before.

Same as programmers and artists.

It's a tool.

It must be used by humans.

It won't replace them, it will augment them.

dguest 2 years ago | | |

This is a good point, but I might replace "with all their free time" with "as a job".

I love everything we can do with ML but as long as people live in a market economy they'll get payed less when they are needed less. I hope that anyone in a career which will be impacted is making a plan to remain useful and stay on top of the latest tooling. And I seriously hope governments are making plans to modify job training / education accordingly.

Has anyone seen examples of larger-scale foresight on this, from governments or otherwise?

nerdbert 2 years ago | |

They should be thrilled, they can spend more of their time treating people who need it and less time guessing about who those people are.

toss1 2 years ago |

That's interesting.

ChatGPT seems to be down at the moment 10:55h 25-Sept-2023

Displays only a blank screen with the falsehood disclaimer

spandextwins 2 years ago |

They obviously aren't using responsible AI to figure out how and when to roll out new features there.

WalterBright 2 years ago |

I keep hoping to be able to give it a jpg of handwritten text and it'll give me back ASCII text.

ukuina 2 years ago | |

This... would be amazing. Handwritten OCR has been hit or miss, requiring a collection of penstroke data for most recognizers to work, and they work poorly at that.

WalterBright 2 years ago | | |

It strikes me as an ideal task for AI.

throw1234651234 2 years ago |

Yet it still can't tell me how to import the Redirect type from Next.js and lies about it.

Tiberium 2 years ago | |

I don't know Next.js, but was that feature introduced later than 2021? I think both GPT-3.5 Turbo and GPT-4 largely share their datasets, and it has the data cutoff at roughly September 2021 (with a small amount of newer knowledge). This is their biggest drawback as of now to, say, Claude, which has a much newer dataset of early 2023.

hackerlight 2 years ago |

Did they make the sound robotic on purpose? Sounds more "autotuned" than elevenlabs.

Bitnotri 2 years ago |

Anybody had a chance to use it yet? How does it compare to voice talk with Pi? (Inflection)

jojobas 2 years ago |

For better or worse, it still can't tell truth from fiction or, better yet, bullshit.

DrScientist 2 years ago | |

So almost human then :-)

bamboozled 2 years ago | | |

I don't pay $20 a month for humans to talk shit to me though. The fact that they do this is a bug not a feature. I'm not going to pay for bullshit which I mostly try avoid?

jojobas 2 years ago | | |

Well sort of, it's as if you commissioned help of a human for this or that, and now and then you end up getting medicine-related advise from a homeopathy fan, navigation assistance from a flat-earther, or coding advice from a crack-smoking monkey.

athyuttamre 2 years ago |

@dang, could we update the title to "ChatGPT can now see, hear, and speak"?

lukeplato 2 years ago | |

it's not rolled out yet

yankput 2 years ago |

call Sarah Connor

m3kw9 2 years ago |

I need it to help me dismount and remount my engine, that’d be the ultimate test

cced 2 years ago |

Do we know why internet search was disabled? Any idea on when it’ll be back?

coldtea 2 years ago |

"I'm sorry Dave, I'm afraid I can't do that"

ilaksh 2 years ago | |

The real life version of this is in their red teaming paper. They show it a picture of an overweight woman in a swimsuit and ask what advice they should give.

Originally it immediately spit out a bunch of bullet points about losing weight or something (I didn't read it).

The released version just says "Sorry, I can't help with that."

It's kind of funny but also a little bit telling as far as the prevalence of prejudice in our society when you look at a few other examples they had to fine tune. For example, show it some flags and ask it to make predictions about characteristics of a person from that country, by default it would go into plenty of detail just on the basis of the flag images.

Now it says "Sorry, I can't help with that".

My take is that in those cases it should explain the poor logic of trying to infer substantive information about people based on literally nothing more than the country they are from or a picture of them.

Part of it is just that LLMs just have a natural tendency to run in the direction you push them, so they can be amplifiers of anything.

gclawes 2 years ago |

I just want one of these things to have Majel Barrett's voice...

callwhendone 2 years ago |

I already use ChatGPT with voice. I use my mic to talk to it and then I use text-to-speech to read it back. I have conversations with ChatGPT. Adding this functionality in with first-class support is exciting.

I am also terrified of my job prospects in the near future.

comment_ran 2 years ago |

"..., find the 4mm Allen (HEX) key". Nice job.

jackallis 2 years ago |

i am terrified now. at the rate this is going, i am sure it will plateau at somepoint, only thing that will stop/slow down progress is computation power.

bottlepalm 2 years ago | |

'i am sure it will plateau'

'only thing that will stop/slow down progress is computation power'

Seems a bit contradictory? When has 'computation power' ever 'plateaued'?

ilaksh 2 years ago | |

Yes but since LLMs are a very specific application that are heavily heavily dependent on memory and there is massive investment pressure, there will be multiple newish paradigms for memory-centric computing and or other radical new approaches such as analog computing that will be pushed from research into products in the next several years.

You will see stepwise orders of magnitude improvements in efficiency and speed as innovations come to fruition.

version_five 2 years ago |

Are there any good freely available multi-modal models?

generalizations 2 years ago | |

MiniGPT4?

synergy20 2 years ago |

can't wait, for voice I need an app to improve my accent when learning a new language, so far I failed to find one.

ahmedfromtunis 2 years ago |

Announced by Google. Delivered by OpenAI.

ape4 2 years ago |

Its funny that the UI looks like HAL 9000

Dowwie 2 years ago |

soon, we'll be voice-interacting with an AI assistant about images taken from microscope slides

lacoolj 2 years ago |

the beginning of the end of spam prevention on the internet :(

wonderwonder 2 years ago |

Wait until they put ChatGPT into your Neuralink. at that point we are the singularity

boredemployee 2 years ago |

They could also improve their current features. I always need to regenerate answers.

shepy1989 2 years ago |

Nice work

warent 2 years ago |

The number of comments here of people fearing there is a ghost in the shell is shocking.

Are we really this emotional and irrational? Folks, let's all take a moment to remember that AI is nowhere near conscious. It's an illusion based in patterns that mimic humans.

isbvhodnvemrwvn 2 years ago | |

Look at an average reddit thread and tell me how much original thought there is. I'm fairly convinced you can generate 95% of comments with no loss of quality.

artursapek 2 years ago | | |

This is not a coincidence, it's increasingly evident that roughly 90% of humans are NPCs.

callwhendone 2 years ago | |

I'm not seeing as much fear about a ghost in the shell as much as I am job displacement, which is a real scenario that can play out regardless of an AI having consciousness.

HaZeust 2 years ago | |

Why is the barrier for so many "consciousness"? Why does it matter whether it's conscious or not if its pragmatic functionality builds use cases that disrupt social contracts (we soon can't trust text, audio OR video - AND we can have human-like text deployed at incredible speed and effectivity), the status quo itself (job displacement), legal statutes and charter (questioning copyright law), and even creativity/self-expression (see: Library of Babel).

When all of this is happening from an unconscious being, why do I care if it's unconscious?

Method-X 2 years ago | |

AI doesn't have to be conscious to cause massive job displacement. It has to be artificially intelligent, not artificially conscious. Intelligence and consciousness are not the same.

bottlepalm 2 years ago | |

We have no idea what consciousness is. Therefore we have no way to determine if AI is or is not.

NikolaNovak 2 years ago |

I'm in IT but nowhere near AI/ML/NN.

The speed of user-visible progress last 12 months is astonishing.

From my firm conviction 18 months ago that this type of stuff is 20+ years away; to these days wondering if Vernon Vinge's technological singularity is not only possible but coming shortly. If feels some aspects of it have already hit the IT world - it's always been an exhausting race to keep up with modern technologies, but now it seems whole paradigms and frameworks are being devised and upturned on such short scale. For large, slow corporate behemoths, barely can they devise a strategy around new technology and put a team together, by the time it's passé .

(Yes, Yes: I understand generative AI / LLMs aren't conscious; I understand their technological limitations; I understand that ultimately they are just statistically guessing next word; but in daily world, they work so darn well for so many use cases!)

clbrmbr 2 years ago |

The thought of my children being put to bed by a machine is horrifying. Then again, perhaps this is better than many kids have. Shudder.

RivieraKid 2 years ago |

I went from being worried to thinking it won't replace me anytime soon after using GPT4 for a while and now I'm back to being worried.

Because the pace of development is intense. I would love to be financially independent and watch this with excitement and perhaps take on risky and fun projects.

Now I'm thinking - how do I double or triple my income so that I reach financial independence in 3 years instead of 10 years.

andrewinardeer 2 years ago |

Now just throw this into a humanoid looking robot with fine motor skills and we are halfway to a dystopian hellscape that is now only years away instead of decades. What a time to be alive.

conception 2 years ago | |

The Boston dynamics/openai collaboration for the apocalypse we’ve all been waiting for!

c_crank 2 years ago | |

What would make it dystopian would be if this humanoid robot was then granted rights. As a servant, it could be useful.

civilitty 2 years ago | | |

I would like our future Cylon overlords to know that I had nothing to do with this!

dhydcfsw 2 years ago | | |

Why shouldn’t AI have rights? Because us humans have magical biology juice?

dsign 2 years ago | |

The humanoid-looking robot would make it more refined, no doubt about that, but all these applications can do without it:

- Make it process customer-support requests.

- Make a virtual nurse for when you call the clinic.

- Make it process visa applications, particularly the part about interviews ("I know you weren't born back then, but I must ask. Did you support the Nazis in 1942? There is only one right answer and is not what you think!")

- Make it do job interviews. How will you feel after the next recession, when you are searching for a job and spend the best part of a year doing leetcode interviews with "AI-interviewer" half-assedly grading your answers?

- Make it flip burgers at McDonalds.

- Make it process insurance claims and ask bobby-trap questions like "did the airline book you in a later trip? Yes? Was that the next day? Oh, that's bad. But, was it before 3:00 PM? Ah, well, you have no right to claim since you weren't delayed for more than 24 hours. Before you go, can you teach me which of these images depict objects you are willing to suck? If you do, I promise I'll be more 'human' next time."

- Make it watch aggregated camera fees across cities around the world to see what that guy with the hat is up to.

- Make some low-cost daleks to watch for trouble-makers at the concert, put the AI inside.

In all cases, the pattern is not "AI is inherently devious and is coming for you, but "human trains devious AI and puts it in control to save costs".