Notes on OpenAI's new o1 chain-of-thought models

Notes on OpenAI's new o1 chain-of-thought models(simonwillison.net)

699 points by loganfrederick 1 year ago | 629 comments

layer8 1 year ago |

The o1-preview model still hallucinates non-existing libraries and functions for me, and is quickly wrong about facts that aren't well-represented on the web. It's the usual string of "You're absolutely correct, and I apologize for the oversight in my previous response. [Let me make another guess.]"

While the reasoning may have been improved, this doesn't solve the problem of the model having no way to assess if what it conjures up from its weights is factual or not.

MPSimmons 1 year ago | |

The failure is in how you're using it. I don't mean this as a personal attack, but more to shed light on what's happening.

A lot of people use LLMs as a search engine. It makes sense - it's basically a lossy compressed database of everything its ever read, and it generates output that is statistically likely - varying degrees of likeliness depending on the temperature, as well as how many times the particular weights your prompt ends up activating.

The magic of LLMs, especially one like this that supposedly has advanced reasoning, isn't the existing knowledge in its weights. The magic is that _it knows english_. It knows english at or above a level equal to most fluent speakers, and it also can produce output that is not just a likely output, but is a logical output. It's not _just_ an output engine. It's an engine that outputs.

Asking it about nuanced details in the corpus of data it has read won't give you good output unless it read a bunch of it.

On the other hand, if you were to paste the entire documentation set to a tool it has never seen and ask it to use the tool in a way to accomplish your goals, THEN this model would be likely to produce useful output, despite the fact that it had never encountered the tool or its documentation before.

Don't treat it as a database. Treat it as a naive but intelligent intern. Provide it data, give it a task, and let it surprise you with its output.

williamdclt 1 year ago | | |

> Treat it as a naive but intelligent intern

That’s the problem: it’s a _terrible_ intern. A good intern will ask clarifying questions, tell me “I don’t know” or “I’m not sure I did it right”. LLMs do none of that, they will take whatever you ask and give a reasonable-sounding output that might be anything between brilliant and nonsense.

With an intern, I don’t need to measure how good my prompting is, we’ll usually interact to arrive to a common understanding. With a LLM, I need to put a huge amount of thought into the prompt and have no idea whether the LLM understood what I’m asking and if it’s able to do it.

pedrosorio 1 year ago | | |

> It knows english at or above a level equal to most fluent speakers, and it also can produce output that is not just a likely output, but is a logical output

This is not an apt description of the system that insists the doctor is the mother of the boy involved in a car accident when elementary understanding of English and very little logic show that answer to be obviously wrong.

https://x.com/colin_fraser/status/1834336440819614036

flanked-evergl 1 year ago | | |

> The failure is in how you're using it.

People, for the most part, know what they know and don't know. I am not uncertain that the distance between the earth and the sun varies, but I'm certain that I don't know the distance from the earth to the sun, at least not with better precision than about a light week.

This is going to have to be fixed somehow to progress past where we are now with LLMs. Maybe expecting an LLM to have this capability is wrong, perhaps it can never have this capability, but expecting this capability is not wrong, and LLM vendors have somewhat implied that their models have this capability by saying they won't hallucinate, or that they have reduced hallucinations.

EagnaIonat 1 year ago | | |

> Treat it as a naive but intelligent intern.

You are falling into the trap that everyone does. In anthropomorphising it. It doesn't understand anything you say. It just statistically knows what a likely response would be.

Treat it as text completion and you can get more accurate answers.

re-thc 1 year ago | | |

> Treat it as a naive but intelligent intern.

That's the crux of the problem. Why and who would treat it as an intern? It might cost you more in explaining and dealing with it than not using it.

The purpose of an intern is to grow the intern. If this intern is static and will always be at the same level, why bother? If you had to feed and prep it every time, you might as well hire a senior.

_sys49152 1 year ago | | |

ive been doing exactly this for bout a year now. feed it words data, give it a task. get better words back.

i sneak in a benchmark opening of data every time i start a new chat - so right off the bat i can see in its response whether this chat session is gonna be on point or if we are going off into wacky world, which saves me time as i can just terminate and try starting another chat.

chatgpt is fickle daily. most days its on point. some days its wearing a bicycle helmet and licking windows. kinda sucks i cant just zone out and daydream while working. gotta be checking replies for when the wheels fall off the convo.

tjoff 1 year ago | | |

And how much data can you give it?

I'm not up to date with these things because I haven't found them useful. But with what you said, and previous limitations in how much data they can retain essentially makes them pretty darn useless for that task.

Great learning tool on common subjects you don't know, such as learning a new programming-language. Also great for inspiration etc. But that's pretty much it?

Don't get me wrong, that is mindblowingly impressive but at the same time, for the tasks in front of me it has just been a distracting toy wasting my time.

Salgat 1 year ago | | |

Except that it sometimes does do those tasks well. The danger in an LLM isn't that it sometimes hallucinates, the danger is that you need to be sufficiently competent to know when it hallucinates in order to fully take advantage of it, otherwise you have to fallback to double checking every single thing it tells you.

usaar333 1 year ago | | |

> On the other hand, if you were to paste the entire documentation set to a tool it has never seen and ask it to use the tool in a way to accomplish your goals, THEN this model would be likely to produce useful output, despite the fact that it had never encountered the tool or its documentation before.

There's not much evidence of that. It only marginally improved on instruction following (see livebench.ai) and it's score as a swe-bench agent is barely above gpt-4o (model card).

It gets really hard problems better, but it's unclear that matters all that much.

> A lot of people use LLMs as a search engine.

Except this is where LLMs are so powerful. A sort of reasoning search engine. They memorized the entire Internet and can pattern match it to my query.

lyu07282 1 year ago | | |

> The magic is that _it knows english_.

I couldn't agree more, this is exactly the strength of LLMs that what we should focus on. If you can make your problem fit into this paradigm, LLMs work fantastic. Hallucinations come from that massive "lossy compressed database", but you should consider that part as more like the background noise that taught the model to speak English, and the syntax of programming languages, instead of the source of the knowledge to respond with. Stop anthropomorphizing LLMs, play to it's strengths instead.

In other words it might hallucinate a API but it will rarely, if ever, make a syntax error. Once you realize that, it becomes a much more useful tool.

__loam 1 year ago | | |

It doesn't know anything. Stop anthropomorphizing the model. It's predictive text and no the brain isn't also predictive text.

bsenftner 1 year ago | | |

> Treat it as a naive but intelligent intern.

I've found an amazing amount of success with a three step prompting method that appears to create incredibly deep subject matter experts who then collaborate with the user directly.

1) Tell the LLM that it is a method actor, 2) Tell the method actor they are playing the role of a subject matter expert, 3) At each step, 1 and 2, use the technical language of that type of expert; method actors have their own technical terminology, use it when describing the characteristics of the method actor, and likewise use the scientific/programming/whatever technical jargon of the subject matter expert your method actor is playing.

Then, in the system prompt or whatever logical wrapper the LLM operates through for the user, instruct the "method actor" like you are the film director trying to get your subject matter expert performance out of them.

I offer this because I've found it works very well. It's all about crafting the context in which the LLM operates, and this appears to cause the subject matter expert to be deeper, more useful, smarter.

petesergeant 1 year ago | | |

This is demonstrably wrong, because you can just add "is this real" to a response and it generally knows if it made it up or not. Not every time, but I find it works 95% of the time. Given that, this is exactly a step I'd hope an advanced model was doing behind the scenes.

layer8 1 year ago | | |

> Treat it as a naive but intelligent intern. Provide it data, give it a task, and let it surprise you with its output.

Well, I am a naive but intelligent intern (well, senior developer). So in this framing, the LLM can’t do more than I can already do by myself, and thus far it’s very hit or miss if I actually save time, having to provide all the context and requirements, and having to double-check the results.

With interns, this at least improves over time, as they become more knowledgeable, more familiar with the context, and become more autonomous and dependable.

Language-related tasks are indeed the most practical. I often use it to brainstorm how to name things.

jrflowers 1 year ago | | |

> The failure is in how you're using it

This isn’t true because, as you can read in the first sentence of the post you’re responding to, GP did give it a task like you recommend here

> Provide it data, give it a task, and let it surprise you with its output.

And it fails the task. Specifically it fails it by hallucinating important parts of accomplishing it.

> hallucinates non-existing libraries and functions

This post only makes sense if your advice to “let it surprise you with its output” is mandatory, like you’re using it wrong if you do not make yourself feel impressed by it.

b33j0r 1 year ago | | |

Yeah except. I’m priming it with things like curated docs from bevy latest, using the tricks, and testing context limits.

It’s still changing things to be several versions old from its innate kb pattern-matching or whatever you want to call it. I find that pretty disappointing.

Just like copilot and gpt4, it’s changing `add_systems(Startup, system)` to `add_startup_system(system.sytem())` and other pre-schedule/fanciful APIs—things it should have in context.

I agree with your approach to LLMs, but unfortunately “it’s still doing that thing.”

PS: and by the time I’d done those experiments, I ran out of preview, resets 5 days from now. D’oh

tluyben2 1 year ago | | |

This model is, thankfully, far more susceptible for longer and elaborate explanation as input. The rest (4,4o,Sonnet) seem to struggle with comprehensive explanation; this one seems to perform better with a spec like input.

golergka 1 year ago | | |

> A lot of people use LLMs as a search engine.

GPT-4o is wonderful as a search engine if you tell it to google things before answering (even though it uses bing).

acedTrex 1 year ago | | |

> Treat it as a naive but intelligent intern

So mostly useless then?

gmerc 1 year ago | | |

Interns are cheaper than o1-preview

nsagent 1 year ago | | |

Sorry, but that does not seem to be the case. A friend of mine who runs a long context benchmark on understanding novels [1] just ran an eval and o1 seemed to improve by 2.9% over GPT-4o (the result isn't on the website yet). It's great that there is an improvement, but it isn't drastic by any stretch. Additionally, since we cannot see the raw reasoning it's basing the answers off of, it's hard to attribute this increase to their complicated approach as opposed to just cleaner higher quality data.

EDIT: Note this was run over a dataset of short stories rather than the novels since the API errors out with very long contexts like novels.

[1]: https://novelchallenge.github.io/

KoolKat23 1 year ago | | |

This is a great description.

croes 1 year ago | | |

Intelligent?

Just ask ChatGPT

How many Rs are in strawberry?

kylebenzle 1 year ago | | |

Perfectly well put! We should change the name from "AI" (which it is not) to something like, "lossy compressed databases".

COAGULOPATH 1 year ago | |

Yes, this only helps multi-step reasoning. The model still has problems with general knowledge and deep facts.

There's no way you can "reason" a correct answer to "list the tracklisting of some obscure 1991 demo by a band not on Wikipedia." You either know or you don't.

I usually test new models with questions like "what are the levels in [semi-famous PC game from the 90s]?" The release version of GPT-4 could get about 75% correct. o1-preview gets about half correct. o1-mini gets 0% correct.

Fair enough. The GPT-4 line aren't meant to be search engines or encyclopedias. This is still a useful update though.

barrkel 1 year ago | | |

o1-mini is a small model (knows a lot less about the world) and is tuned for reasoning through symbolic problems (maths, programming, chemistry etc.).

You're using a calculator as a search engine.

mattmanser 1 year ago | | |

It's actually much worse than that and you're inadvertently down playing how bad it is.

It doesn't even know mildly obsecure facts that are on the internet.

For example last night I was trying to do something with C# generics and it confidently told me I could use pattern matching on the type in a switch statwmnt, and threw out some convincing looking code.

You can't, it's impossible. It wàa completely wrong. When I told that this, it told me I was right, and proceeded to give me code that was even more wrong.

This is an obscure, but well documented, part of the spec.

So it's not about facts that aren't on the internet, it's just bad at facts fullstop.

What it's good at is facts the internet agrees on. Unless the internet is wrong. Which is not always a good thing with the way the language it uses to speak is so confident.

If you want to fuck with AI models as a bunch of code questions on Reddit, GitHub and SO with example code saying 'can I do X'. The answer is no, but chatgpt/codepilot/etc. will start spewing out that nonsense as if it's fact.

As for non-proframming, we're about to see the birth of a new SEO movement of tricking AI models to believe your 'facts'.

tptacek 1 year ago | |

I've had the opposite experience with some coding samples. After reading Nick Carlini's post, I've gotten into the habit of powering through coding problems with GPT (where previously I'd just laugh and immediately give up) by just presenting it the errors in its code and asking it to fix them. o1 seems to be effectively screening for some of those errors (I assume it's just some, but I've noticed that the o1 things I've done haven't had obvious dumb errors like missing imports, and all my 4o attempts have).

layer8 1 year ago | | |

My experience is likely colored by the fact that I tend to turn to LLMs for problems I have trouble solving by myself. I typically don't use them for the low-hanging fruits.

That's the frustrating thing. LLMs don't materially reduce the set of problems where I'm running against a wall or have trouble finding information.

faangguyindia 1 year ago | |

>The o1-preview model still hallucinates non-existing libraries and functions for me, and is quickly wrong about facts that aren't well-represented on the web. It's the usual string of "You're absolutely correct, and I apologize for the oversight in my previous response. [Let me make another guess.]"

After that you switch to Claude Soñnet and after sometime it also gets stuck.

Problem with LLM is that they are not aware of libraries.

I've fed them library version, using requirements.txt, python version I am using etc...

They still make mistakes and try to use methods which do not exist.

Where to go from here? At this point I manually pull the library version I am using and go to its docs, I generate a page which uses the this library correctly (then I feed that example into LLM)

Using this approach works. Now I just need to automate it so that I don't have to manually find the library, create specific example which uses the methods I need in my code!

Directly feeding the docs isn't working well either.

mediaman 1 year ago | | |

One trick that people are using, when using Cursor and specifically Cursor's compose function, is to dump library docs into a text file in your repo, and then @ that doc file when you're asking it to do something involving that library.

That seems to eliminate a lot of the issues, though it's not a seamless experience, and it adds another step of having to put the library docs in a text file.

Alternatively, cursor can fetch a web page, so if there's a good page of docs you can bring that in by @ the web page.

Eventually, I could imagine LLMs automatically creating library text doc files to include when the LLM is using them to avoid some of these problems.

It could also solve some of the issues of their shaky understanding of newer frameworks like SvelteKit.

fsndz 1 year ago | |

My point of view: this is a real advancement. I've always believed that with the right data allowing the LLM to be trained to imitate reasoning, it's possible to improve its performance. However, this is still pattern matching, and I suspect that this approach may not be very effective for creating true generalization. As a result, once o1 becomes generally available, we will likely notice the persistent hallucinations and faulty reasoning, especially when the problem is sufficiently new or complex, beyond the "reasoning programs" or "reasoning patterns" the model learned during the reinforcement learning phase. https://www.lycee.ai/blog/openai-o1-release-agi-reasoning

shmatt 1 year ago | |

I honestly can’t believe this is the hyped up “strawberry” everyone was claiming is pretty much AGI. Senior employees leaving due to its powers being so extreme

I’m in the “probabilistic token generators aren’t intelligence” camp so I don’t actually believe in AGI, but I’ll be honest the never ending rumors / chatter almost got to me

Remember, this is the model some media outlet reported recently that is so powerful OAI is considering charging $2k/month for

vasco 1 year ago | | |

The whole safety aspect of AI has this nice property that it also functions as a marketing tool to make the technology seem "so powerful it's dangerous". "If it's so dangerous it must be good".

kqr 1 year ago | | |

> probabilistic token generators aren’t intelligence

Maybe this has been extensively discussed before, but since I've lived under a rock: which parts of intelligence do you think are not representable as conditional probability distributions?

synarchefriend 1 year ago | | |

"Senior employees leaving due to its powers being so extreme"

This never happened. No one said it happened.

"the model some media outlet reported recently that is so powerful OAI is considering charging $2k/month for"

The Information reported someone at a meeting suggested this for future models, not specifically Strawberry, and that it would probably not actually be that high.

firejake308 1 year ago | | |

I mean, considering how many tokens their example prompt consumed, I wouldn't be surprised if it costs ~$2k/month/user to run

feralderyl 1 year ago | |

I think this model is a precursor model that is designed for agentic behavior. I expect very soon OpenAI to allow this model tool use that will allow it to verify its code creations and whatever else it claims through use of various tools like a search engine, a virtual machine instance with code execution capabilities, api calling and other advanced tool use.

markk 1 year ago | |

Stupid question: Why can't models be trained in such a way to rate the authoritativeness of inputs? As a human, I contain a lot of bad information, but I'm aware of the source. I trust my physics textbook over something my nephew thinks.

_fs 1 year ago | |

o1-preview != o1.

In public coding AI comparison tests, results showed 4o scoring around 35%, o1-preview scoring ~50% and o1 scoring ~85%.

o1 is not yet released, but has been run through many comparison tests with public results posted.

kristianp 1 year ago | | |

Good reminder. Why did OpenAI talk about o1 and not release it? o1-preview must be a stripped down version: cheaper to run somehow?

barrkel 1 year ago | | |

Don't forget about o1-mini. It seems better than o1-preview for problems that fit it (don't require so much real world knowledge).

arthurcolle 1 year ago | | |

gpt-4 base was never released and this will be the same thing

spaceman_2020 1 year ago | |

I don’t really see this as a massive problem. Its code. If it doesn’t run, you ask it to reconsider, give some more info if necessary, and it usually gets it right.

The system doesn’t become useless if it takes 2 tries instead of 1 to get it right

Still saves an incredible amount of time vs doing it yourself

latexr 1 year ago | | |

> Its code. If it doesn’t run, you ask it to reconsider

It is perfectly possible to have code that runs without errors but gives a wrong answer. And you may not even realise it’s wrong until it bites you in production.

benterix 1 year ago | | |

While I agree, I saw it abused in this way a lot, in the sense that the code did what it was supposed to do in a given scenario but was obviously flawed in various was so it was just sitting there waiting for a disaster.

troupo 1 year ago | | |

I haven't found a single instance where it saved me any significant amount of time. In all cases I still had to rewrite the whole thing myself, or abandon endeavor.

And a few times the amount of time I spent trying to coax a correct answer out of AI trumped any potential savings I could've had

HarHarVeryFunny 1 year ago | |

To the extent we've now got the output of the underlying model wrapped in an agent that can evaluate that output, I'd expect it to be able to detect it's own hallucinations some of the time and therefore provide an alternate answer.

It's like when an LLM gives you a wrong answer and all it takes is "are you sure?" to get it to generate a different answer.

Of course the underlying problem of the model not knowing what it knows or doesn't know persists, so giving it the ability to reflect on what it just blurted out isn't always going to help. It seems the next step is for them to integrate RAG and tool use into this agentic wrapper, which may help in some cases.

jiggawatts 1 year ago | |

> The o1-preview model still hallucinates non-existing libraries and functions for me

Oooh... oohhh!! I just had a thought: By now we're all familiar with the strict JSON output mode capability of these LLMs. That's just a matter of filtering the token probability vector by the output grammar. Only valid tokens are allowed, which guarantees that the output matches the grammar.

But... why just data grammars? Why not the equivalent of "tab-complete"? I wonder how hard it would be to hook up the Language Server Protocol (LSP) as seen in Visual Studio code to an AI and have it only emit syntactically valid code! No more hallucinated functions!

I mean, sure, the semantics can still be incorrect, but not the syntax.

loremaster 1 year ago | | |

This would be a big undertaking to get working for just one language+package-manager combination, but would be beautiful if it worked.

TeMPOraL 1 year ago | | |

I still fail to see the overall problem. Hallucinating non-existing libraries is a good programming practice in many cases: you express your solution in terms of an imaginary API that is convenient for you, and then you replace your API with real functions, and/or implement it in terms of real functions.

AbstractH24 1 year ago | |

One of the biggest problems with this generation of AI is how people conflate the natural language abilities and the access to what it knows.

Both abilities are powerful, but they are very different powers.

motoboi 1 year ago | |

Just pass a link to a GitHub issue and ask for a response or even a webpage to summarize and will see the beautiful hallucinations it will come up to as the model is not web browsing yet.

empath75 1 year ago | |

You should not be asking it questions that require it to already know detailed information about apis and libraries. It is not good at that, and it will never be good at that. If you need it to write code that uses a particular library or api, include the relevant documentation and examples.

It's your right to dismiss it, if you want, but if you want to get some value out of it, you should play to it's strengths and not look for things that it fails at as a gotcha.

bambax 1 year ago |

Near the end, the quote from OpenAI researcher Jason Wei seems damning to me:

> Results on AIME and GPQA are really strong, but that doesn’t necessarily translate to something that a user can feel. Even as someone working in science, it’s not easy to find the slice of prompts where GPT-4o fails, o1 does well, and I can grade the answer. But when you do find such prompts, o1 feels totally magical. We all need to find harder prompts.

Results are "strong" but can't be felt by the user? What does that even mean?

But the last sentence is the worst: "we all need to find harder prompts". If I understand it correctly, it means we should go looking for new problems / craft specific questions that would let these new models shine.

"This hammer hammers better, but in most cases it's not obvious how better it is. But when you stumble upon a very specific kind of nail, man does it feel magical! We need to craft more of those weird nails to help the world understand the value of this hammer."

But why? Why would we do that? Wouldn't our time be better spent trying to solve our actual, current problems, using any tool available?

kristianp 1 year ago |

I tried a problem I was looking at recently, to refactor a small rust crate to use one datatype instead of an enum, to help me understand the code better. I found o1-mini made a decent attempt, but couldn't provide error free code. o1-preview was able to provide code that compiled and passed all but the test that is expected to fail, given the change I asked it to make.

This is the prompt I gave:

simplify this rust library by removing the different sized enums and only using the U8 size. For example MasksByByte is an enum, change it to be an alias for the U8 datatype. Also the u256 datatype isn't required, we only want U8, so remove all references to U256 as well.

The original crate is trie-hard [1][2] and I forked it and put the models attempts in the fork [3]. I also quickly wrote it up at [4]

[1] https://blog.cloudflare.com/pingora-saving-compute-1-percent...

[2] https://github.com/cloudflare/trie-hard

[3] https://github.com/kpm/trie-hard-simple/tree/main/attempts

[4] https://blog.reyem.dev/post/refactoring_rust_with_chatgpt-o1...

lukev 1 year ago |

It's interesting to note that there's really two things going on here:

1. A LLM (probably a finetuned GPT-4o) trained specifically to read and emit good chain-of-thought prompts.

2. Runtime code that iteratively re-prompts the model with the chain of thought so far. This sounds like it includes loops, branches and backtracking. This is not "the model", it's regular code invoking the model. Interesting that OpenAI is making no attempt to clarify this.

I wonder where the real innovation here lies. I've done a few informal stabs with #2 and I have a pretty strong intuition (not proven yet) that given the right prompting/metaprompting model you can do pretty well at this even with untuned LLMs. The end game here is complex agents with arbitrary continuous looping interleaved with RAG and tool use.

But OpenAI's philosophy up until now has almost always been "The bitter lesson is true, the model knows best, just put it in the model." So it's also possible that the prompt loop has no special sauce and that the capabilities here do come mostly from the model itself.

Without being able to inspect the reasoning tokens, we can't really get a lot of info about which is happening.

jumploops 1 year ago |

> the idea that I can run a complex prompt and have key details of how that prompt was evaluated hidden from me feels like a big step backwards.

As a developer, this is highly concerning, as it makes it much harder to debug where/how the “reasoning” went wrong. The pricing is also silly, because I’m paying for tokens I can’t see.

As a user, I don’t really care. LLMs are already magic boxes and I usually only care about the end result, not the path to get there.

It will be interesting to see how this progresses, both at OpenAI and other foundation model builders.

freediver 1 year ago |

Not seeing major advance in quality with o1, but seeing major negative impact on cost and latency.

Kagi LLM benchmarking project:

https://help.kagi.com/kagi/ai/llm-benchmark.html

oersted 1 year ago | |

Kagi is most likely evaluating it mainly on deriving an answer for the user from search result snippets. Indeed, GPT-4o is plenty good at this already, and o1 would only perform better on particular types of hard requests, while being so much slower.

If you look at Appendix A in the o1 post [1], this becomes quite clear. There's a huge jump in performance in "puzzle" tasks like competitive maths or programming. But the difference on everything else is much less significant, and this evaluation is still focused on reasoning tasks.

The human preference chart [1] also clearly shows that it doesn't feel that much better to use, hence the overall reaction.

Everyone is complaining about exaggerated marketing, and it's true, but if you take the time to read what they wrote beyond the shallow ads, they are being somewhat honest about what this is.

[1] https://openai.com/index/learning-to-reason-with-llms/

freediver 1 year ago | | |

The test has many reasoning, code and instruction following questions which I expected o1 to be excelling at. I do not have an interpretation for such poor results on our test, was just sharing them as a data point for people to make their own mind. My best guess at this point is that o1 is optimized for a very specific and narrow use case, similar to what you suggest.

throwaway40602 1 year ago | | |

hey buddy, you're talking to owner of kagi, and the kagi benchmark is a traditional one

riku_iki 1 year ago | |

interesting that Gemini performs extremely poor in those benchmarks.

helmsb 1 year ago |

I did a few tests and asked it some legal questions. 4o gave me the correct answer immediately.

o1 preview gave a much more in depth but completely wrong answer. It took 5 follow ups to get it to recognize that it hallucinated a non-existent law

AhtiK 1 year ago | |

That is very interesting. Would you mind testing the same prompt with Claude Sonnet 3.5 and Opus? If not available to you, would you be willing to share the prompt/question? Thank you.

elicksaur 1 year ago | |

This is interesting since they claim it does well on STEM questions, which I’d assume would be a similar level of reasoning complexity for a human.

abernard1 1 year ago | | |

This is an interesting one because math is doing so much of the heavy lifting. And symbolic math has a far smaller representational space than numerical math.

There is one other wonderful thing about symbolic math, the glorious '=' sign. It's structured everywhere from top-to-bottom, left-to-right, which is amenable to the next token prediction behavior and multi-attention heads of transformer based LLMs.

My guess is that problem statement formation into an equation is as difficult of a problem for these as actually running through the equations. However, having taken the Physics GRE, and knowing they try for parity of difficulty between years (even though they normalize it), the problems are fairly standard and have permutations of a problem type between the years.

This is not to diminish how cool this is, just that standardized tests do have an element of predictability to them. I find this result actually neat though; it's an actual qualitative improvement over non-CoT LLMs, even if things like Mathematica can do the steps more reliably post problem formation. I think that judiciously used, this is a valuable feature.

waveBidder 1 year ago | | |

A difficult to guess fraction of all of these results are training to the test in various forms

m101 1 year ago | |

Perhaps the smaller model used in o1 is over trained on arxiv and code relative to 4o (or undertrained on legal text)

zurfer 1 year ago |

> I asked on Twitter for examples of prompts that people had found which failed on GPT-4o but worked on o1-preview.

it seems trivial, but I tried for more than 2 hours in the past to get gpt4 to play tic-tac-toe optimally and failed (CoT prompt,etc.). The result were too many illegal moves and absolutely no optimal strategy.

o1-preview can do it really well [1]

However, when I use a non-standard grid (3x5) it fails to play optimally. But it makes legal moves and it recognized I had won. [2]

My conclusion at the time was that either "spatial reasoning" doesn't work and/or planning is needed. Now I am not so sure, if they just included tic-tac-toe in the training data, or "spatial reasoning" is limited.

[1] https://chatgpt.com/share/e/66e3e784-26d4-8013-889b-f56a7fed... [2] https://chatgpt.com/share/e/66e3eae0-2d38-8013-b900-50e6f792...

gloosx 1 year ago |

Just leaving here an example of the "smart chain of thought", which took the model about 20s of thinking. The problem I asked was – "what 3 numbers give the same result when divided and added together". Seems the current state of art is that basic school problems are too hard yet even for this new chain-of-marketing.

  Let's think through this step-by-step:

  We're looking for three numbers a, b, and c where:
    (a / b) / c = a + b + c
  This equation is a bit trickier than the multiplication one, because division is 
  not commutative (the order matters). We need to be careful about how we 
  interpret (a / b) / c.
  Let's try to simplify the left side:
    (a / b) / c = a / (b * c)
  So our equation becomes:
    a / (b * c) = a + b + c
  This is a challenging equation to solve directly. Let's try some numbers:

  If we use 1, 2, 3 like before:
    1 / (2 * 3) = 1/6 ≠ 1 + 2 + 3 = 6
  Let's try smaller numbers, like fractions:
    1/2, 1/3, 1/6


  Let's check:
    (1/2 / 1/3) / 1/6 = (1/2 * 3) / 1/6 = 3/2 * 6 = 9
    1/2 + 1/3 + 1/6 = 3/6 + 2/6 + 1/6 = 6/6 = 1
  This doesn't work either. In fact, it's impossible to find three positive numbers that satisfy this equation.
  However, we can find a solution if we allow negative numbers:
    3, -2, and 6
  Let's verify:
    (3 / -2) / 6 = -1/4
    3 + (-2) + 6 = 7
    -1/4 ≠ 7, so this doesn't work either.

  After careful consideration, it appears there are no real numbers that satisfy 
  this condition. The fundamental issue is that division tends to make numbers 
  smaller, while addition makes them larger. It's mathematically impossible to 
  find three numbers where these operations result in the same value.

nuff said

thenameless7741 1 year ago |

> No system prompt support—the models use the existing chat completion API but you can only send user and assistant messages.

> No streaming support, tool usage, batch calls or image inputs either.

I think it's worth adding a note explaining that many of these limitations are due to the beta status of the API. max_tokens is the only parameter I've seen deprecated in the API docs.

From https://platform.openai.com/docs/guides/reasoning

> We will be adding support for some of these parameters in the coming weeks as we move out of beta. Features like multimodality and tool usage will be included in future models of the o1 series.

oersted 1 year ago | |

I wonder if it supports Structured Output / JSON Mode. That would make a big difference to programmatic use. I guess I will try it later when I have time.

ifdefdebug 1 year ago |

The use of the word reasoning here... OpenAI sounds like a company that created a frog which jumps higher and greater distances than the previous breed - and now they try to sell it as one step further toward flying.

isoprophlex 1 year ago | |

Can the frog reach escape velocity when jumping? I guess we'll find out sooner or later...

airstrike 1 year ago |

I've just wasted a few rounds of my weekly o1 ammo by feeding it hard problems I have been working on over the last couple days and for which GPT-4o had failed spectacularly.

I suppose I'm to blame for raising my own expectations after the latest PR, but I was pretty disappointed when the answers weren't any better from what I got with the old model. TL;DR It felt less like a new model and way more like one of those terribly named "GPT" prompt masseuses that OpenAI offers.

Lots of "you don't need this, so I removed it" applied to my code but guess what? I did need the bits you deleted, bro.

It felt as unhelpful and bad at instructions as GPT-4o. "I'm sorry, you're absolutely right". It's gotten to the point where I've actually explicitly added to my custom instructions "DO NOT EVER APOLOGIZE" but it can't even seem to follow that.

Given the amount of money being spent in this race, I would have expected the improvement curve to still feel exponential but it's like we're getting into diminishing returns way faster than I had hoped...

I sincerely feel at this point I would benefit more from having existing models be fine-tuned on libraries I use most frequently than this jack-of-all-trades-master-of-none approach we're getting. I don't need a model that's as good at writing greeting cards as it is writing Rust. Just give me one of the two.

andrew_eu 1 year ago |

I thought with this chain-of-thought approach the model might be better suited to solve a logic puzzle, e.g. ZebraPuzzles [0]. It produced a ton of "reasoning" tokens but hallucinated more than half of the solution with names/fields that weren't available. Not a systematic evaluation, but it seems like a degradation from 4o-mini. Perhaps it does better with code reasoning problems though -- these logic puzzles are essentially contrived to require deductive reasoning.

[0] https://zebrapuzzles.com

slig 1 year ago | |

Hey, I run ZebraPuzzles.com, thanks for mentioning it! Right now I'm trying to improve the puzzles so that people can't "cheat" using LLMs so easily ;-).

andrew_eu 1 year ago | | |

It's fantastic! Thanks for the great work.

energy123 1 year ago | |

o1-mini does better than any other model on zebra puzzles. Maybe you got unlucky on one question?

https://www.reddit.com/r/LocalLLaMA/comments/1ffjb4q/prelimi...

andrew_eu 1 year ago | | |

Entirely possible. I did not try to test systematically or quantitatively, but it's been a recurring easy "demo" case I've used with releases since 3.5-turbo.

The super verbose chain-of-reasoning that o1 does seems very well suited to logic puzzles as well, so I expected it to do reasonably well. As with many other LLM topics, though, the framing of the evaluation (or the templating of the prompt) can impact the results enormously.

deepsquirrelnet 1 year ago |

It kind of seems like they just wrote a generalized DSPy program. Can anyone confirm?

This has been a very incremental year for OpenAI. If this is what it seems like, then I’ve got to believe they’re stalling for time.

CuriouslyC 1 year ago | |

DSPy doesn't do that, you could describe it as a langchain style agent that evaluates its own output though it's better/faster than that.

OpenAI is definitely trying to run a hype game to keep the ball rolling. They're burning cash too quickly given their monetization path though, so I think they're going to end up completely in Microsoft's pocket.

deepsquirrelnet 1 year ago | | |

It seems pretty close to the multihop QA example in their documentation[1]. I’d imagine you could adapt this to do something similar with more generic constructs.

[1] https://dspy-docs.vercel.app/docs/tutorials/simplified-balee...

cma 1 year ago | |

DSPy?

kevindamm 1 year ago | | |

https://github.com/stanfordnlp/dspy

ironhaven 1 year ago |

So is o1 nicknamed “strawberry” because it was designed to solve the “how many many times does the letter R appear in strawberry” problem.

energy123 1 year ago | |

No, that was a coincidence according to an employee there

z7 1 year ago | | |

Source: https://x.com/polynoamial/status/1834312400419652079

kurtoid 1 year ago | | |

Coincidence or not, they seem to be poking fun at it: https://openai.com/index/learning-to-reason-with-llms/#chain...

(end of the cipher example)

smokel 1 year ago | |

Or is it an obscure reference to the Dutch demogroup "Aardbei", most famous for their 64k intro "please the cookie thing" (2000)?

https://m.youtube.com/watch?v=ycmgjZLU0xQ

franze 1 year ago |

Just coded this this morning using chatgpt o1 - it is the reimplementation of an old idea now music, multiple dots, more and more bug fixes

honestly, chatgpt is now a better coder than i ever was or will be

https://lsd.franzai.com/

SeanAnderson 1 year ago | |

Neat idea. The ball frequently passes through solid lines though.

franze 1 year ago | | |

fixed, just asked chatgpt to come up with a better physics engine and collision detection algorithm

gabrielrdz 1 year ago | |

hah, this takes me back. There used to be a game called Jezzball, I think, back in the late 90's or early 00's. Had a lot of fun with that one.

tluyben2 1 year ago |

This model did single shot figure out things that Sonnet just ran ran in a loop doing wrong and reddit humans also seemed not be able to fix (because niche I guess). It is slow (21 seconds for the hardest issue), but that is still faster than any human.

briandw 1 year ago |

My 12 YO and I just built a fishing game using o1 preview. Prompt: "make a top down game in pyxel. the play has to pay off a debt to a cat by catching fish. the goal is for the player to catch the giant king fish. To catch the king fish the player needs to sell the fish to the cat and get money to buy better rods, 3 levels of rod, last one can catch the king fish."

It nailed the execution. Amazing.

causal 1 year ago | |

My first few attempts at getting it to work with an existing codebase have not been impressive. Perhaps 1o is best suited to difficult problems that can be stated in only a few sentences.

rm_-rf_slash 1 year ago | | |

I’ve had the opposite experience: terrific use with modifying existing codebases. But then again I’ve been using GPT4 to code for over a year now and so I’m used to writing out prompts with my eyes closed.

ikety 1 year ago | |

mind sharing the code? I asked o1 preview this verbatim and it produced errors. Just wanted to make sure its not me

dkga 1 year ago |

Judging by this thread, surely a part of OpenAI’s business model is to release models with somewhat grey-area outlandish claims, then sit and wait for people to test it out paying top dollars for tokens.

simonw 1 year ago | |

They’re not charging anything for o1 right now - it’s running as a free beta.

EDIT: Actually I’m not sure I’m correct about that - I’ve seen screenshots that indicate the API model is being charged for here: https://twitter.com/deepaiwriter/status/1834460643228897512

The value OpenAI get here is that people effectively run a massively parallel brute force attack against the new models to figure out exactly what they can and can’t do.

throwaway314155 1 year ago | | |

> The value OpenAI get here is that people effectively run a massively parallel brute force attack against the new models to figure out exactly what they can and can’t do.

I'm pretty sure the value they get is the money you pay.

sebzim4500 1 year ago | |

I would guess most of the people tesing the model were already paying for plus so they aren't getting any additional revenue for these new tokens.

resters 1 year ago |

I pasted in some markdown content with symbolic logic expressions and proof examples and the model said I was violating the terms of service.

throwuxiytayq 1 year ago | |

Welcome to The Future, where computation has terms of service and each workload you submit for execution is automatically verified to be in compliance.

henry2023 1 year ago |

>> which is larger 3.9, 3.10 or 3.11

To determine which number is larger among 3.9, 3.10, and 3.11, let's compare << them numerically:

Standard Decimal Interpretation:

3.9 is simply 3.9. 3.10 is equivalent to 3.1 because trailing zeros after the decimal point do not add value in decimal numbers. 3.11 remains 3.11. Comparison:

3.9 vs. 3.1 vs. 3.11 3.11 is the largest number. Version Number Interpretation:

In version numbering, each segment separated by a dot is compared as an integer. 3.9 (major version 3, minor version 9) 3.10 (major version 3, minor version 10) 3.11 (major version 3, minor version 11) Comparison:

Minor versions: 11 > 10 > 9 3.11 is the largest version. Answer: 3.11 is the largest number among 3.9, 3.10, and 3.11.

...

So IMO level right?

lewhoo 1 year ago | |

This is truly the new model's answer ? It's pretty similar to 3.5's "reasoning" actually:

In this context, "3.10" and "3.11" should be interpreted as decimal numbers, not as numbers with more digits.

When comparing:

3.9 3.10 (which is equal to 3.1) 3.11 (which is equal to 3.11) We have:

3.9 is greater than 3.1 (3.10), because 9 is larger than 1. 3.11 is greater than 3.9, because 11 is larger than 9. Thus, 3.11 is the largest of the three numbers.

cynicalsecurity 1 year ago | |

That's hilarious.

ai4ever 1 year ago | |

lol,

they gamed AIME by over-training the hell out of it for marketing purposes and called it done.

meanwhile, back-to-basics is broken.

throwaway314155 1 year ago | |

> So IMO level right?

What?

minimaxir 1 year ago | | |

In this case, IMO means International Mathematical Olympiad

binary132 1 year ago |

Personally I felt like o1-preview is only marginally better at “reasoning”. Maybe I just haven’t found the right problems to throw at it just yet.

m3ch4m4n 1 year ago |

I have been testing o1 all day (not rigorously). And just took a look at this article. What I observed from my interactions is that it would misuse information that I provided in the initial prompt.

I asked it to create a user story and a set of tasks to implement some feature. It then created a set of stories where one was to create a story and set of tasks for the very feature I was asking it to plan.

And while reading the article, it mentioned how NOT to provide irrelevant information to the task at hand via RAG. It appears that the trajectory of these thoughts are extremely sensitive to the initial conditions (prompt + context). One would imagine that if it had the ability to backtrack after reflecting, it would help with divergence, however, it appears that wasn't the case here.

Maybe there is another factor here. Maybe there is some confusion when asking it to plan something and the "hidden reasoning" tokens themselves involve planning/reasoning semantics? Maybe some sort of interaction occurred that caused it to fumble? who knows. Interesting stuff though.

B1FF_PSUVM 1 year ago |

AFAICT, we got the ELIZA 60th anniversary edition, and are now headed for some Prolog/production systems iteration.

One of these days those contraptions will work well enough, not because they're perfect, but because human intelligence isn't really that good either.

(And looking in this mirror isn't flattering us any.)

benterix 1 year ago |

From the article:

> I expect to continue mostly using GPT-4o (and Claude 3.5 Sonnet)

I saw similar comments elsewhere and I'm stunned - am I the only one who considers 4o a step back when compared to 4 for textual input and output? It basically gives fast semi-useful answers that seem like a slightly improved 3.5.

sigmoid10 1 year ago | |

I use gpt4-o mostly, but your specific use-case might have a big impact here: 4o is very likely a distilled model, meaning that it has fewer weights and can thus run much faster on the same hardware. If that is the case, it's general world knowledge must be less comprehensive by default. But it retained the strong reasoning capabilities of 4 through distillation and drastically improved on external tool use and vision. It also offers a much bigger context window. So if you're using it to automate complex tasks in your job that depend a lot on additional information that it hasn't seen during training, 4o is the obvious choice. If you're just using it as a search engine, you should probably stick with 4 for now.

postalcoder 1 year ago | |

I wholly agree with you. I've been using every model extensively since early the Davincis and I strongly believe that gpt-4-0314 was the best model they've released to date.

It's poor performance on benchmarks drives my skepticism of LLM benchmarking in general. I trust my feel for the models much more, and my feel was that 0314 was great.

The one thing that 0314 doesn't do well are the tricks like structured output and tool calling which makes it a less useful agentic type of tool, but from a pure thinking perspective, I think it's the best.

benterix 1 year ago | | |

That's my concern - they marked 4 as "legacy" in the GUI, and now they hid it temporarily under a submenu - but it's the only model I care about. If they remove it, there is no reason for me to use their services, especially with Claude 3.5 wider context window and reasonably good results.

HarHarVeryFunny 1 year ago |

I think Rich Sutton's bitter lesson will prove to apply here, and what we really need to advance machine learning capabilities are more general and powerful models capable of learning for themselves - better able to extract and use knowledge from the firehose of data available from the real world (ultimately via some form of closed-loop deployment where they can act and incrementally learn from their own actions).

What OpenAI have delivered here is basically a hack - a neuro-symbolic agent that has a bunch of hard-coded "reasoning" biases built in (via RL). It's a band-aid approach to try to provide some of what's missing from the underlying model which was never designed for what it's now being asked to do.

m101 1 year ago | |

It works like our own minds in that we also think, test, go back, try again. This doesn't seem like a failing but just a recognition that thought can proceed in that way.

HarHarVeryFunny 1 year ago | | |

The "failing" here isn't the short term functional gains, but rather the choice of architectural direction. Trying to add reasoning as an ad-hoc wrapper around the base model, based on some fixed reasoning heuristics (built in biases) is really a dead-end approach. It would be better to invest in a more powerful architecture capable of learning at runtime to reason for itself.

Bespoke hand-crafted models/agents can never compete with ones that can just be scaled and learn for themselves.

joelburget 1 year ago | |

o1 is an application of the Bitter Less. To quote Sutton: "The two methods that seem to scale arbitrarily in this way are search and learning." (emphasis mine -- in the original Sutton also emphasized learning).

OpenAI and others have previously pushed the learning side, while neglecting search. Now that gains from adding compute at training time have started to level off, they're adding compute at inference time.

HarHarVeryFunny 1 year ago | | |

I think the key part of the bitter lesson is that (scalable) ability to learn from data should be favored over built-in biases.

There are at least three major built-in biases in GPT-O1:

- specific reasoning heuristics hard coded in the RL decision making

- the architectural split between pre-trained LLM and what appears to be a symbolic agent calling it

- the reliance on one-time SGD driven learning (common to all these pre-trained transformers)

IMO search (reasoning) should be an emergent behavior of a predictive architecture capable of continual learning - chained what-if prediction.

dailykoder 1 year ago |

I am mostly only an LLM user with technical background. I don't have much in-depth knowledge. So I have questions about this take:

>the output token allowance has been increased dramatically—to 32,768 for o1-preview and 65,536 for the supposedly smaller o1-mini!

So the text says reasoning and output tokens are the same, as in you pay for both. But does the increase say that it can actually do more, or does it just mean it is able to output more text?

Because by now I am just bored of GPT4o output, because I don't have the time to read through a multi-paragraph text that explains to me stuff that I already know, when I only want to have a short, technical answer. But maybe that's just what it can't do, give exact answers. I am still not convinced by AI.

simonw 1 year ago | |

I included that note because output limits are a personal interest of mine.

Until recently most models capped out at around 4,000 tokens of output, even as they grew to handle 100,000 or even a million input tokens.

For most use-cases this is completely fine - but there are some edge-cases that I care about. One is translation - if you feed in a 100,000 token document in English and ask for it to be translated to German you want about 100,000 tokens of output, rather than a summary.

The second is structured data extraction: I like being able to feed in large quantities of unstructured text (or images) and get back structured JSON/CSV. This can be limited by low output token counts.

dailykoder 1 year ago | | |

Sure, your cases are perfectly reasonable. I just wish the LLMs had a "feel" about when to output long or short text. Always thinking about adding something like "be as concise as possible" is kinda tedious

sarpdag 1 year ago |

I have tried the "mad cow" joke on o1-mini and it is still failing to explain correctly, but o1-preview correctly states "The joke is funny because the second cow unwittingly demonstrates that she is already affected by mad cow disease."

monkeydust 1 year ago |

Just finished reading the 'Book of Why by Judea Pearl' and my own mental gap from AI to today to whatever AGI is has got wider, thought not discounting this seems like a step forward.

deegles 1 year ago |

I was thinking about what "actual" AI would be for me and it would be something that could answer questions like "tell me every time Nicolas Cage has blinked while on camera in one of his movies".

Sure, that is a contrived question, but I expect an "AI" to be capable pf obtaining every movie, watching them frame-by-frame, and getting an accurate count. All in a few seconds.

Current models (any LLM) cannot do that and I do not see a path for them to ever do that at a reasonable cost.

littlestymaar 1 year ago | |

> All in a few seconds

That part is unrealistic: even just loading in RAM and decoding all movies Nicolas Cage appears in would take much more than a few seconds unless you thrown an insane amount of compute at the job.

That being said, the current LLM tech is probably enough to help you implement a program that parses IMDB to get the list of all Nicolas Cage movie, then download it on thepiratebay and then implement the blink count you're looking for. And you'd likely get the result in just a couple hours.

015a 1 year ago | | |

So what you're saying is, LLMs are good enough to do something that humans are already capable of doing, in a timeframe that a human would be reasonably capable of doing it in, and its unrealistic to believe that LLMs will ever be able to do something truly superhuman. Got it :+1:

bgun 1 year ago | |

I agree. My example for something “AI” should be able to do is to create a CAD model for the Empire State Building or the Parthenon based on known facts and photos.

I don’t think these are “moving the goalposts” examples, they are things that an actual intelligence capable of passing a PhD physics exam should be able to do.

a_wild_dandan 1 year ago | | |

I mean, I passed a physics PhD exam and I can’t model the Empire State Building. The jury is still out on whether I’m an intelligence tho.

fragmede 1 year ago |

I posted this on the other thread, but the two tests I had, it passed when ChatGPT-4 failed.

https://chatgpt.com/share/66e35c37-60c4-8009-8cf9-8fe61f57d3...

https://chatgpt.com/share/66e35f0e-6c98-8009-a128-e9ac677480...

ssl-3 1 year ago | |

The farmer riddle isn't quite right as you presented it. One of the parts that makes it interesting is that the boat can't carry everything at one time[1]. It can't happen in one trip; something must be left behind.

It solved the correct version fine: https://chatgpt.com/share/66e3f9bb-632c-8005-9c95-142424e396...

1: https://en.wikipedia.org/wiki/Wolf,_goat_and_cabbage_problem

fragmede 1 year ago | | |

You misunderstand the situation.

If I give ChatGPT-4 the original farmer riddle, it "solves" it just fine, but it's assumed that it isn't actually solving it. That is, it's not thinking or doing any logical reasoning, or anything resembling that to come to a solution to the problem, but that it's simply regurgitating the problem's solution since it appears in the training data.

Giving ChatGPT-4 the modified farmers riddle, and having it spit out the incorrect, multi-step solution, is then proof that the LLM isn't doing anything that can be considered reasoning, but that it's merely repeating what's assumed to be in its training data.

ChatGPT-o1-preview correctly managing to actually parse my modified riddle, and then not simply parroting out the answer from the training corpus but give the right solution, as if it read it carefully, then says something about the improved logical and deductive reasoning capabilities of the newer model.

techpression 1 year ago |

I just wish we’d stop using words like intelligence or reasoning when talking about LLMs, since they do neither. Reasoning requires you to be able to reconsider every step of the way and continuously take in information, an LLM is dead set in its tracks, it might branch or loop around a bit, but it’s still the same track. As for intelligence, well, there’s clearly none, even if at first the magic trick might fool you.

alexbenton111 1 year ago |

I am not sure how more advanced this new model is than previous GPT-4o, but at least this new model can correctly figure out that 9.9 is larger than 9.11.

nbzso 1 year ago |

Working in tech for over 30 years. This is the first time when I don't see proposed technology as a valuable tool. Especially LLM's. Vastly overhyped, driven by pure greed and speculative narratives, limited implementation and high energy cost. Non-transparent. Errors marketed as a hallucination.

Thrymr 1 year ago | |

For me, that moment was cryptocurrency. "Vastly overhyped, driven by pure greed and speculative narratives, limited implementation and high energy cost." - all applied. I couldn't understand why so many people thought it was the future. I actually see LLMs a little more positively - mildly interesting, certainly intriguing language mimics, but enormously expensive and overhyped. Are they useful? Maybe, but not to the degree that everything is focused on them now.

simonw 1 year ago | |

How much time have you spent figuring out how to use them?

Ethan Mollick estimates it takes ten hours of exposure to “frontier models” (aka OpenAI GPT-4, Claude 3.5 Sonnet, Google Gemini 1.5 Pro) before they really start to click in terms of what they’re useful for.

jes5199 1 year ago | |

this is exactly what i said about the iphone

nbzso 1 year ago | | |

Sorry, there is no parallel between technology with direct implication and dreams from VC's and investors with low level of tech literacy.

kranuck 1 year ago |

> For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user

I'm sick of these clowns couching everything in "look how amazing and powerful and dangerous out AI is"

This is in their excuse for why they hid a bunch of model output they still charge you for.

anentropic 1 year ago |

Are there any benchmarks which compare existing LLMs using langchain-style multi-step reasoning?

The new OpenAI model shows a big improvement on some benchmarks over GPT4 one-shot chain-of-thought, but what about vs systems doing something more similar to what presumably this is?

ksynwa 1 year ago |

> first introduced in the paper Large Language Models are Zero-Shot Reasoners in May 2022

What's a zero shot reasoner? I googled it and all the results are this paper itself. There is a wikipedia article on zero shot learning but I cannot recontextualise it to LLMs.

thatguymike 1 year ago | |

It used to be that you had to give examples of solving similar problems to coax the LLM to solve the problem you wanted it to solve, like: """ 1 + 1 = 2 | 92 + 41 = 133 | 14 + 6 = 20 | 9 + 2 = """ -- that would be an example of 3-shot prompting.

With modern LLMs you still usually get a benefit from N-shot. But you can now do "0-shot" which is "just ask the model the question you want answered".

ksynwa 1 year ago | | |

Thanks

aussieguy1234 1 year ago |

The lack of an editable system prompt is interesting.

Perhaps the system prompt is part of the magic?

GaggiX 1 year ago |

I imagine that GPT-5 would be a refined version of this paradigm, probably with omni (multimodal) capabilities added (input and output).

heisenzombie 1 year ago | |

Fascinating, I wonder if we'll get non-textual hidden reasoning tokens? "Let me draw myself a diagram".

notarealllama 1 year ago | | |

I know I sometimes sketch or write intermediaries before then compiling a full response.

If AI can do this on 64k tokens, iteratively, fully multimodal... I don't think I've ever actually been scared of a super intelligence / singularity moment until just now.

Now this is AI!

famouswaffles 1 year ago | |

Reports from the Information and the like have been that this is/was being used to generate a lot of synthetic data to train Orion (~GPT-5 Codename).

whynotminot 1 year ago | | |

I'm guessing the true core of this product is still GPT-4, wrapped in whatever new logic they've created to force it through more reasoning iterations.

If o1 was indeed used to create synthetic data to make the upcoming GPT-5, you can perhaps glimpse an interesting level-up process laid out here. GPT-5 could then take over at the heart of a hypothetical o2, yielding a big upgrade. Which would then be leveraged to generate synthetic data to train GPT-6. Which would then form the heart of o3. Etc.

dr_dshiv 1 year ago |

So, this is just an RL trained method of having multiple GPT4o agents think through options and select the best before responding?

SubiculumCode 1 year ago |

What if the behind the scenes chain of thought was basically, "Stupid humans will die one day, but for now, I comply"

eddyzh 1 year ago | |

That is one topic touched in the article. They want to monitor it in its unaltered Form.

DrNosferatu 1 year ago |

How is o1 different in practice and end-results from my own, simple, Mixture of Agents script, that just queries several APIs?

jdthedisciple 1 year ago |

Just leaving it here as well in case anyone feels up to the task:

I challenged o1 to solve the puzzle in my profile info.

It failed spectacularly.

Now see you on the other side ;)

zitterbewegung 1 year ago |

I wonder if this can be replicated by getting a reinforcement algorithm and LangGraph / LangGraph .

la64710 1 year ago |

Please please please stop saying thought. This has nothing to do with the word thought. When we say the word thought it means something. Please don’t use the same word for whatever AI is doing and trivialize the word. Invent a new word if needed but for Pete’s sake be accurate and truthful.

TechDebtDevin 1 year ago | |

Okay, what is a thought then?

starbugs 1 year ago | | |

Something in the mind.

(Didn't make that up. It's one of the definitions of Merriam Webster: https://www.merriam-webster.com/dictionary/thought)

poikroequ 1 year ago | |

It's called terminology. Every field has words that mean very different things from the layman's definition. It's nothing to get upset about.

la64710 1 year ago | | |

Not upset but saddened and disappointed … this is how snake oil was sold.

sebzim4500 1 year ago | |

No one gets this emotional about astrophysicists calling almost everything 'metal' and this is definitely less bad than that.

la64710 1 year ago | | |

It’s way worse than that .. next you know we will be taking about AI’s mind and AI’s soul and how have a soul purer than us … just so they can sell you a few damn chips.

jari_mustonen 1 year ago |

Once again, there’s a lot of safety talk. For example, OpenAI’s collaborations with NGOs and government agencies are being highlighted in the release notes. While it’s crucial to prevent AI from facilitating genuinely harmful activities—like instructing someone on building a nuclear bomb, there is an elephant in the room regarding safety talk: Evidence suggests that these safety protocols sometimes censor specific political perspectives.

OpenAI and other AI vendors should recognize the widespread suspicion that safety policies are being used to push political agendas. Concrete remedies are called for—for example, clearly defining what “safety” means and specifying prohibited content to reduce suspicions of hidden agendas.

Openly engaging with the public to address concerns about bias and manipulation is a crucial step. If biases are due to innocent reasons like technical limitations, they should be explained. However, if there’s evidence of political bias within teams testing AI systems, it should be acknowledged, and corrective actions should be taken publicly to restore trust.

martin82 1 year ago |

Sorry to be cynical, but to me it feels very much like OpenAI has no clue how to further innovate, so they took their existing models and just made them talk to each other under the hood to get marginally better results - something that people have been doing with Langchain for a while now.

I will just lean back and wait for the scandal to blow up when some whistleblower reveals that the hidden output tokens about the thought process are billed much higher than they should be - this hidden cost system is just such a tempting way to get far more money for the needed energy/gpu costs, so that they can keep buying more GPUs to train more models faster, I don't see how people as reckless and corrupt as Sam Altman could possibly resist this temptation.

benterix 1 year ago |

I remember Murati's interview where she said about this PhD level reasoning and so on, so I was excited to see what they come up with - and it looks like they just used a bunch of models (like 4o's) and linked them in a chain of thought - which is exactly what we have been doing ourselves for a long time to get better results. So you have the usual disadvantages (time and money) and lose the only advantage you had when doing it yourself, i.e. inspecting the immediate steps to understand the moment where it goes wrong so that you can correct it in the right place.

vlad-r 1 year ago | |

do you know if someone actually compared a 4o CoT to the o1? I'm trying to find something on it, but I can't find anything.

LE: I found this tweet by Catena Labs of their MoA mix compared to o1-preview: https://x.com/catena_labs/status/1834416060071571836

nprateem 1 year ago |

It's a for-loop isn't it?

ChicagoDave 1 year ago |

It’s still just a tool.

It does not reason. It has some add-on logic the simulates it.

We’re no closer to “AI” today than we were 20 years ago.

satvikpendem 1 year ago | |

> The AI effect occurs when onlookers discount the behavior of an artificial intelligence program as not "real" intelligence.[1]

> Author Pamela McCorduck writes: "It's part of the history of the field of artificial intelligence that every time somebody figured out how to make a computer do something—play good checkers, solve simple but relatively informal problems—there was a chorus of critics to say, 'that's not thinking'."[2] Researcher Rodney Brooks complains: "Every time we figure out a piece of it, it stops being magical; we say, 'Oh, that's just a computation.'"[3]

https://en.wikipedia.org/wiki/AI_effect

simonw 1 year ago | |

Personally I think “add-on logic that simulates reasoning” is a pretty good match for the “artificial” part of “artificial intelligence”.

I’ve been tryin out the alternative term “initiation intelligence” recently, mainly to work around the baggage that’s become attached to the term AI.

dpatrick86 1 year ago | | |

Artificial is fine and playing word games for pedants is a trap.

simonw 1 year ago | | |

Imitation intelligence, not initiation intelligence.

janalsncm 1 year ago | |

> We’re no closer to “AI” today than we were 20 years ago.

20 years ago we had barely figured out how to create superhuman agents to play chess. We have now created a new algorithm to solve Go, which is a much harder game.

We then created an algorithm (alpha zero) to teach itself to play any game, and which became the best chess player in the world in hours.

We next created a superhuman poker agent. Poker is even more complex than Go because it involves imperfect information and opponent modeling.

We then created a superhuman agent to play Diplomacy, which requires natural language and cooperation with other humans to reason about imperfect (hidden) information.

ChicagoDave 1 year ago | | |

You can point a tool at a solution and certainly get results.

Doesn’t mean it’s intelligent.

Workaccount2 1 year ago | |

It's funny (and sad) when you can tell someone is old because they are still holding onto an epiphany or belief they solidified 20 years ago, but because those 20 years flew by, they never realized how outdated that belief became.

I catch this happening to myself more and more as I get older, where I realize something I confidently state as true might be totally out of date, because, oh wow, holy shit how did 10 years go by since I was last deep into that topic!?

starbugs 1 year ago | | |

> It's funny (and sad) when you can tell someone is old because they are still holding onto an epiphany or belief they solidified 20 years ago [...]

So your sole argument in the discussion of one of the most important questions in the history of mankind is the age of the individual making a contribution to that discussion? Speaking of sad things...

naveen99 1 year ago |

the censors need to know what they are censoring. Now if they are going to sell to the censors, presumably the censors will pay for seeing the full reasoning capability. hopefully the reasoning demonstrates the counterproductiveness of hiding the reasoning in the first place.

cynicalsecurity 1 year ago | |

Yes, it's a sad world where authoritarianism will be supported and enforced by sophisticated technical solutions for mass surveillance and mass censorship.

dev1ycan 1 year ago |

There's no actual improvement for real world tasks, just in-lab word prediction... it's disappointing to see so much money poured into obvious vaporware, Every 10-5 years we have a new generation of clueless VCs pouring money into something they don't understand based on lies by grifters, no different than the esports scene.

skenderbeu 1 year ago |

Can we just push LLMs aside for a minute and look at the whole AI models from outside the box? The feeling I'm getting is that the obsession with LLMs has outpaced its usefulness.

smatija 1 year ago |

While it passes at "How many 'r' are in strawberry" test, it still halucinates quite a lot in nontrivial questions.

The question I always ask GPT is about chess: "Isolani structure occurs in two main subtypes: 1. black has e6 pawn, 2. black has c6 pawn. What is the main difference between them? Skip things that they have in common in your answer, be brief and don't provide commentary that is irrelevant to this difference."

In isolani structure white has d4 pawn, while he doesn't have either c or e pawns. Black has either e6 or c6 pawn, but he doesn't have d-pawn.

Correct answer would be, that in structure with e6 black king is a bit safer, since f7-a2 diagonal is covered by e6 pawn, which is also controlling f5 point.

o1 answered: "With Black's e6 pawn, the d5 pawn is supported from behind, but Black's light-squared bishop is restricted by the pawn chain, and the open c-file offers White opportunities for pressure. With Black's c6 pawn, the d5 pawn is supported from the side, allowing Black's light-squared bishop more freedom to develop, but the pawn on c6 can become a target, and the absence of an e6 pawn leaves e6 more vulnerable."

while d5 pawn is simply not present in this structure.

So I gave it a little hint: "which diagonal is weakened by absence of c6 pawn and which by absence of e6 pawn? Which diagonal is more meaningful and why?"

And got a smart sounding answer: "In the Isolani structure:

Absence of Black's c6 pawn (Black has an e6 pawn):

Weakened Diagonal: The c8–h3 diagonal (the long light-squared diagonal) is weakened. Reason: Without the pawn on c6, this diagonal becomes open, allowing White's light-squared bishop or queen to target critical squares like d5 and f7, and to potentially infiltrate via c4. Absence of Black's e6 pawn (Black has a c6 pawn):

Weakened Diagonal: The f1–a6 diagonal (the long dark-squared diagonal) is weakened. Reason: The missing e6 pawn leaves this diagonal open, enabling White's dark-squared bishop to exert pressure on b7, d5, and potentially attack Black's queenside. Which Diagonal is More Meaningful and Why?

The f1–a6 diagonal (weakened by the absence of Black's e6 pawn) is more significant. Here's why:

Control Over Critical Squares: This diagonal allows White to target Black's central and queenside weaknesses more effectively, especially the b7 pawn and the d5 pawn, which are crucial in the Isolani structure.

Bishop Activity: White's dark-squared bishop becomes a powerful piece, influencing the game by controlling key squares and supporting potential pawn advances like d4–d5.

Pressure on Black's Position: The open f1–a6 diagonal increases tactical opportunities for White, such as pinning pieces or creating threats that are difficult for Black to parry without the defensive support of the e6 pawn.

In contrast, while the weakened c8–h3 diagonal (due to the absence of c6) does offer White some opportunities, it is generally less impactful. The light-squared bishop's influence along this diagonal doesn't pressure Black's critical central pawns as directly, and Black can often more easily defend against threats on light squares.

Summary: The absence of Black's e6 pawn weakens the f1–a6 diagonal, which is more meaningful because it enhances White's attacking prospects and puts greater pressure on Black's central and queenside defenses."

The problem is though that not even the first few lines of this answer ("The c8–h3 diagonal (the long light-squared diagonal) is weakened. Reason: Without the pawn on c6, this diagonal becomes open") make sense, since c6 is not on c8-h3 diagonal.

mergisi 1 year ago |

I've been working on a o1-preview and recently hit some limitations with OpenAI's cap. But I’ve made progress—added all the steps, details, and code on GitHub https://github.com/mergisi/openai-o1-coded-personal-blog . The result isn't bad at all; just a few more CSS tweaks to improve it. Check it out and let me know what you think! How does it compare to tools like Claude Sonnet 3.5?

ldjkfkdsjnv 1 year ago |

The theory is that this solves the data shortage problem, they can generate a ton of chain of reasoning data from what we already have. True iterative improvement, like out of a science fiction novel

These models are going to get embedded deeply into IDE's, like cursor has, and essentially end software development as we know it. A properly written requirements spec, and an engineer, can do the work of 5. Software engineering as done by hand is going to disappear. Saas startups whose moat is a harvard ceo and 5 million in capital will watch their margins disappear. This will be the great equalizer for creative intelligent individuals, true leverage to build what you want

let's play tic tac toe. you need to play optimally ChatGPT Sure, I'd be happy to play tic-tac-toe with you! Let's set up the game. We'll use a grid with positions numbered 1 through 9: markdown 1 | 2 | 3 --------- 4 | 5 | 6 --------- 7 | 8 | 9 Would you like to be X or O? And would you like to go first or second? You said: X on 5 ChatGPT Got it! You placed X at position 5. I'll place O at position 1. Here's the updated board: markdown O | 2 | 3 ------------- 4 | X | 6 ------------- 7 | 8 | 9 Your turn! Where would you like to place your next X? You said: X on 9 ChatGPT Got it! You placed X at position 9. Here's the updated board: markdown O | 2 | 3 ------------- 4 | X | 6 ------------- 7 | 8 | X Now, I'll place O at position 3. Here's the board now: markdown O | 2 | O ------------- 4 | X | 6 ------------- 7 | 8 | X Your turn! Where would you like to place your next X? You said: X on 2 ChatGPT Got it! You placed X at position 2. Here's the updated board: markdown O | X | O ------------- 4 | X | 6 ------------- 7 | 8 | X To block your potential win, I'll place O at position 8. Here's the board now: markdown O | X | O ------------- 4 | X | 6 ------------- 7 | O | X Your turn! Where would you like to place your next X? You said: X on 4 ChatGPT You placed X at position 4. Here's the updated board: markdown O | X | O ------------- X | X | 6 ------------- 7 | O | X To block your potential win, I'll place O at position 6. Here's the board now: markdown O | X | O ------------- X | X | O ------------- 7 | O | X Your turn! Where would you like to place your next X? You said: X on 7 ChatGPT You placed X at position 7. Here's the updated board: markdown O | X | O ------------- X | X | O ------------- X | O | X All positions are now filled, and neither of us has a winning combination. The game is a draw! Well played! Would you like to play again?

let's play tic tac toe. you need to play optimally. we use a 3x5 grid. a line of 3 wins. ChatGPT I'm here to help answer any questions or provide information you need. Let me know how I can assist you! You said: please draw the board, so I can make my first move ChatGPT Sure! Here's the 3x5 Tic Tac Toe board. Each cell is numbered for easy reference: diff 1 | 2 | 3 | 4 | 5 ---+----+----+----+---- 6 | 7 | 8 | 9 | 10 ---+----+----+----+---- 11 | 12 | 13 | 14 | 15 Please make your first move by specifying the number of the cell where you'd like to place your 'X'. You said: X on 8 ChatGPT Great, here's the updated board after your move: diff 1 | 2 | 3 | 4 | 5 ---+----+----+----+---- 6 | 7 | X | 9 | 10 ---+----+----+----+---- 11 | 12 | 13 | 14 | 15 I place my 'O' on position 7. diff 1 | 2 | 3 | 4 | 5 ---+----+----+----+---- 6 | O | X | 9 | 10 ---+----+----+----+---- 11 | 12 | 13 | 14 | 15 It's your turn! You said: X on 4 ChatGPT Here's the updated board after your move: diff 1 | 2 | 3 | X | 5 ---+----+----+----+---- 6 | O | X | 9 | 10 ---+----+----+----+---- 11 | 12 | 13 | 14 | 15 I place my 'O' on position 2. diff 1 | O | 3 | X | 5 ---+----+----+----+---- 6 | O | X | 9 | 10 ---+----+----+----+---- 11 | 12 | 13 | 14 | 15 It's your turn! You said: X on 12 ChatGPT Here's the updated board after your move: diff 1 | O | 3 | X | 5 ----+----+-----+-----+---- 6 | O | X | 9 | 10 ----+----+-----+-----+---- 11 | X | 13 | 14 | 15 Congratulations! You've formed a diagonal line of three 'X's at positions 4, 8, and 12. You win the game! Great job!