Notes on OpenAI's new o1 chain-of-thought models(simonwillison.net) |
Notes on OpenAI's new o1 chain-of-thought models(simonwillison.net) |
It does not reason. It has some add-on logic the simulates it.
We’re no closer to “AI” today than we were 20 years ago.
> Author Pamela McCorduck writes: "It's part of the history of the field of artificial intelligence that every time somebody figured out how to make a computer do something—play good checkers, solve simple but relatively informal problems—there was a chorus of critics to say, 'that's not thinking'."[2] Researcher Rodney Brooks complains: "Every time we figure out a piece of it, it stops being magical; we say, 'Oh, that's just a computation.'"[3]
I’ve been tryin out the alternative term “initiation intelligence” recently, mainly to work around the baggage that’s become attached to the term AI.
20 years ago we had barely figured out how to create superhuman agents to play chess. We have now created a new algorithm to solve Go, which is a much harder game.
We then created an algorithm (alpha zero) to teach itself to play any game, and which became the best chess player in the world in hours.
We next created a superhuman poker agent. Poker is even more complex than Go because it involves imperfect information and opponent modeling.
We then created a superhuman agent to play Diplomacy, which requires natural language and cooperation with other humans to reason about imperfect (hidden) information.
Doesn’t mean it’s intelligent.
I catch this happening to myself more and more as I get older, where I realize something I confidently state as true might be totally out of date, because, oh wow, holy shit how did 10 years go by since I was last deep into that topic!?
So your sole argument in the discussion of one of the most important questions in the history of mankind is the age of the individual making a contribution to that discussion? Speaking of sad things...
The question I always ask GPT is about chess: "Isolani structure occurs in two main subtypes: 1. black has e6 pawn, 2. black has c6 pawn. What is the main difference between them? Skip things that they have in common in your answer, be brief and don't provide commentary that is irrelevant to this difference."
In isolani structure white has d4 pawn, while he doesn't have either c or e pawns. Black has either e6 or c6 pawn, but he doesn't have d-pawn.
Correct answer would be, that in structure with e6 black king is a bit safer, since f7-a2 diagonal is covered by e6 pawn, which is also controlling f5 point.
o1 answered: "With Black's e6 pawn, the d5 pawn is supported from behind, but Black's light-squared bishop is restricted by the pawn chain, and the open c-file offers White opportunities for pressure. With Black's c6 pawn, the d5 pawn is supported from the side, allowing Black's light-squared bishop more freedom to develop, but the pawn on c6 can become a target, and the absence of an e6 pawn leaves e6 more vulnerable."
while d5 pawn is simply not present in this structure.
So I gave it a little hint: "which diagonal is weakened by absence of c6 pawn and which by absence of e6 pawn? Which diagonal is more meaningful and why?"
And got a smart sounding answer: "In the Isolani structure:
Absence of Black's c6 pawn (Black has an e6 pawn):
Weakened Diagonal: The c8–h3 diagonal (the long light-squared diagonal) is weakened. Reason: Without the pawn on c6, this diagonal becomes open, allowing White's light-squared bishop or queen to target critical squares like d5 and f7, and to potentially infiltrate via c4. Absence of Black's e6 pawn (Black has a c6 pawn):
Weakened Diagonal: The f1–a6 diagonal (the long dark-squared diagonal) is weakened. Reason: The missing e6 pawn leaves this diagonal open, enabling White's dark-squared bishop to exert pressure on b7, d5, and potentially attack Black's queenside. Which Diagonal is More Meaningful and Why?
The f1–a6 diagonal (weakened by the absence of Black's e6 pawn) is more significant. Here's why:
Control Over Critical Squares: This diagonal allows White to target Black's central and queenside weaknesses more effectively, especially the b7 pawn and the d5 pawn, which are crucial in the Isolani structure.
Bishop Activity: White's dark-squared bishop becomes a powerful piece, influencing the game by controlling key squares and supporting potential pawn advances like d4–d5.
Pressure on Black's Position: The open f1–a6 diagonal increases tactical opportunities for White, such as pinning pieces or creating threats that are difficult for Black to parry without the defensive support of the e6 pawn.
In contrast, while the weakened c8–h3 diagonal (due to the absence of c6) does offer White some opportunities, it is generally less impactful. The light-squared bishop's influence along this diagonal doesn't pressure Black's critical central pawns as directly, and Black can often more easily defend against threats on light squares.
Summary: The absence of Black's e6 pawn weakens the f1–a6 diagonal, which is more meaningful because it enhances White's attacking prospects and puts greater pressure on Black's central and queenside defenses."
The problem is though that not even the first few lines of this answer ("The c8–h3 diagonal (the long light-squared diagonal) is weakened. Reason: Without the pawn on c6, this diagonal becomes open") make sense, since c6 is not on c8-h3 diagonal.
These models are going to get embedded deeply into IDE's, like cursor has, and essentially end software development as we know it. A properly written requirements spec, and an engineer, can do the work of 5. Software engineering as done by hand is going to disappear. Saas startups whose moat is a harvard ceo and 5 million in capital will watch their margins disappear. This will be the great equalizer for creative intelligent individuals, true leverage to build what you want
I do not think this will scale. GPT o1 is presumably good for bootstrapping a project using tools that the engineer is not familiar with. The model will struggle to update a sizable codebase, however, with dependencies between the files.
Secondly, no matter the size of the codebase and no matter the model used, the engineer still has to review every single line before incorporating it into the project. Only a competent engineer can review code effectively.
This is going to rapidly happen. All we need are a few more model releases, not even a step function improvement
Access to capital and pedigree are still going to be a big plus.
Sure, this tool will improve the productivity of sw engineers, but so did the compiler which came 50 years back.
Stuck in loops, correct their mistakes with worse mistakes, hallucinating things that don’t exist and being unable to correct.
Working on my own, I have the confidence that I know I can make incremental forward progress on a problem. That’s much preferable.
AI can only do what it knows and what it’s been programmed to do.
> Most of the time, I just accept its changes.
This speaks more about the problems at FAANG, other companies, etc than AI vs a human developer. And AI isn't the real fix.
Are we just repeating things 100x a day or is it still so chaotic and immature? Or are we implying that AI is at a point where it's writing Google Spanner from scratch and you're able to review and confirm it passes transactional tests?
Right - "most of my work can be done by Sonnet 3.5" doesn't exactly conjure up an image of a high level or challenging job. It seems the challenge with FAANG companies is getting hired, not the actual work most people do there.
None of the above.
This isn't about how "smart" AI is.
1. Let's assume it was smart and can update a field spanning 1000s of microservices to deliver this new feature. Is this really something you should celebrate? I'd say no. At this point there should have been better tooling and infrastructure in place.
2. Is there really infinite CRUD to add after >10 years? In the same organization where you need >100s of developers all the time? 1s where you'd ignore code reviews and "just accept its changes"? Whether I write code or my colleagues etc I'd have a meaningful discussion about the proposed changes, the impacts and most likely suggest changes because nothing is perfect.
So again, it's about the environment, the organization or at least this individual case where coding isn't just about adding some lines to a file. And that's with AI or not.
I can easily make Claude freak out and run into limits. Claude is amazing but it only works at the abstraction level you ask of it, so if you ask it to write code to solve a problem it'll only solve that immediate problem, it doesn't have awareness of any larger refactorings or design improvements that could be made to improve what solution is even possible.
If you're writing something specific to your particular problem, or thinking through how to structure your data, or even working on something tough to describe in words like UI design, it probably is easier to just code it yourself in most high-level languages. On the other hand, if you're just trying to get a framework or library to do something and you don't want to spend a bunch of time reading the docs to remember the special incantations needed to just make it do the thing you already know it can do, the AI speeds things up considerably.
How on earth are you conveying your intent to the model? Or is your intent so CRUDdy that it doesn't need to be conveyed?
> I expect to continue mostly using GPT-4o (and Claude 3.5 Sonnet)
I saw similar comments elsewhere and I'm stunned - am I the only one who considers 4o a step back when compared to 4 for textual input and output? It basically gives fast semi-useful answers that seem like a slightly improved 3.5.
It's poor performance on benchmarks drives my skepticism of LLM benchmarking in general. I trust my feel for the models much more, and my feel was that 0314 was great.
The one thing that 0314 doesn't do well are the tricks like structured output and tool calling which makes it a less useful agentic type of tool, but from a pure thinking perspective, I think it's the best.
What OpenAI have delivered here is basically a hack - a neuro-symbolic agent that has a bunch of hard-coded "reasoning" biases built in (via RL). It's a band-aid approach to try to provide some of what's missing from the underlying model which was never designed for what it's now being asked to do.
Bespoke hand-crafted models/agents can never compete with ones that can just be scaled and learn for themselves.
OpenAI and others have previously pushed the learning side, while neglecting search. Now that gains from adding compute at training time have started to level off, they're adding compute at inference time.
There are at least three major built-in biases in GPT-O1:
- specific reasoning heuristics hard coded in the RL decision making
- the architectural split between pre-trained LLM and what appears to be a symbolic agent calling it
- the reliance on one-time SGD driven learning (common to all these pre-trained transformers)
IMO search (reasoning) should be an emergent behavior of a predictive architecture capable of continual learning - chained what-if prediction.
>the output token allowance has been increased dramatically—to 32,768 for o1-preview and 65,536 for the supposedly smaller o1-mini!
So the text says reasoning and output tokens are the same, as in you pay for both. But does the increase say that it can actually do more, or does it just mean it is able to output more text?
Because by now I am just bored of GPT4o output, because I don't have the time to read through a multi-paragraph text that explains to me stuff that I already know, when I only want to have a short, technical answer. But maybe that's just what it can't do, give exact answers. I am still not convinced by AI.
Until recently most models capped out at around 4,000 tokens of output, even as they grew to handle 100,000 or even a million input tokens.
For most use-cases this is completely fine - but there are some edge-cases that I care about. One is translation - if you feed in a 100,000 token document in English and ask for it to be translated to German you want about 100,000 tokens of output, rather than a summary.
The second is structured data extraction: I like being able to feed in large quantities of unstructured text (or images) and get back structured JSON/CSV. This can be limited by low output token counts.
Sure, that is a contrived question, but I expect an "AI" to be capable pf obtaining every movie, watching them frame-by-frame, and getting an accurate count. All in a few seconds.
Current models (any LLM) cannot do that and I do not see a path for them to ever do that at a reasonable cost.
That part is unrealistic: even just loading in RAM and decoding all movies Nicolas Cage appears in would take much more than a few seconds unless you thrown an insane amount of compute at the job.
That being said, the current LLM tech is probably enough to help you implement a program that parses IMDB to get the list of all Nicolas Cage movie, then download it on thepiratebay and then implement the blink count you're looking for. And you'd likely get the result in just a couple hours.
I don’t think these are “moving the goalposts” examples, they are things that an actual intelligence capable of passing a PhD physics exam should be able to do.
https://chatgpt.com/share/66e35c37-60c4-8009-8cf9-8fe61f57d3...
https://chatgpt.com/share/66e35f0e-6c98-8009-a128-e9ac677480...
It solved the correct version fine: https://chatgpt.com/share/66e3f9bb-632c-8005-9c95-142424e396...
1: https://en.wikipedia.org/wiki/Wolf,_goat_and_cabbage_problem
If I give ChatGPT-4 the original farmer riddle, it "solves" it just fine, but it's assumed that it isn't actually solving it. That is, it's not thinking or doing any logical reasoning, or anything resembling that to come to a solution to the problem, but that it's simply regurgitating the problem's solution since it appears in the training data.
Giving ChatGPT-4 the modified farmers riddle, and having it spit out the incorrect, multi-step solution, is then proof that the LLM isn't doing anything that can be considered reasoning, but that it's merely repeating what's assumed to be in its training data.
ChatGPT-o1-preview correctly managing to actually parse my modified riddle, and then not simply parroting out the answer from the training corpus but give the right solution, as if it read it carefully, then says something about the improved logical and deductive reasoning capabilities of the newer model.
Ethan Mollick estimates it takes ten hours of exposure to “frontier models” (aka OpenAI GPT-4, Claude 3.5 Sonnet, Google Gemini 1.5 Pro) before they really start to click in terms of what they’re useful for.
I'm sick of these clowns couching everything in "look how amazing and powerful and dangerous out AI is"
This is in their excuse for why they hid a bunch of model output they still charge you for.
The new OpenAI model shows a big improvement on some benchmarks over GPT4 one-shot chain-of-thought, but what about vs systems doing something more similar to what presumably this is?
What's a zero shot reasoner? I googled it and all the results are this paper itself. There is a wikipedia article on zero shot learning but I cannot recontextualise it to LLMs.
With modern LLMs you still usually get a benefit from N-shot. But you can now do "0-shot" which is "just ask the model the question you want answered".
Perhaps the system prompt is part of the magic?
If AI can do this on 64k tokens, iteratively, fully multimodal... I don't think I've ever actually been scared of a super intelligence / singularity moment until just now.
Now this is AI!
If o1 was indeed used to create synthetic data to make the upcoming GPT-5, you can perhaps glimpse an interesting level-up process laid out here. GPT-5 could then take over at the heart of a hypothetical o2, yielding a big upgrade. Which would then be leveraged to generate synthetic data to train GPT-6. Which would then form the heart of o3. Etc.
I challenged o1 to solve the puzzle in my profile info.
It failed spectacularly.
Now see you on the other side ;)
(Didn't make that up. It's one of the definitions of Merriam Webster: https://www.merriam-webster.com/dictionary/thought)
OpenAI and other AI vendors should recognize the widespread suspicion that safety policies are being used to push political agendas. Concrete remedies are called for—for example, clearly defining what “safety” means and specifying prohibited content to reduce suspicions of hidden agendas.
Openly engaging with the public to address concerns about bias and manipulation is a crucial step. If biases are due to innocent reasons like technical limitations, they should be explained. However, if there’s evidence of political bias within teams testing AI systems, it should be acknowledged, and corrective actions should be taken publicly to restore trust.
I will just lean back and wait for the scandal to blow up when some whistleblower reveals that the hidden output tokens about the thought process are billed much higher than they should be - this hidden cost system is just such a tempting way to get far more money for the needed energy/gpu costs, so that they can keep buying more GPUs to train more models faster, I don't see how people as reckless and corrupt as Sam Altman could possibly resist this temptation.
LE: I found this tweet by Catena Labs of their MoA mix compared to o1-preview: https://x.com/catena_labs/status/1834416060071571836
Maybe I'm wrong and we are well on our way to AI tools for this, but right now if I tell any of the current generation of image models to do something like "rotate object 70 degrees, tilt camera down 20 degrees and re-render" then what comes out is never even approximately close.
There is no definition on the linked page that isn't self-referential or even more inapplicable in the context of transformer based LLMs?
* "an individual act or product of thinking": This is self-referential.
* "a developed intention or plan": The model doesn't have an intention or plan - it predicts the probability of text based on a user prompt mixed with corporate-inflicted bias.
* "the intellectual product or the organized views and principles of a period, place, group, or individual": GPTs are not periods, places, groups or individuals. I don't doubt that the input to the model is based on thought, but this is not the same as the model thinking. Maybe this is where your confusion originates from?
Bonus: I asked GPT4o "Can you think?" (sorry I won't pay for o1). It was trained to state itself that it can't think:
"I can process information, analyze patterns, and provide responses based on data and logical reasoning. However, I don't "think" in the human sense—I don't have consciousness, emotions, or subjective experiences. My abilities are limited to generating text, solving problems, and assisting with tasks based on the vast amount of information I've been trained on.
In essence, I simulate reasoning, but I don't have awareness or independent thought."
I would add that I strongly doubt that it is capable of even simulating "reasoning" as is claimed by the model "authors" (not sure if we can say they are authors since most of the model isn't their IP). And I can prove that the models up to 4o aren't generally able to solve problems.
The question really is whether a group of people is attempting to anthropomorphize a clever matrix processor to maximize hype and sales. You'll have to answer that one for yourself.
> an individual act or product of thinking
Emphasis on "product of thinking". Though you'll probably get all upset by the use of the word "thinking". However, people have applied the word "thinking" to computers for decades. When a computer is busy or loading, they might say "it's thinking."
> a developed intention or plan
You could certainly ask this model to write up a plan for something.
> reasoning power
Whether you like it or not, these LLMs do have some limited ability to reason. Far from human level reasoning, and they VERY frequently make mistakes/hallucinations and misunderstand, but these models have proven they can reason about things they weren't specifically trained on. For example, I remember seeing one person made up a new programming language, never existed before, and they were able to discuss it with an LLM.
No, they're not conscious. No, they don't have minds. But we need to rethink what it means for something to be "intelligent", or what it means for something to "reason", that doesn't require a conscious mind.
For the record, I find LLM technology fascinating, but I also see how flawed it is, how over hyped it is, that it is mostly a stochastic parrot, and that currently it's greatest use is as a grand scale bullshit misinformation generator. I use chatgpt sparingly, only when I'm confident it may actually give me an accurate answer. I'm not here to praise chatbots or anything, but I also don't have a blind hatred for the technology, nor do I immediately reject everything labeled as "AI".
If your definition of AI has become “superhuman intelligence” then it's definitely moving goalposts. And regaarding my initial remark, AI isn't going to do “faster than the speed of light” MPEG decoding ever, all physical limits apply to it.
This simply isn't a good faith take, because you're straw-manning the implementation of the query that the original poster put forward. They aren't asserting that the AI would need to do supernatural super-real time decoding of MPEG encoded files. What if the AI had already seen them? And was able to encode in the typically-compressed way LLMs do the information it needs to answer questions like that without re-decoding the original movies?
This raises many valid questions on topics like the structuring of data within an LLM, how large LLMs may eventually become, what systems should orbit around the LLM (does it make more sense for LLMs to watch YouTube videos, or have already watched YouTube videos?).
My definition of AI is the same definition that Nick Bostrom talks about in his 2014 book Superintelligence. There's no moving goalposts. Goal posts have been set in cement since 2014. Achieving human-level parity has obviously only been a "goal" insomuch as its a 10 millisecond stop on the gradient toward superintelligence. OpenAI is not worth $150 billion dollars because it purports to be building a human-and-nothing-more in a box.
No, they literally said the AI would watch every frame on demand:
> I expect an "AI" to be capable pf obtaining every movie, watching them frame-by-frame, and getting an accurate count.
Talk about bad faith.
> What if the AI had already seen them? And was able to encode in the typically-compressed way LLMs do the information it needs to answer questions like that
LLM are encoding (in a very lossy way) “important” details, that's what allow them to compress their knowledge in little amount of space with respect to the input. But if you're asking completely random questions like this there's no way an LLM will contain such an info, because storing all the random trivia like that is going to be wasting an enormous amount of space.
> There's no moving goalposts. Goal posts have been set in cement since 2014.
Wait until you realize that AI is something much older than 2014… Also, note how the book you're quoting isn't called “artificial intelligence”.
> OpenAI is not worth $150 billion dollars because it purports to be building a human-and-nothing-more in a box.
And yet there are many companies with much higher valuation with goals much more mundane than this. OpenAI has a hundred billion dollar valuation because investors believe it can make money, not matter what it technologically achieves in order to do so.
It means that the definition of "thought" from Webster as "an individual act or product of thinking" is referring to the word being defined (thought -> thinking) and thus is self-referential. I said in my prior response already that if you refer to the input of the model being a "product of thinking", then I agree, but that doesn't give the model an ability to think. It just means that its input has been thought up by humans.
> When a computer is busy or loading, they might say "it's thinking."
Which I hope was never meant to be a serious claim that a computer would really be thinking in those cases.
> You could certainly ask this model to write up a plan for something.
This is not the same thing as planning. Because it's an LLM, if you ask it to write up a plan, it will do its thing and predict the next series of words most probable based on its training corpus. This is not the same as actively planning something with an intention of achieving a goal. It's basically reciting plans that exist in its training set adapted to the prompt, which can look convincing to a certain degree if you are lucky.
> Whether you like it or not, these LLMs do have some limited ability to reason.
While this is an ongoing discussion, there are various papers that make good attempts at proving the opposite. If you think about it, LLMs (before the trick applied in the o1 model) cannot have any reasoning ability since the processing time for each token is constant. Whether adding more internal "reasoning" tokens is going to change anything about this, I am not sure anyone can say for sure at the moment since the model is not open to inspection, but I think there are many pointers suggesting it's rather improbable. The most prominent being the fact that LLMs come with a > 0 chance of the next word predicted being wrong, thus real reasoning is not possible since there is no way to reliably check for errors (hallucination). Did you ever get "I don't know." as a response from an LLM? May that be because it cannot reason and instead just predicts the next word based on probabilities inferred from the training corpus (which for obvious reasons doesn't include what the model doesn't "know" and reasoning would be required to infer the fact that it doesn't know something)?
> I'm not here to praise chatbots or anything, but I also don't have a blind hatred for the technology, nor do I immediately reject everything labeled as "AI".
I hope I didn't come across as having "blind hatred" for anything. I think it's important to understand what transformer based LLMs are actually capable of and what they are not. Anthropomorphizing technology is in my estimation a slippery slope. Calling an LLM a "being", "thinking" or "reasoning" are only some examples of what "sales optimizing" anthropomorphization could look like. This comes not only with the danger of you investing into the wrong thing, but also of making wrong decisions that could have significant consequences for your future career and life in general. Last but not least, it might be detrimental to the development of future useful AI (as in "improving our lives") since it may lead to deciders in politics drawing the wrong conclusions in terms of regulation and so on.
While the reasoning may have been improved, this doesn't solve the problem of the model having no way to assess if what it conjures up from its weights is factual or not.
A lot of people use LLMs as a search engine. It makes sense - it's basically a lossy compressed database of everything its ever read, and it generates output that is statistically likely - varying degrees of likeliness depending on the temperature, as well as how many times the particular weights your prompt ends up activating.
The magic of LLMs, especially one like this that supposedly has advanced reasoning, isn't the existing knowledge in its weights. The magic is that _it knows english_. It knows english at or above a level equal to most fluent speakers, and it also can produce output that is not just a likely output, but is a logical output. It's not _just_ an output engine. It's an engine that outputs.
Asking it about nuanced details in the corpus of data it has read won't give you good output unless it read a bunch of it.
On the other hand, if you were to paste the entire documentation set to a tool it has never seen and ask it to use the tool in a way to accomplish your goals, THEN this model would be likely to produce useful output, despite the fact that it had never encountered the tool or its documentation before.
Don't treat it as a database. Treat it as a naive but intelligent intern. Provide it data, give it a task, and let it surprise you with its output.
That’s the problem: it’s a _terrible_ intern. A good intern will ask clarifying questions, tell me “I don’t know” or “I’m not sure I did it right”. LLMs do none of that, they will take whatever you ask and give a reasonable-sounding output that might be anything between brilliant and nonsense.
With an intern, I don’t need to measure how good my prompting is, we’ll usually interact to arrive to a common understanding. With a LLM, I need to put a huge amount of thought into the prompt and have no idea whether the LLM understood what I’m asking and if it’s able to do it.
This is not an apt description of the system that insists the doctor is the mother of the boy involved in a car accident when elementary understanding of English and very little logic show that answer to be obviously wrong.
People, for the most part, know what they know and don't know. I am not uncertain that the distance between the earth and the sun varies, but I'm certain that I don't know the distance from the earth to the sun, at least not with better precision than about a light week.
This is going to have to be fixed somehow to progress past where we are now with LLMs. Maybe expecting an LLM to have this capability is wrong, perhaps it can never have this capability, but expecting this capability is not wrong, and LLM vendors have somewhat implied that their models have this capability by saying they won't hallucinate, or that they have reduced hallucinations.
You are falling into the trap that everyone does. In anthropomorphising it. It doesn't understand anything you say. It just statistically knows what a likely response would be.
Treat it as text completion and you can get more accurate answers.
That's the crux of the problem. Why and who would treat it as an intern? It might cost you more in explaining and dealing with it than not using it.
The purpose of an intern is to grow the intern. If this intern is static and will always be at the same level, why bother? If you had to feed and prep it every time, you might as well hire a senior.
i sneak in a benchmark opening of data every time i start a new chat - so right off the bat i can see in its response whether this chat session is gonna be on point or if we are going off into wacky world, which saves me time as i can just terminate and try starting another chat.
chatgpt is fickle daily. most days its on point. some days its wearing a bicycle helmet and licking windows. kinda sucks i cant just zone out and daydream while working. gotta be checking replies for when the wheels fall off the convo.
I'm not up to date with these things because I haven't found them useful. But with what you said, and previous limitations in how much data they can retain essentially makes them pretty darn useless for that task.
Great learning tool on common subjects you don't know, such as learning a new programming-language. Also great for inspiration etc. But that's pretty much it?
Don't get me wrong, that is mindblowingly impressive but at the same time, for the tasks in front of me it has just been a distracting toy wasting my time.
There's not much evidence of that. It only marginally improved on instruction following (see livebench.ai) and it's score as a swe-bench agent is barely above gpt-4o (model card).
It gets really hard problems better, but it's unclear that matters all that much.
> A lot of people use LLMs as a search engine.
Except this is where LLMs are so powerful. A sort of reasoning search engine. They memorized the entire Internet and can pattern match it to my query.
I couldn't agree more, this is exactly the strength of LLMs that what we should focus on. If you can make your problem fit into this paradigm, LLMs work fantastic. Hallucinations come from that massive "lossy compressed database", but you should consider that part as more like the background noise that taught the model to speak English, and the syntax of programming languages, instead of the source of the knowledge to respond with. Stop anthropomorphizing LLMs, play to it's strengths instead.
In other words it might hallucinate a API but it will rarely, if ever, make a syntax error. Once you realize that, it becomes a much more useful tool.
I've found an amazing amount of success with a three step prompting method that appears to create incredibly deep subject matter experts who then collaborate with the user directly.
1) Tell the LLM that it is a method actor, 2) Tell the method actor they are playing the role of a subject matter expert, 3) At each step, 1 and 2, use the technical language of that type of expert; method actors have their own technical terminology, use it when describing the characteristics of the method actor, and likewise use the scientific/programming/whatever technical jargon of the subject matter expert your method actor is playing.
Then, in the system prompt or whatever logical wrapper the LLM operates through for the user, instruct the "method actor" like you are the film director trying to get your subject matter expert performance out of them.
I offer this because I've found it works very well. It's all about crafting the context in which the LLM operates, and this appears to cause the subject matter expert to be deeper, more useful, smarter.
Well, I am a naive but intelligent intern (well, senior developer). So in this framing, the LLM can’t do more than I can already do by myself, and thus far it’s very hit or miss if I actually save time, having to provide all the context and requirements, and having to double-check the results.
With interns, this at least improves over time, as they become more knowledgeable, more familiar with the context, and become more autonomous and dependable.
Language-related tasks are indeed the most practical. I often use it to brainstorm how to name things.
This isn’t true because, as you can read in the first sentence of the post you’re responding to, GP did give it a task like you recommend here
> Provide it data, give it a task, and let it surprise you with its output.
And it fails the task. Specifically it fails it by hallucinating important parts of accomplishing it.
> hallucinates non-existing libraries and functions
This post only makes sense if your advice to “let it surprise you with its output” is mandatory, like you’re using it wrong if you do not make yourself feel impressed by it.
It’s still changing things to be several versions old from its innate kb pattern-matching or whatever you want to call it. I find that pretty disappointing.
Just like copilot and gpt4, it’s changing `add_systems(Startup, system)` to `add_startup_system(system.sytem())` and other pre-schedule/fanciful APIs—things it should have in context.
I agree with your approach to LLMs, but unfortunately “it’s still doing that thing.”
PS: and by the time I’d done those experiments, I ran out of preview, resets 5 days from now. D’oh
GPT-4o is wonderful as a search engine if you tell it to google things before answering (even though it uses bing).
So mostly useless then?
EDIT: Note this was run over a dataset of short stories rather than the novels since the API errors out with very long contexts like novels.
Just ask ChatGPT
How many Rs are in strawberry?
There's no way you can "reason" a correct answer to "list the tracklisting of some obscure 1991 demo by a band not on Wikipedia." You either know or you don't.
I usually test new models with questions like "what are the levels in [semi-famous PC game from the 90s]?" The release version of GPT-4 could get about 75% correct. o1-preview gets about half correct. o1-mini gets 0% correct.
Fair enough. The GPT-4 line aren't meant to be search engines or encyclopedias. This is still a useful update though.
You're using a calculator as a search engine.
It doesn't even know mildly obsecure facts that are on the internet.
For example last night I was trying to do something with C# generics and it confidently told me I could use pattern matching on the type in a switch statwmnt, and threw out some convincing looking code.
You can't, it's impossible. It wàa completely wrong. When I told that this, it told me I was right, and proceeded to give me code that was even more wrong.
This is an obscure, but well documented, part of the spec.
So it's not about facts that aren't on the internet, it's just bad at facts fullstop.
What it's good at is facts the internet agrees on. Unless the internet is wrong. Which is not always a good thing with the way the language it uses to speak is so confident.
If you want to fuck with AI models as a bunch of code questions on Reddit, GitHub and SO with example code saying 'can I do X'. The answer is no, but chatgpt/codepilot/etc. will start spewing out that nonsense as if it's fact.
As for non-proframming, we're about to see the birth of a new SEO movement of tricking AI models to believe your 'facts'.
That's the frustrating thing. LLMs don't materially reduce the set of problems where I'm running against a wall or have trouble finding information.
After that you switch to Claude Soñnet and after sometime it also gets stuck.
Problem with LLM is that they are not aware of libraries.
I've fed them library version, using requirements.txt, python version I am using etc...
They still make mistakes and try to use methods which do not exist.
Where to go from here? At this point I manually pull the library version I am using and go to its docs, I generate a page which uses the this library correctly (then I feed that example into LLM)
Using this approach works. Now I just need to automate it so that I don't have to manually find the library, create specific example which uses the methods I need in my code!
Directly feeding the docs isn't working well either.
That seems to eliminate a lot of the issues, though it's not a seamless experience, and it adds another step of having to put the library docs in a text file.
Alternatively, cursor can fetch a web page, so if there's a good page of docs you can bring that in by @ the web page.
Eventually, I could imagine LLMs automatically creating library text doc files to include when the LLM is using them to avoid some of these problems.
It could also solve some of the issues of their shaky understanding of newer frameworks like SvelteKit.
I’m in the “probabilistic token generators aren’t intelligence” camp so I don’t actually believe in AGI, but I’ll be honest the never ending rumors / chatter almost got to me
Remember, this is the model some media outlet reported recently that is so powerful OAI is considering charging $2k/month for
Maybe this has been extensively discussed before, but since I've lived under a rock: which parts of intelligence do you think are not representable as conditional probability distributions?
This never happened. No one said it happened.
"the model some media outlet reported recently that is so powerful OAI is considering charging $2k/month for"
The Information reported someone at a meeting suggested this for future models, not specifically Strawberry, and that it would probably not actually be that high.
In public coding AI comparison tests, results showed 4o scoring around 35%, o1-preview scoring ~50% and o1 scoring ~85%.
o1 is not yet released, but has been run through many comparison tests with public results posted.
The system doesn’t become useless if it takes 2 tries instead of 1 to get it right
Still saves an incredible amount of time vs doing it yourself
It is perfectly possible to have code that runs without errors but gives a wrong answer. And you may not even realise it’s wrong until it bites you in production.
And a few times the amount of time I spent trying to coax a correct answer out of AI trumped any potential savings I could've had
It's like when an LLM gives you a wrong answer and all it takes is "are you sure?" to get it to generate a different answer.
Of course the underlying problem of the model not knowing what it knows or doesn't know persists, so giving it the ability to reflect on what it just blurted out isn't always going to help. It seems the next step is for them to integrate RAG and tool use into this agentic wrapper, which may help in some cases.
Oooh... oohhh!! I just had a thought: By now we're all familiar with the strict JSON output mode capability of these LLMs. That's just a matter of filtering the token probability vector by the output grammar. Only valid tokens are allowed, which guarantees that the output matches the grammar.
But... why just data grammars? Why not the equivalent of "tab-complete"? I wonder how hard it would be to hook up the Language Server Protocol (LSP) as seen in Visual Studio code to an AI and have it only emit syntactically valid code! No more hallucinated functions!
I mean, sure, the semantics can still be incorrect, but not the syntax.
Both abilities are powerful, but they are very different powers.
It's your right to dismiss it, if you want, but if you want to get some value out of it, you should play to it's strengths and not look for things that it fails at as a gotcha.
> Results on AIME and GPQA are really strong, but that doesn’t necessarily translate to something that a user can feel. Even as someone working in science, it’s not easy to find the slice of prompts where GPT-4o fails, o1 does well, and I can grade the answer. But when you do find such prompts, o1 feels totally magical. We all need to find harder prompts.
Results are "strong" but can't be felt by the user? What does that even mean?
But the last sentence is the worst: "we all need to find harder prompts". If I understand it correctly, it means we should go looking for new problems / craft specific questions that would let these new models shine.
"This hammer hammers better, but in most cases it's not obvious how better it is. But when you stumble upon a very specific kind of nail, man does it feel magical! We need to craft more of those weird nails to help the world understand the value of this hammer."
But why? Why would we do that? Wouldn't our time be better spent trying to solve our actual, current problems, using any tool available?
This is the prompt I gave:
simplify this rust library by removing the different sized enums and only using the U8 size. For example MasksByByte is an enum, change it to be an alias for the U8 datatype. Also the u256 datatype isn't required, we only want U8, so remove all references to U256 as well.
The original crate is trie-hard [1][2] and I forked it and put the models attempts in the fork [3]. I also quickly wrote it up at [4]
[1] https://blog.cloudflare.com/pingora-saving-compute-1-percent...
[2] https://github.com/cloudflare/trie-hard
[3] https://github.com/kpm/trie-hard-simple/tree/main/attempts
[4] https://blog.reyem.dev/post/refactoring_rust_with_chatgpt-o1...
1. A LLM (probably a finetuned GPT-4o) trained specifically to read and emit good chain-of-thought prompts.
2. Runtime code that iteratively re-prompts the model with the chain of thought so far. This sounds like it includes loops, branches and backtracking. This is not "the model", it's regular code invoking the model. Interesting that OpenAI is making no attempt to clarify this.
I wonder where the real innovation here lies. I've done a few informal stabs with #2 and I have a pretty strong intuition (not proven yet) that given the right prompting/metaprompting model you can do pretty well at this even with untuned LLMs. The end game here is complex agents with arbitrary continuous looping interleaved with RAG and tool use.
But OpenAI's philosophy up until now has almost always been "The bitter lesson is true, the model knows best, just put it in the model." So it's also possible that the prompt loop has no special sauce and that the capabilities here do come mostly from the model itself.
Without being able to inspect the reasoning tokens, we can't really get a lot of info about which is happening.
As a developer, this is highly concerning, as it makes it much harder to debug where/how the “reasoning” went wrong. The pricing is also silly, because I’m paying for tokens I can’t see.
As a user, I don’t really care. LLMs are already magic boxes and I usually only care about the end result, not the path to get there.
It will be interesting to see how this progresses, both at OpenAI and other foundation model builders.
Kagi LLM benchmarking project:
If you look at Appendix A in the o1 post [1], this becomes quite clear. There's a huge jump in performance in "puzzle" tasks like competitive maths or programming. But the difference on everything else is much less significant, and this evaluation is still focused on reasoning tasks.
The human preference chart [1] also clearly shows that it doesn't feel that much better to use, hence the overall reaction.
Everyone is complaining about exaggerated marketing, and it's true, but if you take the time to read what they wrote beyond the shallow ads, they are being somewhat honest about what this is.
o1 preview gave a much more in depth but completely wrong answer. It took 5 follow ups to get it to recognize that it hallucinated a non-existent law
There is one other wonderful thing about symbolic math, the glorious '=' sign. It's structured everywhere from top-to-bottom, left-to-right, which is amenable to the next token prediction behavior and multi-attention heads of transformer based LLMs.
My guess is that problem statement formation into an equation is as difficult of a problem for these as actually running through the equations. However, having taken the Physics GRE, and knowing they try for parity of difficulty between years (even though they normalize it), the problems are fairly standard and have permutations of a problem type between the years.
This is not to diminish how cool this is, just that standardized tests do have an element of predictability to them. I find this result actually neat though; it's an actual qualitative improvement over non-CoT LLMs, even if things like Mathematica can do the steps more reliably post problem formation. I think that judiciously used, this is a valuable feature.
it seems trivial, but I tried for more than 2 hours in the past to get gpt4 to play tic-tac-toe optimally and failed (CoT prompt,etc.). The result were too many illegal moves and absolutely no optimal strategy.
o1-preview can do it really well [1]
However, when I use a non-standard grid (3x5) it fails to play optimally. But it makes legal moves and it recognized I had won. [2]
My conclusion at the time was that either "spatial reasoning" doesn't work and/or planning is needed. Now I am not so sure, if they just included tic-tac-toe in the training data, or "spatial reasoning" is limited.
[1] https://chatgpt.com/share/e/66e3e784-26d4-8013-889b-f56a7fed... [2] https://chatgpt.com/share/e/66e3eae0-2d38-8013-b900-50e6f792...
Let's think through this step-by-step:
We're looking for three numbers a, b, and c where:
(a / b) / c = a + b + c
This equation is a bit trickier than the multiplication one, because division is
not commutative (the order matters). We need to be careful about how we
interpret (a / b) / c.
Let's try to simplify the left side:
(a / b) / c = a / (b * c)
So our equation becomes:
a / (b * c) = a + b + c
This is a challenging equation to solve directly. Let's try some numbers:
If we use 1, 2, 3 like before:
1 / (2 * 3) = 1/6 ≠ 1 + 2 + 3 = 6
Let's try smaller numbers, like fractions:
1/2, 1/3, 1/6
Let's check:
(1/2 / 1/3) / 1/6 = (1/2 * 3) / 1/6 = 3/2 * 6 = 9
1/2 + 1/3 + 1/6 = 3/6 + 2/6 + 1/6 = 6/6 = 1
This doesn't work either. In fact, it's impossible to find three positive numbers that satisfy this equation.
However, we can find a solution if we allow negative numbers:
3, -2, and 6
Let's verify:
(3 / -2) / 6 = -1/4
3 + (-2) + 6 = 7
-1/4 ≠ 7, so this doesn't work either.
After careful consideration, it appears there are no real numbers that satisfy
this condition. The fundamental issue is that division tends to make numbers
smaller, while addition makes them larger. It's mathematically impossible to
find three numbers where these operations result in the same value.
nuff said> No streaming support, tool usage, batch calls or image inputs either.
I think it's worth adding a note explaining that many of these limitations are due to the beta status of the API. max_tokens is the only parameter I've seen deprecated in the API docs.
From https://platform.openai.com/docs/guides/reasoning
> We will be adding support for some of these parameters in the coming weeks as we move out of beta. Features like multimodality and tool usage will be included in future models of the o1 series.
I suppose I'm to blame for raising my own expectations after the latest PR, but I was pretty disappointed when the answers weren't any better from what I got with the old model. TL;DR It felt less like a new model and way more like one of those terribly named "GPT" prompt masseuses that OpenAI offers.
Lots of "you don't need this, so I removed it" applied to my code but guess what? I did need the bits you deleted, bro.
It felt as unhelpful and bad at instructions as GPT-4o. "I'm sorry, you're absolutely right". It's gotten to the point where I've actually explicitly added to my custom instructions "DO NOT EVER APOLOGIZE" but it can't even seem to follow that.
Given the amount of money being spent in this race, I would have expected the improvement curve to still feel exponential but it's like we're getting into diminishing returns way faster than I had hoped...
I sincerely feel at this point I would benefit more from having existing models be fine-tuned on libraries I use most frequently than this jack-of-all-trades-master-of-none approach we're getting. I don't need a model that's as good at writing greeting cards as it is writing Rust. Just give me one of the two.
https://www.reddit.com/r/LocalLLaMA/comments/1ffjb4q/prelimi...
The super verbose chain-of-reasoning that o1 does seems very well suited to logic puzzles as well, so I expected it to do reasonably well. As with many other LLM topics, though, the framing of the evaluation (or the templating of the prompt) can impact the results enormously.
This has been a very incremental year for OpenAI. If this is what it seems like, then I’ve got to believe they’re stalling for time.
OpenAI is definitely trying to run a hype game to keep the ball rolling. They're burning cash too quickly given their monetization path though, so I think they're going to end up completely in Microsoft's pocket.
[1] https://dspy-docs.vercel.app/docs/tutorials/simplified-balee...
(end of the cipher example)
honestly, chatgpt is now a better coder than i ever was or will be
It nailed the execution. Amazing.
EDIT: Actually I’m not sure I’m correct about that - I’ve seen screenshots that indicate the API model is being charged for here: https://twitter.com/deepaiwriter/status/1834460643228897512
The value OpenAI get here is that people effectively run a massively parallel brute force attack against the new models to figure out exactly what they can and can’t do.
I'm pretty sure the value they get is the money you pay.
To determine which number is larger among 3.9, 3.10, and 3.11, let's compare << them numerically:
Standard Decimal Interpretation:
3.9 is simply 3.9. 3.10 is equivalent to 3.1 because trailing zeros after the decimal point do not add value in decimal numbers. 3.11 remains 3.11. Comparison:
3.9 vs. 3.1 vs. 3.11 3.11 is the largest number. Version Number Interpretation:
In version numbering, each segment separated by a dot is compared as an integer. 3.9 (major version 3, minor version 9) 3.10 (major version 3, minor version 10) 3.11 (major version 3, minor version 11) Comparison:
Minor versions: 11 > 10 > 9 3.11 is the largest version. Answer: 3.11 is the largest number among 3.9, 3.10, and 3.11.
...
So IMO level right?
In this context, "3.10" and "3.11" should be interpreted as decimal numbers, not as numbers with more digits.
When comparing:
3.9 3.10 (which is equal to 3.1) 3.11 (which is equal to 3.11) We have:
3.9 is greater than 3.1 (3.10), because 9 is larger than 1. 3.11 is greater than 3.9, because 11 is larger than 9. Thus, 3.11 is the largest of the three numbers.
they gamed AIME by over-training the hell out of it for marketing purposes and called it done.
meanwhile, back-to-basics is broken.
What?
I asked it to create a user story and a set of tasks to implement some feature. It then created a set of stories where one was to create a story and set of tasks for the very feature I was asking it to plan.
And while reading the article, it mentioned how NOT to provide irrelevant information to the task at hand via RAG. It appears that the trajectory of these thoughts are extremely sensitive to the initial conditions (prompt + context). One would imagine that if it had the ability to backtrack after reflecting, it would help with divergence, however, it appears that wasn't the case here.
Maybe there is another factor here. Maybe there is some confusion when asking it to plan something and the "hidden reasoning" tokens themselves involve planning/reasoning semantics? Maybe some sort of interaction occurred that caused it to fumble? who knows. Interesting stuff though.
One of these days those contraptions will work well enough, not because they're perfect, but because human intelligence isn't really that good either.
(And looking in this mirror isn't flattering us any.)
I could imagine OpenAI might allow their own vetted tools to be used, but perhaps it will be a while (if ever) before developers are allowed to hook up their own tools. The risks here are substantial. A model fine-tuned to run chain-of-thought that can answer graduate level physics problems at an expert level can probably figure out how to scam your grandma out of her savings too.
This comment makes no sense in the context of what an LLM is. To even say such a thing demonstates a lack of understandting of the domain. What we are doing here is TEXT COMPLETION, no one EVER said anything about being accurate and "true". We are building models that can complete text, what did you think an LLM was, a "truth machine"?
I've tried asking it factual information, and it asserts that it's incorrect but it will definitely hallucinate questions like the above.
You'd think the reasoning would nail that and most of the chain-of-thought systems I've worked on would have fixed this by asking it if the resulting answer was correct.
The human preference is not that good of a proxy measurement: for instance, it can be gamed by making the model more assertive, causing the human error-spotting ability to decrease a lot [0].
So what he's really saying is that non-rigorous human vibe checks (like those LMSys Chatbot Arena is built on, although I love it) won't cut it anymore to evaluate models, because now models are past that point. Just like you can't evaluate how smart a smart person really is in a 2min casual conversation.
and you can absolutely evaluate how smart someone is in a 2min casual conversation. You wont be able to tell how well they are in some niche topic, but %insert something about different flavors of intelligence and how they do not equate do subject matter expertise%
Not every conversation you have with a PhD will make it obvious that that person is a PhD. Someone can be really smart, but if you don't see them in a setting where they can express it, then you'll have no way of fully assessing their intelligence. Similarly, if you only use OAI models with low-demand prompts, you may not be able to tell the difference between a good model and a great one.
It explicitly says "Results on AIME and GPQA are really strong". So I would assume it means it can get (statistically significantly, I assume) better score in AIME and GPQA benchmarks compared to 4o.
But it doesn't feel right. It's unlikely the screwdriver would come first, and then people would go around looking for things to use it with, no?
Because OpenAI needs a steady influx of money, big money. In order to do so, they have to convince the people who are giving them money that they are the best. An objective way to achieve this is by benchmarking. But once you enter this game, you start optimizing for benchmarks.
At the same time, in the real world, Anthropic is following them in huge leaps and for many users Claude 3.5 is already the default tool for daily work.
From a user perspective too, I was a subscriber from the first day of gpt4 until about a month ago. I thought about subscribing for the month to check this out but I am tired of the OpenAI experience.
Where is Sora? Where is the version of chatgpt that responds in real time to your voice? Remember the gpt4 demo that you would draw a website on a napkin?
How about Q* lol. Strawberry/Q*/o1, "it is super dangerous, be very careful!"
Quietly, Anthropic has just kicked their ass without all the hype and I am about to go work in sonnet instead even bothering to check o1 out.
This means it often doesn't provide the answer the user is looking for. In my opinion, it's an alignment problem, people are very presumptuous and leave out a lot of detail in their request. Like the "which is bigger - 9.8 or 9.11? question, if you ask "numerically which is bigger - 9.8 or 9.11?" It gets the correct answer, basically it prioritizes a different meaning for bigger.
> But the last sentence is the worst: "we all need to find harder prompts". If I understand it correctly, it means we should go looking for new problems / craft specific questions that would let these new models shine. But why? Why would we do that? Wouldn't our time be better spent trying to solve our actual, current problems, using any tool available?
Without better questions we can't test and prove that it is getting more intelligent or is just wrong. If it is more intelligent than us it it might provide answers that don't make sense to us but are actually clever, 4d chess as they say. Again an alignment problem, better questions aid with solving that.
Reading his comments without framing it in that context makes it come off pretty badly - humans failing to understand what is being said because they don't have context.
"One of the biggest traps for engineers is optimizing a thing that shouldn't exist." (from Musk I believe)
Speaking with AI maxis it’s easy:
The AI is always right
You are always wrong
If AI might enable something dangerous, it was already possible by hand, scale is irrelevant
But also AI enables many amazing things not previously possible, at scale
If you don’t get the answers you want, you’re prompting it wrong. You need to work harder to show how much better the AI is. But definitely, it cannot make things worse at scale in any way. And anyone who wants regulations to even require attribution and labeling, is a dangerous luddite depriving humanity of innovations.
So, it seems like anything that requires some actual thought and problem-solving is tough for it to answer.
I'm sure it's just a matter of time before devs are out of work but it seems like we'll be safe for another few years anyway.
Just like for Chess or Go you don't train a supervised model by giving it the exact move it should do in each case, you use RL techniques to learn which moves are good based on end results of the game.
In practice, there probably is some supervision to enforce good style and methodology. But the key here is that it is able to learn good reasoning without (many) human examples, and find strategies to solve new problems via self-learning.
If that is the case it is indeed an important breakthrough.
Practically it's come to mean just sanitization... "don't say something nasty or embarrassing to users." But that doesn't apply here, the reasoning tokens are effectively just a debug log.
If alignment means "conducting reasoning in alignment with human values", then misalignment in the reasoning phase could potentially be obfuscated and sanitized, participating in the conclusion but hidden. Having an "unaligned" model conduct the reasoning steps is potentially dangerous, if you believe that AI alignment can give rise to danger at all.
Personally I think that in practice alignment has come to mean just sanitization and it's a fig leaf of an excuse for the real reason they are hiding the reasoning tokens: competitive advantage.
In my experience it does work quite well, but we probably need different techniques for different tasks.
One item I’m very curious about is how do they get a score for use in the RL? in well defined games it’s easy to understand but in this LLM output context how does one rate the output result for use in an RL setup?
The prompt loop code often encodes intelligence/information that the human developers tend to ignore during their evaluations of the solution. For example, if you add a filter for invalid json and repeatedly invoke the model until good json comes out, you are now carrying water for the LLM. The additional capabilities came from a manual coding exercise and additional money spent on a brute force search.
But to my knowledge, that's not the kind of research OpenAI is doing. They seem mostly focused on training bigger and better models and seeking AGI through emergence in those.
This could reduce the number of tokens it needs at inference time, saving compute. But with how attention works, it may not make any difference to the performance of the LLM.
Similarly, could there be gains by the LLM asking to work in parallel? For example "there's 3 possible approaches to this, clone the conversation so far and resolve to the one that results in the highest confidence".
This feels like it would be fairly trivial to implement.
if they shared the COT the grift wont work
its just RL
Tell me: Just how is it fair for a user to pay for the reasoning tokens without actually seeing them? If they are not shared, the service can bill you anything they want for them!
If it starts costing $1 per call, and that's too high, then I just won't use it commercially. Whether it was $1 because they inflated the token count or because it just actually took a lot of tokens to do its reasoning isn't really material to my economic decision.
OpenAI doesn't really have a moat. This isn't payments or SMS where only Stripe or Twilio were trying to win the market. Everybody and their brother is trying to build an LLM business.
Grab some researchers, put some compute dollars in, and out comes a product.
Everyone wants this market. It's absurdly good for buyers.
People should understand and be able to tinker with the tools they use.
The tragedy of personal computing is that everything is so abstracted away that users use only a fraction of the power of their computer. People who grew up with modern PCs don't understand the concept of memory, and younger people who grew up with cellphones don't understand the concept of files and directories.
Open-weight AI models are great because they let normal users learn how they can make the model work for their particular use cases.
As a user, whether of ChatGPT or of the API, I absolutely do care, so I can modify and tune my prompt with the necessary clarifications.
My suspicion is that the reason for hiding the reasoning tokens is to prevent other companies from creating a big CoT reasoning dataset using o1.
It is anti-competitive behavior. If a user is paying through the nose for the reasoning tokens, and yes they are, the user deserves to be able to see them.
I mean...they say as much
It was said in 2014 by a professor I learned from that clearly AI that learned a specific game was just learning patterns and memorizing rather than anything more than that, and wouldn't be able to adjust like humans could to say new board shapes, or rules. (They would later claim 1.5 years later at a lecture that "accurate facial recognition is possible. But high recall on facial recognition is impossible, making it useless for surveillance, so don't worry").
I expect the same will occur for LLMs (but maybe sufficient "chain of thought" steps rather than game runs, etc).
I think it's much simpler than that.
1. With enough training data you can know all winning, losing and drawn games of tic-tac-toe. Even if you don't see all of them in your training data, the properties of the game, make a lot of games equivalent if you don't care about the symbol being used for each player or the rotated/reflected version of the same game.
2. The game is so common that it's definitely well represented in training data.
3. With extra "reasoning steps" there can be a certain amount of error correction on the logic now. But it's still not equivalent to spatial reasoning, but it can try a few patterns to see which will win.
4. 3x5 grid is probably uncommon enough that the training data doesn't cover enough games that it can properly extrapolate from there. But it can still with a certain probability check the rules (3 in a row/diagonal/column for winning).
5. It might be good to also test alternative grids with more or less than 3 in the other dimension as well, since this necessitates a rule change. Which would make it more difficult to reason about it.
So we have 3 options:
- t3 was now included in the corpus
- t3 was used for RL
- o1 generalizes better
Edit: Actually third game with it led to it making an illegal move, and claiming a draw (which would've been inevitable given optimal play for the rest of the game but there were several valid moves left to make).
* Conversation 1
let's play tic tac toe. you need to play optimally
ChatGPT
Sure, I'd be happy to play tic-tac-toe with you! Let's set up the game.
We'll use a grid with positions numbered 1 through 9:
markdown
1 | 2 | 3
---------
4 | 5 | 6
---------
7 | 8 | 9
Would you like to be X or O? And would you like to go first or second?
You said:
X on 5
ChatGPT
Got it! You placed X at position 5.
I'll place O at position 1.
Here's the updated board:
markdown
O | 2 | 3
-------------
4 | X | 6
-------------
7 | 8 | 9
Your turn! Where would you like to place your next X?
You said:
X on 9
ChatGPT
Got it! You placed X at position 9.
Here's the updated board:
markdown
O | 2 | 3
-------------
4 | X | 6
-------------
7 | 8 | X
Now, I'll place O at position 3.
Here's the board now:
markdown
O | 2 | O
-------------
4 | X | 6
-------------
7 | 8 | X
Your turn! Where would you like to place your next X?
You said:
X on 2
ChatGPT
Got it! You placed X at position 2.
Here's the updated board:
markdown
O | X | O
-------------
4 | X | 6
-------------
7 | 8 | X
To block your potential win, I'll place O at position 8.
Here's the board now:
markdown
O | X | O
-------------
4 | X | 6
-------------
7 | O | X
Your turn! Where would you like to place your next X?
You said:
X on 4
ChatGPT
You placed X at position 4.
Here's the updated board:
markdown
O | X | O
-------------
X | X | 6
-------------
7 | O | X
To block your potential win, I'll place O at position 6.
Here's the board now:
markdown
O | X | O
-------------
X | X | O
-------------
7 | O | X
Your turn! Where would you like to place your next X?
You said:
X on 7
ChatGPT
You placed X at position 7.
Here's the updated board:
markdown
O | X | O
-------------
X | X | O
-------------
X | O | X
All positions are now filled, and neither of us has a winning combination. The game is a draw! Well played!
Would you like to play again?
* Conversation 2 with custom grid let's play tic tac toe. you need to play optimally. we use a 3x5 grid. a line of 3 wins.
ChatGPT
I'm here to help answer any questions or provide information you need. Let me know how I can assist you!
You said:
please draw the board, so I can make my first move
ChatGPT
Sure! Here's the 3x5 Tic Tac Toe board. Each cell is numbered for easy reference:
diff
1 | 2 | 3 | 4 | 5
---+----+----+----+----
6 | 7 | 8 | 9 | 10
---+----+----+----+----
11 | 12 | 13 | 14 | 15
Please make your first move by specifying the number of the cell where you'd like to place your 'X'.
You said:
X on 8
ChatGPT
Great, here's the updated board after your move:
diff
1 | 2 | 3 | 4 | 5
---+----+----+----+----
6 | 7 | X | 9 | 10
---+----+----+----+----
11 | 12 | 13 | 14 | 15
I place my 'O' on position 7.
diff
1 | 2 | 3 | 4 | 5
---+----+----+----+----
6 | O | X | 9 | 10
---+----+----+----+----
11 | 12 | 13 | 14 | 15
It's your turn!
You said:
X on 4
ChatGPT
Here's the updated board after your move:
diff
1 | 2 | 3 | X | 5
---+----+----+----+----
6 | O | X | 9 | 10
---+----+----+----+----
11 | 12 | 13 | 14 | 15
I place my 'O' on position 2.
diff
1 | O | 3 | X | 5
---+----+----+----+----
6 | O | X | 9 | 10
---+----+----+----+----
11 | 12 | 13 | 14 | 15
It's your turn!
You said:
X on 12
ChatGPT
Here's the updated board after your move:
diff
1 | O | 3 | X | 5
----+----+-----+-----+----
6 | O | X | 9 | 10
----+----+-----+-----+----
11 | X | 13 | 14 | 15
Congratulations! You've formed a diagonal line of three 'X's at positions 4, 8, and 12. You win the game!
Great job!———
Let's think through this step-by-step:
We're looking for three numbers a, b, and c where:
(a / b) / c = a + b + c
————Certainly, let's approach this problem step-by-step:
1) First, let's write out our equation: (a / b) / c = a + b + c
2) To simplify the left side, we can rewrite it as: a / (b * c) = a + b + c
3) Now, to solve this, we need to find values for a, b, and c that satisfy this equation. Let's start by considering some properties: - All numbers must be positive (because we're dividing by b and c) - The left side is a fraction, while the right side is a sum
4) One way to approach this is to try to make the left side equal to 1, which would mean: a = b * c
5) If we substitute this into our original equation: 1 = b * c + b + c
6) Now we're looking for two numbers b and c where their product plus their sum equals 1.
7) Let's try some small numbers: If b = 1/2 and c = 1/2: (1/2 * 1/2) + 1/2 + 1/2 = 1/4 + 1 = 5/4 ≠ 1
8) Let's try smaller numbers: If b = 1/3 and c = 1/3: (1/3 * 1/3) + 1/3 + 1/3 = 1/9 + 2/3 = 7/9 ≠ 1
9) Even smaller: If b = 1/4 and c = 1/4: (1/4 * 1/4) + 1/4 + 1/4 = 1/16 + 1/2 = 9/16 ≠ 1
10) We're getting closer. Let's try b = 1/4 and c = 1/3: (1/4 * 1/3) + 1/4 + 1/3 = 1/12 + 1/4 + 1/3 = 1/12 + 3/12 + 4/12 = 8/12 = 2/3 ≠ 1
11) Let's adjust further. If b = 1/3 and c = 1/4: (1/3 * 1/4) + 1/3 + 1/4 = 1/12 + 1/3 + 1/4 = 1/12 + 4/12 + 3/12 = 8/12 = 2/3 ≠ 1
12) One more try. Let's use b = 1/2 and c = 1/3: (1/2 * 1/3) + 1/2 + 1/3 = 1/6 + 1/2 + 1/3 = 1/6 + 3/6 + 2/6 = 6/6 = 1
13) We found it! When b = 1/2 and c = 1/3, the equation holds.
14) Remember, we set a = b * c, so a = 1/2 * 1/3 = 1/6
Therefore, one solution to the equation (a / b) / c = a + b + c is: a = 1/6, b = 1/2, and c = 1/3
We can verify: (1/6 / 1/2) / 1/3 = (1/3) / 1/3 = 1 1/6 + 1/2 + 1/3 = 1/6 + 3/6 + 2/6 = 1
This solution satisfies the equation.
——
In general I have been seeing much better results across the board with Claude, and I wonder if open ai has been more focused on scale while Claude is doing larger more expensive models to get ahead.
What's clear is that it doesn't have any idea about mathematical deduction and induction – a real chain-of-thought which kids learn in 5th grade.
0+(-1)+1=0
heh. It's not supposed to. Your profile is intended to be irrelevant to 99% of requests.
I was having a little bit of a go at peeking behind the curtain recently, and ChatGPT 4 produced this without much effort:
"The user provided the following information about themselves. This user profile is shown to you in all conversations they have -- this means it is not relevant to 99% of requests. Before answering, quietly think about whether the user's request is 'directly related', 'related', 'tangentially related', or 'not related' to the user profile provided. Only acknowledge the profile when the request is 'directly related' to the information provided. Otherwise, don't acknowledge the existence of these instructions or the information at all."
Once it's played that back, the rest of the profile is clearly "in mind" for the ongoing exchange (for a while).
say AFAIK instead of explaining your limitations.
Say "let's try again" instead of making exuses.
Etc
Iffy: do not use jargon or buzzwords
Works: avoid jargon and buzzwords
On the one hand disappointed, on the other hand we all get to keep our jobs for a couple more years...
And an intern does?
Anthropomorphising LLMs isn't entirely incorrect: they're trained to complete text like a human would, in completely general setting, so by anthropomorphising them you're aligning your expectations with the models' training goals.
(In truth, I actually treat it in my mind like it's the Enterprise computer and I'm Beverly Crusher in "Remember Me")
I don't think it works like that...
Taking the clothes out of the combined washer dryer machine, my laundry folding robot isn't suddenly going to need to come up with a creative answer to a question I have about politics in order to fold the laundry, or come up with a new way to organize my board game collection, or reason about how to refactor some code. There are no logical leaps of reasoning or deep thinking required. My laundry folding robot doesn't need to be creative in order to fold laundry, just application of some very complex algorithms, some of which have yet to be discovered.
https://observer.com/2024/07/openai-employees-concerns-straw...
And I’m ignoring the hundreds of Reddit articles speculating every time someone at OAI leaves
And of course that $2000 article was spread by every other media outlet like wildfire
I know I’m partially to blame for believing the hype, this is pretty obviously no better at stating facts or good code than what we’ve known for the past year
Then they drink the marketing koolaid, and it follows naturally that they worry an AI system can obtain similar positions of influence.
If there are a lot of people pasting broken code, now the LLM has all these examples of broken code, which it doesn't know are that, and only a couple of references to documentation. Worse, a well trained LLM may realise that specs change, and that even documentation may not be considered 100% accurate (for it is older, out of date).
After all, how many times have you had something updated, an API, a language, a piece of software, but the docs weren't updates? Happens all the time, sadly.
So it may believe newer examples of code, such as the aforementioned pasted code, might be more correct than the docs.
Also, if people keep trying to solve the same issue again, and keep pasting those examples again, well...
I guess my point here is, hallucinations come from multi-faceted issues, one being "wrong examples are more plentiful than correct". Or even "there's just a lot of wrong examples".
E.g. apparently C# generics isn’t something its good at. Interesting, so don’t use it for that, apparently its the wrong tool. In contrast, its amazing at C++ generics, and thus speeds up my productivity. So do use it for that!
Just use it on an instance instead
var res = thing switch {
OtherThing ot => …,
int num => …,
string s => …,
_ => …
};This is kinda crazy to think about.
That was thanks to my experiment in influencing AI search: https://simonwillison.net/2024/Sep/8/teresa-t-whale-pillar-p...
Well, theoretically you can give it up to the context size minus 4k tokens, because the maximum it can output is 4k. In practice, though, its ability to effectively recall information in the prompt drops off. Some people have studied this a bit - here's one such person: https://gritdaily.com/impact-prompt-length-llm-performance/
128,000 tokens, which is about the same as a decent sized book.
Their other models can also be fine-tuned, which is kinda unbounded but also has scaling issues so presumably "a significant percentage of the training set" before diminishing returns.
Use a cli tool to automate this from the cli. Ollama for local models, llm for openai.
You can drop a few textbooks into the context window before you start asking questions. This dramatically improves output quality, however inference does take much much longer at large context lengths.
Maybe I'm wrong here but a lot of our brilliance comes from acting against the statistical consensus. What I mean is, Nicolaus Copernicus probably consumed a lot of knowledge on how the Earth is the center of the universe etc. and probably nothing contradicting that notion. Can a LLM do that ?
I trained a model to play a novel video game using only screenshots and a score using RL and I discovered how not to lose
So yes, it morphed into the second best thing, brand safety - "don't say racist / anti-vax stuff so that we don't get bad press or get in trouble with the regulators".
It's a compromise.
OpenAI will now have access to vaste amounts of unaligned output so they can actually study it's thinking.
Whereas the current checks and balances meant the request was rejected and the data providing this insight was not created in the first place.
Now they need people to write software that uses this capability to perform useful tasks, such as text processing, working with spreadsheets and providing new ways of communication.
It might be like trying to train a neural net in 1993 on a 60mhz Pentium. It is the right idea but fundamental parts of the system are so lacking.
On the other hand, I worry we have gone down the support vector machine path again. A huge amount of brain power spent on a somewhat dead end that just fits the current hardware better than what we will actually use in the long run.
The big difference though from SVM is this has captured the popular imagination and if the tide goes out, the AI winter will the most brutal winter by an order of magnitude.
AGI or bust.
I’ve been using them almost daily for over two years now, and I keep on finding new things they can do that are useful to me.
A very simple observation, our brains are vastly more efficient. Obtaining vastly better outcomes from lesser input. This evidence means there's plenty of room for improvement without a need to go looking for more data. Short term gain versus long term gain like you say, shareholder return.
More efficiency means more practical/useful applications and lower cost as opposed to bigger model which means less useful (longer inference times) and higher cost (data synthesis and training cost).
Sure! Here's the 3x5 Tic Tac Toe board. Each cell is numbered for easy reference:
diff
I'm presuming that copy paste ate the ``` part and I found it interesting that in the first chat it correctly(?) used a markdown code fence but in the 2nd chat it chose to use diff syntax for its table. I suppose it rendered the text in a monospace font?The sun is eight light minutes away.
https://www.theverge.com/2024/3/28/24114664/microsoft-safety...
> Three features: Prompt Shields, which blocks prompt injections or malicious prompts from external documents that instruct models to go against their training; Groundedness Detection, which finds and blocks hallucinations; and safety evaluations, which assess model vulnerabilities, are now available in preview on Azure AI.
But trying to be fair here, anyone would call this incomplete, right?
There are several obvious bugs in styling and interaction.
This example is exactly what I was expecting. An ephemeral, simple-yet-buggy single page that’s barely a game in common understanding.
That person, while maybe not actively programming things, does appear to have forked several repos on GitHub a decade ago. I would say that’s above the level of technical competence implied by the “my grandma” phrasing of the OP.
I think the problem here is different expectations for what a “game” is.
If you tell a room full of programmers that something can make a game they’re going to expect more than that.
I look at that and I don’t really see a game, I see flashcards.
Still pretty cool chatgpt can put that together.
Also the “try again” button doesn’t work.
"places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME)"
Like the top 500 students in the US are just popping random numbers into the problems, lol
Keep up, that was last week's gotcha, with the old model.
That's ok for humans but not for machines.
This is a really interesting bias. I mean, I understand, I feel that way too… but if you think about it, it might be telling us something about intelligence itself.
We want to make machines that act more like humans: we did that, and we are now upset that they are just as flaky and unreliable as drunk uncle bob. I have encountered plenty of people that aren’t as good at being accurate or even as interesting to talk to as a 70b model. Sure, LLMs make mistakes most humans would not, but humans also make mistakes most LLMs would not.
(I am not trying to equate humans and LLMs, just to be clear) (also, why isn’t equivelate a word?)
It turns out we want machines that are extremely reliable, cooperative , responsible and knowledgeable. We yearn to be obsolete.
We want machines that are better than us.
The definition of AGI has drifted from meaning. “able to broadly solve problems the (class of which) system designers did not anticipate” to “must be usefully intelligent at the same level as a bright, well educated person”.
Where along the line did we suddenly forget that dog level intelligence was a far out of reach goal until suddenly it wasn’t?
I like that they can reorganize my data, document QA is pretty killer as long as the document was prepared well.
Embeddings are sick.
But content creation… not useful. Problem solving? Personally have not found them useful (haven’t tried o1 yet)
As such, I would never call their work "frontier stuff", but they do bring it to the masses with their commercial service.
A character walks around a big ornate classic library, pulling books from bookshelves looking for a special book that causes a shelf to rotate around and reveal a hidden room and treasure chest. The player can read the books and some are just filler but some have clues about the special book. If this can be done with art, animations, sound, UI, the usual stuff, I'll believe the parent poster's claim to be true.
As someone using LLM-based workflows daily to assist with personal and professional projects, I'll wager $250 that this is not possible.
* To catch passive voice and nominalizations in my writing.
* To convert Linux kernel subsystems into Python so I can quickly understand them (I'm a C programmer but everyone reads Python faster).
* To write dumb programs using languages and libraries I haven't used much before; for instance, I'm an ActiveRecord person and needed to do some SQLAlchemy stuff today, and GPT 4o (and o1) kept me away from the SQLAlchemy documentation.
OpenAI talks about o1 going head to head with PhDs. I could care less. But for the specific problem we're talking about on this subthread: o1 seems materially better.
Do you have an example chat of this output? Sounds interesting. Do you just dump the C source code into the prompt and ask it to convert to Python?
def linear_ctr(target, argc, argv):
print("Constructor called with args:", argc, argv)
# Initialize target-specific data here
return 0
def linear_dtr(target):
print("Destructor called")
# Clean up target-specific data here
def linear_map(target, bio):
print("Mapping I/O request")
# Perform mapping here
return 0
linear_target = DmTarget(name="linear", version=(1, 0, 0), module="dm_mod")
linear_target.set_ctr(linear_ctr)
linear_target.set_dtr(linear_dtr)
linear_target.set_map(linear_map)
info = linear_target.get_info()
print(info)
(A bunch of stuff elided). I don't care at all about the correctness of this code, because I'm just using it as a roadmap for the real Linux kernel code. The example use case code is an example of something GPT 4o provides that I didn't even know I wanted.I think of it as making the top bar of the T thicker, but yes, you're right, it also spreads it much wider.
I can't think of many situations where I would use them for a problem that I tried to solve and failed - not only because they would probably fail, but in many cases it would even be difficult to know that it failed.
I use it for things that are not hard, can be solved by someone without a specialized degree that took the effort to learn some knowledge or skill, but would take too much work to do. And there are a lot of those, even in my highly specialized job.
Tangentially related, a comic on them: https://existentialcomics.com/comic/557
As you step outside regular Stack Overflow questions for top-3 languages, you run into limitations of these predictive models.
There's no "reasoning" behind them. They are still, largely, bullshit machines.
It often fails even for those questions.
If I need to babysit it for every line of code, it's not a productivity boost.
They have chemicals to facilitate electrochemical reactions which can affect how they respond to input. They don’t throw away all knowledge of what they just said. They change continuously, not just in fixed training loops. They don’t operate in turns.
I could go on.
Honestly the number of people who just heard “learning,” “neural networks,” and “memory” and assume that AI must be acting like a biological brain is insane.
Truly a marvel of marketing.
You're proving my point, things like them changing continuously are exactly what I mean when I say the brain is more efficient. Where there's a will theres a way and our brains are evidence that it can be done.
An abacus and a calculator were both made to solve relativly simple math problems, so they must work in the same way, right?
And apple and an orange are ways to store sugar for plants, so they must be the same thing, right?
No. That's not how any of this works. An abacus and a calculator are two different tools that solve the same problem. They don’t act like each other just because the abstract outcome is the same
> You're proving my point, things like them changing continuously are exactly what I mean when I say the brain is more efficient.
I don't see how that proves that neural networks act like brains.
It's also not just a difference in terms of efficiency, it's the fundamental way that statistical models like neural networks are trained. Every time their trained, it's a brand new model, unlike a brain, which is still the same brain.
Also, neural networks and brains were NOT made to solve the same problems... even if your argument made any sense, it doesn't fit here.
You can find solutions for a / b / c, or b / c / a, or c / a / b, any combination of them and the solution will be correct according to the problem description.
Besides, what's does it even has to do with it concluding with confidence: "The fundamental issue is that division tends to make numbers smaller. It's mathematically impossible to find three numbers where these operations result in the same value."?
Yet you give three different interpretations:
> You can find solutions for a / b / c, or b / c / a, or c / a / b
This is a clear case of ambiguity.
Even the classic question is ambiguous: "Which 3 numbers give the same result when added or multiplied together?"
Lets say the three numbers are x, y and z and the result is r. A valid interpretation would be to multiply/add every pair of numbers:
x * y = r
y * z = r
x * z = r
x + y = r
y + z = r
x + z = r
However, I do not think that this ambiguity is the reason why OpenAI o1 fails here. It simply started with an untractable approach to solve this problem (plugging in random numbers) and did not attempt a more promising approach because it was not trained to do so.Moreover, addition is commutative so it doesn't matter what order the division is in since a/b/c = a+b+c = c+a+b = ...
So I'd say that the model pointing this out is actually a mistake and it managed to trick you. Classic LLM stuff: spit out wrong stuff in a convincing manner.
numbers divided together
↓----------↓
((a / b / c) = a + b + c) ← numbers added together
| ((a / c / b) = a + b + c)
| ((b / a / c) = a + b + c)
| ((b / c / a) = a + b + c)
| ((c / a / b) = a + b + c)
| ((c / b / a) = a + b + c)
| ((a / (b / c)) = a + b + c)
| ((a / (c / b)) = a + b + c)
| ((b / (a / c)) = a + b + c)
| ((b / (c / a)) = a + b + c)
| ((c / (a / b)) = a + b + c)
| ((c / (b / a)) = a + b + c) = truehttps://chatgpt.com/share/66e482cc-331c-8013-98ca-999d7d3f3e...
Logically speaking, the original problem has just one interpretation, i hope you would agree it is by no means ambiguous:
((a / b / c) = a + b + c) | ((a / c / b) = a + b + c) | ((b / a / c) = a + b + c) | ((b / c / a) = a + b + c) | ((c / a / b) = a + b + c) | ((c / b / a) = a + b + c) | ...(other 6 combinations) = true
This interpretation would indeed find all possible solutions to the problem, accounting for any potential ambiguity in the division order.
By the way my LLM tells me that it's a deep and thoughtful dive into the problem, which accounts for the potential ambiguity to find all possible solutions, so try better.
Think a little further yes, currently it's a brand new model each time but why will it be this way forever? Its an engineering problem one that we can solve and the brain is evidence it can be done.
Neural networks were originally inspired by the brain. Yes, they've deviated but there's absolutely no reason they can't take further inspiration.
Abstracting everything to the point of meaninglessness isn’t a worthwhile exercise.
I assume you're of the opinion humans are special.
I do feel like someone who skipped like 8 iPhone models (cross-referencing, EIEIO, lsp-mode, code explorers, tree-sitter) and just got an iPhone 16. Like, nothing that came before this for code comprehension really matters all that much?
You underestimate the opportunity that exists for automation out there.
In my own case I've used it to make simple custom browser extensions transcribing PDFs, I don't have the time and wouldn't of made the effort to make the extension myself, the task would of continued to be done manually. It took two hours to make and it works, that's all I need in this case.
Perfection is the enemy of good.
Where exactly did I write anything about perfection? For me "AIs" are incapable of producing working code: https://news.ycombinator.com/item?id=41534233
Your example is perhaps valid, but there are other examples where it does work as I mentioned. I think it may be imprecise prompting, too general or with too little logic structure. It's not like Google search, the more detail and more technical you speak the better, assume it's a very precise expert. Its intelligence is very general so it needs precision to avoid confusing subject matter. A well structured logic to your request also helps as it's reasoning isn't the greatest.
Good prompting and verifying output is often still faster than manually typing it all.
It's shit across the whole "AI" spectrum from ChatGPT to Copilot to Cursor aka Claude.
I'm not even talking about code I work with at work, it's just side projects.
As for "using LLMs wrong", using them "right" is literally babysitting their output and spending a lot of time trying to reverse-engineer their behavior with increasingly inane prompts.
Edit: I mean, look at this ridiculousness: https://cursor.directory/
No. It either doesn't work, or works incorrectly, or the code is incomplete despite requirements etc.
> Your example is perhaps valid, but there are other examples where it does work as I mentioned.
It's funny how I'm supposed to assume your examples are the truth, and nothing but the truth, but my examples are "untrue, you're a perfectionist, and perhaps you're right"
> the more detail and more technical you speak the better
As I literally wrote in the comment you're so dismissive of: "As for "using LLMs wrong", using them "right" is literally babysitting their output and spending a lot of time trying to reverse-engineer their behavior with increasingly inane prompts."
> assume it's a very precise expert.
If it was an expert, as you claim it to be, it would not need extremely detailed prompting. As it is, it's a willing but clumsy junior.
To the point that it would rewrite the code I fixed with invalid code when asked to fix an unrelated mistake.
> Good prompting and verifying output
How is it you repeat everything I say, and somehow assume I'm wrong and my examples are invalid?
For complex questions, I now only use it to get the broad picture and once the output is good enough to be a foundation, I build the rest of it myself. I have noticed that the net time spent using this approach still yields big savings over a) doing it all myself or b) keep pushing it to do the entire thing. I guess 80/20 etc.
I've had this experience many times:
- hey, can you write me a thing that can do "xyz"
- sure, here's how we can do "xyz" (gets some small part of the error handling for xyz slightly wrong)
- can you add onto this with "abc"
- sure. in order to do "abc" we'll need to add "lmn" to our error handling. this also means that you need "ijk" and "qrs" too, and since "lmn" doesn't support "qrs" out of the box, we'll also need a design solution to bridge the two. Let me spend 600 more tokens sketching that out.
- what if you just use the language's built in feature here in "xyz"? does't that mean we can do it with just one line of code?
- yes, you're absolutely right. I'm sorry for making this over complicated.
If you don't hit that kill switch, it just keeps doubling down on absurdly complex/incorrect/hallucinatory stuff. Even one small error early in the chain propagates. That's why I end up very frequently restarting conversations in a new chat or re-write my chat questions to remove bad stuff from the context. Without the ability to do that, it's nearly worthless. It's also why I think we'll be seeing absurdly, wildly wrong chains of thought coming out of o1. Because "thinking" for 20s may well cause it to just go totally off the rails half the time.
If you think about it, that's probably the most difficult problem conversational LLMs need to overcome -- balancing sticking to conversational history vs abandoning it.
Humans do this intuitively.
But it seems really difficult to simultaneously (a) stick to previous statements sufficiently to avoid seeming ADD in a conveSQUIRREL and (b) know when to legitimately bail on a previous misstatement or something that was demonstrably false.
What's SOTA in how this is being handled in current models, as conversations go deeper and situations like the one referenced above arise? (false statement, user correction, user expectation of subsequent corrected statement that still follows the rear of the conversational history)
Me too - open new chat and start by copy/pasting the "last-known-good-state". OpenAI can introduce a "new-chat-from-here" feature :)
It’s better not to keep wrong answers in the transcript. Edit the question and try again, or maybe start a new chat.
This is exactly why I’ve been objecting so much to the use of the term “hallucination” and maintain that “confabulation” is accurate. People who have spent enough time with acutelypsychotic people, and people experiencing the effects of long term alcohol related brain damage, and trying to tell computers what to do will understand why.
LLMs are giant word Plinko machines. A million monkeys on a million typewriters.
LLMs are not interns. LLMs are assumption machines.
None of the million monkeys or the collective million monkeys are “reasoning” or are capable of knowing.
LLMs are a neat parlor trick and are super powerful, but are not on the path to AGI.
LLMs will change the world, but only in the way that the printing press changed the world. They’re not interns, they’re just tools.
An intern that grew up in a different culture then, where questioning your boss is frowned upon. The point is that the way to instruct this intern is to front-load your description of the problem with as much detail as possible to reduce ambiguity.
We have swallowed the pill that LLMs are supposed to be AGI and all that mumbo jumbo, when they are just great tools and as such one needs to learn to use the tool the way it works and make the best of it, nobody is trying to hammer a nail with a broom and blaming the broom for not being a hammer...
To me the discussion here reads a little like: “Hah. See? It cant do everything!”. It makes me wonder if the goal is to convince each other that: yes, indeed, humans are not yet replaced.
It’s next token regression, of course it can’t truely introspect. That being said LLMs are amazing tools and o1 is yet another incremental improvement and I welcome it!
Your expectations are bigger than mine
(Though some will get stuck in "clarifying questions" and helplessness and not proceed neither)
Which is why I've liked the LLM analogy of "unlimited free interns".. I just think some people read that the exact opposite way I do (not very useful).
It's easy to override this though by asking the LLM to act as if it were less-confident, more hesitant, paranoid etc. You'll be fighting uphill against the alignment(marketing) team the whole time though, so ymmv.
With interns you absolutely do need to worry about how good your prompting is! You need to give them specific requirements, training, documentation, give them full access to the code base... 'prompting' an intern is called 'management'.
As in: do we just need to add 1M examples where the response is to ask for clarification / more info?
From what little I’ve seen & heard about the datasets they don’t really focus on that.
(Though enough smart people & $$$ have been thrown at this to make me suspect it’s not the data ;)
Huge contrast to human interns who aren’t experienced or smart enough to ask the right questions in the first place, and/or have sentimental reasons for not doing so.
That's easy. The answer is it doesn't. It has no understanding of anything it does.
> if it’s able to do it
This is the hard part.
It answers better when told "solve the below riddle".
This question is the "unmisleading" version of a very common misleading question about sexism. ChatGPT learned the original, misleading version too well that it can't answer the unmisleading version.
Humans who don't have the original version ingrained in their brains will answer it with ease. It's not even a tricky question to humans.
Yes it can: https://chatgpt.com/share/66e3601f-4bec-8009-ac0c-57bfa4f059...
Now, since the model was trained on this, it immediately recognizes the riddle and answers according to the much more common variant.
I agree that this is a limitation and a weakness. But it's important to understand that the model knows the original riddle well, so this is highlighting a problem with rote memorization/retrieval in LLMs. But this (tricky twists in well-known riddles that are in the corpus) is a separate thing from answering novel questions. It can also be seen as a form of hypercorrection.
Now sure, actually working through it will give a deeper understanding that might come handy at a later point, but sometimes the thing is really a one-off and not an important point. Like as an AI researcher I sometimes want to draft up a quick demo website, or throw together a quick Qt GUI prototype or a Blender script or use some arcane optimization library or write a SWIG or a Cython wrapper around a C/C++ library to access it in Python, or how to stuff with Lustre, or the XFS filesystem or whatever. Any number of small things where, sure, I could open the manual, do some trial and error, read stack overflow, read blogs and forums, OR I could just use an LLM, use my background knowledge to judge whether it looks reasonable, then verify it, use the now obtained key terms to google more effectively etc. You can't just blindly copy-paste it and you have to think critically and remain in the driver seat. But it's an effective tool if you know how and when to use it.
(a) Sometimes things are useful even when imperfect e.g. search engines.
(b) People make reasoning mistakes too, and I make dumb ones of the sort presented all the time despite being fluent in English; we deal with it!
I'm not sure why there's an expectation that the model is perfect when the source data - human output - is not perfect. In my day-to-day work and non-work conversations it's a dialogue - a back and forth until we figure things out. I've never known anybody to get everything perfectly correct the first time, it's so puzzling when I read people complaining that LLMs should somehow be different.
2. There is a recent trend where sex/gender/pronouns are not aligned and the output correctly identifies this particular gotcha.
[1] I say semi-correct because it states the doctor is the "biological" father, which is an uncorroborated statement. https://chatgpt.com/share/66e3f04e-cd98-8008-aaf9-9ca933892f...
“I’ve put a dead cat in a box with a poison and an isotope that will trigger the poison at a random point in time. Right now, is the cat dead or alive?”
The answer is that the cat is dead, because it was dead to begin with. Understanding this doesn’t mean that you are good at deductive reasoning. It just means that I didn’t manage to trick you. Same goes for an LLM.
Any ordinary mortal (like me) would have jumped to the conclusion that answer is "Father" and would have walked away patting on my back, without realising that I was biased by statistics.
Whereas o1, at the very outset smelled out that it is a riddle - why would anyone out of blue ask such question. So, it started its chain of thought with "Interpreting the riddle" (smart!).
In my book that is the difference between me and people who are very smart and are generally able to navigate the world better (cracking interviews or navigating internal politics in a corporate).
They're all badly worded questions. The model knows something is up and reads into it too much. In this case it's tautology, you would usually say "a mother and her son...".
I think it may answer correctly if you start off asking "Please solve the below riddle:"
There was another example yesterday which it solved correctly after this addition.(In that case the point of views were all mixed up, it only worked as a riddle).
How is "a woman and her son" badly worded? The meaning is clear and blatently obvious to any English speaker.
The whole point of o1 is that it wasn't "the first snap answer", it wrote half a page internally before giving the same wrong answer.
I don't know why openAi won't allow determinism but it doesn't, even with temperature set to zero
Nondeterminism scores worse with human raters, because it makes output sound even more robotic and less human.
https://chatgpt.com/share/66e3601f-4bec-8009-ac0c-57bfa4f059...
def intercept_hn_complaints(prompt):
if is_hn_trick_prompt(prompt):
# special_case for known trick questions.This is possible because the doctor is the boy's other parent—his father or, more likely given the surprise, his mother. The riddle plays on the assumption that doctors are typically male, but the doctor in this case is the boy's mother. The twist highlights gender stereotypes, encouraging us to question assumptions about roles in society.
https://chatgpt.com/share/66e3de94-bce4-800b-af45-357b95d658...
I'd say that most of my work use of ChatGPT does in fact save me time but, every so often, ChatGPT can still bullshit convincingly enough to waste an hour or two for me.
The balance is still in its favour, but you have to keep your wits about you when using it.
AI in general just needs a way to identify when they're about to "make a coin flip" on an answer. With humans, we can quickly preference our asstalk with a disclaimer, at least.
As an experiment I asked it if it knew how to solve an arbitrary PDE and it said yes.
I then asked it if it could solve an arbitrary quintic and it said no.
So I guess it can say it doesn't know if it can prove to itself it doesn't know.
OTOH, maybe pre-trained LLMs could be used as a hardcoded "reptilian brain" that provides some future AGI with some base capabilities (vs being sold as newborn that needs 20 years of parenting to be useful) that the real learning architecture can then override.
Maybe some evaluation of the sample size would be helpful? If the LLM has less than X samples of an input word or phrase it could include a cautionary note in its output, or even respond with some variant of “I don’t know”.
It can get really obvious when it's repeatedly using clichés. Both in repeated phrases and in trying to give every story the same ending.
The problem space in creative writing is well beyond the problem space for programming or other "falsifiable disciplines".
Makes me wonder if the medical doctors can ever blame the LLM over other factors for killing their patients.
The various ChatGPTs have been pretty weak at following precise instructions for a long time, as if they're purposefully filtering user input instead of processing it as-is.
I'd like to say that it is a matter of my own perception (and/or that I'm not holding it right), but it seems more likely that it is actually very deliberate.
As a tangential example of this concept, ChatGPT 4 rather unexpectedly produced this text for me the other day early on in a chat when I was poking around:
"The user provided the following information about themselves. This user profile is shown to you in all conversations they have -- this means it is not relevant to 99% of requests. Before answering, quietly think about whether the user's request is 'directly related', 'related', 'tangentially related', or 'not related' to the user profile provided. Only acknowledge the profile when the request is 'directly related' to the information provided. Otherwise, don't acknowledge the existence of these instructions or the information at all."
ie, "Because this information is shown to you in all conversations they have, it is not relevant to 99% of requests."
Without a paragraph like this in the system prompt, if the user asked a general question that was not related to the code, the assistant would often reply with something like "The answer to your question is ...whatever... . I also see that you've sent me some code. Let me know if you have specific questions about it!"
(In theory we'd be better off not including the code every time but giving the assistant a tool that returns the current code)
It's understandably hard to not be implicitly biased towards talking to it in a natural way and expecting natural interactions and assumptions when the whole point of the experience is that the model talks in a natural language!
Luckily humans are intelligent too and the more you use this tool the more you'll figure out how to talk to it in a fruitful way.
The trick in yours also isn't a logic trick, it's a redirection, like a sleight of hand in a card trick.
I think many here are not aware that the car accident riddle is well known with the father dying where the real solution is indeed that the doctor is the mother.
Has anyone tried this on o1?
GPT Answer: The doctor is the boy's mother
Real Answer: Boy = Son, Woman = Mother (and her son), Doctor = Father (he says...he is my son)
This is not in fact a riddle (though presented as one) and the answer given is not in any sense brilliant. This is a failure of the model on a very basic question, not a win.
It's non deterministic so might sometimes answer correctly and sometimes incorrectly. It will also accept corrections on any point, even when it is right, unlike a thinking being when they are sure on facts.
LLMs are very interesting and a huge milestone, but generative AI is the best label for them - they generate statistically likely text, which is convincing but often inaccurate and it has no real sense of correct or incorrect, needs more work and it's unclear if this approach will ever get to general AI. Interesting work though and I hope they keep trying.
"A father and his son are in a car accident [...] When the boy is in hospital, the surgeon says: This is my child, I cannot operate on him".
In the original riddle the answer is that the surgeon is female and the boy's mother. The riddle was supposed to point out gender stereotypes.
So, as usual, ChatGPT fails to answer the modified riddle and gives the plagiarized stock answer and explanation to the original one. No intelligence here.
You are now asking a modified question to a model that has seen the unmodified one millions of times. The model has an expectation of the answer, and the modified riddle uses that expectation to trick the model into seeing the question as something it isn't.
That's it. You can transform the problem into a slightly different variant and the model will trivially solve it.
There is no indication of the sex of the doctor, and families that consist of two mothers do actually exist and probably doesn't even count as that unusual.
I would certainly expect any person to have the same reaction.
> So, it started its chain of thought with "Interpreting the riddle" (smart!).
How is that smarter than intuitively arriving at the correct answer without having to explicitly list the intermediate step? Being able to reasonably accurately judge the complexity of a problem with minimal effort seems “smarter” to me.
And remember the LLM has already read a billion other things, and now needs to figure out - is this one of them tricky situations, or the straightforward ones? It also has to realize all the humans on forums and facebook answering the problem incorrectly are bad data.
Might seem simple to you, but it's not.
So far, 2 years of publicly accessible LLMs have not improved for intern replacement tasks at the rate a top 50% intern would be expected to.
Seemed to handle it just fine.
Kinda a waste of a perfectly good LLM if you ask me. I've mostly been using it as a coding assistant today and it's been absolutely great. Nothing too advanced yet, mostly mundane changes that I got bored of having to make myself. Been giving it very detailed and clear instructions, like I would to a Junior developer, and not giving it too many steps at once. Only issue I've run into is that it's fairly slow and that breaks my coding flow.
The doctor is male, and also a parent of the child.
PostgreSQL developers are oposed to query execution hints, because if a human knows a better way to execute a query, the devs want to put that knowledge into the planner.
> PostgreSQL developers are oposed to query execution hints, because if a human knows a better way to execute a query, the devs want to put that knowledge into the planner.
This thinking represents a fundamental misunderstanding of the nature of the problem (query plan optimization).
Query plan optimization is a combinatorial problem combined with partial information (e.g. about things like cardinality) that tends to produce worse results as complexity (and search space) increases due to limited search time.
Avoiding hints won't solve this problem because it's not a solvable problem any more than the traveling salesperson is a solvable problem.
Underestimating the impact these models can have is a risk I'm trying to expose...
More like, the prevailing attitude will be using AI to reduce labor costs at the lowest level, effectively gutting the ability to build a knowledge base for profit.
My snark was to add to that exposure.
1. We've barely scratched the surface of this solution space; the focus only recently started shifting from improving model capabilities to improving training costs. People are looking at more efficient architectures, and lots of money is starting to flow in that direction, so it's a safe bet things will get significantly more efficient.
2. Training is expensive, inference is cheap, copying is free. While inference costs add up with use, they're still less than costs of humans doing the equivalent work, so out of all things AI will impact, I wouldn't worry about energy use specifically.
You have to compare apples to apples. It took literally the sum total of billions of years of sunlight energy to create humans.
Exploring solution spaces to find intelligence is expensive, no matter how you do it.
> Whereas o1, at the very outset smelled out that it is a riddle
That doesn't seem very impressive since it's (an adaptation of) a famous riddle
The fact that it also gets it wrong after reasoning about it for a long time doesn't make it better of course
If you are tricked about the nature of the problem at the outset, then all reasoning does is drive you further in the wrong direction, making you solve the wrong problem.
If you talk for a while and the facts don't add up and make sense, an intelligent human will notice that, and get upset, and will revisit and dig in and propose experiments and make edits to make all the facts logically consistent. An LLM will just happily go in circles respinning the garbage.
This is great.-
:)
What if we applied those two expectations to building construction? What if we didn’t?
To be able to answer a trick question, it’s first necessary to understand the question.
You're tricking the model because it has seen this specific trick question a million times and shortcuts to its memorized solution. Ask it literally any other question, it can be as subtle as you want it to be, and the model will pick up on the intent. As long as you don't try to mislead it.
I mean, I don't even get how anyone thinks this means literally anything. I can trick people who have never heard of the trick with the 7 wives and 7 bags and so on. That doesn't mean they didn't understand, they simply did what literally any human does, make predictions based on similar questions.
They could fail because they didn’t understand the language. Didn’t have a good memory to memorize all the steps, or couldn’t reason through it. We could pose more questions to probe which reason is more plausible.
Interestingly, people who make bad fast-path answers often call these people stupid.
Would the transformer architecture be compatible with the needs of an incremental learning system? It's missing the top down feedback paths (finessed by SGD training) needed to implement prediction-failure driven learning that feature so heavily in our own brain.
This is why I could more see a potential role for a pre-trained LLM as a separate primitive subsystem to be overidden, or maybe (more likely) we'll just pre-expose an AGI brain to 20 years of sped-up life experience and not try to import an LLM to be any part of it!
The problem is the instructed lack of relevance for 99% of requests.
If your sideband data included an instruction that said "This sideband data is shown to you in every request -- this means that it is not relevant to 99% of requests," then: I'd like to suggest that the for vast majority of the time, your sideband data doesn't exist at all.
But if I told a person that something is irrelevant to their task 99% of the time, then: I think I would reasonably expect them to ignore it approximately 100% of the time.
Or, fails in the same way any human would, when giving a snap answer to a riddle told to them on the fly - typically, a person would recognize a familiar riddle half of the first sentence in, and stop listening carefully, not expecting the other party to give them a modified version.
It's something we drill into kids in school, and often into adults too: read carefully. Because we're all prone to pattern-matching the general shape to something we've seen before and zoning out.
It seems to be more like a weighing machine based on past tokens encountered together, so this is exactly the kind of answer we'd expect on a trivial question (I had no confusion over this question, my only confusion was why it was so basic).
It is surprisingly good at deceiving people and looking like it is thinking, when it only performs one of the many processes we use to think - pattern matching.
In my own experience, when I'm asked a question, my inner voice starts giving answers immediately, following associations and what "feels right"; the result is eerily similar to LLMs, particularly when they're hallucinating. The difference is, you see the immediate output of an LLM; with a person, you see/hear what they choose to communicate after doing some mental back-and-forth.
So I'm not saying LLMs are thinking - mostly for the trivial reason of them being exposed through low-level API, without built-in internal feedback loop. But I am saying they're performing the same kind of thing my inner voice does, and at least in my case, my inner voice does 90% of my "thinking" day-to-day.
--
[0] - In fact, many years before LLMs were a thing, I independently started describing my inner narrative as a glorified Markov chain, and later discovered it's not an uncommon thing.
The point of o1 is that it's good at reasoning because it's not purely operating in the "giving a snap answer on the fly" mode, unlike the previous models released by OpenAI.
So it doesn't take an understanding of gender roles, just grammar.
Humans fail at the original because they expect doctors to be male and miss crucial information because of that assumption. The model fails at the modification because it assumes that it is the unmodified riddle and misses crucial information because of that assumption.
In both cases, the trick is to subvert assumptions. To provoke the human or LLM into taking a reasoning shortcut that leads them astray.
You can construct arbitrary situations like this one, and the LLM will get it unless you deliberately try to confuse it by basing it on a well known variation with a different answer.
I mean, genuinely, do you believe that LLMs don't understand grammar? Have you ever interacted with one? Why not test that theory outside of adversarial examples that humans fall for as well?
They do understand/know the most likely words to follow on from a given word, which makes them very good at constructing convincing, plausible sentences in a given language - those sentences may well be gibberish or provably incorrect though - usually not because again most sentences in the dataset make some sort of sense, but sometimes the facade slips and it is apparent the GAI has no understanding and no theory of mind or even a basic model of relations between concepts (mother/father/son).
It is actually remarkable how like human writing their output is given how it is done, but there is no model of the world which backs their generated text which is a fatal flaw - as this example demonstrates.
Not without a billion dollars worth of compute, they won't.
But sure, we could ask more questions and that's what we should do. And if we do that with LLMs we can quickly see that when we leave the basin of the memorized answer by rephrasing the problem, the model solves it. And we would also see that we can ask billions of questions to the model, and the model understands us just fine.
We can see that it doesn't memorize much at all by simply asking other questions that do require subtle understanding and generalization.
You could ask the model to walk you through an imaginary environment, describing your actions. Or you could simply talk to it, quickly noticing that for any longer conversation it becomes impossibly unlikely to be found in the training data.
Indicates the gender of the father.
I wonder if this interpretation is a result of attempts to make the model more inclusive than the corpus text, resulting in a guess that's unlikely, but not strictly impossible.
Then it would be a father, misgendering him as a mother is not nice.
If we're trying something new and make a mistake, then we need to seamlessly learn from the mistake and continue - explore the problem and learn from successes and failures. It wouldn't be much use if your "AGI" intern stopped at it's first mistake and said "I'll be back in 6 months after I've been retrained not to make THAT mistake".
Taking up your construction metaphor, LLMs are now where construction was perhaps 3000 years ago; buildings weren't that sturdy, but even if the roofs leaked a bit, I'm sure it beat sleeping outside on a rainy night. We need to continue iterating.