A recent experience with ChatGPT 5.5 Pro

A recent experience with ChatGPT 5.5 Pro(gowers.wordpress.com)

727 points by _alternator_ 9 days ago | 535 comments

https://twitter.com/wtgowers/status/2052830948685676605

https://xcancel.com/wtgowers/status/2052830948685676605

Jweb_Guru 8 days ago |

This jives with what I've experienced in the brief time I had access to 5.5 Pro. It's the very first LLM that I feel like I can wrangle into solving tedious, but straightforward, problems correctly. It still makes a ton of mistakes and needs to be very rigidly guided, but it does a pretty good job of tracing its own reasoning and correcting itself in a way that the other models do not.

The downside (not noted in the article, but noted by others here) is cost. It uses tokens at an insane rate, the tokens cost a lot, and using it with subagent flows that you can use to have it tackle large problems with high accuracy costs even more. It is also much "slower" for large scale problems because of context limitations -- it has to constantly rediscover context for each part of the problem, and in order to make it accurate you need to wipe its context before progressing to the next small part, or launch even more agents. For mathematical proofs like these, where the required context to understand the problem and proof besides stuff that's already available in its training set is small and the problems are considered "important" enough, this might not be a problem, but for many of the tasks I would like to use it for (ensuring correctness of code that affects large codebases, or validating subtle assumptions) it definitely is one.

So I think it will be a while before the impressive capabilities of these models really percolate into our lives as programmers, unless you're one of the lucky ones given unlimited access to 5.5 Pro.

elAhmo 7 days ago | |

> It's the very first LLM that I feel like I can wrangle into solving tedious, but straightforward, problems correctly. It still makes a ton of mistakes and needs to be very rigidly guided, but it does a pretty good job of tracing its own reasoning and correcting itself in a way that the other models do not.

I swear that people have said the same thing with effectively every new model that came out in the last six months.

fluidcruft 7 days ago | | |

I think it's because people walk every model up to its limits and become very aware of a task they can't make work. They do a lot of work simplifying and understanding limitations at that boundary. Then an improved model comes out and they immediately toe that barrier and make swift progress. They will also notice that the new model is natively doing tricks they had done manually.

The reality is likely that everyone is hitting similar barriers and the solutions are somewhat generalizable and get added to training new models.

Eventually people will reach the new limits and the cycle repeats.

fasterik 7 days ago | | |

> I swear that people have said the same thing with effectively every new model

That is definitely true, and at the same time, we can measure progress by who is making that claim. When Timothy Gowers, a Fields Medalist, says that models are now capable of "producing a piece of PhD-level research in an hour or so, with no serious mathematical input from me," we can be pretty confident that we are getting into seriously interesting territory.

Jweb_Guru 7 days ago | | |

Many people may have, but I certainly haven't.

hackable_sand 7 days ago | | |

They did. The scam continues.

y1n0 8 days ago | |

> This jives with what I've experienced

Just as an fyi, the word you are looking for is jibes. Jive is something else entirely.

jibe 8 days ago | | |

I'm with you!

pfdietz 7 days ago | | |

Excuse me stewardess, I speak jive.

jfaat 7 days ago | | |

I'm going to start using malapropisms so people know I didn't use an llm to write things

billfor 8 days ago | | |

Blame The Bee Gees: https://en.wikipedia.org/wiki/Jive_Talkin'

boring-human 8 days ago | | |

Cut me some slack, Jack.

bicepjai 8 days ago | | |

Interesting I did not know that I would have used jives :) thanks

refulgentis 8 days ago | | |

That ship sailed looooong ago.

idiotsecant 8 days ago | | |

The only thing worse than complaining about this is being the guy complaining about the guy complaining about this. So congratulations on being second most annoying.

Forgeties79 7 days ago | |

> can wrangle into solving tedious, but straightforward, problems correctly. It still makes a ton of mistakes and needs to be very rigidly guided,

I don’t know about the rest of y’all but I find “rigidly guiding” LLM’s incredibly tedious and frustrating in the same way seeing an error code throw for the 40th time while troubleshooting something on my computer for two hours is frustrating. It also feels somewhat like micromanaging a direct report. I don’t find that process fun or enjoyable in the slightest and it teaches me little in the process. It’s just trading styles of work, and I guess the response to that is “some people prefer that of work.” I just don’t like being told by the world we all have to work that way now I guess.

Jweb_Guru 7 days ago | | |

I agree. I find it endlessly frustrating and kind of hate what programming has become. But at least for me it meets the minimum bar of "it works if you push things" now. For past models, under no circumstances could I get them to semi reliably solve these kinds of problems correctly without giving them so many "hints" that they weren't actually saving me time. The kind of reasoning I'm talking about is stuff like "can you actually construct a trace from program start for this condition that looks locally reachable?" Past model simply cannot reliably answer such questions as soon as the control flow involves enough hops or requires tracing through enough function calls.

pmontra 9 days ago |

It's a very long post with a mix of technical (math) and philosophical sections. Here are the most striking points to reflect upon IMHO.

> It seems to me that training beginning PhD students to do research [...] has just got harder, since one obvious way to help somebody get started is to give them a problem that looks as though it might be a relatively gentle one. If LLMs are at the point where they can solve “gentle problems”, then that is no longer an option. The lower bound for contributing to mathematics will now be to prove something that LLMs can’t prove, rather than simply to prove something that nobody has proved up to now and that at least somebody finds interesting.

Training must start from the basics though. Of course everybody's training in math starts with summing small integers, which calculators have been doing without any mistake since a long time.

The point is perhaps confirmed by another comment further down in the post

> by solving hard problems you get an insight into the problem-solving process itself, at least in your area of expertise, in a way that you simply don’t if all you do is read other people’s solutions. One consequence of this is that people who have themselves solved difficult problems are likely to be significantly better at using solving problems with the help of AI, just as very good coders are better at vibe coding than not such good coders

People pay coders to build stuff that they will use to make money and I can happily use an AI to deliver faster and keep being hired. I'm not sure if there is a similar point with math. Again from the post

> suppose that a mathematician solved a major problem by having a long exchange with an LLM in which the mathematician played a useful guiding role but the LLM did all the technical work and had the main ideas. Would we regard that as a major achievement of the mathematician? I don’t think we would.

robot-wrangler 8 days ago |

A very interesting comment from Baez, I'll just quote part of it.

> Where does the value of thinking and having deep ideas come from? We need to think about this now. If it comes primarily from their scarcity – the fact that having certain ideas is hard – then indeed this value may drop precipitously when the manufacture of ideas can be automated. But if the value comes from the utility of the ideas – the benefit that the idea brings – then the story changes: perhaps creating more good ideas is actually better, not worse. Here I’m using “utility” in a broad sense, not just in the sense of what people often call applied mathematics.

> In other words, mathematicians may need to adjust to a transformation from a scarcity economy to an abundance economy.

https://gowers.wordpress.com/2026/05/08/a-recent-experience-...

mxwsn 9 days ago |

> Here’s a thought experiment: suppose that a mathematician solved a major problem by having a long exchange with an LLM in which the mathematician played a useful guiding role but the LLM did all the technical work and had the main ideas. Would we regard that as a major achievement of the mathematician? I don’t think we would.

This is a cultural choice. It makes sense that in the mathematics culture we currently have, this is alien. But already, other fields, and many individuals, would disagree and say that the human did have a major achievement here. As long as human-AI collaborations are producing the best results, there is meaningful contribution by the humans, and people that are deeper experts and skilled LLM whisperers should be able to make outsized contributions. The real shoe drops when pure AI beats humans and human-AI collaboration.

ziotom78 9 days ago |

I am a physics professor and often use Gemini to check my papers. It is a formidable tool: it was able to find a clerical error (a missing imaginary unit in a complex mathematical expression) I was not able to find for days, and it often underlines connections between concepts and ideas that I overlooked.

However, it often makes conceptual errors that I can spot only because I have good knowledge of the topic I am discussing. For instance, in 3D Clifford algebras it repeatedly confuses exponential of bivectors and of pseudoscalars.

Good to know that ChatGPT 5.5 Pro can produce a publishable paper, but from what I have seen so far with Gemini, it seems to me that it is better to consider LLMs as very efficient students who can read papers and books in no time but still need a lot of mentoring.

few 9 days ago |

>So if your aim in doing mathematics is to achieve some kind of immortality, so to speak, then you should understand that that won’t necessarily be possible for much longer — not just for you, but for anybody.

This made me a little sad

MinimalAction 9 days ago |

As a graduate student, this piece made me sad. I always believed that my work speaks for itself and transcends beyond my limited time on this cosmic experience. This notion of immortality was just a small intangible bonus I hoped for when I jumped into grad school. AI is making me feel less worthy.

NotOscarWilde 9 days ago |

As a TCS assistant professor from Eastern Europe, I always am a little jealous of the biggest names in math having such an easy access to the expensive, long thinking models.

Paying for Pro from any of my current academic budgets is completely ouf of the field of reality here -- all budgets tend to have restricted uses and software payments fit into very few categories. Effectively, I'd have to ask for a brand new grant and hope the grant rules allow for large software payments and I won't encounter an anti-AI reviewer; such a thing would take one year at least.

As a nail to the coffin, I was "denied" all Claude Opus recently as part of Microsoft's clampdown on individual (and academic) use of Copilot.

(Chagpt 5.5 Plus does not seem sufficient for any deeper investigations into new research topics, I've tried.)

Apologies for the rant.

bustermellotron 9 days ago |

I saw Tim Gowers give a talk at the AMS-MAA joint meeting in Seattle about ten years ago where he predicted that in 100 years humans would no longer be doing research mathematics. I wonder if he’s adjusted his timeline.

At the time I thought the key missing tool was a natural language search that acted like mathoverflow, where you could explain your problem or ideas as you understood them and get references to relevant literature (possibly outside your experience or vocabulary).

34qJhah 8 days ago | |

And Teichmüller thought that Germany would win WW2 and volunteered for the Eastern Front.

Being a gifted mathematician does not make you right. In fact, mathematicians have a lot of bizarre theories.

TrackerFF 8 days ago |

The vast, vast majority of students going into higher education this fall will not contribute much to science until 4-5 years down the road (should they do research). Realistically 6-7 when they're in full swing with their Ph.D.

If we look where these models were 5-7 years ago...the existential threat of the Ph.D. was not even on the radar back then. The people finishing up their doctorate now are the first that can truly leverage these tools.

Now, if these to-be researcher students feel defeated (enough to quit), or completely lean on AI models the work for them, we're going to have a problem. Same with the funding of those Ph.D. positions. If we move away from "funding to produce researchers" to "funding to achieve results", will money that was usually spent to fund Ph.D. students start to flow towards compute?

If we look at it a bit cynically: Some researcher will be able to pump out a lot more papers by spending money on compute, than a couple of years of training students.

Interesting times. But also so much uncertainty. I feel terrible for the students that will have to decide now what they want to do, with all this knowledge.

robot-wrangler 8 days ago | |

> Now, if these to-be researcher students feel defeated (enough to quit), or completely lean on AI models the work for them, we're going to have a problem. [..] If we look at it a bit cynically: Some researcher will be able to pump out a lot more papers by spending money on compute, than a couple of years of training students.

Obviously this is already happening and will accelerate. Outside of grad work, you could already just buy a degree. Certainly in the softer disciplines, you can currently just buy a phd thesis and a good publication history. If you're in industry instead of academics, you can even buy a promotion. If your employer gives an AI budget to all workers then you quietly double that budget out of your own pocket for as long as it takes to get a promotion, then stop and just enjoy a bigger paycheck.

momojo 9 days ago |

Sorry, I'm reposting a comment I made yesterday that seems fitting:

> This reminds me of Antirez's "Don't fall into the anti-AI hype". In a sentence: These foundation models are really good at optimizing these extremely high level, extremely well defined problem spaces (ie multiply matrices faster). In Antirez's case, it's "make Redis faster".

kang 8 days ago |

> The lower bound for contributing to mathematics will now be to prove something that LLMs can’t prove, rather than simply to prove something that nobody has proved up to now and that at least somebody finds interesting.

5.5pro is amazing but this implication might not be true & is the core argument of this piece.

AI will prove all sort of things - interesting, boring & incorrect.

To sort it will be the task of the PhD.

layer8 8 days ago | |

The task of a proof verifier is much simpler than the task of a proof finder (it’s basically equivalent to P vs. NP), and hence the bar for the required skills is lower. Merely verifying proofs isn’t research, and doesn’t impart research skills.

kang 8 days ago | | |

Verification on its own is not research, but judgement is research.

"Hey, Prove something a machine can't", sure I can't, "Hey, Say something worth proving & judge it well", ah, now I might have a few unique observation/ideas/curiosities/problems from my having being a human.

Imo, the feeling of intelligence or the process of originality(originativity) test for ai is subjective & is coming down to 4 paths: novel relative to a reference class, valuable within a domain, counterfactually sensitive to internal state and environment, and revisable through learning.

ianm218 8 days ago | |

Verification is generally a much lower bar than solution generation. I don’t think it’s likely sorting out the right from wrong will end up being this huge PhD level effort.

kang 8 days ago | | |

Verification & solution generation are both part of problem generation & defining the passing test - judgement.

MrDrDr 8 days ago |

> "Even though I can motivate it in retrospect, ChatGPT’s idea to use h^2-dissociated sets to control relations of order at most h feels quite ingenious. As far as I can tell, this idea is completely original."

The question that keep bothering me is can an LLM generate an idea that is truly novel? How would/could that actually happen? But then that leads to the question - what are we actually doing when we think?

Perhaps it's as simple as the ability to just make mistakes that matters, the same things that powers evolution. As long as the LLM can make mistakes, it's capable of generating something genuinely novel. And it can make more mistakes much faster than we can.

iTokio 9 days ago |

On complex problems with lengthy proofs, the first step that I would have done is to ask 5.5 pro in a new, unrelated, session, to be very critical, to try to find flaws in the arguments.

And certainly not to send it to a fellow colleague to ask its opinion first.

LLMs are certainly becoming capable to code, find vulnerabilities, solve mathematical problems, but we need to avoid putting their works in production, or in front of other humans, without assessing it by any possible mean.

Otherwise tech leads, maintainers, experts get overwhelmed and this is how the « AI slop » fatigue begins.

To be clear I’m talking about this step:

> That preprint would have been hard for me to read, as that would have meant carefully reading Rajagopal’s paper first, but I sent it to Nathanson, who forwarded it to Rajagopal, who said he thought it looked correct.

NitpickLawyer 9 days ago | |

> but we need to avoid putting their works in production, or in front of other humans, without assessing it by any possible mean.

I think this is good advice in general, maybe with an emphasis on public vs. private, friendly contact. Having 0 thought AI slop thrown at you out of the blue is rude. "could have been a prompt" indeed. But having a friend/colleague ask for a quick glance at something they know you handle well is another story for me.

If I've worked on a subject for a few years, and know the particulars in and out, I'd have no trouble skimming something that a friend or a colleague sent me. I am sparing those 5-10 minutes for the friend, not for what they sent. And for an expert in a particular domain, often 5 minutes is all it takes for a "lgtm" or "lol no".

dabinat 9 days ago |

I feel like this experiment was successful because those prompting the AI were knowledgeable enough to ask the right questions and verify the output was correct. This shows that there is still a place for expertise, even if the LLM does the actual research.

colechristensen 9 days ago | |

I feel my input to LLMs is most valuable in the initial idea, big picture design tweaks, and the vast majority of my usefulness is negative feedback. This looks wrong, you've gotten off track, you're cheating with workarounds, you're falling into a rabbithole, etc.

readgrounded 8 days ago |

Quantitative finance went through a smaller version of this in the 2010s. The apprenticeship was building a Black-Scholes pricer from scratch, then a vol surface, then a calibration loop. Sweat problems that taught you what the math meant. Then libraries got good, platforms got good, and a junior could be productive without ever reeeeeally knowing how it worked. On some level yes the finer detailed knowledge is going to be lost because it gets locked in but in some ways we do get a "higher level api" to presumably solve more difficult problems.

zkmon 8 days ago |

>> but it was definitely a non-trivial extension of those ideas, and for a PhD student to find that extension it would be necessary to invest quite a bit of time digesting Isaac’s paper

The "non-trivial" is for human abilities. The weights lifted by a crane are also "non-trivial". People keep getting amazed at machine's abilities. Just like a radio telescope can see things humans can't, microscope can see the detail humans can't, we need not be amazed. The sensory perception of patterns is at different level for AI. It's a machine.

svnt 8 days ago | |

Too many people are wrapped around the ego axle thinking (assuming) their ideas are both them and somehow unique and special.

It usually takes dissolving that, often through difficult experiences, before they can see it as a machine, something that could be separated from them.

dag100 8 days ago | | |

I think the more pressing issue is that there isn't really much space left for humans in the economy if thinking can also be automated.

locknitpicker 8 days ago |

From the article:

> Conversely, for problems where one’s initial reaction is to be impressed that an LLM has come up with a clever argument, it often turns out on closer inspection that there are precedents for those arguments, so it is still just about possible to comfort oneself that LLMs are merely putting together existing knowledge rather than having truly original ideas. How much of a comfort that is I will not discuss here, other than to note that quite a lot of perfectly good human mathematics consists in putting together existing knowledge and proof techniques.

This is exactly what leads me to believe that the real impact of LLMs in human history is yet to come. My work as a researcher was mostly spent on two classes of workloads: reading papers that were recently published to gather ideas and keep up with the state of the art, and work on a selection of ideas gathered from said papers to build my research upon. It turns out that LLMs excel at the most critical component of both workloads: parsing existing content and use it when prompting the model to generate additional content based on specific goals and constraints. I mean, papers are already a way to store and distribute context.

YeGoblynQueenne 7 days ago |

All this sounds to me like mathematicians spooking themselves with stories of how ChatGPT solved a problem, when it's mathematicians solving a problem using ChatGPT as a tool. E.g. from the twitter thread by Timothy Gowers:

>> All I did was say things like, "Yes, it would be great if you could explore that idea and see whether you can get it to work," or "Could you rewrite that argument as a LaTeX file in the style of a standard mathematical preprint?"

Yeah, so all he did was take the horse to the water and make the horse drink. The collaboration with the other two mathematicians wasn't a trivial part of the problem solving either: every time Timothy Gowers figured ChatGPT had goe somewhere with its problem-solving, he stopped, asked it to render the answer in LaTex, and sent the answer off to be verified by the other two.

The reason for that is not to be underestimated: ChatGPT can produce answers to questions you ask it for as long as you ask it to do so but it has no capability to determine whether an answer is correct or not. That's why it needs a human with domain expertise to evaluate those answers. And of course to discard wrong answers in the process, because of course the process that's described here glosses over many false starts and back-and-forths and "you're absolutely rights, here's a new version of that"'s etc. that are common experience when using LLMs for problem-solving tasks.

The existential questions that the article poses about mathematics then are easily answered by taking all of the above into account. If LLMs are a useful tool for mathematicians, then nothing changes. Mathematicians of all levels can still do their job and perhaps do it faster or better with the new tool.

If you can sic ChatGPT on a mathematics problem and it can solve it without your input, that's a different matter but that's not what's happening.

adammdaw 9 days ago |

This is certainly interesting, though I would say that based on my understanding of how the current models work combinatorial problems would be an area where they could be particularly successful. They are pretty good at combinatorial creativity - its the exploratory and transformational aspects that are still pretty tricky, and I expect would come to bear in other areas of mathematics.

nxobject 8 days ago | |

I wonder as well whether large-but-finite contexts can handle algebraic questions that require traversing up and down levels of abstraction, at least not without "thrashing".

hodgehog11 9 days ago | |

Indeed, analysis is a bit more loose in its arguments, and so I've found LLMs tend to make more mistakes there.

adaml_623 9 days ago |

"It is the sort of idea I would be very proud to come up with after a week or two of pondering, and it took ChatGPT less than an hour"

This comment about time is very interesting to me. I know it's "just" doing mathematical proofs but the possibilities of speeding up planning, proposals and decision making in the physical world should excite people.

dpweb 7 days ago |

Don’t quite agree with the implication that if the answer to a problem is readily available (from an LLM) there is no use in the struggle to find the solution.

However I think it’s very important to approach such questions objectively, or at least self uninterested, and not as one who’s worried about one’s job or sense of self worth threatened by LLM technology.

The value is in the development of one’s own mental faculties. In math classes they tell you you have to work the problems. Even if LLMs become capable of solving entire classes of problems that that set expands over time, the value in developing one’s ability never goes out of style.

lysecret 8 days ago |

There is a great recent episode of latent space about a similar topic it’s worth a watch even with the click baiti thumbnail and title https://youtu.be/9d899Ram9Bs?is=pQMoVmlWVsTNKfRK

fulafel 9 days ago |

Link to source blog post: https://gowers.wordpress.com/2026/05/08/a-recent-experience-...

dang 9 days ago | |

That's the top link (i.e. that the title is linked to), no?

fulafel 8 days ago | | |

Indeed, the body in the post made me think it was a url-less submission.

goopthink 8 days ago |

An interesting takeaway is that heretofore most of that advances have been not from “invention” but from a breadth of visibility. LLMs have been able to be “creative” because of the volume of work that they cover and can draw lines and associations between, not in discovering things that did not exist previously (though an argument can be made that something like AlphaFold was “discovering” and “intuiting” associations that were not explicit anywhere previously, uniquely found by the AI… but I’d argue back something about the bitter lesson and we’d go on for more than a few threads).

Somewhat ironic then, to not make this more explicit in an article about solving a combinatorial problem.

eranation 8 days ago |

Like coding, if you get inspired by AI for a novel idea, and can reproduce the same result independently (could code the same thing by hand) or at least understand and check every single argument (self review your code, test on your machine) and get it peer reviewed (code review, but with a real human) then I don’t see why the industry accepts the latest iteration of ChatGPT being 99% written by codex, but rejects a valid math result inspired by it.

arjie 8 days ago |

The question of where the creative input is was a big thing around Experiments in Musical Intelligence and co-composing. But it seems perhaps that it’s a transient state we needn’t spend too much effort it. The machine has failed to disappoint repeatedly. Perhaps this is as far as it gets or perhaps we will be like people in Catching Crumbs by the Table by Ted Chiang where almost all science is interpretation of papers by vastly greater intellects.

robeym 7 days ago |

I think progressing as humans is something to be proud of. I care less about who gets credit and more about what we can now do.

I also do not think this makes people less capable of solving hard problems. The bar just moves up. More people can now work on harder problems with better tools.

If the goal is credit or proving real skill, then focus on harder problems, like ones AI can't reach

iandanforth 8 days ago |

I found the section on publishing very interesting. Even if the quality of the output is up to snuff, where should it go? Arxiv doesn't allow AI written work. The author proposes that only work that has been certified by human should be published. However, now the field is in the same boat as software engineering where we are facing a glut of pull requests and not enough time and people to review them.

robot-wrangler 8 days ago |

Despite this coming from an independent expert and not from OpenAI, we need to be honest that this is more like a marketing campaign than open science. I assume the progress really is valid, the experts are indeed impressed, and we have the accurate time that it takes to produce the results. What we don't have is detail about true cost, or a CoT trace, or anything like that.

The implication is: we're ready to let everyone go wild with this very soon. Ok, go wild with what exactly? How do we know that influential VIP users who might make very friendly blog posts aren't getting allocated exclusive access to a billion dollars worth of hardware when they ask questions? I mean literally giving a certain group of people temporary privileged access to like 90% of all available compute would be a completely reasonable business decision for OpenAI.

Would a reveal like that change how we think about the result? What if half that amount of cash/compute could enable some completely non-AI approach of numerical brute forcing that settles the question even if it didn't write the paper?

My other question is always whether the latest is purely using giant models or if we're now deeply into harnesses that use MCTS and such. Understandable to keep that a trade secret I guess. But IMHO we should at least get the CoT trace as a proxy for true cost, or else maybe we're just getting played to do the hype for corporate.

electriclove 8 days ago | |

Marketing campaign???

energy123 8 days ago | |

We know this because most of the Erdos proofs made by AI have been done by amateurs prompting GPT 5.* Pro, not by familiar names that OpenAI is sneaking additional compute to behind the scenes (which is too conspiratorial of an explanation for my liking regardless).

robot-wrangler 8 days ago | | |

> We know this because most of the Erdos proofs made by AI have been done by amateurs prompting GPT 5.* Pro

Where's that? The stuff I've seen is from celebrities. Were those problems as hard as this one, or the ones that Tao posts about? Regardless.. what's the argument against more transparency here to just settle this kind of thing?

> which is too conspiratorial of an explanation for my liking regardless

OpenAI is not, in fact, open. Why do they deserve the benefit of the doubt?

Regardless.. special treatment for special customers isn't conspiracy, it's SOP literally everywhere and especially if you're helping to beta test. Anyone who's ever interacted with any technical account manager has seen waived quotas, free resource allocations, etc. The quid-pro-quo is obviously that your cheap early access means you get to give talks at a conference (or make a blog post that a lot of people read and talk about).

highfrequency 8 days ago |

> LLMs have got to the point where if a problem has an easy argument that for one reason or another human mathematicians have missed (that reason sometimes, but not always, being that the problem has not received all that much attention), then there is a good chance that the LLMs will spot it.

amelius 8 days ago |

Makes sense as a mathematician basically has two powers (1) using their intuition and (2) an enormous amount of mental stamina. A mathematician builds their intuition by reading maths books. It is thus not surprising that an LLM is well equipped to take over the tasks of the mathematician.

casey2 8 days ago |

I think mathematicians like LLMs because this is the first time we have something like a computer for the kinds of math most people do, high level, hand wavy abstractions that are (relatively) easy for people to grok but hard to explain to traditional computers.

__rito__ 9 days ago |

> So maybe there should be a different repository where AI-produced results can live.

Does the author know about CAISc 2026 [0]?

[0]: https://caisc2026.github.io

einrealist 9 days ago |

"After 16 minutes and 41 seconds, it came back" ... "further 47 minutes and 39 seconds" ... "After 13 minutes and 33 seconds" ... "After 9 minutes and 12 seconds" ... "After 31 minutes and 40 seconds" ... plus other computations

Anyone spotting the issue here? What did that really cost?

I am not against compute being used for scientific or other important problems. We did that before LLMs. However, the major LLM gatekeepers want to make all industries and companies dependent on their models. And, at some point, they need to charge them the actual, unsubsidized costs for the compute. In the meantime, companies restructure in the hopes that the compute costs remain cheap.

sidkshatriya 9 days ago | |

> "After 16 minutes and 41 seconds, it came back" ... "further 47 minutes and 39 seconds" ... "After 13 minutes and 33 seconds" ... "After 9 minutes and 12 seconds" ... "After 31 minutes and 40 seconds" ... plus other computations Anyone spotting the issue here? What did that really cost?

Whatever the Joules... (convert to $ using your preferred benchmark price) it is a fraction to what it might take a human Ph. D. weeks to feed and sustain themselves when working on the same problem. The economics on LLMs is just unbeatable (sadly) when compared to us humans.

MagicMoonlight 8 days ago |

ChatGPT pro is garbage. It’ll spend 20 minutes on an answer, doing all kinds of ridiculous things like writing scripts… instead of just outputting plaintext.

And then the answer isn’t even right.

chalr 8 days ago |

There have always been attempts at settling all mathematics by using mechanized approaches. Often by mathematicians who already had made an impact and then wanted an automated approach.

The Bourbaki group was one of the first who attempted a mechanized approach (using pen and paper still of course) to set theory and were literally accused of wanting to end all mathematics. The approach was largely ignored in practice.

Gowers and a handful of others who work on computerized approaches also seem to want to end human mathematics and have sharecropper mathematics for a monthly tithe. So far they are largely ignored in practice.

zingar 8 days ago |

The post talks about LLM+human contributions being recognized in some different category from human-only. But is it possible to spot the difference between the two?

globular-toast 9 days ago |

I wish people would stop generating stuff they don't understand only to forward it to someone who does. Something about that really rubs me the wrong way.

hodgehog11 9 days ago | |

May I remind you that this is Timothy Gowers. He says he doesn't understand, but he most certainly has far greater capacity than most to detect complete junk from a maybe plausible argument. His colleague is even better able to judge this, hence why he sent it to him.

Also if he did send me complete junk, I would still parse it for multiple days to see what is there.

globular-toast 7 days ago | | |

Yeah, it doesn't make a difference for me. It's the generation part. Gowers should have sent his prompts to the colleagues, not the generated paper. That's all. I feel like it's creating obligations for others to help with the remaining 20% which always takes the most time, while you get to have all the fun of doing the first 80%.

I'm not criticising Gowers directly in this instance because he's exploring the possibilities, my disdain is towards the more general pattern I see emerging where people just send each other LLM outputs.

auggierose 8 days ago | |

Lol. If Gowers sends you a piece of math he doesn't quite understand because he thinks that you might, that is something you celebrate.

frozenseven 8 days ago | |

You are criticizing a Fields Medalist for consulting with another mathematician.

incrediblylarge 9 days ago |

A month ago my PhD supervisor told me it rips on proofs but he also said it's useless when formalising arguments in Lean - is this still the case?

vjerancrnjak 9 days ago | |

Nope. Codex formalizes much better than any tool with exception of Aristotle from Harmonic.

https://github.com/vjeranc/fixed-rtrt

M3 module was formalized fully purely from experimental data and from a nudge by earlier versions of codex in 15-30 minutes in a simple write/compile/fix-first-error loop. I was a bit surprised how fast it picked up the pattern but given there was a paper from '70s it became clear why later.

ortusdux 8 days ago |

An important lesson in web/blog design - I cannot for the life of me figure out who this author is (using only the website).

tmp10423288442 8 days ago |

It's interesting that ChatGPT Pro is the real deal that can write novel physics or math papers (for certain values of novel), while Claude Pro is crap that, depending on the A/B test, may not even provide Claude Code or at the very least doesn't provide Opus. Shows how LLM naming conventions are currently a mess.

theptip 8 days ago |

> what should we do with this kind of content? Had the result been produced by a human mathematician, it would definitely have been publishable, so I think it would be wrong to describe it as AI slop. On the other hand, it seems pointless even to think about putting it in a journal, since it can be made freely available, and nobody needs “credit” for it (except that Isaac deserves plenty of credit for creating the framework on which ChatGPT could build). I understand that arXiv has a policy against accepting AI-written content, which makes good sense to me. So maybe there should be a different repository where AI-produced results can live. But various decisions would need to be made about how it was organized.

Interesting question, I guess a starting point is “moltbook”, but perhaps a better one is something like GitHub, where Lean proofs and preprints can go, and trending items can get boosted.

I also think that posting this stuff on x or bluesky has merit, but again the existing paradigm doesn’t quite work; perhaps you can create a completely separate identity for your agent (à la Moltbook) but I think you want some sort of reputational association with the human piloting the agent, at least for now. (Maybe eventually there are enough agents critically engaging with content so that “interesting” results get agent likes, and so we’ll-piloted agents stand on their own merit.)

rklampp 8 days ago |

Gowers has always been a proponent of Lean (naturally). He receives funding from the "AI for Math" fund, which is sponsored by a fund that is a front organization for venture capitalists:

https://www.renaissancephilanthropy.org/

The "brighter future" of course is that everyone is redundant and all capital is further concentrated.

It is always Gowers, Tao and Lichtman (math.ínc startup) who are pushing these technologies.

logicprog 8 days ago | |

> It is always Gowers, Tao and Lichtman (math.ínc startup) who are pushing these technologies.

In your mind does this mean that they are lying, or driven by motivated reasoning and cognitive bias, or whatever you'd like to say?

Because I feel like people bring up these facts as a way to discount everything that these people are saying, but whether or not they've chosen to align themselves with AI aligned venture capital funding or not. The question is really, did what they say is happening happen or not? Are these capabilities real or not?

To my mind, mathematics is pretty definitely, externally, objectively verifiable, so it would be easy to catch them in a lie. In the case of the Erdös problem that was recently solved in a novel and productive way, it wasn't even initiated by them and the chat GPT transcript is public for all to see. And the proof could easily be verified by other people, for instance.

In addition, I think it's unlikely that they're not explaining things as they honestly see them and also doing their due diligence to make sure that they are seeing them as close to correctly as possible. Because their positions with these organizations not to mention their entire reputation and life's work and passion depends on their reputation in academic mathematics. If they were to give that up by falsifying these claims or not verifying them sufficiently, they would lose everything.

I think it's also worth pointing out that it is totally possible for someone to align themselves with such organizations after the fact because they agree with them instead of being bought out by such organizations. Otherwise, it would be possible to dismiss the opinion of anyone working at any NGO dedicated to being against AI and denying AI's capabilities or whatever, as well by the same logic of their salary being paid by an organization dedicated to pushing those ideas.

ionwake 8 days ago |

one thing I was wondering, is, if LLMs are word completions seemingly coming up with new solutions could this just be because stuff that was kept secret and now - is no longer is due to ingestion? I dont know enough about it tho

dist-epoch 8 days ago | |

why would you keep secret this particular mathematical idea? it's not extraordinarily important, it's not on the path to some other major result, doesn't seem useful in financial trading. even author calls it good reasonable problem for a PhD thesis.

CharlesLau 9 days ago |

Is the assessment system of undergraduate mathematics education no longer effective?

alpama 8 days ago |

This is scary. Ai is growing faster than our knowledge. We are not prepared

jacktu 8 days ago |

That's a real shift. The value of an open problem used to be that it was unsolved. Now an open problem needs to be unsolvable by something that can read the entire literature and try a hundred approaches in an hour.

dares2573 8 days ago |

I think the biggest advantage of ChatGPT compared to Claude is that there are fewer things outside the model itself, such as KYC, account bans, etc.

solenoid0937 8 days ago | |

This is just grossly misinformed.

OAI and Anthropic both require KYC for models of similar intelligence. They both do account bans if the classifiers fire wrong. You simply hear about it less with OAI because Codex has fewer prosumers.

tmp10423288442 8 days ago | | |

Can you name any instance of OpenAI being as trigger-happy with bans as Anthropic has been in the past few months? Codex may have fewer prosumers, but they've added a lot in that time.

zuogl 9 days ago |

The HTML generation is surprisingly good because the training corpus for markup is cleaner than most programming languages.

OsamaJaber 8 days ago |

the bottleneck isn't generation, it's verification

SubiculumCode 9 days ago |

I honestly can't say this isn't AGI anymore. AGI shouldn't be a bar so taboo that it has to be at the extreme capability in every domain. What human is?

This is as AGI as it needs to be to get my vote. And it's scary.

MrScruff 8 days ago | |

It's ASI with jagged intelligence, which is probably what it will remain for a while.

It still sounds to me like remarkable automation rather than something that's expanding the frontier of human knowledge, for now at least.

agiipullor 8 days ago | |

to quote Demis Hassabis, "these models can solve frontieer problems in math, but also fail in really dumb ways at trivial questions - the car wash question".

jagged AGI

richard_chase 7 days ago |

Dude needs to step down off his pedestal before he gets knocked down.

zkmon 8 days ago |

sexylinux 8 days ago |

Unfortunately it still does create errors.

This is of enormous importance but still is being actively ignored by many professionals or dismissed as as a minor issue.

Our emotional human brains are very enthusiastic about these new kind of "intelligent" products ("partners") and we want to believe so hard that they are finally "there" that we tend to ignore how big of a problem it is that LLMs carry a fundamental design problem with them that will make them produce errors even when we use a grotesque amount of resources to build "bigger" versions of them. The potential for errors will never go away with the current AI architecture.

This is a fundamental paradigm shift in computing. Instead of putting a lot of energy into building an architecture that will produce reliable results, we are now maximizing on a system / idea that will never give us 100% reliable results.

Basically it is just a marketing stunt. Probably the computer science guy building it knew very well that he would still need some fundamental break troughs to get to a real product, but the marketing guy saw that there is still potential to make a lot of money by selling a product that will produce correct results only 80% of the time.

The marketing guy was right and marketing is now dominating science, but humanity will pay a big price for that.

Putting enormous amounts of money into a fundamentally flawed system that we can not optimize to produce reliably error free results is just stupid.

The big achievement of "classical" computing is that the results are reliably error free. We have still some known issues eg. with floating point math and bad blocks on disk / bit flipping etc. but these are observable and we can handle / avoid them. Generally "non-ai-computing" was made so reliable, that we can depend on it for many very important things. This came not by accident but was created by a lot of people who put a lot of resources into research to achieve that result.

LLMs introduce a level of uncertainty and unreliability into computing that makes them practically useless.

Because if you have enough knowledge to verify the result and AI is only quicker in producing the result, what is the point then putting so much resources in it (besides making money by re-centralizing computing, of course). Verifying a lot of results that have been produced quicker is still slow, so the people who are now just AI verifiers should just produce the results themselves, makes the whole process quicker.

AI is only of value if it can produce results about things that you or your organization does not know anything about. But these results you can not verify and therefore potentially wrong results can be fatal for you, your organization and all the people that are affected by actions generated based on these wrong results.

Many people have already been killed because decision makers are not able to follow that very simple logic.

So we can still create "interesting and enjoyable results", but finally it is a gigantic miss-allocation of resources of historic idiocy. It fits, of course, very well in a timeline where grifters are on top of societies around the world.

It is a fundamentally wrong path that should not be followed and scientists around the world should articulate exactly that instead of producing marketing blog posts for a system with such fatal inherent issues.

quinndupont 8 days ago |

[flagged]

vessenes 8 days ago | |

I don’t love the tone here, but I do think you get at a key question in mathematical philosophy.

Mathematicians have engaged, vigorously, on this very philosophical question for centuries - is math discovered truth, or is it more akin to building an edifice where you first define the materials, then the structure, and see where it leads?

There are lots of strong feelings on both sides. For instance: “God created the integers, the rest is the creation of man” — Kronecker, 19th century sums up one particular perspective.

To me, it’s probably a mix of both - some fantastic results in imaginary numbers show up as describing key electromagnetic effects many decades after they were first ‘discovered’ by theoretical mathematicians.

NB: My original comment led with a pejorative, which was rightly flagged.

dang 8 days ago | | |

> Who hurt you bro?

Please don't respond to a bad comment by breaking the site guidelines yourself. That only makes things worse.

https://news.ycombinator.com/newsguidelines.html

dang 8 days ago | |

Can you please make your substantive points without fulminating, as the site guidelines request (https://news.ycombinator.com/newsguidelines.html)?

We're trying for curious conversation here, and you've clearly got something interesting to say, but when you put it this aggressively, curiosity gets fried (https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...)

bambax 9 days ago |

> quite a lot of perfectly good human mathematics consists in putting together existing knowledge and proof techniques

Creativity is connecting ideas from different domains and see if something from one field applies to another. I do think AI is overhyped generally; but a major benefit from AI could be that after ingesting all the existing human knowledge (something no single human can ever hope to achieve) it would "mix and connect" it and come up with novel insights.

Most published research sits ignored and unread; AI can uncover and use everything.

imiric 8 days ago | |

> Creativity is connecting ideas from different domains and see if something from one field applies to another.

That's true. The question is whether the produced pattern has any value. LLMs are incapable of determining this, and will still often hallucinate, and make random baseless claims that can convince anyone except human domain experts. And that's still a difficult challenge: a domain expert is still needed to verify the output, which in some fields is very labor intensive, especially if the subject is at the edge of human knowledge.

The second related issue is the lack of reproducibility. The same LLM given the same prompt and context can produce different results. This probability increases with more input and output tokens, and with more obscure subjects.

The tools are certainly improving, but these two issues are still a major hurdle that don't get nearly as much attention as "agents", "skills", and whatever adjacent trend influencers are pushing today.

And can we please stop calling pattern matching and generation "intelligence"? This farce has gone on long enough.

agiipullor 8 days ago | | |

> And can we please stop calling pattern matching and generation "intelligence"

thats literally what an IQ test tests - abstract pattern matching. but I guess you dont like IQ tests either

slopinthebag 9 days ago |

AI generated article btw.

Maybe if you find AI to be doing stuff you find impressive, the stuff you were doing wasn't that impressive? Worth ruminating on your priors at least.

hodgehog11 9 days ago | |

This is beyond ridiculous to say considering whose blog this is.

For those that don't know, this is Timothy Gowers. He is one of the most accomplished mathematicians in the world. Like Terence Tao, he is considered one of the world leaders in mathematics and tends to have good judgement in where the field is going.

Even without that knowledge, no, this article is certainly not AI generated. It has none of the tells.

reasonableklout 9 days ago | |

What makes you think either the tweet or blog post are AI generated?

> The mathematician and the blog author are not the same person > (as you seem to understand). Nathanson (the mathematician) is > the one who is the expert verifier. He is the person who has > the higher value and won't be fired in some hypothetical.