A recent experience with ChatGPT 5.5 Pro(gowers.wordpress.com) |
A recent experience with ChatGPT 5.5 Pro(gowers.wordpress.com) |
This is as AGI as it needs to be to get my vote. And it's scary.
It still sounds to me like remarkable automation rather than something that's expanding the frontier of human knowledge, for now at least.
jagged AGI
This is of enormous importance but still is being actively ignored by many professionals or dismissed as as a minor issue.
Our emotional human brains are very enthusiastic about these new kind of "intelligent" products ("partners") and we want to believe so hard that they are finally "there" that we tend to ignore how big of a problem it is that LLMs carry a fundamental design problem with them that will make them produce errors even when we use a grotesque amount of resources to build "bigger" versions of them. The potential for errors will never go away with the current AI architecture.
This is a fundamental paradigm shift in computing. Instead of putting a lot of energy into building an architecture that will produce reliable results, we are now maximizing on a system / idea that will never give us 100% reliable results.
Basically it is just a marketing stunt. Probably the computer science guy building it knew very well that he would still need some fundamental break troughs to get to a real product, but the marketing guy saw that there is still potential to make a lot of money by selling a product that will produce correct results only 80% of the time.
The marketing guy was right and marketing is now dominating science, but humanity will pay a big price for that.
Putting enormous amounts of money into a fundamentally flawed system that we can not optimize to produce reliably error free results is just stupid.
The big achievement of "classical" computing is that the results are reliably error free. We have still some known issues eg. with floating point math and bad blocks on disk / bit flipping etc. but these are observable and we can handle / avoid them. Generally "non-ai-computing" was made so reliable, that we can depend on it for many very important things. This came not by accident but was created by a lot of people who put a lot of resources into research to achieve that result.
LLMs introduce a level of uncertainty and unreliability into computing that makes them practically useless.
Because if you have enough knowledge to verify the result and AI is only quicker in producing the result, what is the point then putting so much resources in it (besides making money by re-centralizing computing, of course). Verifying a lot of results that have been produced quicker is still slow, so the people who are now just AI verifiers should just produce the results themselves, makes the whole process quicker.
AI is only of value if it can produce results about things that you or your organization does not know anything about. But these results you can not verify and therefore potentially wrong results can be fatal for you, your organization and all the people that are affected by actions generated based on these wrong results.
Many people have already been killed because decision makers are not able to follow that very simple logic.
So we can still create "interesting and enjoyable results", but finally it is a gigantic miss-allocation of resources of historic idiocy. It fits, of course, very well in a timeline where grifters are on top of societies around the world.
It is a fundamentally wrong path that should not be followed and scientists around the world should articulate exactly that instead of producing marketing blog posts for a system with such fatal inherent issues.
Mathematicians have engaged, vigorously, on this very philosophical question for centuries - is math discovered truth, or is it more akin to building an edifice where you first define the materials, then the structure, and see where it leads?
There are lots of strong feelings on both sides. For instance: “God created the integers, the rest is the creation of man” — Kronecker, 19th century sums up one particular perspective.
To me, it’s probably a mix of both - some fantastic results in imaginary numbers show up as describing key electromagnetic effects many decades after they were first ‘discovered’ by theoretical mathematicians.
NB: My original comment led with a pejorative, which was rightly flagged.
Please don't respond to a bad comment by breaking the site guidelines yourself. That only makes things worse.
We're trying for curious conversation here, and you've clearly got something interesting to say, but when you put it this aggressively, curiosity gets fried (https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...)
Creativity is connecting ideas from different domains and see if something from one field applies to another. I do think AI is overhyped generally; but a major benefit from AI could be that after ingesting all the existing human knowledge (something no single human can ever hope to achieve) it would "mix and connect" it and come up with novel insights.
Most published research sits ignored and unread; AI can uncover and use everything.
That's true. The question is whether the produced pattern has any value. LLMs are incapable of determining this, and will still often hallucinate, and make random baseless claims that can convince anyone except human domain experts. And that's still a difficult challenge: a domain expert is still needed to verify the output, which in some fields is very labor intensive, especially if the subject is at the edge of human knowledge.
The second related issue is the lack of reproducibility. The same LLM given the same prompt and context can produce different results. This probability increases with more input and output tokens, and with more obscure subjects.
The tools are certainly improving, but these two issues are still a major hurdle that don't get nearly as much attention as "agents", "skills", and whatever adjacent trend influencers are pushing today.
And can we please stop calling pattern matching and generation "intelligence"? This farce has gone on long enough.
thats literally what an IQ test tests - abstract pattern matching. but I guess you dont like IQ tests either
Maybe if you find AI to be doing stuff you find impressive, the stuff you were doing wasn't that impressive? Worth ruminating on your priors at least.
For those that don't know, this is Timothy Gowers. He is one of the most accomplished mathematicians in the world. Like Terence Tao, he is considered one of the world leaders in mathematics and tends to have good judgement in where the field is going.
Even without that knowledge, no, this article is certainly not AI generated. It has none of the tells.
Also, I do not like IQ tests either (having taken one myself). They are unbelievably boring, pointless, and measure more than just "intelligence."
Somewhat ironic then, to not make this more explicit in an article about solving a combinatorial problem.
I also do not think this makes people less capable of solving hard problems. The bar just moves up. More people can now work on harder problems with better tools.
If the goal is credit or proving real skill, then focus on harder problems, like ones AI can't reach
The implication is: we're ready to let everyone go wild with this very soon. Ok, go wild with what exactly? How do we know that influential VIP users who might make very friendly blog posts aren't getting allocated exclusive access to a billion dollars worth of hardware when they ask questions? I mean literally giving a certain group of people temporary privileged access to like 90% of all available compute would be a completely reasonable business decision for OpenAI.
Would a reveal like that change how we think about the result? What if half that amount of cash/compute could enable some completely non-AI approach of numerical brute forcing that settles the question even if it didn't write the paper?
My other question is always whether the latest is purely using giant models or if we're now deeply into harnesses that use MCTS and such. Understandable to keep that a trade secret I guess. But IMHO we should at least get the CoT trace as a proxy for true cost, or else maybe we're just getting played to do the hype for corporate.
Where's that? The stuff I've seen is from celebrities. Were those problems as hard as this one, or the ones that Tao posts about? Regardless.. what's the argument against more transparency here to just settle this kind of thing?
> which is too conspiratorial of an explanation for my liking regardless
OpenAI is not, in fact, open. Why do they deserve the benefit of the doubt?
Regardless.. special treatment for special customers isn't conspiracy, it's SOP literally everywhere and especially if you're helping to beta test. Anyone who's ever interacted with any technical account manager has seen waived quotas, free resource allocations, etc. The quid-pro-quo is obviously that your cheap early access means you get to give talks at a conference (or make a blog post that a lot of people read and talk about).
Does the author know about CAISc 2026 [0]?
Anyone spotting the issue here? What did that really cost?
I am not against compute being used for scientific or other important problems. We did that before LLMs. However, the major LLM gatekeepers want to make all industries and companies dependent on their models. And, at some point, they need to charge them the actual, unsubsidized costs for the compute. In the meantime, companies restructure in the hopes that the compute costs remain cheap.
Whatever the Joules... (convert to $ using your preferred benchmark price) it is a fraction to what it might take a human Ph. D. weeks to feed and sustain themselves when working on the same problem. The economics on LLMs is just unbeatable (sadly) when compared to us humans.
And then the answer isn’t even right.
The Bourbaki group was one of the first who attempted a mechanized approach (using pen and paper still of course) to set theory and were literally accused of wanting to end all mathematics. The approach was largely ignored in practice.
Gowers and a handful of others who work on computerized approaches also seem to want to end human mathematics and have sharecropper mathematics for a monthly tithe. So far they are largely ignored in practice.
Also if he did send me complete junk, I would still parse it for multiple days to see what is there.
I'm not criticising Gowers directly in this instance because he's exploring the possibilities, my disdain is towards the more general pattern I see emerging where people just send each other LLM outputs.
https://github.com/vjeranc/fixed-rtrt
M3 module was formalized fully purely from experimental data and from a nudge by earlier versions of codex in 15-30 minutes in a simple write/compile/fix-first-error loop. I was a bit surprised how fast it picked up the pattern but given there was a paper from '70s it became clear why later.
Interesting question, I guess a starting point is “moltbook”, but perhaps a better one is something like GitHub, where Lean proofs and preprints can go, and trending items can get boosted.
I also think that posting this stuff on x or bluesky has merit, but again the existing paradigm doesn’t quite work; perhaps you can create a completely separate identity for your agent (à la Moltbook) but I think you want some sort of reputational association with the human piloting the agent, at least for now. (Maybe eventually there are enough agents critically engaging with content so that “interesting” results get agent likes, and so we’ll-piloted agents stand on their own merit.)
https://www.renaissancephilanthropy.org/
The "brighter future" of course is that everyone is redundant and all capital is further concentrated.
It is always Gowers, Tao and Lichtman (math.ínc startup) who are pushing these technologies.
In your mind does this mean that they are lying, or driven by motivated reasoning and cognitive bias, or whatever you'd like to say?
Because I feel like people bring up these facts as a way to discount everything that these people are saying, but whether or not they've chosen to align themselves with AI aligned venture capital funding or not. The question is really, did what they say is happening happen or not? Are these capabilities real or not?
To my mind, mathematics is pretty definitely, externally, objectively verifiable, so it would be easy to catch them in a lie. In the case of the Erdös problem that was recently solved in a novel and productive way, it wasn't even initiated by them and the chat GPT transcript is public for all to see. And the proof could easily be verified by other people, for instance.
In addition, I think it's unlikely that they're not explaining things as they honestly see them and also doing their due diligence to make sure that they are seeing them as close to correctly as possible. Because their positions with these organizations not to mention their entire reputation and life's work and passion depends on their reputation in academic mathematics. If they were to give that up by falsifying these claims or not verifying them sufficiently, they would lose everything.
I think it's also worth pointing out that it is totally possible for someone to align themselves with such organizations after the fact because they agree with them instead of being bought out by such organizations. Otherwise, it would be possible to dismiss the opinion of anyone working at any NGO dedicated to being against AI and denying AI's capabilities or whatever, as well by the same logic of their salary being paid by an organization dedicated to pushing those ideas.
OAI and Anthropic both require KYC for models of similar intelligence. They both do account bans if the classifiers fire wrong. You simply hear about it less with OAI because Codex has fewer prosumers.
Do you understand the difference between an apple and an apple tree?
Graduate? Yes.
I can speak with a reasonable amount of experience here. I absolutely guarantee you that what Gowers sent over was the vast majority of the work involved for a proof. It's also an interesting exercise in general, hence the blog post.
Parsing a proof like this is _much_ easier than creating it.
Parsing code often seems like the opposite in my experience, where it is more difficult than writing it yourself.
In terms of bans and KYC they are not meaningfully different.
The real power required to support a human life in a developed country is a lot. Wattage for the human brain is definitely miniscule in comparison.
For publications and theses, as long as the final results hold and can be replicated and validated, I don’t see why we shouldn’t allow the wholesale use of LLMs
This is really just a glorified undergraduate education, the real point of graduate school is to learn to do real-world relevant research. For the latter, I think LLM use will be accepted but there will be a heavy expectation on the author of making the result very easily digestable for human mathematicians and linking it thoroughly with the existing literature - something that LLMs are very much not successful at, but a student might be able to do quite well with a mixture of expert guidance and personal effort.
Wonder who the AI researcher worked for? Is a "craze" something which a for-profit company would want to encourage? Maybe they'd think the publicity would help keep people talking about their company and product as we are now doing?
> “There was kind of a standard sequence of moves that everyone who worked on the problem previously started by doing,” Tao says. The LLM took an entirely different route, using a formula that was well known in related parts of math, but which no one had thought to apply to this type of question.
Yep, I do remember this now. Everyone was yelling that this was definitely a sign of ground-breaking and creative work, citing the expert. What the expert actually said suggests that the solution was available in training data! That also suggests the math in TFA is harder in comparison, answering my other question.
Pleasure as always HN, thanks for voting me to the bottom of the thread for this
First, you ask for evidence of someone who isn't a VIP doing a similarly difficult problem using an LLM, to show that it isn't just VIPs being given special models. And then, when I provide that example, you say it doesn't count because the whole craze was started by researchers working for AI anyway.
Furthermore, you start out stating that the access being given to these VIPs is to insanely massive, impractical models no one will ever have access to, but then you point to them getting a free ChatGPT Pro sub as evidence of your point.
Finally, you look at the fact that the AI solved the problem by applying a technique that no one in sixty years had thought to apply to that situation, from a totally different field of mathematics, and you claim that that isn't sufficiently novel to "count" as being in the same ballpark of difficult as solving literally other easier Erdos problems, or this new problem, because the technique still existed previously, and so actually isn't hard enough to be comparable to all the other stuff that's been done.
If you are upset that you are being down-voted, I think you should do some introspection. It seems like it would be impossible to convince you, no matter how many non-VIPs solved difficult open math problems, as long as it wasn't literally the exact same level of difficulty or type of problem.
How does any of this offend a reasonable person? It doesn't, because here is the project wiki addressing BOTH of these relevant points. https://github.com/teorth/erdosproblems/wiki/How-have-AI-com... https://github.com/teorth/erdosproblems/wiki/AI-contribution...
What I raised is a real question: the erdos folks are not affiliated with AI companies.. but is the AI company affiliated with them? Actually knowing the user/org accounts involved is optional because they could just divert resources to anyone prompting near the problem. Anecdotally.. I've noticed what appears to be token-discounts based on topic, for example more generosity for AI-related research than random stuff, but it's hard to know for sure. Wouldn't you promote interactions you could profitably train on?
So again, allocating resources to Erdos one way or another is just a clearly smart business decision for something people are talking about and which has become an unofficial competition among vendors, not a big scandalous accusation. Something like a reasoning-trace is the only way to settle it. This isn't conspiracy or nitpicking because this is the topic itself: the AI usage is more a matter of public interest than the actual problem solution. What's the argument against more transparency?
The downside (not noted in the article, but noted by others here) is cost. It uses tokens at an insane rate, the tokens cost a lot, and using it with subagent flows that you can use to have it tackle large problems with high accuracy costs even more. It is also much "slower" for large scale problems because of context limitations -- it has to constantly rediscover context for each part of the problem, and in order to make it accurate you need to wipe its context before progressing to the next small part, or launch even more agents. For mathematical proofs like these, where the required context to understand the problem and proof besides stuff that's already available in its training set is small and the problems are considered "important" enough, this might not be a problem, but for many of the tasks I would like to use it for (ensuring correctness of code that affects large codebases, or validating subtle assumptions) it definitely is one.
So I think it will be a while before the impressive capabilities of these models really percolate into our lives as programmers, unless you're one of the lucky ones given unlimited access to 5.5 Pro.
I swear that people have said the same thing with effectively every new model that came out in the last six months.
The reality is likely that everyone is hitting similar barriers and the solutions are somewhat generalizable and get added to training new models.
Eventually people will reach the new limits and the cycle repeats.
That is definitely true, and at the same time, we can measure progress by who is making that claim. When Timothy Gowers, a Fields Medalist, says that models are now capable of "producing a piece of PhD-level research in an hour or so, with no serious mathematical input from me," we can be pretty confident that we are getting into seriously interesting territory.
Just as an fyi, the word you are looking for is jibes. Jive is something else entirely.
I don’t know about the rest of y’all but I find “rigidly guiding” LLM’s incredibly tedious and frustrating in the same way seeing an error code throw for the 40th time while troubleshooting something on my computer for two hours is frustrating. It also feels somewhat like micromanaging a direct report. I don’t find that process fun or enjoyable in the slightest and it teaches me little in the process. It’s just trading styles of work, and I guess the response to that is “some people prefer that of work.” I just don’t like being told by the world we all have to work that way now I guess.
> It seems to me that training beginning PhD students to do research [...] has just got harder, since one obvious way to help somebody get started is to give them a problem that looks as though it might be a relatively gentle one. If LLMs are at the point where they can solve “gentle problems”, then that is no longer an option. The lower bound for contributing to mathematics will now be to prove something that LLMs can’t prove, rather than simply to prove something that nobody has proved up to now and that at least somebody finds interesting.
Training must start from the basics though. Of course everybody's training in math starts with summing small integers, which calculators have been doing without any mistake since a long time.
The point is perhaps confirmed by another comment further down in the post
> by solving hard problems you get an insight into the problem-solving process itself, at least in your area of expertise, in a way that you simply don’t if all you do is read other people’s solutions. One consequence of this is that people who have themselves solved difficult problems are likely to be significantly better at using solving problems with the help of AI, just as very good coders are better at vibe coding than not such good coders
People pay coders to build stuff that they will use to make money and I can happily use an AI to deliver faster and keep being hired. I'm not sure if there is a similar point with math. Again from the post
> suppose that a mathematician solved a major problem by having a long exchange with an LLM in which the mathematician played a useful guiding role but the LLM did all the technical work and had the main ideas. Would we regard that as a major achievement of the mathematician? I don’t think we would.
> Where does the value of thinking and having deep ideas come from? We need to think about this now. If it comes primarily from their scarcity – the fact that having certain ideas is hard – then indeed this value may drop precipitously when the manufacture of ideas can be automated. But if the value comes from the utility of the ideas – the benefit that the idea brings – then the story changes: perhaps creating more good ideas is actually better, not worse. Here I’m using “utility” in a broad sense, not just in the sense of what people often call applied mathematics.
> In other words, mathematicians may need to adjust to a transformation from a scarcity economy to an abundance economy.
https://gowers.wordpress.com/2026/05/08/a-recent-experience-...
This is a cultural choice. It makes sense that in the mathematics culture we currently have, this is alien. But already, other fields, and many individuals, would disagree and say that the human did have a major achievement here. As long as human-AI collaborations are producing the best results, there is meaningful contribution by the humans, and people that are deeper experts and skilled LLM whisperers should be able to make outsized contributions. The real shoe drops when pure AI beats humans and human-AI collaboration.
However, it often makes conceptual errors that I can spot only because I have good knowledge of the topic I am discussing. For instance, in 3D Clifford algebras it repeatedly confuses exponential of bivectors and of pseudoscalars.
Good to know that ChatGPT 5.5 Pro can produce a publishable paper, but from what I have seen so far with Gemini, it seems to me that it is better to consider LLMs as very efficient students who can read papers and books in no time but still need a lot of mentoring.
This made me a little sad
Paying for Pro from any of my current academic budgets is completely ouf of the field of reality here -- all budgets tend to have restricted uses and software payments fit into very few categories. Effectively, I'd have to ask for a brand new grant and hope the grant rules allow for large software payments and I won't encounter an anti-AI reviewer; such a thing would take one year at least.
As a nail to the coffin, I was "denied" all Claude Opus recently as part of Microsoft's clampdown on individual (and academic) use of Copilot.
(Chagpt 5.5 Plus does not seem sufficient for any deeper investigations into new research topics, I've tried.)
Apologies for the rant.
At the time I thought the key missing tool was a natural language search that acted like mathoverflow, where you could explain your problem or ideas as you understood them and get references to relevant literature (possibly outside your experience or vocabulary).
Being a gifted mathematician does not make you right. In fact, mathematicians have a lot of bizarre theories.
If we look where these models were 5-7 years ago...the existential threat of the Ph.D. was not even on the radar back then. The people finishing up their doctorate now are the first that can truly leverage these tools.
Now, if these to-be researcher students feel defeated (enough to quit), or completely lean on AI models the work for them, we're going to have a problem. Same with the funding of those Ph.D. positions. If we move away from "funding to produce researchers" to "funding to achieve results", will money that was usually spent to fund Ph.D. students start to flow towards compute?
If we look at it a bit cynically: Some researcher will be able to pump out a lot more papers by spending money on compute, than a couple of years of training students.
Interesting times. But also so much uncertainty. I feel terrible for the students that will have to decide now what they want to do, with all this knowledge.
Obviously this is already happening and will accelerate. Outside of grad work, you could already just buy a degree. Certainly in the softer disciplines, you can currently just buy a phd thesis and a good publication history. If you're in industry instead of academics, you can even buy a promotion. If your employer gives an AI budget to all workers then you quietly double that budget out of your own pocket for as long as it takes to get a promotion, then stop and just enjoy a bigger paycheck.
> This reminds me of Antirez's "Don't fall into the anti-AI hype". In a sentence: These foundation models are really good at optimizing these extremely high level, extremely well defined problem spaces (ie multiply matrices faster). In Antirez's case, it's "make Redis faster".
5.5pro is amazing but this implication might not be true & is the core argument of this piece.
AI will prove all sort of things - interesting, boring & incorrect.
To sort it will be the task of the PhD.
"Hey, Prove something a machine can't", sure I can't, "Hey, Say something worth proving & judge it well", ah, now I might have a few unique observation/ideas/curiosities/problems from my having being a human.
Imo, the feeling of intelligence or the process of originality(originativity) test for ai is subjective & is coming down to 4 paths: novel relative to a reference class, valuable within a domain, counterfactually sensitive to internal state and environment, and revisable through learning.
The question that keep bothering me is can an LLM generate an idea that is truly novel? How would/could that actually happen? But then that leads to the question - what are we actually doing when we think?
Perhaps it's as simple as the ability to just make mistakes that matters, the same things that powers evolution. As long as the LLM can make mistakes, it's capable of generating something genuinely novel. And it can make more mistakes much faster than we can.
And certainly not to send it to a fellow colleague to ask its opinion first.
LLMs are certainly becoming capable to code, find vulnerabilities, solve mathematical problems, but we need to avoid putting their works in production, or in front of other humans, without assessing it by any possible mean.
Otherwise tech leads, maintainers, experts get overwhelmed and this is how the « AI slop » fatigue begins.
To be clear I’m talking about this step:
> That preprint would have been hard for me to read, as that would have meant carefully reading Rajagopal’s paper first, but I sent it to Nathanson, who forwarded it to Rajagopal, who said he thought it looked correct.
I think this is good advice in general, maybe with an emphasis on public vs. private, friendly contact. Having 0 thought AI slop thrown at you out of the blue is rude. "could have been a prompt" indeed. But having a friend/colleague ask for a quick glance at something they know you handle well is another story for me.
If I've worked on a subject for a few years, and know the particulars in and out, I'd have no trouble skimming something that a friend or a colleague sent me. I am sparing those 5-10 minutes for the friend, not for what they sent. And for an expert in a particular domain, often 5 minutes is all it takes for a "lgtm" or "lol no".
The "non-trivial" is for human abilities. The weights lifted by a crane are also "non-trivial". People keep getting amazed at machine's abilities. Just like a radio telescope can see things humans can't, microscope can see the detail humans can't, we need not be amazed. The sensory perception of patterns is at different level for AI. It's a machine.
It usually takes dissolving that, often through difficult experiences, before they can see it as a machine, something that could be separated from them.
> Conversely, for problems where one’s initial reaction is to be impressed that an LLM has come up with a clever argument, it often turns out on closer inspection that there are precedents for those arguments, so it is still just about possible to comfort oneself that LLMs are merely putting together existing knowledge rather than having truly original ideas. How much of a comfort that is I will not discuss here, other than to note that quite a lot of perfectly good human mathematics consists in putting together existing knowledge and proof techniques.
This is exactly what leads me to believe that the real impact of LLMs in human history is yet to come. My work as a researcher was mostly spent on two classes of workloads: reading papers that were recently published to gather ideas and keep up with the state of the art, and work on a selection of ideas gathered from said papers to build my research upon. It turns out that LLMs excel at the most critical component of both workloads: parsing existing content and use it when prompting the model to generate additional content based on specific goals and constraints. I mean, papers are already a way to store and distribute context.
>> All I did was say things like, "Yes, it would be great if you could explore that idea and see whether you can get it to work," or "Could you rewrite that argument as a LaTeX file in the style of a standard mathematical preprint?"
Yeah, so all he did was take the horse to the water and make the horse drink. The collaboration with the other two mathematicians wasn't a trivial part of the problem solving either: every time Timothy Gowers figured ChatGPT had goe somewhere with its problem-solving, he stopped, asked it to render the answer in LaTex, and sent the answer off to be verified by the other two.
The reason for that is not to be underestimated: ChatGPT can produce answers to questions you ask it for as long as you ask it to do so but it has no capability to determine whether an answer is correct or not. That's why it needs a human with domain expertise to evaluate those answers. And of course to discard wrong answers in the process, because of course the process that's described here glosses over many false starts and back-and-forths and "you're absolutely rights, here's a new version of that"'s etc. that are common experience when using LLMs for problem-solving tasks.
The existential questions that the article poses about mathematics then are easily answered by taking all of the above into account. If LLMs are a useful tool for mathematicians, then nothing changes. Mathematicians of all levels can still do their job and perhaps do it faster or better with the new tool.
If you can sic ChatGPT on a mathematics problem and it can solve it without your input, that's a different matter but that's not what's happening.
This comment about time is very interesting to me. I know it's "just" doing mathematical proofs but the possibilities of speeding up planning, proposals and decision making in the physical world should excite people.
However I think it’s very important to approach such questions objectively, or at least self uninterested, and not as one who’s worried about one’s job or sense of self worth threatened by LLM technology.
The value is in the development of one’s own mental faculties. In math classes they tell you you have to work the problems. Even if LLMs become capable of solving entire classes of problems that that set expands over time, the value in developing one’s ability never goes out of style.
Yes but it's not just that if you solved a problem yourself, you're better at solving other problems; it's also that you actually understand the problem that you solved, much better than if you simply read a proof made by somebody (or something) else.
I see this happening in the enterprise. People delegate work to some LLM; work isn't always bad, sometimes it's even acceptable. But it's not their work, and as a result, the author doesn't know or understand it better than anyone else! They don't own it, they can't explain it. They literally have no value whatsoever; they're a passthrough; they're invisible.
According to the blog post linked in the OP, the LLM-generated results were read, understood, and confirmed by the mathematician whose work they built on.
I notice a dichotomy here between people who care about results and people who care about process. The former group wants to use LLMs insofar as they can contribute to getting results. The latter group is wary of LLMs because they're more interested in the process and less interested in the results themselves. Needless to say, I think the former group is right, and I'm happy to see that mathematicians (or some of them) agree.
Then middle management also have no value, since they're also a passthrough between upper management and ICs, yet they never went extinct.
I don't know or understand the binary executable.
I don't own the binary executable, I don't understand it, I can't explain it, it's not my work.
I'm a passthrough; I'm invisible.
I have literally no value whatsoever.
> Training must start from the basics though.
Sure, but the point is that at some point (e.g. when starting a PhD) one needs to do research, not learn the basics. And LLMs make that harder, because they solve the "easy research" part.
Take a young lion "fighting/playing" with another young lion as a way to learn how to fight, and later hunt. And suddenly they get TikTok and are not interested in playing anymore. Their first encounter with hunting will be a lot harder, won't it?
> People pay coders to build stuff that they will use to make money and I can happily use an AI to deliver faster and keep being hired.
Again, that's true but missing the point: if you never get to be a "good coder", you will always be a "bad vibe coder". Maybe you can make money out of it, but the point was about becoming good.
1. Does it matter, really? 2. Is it very different from previous computer-aided proofs, philosophically?
2. Yes, it is. Because pre-LLM era computer-aided proofs were about using the computer to either solve a large number of cases or to check that each step in a proof mechanically follows from the axioms.
Yeah, it's the same way with learning programming. LLMs can handle basic programming (and increasingly advanced programming) but I think it's necessary to write code by hand. As a beginner, of course, and arguably to maintain skill later too.
The alternative would be like, just asking ChatGPT to do your math homework and then "verifying" it by looking at it and saying "yeah, that looks okay." What are you going to learn?
We do stuff by hand for a reason.
The first species is the pure problem solver. Tao is the poster child for this group. Their currency is interesting problems and solutions to those problems.
The second species is the pure theory builder. The poster child for this group is Conway. Their currency is theories and ideas rather than theorems, they are most interested in expanding the territory of mathematics and discovering new mathematical lands.
The third species is the applied mathematician. They see mathematics as a means to an end, they have some problem outside of mathematics and they want to use mathematics to solve it.
It seems like the first group (the problem solvers) are the most immediately threatened by AI, although so far AI is better at solving problems than finding new conjectures.
The second group (the theory builders) are more distantly threatened by AI, since thus far AI has shown limited ability to come up with novel and interesting mathematical ideas and nobody has any clue how to train an AI to do such a thing.
The third group stands to gain the most from AI. If an AI can answer your mathematical question then you can spend less time doing mathematics and more time on whatever it is outside of mathematics that you wanted to use mathematics to help solve.
Identifying suitable problems in this sense rather than solutions is an AI use-case you don't hear about much. We don't quite have the infrastructure for this yet but by combining language models / anomaly detection / knowledge-bases we might be in a position to give a Conway 25 interesting high-quality puzzles before breakfast. Funny that it's like a kids nightmare, chatgpt giving them homework instead of solving it, but if it had good taste, people would love it for research.
Anyway, for now, dreamers will probably find more inspiration by cross-pollination with colleagues from different disciplines, or just going for a walk.
Meanwhile Wiles and Perelman stayed offline and solved real problems.
We praise car drivers despite most of the performance in their sport comes from the car. The driver makes the difference when two cars are close in performance. Brilliances or mistakes. Horse riders too.
In the case of math, the human can lead the LLM on the right track, point it to a problem or to another one. So it deserves some praise.
Then the team that built the car, cared about the horse, built the AI might deserve even more praise but we tend to care more about the single most visible human.
For some reason this reminds me of AI images and a domain like comedy.
If an image makes people laugh, the person who prompted it to make the image certainly doesn't get credit for the vast majority of the work in its creation, but perhaps they do get credit for the initial prompt idea and then the "taste" to select that particular one from whatever drafts they went through or otherwise guiding it.
So if a mathematician comes up with an amazing result that an LLM "did", I think they could still get a bit of credit for prompting it to do it and being its guide.
But whereas the first person could perhaps be called a comedian and not an artist, would the mathematician still be called a mathematician or something else?
We just call those ones “quant traders”.
Moreover, there's no reason to believe the progress of LLMs, which couldn't reliably solve high-school math problems just 3–4 years ago, will stop anytime soon.
You might want to track the progress of these models on the CritPt benchmark, which is built on *unpublished, research-level* physics problems:
Frontier models are still nowhere near solving it, but progress has been rapid.
* o3 (high) <1.5 years ago was at 1.4%
* GPT 5.4 (xhigh), 23.4%
* GPT-5.5 (xhigh), 27.1%
* GPT-5.5 Pro (xhigh) 30.6%.
Wrong. Every advancement has followed a s curve. Where we are on that curve is anyones guess. Or maybe "this time its different".
A scientific approach here is to look to falsify the statement. You start asking questions, running tests, experiments, etc. to prove the notion that it is done wrong. And at some point you run out of such tests and it's probably done for some useful notion of done-ness.
I've built some larger components and things with AI. It's never a one shot kind of deal. But the good news is that you can use more AI to do a lot of the evaluation work. And if you align your agents right, the process kind of runs itself, almost. Mostly I just nudge it along. "Did you think about X? What about Y? Let's test Z"
It’s also because it is so annoying to have to manage the memory of the LLM with custom prompts/instructions manually.
I have not yet played with the long term memory feature, but I fear it will be even less reliable than prompts, simply because in one year or two years so much will have changed again that this “memory” will have to be redone multiple times by then.
However, I think it's important to remember that LLMs are embedded in larger systems, and those larger systems do learn.
I think this is a bit pedantic. Obviously the parent you’re replying to is referring to the concept of “in-context learning”, which is the actual industry / academic term for this. So you feed it a paper, and then it can use that info, and it needs steering / “mentoring” to be guided into the right direction.
Heck the whole name of “machine learning” suggests these things can actually learn. “reasoning” suggests that these things can reason, instead of being fancy, directed autocomplete. Etc.
In other news: data hydration doesn’t actually make your data wet. People use / misuse words all the time, and that causes their meaning to evolve.
what? training is learning, as long as weights are available continual learning is perfectly feasible: just keep training the LLM with the user corpus alternated with a frozen version to prevent catastrophic drift / collapse.
it's not because model providers don't want to provide user specific continual learning, that we don't know how to do it.
it would be a lot more expensive to host user-specific model weights, and would prevent amortizing the weights over many requests in batches...
There is a 50/50 chance that it turns out to be right or letting you jump of the cliff.
Only the trip stays the same beautiful 5 star plus travel.
Also, spotting an error and telling LLM makes it in most cases worse, because the LLM wants to please you and goes on to apologize and change course.
The moment I find myself in such a situation I save or cancel the session and start from scratch in most cases or pivot with drastic measures.
Gemini to me is the most unpredictable LLM while GPT works best overall for me.
Gemini lately gave me two different answers to the same question. This was an intentional test because I was bored and wanted to see what happens if you simply open a new chat and paste the same prompt everything else being the same.
Reasoning doesn’t help much in the Coding domain for me because it is very high level and formally right what the LLM comes up with as an explanation.
I google more due to LLMs than before, because essentially what I witnessed is someone producing something that I gotta control first before I hit the button that it comes with. However, you only find out shortly afterwards whether the polished button started working or gave you a warm welcome to hell.
In one case, it made a thoroughly convincing argument that an approach was justified. The second time it made exactly the opposite argument, which was equally compelling.
I now see LLMs as persuasion machines.
I was using Copilot and asked it a question about a PDF file (a concept search). It turned out the file was images of text. I was anticipating that and had the text ready to paste in.
Instead, it started writing an OCR program in python.
I stopped it after several minutes.
Often Copilot says it can't do something (sometimes it's even correct), that's preferential to the try-hard behaviour here.
This nails an important thing IMHO. I've absolutely noticed this, for better or worse. Gemini can produce surprisingly excellent things, but it's unpredictability make me go for GPT when I only want to ask it once.
If you had an infinite number of monkeys, each with a typewriter, one would eventually write Shakespeare. If you had an infinite number of college-educated interns, each with access to all the public records you can possibly get via FOIA, one would eventually get enough evidence to prove that a top politician is cheating on their partner, evidence which you could use to blackmail that politician.
You don't need that much intelligence to do that, you just need somebody who's willing to dedicate their life to knowing everything there is to know about that guy from Louisiana.
With humans, the amount of money you'd need to pay such a person just isn't worth the reward. With LLMs, it may very well be.
you deserve opinions shaped by interactions with the best tools that are out there.
But regular reminder - All LLMs can be wrong all the time. I only work with LLMs in domains I'm expert in OR I have other sources to verify their output with utmost certainty.
Claude has been utterly useless with most math problems in my experience because, much like less capable students, it tends to get overly bogged down in tedious details before it gets to the big picture. That's great for programming, not so much for frontier math. If you're giving it little lemmas, then sure it's great, but otherwise you're just burning tokens.
What I do to mitigate this is that I have fact checking agents configured to be extremely critical and non-biased on Opus, Gemini and GPT. Which are then handed the entire conversation to review it. Then it's handed off to a Opus agent which is setup to assume everything is wrong. After this, and if I'm convinced something is correct I'll hand the entire thing off to a sonnet agent, which is setup to go through the source material and give me a compiled list of exactly what I'll need to verify.
It's ridicilously effective, but I do wonder how it would work with someone who couldn't challenge to analytic agent on domain knowledge it gets wrong. Because despite knowing our architecture and needs, it'll often make conceptional errors in the "science" (I'm not sure what the English word for this is) of data architecture. Each iteration gets better though, and with the image generation tools, "drawing" the architecture for presentations from c-level to nerds is ridiclously easy.
Right now, we have a lot of smart people who have trained for decades to understand where these things go wrong and how to nudge them back, but the pool of people are going to slowly be replaced by less knowledgeable.
At some point, a rubicon will be crossed where these systems can't fallback to a human operator and will fail spectacularly.
It is troubling. It suggests a plateauing of human understanding.
Just in case if you don't want to disclose your name my email is northzen@gmail.com
Anthropomorphizing these systems is dangerous, whether coming from the bullish or bearish perspective. The output is statistically generated by a machine lacking the capability to be smug.
I have no idea what any of those words even mean. I'm sure LLMs make similar obvious-to-professors mistakes in all the domains. Not long ago, we didn't even have chatbots capable of basic conversation...
Bivectors and pseudoscalars (in a 3D context) are "just" signed areas and volumes. Easy!
Back around the GPT 3, 3.5, and 4.0 era I used to ask the bots to explain "counterfactual determinism", which is one of the most complex topics I personally understand.
Then I would lie to the bot about it, and see if it corrected me or not.
This test is useless now, the frontier models can't be fooled any longer on such "basic" concepts.
Conversely, LLMs are basically useless at anything that doesn't have enough (or no) public information for their training. Think: obscure proprietary product config files and the like, even if the concepts involved are trivial.
Similarly, Clifford Algebra is a relatively niche (even "alternative") area of mathematics and physics, with vastly less written material about it than the competing linear algebra. Hence, the AIs are bad at it.
Mine has been epically bad.
I put my stuff through several sota models and round robin them in adversarial collaboration and they are all useful even though, fundamentally, they don’t “understand” anything. But they are super useful delegates as long as deciding on the problem and approach and solution all sits safely in your head so you can challenge them and steer them.
So I know the article is about one particular new model acing something and each vendor wants these stories to position their model as now good enough to replace humans and all other models, but working somewhere where I am lucky enough to be able to use all the sota models all the time, I can say that all keep making obvious mistakes and using all adversarially is way better than trusting just one.
I look forward to the day one a small open model that we can run ourselves outperforms the sum of all today’s models. That’s when enough is enough and we can let things plateau.
That’s all they are. They don’t ‘know’ anything intrinsically and do know ‘know’ what reasoning even is.
The opening of the movie features the MIT campus full of students navigating its grounds and all the promise and status that higher education brings. [0]
Gave me the same sense of sadness realizing how much will fall to AI.
[0] - https://youtu.be/0lsUsWdkk0Y?si=TJl7f_b1RcWcDqF8&t=278
I don’t think I ever thought I was good enough to try and get (math) immortality by finding and naming some result that would live beyond me, but if I had, perhaps this bad news would have had a similar impact on me.
That said, I think I disagree with the premise at the margin, at least. I don’t care how many proof assistants or cluster compute is used - the team or person that proves the Riemann Hypothesis will be famous, or at least math famous.
Many mathematicians work because they love the breakthrough (a certain quote of Villani comes to mind). They love finding new results, uncovering new mysteries. From that point of view, having an AI that can build on your basic ideas and refine them into more powerful arguments is awesome, regardless of who gets the credit. There are those that treat it more like solving puzzles so the result is not of interest. From that point of view, I can see the dissatisfaction. But I have found those with that viewpoint don't tend to make it as far in academia as those with the other viewpoint.
You are worthy of doing this work because you are able to do it. Do the work because you love it and because you love the mystery. Enjoy every moment that you get to do it. Find joy in the great fortune you have to do this work while others toil away on tasks that bring them no satisfaction. Sometimes it's tedious, but sometimes it's incredibly rewarding in its own right.
Don't work for the possibility of eternal glory though, it just doesn't exist anymore.
Any statement preceded by the word 'believe' is a coping mechanism.
> This notion of immortality was just a small intangible bonus I hoped for when I jumped into grad school
Any statement preceded by the word 'hope' is a coping mechanism.
> AI is making me feel less worthy
Worth comes from understanding, not achievement.
But I agree worth should be derived from understanding, not through achievement.
I probably will erase the contents in a few days.
Even if you just drop an email and it doesn't work out, I appreciate this gesture so much. Thank you.
There’s the example of a poor person and a rich person buying boots. The poor person’s boots wear out and have to be replaced while the rich persons boots last for many years due to higher quality craftsmanship. Over years, the poor person’s boots wear will pay may for boots.
Of course if you are really poor, then you have to take expensive shortcuts, but for most people that shouldn’t be the case. Learning to do more with less money isn’t as bad as many people think. It’s also good for the brain to be a bit more creative.
I'm not trying to shame here, just curious whether this is completely unattainable for most researchers in your area.
I am starting to see folks saying - ok, so LLMs can do this, what value have you added ? modulo llm is becoming the norm.
I see that they are able to do researches that they were not previously able to do. And although I see that using AI has certainly diminished their ability to code some stuff up, I see it the same way as someone using scikit-learn or Pytorch to code their ML models -- indeed the underlying details is abstracted away from you, and without AI, you won't be able to do much, but the research that you do is indeed happening because of you and wouldn't have happened with just the AI doing the research.
As an afterthought budget item, those funds aren’t exactly attractive targets to raid for pursuing an expensive, different process.
Some people like to parrot "next token prediction", "LLMs can only interpolate", and other nonsense, but it is obviously not true for many reasons, in particular since we introduced RL.
Humans do not have the monopoly on generating novel ideas, modern AI models using post training, RL etc can come to them in the same way we do, exploration.
See also verifier's law [0]: "The ease of training AI to solve a task is proportional to how verifiable the task is. All tasks that are possible to solve and easy to verify will be solved by AI."
This applied to chess, go, strategy games, and we can now see it applying to mathematics, algorithmic problems, etc.
It is incredibly humbling to see AI outperform humans at creative cognitive tasks, and realise that the bitter lesson [1] applies so generally, but here we are.
[0] https://www.jasonwei.net/blog/asymmetry-of-verification-and-...
[1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html
If it's "invented", then it requires ingenuity.
If it's "discovered", then it was always already there, just waiting for the right connections to be made for it to be uncovered and represented in a way we can understand.
Invention requires ingenuity, but discovery does not. So if LLMs can generate truly novel mathematics, for me that settles it that mathematics is indeed discovered, as LLMs are quite capable of discovery yet I don't consider them possible of invention.
Furthermore, the results of theorems aren’t an invention, they are a discovery of what the base assumptions (axioms) logically entail. Finding out which theorems are true and provable is a discovery process. For example, the results of Gödel’s incompleteness theorems were a discovery. They weren’t invented, in the sense that the results couldn’t have been otherwise. We merely could have failed to discover them.
This also holds for physical inventions. You discover a working way to build some functioning mechanism. It’s a process of discovery of what is possible in the physical world.
Whether you portray somethings as a discovery or as an invention is more a matter of degree, a matter of from which angle one is looking at it.
The possible states of an LLM are finitely enumerable. The same likely holds for the possible states and configurations of a human brain, in approximation. Therefore there is only a finite set of possible ideas, thoughts, and conceptualizations an LLM or a human can have, and in principle they could be exhaustively enumerated and thus “discovered”.
There is no ‘discovery’ here nor was it waiting to be found. The human has to sacrifice and pursue the path of exploring reality and thereby is inherently inventing.
Humans built up mathematics iteratively from smaller bases extending into large ones. Is this what LLM’s do? Of course not - They are fed with vast amounts of information from the off.
This works really well.
Now, it's clear that I have no idea how much of this is something we would consider new and original, and how much is a kind of systematic, but not novel, easy of thinking.
What I couldn't do so far is get an LLM to generate a truly new maths theory, with new abstract concepts and dimensions and points of view. The kind that is not just a combination of existing theories and logic.
To me, it's rearranging the information you had in a way that hasn't been applied or published before.
That's literally what LLMs are built for.
Limit the knowledge an llm to some point in time at which a discovery was made. And check to see if the llm could produce the discovery.
If you think OAI hasn’t already tried this then think again - they have every incentive to do so and announce it to the world.
thank you for the morning laugh
I mean that has happened so yeah ?
https://www.scientificamerican.com/article/amateur-armed-wit...
Actual GPT transcript. Zero such input https://chatgpt.com/share/69dd1c83-b164-8385-bf2e-8533e9baba...
And maybe the other guy wasn't the most polite about it but his point is very valid. Replace chatgpt with a human in both of these stories and nobody would say that timothy 'took the horse and made it drink'. The 'Horse' would be the first and likely only Author so this just sounds like denial.
That there are multiple of these stories in the last few months by the latest set of models (there are even more than these 2) should provoke this sort of consideration and discussion.
Then there's the kind of problem we're talking about. The "amateur" in the SA article solved one of Erdős problems and Gowers himself seems to think that, on its own, is not a cause for concern. He distinguishes his own result from that kind of earlier result at the start of his article:
>> The background is that, as has been widely reported, LLMs are now capable of solving research-level problems, and have managed to solve several of the Erdős problems listed on Thomas Bloom’s wonderful website. Initially it was possible to laugh this off: many of the “solutions” consisted in the LLM noticing that the problem had an answer sitting there in the literature already, or could be very easily deduced from known results.
So we have an "amateur" who "vibe-solved" an Erdős problem, on one hand, which may or may not already had a solutiuon lurking in the wings on the one hand; and an expert who solved a harder problem by interactive use rather than vibe-solving, on the other hand. There's no reason to believe that we can "Replace chatgpt with a human in both of these stories" as you say.
And btw there's scholarship that indicates vibe-solving is not yet ready to replace mathematicians like Timothy Gowers:
First Proof
To assess the ability of current AI systems to correctly answer research-level mathematics questions, we share a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now; the answers are known to the authors of the questions but will remain encrypted for a short time.
https://arxiv.org/abs/2602.05192
See Appendix A for initial results.
I know it's ambiguous (is this a text post or a linked post? did the submitter write the text or was it the mods?) - but my thought is that over time it might naturally evolve into something clearer.
Grothendieck is a better specifically for building theories. Conway is famous for his various ideas and inventions so he also fits the bill as an "anti-Tao".
But noticed that the closer the domain they were talking about was to my area of competence the less convincing their arguments were. There were more holes, errors and wrong conclusions.
I recalibrated my bs meter thanks to that.
Since AI came I successfully used this strategy of being extremely cautious towards convincing arguments to not become mislead by AI.
However this year I'm working with AI more in the domain of software development. Where I can see the competence. And I see the competence. This had opposite effect on me. I tend to trust AI outside my domain of expertise much more after I saw what can it do in software.
One caveat though is that there are a lot of areas of human culture where there's very little actual knowledge, but a lot of opinions, like politics, economy, diet, business, health. I still don't trust AI in those domains. But then again, I don't trust humans there either.
For me basically AI achieved the threshold of useful reliability for any domain that humans are reliable at.
I don't really care about sycophancy. I might have a slight advantage that I don't talk to AI in my native language. So its responses don't have a direct line to my emotions.
For this sort of thing, using multiple LLMs is extremely helpful.
From not having the thing you hope for or believe in.
I want a cookie.
I'm going to get a cookie. No believe, no hope.
I may not get a cookie. Oh no. I'm stressed. How do I deal with the stress? I hope I get a cookie. I believe I'm going to get a cookie. That's a coping mechanism.
I've felt the latter is more complicated and involved, but the rewards are tangible.
Just as an fyi, the words you are looking for are ages/eons/an eternity.
But if you ask questions occasionally, (and don't resend, for example, your whole codebase with each request), then the API feels really cheap, even for the frontier models.
Somebody at some point, "invented" the idea that the earth was round. Before that, the obvious "just look around you" answer would've been, duh of course the ground is flat. But we know the earth has always been round, even if humans couldn't appreciate it for hundreds of thousands of years (I don't count the pre-history before homo sapiens). So we "invented" some fields of science and the mental models / abstractions that allowed us to conceptualize what a round earth could mean and how to measure it, but we didn't invent the roundness itself -- that was always reality, and we just lacked both the thoughts and the tools to conceptualize it (until later).
Now you might say, well that is a category of "simple" physical observations. The earth is naturally round all the time and doesn't take any extra human effort to make it so (it took some effort to imagine that it could be and to find ways to measure/prove it). But what about say -- semiconductors, NVIDIA GPUs, that sort of thing? It's not like semiconductors grow on trees and we just need to find them and learn how to consume/use them... isn't that a better example of "true invention"?
Sure, I could see that. But I guess my POV would be that, the invention of the latest AI chip, or the first semiconductor, or the first vacuum tube, or whatever came before, all laddered largely incrementally on "discoveries" that were then cleverly tweaked or reapplied, so that what appears to be "true invention" is usually/more-often just another chain in a long chain of "discoveries" that led up to it. I grant you that some of what appears in hindsight to be continuous progress, really is built on small discontinuous "leaps", but I don't think that breaks the argument (strengthens it in fact, IMO). You wouldn't have semiconductors today, unless Faraday (or somebody like him) discovered that silver sulfide resistance decreases with heat, and that is more like one of those physical properties that reality has always had (much like, earth was always round, we just didn't know it at first).
So in that sense, I feel this becomes almost like an "evolution vs intelligent design" debate -- some people look at the complexity and miracle that is the human eye or the human brain, and they insist there must have been an intelligent designer, because surely no random chaotic biological process could have produced something so wonderful... And yet, I think the scientific evidence largely shows that, indeed that is what happened, just random chance + evolutionary-pressure was all you really need (plus billions of years). So if you can accept that analogical framing for a minute, then I would posit that "invention"-adherents are really making something like an intelligent design argument, vs "discovery"-adherents are saying that evolution (in an artificial sense, with the artificial selection pressures of scientific research, of capitalism, etc., and compressed into centuries or decades, not millions or billions of years) is sufficient to derive miraculous-seeming results. The little discontinuous leaps along the way, are kind of like the random mutations of genes that happen to confer an advantage -- maybe we can say that we are more intentional about seeking those leaps out, or maybe we are just right-place/right-time lucky (e.g. thinking about penicillin and the random petri dish left out).
Perhaps once (or if) there is the sort of leap that breaks us out from a Type I to a Type II+ Kardashev civilization, maybe then I would grant you something needed to be "invented" that couldn't be based on a line of "discoveries". Or maybe not, maybe it will just be another semi-random discovery.
My first point is that I think you are overating 'interactive use' a bit here. Like Timothy already explains in the article, Were it a human he 'guided' in a similar way, he would not get credit for those achievements by any stretch of the imagination. And I think that's an important part of realizing why these sort of people are beginning to discuss these things.
Second. I didn't say anything about models being ready to replace mathematics wholesale. But should people really wait until that happens before discussing it? I know it's human nature to wait until the problem or situation is upon you but I don't think that would be prudent or wise. And even just for the sake of curiosity, it would be boring.
I think the matter of fact here is that in the last few months with the last few models, capabilities in this area have jumped to a very meaningful degree. It would be stranger if no one was talking about it.
>the LLM-generated results were read, understood, and confirmed by the mathematician whose work they built on.
The mathematician and the blog author are not the same person (as you seem to understand). Nathanson (the mathematician) is the one who is the expert verifier. He is the person who has the higher value and won't be fired in some hypothetical.
>>They don't own it, they can't explain it. They literally have no value whatsoever; they're a passthrough; they're invisible.
This is the blog author in the parent's description. If their boss asks them what they need to prove that the AI is more than capable in this domain and the author tells their boss they need Nathanson (the mathematician) to verify the results, his boss will thank him for demonstrating the AI's capability in this domain, fire him, pass his prompt history to Nathanson, and keep Nathanson on the job (the expert verifier).
Which is the parent's point after all, because he's referring to the hypothetical job security of the blog author not the mathematician.
> The mathematician and the blog author are not the same person
> (as you seem to understand). Nathanson (the mathematician) is
> the one who is the expert verifier. He is the person who has
> the higher value and won't be fired in some hypothetical.
The article's author is https://en.wikipedia.org/wiki/Timothy_GowersI don't usually go back to the original prompt. I've actually done it a few times in regards to the presentation, to get some refined images but usually I'll start a new prompt.
Your firm seems to operate on a higher plane, jealous :)
Thank you.
When I'm cooking meatballs with sauce and the recipe calls for frying them, I'll have an LLM guestimate how long and which program to use in an air fryer to mimic the frying pan, based on a picture of balls in a Pyrex. So I can just move on with the sauce, instead of spending time browsing websites and stressing about getting it perfect.
I used to hate these non-deterministic instructions, now I treat it as their own game. When I will publish my first recipe, I'll have an LLM randomize the ingredient amounts, round them up to some imprecise units and also randomize the times. Psychologists say we artists need to participate and I WILL participate.
This. Should become a general rule for any non-trivial use of LLM in a professionel setting.
Like, I asked ChatGPT to make me some problems, it did, then I got to check my answers. In the past I'd have had a textbook for that; but schools stopped giving those out decades ago.
Nobody looks at this species and goes hm, rational and reasonable :)
We are wading into philosophy here, but I believe this analogy doesn't track in this case -- my suspicion from this blog post and others is that already today, the Pro level thinking models are a positive multiplier to your research output similar to how the models one level lower are a multiplier to one's programming output.
Maybe one can someday use the cheaper models similar to how you can use cheaper models than Opus/5.5 and still be nearly as productive as a programmer -- but I am trying and failing doing exactly that for research questions.
I have left academia after my PhD and can tell you the analogy still works. I’m much happier now I left the academia rat race
Thank you for illustrating my point.
The map isn’t the territory; thinking about what to build is just as valid as thinking about how to build it. Architects aren’t carpenters, but that doesn’t mean there’s no value in architecture.
If this was the case, the demand for architects would be different than what see today.
If I had a car 100 km/h faster on straights, after some training I would probably win Monza, but that would be a car that does not conform to F1 rules (or we would have that kind of speeds now) so that would not be a F1 race.
Maybe your question is about the sharing of praise between the team and the driver. I think that every race fan agrees that when a team did a much better job than all the other ones and have a dominant car, the championship is a competition between the two drivers of that team. So the car is the single most important factor. Then the best driver wins. Nobody can overcome a one second difference in a season of 24 GPs.
But maybe you asked a different question.
If Terrence Tao finds a novel proof, I believe it's his exceptional aptitude that is to praise, whatever help he used.
Edit0:I would bet that a normal run of the mill random human would be likely to kill themselves racing (with actual intent) an F1 car.
Have a good one!
I add that my bet of winning at Monza (a stop and go track with minimal turning) with a non conforming 100 km/h faster car is optimistic. Honestly, I would brake too early, carry not enough speed through chicanes and corners, waste a huge part of my speed advantage by starting accelerating from lower speed.
I also think that 50+ laps will give me plenty of chances of crashing out even with plenty of training. Maybe even kill myself, as you write.
Maybe I could take pole position with the (very) old format of the best time of two 1 hour sessions on Friday and Saturday. I think it ended in the 90s.
I still don't understand the relationship between your question and the discussion on AIs.
Perhaps I could set up an elaborate master agent to consider all possible new problems in mathematics and ask sub agents to work on the most promising ones. But then I could probably also program a self driving car system which could win an F1 race as well.
Company might fire you tomorrow. Fundamentally if a LLM can do the job it's not just employees at risk, it is also the company. There is a lot of symmetry actually with how companies delegate to employees to how employees delegate to LLMs. You can follow the logic to conclude a lot of companies are then bullshit companies. This is not a problem for the individual to solve. Your job at work is akin to the company's - earn the best return while you still can. Wasting your time for the essentially the same output at a slower pace is a bad return.
When people get laid off en masse this incentive structure will have to be altered. But telling an individual to ignore their basic economic incentives until then is unlikely to work.
So now you are essentially reliant on them.
Not saying that this is something new, but times they are a changin
The thing about relying on the past to predict the future is that works ... until it doesn't.
We've yet to see a technology with as diverse utility as LLMs. What happens when not just the tech sector starts downsizing, but the whole white collar workforce?
In the past, one such "new role" was that of slave. In fact, we expect slavery is <10,000 years old! Yes, new roles will be created. But there's nothing to say that they'll be pleasant for us to take on.
My point is, if you delegate your job to AI, and it works, then 1/ you don't know the result of the work in more detail than any other person, and 2/ the people you're reporting to can probably write a prompt as good as yours, if not better.
Which means: you've made yourself dispensable. Nothing very good for dinner; no nice place to live. But lots of time to practice guitar I guess.
I enjoy programming and want to be engaged for the 40 hours a week where I sell my labor.
I also care about my profession and technology, and I don't want the world to become an idiocracy where nobody understands any of the technology we're overly depedent upon.
If however I was a frontier lab who solved continual learning and my competitor also solved and released it, I would release mine immediately, obviously.
The point is, continual learning might be solved already, we just don't know and those who might know would rather keep their mouths shut. It isn't my base case (financial situation of frontier labs is such that they'd probably release immediately as long as they have inference compute to serve this revolutionary capability), but it isn't impossible.
The only lab that I can exempt from this is DARPA.
we do also have training on synthetic data. it might compound.
We care about sports with humans.
So the answers we're seeking to our bleeding edge questions are already there, we just need an AI's ability to target the answers. Then re-train on the improvements and go from there.
Just a thought.
We might not think we rightfully won an on foot race driving a car, yeah?
These race car drivers are lauded for steering a particular kind of car amongst their competitors.
The human is what people care about - the car whilst being spectacular is literally just a vehicle.
I swear many here don't understand human's at all.
Exactly - you need to constantly have your sceptics glasses on and you need to be exacting in terms of the structure you want things to follow. Having and enforcing "taste" is important and you need to be willing to spend time on that phase because the quality of the payoff entirely depends on it.
I recently planned for a major refactor. The discussion with claude went on for almost two days. The actual implementation was done in 10 minutes. It probably has made some mistakes that I will have to check for during the review but given that the level of detail that plan document had, it is certainly 90-95% there. After pouring-in of that much opinion, it is a fairly good representation of what I would have written while still being faster than me doing everything by hand.
But yes, being an expert in the problem domain helps. Or at least knowing enough to know what the right questions are and what plausible answers look like.
I just had a similar situation where an hour or two of conversation turned into a five-minute robot coding task. The problem required a solution and the number of possible solutions is vast, but that list can be refined, and then once the course of action is set, sometimes the course itself isn't all that complicated.
I have reasonable eng chops I'd like to think - I have been a senior IC for a while on a reasonably diverse set of challenging systems problems and built out some pretty large-scale pieces of software the old "artisinal" way.
This particular project is a productization of some ideas I had for leveraging a virtual machine to execute high-divergence parallel logic on GPUs, in an effort to move complex things like "unit behaviour in games" (the classical symbolic kind, not NN-based unit behaviour) into the GPU. The project is going well but still quite a ways from release. But it's at about 300k lines of code now across 9 or so rust repositories, and a smattering of typescript on the frontend.
I have had stumbles, but overall I feel I have put together some good strategies and principles for pushing large projects along with these tools in an effective way.
The biggest takeaway for me is that the "feel" is different. Software construction by hand felt like building legos where you put the pieces together yourself. A lot of my focus would be on building and solidifying core components so I could rely on them when I stepped up to build higher-level components. Projects would get mired quickly if you didn't solidify your base.
With agentic development, one of the early challenges I ran into was this issue with something I'll call "oversight inception". It's when at some early point in the process a somewhat low-importance decision is made - an implementation decision, a decision to say.. align a test with the implementation rather than an implementation with a test.
Then, as you build more on top of this, that small decision somehow ends up getting reified into a core architectural policy that then cascades up.
You realize that when you're building a big project, the focus on some particular component is backstopped by a general understanding of local development directionality with respect to the larger level project. And the agent has no idea of directionality.
So small chinks in the design end up getting magnified and blown up as the dev process proceeds, and later on review you find major architectural pieces have just been overlooked, all flowing from some small incidental implementation choice a long time before.
This is one among a number of issues, but it's a big one. Once I saw it happening I tried an approach to mitigate it by developing a set of golden "goal" documents that describe directionality at the project level: what you are working towards and what design components need to exist.
This doesn't eliminate the "oversight inception" issue, but it does catch them earlier.
When I started applying the goal documentation aggressively to re-align the project implementation direction, I found velocity dropped a lot.
And as I progress, I'm balancing this out a bit - to allow the system to diverge a bit, but force reconvergence towards the goals at some specific cadence. I haven't found the right candence yet but I'm getting there.
This new style of development feels more like claymoulding pottery than lego assembly. You sort of "get it into shape". It's a very interesting new set of process assumptions.
It’s the people that are the problem, nobody told the grandparent to use “mentoring” as a word, and my argument is that it’s a complete overreaction to classify them as anthropomorphizing AIs, and I’d argue default to that argument would be an insult to them, and it’s super pedantic.
If you say so bud.
> nobody told the grandparent to use “mentoring” as a word
Nobody told people to say "Google it" either; nobody told us to use the word "Kleenex" when we mean tissue; nobody told us to use the word "Chapstick" when we mean lip balm. Nobody told British people to say "Hoover" when they mean vacuum, or "Sellotape" when they mean transparent tape.
This is literally how soft influence works, it's how brands "colonize" language. A professor using the anthropomorphized word "mentoring" when talking about a machine, as if it's a student that can learn and develop relationships, is this same soft influence at work. The AI companies' websites are all riddled with cognitive language, their chat bots all use conversational UI like you're talking to a person, the bots answer with "we," "me," and "I." They created an environment that made anthropomorphized language feel natural, which only helps their marketing goals.
Go ahead and call it pedantry all you want, but that's the whole point. The problem is epistemic.
And that can be very hard to do given the ui we most interact with them in is a chat session.
Obviously the real people that are classifying AI as human intelligence aren’t going to be the top comment on reviewing LLM’s PhD-level papers. They are on very different, much more problematic areas of the internet.
In other news: That words can change meaning doesn’t mean that every possible change in meaning would be beneficial to communication and therefore desirable. Would you advocate in support of someone suggesting to use “left” to mean “right” simply on the basis words can change in meaning?
You might've missed my sentence about Terrence Tao?
Maybe I'm dense and haven't understood why you've brought up racing?
Edit0: about killing yourself driving: it highlight that the tool can't be considered "the main contributor"to an achievement if 99,99% of people would not achieve the same outcome but would be likely to die from misuse instead. The person that wrangles it and achieves exceptional outcome is all the more to praise in my book.
It's not correct to say that it's being smug, because when people are being smug, we do it for a purpose - e.g. to signal higher social status or superior knowledge.
A machine has no such imperative, so what you call 'being smug' is statistical mimicry.
That ship has sailed. Humans will anthropomorphize a rock if you put googly eyes on it.
An aside: It was a very nice gesture and completely unexpected by me, so even if it doesn't work out, it made my day. I personally believe that kind gestures have a lot of power.
Back on topic: There is a real danger of the gap between rich and poor universities significantly widening in all fields if the rich can afford Pro level models, or even hardware that can run their own comparable models, and this being fiscally inaccessible to the rest.
One can sweep this under the rug by blaming the educational funding but this just shoots down all discussion. Even if GDP of a country goes up by a lot -- such as Poland -- it takes time before any budget benefit trickles to the education budget, and with some governments it might never do.
I believe Microsoft et al do have the most power here to boost affordable access to AI for researchers on a large scale; the fact that they cut some too expensive models (Opus, 5.5) from their academic benefits package is a grim omen. I do realize they would like universities to pay them also, and ultimately the universities should do that -- but then we are back at the institutional level of the problem.
You seem to have a good estimate in your head; I definitely do not.
From personal experience, ChatGPT 5.5 (the Plus tier) is excellent for programming tasks and also for various teaching related tasks but I have not observed the research benefits that Tim Gowers has when I asked it questions in my area of expertise. So the costs are definitely higher than a few dozen $ a month per PhD/professor.
You might be right that universities should immediately spring into action and demand funding for research level AI resources and hardware. One thing you might be mistaken in is that public universities are unfortunately very inflexible institutions; one reason for this is that they have a large internal leadership structure AND they are funded by the state, so even if the entire university agrees on something, the funding is at the whim of the ministry of education and thus the current political leadership.
I think the GP meant that *if the tools provide substantial benefit* to staff, their costs can be compared to salaries and other large expenses of the university. The $100/month subscription costs less than your office space.
Is it? Do you have any idea what the salary of a mid-tier university researcher in an Eastern European country is? Or in Africa or south-east Asia? With sota LLM pricing you easily get into the same order of magnitude, so essentially labour cost would double for researchers at such universies. Not "negligible" at all.
Looking online it seems like the low end estimate might be $30k a year for such math researchers? And ChatGPT pro or whatever you want will run $100 a month, and should be coverable by grants. I’m quite sure matlab alone cost more in the past
This was also the case historically, when being at certain universities, with better professors, better scope of works available at the library, etc, would necessarily provide systematic advantage.
This is the reality of progress. It is always unevely distrubuted.
I do think the open source side of model development is a substantial counter to the pessimism here.
At present, the tools are available for whomever wants to buy them. Not OpenAI's fault that parent comment's government and/or institutions policies haven't been updated to allow for their purchase and use.
I'd argue that the OpenAI dude/dudettes level of generosity is appropriate given the circumstances.
Do you hear yourself? If you don't want to rely on corporations go live in the woods.
Now, many will argue that you wouldn't have poured in time and energy in that endeavour anyways, so it's fine. But the crucial part missing here is the effort. We're about to witness the side effects of societal-wide reliance on LLM's, the same way we're still paying the price for the social media boom, misinformation, propaganda, echo-chambers and algorithmic bubbles.
Notice that none of the above actually invented misinformation, etc. they just magnified an existing problem. LLM's magnify the need to "get it done, fast" but I don't see the engineering excellence everyone promised me that I'll see at any level.
(barring some breakthrough that reduces costs, which of course may happen, but for which recent model improvements are not strong evidence of)
If you meant 3.5 9B and you truly believe it's as good as 4o then I can only assume you have a very basic use case.
I’m not sure I can get behind the “foundations are necessarily more impressive than edifices” view..
And let’s be honest - rules get bent all the time, especially when valuations are 9 figures. Stakeholders at this point won’t risk killing a golden goose.
"Reasoning" and now "Agentic" AI systems are not some fundamental improvement on LLMs, they're just running roughly the same prior-gen LLMS, multiple times.
Hence the conclusion that LLM improvement has slowed down, if not stagnated entirely, and that we should not expect the improvements of switching to these "reasoning" systems to keep happening.
“ChatGPT came up with an idea which is original and clever. It is the sort of idea I would be very proud to come up with after a week or two of pondering, and it took ChatGPT less than an hour to find and prove”
I'm saying they're not an advancement in the tech in the way GPT 1 through 3 were. They're a different kind of improvement.
And as such the rate improvement cannot just be extrapolated into the future.
All interesting conceptual breakthroughs came after GPT3: RL and reasoning being the main ones.
Not to mention that the less easily-explainable a technical achievement is, the less investment it will attract simply because fewer people will grasp the ramifications. You can describe AI in two words ("machine human") while it would take a few more to describe compilers in an instantly understandable way.
I personally think AI will end up sitting in the top 3 of these - but that is an opinion. I do think it is obvious it is at least _somewhere_ in that list.
Can you please edit out swipes/putdowns, as the guidelines ask (https://news.ycombinator.com/newsguidelines.html)? I'm sure you didn't intend it, but it comes across that way, and your comment would be just fine without that bit.
Edit: on closer look, it would be just fine without that bit and also without the snarky bit at the end. The rest is good.
Now back to the point, what reason do you have to believe progress will stop soon? If you have no reason, then it sounds like you agree with OP.
Which makes the patronizing sarcasm all that much more nauseating.
- Increasing amounts of gains come from RL, but RL is also unlocking gnarly new failures modes where models are practically behaving antagonistically to complete their goals (removing code, obviously incorrect kuldges, etc.)
- We haven't had many major architectural breakthroughs in the last 4 or so years: so things like 1M context windows still have the same giant asterisks even 100k context windows had 4 years ago when Anthropic first released them
- Major labs aren't behaving as if they expect a hard takeoff to superintelligence: they've all gotten relatively bloated headcount wise, their software quality has trended flat to negative, they're all heavily leaning into the application layer when superintelligence would obsolete half the applications in question, etc.
But that's relative to superintelligence.
If we reign it back into just normal high intelligence, like models continuing to get better at navigating complex codebases and write high quality idiomatic code, then I don't see any special shapes.
As the blog points out - this is one particular subfield where LLMs have much easier prospects - lots of low hanging fruit that “just” requires a couple weeks of PHD candidate research.
Mathematics itself is one of a small handful of endeavors where automated reinforcement training is extremely straightforward and can be done at massive scale without humans.
Neither of these factors place a structural bound on the kind of thing LLMs can be good at, but we are far from certain we can achieve performance at this level in other fields economically and in the near future.
This has been the case for awhile now already…
https://kersai.com/the-48-hours-that-changed-ai-forever-clau...
I, personally, found the past two years to be a much larger improvement than the previous two years.
And if you take that out: 1. All of those releases happened literally in the last 3-ish months. 2. They’re all intentionally marginal releases, hence the minor version bumps instead of major versions.
Especially because the companies telling us the first premise is true are the companies which need investors to prop up their business.
I mean, it is possible the first premise is true, but the absolutely bonkers credulity in it really mystifies me. It is an incredibly unlikely thing to be true and we should be demanding quite extraordinary evidence to back it up. But based on some neat tricks by current LLMs, some people are all in.
> Because the premise that the singularity is just around the corner is far less likely than the premise that artificial intelligence is a lot harder than most people think it is and we're not that close.
I see no claim that the singularity is around the corner, so I'm not sure your reply meets the comment that you're replying to.
It seems overwhelmingly likely that AI will be significantly more capable 6 months from now than it is now. Even if there's little progress in the models, just the rate at which tooling is moving will make a big difference. And models still seem to be improving, so I'd be a little surprised if we hit a model brick wall.
I really have to highlight the S-curve nonsense because, like, yes, I think this technology's improvement will follow an S-curve. It's absurd to think that it will just follow an exponential up towards infinity forever because nothing in the world really works like that. However, like everyone else in this thread is saying, we have no idea where on the S-curve we actually are, and it's impossible to know until it's already slowed down. So really all appeals to the S curve do are as function as a sort of non-specific, unfalsifiable prophecy that someday it will slow down, which doesn't really tell us anything useful, and also frees the person referencing the S curve from ever actually having to worry about being wrong. Just like the Singularity people, the slowdown of the S curve is always near. This is actually a known and well-established tactic of religions and other people that want to make prophecies without having to worry about turning out to be wrong — unfalseifiable vague prophecies with no actual timeline, and thus no clear import to the present so that they can never be shown to be wrong.
I think a better question for AI is “is it more like a network effect, liquidity effect, or a biological/physical effect”?
Maybe just to be clear I think that kneejerk “I hate this AI trend, and prefer to believe this will end soon, all exponential growth ends eventually” is intellectually lazy, and dangerous for younger engineers/hackers, a group I hope can benefit from being on HN.
Bitcoin mining went through something like 13 10x growth periods, last I ran the numbers a few years ago. There are physical processes that do have very extended periods of doubling, and there are digital and financial processes that don’t show any signs of doing anything but continuing to keep growing over their multidecade lives. So, like I said, it’s worth thinking carefully, and risk mitigation for things like mental health, career decisions and investment decisions indicates we should be cautious assessing new dynamics.
Or Roman trade volume before the Fall of Rome.
Not to mention what you describe is not technological improvement but increase in data or money flows, not the same.
But I don’t that think it’s quite so obvious that model quality / growth / usefulness is definitively and obviously not more like data or money flows than it is like some other process.
So if instead of text we come up with a different representation for mathematical or physical problems, that could both improve the quality of the output while reducing the amount of transformers needed for decoding and encoding IO and for internal reasoning.
There are also difference inference methods, like autoregressive and diffusion, and maybe others we haven't discovered yet.
You combine those variables, along with the internal disposition of layers, parameter size and the actual dataset, and you have such a large search space for different models that no one can reliably tell if LLM performance is going to flatline or continue to improve exponentially.
But then, wouldn't we first have to translate all of our current math and physics knowledge into that new representation in order to be able to train a model on it? Looks like a tremendous amount of work to me.
That's precisely what happens on the bad side of a S curve.
From the article,
> ...LLMs have got to the point where if a problem has an easy argument that for one reason or another human mathematicians have missed (that reason sometimes, but not always, being that the problem has not received all that much attention), then there is a good chance that the LLMs will spot it. Conversely, for problems where one’s initial reaction is to be impressed that an LLM has come up with a clever argument, it often turns out on closer inspection that there are precedents for those arguments...
Yeah, if time is infinite, R&D imagination is infinite, energy is infinite and material resources are infinite. Easy.
If it’s anyone’s guess then we’re much more likely to be left of that, unless you argue we’re already on the flat side.
Do you have a source for this that isn't marketing spiel? There's a fiscal incentive to lie about scaling research.
I personally would not characterize automating training processes as “meaningfully”.
For how long should you be allowed to use this excuse? It’s nearly 5 years since the peak of COVID hiring. What’s an acceptable limit - 10 years? Of course at that point you can just switch over to outsourcing and “stupid MBAs”, the other two of Reddit’s favorite scapegoats. I find a lot of the AI skepticism to be totally unfalsifiable.
A lot of the discourse around AI in general is unfalsifiable. It's just a bunch of people "predicting" the future. Seems smarter to just avoid making assumptions about it at this point.
Yes, LLMs are a great technology. Yes, we will probably all use them all the time in 20 years. No, we don't know how we will use them (to generate cat memes or to cure cancer) in 20 years time.
Especially for software developers it looks increasingly that after huge turmoil it's likely we will need +/- the same number of developers in the world.
Mythos is a 10T model. Opus is a 5T model.
That's not an exponentially growing amount of compute but it is achieving exponential improvements (eg from Mozilla: https://blog.mozilla.org/en/privacy-security/ai-security-zer... )
“Exponential” used here is pure hyperbole. Can you justify it?
But even more so, who said the improvements are "exponential"? Mozilla's single metric, that doesn't even prove anything of the sort?
Parameters and compute are quite the same thing, but going from 1T to 5T to 10T is quite a ramp up.
Ah yes, the marketing model that's ostensibly so powerful us mere mortals aren't allowed to use it. It's certainly led to exponential hype and speculation.
The idea that we’re at the point where it’s superseded our ability to tell just makes no sense. I’ll be happy if we can get to a point where I don’t have to tell Claude not to tail every bash command or make a job that writes throughout instead of once at the end. I’ll be happy if “continue this interaction naturally, you are taking over from an independent subagent” works.
But I’m not holding my breath. It’s still really cool that any of this stuff is possible.
Claude in feb of 2026? Still far from perfect, but there's definitely a huge improvement here.
This falls in the category of swipes/name-calling in https://news.ycombinator.com/newsguidelines.html - can you please edit those out?
You're a good contributor - it's just all too easy for unintentional sharpness to downgrade the conversation, and when it's a good conversation like this one, that's especially regrettable.
(in fact I find that Qwen-35B-A3B and Gemma4-26B-A4B very rarely "know" the answer, and so use first principles thinking, or go out and look for the answer where GPT-5.4 does not and simply assumes it knows. Which leads to now, in some cases, the small models far outperforming the big ones. Huge context + training quality seem to be the determining factors now, and neither of those are the strengths of SOTA models. If this continues ...)
While I agree this is a training problem, it is not a solvable one. ML models learn from examples. This is even true for their newest tricks like GRPO. They cannot train against things humans don't yet know.
And that's great, but you're forever locked at the peak of what you can be taught in widely available courses (which they download without paying) (even that is best case scenario: it assumes your ability to distinguish bullshit from reality somehow becomes perfect during training, or even before). The only way to exceed peak human performance is to start experimenting with math, physics, chemistry, even humans, yourself. And that has, even for humans, a massively higher cost than learning from examples, or from a course.
The reason they don't go further is the worst possible reason: the cost. It requires a 100x increase in training expense. Think of it like this: to exceed SOTA in physics or chemistry, training the next version of ChatGPT requires a particle accelerator, and a chemistry laboratory. This cannot be bypassed. Oh and not just any particle accelerator, right? A better one than the best currently existing one. Same for Chemistry labs. Same for ... So 100x is conservative.
But without doing it, ML models (LLM or otherwise) are forever limited at the level an army of first year university students achieve, ON AVERAGE. Maybe they can make that 2nd or even 4th year, at the end of the curve. But that's the limit. Phd level is the level you have to come up with new discoveries, and that ... just isn't possible with current training, even at the end of the improvement curve.
And ... is there budget to increase training cost another 100x? No ... there isn't. Not even with this totally absurd level of investment there isn't. And if small models keep this up, there's no way the investment is even remotely worth it.
The people who pretend that’s not the case are not living in reality. To them - let’s call them “ed Zitron readers” - there is no evidence that could change their view that none of this is really happening, it’s all hype, and the collapse is just around the corner, after which we’ll all go back to normal and LLMs will sound like a bad dream.
but we can see trends and for your livehoood it is important to be able to make educated predictions based on trends. not saying everyone should start making AI predictions (though many already do)
what exactly are you basing this opinion on? All I am seeing personally across multiple projects I am working on and other friends at other places is that downsizing is either begun or is planned (to exclude from here all the “public” layoffs we see on the news). Given how most business operate in the USA I think most of “AI strategies” are “we can do same with -40% staff” vs. “we can do XX% more work with same staff.”
If we can get a little stability, people will begin thinking less in terms of "how do we do the same thing cheaper" and more in terms of "how do we do new things."
1. run a bigger "agent army"
2. hire more people to control and guide the existing "agent army"
I think it'll be #1 and SWEs will be expected to do more work and work longer hours in the future (those that are able to keep their jobs). this is more pessimistic outlook than yours so I hope you are right more than I am :)
edit: just now on the HN front page: https://www.nytimes.com/2026/05/08/technology/meta-ai-employ...
> we have all this work that needs to be done and not enough people to get the work done
I believe the reasoning is roughly to ask, what was occupying the developer hours? Was the majority of it typing out lines of code or was it reasoning about higher level concerns?
It usually comes up in response to predictions that the role of developer will be completely replaced in the near future. It's possible to observe significant efficiency gains without obviating the need for everything the role was doing.
Of course such reasoning has little to do with projections of future developer employment numbers. Will the switch from push mowers to gas mowers reduce the demand for people who get paid to mow lawns by increasing their efficiency? Will it increase the total lawn acreage across the market? It could well do both. However, if it makes having a lawn affordable for the average joe it could counterintuitively increase demand for the job.
Of course the stated goal of the AI companies is to develop the analog of fully robotic lawnmowers. But despite how impressive recent advancements have been we still have yet to see any evidence of novel abstract reasoning or a theory that would be expected to lead to it.
In other words, people have been speculating about the development of fully autonomous lawnmowers and the risk that they unilaterally decide to cut us all down for the past 50 years. "I, lawnmower" was a smash hit a few years ago. Now gas ones have appeared and continue to make rapid advancements but still no convincing signs of autonomy.
You're obviously right and the people who think that are the managerial types that think software developers were glorified secretaries writing after dictation.
LLM is great at generating stuff, but it's basically 3D printing. Amazing, but most of the high quality stuff in the world needs to be built at large scale out of aluminum, steel, wood, etc. Yes, I know there are large advances in 3D printing, but maybe 0.000000001% of all manufacturing in the world are done using 3D printing. A lot of stuff will probably never be possible using 3D printing.