AI solves International Math Olympiad problems at silver medal level(deepmind.google) |
AI solves International Math Olympiad problems at silver medal level(deepmind.google) |
> In the official competition, students submit answers in two sessions of 4.5 hours each. Our systems solved one problem within minutes and took up to three days to solve the others.
Why not compare with students who are given 3 days to submit an answer ? /s
Not opensourcing anything.
This is a dead end on which no further research can be built.
It violates pretty much every principle of incremental improvement on which science is based. It's here just for hype, and the 300+ comments prove it.
[...]
"IMO medalists have also gone on to become notable computer scientists. The following IMO medalists have received a Nevanlinna Prize, a Knuth Prize, or a Gödel Prize; these awards recognise research in theoretical computer science."
https://en.wikipedia.org/wiki/List_of_International_Mathemat...
[1]: https://www.aeaweb.org/articles?id=10.1257/aeri.20190457
There are so many competitions that don't have any obvious practical significance. And people are still enjoying competitions where AI completely pwns humans.
Also, this is probably a good time to ask whether you won the Putnam... https://news.ycombinator.com/item?id=35079
Got to be kidding me. We are fucked
I don't really find that this impressive. With enough compute you could just do n-of-10,000 LLM generations to "brute force" a difficult problem and you'll get there eventually.
Sounds perfect for a GPT model, with lots of input training data (problem books and solutions).
The proofs of these problems aren't interesting. They were already known before the AI started work.
What's interesting is how the AI found the proof. The only answer we have is "slurped data into a neural network, matched patterns, and did some brute search".
What were the ideas it brainstormed? What were the dead-end paths? What were the "activations" where the problem seemed similar to a certain piece of input, which led to a guess of a step in the solution?
And, how much CO2 was released into earths atmosphere?
The result is (likely) net energy consumption, resulting in (likely) net CO2 emissions.
So, what was did it cost us for this achievement in AI?
EDIT TO ADD: It's fair to think that such a presser should not include answers to my questions. But, it's also fair to want that level of transparency given we are dealing with climate change.
It's that we are living in a period of time where there are very real consequences of nearly a century of unchecked CO2 due to human industry.
And AI (like crypto before it) requires considerable energy consumption. Because of which, I believe we (people who believe in AI) need to hold companies accountable by very transparently disclosing those energy costs.
Competition problems are designed to be actually solvable by contestants. In particular, the problems should be solvable using a reasonable collection of techniques and many "prep courses" will teach you many techniques, tools and algorithms and a good starting point is to throw that stuff at any given problem. So just like chess openings putting in lots of leg work will give you some good results for that part. You might very well lose in mid and late game, just like this AI might struggle with "actual problems"
It is of course still very impressive, but that is an important point.
you realize this holds true for all of math right? outside of godel incompleteness potholes every proof/theorem is a permutation of ZFC. And you can fix the potholes by just filling them in with more Cs.
Your point, however, is certainly a good one: IMO problems are an extremely narrow subset of the space of mathematical problems, which is itself not necessarily even 50% of the space of the work of a mathematician.
Wonder what "great promise" entails. Because it's hard to imagine Gemini and other transformer-based models solving these problems with reasonable accuracy, as there is no elimination of hallucination. At least in the generally available products.
They explicitly stated that to achieve the current results, they had to manually translate the problem statements into formal mathematical statements:
> First, the problems were manually translated into formal mathematical language for our systems to understand.
How I understand what they're saying is that they used gemini to translate the problem statement into formal mathematical language and let DeepMath do it's magic after that initial step.
The article claims they have another model that can work without formal languages, and that it looks very promising. But they don't mention how well that model performed. Would that model also perform at silver medal level?
Also note, that if the problems are provided in a formal language, you can always find the solution in finite amount of time (provided the solution exists). You can brute-force over all possible solutions until you find the solution that proofs the statement. This may take a very long time, but it will find the solutions eventually. You will always solve all the problems and win the IMO at gold medal level. Alphaproof seems to do something similar, but takes smarter decisions which possible solutions to try and which once to skip. What would be the reason they don't achieve gold?
I know speed is just a matter of engineering, but looks like we still have a ways to go. Hold the gong...
(It's not my main point, but it's always worth remembering - even aside from any AI context - that many top mathematicians can't do IMO-type problems, and many top IMO medalists turn out to be unable to solve actual problems in research mathematics. IMO problems are generally regarded as somewhat niche.)
In other words, how does AlphaProof fare on mathematical problems which aren't in the IMO style? (As such exceptions comprise most mathematical problems)
Combinatorics problems are usually simple enough that anyone can understand and try tackling it though, and the solutions in IMO are usually designed to be elegant. I don't think I've ever seen a bad combo problem before.
Your second paragraph conveyed exactly how I feel about combinatorics. Elegant and clever, on top of being understandable to even non math people.
I wonder if some class of problems will emerge that human competitors are able to solve but are particularly tricky for machines. And which characteristics these problems will have (e.g. they'll require some sort of intuition or visualization that is not easily formalized).
Given how much of a dent LLM's are already making on beginner competitions (AtCoder recently banned using them on ABC rounds [1]), I can't help but think that soon these competitions will be very different.
I feel why combinatoric is harder for AI models is the same reason why LLM's are not great at reasoning anything out of distribution. LLM's are good pattern recognizers and fascinating at this point. But simple tasks like counting intersections at the Venn diagrams requires more strategy and less pattern recognition. Pure NN based models seem won't be enough to solve these problems. AI agents and RL are promising.
I don't know anything about lean but I am curious that proof of combinatorial problems can be as well represented as number theory or algebra. If combinatorial problem solutions are always closer to natural language, the failure of LLMs are expected. Or, at least we can assume it might take more time to make it better. I am making assumption in here that solutions of combinatorial problems in IMO are more human language oriented and relies on more common sense/informal logic when it compared to geometry or number theory problems.
Anyone know any details?
>because of limitations in reasoning skills and training data
One would assume that mathematical literature and training data would be abundant. Is there a simple example that could help appreciate the Gemini bridge layer mentioned in the blog which produces the input for RL in Lean?
The link you posted has problems with a dificulty between the first and second round that are much easier.
I took a quik look at the recent list of problems in the first and second round. I expect this new AI to get a solid 50/50 points in this test.
I wonder because on one hand they seem very impressive and groundbreaking, on the other it’s hard to imagine why more than a handful of researchers would use them
If you could automatically prove that your concurrency protocol is safe, or that your C program has no memory management mistakes, or that your algorithm always produces the same results as a simpler, more obviously correct but less optimized algorithm, I think that would be a huge benefit for many programmers.
Here, from what I understand, it's instead a theorem prover + LLM backing it. General proofs have a much larger search space than the 2d geometry problems you see on IMO; many former competitors disparage geometry for that reason.
> AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. … Formal languages offer the critical advantage that proofs involving mathematical reasoning can be formally verified for correctness.
Edit: To defend my response, the model definitely knows when it hasn't yet found a correct response, but this is categorically different from knowing that it does not know (and of course monkeys and typewriters etc., can always find a proof eventually if one exists).
For example, suppose a computer is asked to prove the sum of two even numbers is an even number. It could pull up its list of “things it knows about even numbers”, namely that an even number modulo 2 is 0. Assuming the first number is “a” and the second is “b”, then it knows a=2x and b=2y for some x and y. It then knows via the distributive property that the sum is 2(x+y), which satisfies the definition of an even number.
What am I missing that makes this problem so much harder than applying a finite and known set of axioms and manipulations?
This makes it natural address these problems using similar techniques, which is what this research team did. The "magic" in their solution is the use of neural nets to make good guesses about which branches of these massive search trees to explore, and make good guesses about how good any particular branch is even before they reach the end of the branch. These tricks let them (massively) reduce the effective branching factor and depth of the search trees required to produce solutions to math problems or win board games.
The thing is that most math "problems" are not solved not becasue they're hard, but because they're not interesting enough to even be discovered by humans.
That is more than half the work of solving them. Headline should read "AI solves the simple part of each IMA problem at silver medal level"
Real, exact, quotes from the top comments at 1 PM EST.
"I want to believe that the computer found it, but I can't find anything to confirm."
"Curing cancer will require new ideas"
"Maybe they used 10% of all of GCP [Google compute]"
So it failed at the first step (comprehension) and hence I think we can request a better effort next time.
(probably confused by version numbers)
IMO contestants aren't allowed to bring in paper tables, much less a whole theorem prover. They're given two 4.5 hour sessions (9 hours total) to solve all the problems with nothing but pencils, rulers, and compasses [0].
This model, meanwhile, was wired up to a theorem proover and took three solid days to solve the problems. The article is extremely light on details, but I'm assuming that most of that time was guess-and-check: feed the theorem prover a possible answer, get feedback, adjust accordingly.
If the IMO contestants were given a theorem prover and three days (even counting breaks for sleeping and eating!), how would AlphaProof have ranked?
Don't get me wrong, this is a fun project and an exciting result, but their comparison to silver medalists at the IMO is just feeding into the excessive hype around AI, not accurately representing its current state relative to humanity.
[0] 5.1 and 5.4 in the regulations: https://www.imo-official.org/documents/RegulationsIMO.pdf
In 2016, machines defeated a World Go Champion for the first time, using a clever form of "dumb search" that leverages compute, DNNs, reinforcement learning (RL), and self-play. Critics noted that while this fancy form of "dumb search" worked for Go, it might not necessarily be a general strategy applicable to other cognitive tasks.[a]
In 2024, machines solved insanely hard math problems at the Silver Medal level in an International Math Olympiad for the first time, using a clever form of "dumb search" that leverages compute, DNNs, RL, and a formal language. Perhaps "dumb search" over cleverly pruned spaces isn't as dumb as the critics would like it to be?
---
[a] http://www.incompleteideas.net/IncIdeas/BitterLesson.html
A theorem prover is probably more useful for the typical IMO problem than for the typical real research problem, but even so I'd guess that even with a reasonable amount of training most IMO contestants would not do much better for having access to a theorem prover.
Having three days would be a bigger deal, for sure. (But from "computers can't do this" to "computers can do this, but it takes days" is generally a much bigger step than from "computers can do this, but it takes days" to "computers can do this in seconds".)
Today's announcement is also not about proving Lean theorems by "dumb search". The success is about search + neural networks.
You're attacking critics for criticizing the solution that has failed, while confusing it for the solution that works to this day.
Clever forms of "dumb search" that leverage compute, DNNs, RL, self-play, and/or formal languages are not dumb at all.
I put the words "dumb search" in quotes precisely because I think critics who dismiss AI progress as such are missing the point.
We're not in disagreement :-)
And you keep shifting the goalpost on what's called "dumb" here.
It kinda had to be this way, I think. There's a something from nothing problem. Douglas Adams brilliantly starts at this point.
We don't understand something from nothing. We don't even have the language to describe it. Concepts like "complexity" are frustratingly resistant to formalization.
"There is no free will." Has recently resurged as a philosophical position... like it did in response to Newton's mechanics.
Matter from void. Life from minerals. Consciousness from electrons. Free will from deterministic components. Smart reasoning & intelligent rationalisation from dumb search, computation, DNNs and such.
I don't think this debate was supposed to be ended by anything short of empirical demonstration.
Endnote: deep blue's victory over Gary was a bunch of preprogrammed bulls--t. Rematch!
Then, should we expect less with mathematics where written language is the normal way knowledge is propagated, and where formal proofs are wanted? An important distinction here is the coupling of search (not LLM for this one), a formal math language, and theorem proving. Math intelligence may not be merely the math written corpus, but adding the formal language and theorem proving sounds pretty powerful.
All this still lacks self-directed goals. An intention. For now that's taken care of by the human asking questions.
It isn't. That's why LLMs are still terrible at basic reasoning tasks. The dataset doesn't contain the human intelligence, just the end results. Nobody is training their LLMs on brain scans.
And if they do, then what? If it is "too high" do we delay research because we need to keep the world how it is for you? What about all the other problems others face that could be solved by doubling down on compute for AI research?
First, it's keeping the world how it is for all of us, not just me.
Second, to answer you question, I think that is a decision for all of us to weigh in on, but before we can do that, we must be informed as best as we can.
Do sacrifices have to be made for the greater good? Absolutely. Do for-profit mega corporations get to make those decisions without consent from the public? No.
Right. My point is that you're attacking a position that no real critic holds. Of course we're in agreement then!
"Clever forms of dumb search are not dumb" feels a little like kicking down open doors. We were always going to agree.
I'm not so sure. I wrote my comment after seeing quite a few comments on other threads here that read like fancy versions of "this is just brute-force search over cleverly pruned trees." Search the other threads here, and you'll see what I mean.
Got a source to support your assertion that many people are okay with the effects of climate change?
https://www.scientificamerican.com/article/more-climate-laws...
> Stop trying to pretend your take is the humanitarian take.
That's a straw man. However, I cannot believe many humans are in support of an uninhabitable world.
It's great that you feel safe being so aloof, but I believe we have a responsibility in tech to turn down the AI hype valve.
The NYT is currently running a piece with the headline "Move Over, Mathematicians, Here Comes AlphaProof". People see that, and people react, and we in tech are not helping matters by carelessly making false comparisons.
Search is amazing. Protein folding searches for the lowest energy configuration. Evolution searches for ecological fit. Culture searches for progress, and science for understanding. Even placing a foot on the ground searches for dynamic equilibrium. Training a ML model searches for best parateters to fit a dataset. Search is universal. Supervised learning is just imitative, search is extrapolative.
Also the headline is fair, as I do believe that AlphaProof demonstrates an approach to mathematics that will indeed invade mathematicians workspaces. And I say that as a mathemstician.
> First, the problems were manually translated into formal mathematical language for our systems to understand.
The non-geometry problems which were solved were all of the form "Determine all X such that…", and the resulting theorem statements are all of the form "We show that the set of all X is {foo}". The downloadable solutions from https://storage.googleapis.com/deepmind-media/DeepMind.com/B... don't make it clear whether the set {foo} was decided by a human during this translation step, or whether the computer found it. I want to believe that the computer found it, but I can't find anything to confirm. Anyone know?
> However, LLMs are not able to autoformalize reliably, so they got them to autoformalize each problem many times. Some of the formalizations were correct, but even the incorrect ones were useful as training data, as often they were easier problems.
So didn't actually solve autoformalization, which is why they still needed humans to translate the input IMO 2024 problems.
The reason why formalization is harder than you think is that there is no way to know if you got it right. You can use Reinforcement Learning with proofs and have a clear signal from the proof checker. We don't have a way to verify formalizations the same way.
To you or me, sure. But I think the proof that it isn't for this AI system is that they didn't do it. Asking a modern LLM to "translate" something is a pretty solved problem, after all. That argues strongly that what was happening here is not a "translation" but something else, like a semantic distillation.
If you ask a AI (or person) to prove the halting problem, they can't. If you "translate" the question into a specific example that does halt, they can run it and find out.
I'm suspicious, basically.
> While the problem statements were formalized into Lean by hand, the answers within the problem statements were generated and formalized by the agent.
However, it's unclear what initial format was given to the agents that allowed this step
To me, this sounds like Alphaproof receives a "problem", whatever that is (how do you formalize "determine all X such that..."? One is asked to show that an abstract set is actually some easily understandable set...). Then it generates candidate Theorems, persumably in Lean. I.e. the set is {n: P(n)} for some formula or something. Then it searches for proofs.
I think if Alphaproof did not find {foo} but it was given then it would be very outrageous to claim that it solved the problem.
I am also very hyped.
> AlphaProof solved two algebra problems and one number theory problem by determining the answer and proving it was correct.
still good progress nonetheless. won't call the system sufficient by itself tho.
Then he gave me a huge hint to the solution, after which it only took me a couple of hours to solve.
(Formalizing the solution is of course the hardest part, and might serve as a good masters dissertation I think)
But seriously, if you can't ask the LLM to solve the right question, you can't really expect it to give you the right answer unless you're really lucky. "I'm sorry, but I think you meant to ask a different question. You might want to check the homework set again to be sure, but here's what I think you really want."
Some people call this programming
Lean isn't just a formal language, it is also a theorem prover, Could the IMO participants use the nlinarith tactic? Could they use other tactics?
Of course not, they had to show their work!
Could they have mathematicians translate the problem statements into the formal language for them?
Of course not, they had to do it themselves. In "How to solve it" Polya stresses multiple times that formalizing the initial question is an important part of the process.
Then, the actual computational resources expressed in time are meaningless if one has a massive compute cloud.
I'm a bit dissatisfied with the presentation, same as with the AlphaZero comparison to an obsolete Stockfish version that has been debunked multiple times.
That means that "AI solves IMO problems better than 75% of the students", which is probably even more impressive.
But, "minutes for one problem and up to 3 days for each remaining problem" means that this is unfortunately not a true representation either. If these students were given up to 15 days (5 problems at "up to 3 days each") instead of 9h, there would probably be more of them that match or beat this score too.
It really sounds like AI solved only a single problem in the 9h students get, so it certainly would not be even close to the medals. What's the need to taint the impressive result with apples-to-oranges comparison?
Why not be more objective and report that it took longer but was able to solve X% of problems (or scored X out of N points)?
This is important for more than Math problems. Making ML models wrestle with proof systems is a good way to avoid bullshit in general.
Hopefully more humans write types in Lean and similar systems as a much way of writing prompts.
yes, it is true, but getting to the country specific team is itself an arduous journey, and involves brutal winnowing every step of the way f.e. regional math-olympiad, and then national math-olympiad etc.
this is then followed by further trainings specifically meant for this elite bunch, and maybe further eliminations etc.
suffice it to say, that qualifying to be in a country specific team is imho a big deal. getting a gold/silver from amongst them is just plain awesome !
Source: a friend who got silver on the IMO
I found the work environment to be more important than the subject when it comes to work satisfaction. If you are working on a world changing subject with a team of assholes, you are going to have a bad time, some people really have a skill for sucking the fun out of everything, and office politics are everywhere, especially on world changing subjects.
On the other hand, you can have a most boring subject, say pushing customer data to a database, and have the time of your life: friendly team, well designed architecture, time for experimentation and sharing of knowledge, etc... I have come to appreciate the beauty of a simple thing that just works. It is so rare, maybe even more rare than scientific breakthroughs.
Now, you can also have an awesome work environment and an awesome subject, it is like hitting the jackpot... and a good reason to be envious.
Pretty much all the top AI labs are both intensely competitive and collaborative. They consist of many former IMO and IOI medalists. They don't believe in remote work, either. Even if you work at Google DeepMind, you really need to be in London for this project.
Yet no one cares. Everyone's busy watching Magnus Carlsen.
We are human. This means we care about what other humans do. We only care about machines insofar as it serves us.
This principle is broadly extensible to work and art. Humans will always have a place in these realms as long as humans are around.
IMHO, the largest contributors to AlphaProof were the people behind Lean and Mathlib, who took the daunting task of formalizing the entirety of mathematics to themselves.
This lack of formalizing in math papers was what killed any attempt at automation, because AI researcher had to wrestle with the human element of figuring out the author's own notations, implicit knowledge, skipped proof steps...
This seems so weird to me - AGI is undefined as a term imo but why would you expect "producing something generally intelligent" (i.e. median human level intelligence) to be significantly harder than "this thing is better than Terrence Tao at maths"?
On the other hand, navigating the real world mostly consists in employing a ton of heuristics we are still kind of clueless about.
At the end of the day, we won't know before we get there, but I think my reasons are compelling enough to think what I think.
Note that the 6th question is generally the hardest (“final boss”) and many top performers couldn’t solve it.
I don’t know what Lean is or how see AI’s proofs but an AI system that can explain such a question on par with the YouTuber above would be fantastic!
"... We'll be bringing all the goodness of AlphaProof and AlphaGeometry 2 to our mainstream #Gemini models very soon. Watch this space!" -- Demis Hassabis, CEO of Google DeepMind. https://x.com/demishassabis/status/1816499055880437909
Three days is interesting... Not technically silver medal performance I guess, but let's be real I'd be okay waiting a month for the cure to cancer.
Which means these problems are trivial to solve if you have a computer - you can simply check all possibilities. And is precisely the reason why calculators aren't allowed.
But exhaustive searches are not feasible by hand in the time span the problems are supposed to be solved - roughly 30 minutes per problem. You are not supposed to use brute force, but recognize a key insight which simplifies the problem. And I believe even if you did do an exhaustive search, simply giving the answer is not enough for full points. You would have to give adequate justification.
Since I think you need an account to read threads now, here's a transcript:
Google DeepMind have produced a program that in a certain sense has achieved a silver-medal peformance at this year's International Mathematical Olympiad.
It did this by solving four of the six problems completely, which got it 28 points out of a possible total of 42. I'm not quite sure, but I think that put it ahead of all but around 60 competitors.
However, that statement needs a bit of qualifying.
The main qualification is that the program needed a lot longer than the human competitors -- for some of the problems over 60 hours -- and of course much faster processing speed than the poor old human brain.
If the human competitors had been allowed that sort of time per problem they would undoubtedly have scored higher.
Nevertheless, (i) this is well beyond what automatic theorem provers could do before, and (ii) these times are likely to come down as efficiency gains are made.
Another qualification is that the problems were manually translated into the proof assistant Lean, and only then did the program get to work. But the essential mathematics was done by the program: just the autoformalization part was done by humans.
As with AlphaGo, the program learnt to do what it did by teaching itself. But for that it needed a big collection of problems to work on. They achieved that in an interesting way: they took a huge database of IMO-type problems and got a large language model to formalize them.
However, LLMs are not able to autoformalize reliably, so they got them to autoformalize each problem many times. Some of the formalizations were correct, but even the incorrect ones were useful as training data, as often they were easier problems.
It's not clear what the implications of this are for mathematical research. Since the method used was very general, there would seem to be no obvious obstacle to adapting it to other mathematical domains, apart perhaps from insufficient data.
So we might be close to having a program that would enable mathematicians to get answers to a wide range of questions, provided those questions weren't too difficult -- the kind of thing one can do in a couple of hours.
That would be massively useful as a research tool, even if it wasn't itself capable of solving open problems.
Are we close to the point where mathematicians are redundant? It's hard to say. I would guess that we're still a breakthrough or two short of that.
It will be interesting to see how the time the program takes scales as the difficulty of the problems it solves increases. If it scales with a similar ratio to that of a human mathematician, then we might have to get worried.
But if the function human time taken --> computer time taken grows a lot faster than linearly, then more AI work will be needed.
The fact that the program takes as long as it does suggests that it hasn't "solved mathematics".
However, what it does is way beyond what a pure brute-force search would be capable of, so there is clearly something interesting going on when it operates. We'll all have to watch this space.
I wonder if this new model could be integrated with an LLM somehow? I get the feeling that combining those two powers would result in a fairly capable programmer.
Also perhaps a LLM could do the translation step that is currently manual?
While there are certainly some huge jumps in compute power, theory of data transformation and availability of data to transform, it would surprise me if computers in a 100 years do not still rely on a combination of well-defined and well-understood algorithms and AI-inspired tools that do the same thing but on a much bigger scale.
If not for any other reason, then because there are so many things where you can easily produce a great, always correct result simply by doing very precise, obvious and simple computation.
We've had computers and digital devices for a long while now, yet we still rely heavily on mechanical contraptions. Sure, we improve them with computers (eg. think brushless motors), but I don't think anyone would be surprised today about how did anyone design these same devices (hair dryers, lawn mowers, internal combustion engines...) before computers?
Also it's making me think that in 5-10 years almost all tasks involving computer scientists or mathematicians will be done in AI. Perhaps people going into trades had a point.
Math => Code => Simulation => Robots => GG
Is it clear whether the algorithm is actually learning from why previously attempted solutions failed to prove out, or is it statistically generating potential answers similar to an LLM and then trying to apply reasoning to prove out the potential solution?
If the data is synthetic and covers a limited class of problems I would imagine what it's doing mostly reduces to some basic search pattern heuristics which would be of more value to understand than just being told it can solve a few problems in three days.
All I've learnt from this is that they used an unstated amount of computational resources just to basically brute force what a human already is capable of doing in far less time.
Time is not a very interesting dimension here, because humans don't use the same CPU as huge GPU clusters. The binary "is it able to reach a solution given enough resources?" is more interesting (for GPT/Claude the answer is a clear negative).
If these problems were important to solve, redundantly by thousands or millions of people (like the real work that most people do), far more people would put in the effort to learn how to solve these problems.
It's just a weird comparison. Contests are very artificially structured in ways that don't make sense for comparing to computers.
Now, you may say how time is not a useful dimension here, but really, this is where we are seeing a lot of these advances come from: general researchers today do get access to huge compute capability, allowing them to quickly iterate on different ideas. In a sense, they can be less smart about their use of resources and simply try things out: this does drive the innovation (compared to waiting for their turn on a supercomputer nearby).
And finally, time is essential even for humans: given a couple hundred years, they will find the proof for Fermat's last theorem, but they might not do it in 4.5h. Since we are comparing AI capabilities to humans in the article, it's very possible that increased compute will never allow AI to find novel proofs we have not come up with either. That's where the AI bit comes in: we know that brute searching through the entire possible space of proofs is still too expensive for our compute capabilities, so we need AI to emulate our "intuition" when looking for the direction to narrow down the search.
So there are really two reasons time matters: 1. getting enough of compute might still be far away (heck, prime factorization and elliptic curves are still the basis of the most of world cryptography for that reason) and 2. maybe it's not even enough to increase compute capability to make huge jumps in problem solving capabilities (iow, maybe we are reaching maximum of where the approach can take us).
In other words model which can solve any/most school / college exam problem isn't necessarily smart. It can be just a database, or, in fact, a lookup table. Smarter version can be a lookup + 1 step test. Not saying it's bad, but it doesn't scale to less formalized domains. BTW, in this case formalization was done by humans.
In case this confuses anyone: the high school students in question are not a standard sample of high school students. AFAIK, they are teams of the ~6 strongest competitive problem solving high school students from each country.
Most of DeepMind’s research is a cost-centre for the company. These press releases help justify the continued investment both to investors and to the wider public.
The effect of establishing oneself as the thought leader in a field is enormous.
For example, IBM's stock went up 15% the month after they beat Kasparov.
I'd say "welcome to the web" but this was true in 1800s newspapers as well.
By contrast, this kind of open ended formalization is something where progress used to be extremely slow and incremental. I worked in an adjacent field 5 years ago and I cannot stress enough that these results are simply out of reach for traditional automated reasoning techniques.
Real automatic theorem proving is also useful for a lot more than pure mathematics. For example, it's simple to write out an axiomatic semantics for a small programming language in lean and pose a question of the form "show that there exists a program which satisfies this specification".
If this approach scales it'll be far more important than any other ML application that has come out in the last few years.
The blog post indicates the opposite. The geometry problem in the IMO problem set was solved by AlphaGeometry 2, which is an LLM based on Google's Gemini. LLMs are considered relatively general systems. But the other three solved problems were proved by AlphaProof, which is a narrow RL system that is literally based on AlphaZero, the Go and Chess AI. Only its initial (bootstrapping) human training data (proofs) were formalized and augmented by an LLM (Gemini).
Don't dismiss search, it might be brute force but it goes beyond human level in Go and silver at IMO. Search is also what powers evolution which created us, also by a lot of brute forcing, and is at the core of scientific method (re)search.
That makes it, in principle, similar or even easier than a champsionship-level chess move, which often take more than 1 hour for a professional human (with more training than an IMO high school student) to solve.
Another interesting concern is that when posing a problem to humans, it's fine to pose an "easy" brute-forceable problem, but humans, being slow brute-searchers, need to find more clever solutions. But if you give such a problem to a computer, it can trivialize it. So to test a computer, you need to pose non- easily-brute-forceable problems, which are harder for the computer than the others, but equally difficult for the humans as the other problems are.
The issue is that to find solutions for useful problems you're often searching through highly complex and often infinite solution spaces.
You don't need to be able to prove very hard problems to do useful work. Proving just simple things is often enough. If I ask a language model to complete a task, organize some entries in a certain way, or schedule this or that, write a code that accomplishes X, the result is typically not trustworthy directly. But if the system is able to translate parts of the problem to logic and find a solution, that might make the system much more reliable.
But MCTS was always promising when married to large NNs and DeepMind/Brain were always in front on it.
I don’t know who fucked up on Gemini and it’s concerning for Alphabet shareholders that no one’s head is on a spike. In this context “too big to fail” is probably Pichai.
But only very foolish people think that Google is lying down on this. It’s Dean and Hassabis. People should have some respect.
People want always knock generative AIs for not being able to reason, and we've had automated systems that reason perfectly well for decades, but for some reason that doesn't count as AI to people.
In my understanding, proofs are usually harder to transcribe into Lean which is nobody _writes_ proofs using Lean.
What is a nlinarith?
has the example:
0 ≤ x^2 if x : ℝ
which humans simply use without proof. The IMO doesn't challenge participants to prove everything, only the main ideas.(/s)
I guess I should have pursued a PhD when I was younger.
The proof will only have value if it's meaningful to us.
Nah, as a consumer it makes no difference to me if a meat packing factory or Amazon warehouse employs 5000 or 5 people. To art, this principle is totally real, but for work, it only applies to some/most of it.
Imagine a machine doing "work" that only serves itself and other machines that does no service to humanity. It would have no economic value. In fact the whole concept of "work" only makes sense if it is assigned economic value by humans.
Actually, I was looking up Elo ratings of the top computer chess players, and learned that it is not that trivial to compare these, due to differences in hardware requirements and whatnot.
Fischer was probably the last great player who was unassisted by tools.
In the meantime, while DeepBlue beat the world chess champion, Kasparov, at chess, our best efforts at generalism - LLMs than many (not me!) think are the path to AGI - struggle to play tic tac toe.
"Better than Terrence Tao at solving certain formalized problems" (not necessarily equal to "Better that Terrence Tao at maths) isn't.
As opposed to 0.999999% of the human population who can't do it even if their life depends on it?
Then they can train a network to evaluate intermediate positions (score network) and one to suggest things to try next (policy network).
So they had three days to keep training the model, on synthetic variations of each IMO problem.
Also, there's no reason to affirm that an eventual cure for cancer requires fundamentally new methods. Maybe the current methods are sufficient, it's just that nobody has been "smart" enough to put the pieces together. (disclaimer: not an expert at all)
But even if they aren't as fast as humans yet this is very valuable. Both as a stepping stone, and because at a certain scale compute is much easier to scale than skilled mathematicians.
So either (1) "within minutes" was underselling the abilities of the system, or (2) what they actually meant was that the geometry problem was solved in 19 seconds, one of the others "within minutes" (I'd guess #1 which is definitely easier than the other two they solved), and the others in unspecified times of which the longer was ~3 days.
I'd guess it's the first of those.
(Euclidean geometry has been a kinda-solved domain for some time; it's not super-surprising that they were able to solve that problem quickly.)
As for the long solve times, I would guess they're related to this fascinating remark:
> The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found.
Also, for what it's worth, I'm pretty sure that I wouldn't have been able to solve it myself in three days, even if I had access to all of GCP, Azure and AWS (except if I could mine crypto to then pay actual IMO-level mathematicians to solve it for me).
How many cycles of guess-and-check did it take over the course of three days to get the right answer?
If the IMO contestants were allowed to use theorem provers and were given 3 days (even factoring in sleep) would AlphaProof still have gotten silver?
> let's be real I'd be okay waiting a month for the cure to cancer.
I don't think these results suggest that we're on the brink of knowledge coming at a substantially faster rate than before. Humans have been using theorem provers to advance our understanding for decades. Now an LLM has been wired up to one too, but it still took 8x as long to solve the problems as our best humans did without any computer assistance.
First of all, this is not a sport and the point is not to compare AI to humans. The point is to compare AI to IMO-difficulty problems.
Secondly, this is now some hacky trick where Brute force and some theorem prover magic are massaged to solve a select few problems and then you'll never hear about it again. They are building a general pipeline which turns informal natural lamguage mathematics (of which we have ungodly amounts available) into formalized mathematics, and in addition trains a model to prove such kinds of mathematics. This can also work for theory building. This can become a real mathematical assistant that can help a mathematician test an argument, play with variations of a definition, try 100 combinations of some estimates, apply a classic but lengthy technique etc. etc.
> it still took 8x as long to solve the problems as our best humans did without any computer assistance.
Give it a year and that ratio will be reversed. At least. But also it matters less how long it takes if doubling the number of things reasoning at a best-human level is pronounced "ctrl-c, ctrl-v".
LLM/ML is fascinating tech that has a lot of legitimate applications, but it is not fucking intelligent, artificial or otherwise, and I am sick to death of people treating it like it is.
Or if AlphaProof used more compute they could have slashed that time to 1/10 or less. It's arbitrary as long as we don't define what is the compute the AI should be entitled to use here.
maybe not, because you need to connect many complicated topics/terms/definitions together, and you don't have a way to reliably verify if formalized statement is correct.
They built automatic formalization network in this case, but didn't trust it and formalized it manually.
i understand that there are very hard questions for the olympiad. but it might be possible to learn about some recurring types of them by looking at past instances. it may not be the meta for IMO but has been for other kinds of exams.
I'm also waiting for my answer on the role of the Gemini formalizer, but reading between the lines, it looks like it was only used during training the "solver network", but not used in solving the IMO problems. If so then the hyping is greatly premature, as the hybrid formalizer/solver is the whole novelty of this, but it's not good enough to use end-to-end?
You cannot say AlphaProof learned enough to solve problems if formalization made them easier to solve in the first place! You can say that the "solver network" part learned enough to solve formalized problems better than prior training methods.
Docs: https://leanprover-community.github.io/mathlib4_docs/Mathlib...
Still, an impressive result for sure.
EDIT - I hope if that new class of bug exists that it is found. I hope that new class of bug doesn't exist.
It looks like "jealous" is more being afraid of losing something you have (most commonly e.g. a spouse's affection) to someone else, whereas "envious" is wanting what someone else has?
This is Nvidia's main area of research now.
[1] https://terrytao.wordpress.com/career-advice/theres-more-to-...
[2] https://en.m.wikipedia.org/wiki/Classification_of_finite_sim...
What I meant to convey, is that theorem proving at least is a well-defined problem, and computers have had some successes in similar-ish search problems before.
Also I don't think pure brute-force was ever used to solve any kind of interesting problem.
Chess engines make use of alpha-beta pruning plus some empirical heuristics people came up with over the years. Go engines use Monte-Carlo Tree Search with straight deep learning models node evaluation. Theorem proving, when it is solved, will certainly use some kind of neural network.
Search is different from either of those things, it's when you have a target and a collection of other things, and are trying to find the target in that collection.
- Search allows exploration of the game tree, potentially finding novel strategies.
- Learning compresses the insights gained from search into a more efficient policy.
- This compressed policy then guides future searches more effectively.
Evolution is also a form of search, and it is open-ended. AlphaProof solved IMO problems, those are chosen to be out of distribution, simple imitation can't solve them. Scientists do (re)search, they find novel insights nobody else discovered before. What I want to say is that search is on a whole different level than what neural nets do, they can only interpolate their training data, search pushes outside of the known data distribution.
It's actually a combo of search+learning that is necessary, learning is just the little brother of search, it compresses novel insights into the model. You can think of training a neural net also as search - the best parameters that would fit the training set.
I believe even Wiles in a documentary described his search for the proof of Fermat’s last theorem as groping around in a pitch black room, but once the proof was discovered it was like someone turned the lights on.
I don't think that's the case at all. The writing was already on the wall.
It's a math Language Model. Not even sure it's a Large Language Model. (Maybe shares a foundational model with an English LLM; I don't know)
It learns mathematical statements, and generates new mathematical statements, then uses search techniques to continue. Similar to Alpha Go's neural network, what makes it new and interesting is how the NN/LLM part makes smart guesses that drastically prune the search tree, before the brute-force search part.
(This is also what humans do to solve math probrems. But humans are really, really slow at brute-force search, so we really almost entirely on the NN pattern-matching analogy-making part.)
Both changes were trivial: it had one incorrect (but unnecessary) import, and it used the syntax from Lean 3 instead of Lean 4 in one lambda definition. A system that was trained harder on Lean would not make those mistakes.
The one actual error it made was in not proposing that the other direction of the "if and only if" is required. Again, I am quite confident that this formalisation failure mode is not hard to solve in a system that is, like, actually trained to do this.
Obviously formalising problems that a working mathematicican solves is dramatically harder than formalising IMO problems, and is presumably way ahead of the state of the art.
Why?
The interesting thing about math, or science, and art in general, comparing it to games like chess or go is that science gives you the freedom to continue to excel as a human while in games we have lost the game and/or the league.
Science and art are infinite and no AI can produce infinity.
You can just use another implementation nowadays.
(We don't know because we can't run AlphaGo to compare them).
It's obvious that's happening with LLMs even today to ensure they don't spew out too much bullshit or harmful content. So let's get to a point where we can trust AI as-is first, and let's talk about what's needed to achieve the next milestone after and if we get there.
And I love asking every new iteration of ChatGPT/Gemini something along the lines of "What day was yesterday if yesterday was a Thursday?" It just makes me giggle.
https://chatgpt.com/share/b657f0cd-34f9-4f5c-8ac9-c3a06c7101...
What is great about that Thursday prompt is how naked the LLM is to the reality that it knows absolutely nothing in the way we think of "to know". The bubble we are in is just awesome to behold.
Find n such that p(n) and not p(n-1).
p(n):
exists(f: state -> move) such that solves(f, n)
state: solved | illegal | (k, is_first_move in {T,F}, px in (1,2023), py in (1,2024+1), mapping from x,y to {T,F, ?})
initial_state(n): (n, T, 1, 1, {(x,y) -> ?})
move: U|D|L|R|(x in (1,2023))
power: ((a -> a), integer) -> (a->a)
power(h, 0)(x) = h(x)
power(h, k)(x) = h(power(k-1))(x)
solves(f, n): exists l such that for every board, power(make_move(board, f), l)(initial_state(n)) = solved
board: permutation of (1, 2, 3, ..., 2022+1)
make_move: (board, (state -> move)) -> (state -> state)
make_move(board, f)(solved): solved
make_move(board, f)(illegal): illegal
make_move(board, f)(s = (k, is_first_move, px, py, m))
if is_first_move = T:
if k = 0:
illegal
else if f(s) is a number:
(k, F, f(s), 1, m)
else:
illegal
else:
if f(s) is a number:
illegal
else if py = 2025:
solved
else if board(px) = py and py != 1:
(k - 1, T, px, py, m + {((px, py), T)})
else:
dy = {U:-1,D:1,L:0,R:0}(f(s))
dx = {U:0,D:0,L:-1,R:1}(f(s))
px' = px+dx
py' = py+dy
if px' < 1 or px' > 2023 or py' < 1 or py' > 2025:
illegal
(k, F, px', py', m + {((px, py), F)})Personally in the past I tried a few times to formalize some statements and sometimes I found that the mathlib libraries were pretty lacking in these more open-ended problems (I wanted to reason about lists and stuff). But it seems that I am just very bad at formalization lol.
Problems are certainly not trivial, but humans are not really putting all their effort into it either, and the few that do train for it, on average medal 50% of the time and get a silver or better 25% of the time (by design) with much less time available to do the problems.
A small detail wasn't clear to me: for these incorrectly formalized problems, how do they get the correct answer as ground truth for training? Have a human to manually solve them?
(In contrast to problems actually from "a huge database of IMO-type problems", they do have answers for these already).
Formal proofs can be mechanically checked if it's correct or not. We just don't know what's the answer. Think it as an extremely rigorous type system that typically requires really long type annotations, like annotation itself is a complex program. So if AlphaProof happens to generate a proof that passes this checker, we know that it's correct.
An incorrectly formalized problem is a different problem and a solution to any formalized problem still useful for AI training because such solutions can be mechanically checked for correctness and this does not require the hire of humans. What requires humans is the initial formalization process since that is more a language translation task which requires nuance and judgment and is difficult to mechanically verify.
While there is no perfect method, it is possible to use the agent to determine if the statement is false, has contradictory hypotheses, or a suspiciously short proof.
That said, the gap from "can't do it at all" to "can do it in 60 hours" is probably quite a bit larger than the gap from 60 hours to 1.5 hours.
They look like what a human would write if they were trying to come up with a formal proof (albeit it does some steps in a weird order).
Agreed that the absolute upper tier of chess players have trained longer and harder than most or all IMO contestants. Though I do wonder which (top-tier chess or the IMO) draws on a larger talent pool. To my understanding, a significant fraction of all high school students on Earth take some form of qualifying exam which can channel them into an IMO training program.
And as far as the being amenable to brute force (relative difficulty for humans vs. computers): it seems that chess was comparatively easier for computers, IMO problems are comparatively easier for humans, and the game of Go is somewhere in between.
It's true that IMO problems are vetted as being solvable, but that still doesn't really shed any information on how the difficulty of an IMO problem compares to the difficulty of chess play.
Another basic requirement is that valid moves / inference steps and the winning condition can be efficiently verified using some non-AI algorithm. Otherwise there would not be a reward signal for the reinforcement learning algorithm. This is different from answering most natural language questions, where the answer can't be checked trivially.
A human doesn't need to understand the proof, they just have to understand why the proof is a proof.
Yes, it was still proven. If I don't speak English but come up with a proof, I can't communicate the proof, but I have still proven (ie created proof) it.
To what extent is the proof of Fermat's Last Theorem "incomprehensible to humans" because only like a dozen people on the planet could truly understand it - I don't know.
The point of new proofs is really to learn new things about mathematics, and I'm sure we would learn something from a proof of Goldbach's conjecture.
Finally if it's not peer reviewed then its not a real proof eh.
Thinking back to Wiles' proof of FLT, it took the community several years of intense work just to verify/converge on the correct result. And that proof is ~130 pages.
So, what if the computer produced a provably correct, 4000-page proof of the Goldbach conjecture?
(Source: I am a two-time IMO silver medalist.)
If verifying a good idea is easy, then the evidence shows that the AI didn't have good ideas for the other 2 problems.
Humans are even better at this as you mention - but effectively the approach is similar. Come up with lot of ideas and see what proves it.
Eg. if it asked you to do something computationally hard (when done in a brute force way: calculators not allowed), and you put out the answer without the process using formal math knowledge, you'd get zero points. Even if it was easy to prove the answer was correct.
That's why IMO and many math exams won't take a single answer even if correct.
("Scare quotes")
Even within fairly modest finite limits, you can produce a solution space that cannot be significantly searched with the available finite matter and time available in the observable universe.
Thus, the problem with using search isn't that solution spaces can be infinite, but that finite solution spaces can be unimaginably large.
IDK, even for translation between languages outside of mathematics, missing small qualifiers that change the whole meaning of the sentence is a worrying failure mode I've seen, and with mathematical problems there are a lot more cases like that.
{Formal description of the set in question} = ?
And then Alphaproof has to find candidate descriptions of this set and prove a theorem that they are equal to the above.
I doubt they would claim to solve the problem if they provided half of the answer.
Stranger things have happened
This falls under extraordinary claims require extraordinary proof and we have seen nothing of the sort.
They clarified above that it provided the full answer though.
"We established a bridge between these two complementary spheres by fine-tuning a Gemini model to automatically translate natural language problem statements into formal statements, creating a large library of formal problems of varying difficulty."
I'm confused, is the formalization by Gemini or "manually"? Which is it?
If this were the case then the headline would be "AI solves 4/6 IMO 2024 problems", it wouldn't be claiming "silver-medal standard". Medals are generally awarded by comparison to other contestants, not to the challenges overcome.
> This can become a real mathematical assistant that can help a mathematician test an argument, play with variations of a definition, try 100 combinations of some estimates, apply a classic but lengthy technique etc. etc.
This is great, and I'm not complaining about what the team is working on, I'm complaining about how it's being sold. Headlines like these from lab press releases will feed the AI hype in counterproductive ways. The NYT literally has a headline right now: "Move Over Mathematicians, Here Comes AlphaProof".
but this part currently sucks, because they didn't trust it and formalized problems manually.
I also expect that math formalized by machine will be readable by machine and hardly understandable by humans.
Most students were simply not interested. And certainly, not everybody is equally talented, but the motivation to achieve competition success is needed too — perhaps you had the latter but not enough of the former. I also believe competitive maths is entirely different from research maths (time pressure, even luck is involved for a good idea to come up quickly, etc). Since you said you were a potential bronze medal winner, it might not even be a talent issue but maybe you just had great competition and someone had the better luck in one or two tests to rank above you (better luck as in the right idea/approach came to them quicker, or the type of problem that appeared on the test suited them more). And if you are from a large country like USA, China or Russia (topping the medal table), it's going to be freaking hard to get into a team since you'll have so many worthy students (and the fact they are not always scoring only golds for their teams out of such large pools tells me that the performance is not deterministic).
As a mathematician, I am sure you'd agree you'd want to run a lot more than a dozen tests to establish statistical significance for any ranking between two people at competitive maths IMO style, esp if they are close in the first few. As an anecdote, many at my school participated in national level maths and informatics competitions (they start at school level, go on to county/city level to nation level) — other than the few "trained" competitors staying at the top, the rest of the group mostly rotated in the other spots below them regardless of the level (school/county/nation). We've actually joked amongst ourselves about who had the better intuition "this time around" for a problem or two, while still beating the rest of the country handily (we've obviously had better base level of education + decently high base talent), but not coming close to "competitors".
I, for instance, never enjoyed working through math problems and math competitions (after winning a couple of early age local ones): I've finished the equivalent of math + CS MSc while skipping classes by only learning theory (reading through axioms, theorems and proofs that seemed non-obvious) and using that to solve problems in exams. I've mostly enjoyed building things with the acquired knowledge (including my own proofs on the spot, but mostly programming), even though I understood that you build up speed with more practice (I was also lazy :)).
So, let's not trivialize solving IMO-style problems, but let's not put them on a pedestal either. Out of a very small pool of people who train for it, many score higher than AI did here, and they don't predict future theoretical math performance either. Competition performance mostly predicts competition performance, but even that with large error bars.
And there are good reasons to believe that theorem finding and proof generation are at least NP-hard problems.
While I agree that not all problems show this kind of acceleration in performance, that's typically only true if you've already spent so much time trying to solve it that you've asymptoted to the optimal solution. Right now we're nowhere near the asymptote for AI improvements. Additionally, there's so many research dollars flowing into AI precisely because the potential upside here is nowhere near realized and there's lots of research lines still left to be explored. George Hinton ended the AI winter.
If you need to solve 1000 problems in 3 days you wouldn't find the humans that can do it. So it would not be cheaper if it's not possible.
Conceptually, if you're trying to show a conjunction, then it's the other player's turn and they ask you for a proof of a particular case. If you're trying to show a disjunction then it's your turn and you're picking a case. "Forall" is a potentially infinite conjunction, "exists" is a potentially infinite disjunction.
In classical logic this collapses somewhat, but the point is that this is still a search problem of the same kind. If you want to feel this for yourself, try out some proofs in lean or coq. :)
As for humans, 100 countries send 6 students to solve these problems. It also doesn’t mean that these problems aren’t solvable by anyone else. These are just the “best 6” where best = can solve and solve most quickly. Given a three day budget, 1000 problems could reasonably be solvable and you know exactly who to tap to try to solve them. Also, while the IMO is difficult and winners tend to win other awards like Field Medals, there’s many professional mathematicians who never even bother because that type of competition isn’t interesting to them. It’s not unreasonable to expect that professional mathematicians are able to solve these problems as well if they wanted to spend 3 days on it.
But in terms of energy per solve, humans are definitely cheaper. As you note the harder part is scaling it out but so far the AI isn’t solving problems that are impossible for humans, just that given enough time it managed to perform the same task. That’s a very promising result but supremacy is slightly a ways off for now (this AI can’t win the competition for now)
Making a proof formal doesn't mean it is not understandable any more. First, certainly a machine can "understand" it now. I think with AI improving what exactly such an understanding is worth, and what you can do with it, will increase a lot.
Secondly, the inscrutable tactics-based proofs featured in Lean (I have written such proofs in Isabelle in 1996) are actually quite old-fashioned for such a "modern" prover. Isabelle has long featured human-friendly syntax, and these proofs are certainly as understandable as English text, if written with care.
What we will soon get are proof assistants which allow you to write your proofs (and definitions) in English, but which are still fully formal and checkable. That is an immense help if you are producing new knowledge, because usually nobody else looks at your work with sufficient scrutiny in the first place. If you understand it, and in addition the machine "understands" it, I think that will be the gold standard for proofs very soon.
It will also help mathematicians to clean up their act ;-)
For example, write a formal specification of a function in Dafny on Liquid Haskell and get the LLM to produce code that is formally guaranteed to be correct. Logic-based + probability-based ML.
All GOFAI ideas are still very useful.
https://aws.amazon.com/blogs/opensource/lean-into-verified-s...
Those 3 also have a lot of training data as well. Hoping Lean gets more support as it is very friendly.
As a basic learning resource focused on software engineering, there's [1]. But nothing more advanced I am aware of.
[1] The Hitchhiker's Guide to Logical Verification. https://cs.brown.edu/courses/cs1951x/static_files/main.pdf
Do you know if a system like the OP is learning from failed tests to guide future tests, or is it a truly a brute force search as if it were trying to mine bitcoin?
>We trained AlphaProof for the IMO by proving or disproving millions of problems, covering a wide range of difficulties and mathematical topic areas over a period of weeks leading up to the competition. The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found.
I read that to say the model's token weights are adjusted as it goes, so in an LLM sense it is kind of learning. It isn't reasoning through an answer in the way a human does though. Meaning, the model is still just statistically predicting what an answer may be and checking if it worked.
I wouldn't chalk that up to learning at all. An AI solving complex math doesn't even seem too impressive to me with the predictive loop approach. Computers are well adept at math, throwing enough compute hardware at it to brute force an answer isn't suprising. I'd be really impressed if it could reliably get there with a similar number of failed attempts as a human, that could indicate that it really learned and reasoned rather than rammed through a mountain of failed guesses.
I do think this language is considerably more robust than the typical programming language, which means a sound program is more likely to end up being valid/“correct”. But still, that’s a difference of degree, not kind, IMO
Then you need to implement that logic in software, and again, you can and will mistakes here. You will use the first version of that software, or another logic software, to verify that your informal thoughts why your logic implementation is correct, can be formalised and checked. You will find mistakes, and fix them, and check that your correctness proof still goes through. It is very unlikely that it won't, but if it doesn't, you fix your correctness proof. If you can indeed fix it, you are done, no mistakes remain. If you cannot, something must be wrong with your implementation, so rinse and repeat.
At the end of this, you have a logic, and a logic implementation, which doesn't contain any mistakes. Guaranteed.
Validating a math proof either terminates in a reasonable time (in which case it’s useful for training), or does not (in which case the AI should be discouraged from using that approach).
I'd be hard to know how many failed attempts the human made. Humans are constantly thinking of ideas and eliminating them quickly. Possibly to fast to count.
It sounds like AlphaProof will doggedly create full proofs for each idea.
Is what the human is doing testing attempts?
There's definitely an aspect of this that is 'airplanes, not birds.' Just because the wings don't flap doesn't mean it can't fly, though.