AI solves International Math Olympiad problems at silver medal level

AI solves International Math Olympiad problems at silver medal level(deepmind.google)

1370 points by ocfnash 1 year ago | 525 comments

Smaug123 1 year ago |

So I am extremely hyped about this, but it's not clear to me how much heavy lifting this sentence is doing:

> First, the problems were manually translated into formal mathematical language for our systems to understand.

The non-geometry problems which were solved were all of the form "Determine all X such that…", and the resulting theorem statements are all of the form "We show that the set of all X is {foo}". The downloadable solutions from https://storage.googleapis.com/deepmind-media/DeepMind.com/B... don't make it clear whether the set {foo} was decided by a human during this translation step, or whether the computer found it. I want to believe that the computer found it, but I can't find anything to confirm. Anyone know?

ocfnash 1 year ago | |

The computer did find the answers itself. I.e., it found "even integers" for P1, "{1,1}" for P2, and "2" for P6. It then also provided provided a Lean proof in each case.

freehorse 1 year ago | | |

It would make a lot of sense for the lean-code-formalisation of the problems done by the researchers fed to the AI to be provided. Not assuming bad intent in not providing them, but it would help understand better the results.

nnarek 1 year ago | | |

formal definition of first theorem already contain answer of the problem "{α : ℝ | ∃ k : ℤ, Even k ∧ α = k}" (which mean set of even real numbers).if they say that they have translated first problem into formal definition then it is very interesting how they initially formalized problem without including answer in it

Davidzheng 1 year ago | | |

Can you elaborate on how it makes guesses like this? Does it do experiments before? Is it raw LLM? Is it feedback loop based on partial progress?

summerlight 1 year ago | |

To speak generally, that translation part is much easier than the proof part. The problem with automated translation is that the translation result might be incorrect. This happens a lot when even people try formal methods by their hands, so I guess the researchers concluded that they'll have to audit every single translation regardless of using LLM or whatever tools.

thomasahle 1 year ago | | |

You'd think that, but Timothy Gowers (the famous mathematician they worked with) wrote (https://x.com/wtgowers/status/1816509817382735986)

> However, LLMs are not able to autoformalize reliably, so they got them to autoformalize each problem many times. Some of the formalizations were correct, but even the incorrect ones were useful as training data, as often they were easier problems.

So didn't actually solve autoformalization, which is why they still needed humans to translate the input IMO 2024 problems.

The reason why formalization is harder than you think is that there is no way to know if you got it right. You can use Reinforcement Learning with proofs and have a clear signal from the proof checker. We don't have a way to verify formalizations the same way.

ajross 1 year ago | | |

> To speak generally, that translation part is much easier than the proof part.

To you or me, sure. But I think the proof that it isn't for this AI system is that they didn't do it. Asking a modern LLM to "translate" something is a pretty solved problem, after all. That argues strongly that what was happening here is not a "translation" but something else, like a semantic distillation.

If you ask a AI (or person) to prove the halting problem, they can't. If you "translate" the question into a specific example that does halt, they can run it and find out.

I'm suspicious, basically.

dooglius 1 year ago | |

The linked page says

> While the problem statements were formalized into Lean by hand, the answers within the problem statements were generated and formalized by the agent.

However, it's unclear what initial format was given to the agents that allowed this step

Smaug123 1 year ago | | |

FWIW, GPT-4o transcribed a screenshot of problem 1 perfectly into LaTeX, so I don't think "munge the problem into machine-readable form" is per se a difficult part of it these days even if they did somehow take shortcuts (which it sounds like they didn't).

pclmulqdq 1 year ago | | |

So if Lean was used to find the answers, where exactly is the AI? A thin wrapper around Lean?

zerocrates 1 year ago | |

Interesting that they have a formalizer (used to create the training data) but didn't use it here. Not reliable enough?

golol 1 year ago | |

> When presented with a problem, AlphaProof generates solution candidates and then proves or disproves them by searching over possible proof steps in Lean.

To me, this sounds like Alphaproof receives a "problem", whatever that is (how do you formalize "determine all X such that..."? One is asked to show that an abstract set is actually some easily understandable set...). Then it generates candidate Theorems, persumably in Lean. I.e. the set is {n: P(n)} for some formula or something. Then it searches for proofs.

I think if Alphaproof did not find {foo} but it was given then it would be very outrageous to claim that it solved the problem.

I am also very hyped.

sebzim4500 1 year ago | |

I as someone with a maths degree but who hasn't done this kind of thing for half a decade, was able to immediately guess the solution to (1) but actually proving it is much harder.

gowld 1 year ago | |

The article says

> AlphaProof solved two algebra problems and one number theory problem by determining the answer and proving it was correct.

rldjbpin 1 year ago | |

as a noob, i feel that formalizing is a major part of solving the problem by yourserlf. my assessment is that once you identify certain patterns, you can solve problems by memorizing some patterns. but people might me can struggle with the first stage and solve the wrong problem.

still good progress nonetheless. won't call the system sufficient by itself tho.

SonOfLilit 1 year ago | | |

My mathematician friend said problem 5 (I think? With the monsters) seems hard to formulate, so I spent 15 minutes formulating it in pseudo-haskell.

Then he gave me a huge hint to the solution, after which it only took me a couple of hours to solve.

(Formalizing the solution is of course the hardest part, and might serve as a good masters dissertation I think)

kurthr 1 year ago | |

As is often the case, creating a well formed problem statement often takes as much knowledge (if not work) as finding the solution.

But seriously, if you can't ask the LLM to solve the right question, you can't really expect it to give you the right answer unless you're really lucky. "I'm sorry, but I think you meant to ask a different question. You might want to check the homework set again to be sure, but here's what I think you really want."

hyfgfh 1 year ago | |

> First, the problems were manually translated into formal mathematical language for our systems to understand.

Some people call this programming

allxnb 1 year ago | |

Presenting this just as "translating into formal language" omits important information.

Lean isn't just a formal language, it is also a theorem prover, Could the IMO participants use the nlinarith tactic? Could they use other tactics?

Of course not, they had to show their work!

Could they have mathematicians translate the problem statements into the formal language for them?

Of course not, they had to do it themselves. In "How to solve it" Polya stresses multiple times that formalizing the initial question is an important part of the process.

Then, the actual computational resources expressed in time are meaningless if one has a massive compute cloud.

I'm a bit dissatisfied with the presentation, same as with the AlphaZero comparison to an obsolete Stockfish version that has been debunked multiple times.

necovek 1 year ago |

This is certainly impressive, but whenever IMO is brought up, a caveat should be put out: medals are awarded to 50% of the participants (high school students), with 1:2:3 ratio between gold, silver and bronze. That puts all gold and silver medalists among the top 25% of the participants.

That means that "AI solves IMO problems better than 75% of the students", which is probably even more impressive.

But, "minutes for one problem and up to 3 days for each remaining problem" means that this is unfortunately not a true representation either. If these students were given up to 15 days (5 problems at "up to 3 days each") instead of 9h, there would probably be more of them that match or beat this score too.

It really sounds like AI solved only a single problem in the 9h students get, so it certainly would not be even close to the medals. What's the need to taint the impressive result with apples-to-oranges comparison?

Why not be more objective and report that it took longer but was able to solve X% of problems (or scored X out of N points)?

golol 1 year ago |

This is the real deal. AlphaGeometry solved a very limited set of problems with a lot of brute force search. This is a much broader method that I believe will have a great impact on the way we do mathematics. They are really implementing a self-feeding pipeling from natural language mathematics to formalized mathematics where they can train both formalization and proving. In principle this pipeline can also learn basic theory building like creating auxilliary definitions and Lemmas. I really think this is the holy grail of proof-assistance and will allow us to formalize most mathematics that we create very naturally. Humans will work podt-rigorously and let the machine asisst with filling in the details.

Ericson2314 1 year ago |

The lede is a bit buried: they're using Lean!

This is important for more than Math problems. Making ML models wrestle with proof systems is a good way to avoid bullshit in general.

Hopefully more humans write types in Lean and similar systems as a much way of writing prompts.

Smaug123 1 year ago | |

And while AlphaProof is clearly extremely impressive, it does give the computer an advantage that a human doesn't have in the IMO: nobody's going to be constructing Gröbner bases in their head, but `polyrith` is just eight characters away. I saw AlphaProof used `nlinarith`.

michael_nielsen 1 year ago |

A good brief overview here from Tim Gowers (a Fields Medallist, who participated in the effort), explaining and contextualizing some of the main caveats: https://x.com/wtgowers/status/1816509803407040909

signa11 1 year ago |

> ... but whenever IMO is brought up, a caveat should be put out: medals are awarded to 50% of the participants (high school students), with 1:2:3 ratio between gold, silver and bronze. That puts all gold and silver medalists among the top 25% of the participants.

yes, it is true, but getting to the country specific team is itself an arduous journey, and involves brutal winnowing every step of the way f.e. regional math-olympiad, and then national math-olympiad etc.

this is then followed by further trainings specifically meant for this elite bunch, and maybe further eliminations etc.

suffice it to say, that qualifying to be in a country specific team is imho a big deal. getting a gold/silver from amongst them is just plain awesome !

nb_quant 1 year ago | |

Some countries pull these kids out of school for an entire year to focus on training for it, while guaranteeing them entry into their nation's top university.

Source: a friend who got silver on the IMO

fancyfredbot 1 year ago |

I'm seriously jealous of the people getting paid to work on this. Sounds great fun and must be incredibly satisfying to move the state of the art forward like that.

GuB-42 1 year ago | |

I don't know about that. A lot of the work that should have been very satisfying turned out to be boring as hell, if not toxic, while at the same time, some apparently mundane stuff turned out to be really exciting.

I found the work environment to be more important than the subject when it comes to work satisfaction. If you are working on a world changing subject with a team of assholes, you are going to have a bad time, some people really have a skill for sucking the fun out of everything, and office politics are everywhere, especially on world changing subjects.

On the other hand, you can have a most boring subject, say pushing customer data to a database, and have the time of your life: friendly team, well designed architecture, time for experimentation and sharing of knowledge, etc... I have come to appreciate the beauty of a simple thing that just works. It is so rare, maybe even more rare than scientific breakthroughs.

Now, you can also have an awesome work environment and an awesome subject, it is like hitting the jackpot... and a good reason to be envious.

phillypham 1 year ago | | |

Awesome work environment for one person can be not ideal for another.

Pretty much all the top AI labs are both intensely competitive and collaborative. They consist of many former IMO and IOI medalists. They don't believe in remote work, either. Even if you work at Google DeepMind, you really need to be in London for this project.

lonesword 1 year ago | |

I work in this space (pretraining LLMs). It looks fancier than it really is. It does involve wrangling huge ymls and writing regular expressions at scale (ok I am oversimplifying a bit). I should be excited (and grateful) that I get to work on these things but shoddy tooling takes the joy out of work.

onemoresoop 1 year ago | |

You probably mean envious not jealous.

yalok 1 year ago | | |

I'm learning something new today. In some other languages these 2 are usually the same 1 word.

cynicalpeace 1 year ago |

Machines have been better than humans at chess for decades.

Yet no one cares. Everyone's busy watching Magnus Carlsen.

We are human. This means we care about what other humans do. We only care about machines insofar as it serves us.

This principle is broadly extensible to work and art. Humans will always have a place in these realms as long as humans are around.

thrance 1 year ago |

Theorem proving is a single-player game with an insanely big search space, I always thouht it would be solved long before AGI.

IMHO, the largest contributors to AlphaProof were the people behind Lean and Mathlib, who took the daunting task of formalizing the entirety of mathematics to themselves.

This lack of formalizing in math papers was what killed any attempt at automation, because AI researcher had to wrestle with the human element of figuring out the author's own notations, implicit knowledge, skipped proof steps...

camjw 1 year ago | |

> Theorem proving is a single-player game with an insanely big search space, I always thouht it would be solved long before AGI.

This seems so weird to me - AGI is undefined as a term imo but why would you expect "producing something generally intelligent" (i.e. median human level intelligence) to be significantly harder than "this thing is better than Terrence Tao at maths"?

thrance 1 year ago | | |

My intuition tells me we humans are generally very bad at math. Proving a theorem, in an ideal way, mostly involves going from point A to point B in the space of all proofs, using previous results as stepping stones. This isn't particularly a "hard" problem for computers which are able to navigate search spaces for various games much more efficiently than us (chess, go...).

On the other hand, navigating the real world mostly consists in employing a ton of heuristics we are still kind of clueless about.

At the end of the day, we won't know before we get there, but I think my reasons are compelling enough to think what I think.

zone411 1 year ago |

The best discussion is here: https://leanprover.zulipchat.com/#narrow/stream/219941-Machi...

Jun8 1 year ago |

Tangentially: I found it fascinating to follow along the solution to Problem 6: https://youtu.be/7h3gJfWnDoc (aquaesulian is a node to ancient name of Bath). There’s no advanced math and each step is quite simple, I’d guess on a medium 8th grader level.

Note that the 6th question is generally the hardest (“final boss”) and many top performers couldn’t solve it.

I don’t know what Lean is or how see AI’s proofs but an AI system that can explain such a question on par with the YouTuber above would be fantastic!

nopinsight 1 year ago |

Once Gemini, the LLM, integrates with AlphaProof and AlphaGeometry 2, it might be able to reliably perform logical reasoning. If that's the case, software development might be revolutionized.

"... We'll be bringing all the goodness of AlphaProof and AlphaGeometry 2 to our mainstream #Gemini models very soon. Watch this space!" -- Demis Hassabis, CEO of Google DeepMind. https://x.com/demishassabis/status/1816499055880437909

adverbly 1 year ago |

> First, the problems were manually translated into formal mathematical language for our systems to understand. In the official competition, students submit answers in two sessions of 4.5 hours each. Our systems solved one problem within minutes and took up to three days to solve the others.

Three days is interesting... Not technically silver medal performance I guess, but let's be real I'd be okay waiting a month for the cure to cancer.

riku_iki 1 year ago |

Example of proof from AlphaProof system: https://storage.googleapis.com/deepmind-media/DeepMind.com/B...

SJC_Hacker 1 year ago |

The kicker with some of those math competition problems, there will be problems that reduce to finding all natural numbers for which some statement is true. These are almost always small numbers, less than 100 in most circumstances.

Which means these problems are trivial to solve if you have a computer - you can simply check all possibilities. And is precisely the reason why calculators aren't allowed.

But exhaustive searches are not feasible by hand in the time span the problems are supposed to be solved - roughly 30 minutes per problem. You are not supposed to use brute force, but recognize a key insight which simplifies the problem. And I believe even if you did do an exhaustive search, simply giving the answer is not enough for full points. You would have to give adequate justification.

robinhouston 1 year ago |

Some more context is provided by Tim Gowers on Twitter [1].

Since I think you need an account to read threads now, here's a transcript:

Google DeepMind have produced a program that in a certain sense has achieved a silver-medal peformance at this year's International Mathematical Olympiad.

It did this by solving four of the six problems completely, which got it 28 points out of a possible total of 42. I'm not quite sure, but I think that put it ahead of all but around 60 competitors.

However, that statement needs a bit of qualifying.

The main qualification is that the program needed a lot longer than the human competitors -- for some of the problems over 60 hours -- and of course much faster processing speed than the poor old human brain.

If the human competitors had been allowed that sort of time per problem they would undoubtedly have scored higher.

Nevertheless, (i) this is well beyond what automatic theorem provers could do before, and (ii) these times are likely to come down as efficiency gains are made.

Another qualification is that the problems were manually translated into the proof assistant Lean, and only then did the program get to work. But the essential mathematics was done by the program: just the autoformalization part was done by humans.

As with AlphaGo, the program learnt to do what it did by teaching itself. But for that it needed a big collection of problems to work on. They achieved that in an interesting way: they took a huge database of IMO-type problems and got a large language model to formalize them.

However, LLMs are not able to autoformalize reliably, so they got them to autoformalize each problem many times. Some of the formalizations were correct, but even the incorrect ones were useful as training data, as often they were easier problems.

It's not clear what the implications of this are for mathematical research. Since the method used was very general, there would seem to be no obvious obstacle to adapting it to other mathematical domains, apart perhaps from insufficient data.

So we might be close to having a program that would enable mathematicians to get answers to a wide range of questions, provided those questions weren't too difficult -- the kind of thing one can do in a couple of hours.

That would be massively useful as a research tool, even if it wasn't itself capable of solving open problems.

Are we close to the point where mathematicians are redundant? It's hard to say. I would guess that we're still a breakthrough or two short of that.

It will be interesting to see how the time the program takes scales as the difficulty of the problems it solves increases. If it scales with a similar ratio to that of a human mathematician, then we might have to get worried.

But if the function human time taken --> computer time taken grows a lot faster than linearly, then more AI work will be needed.

The fact that the program takes as long as it does suggests that it hasn't "solved mathematics".

However, what it does is way beyond what a pure brute-force search would be capable of, so there is clearly something interesting going on when it operates. We'll all have to watch this space.

1. https://x.com/wtgowers/status/1816509803407040909?s=46

amarant 1 year ago |

This is quite cool! I've found logical reasoning to be one of the biggest weak points of LLMs, nice to see that an alternative approach works better! I've tried to enlist gpt to help me play a android game called 4=10, where you solve simple math problems, and gpt was hilariously terrible at it. It would both break the rules I described, and make math mistakes, such as claiming 6*5-5+8=10

I wonder if this new model could be integrated with an LLM somehow? I get the feeling that combining those two powers would result in a fairly capable programmer.

Also perhaps a LLM could do the translation step that is currently manual?

petters 1 year ago |

The problems were first converted into a formal language. So they were partly solved by the AI

nitrobeast 1 year ago |

Reading into the details, the system is more impressive than the title. 100% of the algebra and geometry problems were solved. The remaining problems are of combinatorial types, which ironically more closely resembles software engineering work.

gallerdude 1 year ago |

Sometimes I wonder if in 100 years, it's going to be surprising to people that computers had a use before AI...

necovek 1 year ago | |

AI is simply another form of what we've been doing since the dawn of computers: expressing real world problems in the form of computations.

While there are certainly some huge jumps in compute power, theory of data transformation and availability of data to transform, it would surprise me if computers in a 100 years do not still rely on a combination of well-defined and well-understood algorithms and AI-inspired tools that do the same thing but on a much bigger scale.

If not for any other reason, then because there are so many things where you can easily produce a great, always correct result simply by doing very precise, obvious and simple computation.

We've had computers and digital devices for a long while now, yet we still rely heavily on mechanical contraptions. Sure, we improve them with computers (eg. think brushless motors), but I don't think anyone would be surprised today about how did anyone design these same devices (hair dryers, lawn mowers, internal combustion engines...) before computers?

onemoresoop 1 year ago | |

If AI stays in the computer form though..

StefanBatory 1 year ago |

Wow, that's absolutely impressive to hear!

Also it's making me think that in 5-10 years almost all tasks involving computer scientists or mathematicians will be done in AI. Perhaps people going into trades had a point.

visarga 1 year ago | |

Everything that allows for cheap validation is going that way. Math, code, or things we can simulate precisely. LLM ideation + Validation is a powerful combination.

machiaweliczny 1 year ago | | |

This, I've said it many years ago.

Math => Code => Simulation => Robots => GG

_heimdall 1 year ago |

I'm still unclear whether the system used here is actually reasoning through the process of solving the problem, or brute forcing solutions with reasoning coming in during the mathematical proof of each potential proof.

Is it clear whether the algorithm is actually learning from why previously attempted solutions failed to prove out, or is it statistically generating potential answers similar to an LLM and then trying to apply reasoning to prove out the potential solution?

HL33tibCe7 1 year ago |

This is kind of an ideal use-case for AI, because we can say with absolute certainty whether their solution is correct, completely eliminating the problem of hallucination.

majikaja 1 year ago |

It would be nice if on the page they included detailed descriptions of the proofs it came up with, more information about the capabilities of the system and insights into the training process...

If the data is synthetic and covers a limited class of problems I would imagine what it's doing mostly reduces to some basic search pattern heuristics which would be of more value to understand than just being told it can solve a few problems in three days.

cygaril 1 year ago | |

Proofs are here: https://storage.googleapis.com/deepmind-media/DeepMind.com/B...

majikaja 1 year ago | | |

I found those, I just would have appreciated if the content of the mathematics wasn't sidelined to a separate download as if it's not important. I felt the explanation on the page was shallow, as if they just want people to accept it's a black box.

All I've learnt from this is that they used an unstated amount of computational resources just to basically brute force what a human already is capable of doing in far less time.

seydor 1 year ago |

We need to up the ante: Getting human-like performance on any task is not impressive in itself, what matters is superhuman, orders of magnitude above. These comparisons with humans in order create impressive sounding titles are disguising the fact that we are still at the stone age of intelligence.

zhiQ 1 year ago |

Coincidentally, I just posted about how well LLMs handle adding long strings of numbers: https://userfriendly.substack.com/p/discover-how-mistral-lar...

dan_mctree 1 year ago |

I'm curious if we'll see a world where computers could solve math problems so easily, that we'll be overwhelmed by all the results and stop caring. The role of humans might change to asking the computer interesting questions that we care about.

mr_toad 1 year ago | |

The next step will be having an AI come up with the problems.

klysm 1 year ago | |

I'm not sure what stop caring really means - like stop caring about the result, or the implications?

Davidzheng 1 year ago | |

I think mathematicians will still care

0xd1r 1 year ago |

> As part of our IMO work, we also experimented with a natural language reasoning system, built upon Gemini and our latest research to enable advanced problem-solving skills. This system doesn’t require the problems to be translated into a formal language and could be combined with other AI systems. We also tested this approach on this year’s IMO problems and the results showed great promise.

Wonder what "great promise" entails. Because it's hard to imagine Gemini and other transformer-based models solving these problems with reasonable accuracy, as there is no elimination of hallucination. At least in the generally available products.

azeirah 1 year ago | |

I don't think that's what they mean.

They explicitly stated that to achieve the current results, they had to manually translate the problem statements into formal mathematical statements:

> First, the problems were manually translated into formal mathematical language for our systems to understand.

How I understand what they're saying is that they used gemini to translate the problem statement into formal mathematical language and let DeepMath do it's magic after that initial step.

skywhopper 1 year ago |

Except it didn’t. The problem statements were hand-encoded into a formal language by human experts, and even then only one problem was actually solved within the time limit. So, claiming the work was “silver medal” quality is outright fraudulent.

noud 1 year ago | |

I had exactly the same feeling when reading this blog. Sure, the techniques used to find the solutions are really interesting. But the claim more than they achieve. The problem statements are not available in Lean, and the time limit is 2 x 4.5 hours. Not 3 days.

The article claims they have another model that can work without formal languages, and that it looks very promising. But they don't mention how well that model performed. Would that model also perform at silver medal level?

Also note, that if the problems are provided in a formal language, you can always find the solution in finite amount of time (provided the solution exists). You can brute-force over all possible solutions until you find the solution that proofs the statement. This may take a very long time, but it will find the solutions eventually. You will always solve all the problems and win the IMO at gold medal level. Alphaproof seems to do something similar, but takes smarter decisions which possible solutions to try and which once to skip. What would be the reason they don't achieve gold?

1024core 1 year ago |

> The system was allowed unlimited time; for some problems it took up to three days. The students were allotted only 4.5 hours per exam.

I know speed is just a matter of engineering, but looks like we still have a ways to go. Hold the gong...

stonethrowaway 1 year ago |

It’s like bringing a rocket launcher to a fist fight but I’d like to use these math language models to find gaps in logic when people are making online arguments. It would be an excellent way to verify who has done their homework.

PaulHoule 1 year ago |

See https://en.wikipedia.org/wiki/Automated_Mathematician for an early system that seems similar in some way.

golol 1 year ago | |

This Wikipedia page makes AM kind of comes across as a nonsense project whose outputs no one (besides the author) bothered to decipher.

arnabgho 1 year ago |

https://x.com/GoogleDeepMind/status/1816498082860667086

nybsjytm 1 year ago |

To what extent is the training and structure of AlphaProof tailored specifically to IMO-type problems, which typically have short solutions using combinations of a small handful of specific techniques?

(It's not my main point, but it's always worth remembering - even aside from any AI context - that many top mathematicians can't do IMO-type problems, and many top IMO medalists turn out to be unable to solve actual problems in research mathematics. IMO problems are generally regarded as somewhat niche.)

Davidzheng 1 year ago | |

The last statement is largely correct (though idk what the imo medalists that are unable to solve actual problems most mathematicians can't solve most open problems). But i kind of disagree with the assessment of imo problems--the search space is huge if it were as you say it would be easy to search.

nybsjytm 1 year ago | | |

No, I don't mean that the search space is small. I just mean that there are special techniques which are highly relevant for IMO-type problems. It'd be interesting to know how important that knowledge was for the design and training of AlphaProof.

In other words, how does AlphaProof fare on mathematical problems which aren't in the IMO style? (As such exceptions comprise most mathematical problems)

osti 1 year ago |

So they weren't able to solve the combinatorics problem. I'm not super well versed in competition math, but combinatorics always seem to be the most interesting problems to me.

sigbottle 1 year ago | |

I mean, IMO algebra problems can require very clever insights as well, and number theory especially has some really nice proof arguments you can make. It's easier to make a bad problem of this category though because it's much easier to hide the difficulty in a bunch of computations / rote deduction, and not creative insights.

Combinatorics problems are usually simple enough that anyone can understand and try tackling it though, and the solutions in IMO are usually designed to be elegant. I don't think I've ever seen a bad combo problem before.

osti 1 year ago | | |

Oh I'm sure the other topics all do have interesting problems, but I don't have the background necessary to even tackle them.

Your second paragraph conveyed exactly how I feel about combinatorics. Elegant and clever, on top of being understandable to even non math people.

lo_fye 1 year ago |

Remember when people thought computers would never be able to beat a human Grand Master at chess? Ohhh, pre-2000 life, how I miss thee.

utopcell 1 year ago | |

not to be pedantic, but Deep Blue beat Kasparov in 1997.

quirino 1 year ago |

I honestly expected the IOI (International Olympiad of Informatics) to be "beaten" much earlier than the IMO. There's AlphaCode, of course, but on the latest update I don't think it was quite on "silver medal" level. And available LLM's are probably not even on "honourable mention" level.

I wonder if some class of problems will emerge that human competitors are able to solve but are particularly tricky for machines. And which characteristics these problems will have (e.g. they'll require some sort of intuition or visualization that is not easily formalized).

Given how much of a dent LLM's are already making on beginner competitions (AtCoder recently banned using them on ABC rounds [1]), I can't help but think that soon these competitions will be very different.

[1] https://info.atcoder.jp/entry/llm-abc-rules-en

oXman038 1 year ago | |

IOI problems are more close to IMO combinatoric problems than other IMO problem types. That might be the reason for that delay. I personally like only combinatoric problems in IMO. Thats why I drop math track and went IOI instead.

I feel why combinatoric is harder for AI models is the same reason why LLM's are not great at reasoning anything out of distribution. LLM's are good pattern recognizers and fascinating at this point. But simple tasks like counting intersections at the Venn diagrams requires more strategy and less pattern recognition. Pure NN based models seem won't be enough to solve these problems. AI agents and RL are promising.

I don't know anything about lean but I am curious that proof of combinatorial problems can be as well represented as number theory or algebra. If combinatorial problem solutions are always closer to natural language, the failure of LLMs are expected. Or, at least we can assume it might take more time to make it better. I am making assumption in here that solutions of combinatorial problems in IMO are more human language oriented and relies on more common sense/informal logic when it compared to geometry or number theory problems.

Davidzheng 1 year ago | | |

Are you convinced there's a "reason " AI today is worse at combo? Like i don't see enough evidence that it's not an accident.

ckcheng 1 year ago |

There doesn’t seem to be much information on how they attempted and failed to solve the combinatorial type problems.

Anyone know any details?

ckcheng 1 year ago | |

I asked around and all I got was this: https://news.ycombinator.com/item?id=41150581

pnjunction 1 year ago |

Brilliant and so encouraging!

>because of limitations in reasoning skills and training data

One would assume that mathematical literature and training data would be abundant. Is there a simple example that could help appreciate the Gemini bridge layer mentioned in the blog which produces the input for RL in Lean?

jerb 1 year ago |

Is the score of 28 comparable to the score of 29 here? https://www.kaggle.com/competitions/ai-mathematical-olympiad...

Davidzheng 1 year ago | |

No. I would say it is more impressive than 50/50 there. (Source: I used to do math comps back in the day sorry it's not a great source)

gus_massa 1 year ago | |

IIUC the American Math Olympiad has 3 rounds. Wining the last one is almost a guaranty gold medal.

The link you posted has problems with a dificulty between the first and second round that are much easier.

I took a quik look at the recent list of problems in the first and second round. I expect this new AI to get a solid 50/50 points in this test.

djaouen 1 year ago |

Is it really such a smart thing to train a non-human "entity" to beat humans at math?

brap 1 year ago |

Are all of these specialized models available for use? Like, does it have an API?

I wonder because on one hand they seem very impressive and groundbreaking, on the other it’s hard to imagine why more than a handful of researchers would use them

creata 1 year ago | |

> it’s hard to imagine why more than a handful of researchers would use them

If you could automatically prove that your concurrency protocol is safe, or that your C program has no memory management mistakes, or that your algorithm always produces the same results as a simpler, more obviously correct but less optimized algorithm, I think that would be a huge benefit for many programmers.

11101010001100 1 year ago |

Can anyone comment on how different the AI generated proofs are when compared to those of humans? Recent chess engines have had some 'different' ideas.

imranhou 1 year ago |

If the system took 3 days to solve a problem, how different is this approach than a bruteforce attempt at the problem with educated guesses? Thats not reasoning in my mind.

sigbottle 1 year ago | |

Because with AlphaGeometry it literally was just a feedback loop brute forcing over a known database of geometry axioms with an LLM to guide the guesses.

Here, from what I understand, it's instead a theorem prover + LLM backing it. General proofs have a much larger search space than the 2d geometry problems you see on IMO; many former competitors disparage geometry for that reason.

JohnPrine 1 year ago | |

it wouldn't surprise me if what we think of as intelligence is nothing more than brute force attempts at prediction with educated guesses

gowld 1 year ago |

Why is it so hard to make an AI that can translate an informally specified math problem (and Geometry isn't even so informal) into a formal representation?

sssummer 1 year ago |

Why frontier models can both achieve silver medal in Math Olympiad but also fail to answer "which number is bigger, 9.11 or 9.9"?

utopcell 1 year ago | |

..because not all systems are of the same quality.

quantum_state 1 year ago |

It’s as impressive as if not more than AI beating a chess master. But are we or should we be really impressed?

rowanG077 1 year ago |

Is this just google blowing up their own asses or is this actually useable with some sane license?

c0l0 1 year ago |

That's great, but does that particular model also know if/when/that it does not know?

ibash 1 year ago | |

Yes

> AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. … Formal languages offer the critical advantage that proofs involving mathematical reasoning can be formally verified for correctness.

diffeomorphism 1 year ago | |

While that was probably meant to be rhetorical, the answer surprisingly seems to be an extremely strong "Yes, it does". Exciting times.

foota 1 year ago | |

Never?

Edit: To defend my response, the model definitely knows when it hasn't yet found a correct response, but this is categorically different from knowing that it does not know (and of course monkeys and typewriters etc., can always find a proof eventually if one exists).

atum47 1 year ago |

Oh, the title was changed to international math Olympiad. I was reading IMO as in my opinion, haha

fovc 1 year ago |

6 months ago I predicted Algebra would be next after geometry. Nice to see that was right. I thought number theory would come before combinatorics, but this seems to have solved one of those. Excited to dig into how it was done

https://news.ycombinator.com/item?id=39037512

lumb63 1 year ago |

Can someone explain why proving and math problem solving is not a far easier problem for computers? Why does it require any “artificial intelligence” at all?

For example, suppose a computer is asked to prove the sum of two even numbers is an even number. It could pull up its list of “things it knows about even numbers”, namely that an even number modulo 2 is 0. Assuming the first number is “a” and the second is “b”, then it knows a=2x and b=2y for some x and y. It then knows via the distributive property that the sum is 2(x+y), which satisfies the definition of an even number.

What am I missing that makes this problem so much harder than applying a finite and known set of axioms and manipulations?

psb217 1 year ago | |

In a sense, the model _is_ simply applying a finite and known set of axioms and manipulations. What makes this hard in practice is that the number of possible ways in which to perform multiple steps of this sort of axiomatic reasoning grows exponentially with the length of the shortest possible solution for a given problem. This is similar to the way in which the tree of possible futures in games like go/chess grows exponentially as one tries to plan further into the future.

This makes it natural address these problems using similar techniques, which is what this research team did. The "magic" in their solution is the use of neural nets to make good guesses about which branches of these massive search trees to explore, and make good guesses about how good any particular branch is even before they reach the end of the branch. These tricks let them (massively) reduce the effective branching factor and depth of the search trees required to produce solutions to math problems or win board games.

zone411 1 year ago | |

The problems in question require much, much more complex proofs. Try example IMO problems yourself and see if they don't require much intelligence: https://artofproblemsolving.com/wiki/index.php/IMO_Problems_.... And then keep in mind that research math is orders of magnitude more complex still.

ComplexSystems 1 year ago | |

What you're missing is that this kind of thing has arbitrarily been declared "artificial intelligence" territory. Once the ability of computers to do has been established, it will no longer be artificial intelligence territory; at that point it'll just be another algorithm.

booleandilemma 1 year ago | |

Proofs require a certain ingenuity that computers just don't have, imo. A computer would never be able to come up with something like Cantor's diagonalization proof on its own.

Davidzheng 1 year ago | | |

Are you sure alphaproof can't

runeblaze 1 year ago | |

Another answer is that 3SAT and co can be seen as distilled variants of proving statements. Well, 3SAT is famously hard.

mathinaly 1 year ago |

How do they know their formalization of the informal problems into formal ones was correct?

__0x01 1 year ago |

Please could someone explain, very simply, what the training data was composed of?

m3kw9 1 year ago |

Is it one of those slowly slowly then suddenly things? I hope so

mupuff1234 1 year ago |

Can it / did it solve problems that weren't solved yet?

raincole 1 year ago | |

Techinically yes. And it's easy. You can probably do it with your PC's computational power.

The thing is that most math "problems" are not solved not becasue they're hard, but because they're not interesting enough to even be discovered by humans.

mupuff1234 1 year ago | | |

Yeah, I mean "interesting" problems (perhaps not fields medal interesting, but interesting enough)

amelius 1 year ago |

How long until this tech is integrated into compilers?

dmitrygr 1 year ago |

> First, the problems were manually translated into formal mathematical language

That is more than half the work of solving them. Headline should read "AI solves the simple part of each IMA problem at silver medal level"

refulgentis 1 year ago |

Goalposts at the moon, FUD at "but what if its obviously fake?".

Real, exact, quotes from the top comments at 1 PM EST.

"I want to believe that the computer found it, but I can't find anything to confirm."

"Curing cancer will require new ideas"

"Maybe they used 10% of all of GCP [Google compute]"

gerdesj 1 year ago |

Why on earth did the "beastie" need the questions translating?

So it failed at the first step (comprehension) and hence I think we can request a better effort next time.

badrunaway 1 year ago |

This will in a few months change everything forever. Exponential growth incoming soon from Deepmind systems.

szundi 1 year ago |

Like it understands any of it

johnfn 1 year ago | |

Do you understand any of it?

utopcell 1 year ago | | |

:-)

Let us support the fellow human who seems to be just at the first of the five stages of grief: denial.

thoiwer23423 1 year ago |

And yet it thinks 3.11 is greater than 3.9

(probably confused by version numbers)

mik09 1 year ago |

how long before it solves the last two problems?

lolinder 1 year ago |

This is a fun result for AI, but a very disingenuous way to market it.

IMO contestants aren't allowed to bring in paper tables, much less a whole theorem prover. They're given two 4.5 hour sessions (9 hours total) to solve all the problems with nothing but pencils, rulers, and compasses [0].

This model, meanwhile, was wired up to a theorem proover and took three solid days to solve the problems. The article is extremely light on details, but I'm assuming that most of that time was guess-and-check: feed the theorem prover a possible answer, get feedback, adjust accordingly.

If the IMO contestants were given a theorem prover and three days (even counting breaks for sleeping and eating!), how would AlphaProof have ranked?

Don't get me wrong, this is a fun project and an exciting result, but their comparison to silver medalists at the IMO is just feeding into the excessive hype around AI, not accurately representing its current state relative to humanity.

[0] 5.1 and 5.4 in the regulations: https://www.imo-official.org/documents/RegulationsIMO.pdf

gyudin 1 year ago |

Haha, what a dumb tincan (c) somebody on Twitter right now :D

hendler 1 year ago |