Show HN: Another experiment with an Erdos problem and LLMs Background: I am a coder, not a mathematician, but I was quite entertained by this story: https://news.ycombinator.com/item?id=47903126 I wondered how far I could get by just choosing a random open problem and throwing it at LLMs. Disclosure: I have no idea what the problem even refers to, let alone whether or not the output is even remotely correct. My interest is purely for testing capabilities of various models, curiousity, and entertainment. Problem: https://www.erdosproblems.com/691
My approach:
I used DeepSeek in Expert mode, using the same prompt as in the linked HN submission. It thought for a very long time, but I was doing other things in the background so I didn't really time it. I pressed "Continue" twice over the space of maybe 60mins. The output says it thought for about 46mins.Once it generated a proof, I asked Opus 4.7 to review it, and then entered the review into DeepSeek which made edits, corrections and refinements. This back-and-forth continued till Opus 4.7 was reasonably happy. At that point, I called in Gemini 3.1 Pro Preview, which raised issues which Opus missed. Opus acknowledged the feedback, and then I placed its feedback into Deepseek for a final round. Essentially, what Opus says Deepseek generated was a "clean exposition of a D[avenport]-E[rdos] corollary", not a new result. In all likelihood this result may already be known (Deepseek was not allowed to use the internet for this phase), or even wrong. In "simple" terms:
In any case, these were my takeaways:- These new models seem to be surprisingly capable especially when used to in conjunction with each other, even with fairly simple prompts - I am quite impressed by Deepseek. I'm going to review its coding ability, and may even switch completely from Anthropic - This was a genuinely interesting exercise, even if I have no idea if any of it is correct or useful Some other observations: - Opus was really fast at reviewing Deepseek's output. Literally seconds - Gemini had trouble figuring out what "Erdos 691" referred to - The free version of ChatGPT of generated mostly useless output. I didn't include it. Chat links below: https://chat.deepseek.com/share/hpguvrhcxn226bi3hn https://claude.ai/share/4f3ccad1-d862-4e37-8333-8a1ebd84b38f https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%... |