Leanstral 1.5: Proof Abundance for All

Leanstral 1.5: Proof Abundance for All(mistral.ai)

73 points by programLyrique 3 hours ago | 14 comments

andai 17 minutes ago |

Halfway thru the article it shows a comparison with several frontier-ish LLMs. But they're all from half a year ago. "Our new model is better than all these Chinese models from 3 generations ago" is pretty funny to me.

boulos 2 hours ago |

This is nice work, but I found the bug finding example to be weird:

> One such bug was in the sign function for zigzag decoding of the datrs/varinteger library. On input Std.U64.MAX, the expression (value + 1) overflowed, causing crashes in debug mode and silent corruption in release mode—an edge case that testing and fuzzing would typically miss.

In what way would this boundary condition case be considered something that "testing [...] would typically miss"? It's certainly something that bad tests would miss or not think about, but I find that (a) careful people and (b) ML coding systems are actually really good at "oh, I should test the extreme values". Especially for things that parse user input.

I'm curious if they found other bugs that were more interesting, but found them too hard to explain quickly.

Groxx 15 minutes ago | |

particularly "and fuzzing", yea. fuzzing generally does intentionally explore boundary values, from what I've seen. for an encoding library like this, I think it's fair to say that fuzzing is a baseline expectation for any decent code, and it almost certainly would've caught this in seconds.

fjdjshsh 1 hour ago | |

Maybe it's not something they would "typically miss", but, from proof by existence, it's something they sometimes miss.

It does speak to the benefits of using lean in that you don't need to be clever about the different examples you test.

Exoristos 1 hour ago | |

Yes, it's basic QA. If tests missed this kind of thing, they would be of much more limited use than we generally expect them to be. It raises questions about the authors' background.

Groxx 28 minutes ago |

>One such bug was in the sign function for zigzag decoding of the datrs/varinteger library. On input Std.U64.MAX, the expression (value + 1) overflowed, causing crashes in debug mode and silent corruption in release mode—an edge case that testing and fuzzing would typically miss.

that library is: https://github.com/datrs/varinteger

it seems probably correct, as there's an identical issue filed on that repo a week before this was published: https://github.com/datrs/varinteger/issues/8 (is this a leanstral employee? they have almost no info or activity. or did leanstral perhaps just pick up this issue?)

it's a tiny, surprisingly-poorly tested, long-untouched (8y) library: https://github.com/datrs/varinteger/blob/master/tests/test.r... that has about 1k downloads per day: https://crates.io/crates/varinteger [1] which seems rather low.

I don't think I'd consider that such a smashing success that it's worth bringing up as the sole example tbh. though automated detection is certainly useful. or is this a noteworthy accomplishment for this sub-field? I haven't played with proof-writing LLMs, but given the paucity training data I wouldn't be surprised if they're a bit rough compared to general coding.

1: https://crates.io/crates/varinteger lists it as https://github.com/mafintosh/varinteger-rs which redirects to https://github.com/datrs/varinteger , so despite looking different at a glance it does appear to be the same library

andai 24 minutes ago |

Discussed the other day:

https://news.ycombinator.com/item?id=48738938

henryrobbins00 1 hour ago |

Try out Leanstral 1.5 on the latest version of OpenATP! OpenATP is an open-source Python package and CLI for agentic automated theorem provers. It natively supports running provers locally in Docker or remotely in Modal sandboxes.

GitHub: https://github.com/henryrobbins/open-atp

Docs: https://open-atp.henryrobbins.com

mathieudombrock 17 minutes ago | |

This is an ad.

henryrobbins00 5 minutes ago | | |

Earnest question: any recommendation to not come off this way in forums?

I created this tool for my own research and have found it really helpful to benchmark different automated theorem provers (my experience so far has been that Claude Code + Codex still out-perform Leanstral). My genuine aim is to share that usefulness with others, not self promote!

satvikpendem 2 hours ago |

I also submitted the HuggingFace link itself here: https://news.ycombinator.com/item?id=48779902

strongly-typed 45 minutes ago |

Lean is such a wonderful language. So hyped by these releases.

nullc 2 hours ago |

It would be nice if special purpose models provided a some diverse examples of exactly the input required to get its expected performance on a mix of problem types. Maybe also a document intended for LLMs to read that advises on prompt construction.

I've found that you can get wildly different quality results from these sorts of models due to seemingly insignificant differences in prompt construction. It would be much easier to guess at what it wants if I could just see some RL transcripts -- and so the model author is in a much better position to provide initial advice.

moonset 1 hour ago |

I gave Codex with GPT-5.5 High this prompt:

    Identify bugs in [datrs/varinteger](https://github.com/datrs/varinteger) . Do NOT look at the GitHub issues, just inspect the source

It also found the bug that Leanstral 1.5 found and the authors highlighted. I think this bug wasn't especially tricky; it's just a case of too few eyeballs on this repo.

Congrats on the release regardless! Excited for the direction Lean + automated AI proofs are headed.

Disclosure: I work at OpenAI.

noperator 8 minutes ago | |

Leanstral 1.5 has 6B active parameters. How many parameters does GPT-5.5 have?

8note 8 minutes ago | |

the mechanism is whats interesting, rather than whether it could do it.

this sounds like a great tool to add to the toolbelt, as part of the "how do we handle all the code output from LLMs" problem