Comparing the maze/Sudoku results to LLMs rather than maze/Sudoku-specific AIs strikes me as blatantly dishonest. “1k Sudoku training examples” is also dishonest, they generate about a million of them with permutations: https://news.ycombinator.com/item?id=44701264 (see also https://github.com/sapientinc/HRM/blob/main/dataset/build_su... And they seem to have deleted the Sudoku training data! Or maybe they made it private. It used to be here: https://github.com/imone and according to the Git history[1] they moved it here https://github.com/sapientinc but I cannot find it. Might be an innocent mistake; I suspect they got called out for lying about “1000 samples” and are hiding their tracks.
[1] https://github.com/sapientinc/HRM/commit/171e2fcde636bcb7e6c...
ah! this explains the performance..
What is the conventional wisdom on improving codegen in LLMs? Sample n solutions and verify, or run a more expensive tree search?
I have thoughts on a very elaborate add-a-function-verify-and-rollback testing harness and i wonder if this has been tried
Still reading, but the benchmarks for ARC-AGI-1, ARC-AGI-2, Sudoku-Extreme (9x9), and Maze-Hard (30x30) look impressive.