Hierarchical Reasoning Model – 1k training samples SoTA reasoning v/s CoT

Hierarchical Reasoning Model – 1k training samples SoTA reasoning v/s CoT(github.com)

26 points by dreamer7 339 days ago | 6 comments

riknos314 337 days ago |

Prior thread on the paper about this: https://news.ycombinator.com/item?id=44699452

dreamer7 339 days ago |

To a casual observer, this seems like a big deal. Can knowledgeable folks comment on this work?

AIPedant 337 days ago | |

I am still reading the paper, but it is worth noting that this is not an LLM! It is closer to something like AlphaGo, trained only on ARC, Sudoku and mazes. I am skeptical that you could add a bunch of science facts and programming examples without degrading the performance on ARC / etc - frankly it’s completely unclear to me how you would make this architecture into a chatbot, period, but I haven’t thought about it very much.

Comparing the maze/Sudoku results to LLMs rather than maze/Sudoku-specific AIs strikes me as blatantly dishonest. “1k Sudoku training examples” is also dishonest, they generate about a million of them with permutations: https://news.ycombinator.com/item?id=44701264 (see also https://github.com/sapientinc/HRM/blob/main/dataset/build_su... And they seem to have deleted the Sudoku training data! Or maybe they made it private. It used to be here: https://github.com/imone and according to the Git history[1] they moved it here https://github.com/sapientinc but I cannot find it. Might be an innocent mistake; I suspect they got called out for lying about “1000 samples” and are hiding their tracks.

[1] https://github.com/sapientinc/HRM/commit/171e2fcde636bcb7e6c...

algo_trader 336 days ago | | |

> not an LLM! closer to something like AlphaGo, trained only on ARC, Sudoku and mazes.

ah! this explains the performance..

What is the conventional wisdom on improving codegen in LLMs? Sample n solutions and verify, or run a more expensive tree search?

I have thoughts on a very elaborate add-a-function-verify-and-rollback testing harness and i wonder if this has been tried

munro 337 days ago |

Link to paper here https://arxiv.org/pdf/2506.21734

Still reading, but the benchmarks for ARC-AGI-1, ARC-AGI-2, Sudoku-Extreme (9x9), and Maze-Hard (30x30) look impressive.

tough 335 days ago | |

on gh someone reproduced but paper lacks total gpu hours and their benchmark results where 10-20% lower (read on gh issue)