- Fig 4a shows Centaur clusters more closely to humans than any other model in a cognitive benchmark (CogBench) but also shows that parental llama cluster closer than claude and openAI thinking models which makes me a bit sceptical of using this measurement at all and reinforces the need for further comparisons.
- the fMRI stuff makes no sense and transforms the paper into a propaganda stunt, IMHO.
- At the end of the paper, the comparison with an "informed" Deepseek-R1 (not shown in data?) shows that a modern reasoning model matches Centaur-performance even without any fine tuning.
The latter point is incredibly interesting in principle but it has nothing to do with the claims of the paper. It basically concludes that a modern reasoning model with CoT can outperform out of the box a "simpler" model that was specifically fine-tuned with a huge dataset of human cognitive behaviours. Bigger claim than the title itself basically IMO.