VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO(arxiv.org)

61 points by timhigins 3 hours ago | 20 comments

gslepak 1 hour ago |

Note that these are Python-only results, the model will not do as well with other languages.

I'm glad to see more domain-focused SLMs, we need more of them! A programming focused MoE should work well across many languages.

nsingh2 23 minutes ago | |

Lots of confusion about what this model is actually focused on.

It is a cheap specialist for closed-world, verifiable reasoning tasks like math, self-contained coding problems, and similar.

"Closed-world" means the needed information is already in the context. It is not a tool-using agent that can discover missing context. "Verifiable" means answers are hard to generate but easy to check.

So no open ended research, repo wide agent work, factual Q&A, or SVG generation. More of a compact reasoning module for bounded problems.

secretslol 43 minutes ago |

Am I right in thinking this is a tiny model which has been trained well to reason, and that's it? Makes me think of a smart person who doesn't know anything about a given topic, but with the right tools will go and research the heck out of it. I really like the sound of this... why have models train on learning anything when you can just train them how to learn and let them get on with it from something as small as a Pi Zero and an internet connection.

numlock86 19 minutes ago | |

This has been my dream ever since. Instead of encoding "all the knowledge" into those parameters, how about just making a model that has the same size, but all (or rather most) it does is reasoning? Just give it the ability to browse the net (e.g. language specifications, documentation and best practices) and just have it do its thing. Why does my coding agent need to know the population of New York, know a cheese cake recipe or the general lifespan of an ostrich? Just give it the bare minimum knowledge to think and reason about, and let it figure out the rest.

Sadly that's not how LLMs work, since all they do is "token prediction". At least the models we have to today ...

3eb7988a1663 15 minutes ago | | |

It would also reduce training costs to nothing. Current methodology requires continual retraining to scoop up new facts. If you can do a one time "this is how to think" - that could conceptually work forever, just plug in a new database layer that can be queried as required.

deftio 1 hour ago |

There is some base level of intelligence any model needs to be useful, even in narrow tasks.

Could you teach a 5 year old to drive a car? A 10 year old? A 12 year old? To drive a car requires being able to read, to have judgement about ice or rainy conditions, to anticipate a child running after a ball. By the time a human in in their mid teens they have acquired the base knowledge...

Small models need to have enough base knowledge to be able to be good enough -- even in a seemingly narrow regime. Where is that? Obviously they don't need all the obscure knowledge of a frontier model but there is some base level which is probably more than it would first seem.

smokel 1 minute ago | |

Being able to drive a car properly also depends on having the right exploration-exploitation balance. A three-year-old is likely to explore too much in a situation where mistakes can be dangerous.

This requires not only knowledge, but also the control systems that develop with the prefrontal cortex. LLMs don't do much control yet.

aero2146 2 hours ago |

I tried generating the classic pelican svg, but it failed horribly just showing me a rectangle and a black circle...

fwipsy 1 hour ago | |

I think this is predicted? Part of the story is how they were able to preserve core reasoning ability while cutting knowledge like "pelicans have wings."

> these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios.

pylotlight 1 hour ago | | |

The only real essential item here is tool calling capability is it not? So I assume they tested a strong read/write/edit tool consistency?

realitysballs 1 hour ago | |

That’s all I needed to hear

pylotlight 1 hour ago | | |

As in, you learnt that a useless test that no one should be using was tested here, that's what you meant right?

physPop 1 hour ago | |

Its for reasoning not generating art?

websap 1 hour ago | | |

Can you explain this a bit more

noperator 1 hour ago |

Having some success while testing this model out as a replacement for GPT-5 nano in source code security review. Running on RTX 3090 (24 GB VRAM) via vLLM. It's not great on structured output (as noted in the model card) but I'm working around that in my harness.

dummydummy1234 1 hour ago | |

Can't you just force it to do structured output via constrained generation?

SwellJoe 1 hour ago |

It's terrible at hunting security bugs (I expected it to be, but I wanted to be sure). I added it to a benchmark I made with a corpus of some Mythos-discovered bugs, and it found zero. The smallest pretty successful models remain Qwen 3.6 and Gemma 4 (but I haven't tested the very small variants of those yet).

https://swelljoe.com/post/will-it-mythos/

nsingh2 53 minutes ago | |

The lack of tool use will hinder it a lot I think, since bug hunting requires collecting context across a code base and stitching it together. It might be good in a more narrow sense, i.e "is there a bug in this block of code" and not considering how it interacts with the rest of the code base.

That's also more aligned to its leetcode style training data, the code under test is fully in the context window. It might be interesting to have a bigger tool use model go through the effort of collecting the context, and feeding it into this kind of model for analysis only. It becomes more of a thinking tool, instead of the orchestrator.