Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”

Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”(openpipe.ai)

199 points by kcorbitt 1 year ago | 55 comments

Imnimo 1 year ago |

>To speed up our experiments, we omitted the Kullback–Leibler (KL) divergence penalty, although our training recipe supports it for interested readers.

I am very curious whether omitting the KL penalty helps on narrow domains like this, and also whether doing so results in illegible reasoning. (From the samples in the post, it looks like it doesn't make reasoning illegible?)

>the 32B model’s response lengths collapsing, especially after reaching peak performance.

I would not have predicted this. Nor that it could collapse its response length to near zero yet lose only a few percentage points of accuracy. If you do SFT to get a model of the same size to solve these puzzles with no reasoning (just output answers directly), how good can it do?

bradhilton 1 year ago | |

Yeah, it may help. In this paper[1], the author used a KL penalty of 0.01 for general tasks and 0.001 for mathematical. I tend to think it's probably not very important unless you're trying to optimize for human preferences.

As for response length, I think the model internalizes the logic and doesn't deliberate its answers through context creation. I don't think this is necessarily good for general reasoning, but for a specific task it would cut down inference costs. Just depends on what you're optimizing for. To encourage more general reasoning, I think a broader train and validation set would be helpful.

[1] https://arxiv.org/html/2501.03262v1

jstanley 1 year ago | |

I keep seeing people mention "illegible reasoning" but I'd be fascinated to see an example of what it actually looks like. Do you have any examples?

Apparently DeepSeek-R1 can switch between English, Chinese, and gibberish, and even the gibberish helps it think! That's fascinating, but all I can find is people saying it, nobody showing it.

Imnimo 1 year ago | | |

Here's an example of language switching:

https://gr.inc/question/although-a-few-years-ago-the-fundame...

In the dropdown set to DeepSeek-R1, switch to the LIMO model (which apparently has a high frequency of language switching).

I'm not sure about examples of gibberish or totally illegible reasoning. My guess is that since R1-Zero still had the KL penalty, it should all be somewhat legible - the KL penalty encourages the model to not move too far from what the base model would say in any given context.

NitpickLawyer 1 year ago | | |

Don't have examples handy, but I did a round of grpo on a 7b model and it did indeed start to switch between english, coreean and chinese, but the reward was steadily increasing. RL doesn't care what the middle tokens are, as long as the end result gets the carrot.

I think there's still a lot to learn about reward functions (saw a team work w/ just correct output, and nothing else), if you should reward partial success (i.e. code compiles / math outputs a result) or just the final thing (i.e. test cases pass / correct answer) and so on.

Not to mention how to get downstream signals from e2e tasks (i.e. if an "agent" navigates to the correct webpage and finds a "cookie" or something, figure out how to reward all the intermediary steps out of that single binary signal).

And there's a lot to learn in using grammars & stuff w/ RL as well. The problem there is that the libraries are pretty wonky atm, some things work, some things need work, and RL in itself is pretty slow due to having to generate, update the model and generate again.

jmmcd 1 year ago |

These puzzles probably have more in common with "Zebra puzzles" (eg https://www.zebrapuzzles.com/) than Cluedo (USA Clue) itself. I've been doing some one-off experiments with Zebra puzzles recently. All the reasoning models generate an enormous batch of text, trying out possibilities, backtracking, and sometimes getting confused.

From what I can see (not rigorous): Claude 3.7 fails, ChatGPT with reasoning succeeds, DeepSeek with reasoning succeeds.

But of course the best way for a model to solve a problem like this is to translate it into a constraint satisfaction problem, and write out Python code to call a CSP solver.

mdp2021 1 year ago | |

> But of course the best way for a model to solve a problem like this is to translate it

Which means that when you asked it (e.g.) whether A is better than B (as a Decision Support System), it should write a program to decide it instead of "guessing it" from the network.

You are stating that, since the issue is general, LLMs should write programs to produce their own outputs, instead of their standard output.

jmmcd 1 year ago | | |

> since the issue is general

I'm not sure what that means specifically. I don't agree overall. Only certain types of problems encountered by LLMs map cleanly to well-understood problems where existing solvers are perfect.

layer8 1 year ago |

GRPO = Group Relative Policy Optimization

https://arxiv.org/abs/2402.03300

Tostino 1 year ago |

I couldn't quickly find it by searching your github, but what layers did you end up targeting for training? Would be interesting to see an ablation on targeting different sets of layers (train only attention layers, freeze the first 30% of the layers and train the remaining 70%, etc).

bradhilton 1 year ago | |

We trained all the parameters. Those would definitely be interesting ablations. I would also like to see how much of a performance hit we would take with PEFT methods like LoRA.

kcorbitt 1 year ago |

One of the authors here. Happy to answer any questions about our methods/results!

kiratp 1 year ago |

Unless I’m missing something this isn’t online RL. They are collecting outputs in one pass and then doing a separate offline GRPO training run on those.

The results of this paper would indicate doing what they did, but online could return better results

https://arxiv.org/abs/2402.04792

bradhilton 1 year ago | |

Technically yes, only if you do a gradient step with data sampled from the exact same weights is it an online step.

With our training recipe this can be easily done by accumulating the gradients across the entire batch and only doing one step with optimizer before sampling more responses.

In our experiments, however, we found the advantages of doing multiple gradient steps outweighed any potential drift in policy.

Ultimately the online-ness of data is on a spectrum and while more online data is better, other factors may be more important.

fc417fc802 1 year ago | | |

> only if you do a gradient step with data sampled from the exact same weights is it an online step.

Bit pedantic, but amusing thought; wouldn't that imply that asynchronous actor critic is an offline training methodology?

bionhoward 1 year ago |

This looks impressive but I’m concerned, is it fair to “teach to the test” by fine tuning the Qwen model with RL on the test task, while the other models in the comparison are not fine tuned on the test task?

bradhilton 1 year ago | |

Yeah, the takeaway shouldn't be "our model is smarter," but that we were able to train weak models to as good or better than the best for this specific task. Depends on what you're doing, but sometimes that is enough.

Liwink 1 year ago |

Can you please share the training cost?

bradhilton 1 year ago | |

We used about 58 hours on 4xH100s and about 19 hours on 8xH100s to get the very best result with the 32B model. We trained for about another 16 hours before finishing the run, but we could have stopped earlier after it was apparent the model was regressing. Actual dollar costs are provider dependent.

behnamoh 1 year ago |

this is the same team that a few months ago here on hacker news talked about how to do fine-tuning on large language models, and then made it close source.

machiaweliczny 1 year ago |

Would be great if some details given about how exactly model is penalized for staying off-track.

bradhilton 1 year ago | |

The model is rewarded for accuracy. For each puzzle there are a few multiple choice questions. If it got 1 out of 4 correct, for example, its reward would be 0.25.

Then group relative advantages are calculated. If you have 16 different responses and the average accuracy is 0.5, then you subtract that from each reward and divide by the standard deviation. Say it's also 0.25. Then the advantage for our example would be (0.25 - 0.5) / 0.25 = -1.

The advantages are then used to increase (or decrease) the probability of sampling those tokens again. Since our example was negative, we penalize the model for underperforming with that response.

randomcatuser 1 year ago |

Wait, what's the difference between using GRPO and traditional fine-tuning of Qwen using your provided dataset?

Would be super interesting to see which one is more data-efficient!

bradhilton 1 year ago | |

Great question! So the dataset includes prompts and solutions, but no "gold" answer per se to use for SFT. You could sample responses from larger models and then train the smaller model on their answers, but as outlined in the benchmarks there is still a lot of headroom on this task and I wouldn't expect that to get the same results. At the very least you would probably want to do rejection sampling to discard bad results. It would definitely be a good experiment!