Tree Search Distillation for Language Models Using PPO(ayushtambde.com) |
Tree Search Distillation for Language Models Using PPO(ayushtambde.com) |
This part confused me, it sounded like they were only doing the MCTS at train time, and then using GRPO to distill the MCTS policy into the model weights. So wouldn’t the model still have the same inference cost?
Once you have identified the best method and want to productize it, it would of course make sense to apply it on top of the best model, but if you're just doing research, you can skip that expensive last step.
In what way does using this model reduce the authors credibility?
what are your thoughts on MCTS for coding?
this can/must be paired with a smart execution harness to optimise roll out and roll back of execution paths and system state.
does this change the calculus for optimal post-training ?