Generalized on-policy distillation with reward extrapolation | Dark Hacker News