Dwarkesh ClipsJune 23, 202613m

How On Policy Self Distillation Works - Sasha Rush

TL;DR

On-policy distillation trains on the student's actual behavior: Instead of copying a teacher's full output sequence, the student generates its own trajectory and then matches the teacher's token probabilities on that exact path.
Rush's Nadal analogy makes the difference click: Classic distillation is like watching Rafael Nadal play and trying to imitate him, while OPD is like playing your own bad tennis and having Nadal correct each mistake over your shoulder.
Composer 2.5 uses self-distillation because there is no better teacher model: Rather than relying on a stronger external model, the system creates a synthetic teacher by adding text feedback to the student's existing trajectory and comparing log probs before and after.
This helps with credit assignment in long RL traces: When trajectories can run for hundreds of turns, RL struggles to pinpoint where an error began, so a reader model flags a specific problematic message and inserts feedback right there.
There is only one rollout, not two: The model does not regenerate a better sequence after feedback, it simply re-scores the same tokens under a slightly modified context, which avoids an extra decode step.
The tradeoff is local improvement, not dramatic jumps: Because the model learns from its own imperfect trajectory, it gets incremental corrections rather than the full benefit of seeing an ideal teacher path from start to finish.

The Breakdown

Composer 2.5 speeds up RL by turning a model's own flawed trajectory into a teaching signal, without needing a stronger teacher model or a second rollout. Sasha Rush's key point is that you can inject targeted text feedback into the same token sequence, re-score it under the modified context, and use that KL gap as a local correction signal.