Self-Distillation Enables Continual Learning [pdf]

Self-Distillation Enables Continual Learning [pdf](arxiv.org)

100 points by teleforce 1 day ago | 23 comments

From Jan 2026.

This is very interesting:

"Empirical Validation. While we cannot verify these theoretically, we evaluate each empirically. We use the Qwen-2.5-7B-Instruct model (Hui et al., 2024) as the base policy and the ToolAlpaca dataset (Tang et al., 2023). In this benchmark, the model receives a tool-API specification and a user request, and must identify the correct tool call. Without demonstrations, the base model solves only 42% of examples. When provided with the appropriate demonstration c for each prompt x , the teacher achieves a 100% success rate. To further test reward proximity, we manually inspected 50 teacher reasoning traces. In all cases, not only were the final tool calls correct, but the intermediate chain-of-thought was valid and semantically grounded. This suggests that the teacher is reconstructing a correct reasoning process rather than merely copying the expert output. These observations provide evidence for the first requirement, that the demonstration-conditioned model behaves as an optimal policy."

HarHarVeryFunny 16 hours ago |

The title seems a bit misleading.

The paper is about a way to do SFT will less chance of catastrophic forgetting and performance regressions.

The idea is that SFT on new data that was NOT generated by the model (aka "off policy" data) is likely to cause problems due to the statistical mismatch between the new data and what the model has already learnt. As I understand it, their solution is to statistically align the new data with the old by feeding it to the old model, which will hopefully grok it via in-context learning, then have it regenerate it in its own words such that "off policy" data now becomes "on policy". The model can then be SFT trained on this regenerated data (i.e self-distillation).

To me SFT and "continual learning" are two distinct things.

Human/animal continual learning is always-on learning that removes the need for, and distinction between, training and inference, and it initiated by prediction failure. It's as much about skill acquisition as it is about knowledge acquisition. Continual learning can happen in any context from trying to do something (or just observing something passively) and being wrong about the outcome of your own actions, or what some external entity does next, to curiosity/boredom driven exploration and play which is more along the spectrum of pure learning with less expectation of outcomes.

Continual learning is what, one day, will let the AGI intern pick up new skills on the job by trying to do things and failing/learning/practicing until they get better. This is not the same as sending the intern home with a textbook to read, or a transcript of the conversations you had with it today, and having it take these onboard overnight, which is basically what SFT is designed to do - intermittent addition of new declarative data.

big-chungus4 50 minutes ago | |

They don't use the old model, they use EMA of the trained model weights as teacher

airstrike 1 day ago |

Both title and abstract feel a little too confident, which ironically makes me more skeptical rather than less.

I find the choice of the words "enable" in the title and "establishing" at the end of the abstract to be particularly jarring.

teleforce 1 day ago |

Fun facts, this paper is cited by Simple Self-Distillation (SSD) paper by Apple [1],[2]. I think it is a bad naming scheme due to the very common SSD namesake and the fact that it belongs to on-policy self-distillation [3]. But again according to the authors their proposed solution is simple because "SSD uses only temperature-shifted samples from the base model and standard cross-entropy training,without privileged context, feedback-conditioned teachers,or auxiliary supervision."

The Apple paper also cited another very similar idea of self-distillation paper by UCLA team. Both cited papers namely by MIT & ETH team, and the other by UCLA team proposed novel on-policy self-distillation technique. Interestingly both teams submitted their papers within one day from each other back in January this year to arXiv [4],[5]. No price for guessing who actually published the idea first.

IMHO, self-distillation fine-tuning is the future of LLM fine-tuning because it mitigates the forgetfulness of the SFT approach that can be cumbersome for lightweight fine-tuning rather than full post-training of LLM.

With the advent and proliferation of plethora open source and open weight LLM foundation models, anyone can fine-tuning these models for domain specialization or sub-specialization (like medicine sub-specialization, law disciplines, branches of architecture practices, etc) [6]. This fine-tuning process can be performed with the minimum resources of 8 H200 or even 4 H100 GPUs as reported respectively in either of the papers [4],[5]. Let's see if we can replicate that with much cheaper arrangements consisting of a couple of DGX Spark, or the latest eight of DGX Spark based nodes arrangement with a total of 1 TB RAM (128 GB x 8) [7],[8].

IMHO, if the results are valid, the self-distillation can be the second best thing happened to LLM after the transformer.

[1] Embarrassingly simple self-distillation improves code generation (2026 - 201 comments):

https://news.ycombinator.com/item?id=47637757

[2] Embarrassingly Simple Self-Distillation Improves Code Generation:

https://arxiv.org/abs/2604.01193

[3] Comment on "Embarrassingly simple self-distillation improves code generation":

https://news.ycombinator.com/item?id=47644784

[4] Self-Distillation Enables Continual Learning:

https://arxiv.org/abs/2601.19897

[5] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models:

https://arxiv.org/abs/2601.18734

[6] Why domain specific LLMs won't exist: an intuition (2026 - 4 comments):

https://news.ycombinator.com/item?id=47649167

[7] NVIDIA DGX Spark Review The GB10 Machine is so Freaking Cool:

https://www.servethehome.com/nvidia-dgx-spark-review-the-gb1...

[8] BIG AI Cluster Little Power the 8x NVIDIA GB10 Cluster:

https://www.servethehome.com/big-cluster-little-power-the-8x...

greesil 1 day ago |

Wtf is a policy? Is this some sort of RL thing that I'm too ML to understand?

Gemini tells me it's the probability of the next token for an LLM. Okay then.