Key finding: using traces as context for synthetic data generation scores up to 26pp higher than training directly on them. The 1.7B student model also beats every frontier teacher we tested, including GLM-5 at 744B.
All code and data: https://github.com/distil-labs/distil-tft-benchmarking
Curious how others handle production trace quality for fine-tuning — have you run into similar issues?