undefined | Dark Hacker News

1 points by jacek-123 78 days ago

jacek-123 78 days ago |

We ran this benchmark because we kept seeing the same failure mode: teams fine-tune small models on production traces expecting them to learn their agent's behavior, but the downstream metrics are poor. We tested 5 corruption scenarios (noisy labels, schema drift, low data, irrelevant trace mixing, clean baseline) on the Schema Guided Dialogue dataset.

Key finding: using traces as context for synthetic data generation scores up to 26pp higher than training directly on them. The 1.7B student model also beats every frontier teacher we tested, including GLM-5 at 744B.

All code and data: https://github.com/distil-labs/distil-tft-benchmarking

Curious how others handle production trace quality for fine-tuning — have you run into similar issues?