Scaling Pedagogical Pre-Training: From Optimal Mixing to 10B Tokens | Dark Hacker News