Insights from Multilingual Curation for a 20T-Token Dataset | Dark Hacker News