Pretraining with hierarchical memories separating long-tail and common knowledge | Dark Hacker News