Large language model data pipelines and Common Crawl (WARC/WAT/WET) formats | Dark Hacker News