This is the big one. Every single source of data post March 2023 is corrupted beyond repair, and it's only going to get worse.
This seems like a big leap from current “known problems” to “doomed”.
It seems like smaller models are more desirable anyways (faster less resource usage etc), so a system that distills models to be smaller is more desirable than ever increasing model sizes. Additionally, there’s some evidence that using random internet data might not be as high quality as professionally written data (eg books, journalism) anyways, so I wouldn’t be surprised to see future models move away from internet scraping for everything but actual fact gathering. I think most people realize that entirely relying on “knowledge” trained into the model instead of a hybrid approach where the model handles the NLU/NLP aspect but farms out facts and computations to dedicated systems/APIs leads to worse hallucinations and results anyways.
What I want to read is the doom theory related to copyright issues, or cost issues, or energy usage issues. Those are the open questions. There was a recent article saying GitHub Copilot cost twice what they charged. If true, that spells doom for the sustainability of the product. I want to hear that Google thinks training Bard on daily facts is too expensive compared to search engines, that’s the warning signs for “doom”.