Will we run out of ML data? Evidence from projecting dataset size trends (2022)

Will we run out of ML data? Evidence from projecting dataset size trends (2022)(epochai.org)

66 points by kurhan 3 years ago | 121 comments

The last gen of popular LLMs focuses on publicly accessible web text. But we have lots of other sources of "latent" or "hidden" text. For example, OpenAI's Whisper model can turn audio to text reliably. If you point Whisper at the world's podcasts, that's a whole new source of conversational text. If you point Whisper at YouTube, that's a whole new source of all sorts of text. And then there are all sorts of private sources of text, like UpToDate for doctors, LexisNexis for lawyers, and so forth. I suspect "running out" isn't a within-a-decade concern, especially since text or text-equivalent data grows exponentially in the present internet environment. I think the bigger challenge will be distinguishing human-generated from AI-generated data after 2023.

chii 3 years ago | |

> If you point Whisper at YouTube, that's a whole new source of all sorts of text.

a lot of YT videos already has autogenerated english subtitles, which is actually available as a vtt download, so don't even need to use Whisper on a video to obtain it!

Salgat 3 years ago | |

But how much more data is required to make a big difference? Is doubling the dataset considered a dramatic improvement? Or is increasing the dataset by 10x needed?

ospray 3 years ago | | |

Also quality is likely important will the models get better if we train them on YouTube comments.

nologic01 3 years ago |

Brute force approaches always hit some wall. ML will be no different. In the decades to come it us quite likely that algorithms will develop in directions orthogonal to current approaches. The idea that you improve performance by throwing gazillions of data into gargantuan models might be even come to be seen as laughable.

Keep in mind (pun) that the only real intelligence here is us, and we are pretty good at figuring out when a tool has exhausted its utility.

airgapstopgap 3 years ago | |

We won't hit the wall.

Somewhat counterintuitively, scaling datasets is the lazy and economical approach. If you have the compute already, might as well dig an OOM more text tokens.

But there are other sources of data, and slightly different ways to utilize it. Multimodality, in very large training runs, will almost inevitably increase sample efficiency (for obvious reasons of context richness), synthetic data is already very effective [1], and there are and will be discovered other ways to do more in the condition of diminishing raw text resources. But a thorough abandonment of the scaling strategy is very unlikely.

Sutton's Bitter Lesson [2] points at a very powerful rule of thumb: we shouldn't turn AI engineering into a contest of smartness, we should allow complex smartness to emerge from generic low-level algorithms. What will be seen as laughable in decades to come is not the scaling strategy, but the Godlike conceit of people who thought they can devise generally applicable rules of reasoning from first principles.

1: https://arxiv.org/abs/2304.08466 2: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

sanxiyn 3 years ago | | |

I don't think your "synthetic data on ImageNet" reference shows "synthetic data is already very effective". Since many people won't read the paper, here's what it says:

Training ResNet-50 on real ImageNet gives 73.09% top-1 accuracy, while training it on synthetic data (same resolution, same number of images) generated by this work gives 64.96%, which is SOTA compared to previous work's 63.02%. Therefore, synthetic data is worse than real data for now.

But synthetic data is not useless, because training on real data plus synthetic data is a bit better than both real data and synthetic data. (Accuracy here is different due to different methodology.) Using 1:1 real data and synthetic data improves accuracy from 76.39% to 77.61%. But using 1:2 is worse than 1:1 (77.16%), even if dataset became 50% larger. With 1:4, result is worse than not using synthetic data at all. So synthetic data at best can enlarge dataset by 5x, more likely just 2x.

nologic01 3 years ago | | |

You are masquerading personal preferences (and possibly professional interests) as rules of nature. If anything, Godlike conceit definetely applies to some ML accolytes.

In any case, with your last point "we should allow complex smartness to emerge" you essentially agree with my point that new levels will emerge from orthogonal (new) directions.

The good thing about brute force is that it summons so many resources it primes the way for smarter approaches.

For those not conceited the objective is not some deus-ex-machina but "algorithms that work".

auggierose 3 years ago | | |

But we can devise generally applicable rules of reasoning from first principles. It's called logic. I am pretty sure the next step is to properly combine machine learning and logic properly.

gleenn 3 years ago | |

AI had a winter of many decades because the hardware wasn't there and there were better alternatives, especially for neural nets. Now ChatGPT etc comes out, with unbelievable results, decades in the making. And a couple months we're already writing it off because of the next limitation? Maybe let's give it more than a month or two to figure out if we even need all that data. I heard they're already talking about trying to significantly reduce the model hyper parameters size even though a large model size increase apparently the reason GPT 4 was so much better than 3. Give it a minute IMHO before making generalizations like this so soon

mjburgess 3 years ago | | |

Well I imagine the commenter actually understands the domain, the techniques, and is making an informed opinion.

It is possible to form opinions by knowing the domain, rather than drawing an exponential curve of newspaper headlines which trails off "..."

StrangeATractor 3 years ago |

On this note, the data-set available if you start collecting today is tainted with experimental AI content. Not the biggest issue right now but as time goes on this problem will get worse and we'll be basing our simulations of intelligence on the output of our simulations of intelligence, a brave new abstraction.

bhouston 3 years ago |

We just are not thinking wide enough:

* Train on all of television history, and streaming content.

* Train on YouTube.

* I suspect at some point we'll have a recording of most of people's lives, e.g. live-streaming: https://en.wikipedia.org/wiki/Lifestreaming#Lifecasting

mountainriver 3 years ago | |

Exactly, put bots into the world with cameras and you have infinite training. Humans also need a ton of data to train on and have way more parameters than the biggest ML model today

eru 3 years ago | |

You can also gather arbitrarily more video data by just turning on some webcams and pointing them at the world.

In addition you can also feed your system from video games.

istjohn 3 years ago | | |

Microphones, too.

replygirl 3 years ago |

If we play our cards right, AI could free people up for more valuable pursuits, and the pace of human information production would increase by orders of magnitude

visarga 3 years ago | |

> free people up for more valuable pursuits

It won't roll like that. AI will empower people to be more productive but won't free people up because it makes mistakes, can't help itself, and cannot function autonomously. There is no LLM application that is safe for autonomous usage today. How can we go from 0 to 1? I don't see a path. Self driving cars still can't reach L5 to completely remove the need for driver.

But maybe this is a blessing in disguise. It will make AI more like a new ability of humans than of the companies. Companies need people to unlock AI efficiencies. And AI tends to become open sourced so everyone has access to the same. AI is not a moat for companies and human ability to hand-held it is tied to individuals. That would make the transition easier. Solving that last 1% accuracy might encounter exponential friction and last for a while.

PeterisP 3 years ago | | |

Functioning autonomously is not the level needed to free up people.

If your department gets a bunch of entry-level hires or interns, that frees up people in your organization even if they make mistakes, require supervision and can't function autonomously. Similarly, if an AI system can do half of a particular job under human supervision, it can free up (or make redundant) half of the people doing that job.

replygirl 3 years ago | | |

Yeah, it's not that I think we'll get all the way there, it's a utopia. My expectation is that within 30 years we reduce the work week by a day or two for most people, compensate for our education system's decline, and avoid energy and food crises, and nothing else fundamentally changes

acapybara 3 years ago | | |

> Self driving cars still can't reach L5 to completely remove the need for driver.

This will probably be (or already has been) solved by large transformer models or their successor architectures.

What was missing was common sense reasoning about what they see. We now have that.

blibble 3 years ago | |

> the pace of human information production would increase by orders of magnitude

you mean boilerplate and spam right?

gumballindie 3 years ago | |

> free people

As opposed to what? Being "captive" in jobs for paying bills?

digdugdirk 3 years ago | | |

I mean... Yes.

What would you suggest as the alternative?

haldujai 3 years ago |

I wonder if the better question is not how we get more training data but:

If we're running out of training data with hallucinations and performance remaining so inadequate (per OpenAI's whitepaper) is an autoregressive transformer the right architecture?

Perhaps ongoing work in finetuning will take these models to the next level but ignoring the LLM hype it really does seem like things have plateaued for a while now (with expected gains from scaling).

HybridCurve 3 years ago |

This take is a bit silly in that they are implying the problem training models will be that we will run out of data. It's more likely that the problem is that the current models require too much data to reach convergence.

We've been trying to speed run neural networks science for the past decade but we still don't fully understand how they work. It's like being a bad programmer who doesn't understand algorithms so you compensate by spending money on hardware to make your programs run faster. At some point we will reach a limit where you can't buy your way out of the problem with more data or money and we'll all be forced to return to studying the foundations of the science rather than just trying to scale the existing models up.

I am certain when we get to that point everyone will realize we've been trying to feed these models too much data. It makes more sense that our current architectures are just not effective at assimilating the data they have.

bobsmooth 3 years ago |

There's gotta be entire libraries that haven't been digitized that can be mined for data.

brianr 3 years ago |

This analysis misses the impact of AI models being deployed, like is happening rapidly right now. Production applications built on AI will provide ample (infinite?) additional training data to feed back into the underlying models.

haldujai 3 years ago | |

Not sure that synthetic or LLM-generated training data is as useful as human generated text.

It seems "good enough" (for now) but synthetic makes up a very small proportion of the training set being used in current models that have been trained on it, if that proportion ends up being mostly synthetic we'll likely see whatever weird hallucinations and biases in the dominant backend (GPT4 or whatever) become amplified.

It's been shown repeatedly that garbage in = garbage out for training data.

brianr 3 years ago | | |

Agree about synthetic data. My point is that AI-powered applications that are deployed in production generate more _real_ data which can be used for training. For example, self-driving cars generate tons of data about how their models perform, as a result of the cars driving around. Similarly, code-writing AI applications will generate feedback in the form of errors, logs, etc. which is can be fed back into the models as training data.

laserbeam 3 years ago |

I love how "running out of data" implies that AI companies have access to all the text we ever wrote on all platforms out there. I mean it's probably true...

flyval 3 years ago |

This is dumb. An individual human takes in more data than modern LLMs do.

https://open.substack.com/pub/echoesofid/p/why-llms-struggle...

sebzim4500 3 years ago | |

Blind kids don't though, and they still end up being smarter than GPT-4

eru 3 years ago | | |

They still take in a lot of sensory data, eg related to touch and proprioception.

majikaja 3 years ago | |

>Just the vision data of a baby’s first year easily adds up to petabytes

What encoding is this??

lostmsu 3 years ago | | |

Uncompressed 2x 8k by 8k 24bpp 24FPS video. Comes at about 500GB per hour.

echelon 3 years ago | | |

> > Just the vision data of a baby’s first year easily adds up to petabytes

Just to add to this, the human brain also encodes quite a lot of evolutionary lessons. We didn't have to learn edge detectors.

MagicMoonlight 3 years ago |

Only if you rely on dumb learning where it’s learning pure pattern matching rather than interacting and reinforcement learning based on the responses.

cs702 3 years ago |

No.

AI can generate as much synthetic data as we need, on demand.

Many SOTA models, in fact, are already being trained with synthetic AI-generated data.

See https://en.wikipedia.org/wiki/Betteridge's_law_of_headlines