Ask HN: Estimation of copyright material used by LLM 1. Is it true that LLMs / AI Companies have used copyrighted material for training? 2. Is it possible to estimate how much of copyrighted material has been used? |
Ask HN: Estimation of copyright material used by LLM 1. Is it true that LLMs / AI Companies have used copyrighted material for training? 2. Is it possible to estimate how much of copyrighted material has been used? |
2. This is harder as a lot of them don't disclose training sets.
There's no easy answer there, hence New York Times v. OpenAI.
I think sticking a straw in Zlib or AA or LibGen or whatever it is, and drinking until it makes gurgling slurping noises as it hoovers up the dregs at the bottom of the barrel, is far, far removed from “fair use”.
For example, most popular textbooks have at least several pirate copies uploaded to the web. Some of them are even in plain sight and Googleable.