Outside of privacy (leaking PII), the above is likely the main reason. Someone could have invested a lump of money to scrape as much as they can and then go to town in the courts.
The terms that prohibit it are under “2. Usage Requirements” that restrict reverse engineering the underlying model structure.
As ridiculous as it may seem, they're doing the right thing.
so if this is what the right thing looks like…
Making it against the rules to be able to prove their illegal behavior is not the right thing to do.
I don't think the actual TOS has been changed though.
It is a TOS violation. It’s not a big one. But the weakness of the model is the story here.
So I feel like it’s important to distinguish between sensitive PII (my social or bank number) and non-sensitive PII (my name and phone number scraped from my public web site).
The former is really bad, both to train on and to divulge. The latter is not bad at all and not even remarkable, unless tied to something else making it sensitive (eg, hiv status from a medical record).
Has anyone figured out why asking it to repeat words forever makes the exploit work?
Also, I've gotten it into infinite loops before without asking. I wonder if that would eventually reveal anything.
I think it's more an attempt to ban the people who OpenAI stole data from when they pay $20 to gather evidence about what data was stolen.
It's obviously malicious, warning seems like window dressing
Will you be glad when those systems are good enough to replace you, and they became so using your toil, for free ?
[0] Heck, they could even unite and found an LLM startup themselves training the models legally and making it available for users at various tiers.
I do not want corporate behemoths to profit from my work for free. Period.
And to sprinkle in a bit of ad hominem, I am aware that things regarding rules or ethics are viewed differently in your culture, and that is okay.
GitHub has always had such clauses, even if they were not explicit about AI model training in particular. It is best to self host your own git instance if you are so worried.