I found the OpenAI bot scraping my blog recently. Assuming they used that data, when will they attribute me?
Given that Google successfully used a fair use defense in Authors Guild, Inc. v. Google, Inc., I think it's likely OpenAI and the others will also win in court.
I do think it's possible for specific uses of the output of LLMs to be copyright infringement. That's why it's interesting to see Microsoft to indemnify customers of their commercial products in the event a case is brought against the customer. This is smart on Microsoft's part; the risk probably isn't very high and by making it a non-issue for their customers, many more will feel comfortable using their LLM-based features and services.
Like githubs servers host AGPL code as data, without having to be open-source
The perceived problem there, is if their model generates an exact copy of some AGPL code, and you use it in your project unknowingly, and then you get can sued
Note: i'm declaring my comment license as https://creativecommons.org/licenses/by-sa/4.0/
So if you remix or transform my comment by responding it, please attribute to me your response.
I believe that this fact is and will be exploited to strip copyright and effectively transfer ownership using cleanroom/firewall techniques.
That's the key part. You haven't yet proved they have actually used your content for anything (other than, potentially, read the license to decide if they should include or discard from their training set).
But in practice we'll never know for sure if they are respecting the terms of licenses until 1) this is tested in court, or 2) there's some internal leak that points into either direction.
I would think OpenAI wants the thornier legal issues actually settled so that the whole ecosystem can grow within those terms & they can lobby for the legal changes they need/want?
Because people train on corpuses of data all the time, without a license or any attribution.
Every piece of text a writer reads is training that writer. Every image an artist sees helps to train that artist. Every sound a musician hears is training that musician.
That doesn't mean they can't exclude their works from training via a license going foreward. But that becomes an enforcement problem.
I also doubt "humans are just a larger Markov chain than the LLM and they're allowed to" will hold up in court.
I really hope “copyright can be used to prohibit reading and learning” does not hold up in court.
Copyright is, and should be, a protection from unauthorized reproduction. Extending it to protect the abstract ideas would be a disaster. And extending it to control stylistic learning would be even worse.
Cliff notes is not what lets you replicate the style of the author etc.
And yeah, you can use "it's just feeding it into a bunch of math" to justify nearly anything that involves software including good old piracy. What matters is what math is used for. (Spoiler: line up Microsoft's pockets at the expense of actual writers in this case.)
When someone pirates a book, they're replacing the original without consent or remuneration to the copyright holders.
When you train an AI on the contents of a book, you're not replacing it. If someone is interested in the content, they still need to buy it. Using ChatGPT is not a substitute. If it is, they're gonna have to prove it in court, but I doubt they'll be able to.
https://creativecommons.org/2023/03/23/the-complex-world-of-....
If you dissect the plaintiffs claim they are arbitrarily conflating training and regurgitating
Training is using for criticism and comparison purposes, hence fair use
And there is no lawsuit against what it regurgitates and the purpose of its output, whether someone asks it to give a list for comparison purposes, or specifically asks it for a story that has a plagiarized result
===== who is dan markunas
ChatGPT I'm sorry, but I don't have any information on a person named Dan Markunas in my database ....
who is janet saunders ChatGPT I'm sorry, but I don't have any specific information about a person named Janet Saunders in my database,
===========
As far as style goes, copyright doesn't protect that. Trademark MIGHT if your style is distinctive enough to be a trademark (and is used as such), but the "style" of a writer is largely about tempo and word choices, none of which are subject to copyright protections.
>I found [logs of users from Paramount's writers offices reading] my blog recently. Assuming they used that data, when will they attribute me?
To see that the idea on the face is silly. OP has no evidence that any of their work was used at all, or even that what was used could even be covered under the license in the first place.
When pubic safety and goodwill comes in to focus, that's where the role of automation is scrutinized and minimized more heavily. Copyright itself is an invention and area of balancing individual rights and greater public good.
Machines are not human and they are not sentiment and sapient at a level where we can view them differently. Perhaps they will change one day, but as it is today these systems are not entitled to do the same things humans get to do. They are tools performing a task, so the laws apply to them as they apply to, well, machines; copying and reproducing whole code blocks or novel chapters without attribution or licenses is something we allow a human to do in their head and not what we allow a machine to do in a prompt, regardless of the non-human mechanisms in between.
The moral calculus probably changes if machines are deemed capable of producing "useful art", as granting artists temporary monopoly ceases to become the only mechanism of spurring that art.
.. wants the thornier issues to be debated and re-tried ad infinitum, as long as they generate cash flow and build their moat(s).. more likely
This behaviour seems more consistent with wanting is sorted out than stalling for time.
Merely summarizing info and attributing it to the source is the basic element of learning, for both machines and human beings.
These suits are necessary becsuse it's not clear where the line is, and if ChatGPTs functions actually cross it.
What is clear is that OpenAI is doing its best to avoid infringing anyone's copyright even if it is trivial for them to do so. They have the training data so they can simply output it word for word bypass the LLM. They don't do that and further restrain their LLM from making too long recitations.
If you can trick / manipulate the LLM into giving you too much then I say that infringement is on you.
The ability to ask a commercial product is. In fact, feeding the book to that commercial product is already infringement.
ClosedAI is doing squat. The very least they could do is ask authors for permission, and of course if they really cared they would have LLM infer attribution and revenue share with the original creators.
The vast majority of publications (especially those of a explanatory nature) do not contribute original content/information. The exceptions are things like research articles/monographs, historical records, government reports. But copyright infringement doesn't apply here because these things weren't published with a profit motive but precisely to publicize the information as widely as possible. The only problem area I can think of involves books published by commercial publishers which promise 'exclusive peek' into the life of some famous person (think biographies of celebrities or books like Fire and Fury). In that kind of case there is indeed original content, and revealing it in detail will arguably mean less sales for the authors/publishers.
I disagree with this emphasis, given that rote, repetitive or technical material that is not original authorship is not in peril. Human authors who wrote original creative content, or wrote in a style that is personal and widely recognized, their rights to trade and commerce are in peril. That is much more important over the long term, and is not worth losing for convenient information mixers.
If someone makes a commercial activity of "answering any question about book contents at any time 24/7", hires tons of people to read those books and reply to billions of such questions daily thereby helping everyone not buy any books, is that robbing book authors?
Food for thought.
but let's be direct - are we talking about market share in the millions of views, where pirate copies are also available, or the sale of any books at all compared to a few hundred over a year. Quite the difference on a subsistence level of an individual author, no?
Curiously, when I ask GPT-4 about some well-known but under-copyright book, it says it can't answer because of the copyright. For well-known books out of copyright such as Alice in Wonderland, it can recite passages but tends to get lost and start reciting another section or book at some point. Would be real frustrating to use as a substitute.
Don't teachers do the same?
- Trained their minds on existing books
- Tutor the next generation of students
- Give classes on book contents
- Answer questions about those books
The book publishing industry didn't go out of business because there are teachers answering questions. To the contrary, it benefited book sales, because most people aren't good self-learners.
What's wrong with having a machine do the same?
> - Trained their minds on existing books
Training a human = enriching conscious human mind. "Training" AI = mechanically creating a derivative work (no conscious mind to enrich). Training a human is the same to "training" AI as killing a human to "killing" a Unix process, same word different things
Ah, the mental gymnastics people go through to justify the theft.
Just... no. It's nothing about people reading your writings and deriving things from that. It's about big companies using automated tools to ingest your writing and provide commercial services based on it. To other people. Without paying you a dime.
I see what you're saying, but I fail to see how ChatGPT merely copying their style (not: content) might impact "their rights to trade and commerce". Suppose I ask ChatGPT to "tell me some jokes in the style of Louis CK". Would that make me less likely to stream a Louis CK comedy special?
(By contrast, if I ask ChatGPT to summarize the key revelations from a book like Fire and Fury, that probably would make me less likely to buy the book, because if I buy the book it'd be for the novel information contained in it, but ChatGPT already divulged it to me.)
I think you are thinking too narrowly.
Many or most well-known comedians have people write for them. Those writers are to be out of the job because the results of their work were fed into an LLM and now Louis CK will pay MS for it.
Companies who used to pay skilful writers now will pay MS, who trained its AI on works by those skilful writers without asking them. They are out of the job too.
Repeat for every creative industry.