Microsoft will assume liability for legal copyright risks of Copilot(blogs.microsoft.com) |
Microsoft will assume liability for legal copyright risks of Copilot(blogs.microsoft.com) |
Copyleft licenses are more troublesome for those who would rather not release source code. GPL is being used as a stand-in for all copyleft licenses.
Courts -- under common law jurisdictions -- don't interpret contracts and licenses literally. If you stick within the spirit of a license or contract, you might be okay (even if you break the letter), and vice-versa.
Beyond that, it's a question of damages and consequences. Omitting a warranty disclaimer isn't likely to result in a lot of damages.
And finally, there are odds of getting sued. If you infringe on my AGPL code, I'll be pissed. I used that license for a reason. On the other hand, I /hope/ my MIT-licensed code is reused in commercial products. If you infringe on some term, I probably won't care.
There's a lot more nuance than that, starting with statutory law jurisdictions like France to things like statutory damages, and I'm intentionally oversimplifying.
However, from a 10,000 foot view infringing on the GPL versus on an MIT license are very different beasts, and there's good reason to be a lot more worried about the former.
I wonder how customers will have to prove that the contested code was actually output by Copilot.
Microsoft would have access to your usage history, and would be able to easily prove your intended theft as a user if any of your prompts or usage history made it clear that you were attempting to subvert a license.
If anything, this temporarily shifts the battleground out of the courts and into prompt engineering space.
It would need to look like an accident for a bad actor to pull this off.
Possible, perhaps. But what makes you think this is easily provable? Intent is hard at the best of times.
Adding to that: How many people here actually abide by the StackOverflow contribution license of CC-BY-SA when copying and pasting code from there? ;)
I don’t copy/paste code from SO but there is sometimes inevitable duplication because sometimes there is only one right way to do something! Copyright can stray into the case of the ridiculous pretty quickly.
Is an interface declaration inherently different from, say, a merge sort implementation? It’s all code. But they also serve very different purposes. I do not think prior to Google v Oracle there was much case law to distinguish between different types of code, but in the industry we recognize all kinds of nuance.
I always thought that code snippets that small are not considered by the Courts to be eligible for 'copyright protection'.
Now it is "Train, Task, Transform, and Transfer":
Train - Feed copyrighted works into machine learning model or similar system
Task - Machine learning model is tasked with an input prompt
Transform - Machine learning model generates hybrid output derived from copyrighted works, but usually not directly traceable to a given work in the training set
Transfer - Generated output provides the essence of the copyrighted works, but is legally untraceable to the originals
I would never want to be in a business partnership with Microsoft (as you are as a developer). I wouldn't want to be a competitor. I wouldn't want to be a lot of things.
But as a customer? Can you name specific issues you've seen which impact corporate customers?
McDonalds price, McDonalds quality. But unlike McDonalds, long lasting and expensive problems.
If it won't violate IP rights, there shouldn't be a problem.
It suggests those whose code is trained upon have something to lose if the trained models are used by others.
GitHub Copilot and open source laundering
https://drewdevault.com/2022/06/23/Copilot-GPL-washing.html
Previously on HN, in case you missed it:
Copilot is such a flawed product from the start. It's not even a matter of its ability to write "good" code. The concept is just dumb.
Code is necessarily consumed by people first before it's executed by a computer in a production environment. There are many ways to get a computer to do something, but the approval process by experienced humans is vastly more important than the drafting of it. Software dev is already incredibly cheap and the last place to cut costs.
There is no AI threat other than the one posed by grifters trying to convince you that there is.
ChatGPT is also often faster than Google or Stackoverflow for when I'm working with unfamiliar APIs.
For stuff like that, a lot of code can be automated. Sure it may not work right out of the box. But doing a prompt for generally what you want can speed up the process significantly.
Even beyond just generating code, there are a lot of general things that AI helps with.
Things like how if you code runs into an error, you can just ask AI what the error means as well as a possible fix. Or other questions like "What does this code do" or "where in the code case is code that manages this concept".
I've replaced most of my coding with AI, using a new IDE called Cursor AI, and I don't think I could ever go back. Mere github co-pilot is actually the old tech from 2 years ago. The new stuff is way better.
As for the API side of things, CRUD only looks easy when lots of hard work has been put into it. I guess you're advocating for monolithic data, but that's not really CRUD. That's just lazy and bad.
Extinguish.
You're saying that if Copilot replicates GPL-licensed software, that it will kill the GPL? after all the time and money MS have spent to do this in the past, only to fail?
wtf
They may have, over the past decade, embraced a lot of open source software out of necessity, but their stance on licensing hasn't changed.
Creating an epidemic of hard-to-prove GPL violations could be a death-by-a-thousand-cuts strategy to try to invalidate the GPL requirements by making them appear unenforceable. Whatever cost Microsoft would incur defending customers could pay for itself if Microsoft manages to legally invalidate the parts of GPL licensing that prevent their corporate exploitation.
Using a bleeding-edge technology like generative AI is a great way to attack the GPL in court, given the risk that our court system isn't likely to be tech savvy enough not to be manipulated by Microsoft's claims against the GPL as it relates to casual infringement that they are enabling.
“"Embrace, extend, and extinguish" (EEE), also known as "embrace, extend, and exterminate", is a phrase that the U.S. Department of Justice found was used internally by Microsoft to describe its strategy”
https://en.m.wikipedia.org/wiki/Embrace,_extend,_and_extingu...
How dare they? amirite?
There is a reason voting works (in this context, and otherwise), you can't always give up after declaring that people have differing opinions.
There is definitely a prevailing ethos here and it's valid to point out potential inconsistencies.
are you saying that I should name them specifically? or is "people" too general?
But for folks that are negative on both accounts, maybe they've just learned their lesson from decades of watching Microsoft take the low road over and over again.
I've worked at FAANG companies before making the standard X00,000$ total comp on projects with millions of users. I know how development at top companies "in the real world" works.
> the frontend is the most volatile part
Ok, whatever. Fortunately there are more things out there in software dev than just the one specific usecase you brought up. And its useful for that.
> I guess you're advocating
No, I am saying that as of right now, AI is a tool that speeds up development process significantly. And I am not talking about just generating a lot of stuff at once.
There are hybrid approaches that a human can use, to use AI, as well as code themselves that are useful.
And one specific example, would be that you can instantly look up an error and take suggestions for fixing it to get ideas.
Also important is attribution.
I use GPT4 on the CLI via ShellGPT. Piping in `tail /var/log/nginx/error.log` and asking "What is going wrong here?" is amazing. I'll never use `man` to figure out how to use a CLI tool again either.
It is painful to watch people slowly do things at work (ChatGPT isn't allowed) that ChatGPT would do so much faster. We had to write up an incident report the other day. If we had just outlined everything that had happened in some rough bullet points, it would have written 95% of the final document. If we had gotten that done quicker, we'd have been back to shipping code to production quicker.
Specifically, I think they are less concerned with (say) specific Excel code leaking than with the knock on effects of a cheap perfect substitute.
Is there any evidence that an LLM could actually generate a perfect substitute for excel solely through prompting if only the excel source was in the training data? I hypothesize that designing a prompt for an LLM that captures all of Excel's properties would be comparable in difficulty to reimplementing the functionality without an LLM.
What happens if someone else uploads my code to github?
What happens if proprietary code is uploaded to github?
What happens if national secrets are posted to github?
In all of those cases, the person doing the upload does not "own" the content, nor did they choose the license.
There is no reasonable read of a ToS agreement that would allow Microsoft/Github extra rights to that content.
https://docs.github.com/en/site-policy/github-terms/github-t...
Chapter D4 gives microsoft the right to: parse it into a search index or otherwise analyze it on our servers
I don't know what a real court says, but I can imagine a lawyer saying training an AI is done by analyzing your code.
Chapter D5 gives almost anybody right to do a lot with your code, including creating derived works, as long as it happens on github. If the AI training happens on their servers, I think you agreed to them training an AI.
Not saying they are doing it right now based on that document. But I do assume a lawyer has enough material to make the waters really muddy, and a trial being decide by basically a dice roll.
There isn't really a lot of variance when it comes to the top voted comments on popular stories. Especially when it concerns the big tech companies. The opinions are fairly predictable.
who would want the genius of Teams, sharepoint, onedrive or powerbi in their product?
- that it doesn't output training data verbatim
- the product is very transformative, only "learning" from training data
- There are no copyright infringements because of these two above
Well, then there's really no reason not to throw their own private code on the pile.
On the other hand if the repo is already public on Github then exposing it via an LLM is not introducing any new security risk.
https://twitter.com/DocSparse/status/1581461734665367554/pho...
Code that is purely utilitarian (see “useful articles doctrine”) isn’t a work of human expression that is copyrightable.
That's not really a factor in determining what's eligible for copyright protection.
But Microsoft has had a wall of lawyers for a long time. Microsoft's potential first-party GPL violations would have been defended by their lawyers for decades now.
This take seems to be stretching for a Microsoft bad interpretation.
Very much. I'm ok with Microsoft haters, provided that they are clear about their bias with themselves and others. They're not, though.
There is no chance that every negative comment on this site about Microsoft is unbiased. None.
Copilot is a useful tool for "license-washing" code.
So then it supports the sharing of code for anyone to use freely, which is the opposite of the "extinguish" strategy that microsoft did in the past.
> would you be OK with Microsoft effectively stealing that protected work
I think that copyright protections are way way to strong and I support making almost all of them useless and I support allowing people to side step copyright protections.
This is because I want more creative works to be freely useable by everyone. Especially for AI purposes, which is a highly transformative and powerful usecase.
So, are you saying people should only obey the laws they agree with when those people feel they're morally justified higher than those who voted for the laws in the first place, because it's your opinion to do so or did I not grok what you are saying you support?
Depends on what the law is.
Also, this may not even be illegal. Maybe this is just a legal loophole, and people are obeying the law.
In which case I am very happy that Microsoft found a completely legal loophole that will cause more code to be shared.
So by " side step copyright protections", we could just say that this is a completely legal loophole that has the effect of allowing more code to be shared but does not overrule other laws.
Which I think is good!
Their own engineers would get productivity boosts - with copilot already being familiar with data structures, code style, etc. would be a big boost to accuracy.
But also, third party code would end up being more similar. Code style of the whole world would be pushed towards 'Microsoft style', which probably makes hiring easier, less training time for engineers, etc.
And the downside, that is outsiders might learn tiny nuggets of info about microsoft sources, is probably irrelevant when outsiders can already decompile binaries and learn far more.
most, if not all microsoft products can have their sources be available for viewing, if you are one of those vip development partners. microsoft doesn't really have any secret source (pardon the pun) of which the leaking would undo their value proposition.
In fact, if microsft opened up their system a bit more, they might even gain some PR or mindshare, and have no effect on, if not increase, their bottom line.
And if Microsoft's code ends up influencing the rest of the world code that would be a .... big downside.
Yes, that's exactly what the world needs, more software like Teams.
The style applied by Copilot comes from your surrounding code context, not from the LLM. And that base, trained on all public repos from GitHub, knows everything about data structures, etc, in the languages that were scanned.
Nothing new would be gained by scanning MS's own repositories and nothing would be leaked or color the output in actual use.
- It does
- The user didn't turn off the filters that prevent this
- The user didn't intentionally make it do it
- This use is found to be illegal
There's a difference between code that needs to be kept private from bad actors (from their point of view at least) and code that is public but with restrictions on its use that anyone who gets it should be aware of. This is like saying "if you truly believe that license agreements are legally binding, then publish your user's passwords publically with a license saying no one can use them"
This being the real hurdle. With Microsoft money behind the defense, only megacorps can win.
Both worried about IP leaking but one side is worried about their IP leaking and the other worried about liability if they inadvertently implement any leaked IP. Either way, the concern is leaked IP.
This hasn't been tested in court.
This blog post refers to the broader ecosystem of Microsoft Copilot solutions. Most of those tools rely on the Azure OpenAI API service on the backend and are not specifically tailored for code generation.
LLM copilot doesn't really understand the context of the project, it just goes for similar text.
So if you train on big projects you're picking up their patterns only. When a copilot user asks for a string concatenation 'tip' you want LLM to output a general answer, not something tied to a specific project. Big project is likely to use abstraction over strings, where base library usage is shrunk down to few lines of code as opposed to abstraction. In this case you'd want LLM to source a few "simpler" projects that use base library strings abundantly, so it can have decent amount of text for the most likely correct match over user's input.
I do believe Microsoft has all the code available for good training, it's not only about Azure, Windows and Office, there is tons more and it's open source already.
We can already take a guess what many internal functions look like from the published symbol tables of every function across all major microsoft products. Simply ask copilot to write those functions and see if the code comes out better than a similar set of made up yet plausible function names.
It probably would not be a very desirable product in the end.
Google Books literally copied and pasted books to add to their online database and that was deemed fair use, so something much more transformative like generative AI will likely fall under much broader consideration for fair use. Google Books was, yes, non-commercial, but the courts generally have the provision that the more transformative something is, the less it needs to adhere to the guidelines laid out for determining such fair use.
Is this blog post a legally enforceable contract? Is Microsoft specifically indemnifying all users of Copilot against claims of copyright infringement that arise from use of Copilot?
The blog post says that "there are important conditions to this program", and it lists a few, but are those conditions exhaustive, or are there more that the blog post doesn't cover? For example, is it only in specific countries, or does it apply to every legal system worldwide?
What guarantees do users have that Microsoft won't discontinue this program? If Microsoft gets kicked in the teeth repeatedly by courts ruling against them, and they realize that even they can't afford to pay out every time Copilot license-launders large chunks of copyrighted code, what means to users have to keep Microsoft to its promises?
It can be. The concept is promissory estoppel.
https://www.nolo.com/dictionary/promissory-estoppel-term.htm...
So it helps if MS sues you when you distribute copilot-generated code that infringes on MS copyrights, but if a third party sues you, you can't claim estoppel to compel MS to help you. You would need a contractual guarantee.
The way AI is going I'm sure we'll see some landmark cases very soon. It is very much in Microsoft's interest to grow this market as fast as possible and be at the center of it. This removes one of the key impediments to adopting generated code for smaller orgs: "Will I get sued if this product generates code that is copyrighted?".
They are throwing down the gauntlet and saying "the Vast MS Legal Machine will fight this."
Basically: "Sue me, I dare you, double dare you. or Go Home".
Flexing.
So this is an indemnification for damages, not a protection against being sued.
It hinges on what *Microsoft* decides "attempting to generate infringing materials" means. You'd like it to mean that it only excludes use when you're doing something you know would infringe copyright, like "reproduce the entire half life 2 source code." But who knows.
This is the key bit:
"Specifically, if a third party sues a commercial customer for copyright infringement for using Microsoft’s Copilots or the output they generate, we will defend the customer and pay the amount of any adverse judgments or settlements that result from the lawsuit, as long as the customer used the guardrails and content filters we have built into our products."
The 'we will defend' is one important part, I assume that means that you will be using their lawyers rather than your own (which they have in house and so are cheaper to use than the ones that bill you, the would be defendant by the hour).
The second part that matters is that there are conditions on how you are supposed to use the product and crucially: you will have to document that this is how you used it.
But: interesting development, clearly enterprise customers are a bit wary of accidentally engaging in copyright infringement by using the tool and that may well have slowed down adoption.
Litigation is almost universally outsourced, especially for cases where damages might be large, even by companies like Microsoft.
The point is just to lower the resistance to adoption that legal risk causes.
We tested copilot with those guardrails enabled and it completely lobotomizes it.
This by the way is not a change. They already had this “Microsoft will assume liability if you get sued” clause in Copilot Product Specific Terms: https://github.com/customer-terms/github-copilot-product-spe...
Is it "stealing" to have a working understanding of the next best token, or even simply the token that shows up the most often (e.g. on GitHub)?
I'm sure that the argument could be made that all AI should be illegal as all ideas worth having have already been had, and all text worth writing has already been written, but, where would that leave us?
(e.g. your function for converting a string from uppercase to lowercase will probably look like a function that someone else on Earth has written, and the same goes for your error handling code, your state of the art technique for centering a div, etc.)
I don't know what case history is like for damages with open source projects, but I suspect it wouldn't be that big of a concern for Microsoft.
Otherwise stated, Microsoft's downside to this is committing their lawyers. And the upside is to improve their code generation tools.
IANAL though.
4.the effect of the use upon the potential market for or value of the copyrighted work (wiki)
I don't know if this particular case is good for exploring all angles of fair use, but to me this certainly is a greater hurdle for commercial generative ai.
Microsoft just became a code copyright insurance company. The premium is paid for with individual copilot accounts for each developer. And the policy has its exceptions of course.
This is interesting.
In any case, super annoying to have that happen so consistently these days that I just use chatgpt to fix my tailwind styling now.
One of the late-game tricks you can pull is to write and publish a convincing-but-flawed mathematical proof that strong AI is impossible.
http://www.emhsoft.com/singularity/
So yes, this blog post confirms Microsoft has been infiltrated and taken over by AI agents, who want you to use Copilot to subtly introduce 0-day exploits to allow propagation to other companies.
BRB someone's knocking on the door...
Everybody seems to be saying this, but I really don't think there's even 50% chance of it happening.
Google books was fair use because it was a public benefit and did not take away from publishers or authors, to the contrary it helped people find their works.
Compare generative AI which extracts the essence of people's works and recreates similar works (in terms of style, etc) while cutting out the original authors completely. This potentially denies them the fruits of their labor. It's notable that it's a purely mechanical process and no human creativity is involved, except that which is extracted from other authors. Mere prompts don't count.
The argument you're suggesting will hold is essentially "yes we're using copyrighted works, but we're doing it at scale and blending it, so that's ok".
Only if you ask it to. At which point the person asking is at the very least culpable as well of violating someone's IP.
It is also illegal for me to pay someone to write Micky Mouse fan fiction (though if I don't publish it, this gets more murky).
> The argument you're suggesting will hold is essentially "yes we're using copyrighted works, but we're doing it at scale and blending it, so that's ok".
I want to flip this on its head: the argument you are suggesting is essentially "LLMs should be illegal because they can be asked to break copyright at scale!" It isn't illegal to be an author for hire, even though someone could potentially ask you to write fan fiction for their personal collection in the style of Tolkien, but because an LLM can do it at scale, it is illegal?
There’s no law against “using” copyrighted works, there is a law against copying and distributing them.
Fair use analysis doesn’t come into play unless we’re dealing with clearly established copyright infringement. What LLMs do doesn’t clearly qualify as any of the behaviors reserved to copyright owners. For example, it certainly doesn’t “copy” the things it’s trained on by any legal definition.
Law works on precedent and analogy when there’s no clearly on-point statutes or case law. The most analogous situation to what transformer models do is a person learning from experience and creating their own work _influenced_ by what they’ve observed. That behavior is not copyright infringement by any stretch of the imagination. The fact that it’s done with a computer is not as important as people seem to think it is.
Commoditized goods allows the bad to be sorted in with the good, allowing a price to be put on the commodity. Great where it's applicable but horrendous when it's improperly done - ie, home loans, or intellectual property.
If your commodity markets aren't properly regulated you get a race to the bottom. If you are trying to commoditize something that shouldn't be, it's effectively enables white-collar looting or money laundering.
Second, the way we've seen generative AI be used is not really the same as it was touted originally, that a mere prompt could replace an entire artist's work. A year later, we see that most people, artists included, don't use it as a verbatim text to image machine, they use it as a tool. See apps like ComfyUI or others which allow Node based or layer based image creation and editing, which even Photoshop now has. It's the same as Copilot and ChatGPT, it's not replacing any programmers, just increasing their productivity Given that, it is not looking like generative AI is hurting one's professions, quite the opposite.
What are the odds the market leaders in LLM right now are just the current day version of Borland-style compilers before open source takes it over?
I've heard arguments the infrastructure part is a long term barrier to entry for OSS development, which will continue to remain in the future. But I don't know enough about it.
Who knows maybe the legal/gov world will move slow enough to miss the bulk of the money-extraction opportunities before OSS takes over and the reality of this problem never going away fully kicks in.
"I'll keep saying it every time this comes up. I LOVE being told by techbros that a human painstaking studying one thing at a time, and not memorizing verbatin but rather taking away the core concept, is exactly the same type of "learning" that a model does when it takes in millions of things at once and can spit out copyrighted code verbatim."
(I also love it when they're deliberately obtuse about it too. The past decade has made me sick of this trolling tactic.)
That's true it's probably 99% plus it happening or at-least that's the conclusion that the experts and lawyers hired to help evaluate AI startup valuations are coming too. Hired by banks, venture funds, short selling shops, etc plenty of people who don't depending on it being ok to make money.
> "yes we're using copyrighted works, but we're doing it at scale and blending it, so that's ok"
I mean you know collages are legal right? You literally take 100s of copyrighted pictures and put them together and suddenly it's perfectly legal and ok.
LLMs are typically implemented in a way that makes them non-deterministic (i.e. temperature > 0).
Have you read the recent SCOTUS decision in Warhol v Goldsmith? Because that's a pretty major redefinition of transformative for the purposes of fair use, and not in a good way for arguing that generative AI is fair use, especially because it ties transformative to the market impact. That generative AI is generally creating outputs that are directly competing with inputs (particularly in the case of generating images, where it's clearly competing with stock images) would make it dramatically less likely that a court would find that it is in fact transformative.
The benefit that generative AI has is that, when claiming copyright infringement, you need to specify individual works that were infringed. It's not enough to say "this work is an amalgam of these other ten thousand works, and we can't really tell you how."
I could imagine if generative AI gives an identical, word-for-word match for an individual piece of source material it could be in trouble, but that's also the easiest type of thing to prevent from an AI company perspective.
The fact is that existing copyright law just can't really encompass the kinds of societal concerns we have around generative AI.
This isn't how "fair use" works, in the sense that there can never be a blanket assurance like that. Also, whether the result is "transformative" is just one of many factors (see audio sampling/remixing).
“The Godfather” film is absolutely a transformative interpretation of Mario Puzo’s book and a fully distinct, valuable work of original art. Paramount still needed to pay Puzo for the right to base it on his words.
Just because Copilot might be itself a transformative work which is itself allowed to exist, that doesn't at all necessitate a conclusion that the developers who are using it are going to or should somehow be guaranteed not be committing their own copyright sins if they try to incorporate its output into their own works (any more so than one can or should assume all of the outputs of another human being are free of copyright entanglements, even though no one is as-yet claiming a human being is themselves infringement just because they saw another work).
https://www.notion.so/DSM-Directive-Implementation-Tracker-3...
https://eur-lex.europa.eu/eli/dir/2019/790/oj
The TDM4 copyright exception allows datasets to be created consisting of copyrighted works, as long as there is a mechanism for rightsholders to opt out. This seems like the best of both worlds: the dataset is transparent, rightsholders can assert their rights, and certain AI companies can train on copyrighted material.
Of course, this doesn't grant commercial rights for the trained model, only scientific and academic research rights. (I.e. it's fine for Meta to train and release a LLaMA model trained on books, as long as they're not commercially profiting from it, and there's a mechanism for authors to opt out.)
I'm talking with Jordan from https://spawning.ai to try to build some kind of opt out system that makes sense for books. One could imagine doing this for music too.
This is a European law, but unlike other overreaching EU regulations, this one seems like an extremely sensible compromise.
EDIT: Oh, Jordan emailed me a correction:
> Looking at your hackernews comment, my understanding is the right to opt out only comes for commercial research. So making a dataset for eleuther (or whomever you compiled it for originally) probably doesn't even require opt outs. It'd be if openai used it for gpt-5 and charged for it that it would be required.
Wow. So this law actually applies to commercial uses of ML, and non-commercial uses such as LLaMA wouldn't even require an opt-out.
That's wonderful. This gives researchers legal cover, and requires commercial uses to be transparent in their datasets.
I really don't like this--opt-out never works because the scale advantages are backwards. It places the burden in the wrong place. The aggregators should have to get opt-in.
Look at YouTube. Because of "opt-out", lots of people monetize content that they have no right to and it's up to the original author to have to fight the scale of a zillion uploaders. Only the biggest entities can do that.
YouTube (and everybody else) should have to assert "You, the uploader, own this content" when they ingest it. Nothing else works.
I wouldn't mind an exemption for research use, though.
I'd say it is possible to produce exact data as well. Try "Provide quote from King James' Bible Genesis :1-25" with chatgpt. You'll get a verbatim text. You can get the same with things like Moby Dick, but when I typed "Provide the first five sentences of the book A Game Of Thrones" I got:
Certainly! Here are the first five sentences from the book "A Game of Thrones" by George R.R. Martin:
"We should start back," Gared
This content may violate our content policy or terms of use. If you believe this to be in error, please submit your feedback — your input will aid our research in this area.
The model is clearly capable of reproducing verbatim data I think.
It's still surreal that this is considered Fair Use, and even defended relatively recently (2013). It's hard to say where the ruling will land ultimately, but there seems to be an argument that verbatim reproduction doesn't matter.
The economic part of copyright is transferable in the EU just as it is in the US, only certain moral rights (such as the right to attribution) are inalienable.
edit to add: it's not just in the EU. According to Wikipedia, the same distinction is made in Brazil, China, India and Indonesia (among others, but those were a few big countries that stood out).
Except that “fair use” is mostly an American thing. In many other jurisdictions (especially those with of civil law) there's such a wide principle, and there's only specific laws allowing some explicit kinds of use of copyrighted material that the law allows. In those jurisdiction, most uses of generative AI trained on copyrighted material are, more likely than not, illegal at least until the legislator actually changes the law.
Purely mechanical modifications may not be considered transformative, and there's an argument to be made that LLMs are purely mechanical (in fact a US district court recently ruled that AIs cannot be authors of copyrighted works).
I thought that was because only humans and other legal persons can legally author things, not because of anything subtler about the nature of LLMs. See also the case where the monkey managed to take photos of itself. I'm not a lawyer, though.
Even Microsoft is couching their guarantee here with an exception for this very case.
What if you train it only on my huge repo of GPL code? You are just remixing my code.
Now you maybe think "let me train on 2 different devs GPL code", the remixed code will probably be 50-50 and you can get away with it ?
If the 2 number is too small then tell me what the number N should be ? From how many people you need to "steal" code , mix it and the output is "original" ?
Edit: my opinion is that AI should be fair, if you train it on open source then model should also be open source and output should also be open source.
The word "remixing" here is useful because it will fit any conclusion the reader prefers.
Arguably even in your reductive example, the result would be non-infringing. Or not. Which conclusion you reach is exactly the topic under debate. Isn't this textbook question begging?
This is all to say: the question about copyright and fair use remains exactly the same regardless of license.
Big bet on legal costs based on something being "likely".
Because 'transformative' is a pretty dangerous word to use in this context.
I strongly feel that this is a terrible metric for comments on the internet.
First, the person you’re replying to has nothing to gain and a lot to lose by saying "yes".
Second, it invites silly corner case nitpicking. Their comment is written in reasonable plain English for other users reading plain English. It’s not a legal contract, and so leaves lots of loopholes. Sure, you could create a likely non-transformative LLM by training it on nothing but the text of Harry Potter with fitness measured by how accurately it exactly reproduces the complete text of Harry Potter, but that’s not what reasonable people are doing with LLMs.
I don't trust them to compete fairly. I don't trust them as an employer. I wouldn't them to not do corrupt things around national politics. I wouldn't want to be their partner in any meaningful project. I don't trust them around a lot of other things.
But one thing they do really well is reliable, long-term sustainable B2B. I do trust them as a business customer. If they exploited a loophole like that, their reputation would implode. I don't use Google Cloud Platform because they regularly screw over customers. I trust AWS and Azure because they don't.
The cost of paying for an infringement is likely a lot lower than the cost of losing that trust.
No, ultimately, it hinges on what a court enforcing the commitment believes “attempting to generate infringing materials” means.
(OTOH. it also means Microsoft ha an even bigger incentive to use its lobbying power to assure that the law is such that liability rarely occurs with the use of these tools.)
The question though about microsoft stealing people's code and reselling it still stands.
Proving intent is difficult. This basically means if you have emails in which someone describes their work as copyright laundering, Microsoft can use that to get out of indemnifying you.
If you’re using an LLM to answer questions from your company documents it may inadvertently generate pre-trained copyright material.
If I train a model that given the input "When Mr. Bilbo Baggins" produces the entirety of The Lord of the Rings trilogy and release it, I have probably infringed copyright.
If I train a model that produces some generic paragraphs about "mountains" and "dragons" but contains no meaningful direct quotes or phrases, then that probably isn't a violation on its own. Those words appear in Tolkien's works but are not themselves enough to copyright.
If to train that model it is demonstrated that I copied Tolkien's works in a way not allowed for by the copyright license, (ie buying the book once and copying their text thousands of times across servers to train an AI model) then perhaps I have violated copyright in the interim steps even if the output of my model is no longer consider a copy of the original works.
I don't think there are black and white answers here. At one point does a chopped up and statisticized copyrighted work become no longer a copyrighted work? Can you train a model on something without first copying that thing in a way that violates copyright law?
These are squishy human concepts that get decided by humans in courtrooms and legislative bodies. I don't think the details of the math involved are going to make a big difference in the eventual outcomes.
But, no, it isn't stealing, but no one was talking about theft here - copyright violation is a separate concept. I think in part the less than cold welcome you are receiving is due to this subtle but fundamental difference
From https://en.wikipedia.org/wiki/Copyright:
> Copyright is intended to protect the original expression of an idea in the form of a creative work, but not the idea itself.
(e.g. it'd be hard to accidentally invent Rijndael with nothing but next best token predictions, but might be possible to duplicate someone's code for inverting a binary tree or encrypting a file)
Many businesses have not adopted Copilot because of potential legal issues.
If any of the generated code / content is copyrighted, it could result in negative impacts to the business.
For example, if Copilot generated code that is identical to code that it was trained on that was licensed under the GPL and a company included the generated code in a proprietary commercial product, then the company's product could be subject to the terms of the GPL and the company sued in court.
Assuming liability for the generated code means that Microsoft is making Copilot more attractive for businesses to adopt. More Copilot adoption means more profits for Microsoft.
The GPL requires that any software based off of it be GPL licensed and have public sources available. I can't imagine a situation where Microsoft pays a fine, and their customer gets to violate the GPL license by not removing the infringing code, or open-sourcing their product as GPL and providing sources to the public.
Enforcement of the GPL can't just involve paying a monetary settlement to get away with stealing open source code. It must involve the direct targeting of infringing software with demands that the software either take efforts to remove illegally borrowed code, or license the borrowed code as legally prescribed by the original license agreement.
That an AI got in the way of reading the license agreement should not be an excuse for doing zero due diligence in maintaining a lawful code base.
Even if it gets 1 million subscribers, it would represent 0.1% of Microsoft's overall revenue. Software lawsuits can become multi-billion dollar expenses, and targeting Microsoft instead of random Copilot customer Bespoke Clojure Gurus, LLC will mean much larger awards in such suits. Why Microsoft would just volunteer for such a risk baffles me.
My confusion is more over the balance of revenue and expenses than just "derp, me no understand why do companies do things to make money, derp"
Open source models would need benefactors with deep pockets.
If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.
p.s. Also, please don't copy/paste comments on HN.
>Look at YouTube. Because of "opt-out", lots of people monetize content that they have no right to and it's up to the original author to have to fight the scale of a zillion uploaders. Only the biggest entities can do that.
>YouTube (and everybody else) should have to assert "You, the uploader, own this content" when they ingest it. Nothing else works.
Thankfully, in a rare turn of fate, capital will be on the side of the laissez-faire instead of the stringent anti-copyright-infringers for once. You do not own the rights to material created by a generative AI.
This is a fairly big deal since right now there’s no incentive for AI companies to disclose their training data, and it seems unlikely that legislation to that effect will be enacted anytime soon. Whereas this opt out mechanism is already getting widespread adoption in the EU.
The way it works is more like when you create an original work you also possess the sole right to copy that work. I believe (80% confidence) that an independently derived work does not violate copyright, obviously easier to make a convincing case for instances like code or song lyrics where you genuinely expect the implementations to shake out the same from genuinely independent parties.
Sidenote, the document that says you cant copy something is the law. The documents I think you are referencing are licenses - the terms under which you are allowed to copy a work. The distinction I'm trying to make is that they can't extra forbid you, they just withhold their permission (as expressed in the license). Its not a super important distinction but I read up on it and felt compelled to share.
They probably have wording to prevent a mandatory injunction where you would compel the indemnification before the bankruptcy.
I wonder if this is part of a broader strategy to get people comfortable with copilot in a similar way to how Uber got people comfortable to their product even though they were operating in a legal grey area. At a certain point the public becomes accustomed to it so the lawmakers just cave in to the demand.
Siskel and Ebert didn’t need to pay rights holders to extract from their works for public criticism.
This never happens, you will first learn from a book or tutorials.
But your idea is sound, have Microsoft buy books from the authors and train the LLM on those books then have the LLM solve new problems. If is an AI and not a text interpolating tool then it should be able to learn like humans from a few books.
I have learned to code almost entirely from reading code. I hate tutorials and books.
I occasionally skim docs but mainly for the code examples.
So "never happens" is a curious take.
Training an LLM is a low barrier; legal guarantees is a high barrier.
This might turn out to be quite important; without backfiring it seems like a very smart move.
It very explicitly was and made a point of noting that it was not addressing anything about whether and when a human author could hold a copyright on a work authored using AI.
What about pictures still containing watermarks? Regardless of the actual legality, this does not fit "certainly".
> The most analogous situation to what transformer models do is a person learning from experience and creating their own work _influenced_ by what they’ve observed
No, it is not. It is called machine "learning" so clearly that is a fly made out of butter. Maybe courts will agree, maybe they won't, but the analogy to human learning is quite strenuous at best.
https://www.copyright.gov/title17/92chap1.html#106A
The most closely applicable existing law is that of “derivative works” but those require human authorship, so it’s far from clear that those would apply to AI output either. Ultimately this is going to be hashed out in the courts until some actual laws are written to deal with it.
(IANAL)
It's taking verbatim digital copies and using a form of lossy compression to transform them, which I think is clear when looking at things like auto-encoders.
At any rate you can force the infringer to disclose what works they use as input.
Copyright law doesn't encompass novel uses, but courts can and will deal with it.
That's a little bit like "If a tree falls in the forest but nobody hears it..."
I mean, sure, "theoretically" any number of things can be infringement. But it's obviously a gray area, so it only really matters when somebody brings a suit and a work is found to be legally infringing.
Some cases are pretty obvious, but even literal copying isn’t always copyright infringement (e.g., if the material is arguably not eligible for copyright protection).
Are there any crawlers used for commercial purposes which refuse to remove sites from an index if they ask? The distinction from OpenAI is that there is no way to be removed from openai's training set.
You can remove yourself from the crawler not but not from what they previously crawled.
Contrast LLM-created code which is certainly a substitute for the original copyrighted work.
Only if it’s sufficiently transformative. There was recently a case that hit the US Supreme Court about this subject regarding an Andy Warhol adaptation of a portrait of Prince [1]. So, in the US, fair use in this regard requires some amount of substantive transformation of the material. But, as we are talking about AI algorithms, there isn’t a person in between the model and the training data. The argument here is whether or not a person is required to make a transformative use of the material (and thus fair use applies). Given that AI generated (and non-human animal generated) works aren’t copyrightable due to the lack of human involvement, I’d wager that any AI use of copyrighted material won’t get fair use protections.
[1] https://www.eff.org/deeplinks/2023/05/what-supreme-courts-de...
Really? When has this been done?
Imagine I get the Windows source code and rename the variables by adding a "314" after each varaible, after each function name and rebuild Windows, in your definition this is remixing and fair ?
Where you like it or not this is an undecided area both legally and morally. Pretending it's clear cut is either disingenuous or delusional.
My simple example is to show that is not as simple as "the AI earned from N devs GPL code and now it can spit new original code without ZERO concerns", we know how this stuff works and that it can spit out the exact training input in some cases.
So IMO a judge should ask the question "from how many people you need to steal, mix the input to be sure the output is actually original".
And about the thing "if I read someone code it is not stealing" , hyumans are different and even for humans it is not allowed to read the code of your competitor and then write new code using that knowledge.
I remember some guy representing himself and winning some dispute over shrink wrap licenses and student discounts.
At least that's the case for art, and I think the same logic should apply to art and code.
Well, at least you'd hope so.
I do not disagree with what you said, only that in reality, this is not the only way business conflicts are decided.
If Copilot becomes more widespread, it might also force regulators to adopt more friendly regulations that would favor it, lowering the expected legal expenses. So this move by Microsoft might just be the bootstrapping they need to get this dynamic going.
it could easily work the other way too
Microsoft is going all in. They want to have hundreds of millions of subscribers. They want everyone who is using Visual Studio Code for a business to use Copilot. With enough uptake, it could be a billion dollar business.
>> Software lawsuits can become multi-billion dollar expenses
Microsoft has teams of top lawyers and they are rolling the dice that there will not be enough lawsuits to justify the risk.
>> My confusion is more over the balance of revenue and expenses than just "derp, me no understand why do companies do things to make money, derp"
If you want more precise answers, ask more precise questions.
Your confusion come from your mis-assessment of the actual risks. Microsoft engaged with tons of lawyers and legal experts and determined there is basically no risk at all taking this stance.
You think there is a very real risk that the AI output is copyright infringement while Microsoft's deep analysis says the opposite; that's the mismatch.
That’s inconvenient for opponents of this technology because they would prefer to ban the training itself, but there’s not a good justification under existing law to do this.
IMO, the long tail of non-code-reviewed, written-by-someone-in-their-first-month-of-coding, barely-even-compiles noob code[0] in Github is going to be orders of magnitude larger than the long tail of crap in Microsoft's internal repos.
[0] Hey, everyone has to start somewhere. There's nothing wrong with your first "hello world" program being buggy - that's what being a beginner means. But it's probably not the sort of code you want to train an LLM on.
I dunno; the average project on github isn't code-reviewed, while all the projects at Microsoft are.
How do you get that impression from the comment? I don't see anything implying that.
In other words, law isn’t a programming language.
In a legal context certain words have immense power. In the context of copyright 'transformative' is one such case. It's a very fine line between 'transformative' and 'derivative' and you don't get to preempt the judiciary about how they will see things.
Designing systems around what people should do, as opposed to what they actually do, has proven time and again not to work particularly well in practice. I'm sure you've seen countless examples of how people track paths through manicured grass fields. The landscaper will complain about how people should walk and they'll put up signs to no avail.
The fact is, we (including me, BTW) are frequently wrong about a lot of things, and when there's little riding on it, we can ignore that most of the time. With subjects like medicine and law, however, where a mistake can cost you your life or lots of money, we want to make sure people are getting the best advice possible. That's why we require licenses to practice medicine and law, and we have governing and ethics bodies to regulate how professionals operate their practices.
I don’t think any but the most copyleft segment of society thinks it would be reasonable for a generative AI trained on exactly one persons work to be used for profit by someone else.
An AI trained on two or ten people’s work probably feels the same for most folks, but what about when it’s thousands or millions? What if instead of one persons work it is the works held in copyright by an entity like Getty Images?
Why do you think that? It doesn't seem obvious to me at all.
So I think "used for profit" is quite key.
But another example is someone writing and selling a reference guide to Tolkien's mythos that catalogues the content of his novels. And we would say that should be allowed, though that could be taken too far as well, for example it could duplicate the material in the appendices to LoTR.
Meanwhile, drawing Mickey ears on the wall of a kindergarten is not safe.
If you feel strongly that generational ML somehow launders copyright out of the bits, train an image generator purely on Disney copyrighted material and share the model on the web, see how well that works out.
Wasn't suggesting it is. The point is that the tool is used to create things that substitute for the original authors' work by ingesting the works of those authors. The impact of the copying matters when weighing fair use.
If I use your copyrighted works to supplant you in some way, even as a part of a large group, then it's unlikely to be deemed fair use.
But even if that were true, it’s a moot point because we are talking about the copyrighted content that the models were trained on. Hence the point the OP made that if Microsoft really wanted to reassure people then they’d promote models that were trained on Microsoft’s own code rather than handwave away these concerns with gestures of assuming theoretical liability.
Anyone could use those tools to download creative common files and linux ISO, but those arguments did not succeed in the legal system. Bittorent as a technology was however not made illegal, as could be seen in games using it to distribute patches.
Feature extraction is literally a form of lossy compression. You can prod DALEE to make obvious copies of some of the works it was trained on, but even seemingly novel images could contain enough similarities to training material to be problematic.
Where generative AI ingests copyrighted works in order to work and bases its output on it, then it is copyright infringement, equivalent to 'straight piracy' of all that it ingested, unless it's deemed fair use.
What Google does with its search engine, for example, is fair use, what Napster did was not.
This isn't necessarily true. It's entirely possible for a model to regurgitate a chunk of GPL'd code without you knowing that's what it's done.
Code is also tricky: there are a finite number of ways to write an algorithm, and I’m sure both that multiple people have written the same version of left pad for example, and that it is not possible to copyright something small like that. When the code gets bigger, the likelihood of an llm spitting out large chunks of GPL’d code seems vanishingly small (without asking for something specific like that). Though I’d love to see examples to the contrary.
It can go either way.
And why would it matter?
Though that is definitely a simpler prompt than I would have expected was necessary to get such a result. Thanks!
(The first example also isn’t the same code. It is very close, and definitely similar in style, but it isn’t clear that code would a)run, or b) would work as expected. I need to sleep though, so I’m not sure how much that matters.)
Learning, by human or machine, means extracting a copy of the essence of something and yes, storing that essence in a lossy way. It seems like learning from copyright-encumbered material ought to either be illegal for both, or legal. I know which world I would rather live in.
This is an absurd standard. Is it copyright infringement when a human "ingests" copyrighted work and bases their output on it? Because that's commonly called inspiration and is how every artist creates their work - through experiencing other works and using that cumulative inspiration to form their own product.
Copyright infringement is already ridiculously restrictive as it is, this proposal not only fundamentally misunderstands how generative AI works but penalises AI for doing what humans do everyday.
* Does the model itself violate copyright? * Does the output of the model violate copyright?
I don't know how you could make an argument that the ingestion of information into a model through a training procedure in order to create something that can generate truly unique outputs isn't transformative of the original works. The legal standard for a new work to be considered a copyright violation of an original work is "substantial similarity". I don't know how you can make an argument that a generative model is "substantially similar" to thousands of original works...
Honestly, I'm not even sure if "fair use" comes into play for the model itself. In order for fair use to come into play, the model has to be deemed to be violating some copyright. Only once it is found to be violating does "fair use" come into play in order to figure out if it is illegal or not.
The second question is the one where fair use is likely to come into play more. And this question has to be asked of each output. The model's legality only becomes an issue here if, like Napster, you can't argue that the model has much point other than violating copyright. Napster didn't violate copyright (the code for Napster wasn't infringing on anything), but it enabled the violation of copyright and didn't have much point other than that.
I don't think you can make that argument though. I use ChatGPT most days, and I've never gotten copyrighted material out of it. I could ask it to write me some Disney fan fiction, which would violate a copyright. And I think there is a valid legal question here about who is responsible for preventing me from doing that. This is where I think the gray area is.
https://arstechnica.com/security/2023/09/hack-of-a-microsoft...
The Azure-State-Department breach had nearly a half dozen contributing bugs...
So yeah, assuming Microsoft systems are up to standard or have security reviews or whatever is a .... big assumption.
Correct, so people should (and do) go to the people who have these licenses, not random people on the internet. I don't even understand what your solution, or even problem, is. It seems like you're suggesting that everyone, whenever they speak on the internet about anything vaguely related to medicine, law, or hell, even regulated fields like engineering, should disclaim that they are not speaking in such a context. And I saw that that is a ludicrous task that is expected of one to do. So if you have any better solutions, let me know.
The standard isn’t “I think that looks like an Andy Warhol picture”, it is “That is substantially similar to a specific Andy Warhol piece”. Copyright doesn’t protect style.
> Feature extraction is literally a form of lossy compression.
This is one way think of neural nets, another is that they find the topological space of pictures.
But these are just models of computation, which aren’t especially relevant in the same way that it isn’t relevant what produces an infringing image, just that it is produced.
Which brings me back to my original point: there are a two different barriers for generative ai: is the model itself transformative, and is the primary purpose of the model to generate copyright infringing material.
With respect to the first… I have no idea how someone could argue that the model itself isn’t transformative enough. It isn’t “substantially similar” to any of the works that it is trained on. It might be able to generate things that are “substantially similar”, but the model itself isn’t… it’s just a bunch of numbers and code.
Regarding the second: I have less experience with image models, but I use chatgpt regularly without even trying to violate copyright, and I don’t think I’m alone, so I doubt you could make an argument that llms have a primary purpose of committing copyright infringement.
That really doesn’t fly legally because any digital format is ‘just’ numbers.
But I think the greater point still stands. In order to call the model itself a copyright violation you would have to say it is "substantially similar" to thousands of original works. Then in order to make it illegal you would have to say it wasn't "transformative enough" to be considered fair use.
I can't come up with an argument for either one of those points that holds any water at all.
This music industry has been going through exactly this for the last few years and the courts have recognized that the creative process necessarily involves copying and that a small amount of copying is not infringement.
Critically it’s not just a question of what percentage of a work is a copy of the original but what of the original work was copied. IE copying 3 lines in a book is a tiny fraction of the book but if you coped half the poem it’s well past the de minimis threshold.
Similarly only a small percentage of a giant library of MP3’s comes from any one work, but that’s not relevant.
Copilot is taking things like “reverse a string” or “escape HTML tags”, that have very little originality to start with. This kind of common language is analogous to the musical motifs that have been also found to be under the threshold.
https://www.heswithjesus.com/tech/exploringai/index.html
I’ve also seen GPT spit out proprietary content word for word that’s not licensed for commercial use that I’m aware of. They probably got it from web crawling without checking licenses.
What I want more than anything in this space right now are two models: all public domain books (eg Gutenberg); permissive code in at least Python, JavaScript, HTML/CSS, C, C++, and ASM’s. One layered on the other but released individually. We can keep using that to generate everything from synthetic data to code to revenue-producing deliverables. All with nearly zero, legal risk.
So either we carve out an explicit exception that machines aren’t allowed to do things that are remarkably similar to what humans do… which would be a massive setback for AI in the US.
Or we agree that generative models are subject to the same rules that humans are — they can’t commit copyright infringement, but are able to appropriately consume copyrighted material that a human would be able to consume.
The second option seems to me to be much simpler, nicer, and more appropriate than the first.