An open source lawyer’s view on the copilot class action lawsuit(katedowninglaw.com) |
An open source lawyer’s view on the copilot class action lawsuit(katedowninglaw.com) |
I wonder if a court would think that microsoft in this case has done their due diligent to verify that the license grant that they got from users are correct and in order.
e.g. 4. [..] You grant us [..] the right to [..] parse, and display Your Content [..] as necessary to provide the Service, This license includes [...] show it to [...] other users; parse it into a search index or otherwise analyze it
As the Service now includes copilot, publishing anything on Github seems to give them the right to use it in copilot. Maybe even for private repos
Besides of the issue we're currently discussing, I wonder also about:
5. [..] you grant each User of GitHub a [..] license to use, display, and perform Your Content through the GitHub Service and to reproduce Your Content solely on GitHub as permitted through GitHub's functionality (for example, through forking).
So if you find GPLed content on github, you might be allowed to violate the GPL as long as it happens only on github. I don't know how bad this is in practice. Their CI presumably allows you to run code for other people without granting them the rights the GPL should give them, but that might be a violation of the Github TOS as this might be abuse of the CI servers.
This might also mean you violate the GPL when publishing someone else's GPLed code on github, as you now granted Microsoft and others rights not included in the GPL.
Clearly, IANAL, don't know how valid this reading is, but publishing anything you didn't wrote yourself might not be on a very stable legal basis.
https://docs.github.com/en/site-policy/github-terms/github-t...
Yes. This was one of the legal theories behind why Apple refuse to allow GPL in the Mac App Store. The TOS that apple required from developers givens Apple specific rights which the GPL do not grant, and thus any software that get uploaded must be assumed as providing the software under two separate licenses. Given that many free and open source projects has multiple authors, it is a rather large assumption that the person who uploads the software has the complete authority to provide the software under multiple conflicting licenses.
It is after all the distributor that has to do the due diligence to confirm that they are in the right to distribute.
That doesn't sound right. Licences can allow sublicensing, and I think all the popular open-source ones do.
There is also additional problems specific to sublicenses. In the United States, only exclusive licensees are assumed by statute to have a right to sublicense. The theory is that licensees of exclusive licensees are assumed to have the control/authority similar to that of the author. Nonexclusive licensees are not assumed to be granted such a monopoly by the licensor.
In practice, I think the entire open source world knows that people post each other's open source code on GitHub. Even projects that have very purposefully chosen to primarily use other services or self-host their source code are well aware that their code gets mirrored on GitHub and/or included in other people's repos on GitHub. Up until now, I don't think this has been controversial and I don't think GitHub gets a lot of takedown requests for this practice. I think most developers see this as a feature, not a bug. Copilot might make people rethink whether or not they want to start sending take-down requests but that'll be a tough call for a lot of people because withholding code from GitHub to avoid its usage in Copilot also effectively means making their code less easily available to the rest of the world. It may be very disruptive to other projects that include the copyright owner's code in their own projects.
If my code was uploaded on GitHub, I would DMCA it because of Copilot, but it wouldn't matter because the information is already in the model. So the DMCA does not help here.
The only way it would help is if I could DMCA the entire model and force them to retrain without my code. As it stands, this lawsuit is the only way for GitHub to be reined in; I don't have the resources to do so on my own.
IANAL.
Also, about high impact, suppose Copilot has 1 million users that use it on average 10 times a day, 5 days a week. You claim that less than 1% of uses of Copilot would result in copyright violation. Let's assume 0.1%. How many times would copyright violation happen per day? It would happen 10,000 times per day. For five days a week.
It would take a mere twenty weeks (less than six months) to reach a million violations.
That seems impactful.
What if I build an AGPL licenced service, using GitHub to coordinate development. According to the ToS MS could offer a version my service because I posted the code on GitHub, and they are using it to improve their service to me. According to my AGPL licence, they would need to share their source.
So which takes precedence. The licence or the ToS?
Over many years it has now mostly become a tool for large companies to accumulate rights (on works they didn't create themselves) and monetize them.
Maybe a reform is needed, to find a way back to the original purpose.
No, it wasn’t. Copyright was originally intended to protect the publishers of a work. It was later transformed to nominally focus on the creators, but even this was lobbied for by publishers in their own self-interest after the old law directly protecting them was allowed to lapse, and because it still had the same net effect since realizing value meant licensing to a publisher in most practical cases, so the publishers were still major beneficiaries.
And, of course, US copyrights under the Constitution do not exist for the purpose of protecting creators, instead a private benefit for creators is a mechanism but the purpose is expressly to “promote the progress of science of useful arts”.
> [Github's Terms of Service] specifically identifies “GitHub” to include all of its affiliates (like Microsoft) and users of GitHub grant GitHub the right to use their content to perform and improve the “Service.” Diligent product counsel will not be surprised to learn that “Service” is defined as any services provided by “GitHub,” i.e. including all of GitHub’s affiliates.
"It looks a lot more like trolling if an otherwise incredibly useful and productivity-boosting technology is being stymied by people who want to receive payouts for a lack of meaningless attributions."
i'm not paying for copilot right now because i'm waiting for this to shake out. but i'd be happy to pay (even their current asking price) if i knew the model was also open source and could be self hosted.
maybe this is the wrong way to ask the question, but hopefully it makes sense
For example,
> That rings a bit like the Facebook memes of yesteryear promising users that if they just copy and paste these magical sentences onto their timelines, then Facebook won’t be able to do something or other with their data or accounts.
The only legal way you can use copyrighted code is due to the license attached to it by the copyright holder.
If a license specifically prohibits copying the code for a purpose, then it is a violation of the copyright to copy the code for that purpose. You have no other legal way to do it.
These aren't magic words, they are legal obligations. Ok, well maybe legal obligations are magic words. But it is magic that works :). Otherwise things like GPL could not function.
(Section 6 here: https://opensource.org/osd)
Let's say I added a clause to my BSD license that prohibits the copying of this code to train ML models.
Would that not immediately make GitHub in violation of this license?
Or do they only train it where the license is explicitly one of the ones it knows about?
https://medium.com/@6StringMerc/artificial-intelligence-mach...
This one sentence threw off my entire opinion of the article as it demonstrates the author's clear bias in favor of Copilot, not just specifically in this case but in principle.
Legal opinion on Copilot and generative AI in general hinges entirely on metaphors. If the AI is understood to behave like a human being building knowledge and drawing from it for inspiration, Copilot is just another way to write code. But we've already established legal precedent that machines can not hold copyright, which suggests that they can not be deemed to be creative, which could be used to argue that they are therefore just creating an inventory of copyright works and creating mechanical mashups.
The author's dismissal also ignores that this would not JUST result in attribution. If Copilot indexed copyleft code and were required to provide attribution when using this code, the output might also be affected and this could in turn affect the entire code base. Worse yet, Copilot may output code with conflicting licenses. The author considers only the possibility that Copilot itself might have to inherit the license (and the dismissal that it would "help noone" because it runs on a server ignores both the existence of a (presumably self-hosted) enterprise service and the existence of licenses like AGPL, which would still apply) but it seems most people's concerns are with the output instead.
I also fail to understand how the argument that it doesn't reproduce the code exactly 99% of the time is helpful. If I copy code and rename the variables and run an autoformatter on it, it's still a copy of the code. It's odd to see a lawyer use what is essentially obfuscation as a defense against copyright claims. Also 1% is an incredibly large number given how Copilot is supposed to be used and how large the potential customer base is. Given the direction GitHub is heading with "Hello GitHub" (demoed at GitHub Universe yesterday) it's not unlikely that Copilot would in some cases be used to generate hundreds, thousands or tens of thousands of lines of code in a single project.
The question isn't just whether Copilot is violating the law or not, the question is why it is or isn't because that could have wide implications outside GitHub itself. But as the author points out, sadly the lawsuit doesn't try to settle this for copyright, which might be the most impactful question.
Back when I was young, graph pathfinding algorithms where called AI. A few decades later they are a well understood commodity and I haven't seen anyone call them AI for a while. Maybe that'll happen to LLMs too, given a few years?
Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that."
https://news.ycombinator.com/newsguidelines.htmlSee Marvin Minsky’s comment regarding “suitcase words”.
If I ask Stable Diffusion to create a picture of Elon Musk wielding lightnings and riding a giant blue sparrow over a desert during a storm, the result would be more creative than what could be produced by most humans. I believe that counts as a proof.
And that is what copoilots AI mostly does.
It doesn't "understand the concepts and reproduce something alike" in the sense a human does. It might understand some concepts here and there but it also does a lot of heavy lifting my verbatim "remembering" (i.e. copy pasting) code.
This is also why some people argue that the cases for copilot and some of the image generation networks are different as some of the image generation networks get much closer to "understanding and reproducing a style". (Through potentially just by it being much easier to blend over copy-pasted snippets in images to a point its unrecognizable.)
One of the main problems GitHub has IMHO is that anyone who has studied such generative methods knows that:
1) they are prone to copy-pasting
2) you don't know what they remembered (i.e. stored copies of in a obscure human unreadable encoding, i.e. just distributing such a network can be a copyright infrigement)
3) you don't know when they copy past
4) the copy pasted code often is a bit obscured, ironically (and coincidentally) often comparable with how someone who knowingly commits copyright theft would obscure the code to avoid automated detection
Which means GitHub knowingly accepted and continued with tricking its copilote users into committing copyright infringement under the assumption that such infringement is most times obscured enough to evade automatic detection....
There is no equal sign between a person and a program.
There is also that thing called "scale" that is critical to the interpretation of the action.
Is eating meat fine? - maybe. Is eating all animals OK? - Hmm...
This argument is hardly less flawed than the one you are criticizing. And you statement that 'there is no equal sign ...' is also unconvincing, as we're not equating these two, but the process of learning, which is quite similar.
The question is whether a person is doing the same as Copilot for this particular case, i.e. reading source code to learn.
You have not really given any argument why this is not the case. Or maybe your reference to scale? So only because Copilot has read more code than a human possibly could, that makes it different? But why exactly is reading a bit of code fine w.r.t. copyright, but reading more code suddenly violates copyright?
Note that the reason why Copilot needs more code to learn is just because the learning currently is not as efficient as for humans.
Even novelists do not sit all day long in a closed room reading other people's work and then do a collage of what they've read. Otherwise no books would have been written in the first place.
Cut the AI off humans' work, let it interact with the real world and see what it produces. It will be nothing.
Once (if ever?) an AI is capable of producing an actual original work, I'm fine with other AIs stealing from the first one. Please leave humans alone.
That "experiment" could just as well be done on humans, though, cut them off of any work that any human has done before and you may get simple cave paintings, if you're lucky.
That's correct, but it misses the point.
This is about reconciling 1. being allowed but 2. not being allowed:
1. The human uses a machine, where the machine is an organic one it grew itself.
2. The human uses a machine, where the machine is one it made or acquired.
To a lot of us, there's no difference.
... yet.
Please study the series of events that unfolded in the music industry after folk begun incorporating recordings made by other artists in their own work and proceeded to sell the result.
Spoiler: The deeply nuanced question of feeding a mechanical recording through a series of complex physical and mathematical apparatus and whether that constituted a transformational creative act did not come up during the proceedings or final judgements!
Is CoPilot just trained on OSS, or on private repos too?
Am I violating copyright? Yes
Imagine they change the character names in those paragraphs. Am I still violating copyright? Yes
At some point you can change enough of the text to not violate copyright. The grey area involves the courts.
It feels very simple to me so I might be missing something.
> It feels very simple to me so I might be missing something.
In my opinion, you are missing something subtle:
In continental Europe, there is a different law tradition - civil law (https://en.wikipedia.org/wiki/Civil_law_(legal_system) ) - that is different from the Anglo-American common law tradition. To quote from the wikipedia article:
"The civil law system is often contrasted with the common law system, which originated in medieval England, whose intellectual framework historically came from uncodified judge-made case law, and gives precedential authority to prior court decisions. [...] Conceptually, civil law proceeds from abstractions, formulates general principles, and distinguishes substantive rules from procedural rules. It holds case law secondary and subordinate to statutory law."
So if you are attached to the civil law system, you seriously want to avoid this grey area involving the courts (which is much more accepted in common law) and instead want to codify into laws what you mean by this grey area.
The simple layman's version of copyright is that copyright applies to a specific form of a thing and not about the ideas behind that thing.
So, no, George Lucas was not infringing anything. Nor is hip hop music making use of samples infringing anything. Or Andy Warhol integrating photos into his works. Nor is it illegal to paraphrase or refer other authors. And as Oracle found out by challenging it in court, trying to claim ownership over APIs to prevent third party implementations is also not going to work.
All of that falls under fair use. Fair use is what makes copyright useful. Without it you'd have to live in fear that legal copyright holders might come after you if you apply the ideas that you might have been exposed to via their copyrighted work. Fair use exists such that you can make use of information provided to you via a copyrighted work.
It's an interesting test of open source licensing because I'm not aware of any other area of copyright where works come with an explicit "if you use this somewhere else you must credit me as the initial author" in the implied/provided license.
Comparing music, literature, etc. to code is difficult because of both this difference and the existence of software patents. The manner in which infringement happens (and the scale) is often different as well.
I don't want Github or any other megacorp-backed entity abusing the open source community in the way micro$oft is here, it's as simple as that. If they wish to train it on entirely proprietary Microsoft code, then by all means go nuts, but to take the work of open source projects and to hide behind the pretense of the mathematical model behind the A"I" learning something is simply ridiculous to me.
I find it quite curious that they're not doing that (training it on their own codebase). Perhaps they're afraid of their little intelligence spitting out proprietary code verbatim like it's been shown to do many times with licensed open source code.
Next hypothetical.
These can be quite inventive works; nevertheless, no-one seriously argues that the video content does not breach the original animators' copyright.
The video content of an amv is a much better analogy for what copilot does to third parties' code than anything else I've seen in this post's discussion so far.
The most they could do is transfer any liability back to you for posting it in breach of some term in their ToS. But that would be absurd since posting someone else's code, licensed under a common (eg. OSI-approved) license, is an established and normal use case for GitHub. If their ToS really did ban the posting of some AGPL code, they really ought to have pointed it out, and of course it'd render GitHub useless for hosting AGPL code.
This would only apply when posting someone else's code. But of course you could always arrange that.
This is very similar to what happens when you sign a contributor agreement before contributing code to an open source project. When you sign the contributor agreement, you're granting a very broad license to your work to the project maintainers. They can then license your work out under any license they want. But likewise, because you are not granting them an exclusive license, you're free to put your contribution license out into the world under any license of your choosing separate and apart from the project that you contributed it to.
Technically, I think the scenario you're describing with AGPL code may well be possible and legal. But practically, I think people would stop using GitHub if they felt that doing so would lead to GitHub/Microsoft undercutting their projects, stealing their customers, or essentially stripping the project of any AGPL obligations. I think that from a business perspective, they're really gambling on the idea that developers will see Copilot as a big boon rather than a value suck. Time will tell whether their gamble has paid off.
Microsoft isn't going to train Copilot on Windows code for the same reason it didn't train it on private repos: the code is private and they don't want to risk leaking private code.
I imagine there would be no problem training it on e.g. the Unreal Engine code which is not open source but is available to read.
The big practical issue is that there's no warning when Copilot produces code that might violate copyright, so you're taking a bit of a risk if you use it to generate huge chunks of code as-is. I imagine they are working on a solution to that though. It's not difficult conceptually.
Other parties are not granted license under the ToS, and so will have to abide by the AGPL.
That's why it's GitHub, not CodeScribe ( or something)
And sometimes the network improves,depending on the quality or direction of the output,the client can a valuable critic even without being an expert in the field
As it stands, Copilot is a black-box which strips copyright from a piece of code.
I'd be fine if it were a level playing field and GitHub also trained it on private repositories - that's a signal that they don't care about copyright at all.
I'd be fine as a developer who releases GPL'ed code if the output was licensed as GPL - obviously no license violation.
I'd be fine (within a reasonable scale) if a developer contacted me and asked to use my code under MIT.
I'm not fine that Copilot allows people to take my code, 'change the variable names' and remove the license. Especially because I have no visibility of the fact that this has occurred.
As is, it's EEE applied to open source-- Microsoft's ultimate play against the ethic that brought us Linux among other things. When your brainchild gets gobbled up faster than you can blink, pushed to people who never learn about your existence, and a megacorp that you are ethically opposed to profits from the process, the need for self-actualization is no longer addressed. The fundamental incentive that pushes us to publish in the open, to have other humans acknowledge you and your work and feel pride in it, is being eliminated.
I used to be a little more agreeable with Copilot with training money and all, but seeing Stable Diffusion is willing to open up hundreds of thousands in training, and more in engineering, and therefore create an active community dedicated to improving it everyday, I just can’t help but be so annoyed when one of the world’s biggest tech companies pulls such petty move.
https://docs.github.com/en/site-policy/github-terms/github-t...
I'm actually surprised they allowed Copilot to happen, given this section:
> This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.
One could make the argument they had no intrinsic right to use the software for Copilot except under the terms laid out under the respective softwares' licenses. This means any GPL code they copied by error is now in violation of the GPL by default. But IANAL.
edit: The exact quote was “Training machine learning models on publicly available data is considered fair use across the machine learning community” if you want to search for it.
edit 2: https://web.archive.org/web/20210629142841/http://copilot.gi...
> Frequently Asked Questions -> Training Set -> Why was GitHub Copilot trained on data from publicly available sources?
> Training machine learning models on publicly available data is now common practice across the machine learning community. The models gain insight and accuracy from the public collective intelligence. But this is a new space, and we are keen to engage in a discussion with developers on these topics and lead the industry in setting appropriate standards for training AI models.
You have apparently misunderstood copyright licenses as being something that attaches to a copyrighted work and now must be respected by all users of that work. But that is totally incorrect.
Licenses are individual agreements between copyright holders (or licensees who have been granted the right to re-license) and people who want to exercise one of the rights normally withheld under copyright. A LICENSE file is nothing but an offer to grant a license with specified terms to anyone who might want to use the work, without having to nag the licensor to sign an agreement. The existence of that offer doesn't have anything to do with any other agreement the licensor and a (potential) licensee might make.
In the GitHub case, GitHub has negotiated a different license with the uploader. (That negotiation happened to take the form of a ToS, which is another kind of binding offer.) The LICENSE file has nothing to do with it. It hasn't been overridden, it's just irrelevant. It doesn't add or subtract any terms from the separate and distinct license GitHub negotiated.
Other parties will still have to abide by your license.
Monkeys have evolved enough to start making their own tools [0].
[0]: https://www.scientificamerican.com/article/monkeys-make-ston...
It doesn't matter whether it's music, literature, or code. Fair use is fair use. And it's been challenged so often that no judge is going to make any exceptions just because we are now dealing with software.
End of story. No basis for any copyright infringement here. Not even worth trying out in a court because you'd be laughed away. The plaintiffs in this case clearly realized that and did not bother with even trying to prove otherwise.
Software patents are not part of this court case either for the obvious reason that the vast majority of copyright holders in this case don't actually hold any patents whatsoever. And if they would, it would not be Github's problem but the problem of those creating possibly infringing products without a license. Github just gives people access to (public) knowledge here. That's what a patent is: public knowledge. It's up to the user to decide if they are OK shipping products that include that. And it's their problem to do any due diligence.
Why wouldn't they? They are both large important codebases which they can do whatever they like with. If they are confident that Copilot does is akin to learning or at least something transformative then it makes perfect sense.
In all fairness, the article mentions the Fair Use doctrine. You could make the argument that reading a bit of code is allowed, but doing it at a large scale would not be covered as an exemption to copyright.
Counting all the code I have read in my life, it's also quite a lot.
If you sell a product which can regurgitate large parts of code who's licence doesn't allow it to be completely freely used and modified then this kind of outcome seems a foregone conclusion.
Yes it is. If code is public you can't necessarily use or copy the code (depending on the license) but there's nothing stopping you reading it and learning how it works and any secrets contained in it (e.g. information about future products).
There's a reason most proprietary software isn't "source available".
> If you sell a product which can regurgitate large parts of code who's licence doesn't allow it to be completely freely used and modified
Yes but they don't do that. I'm not sure why people are finding this so hard to understand.
Also it's pretty clear you haven't worked with many artists from that statement.
Humans are also trained only on things produced by humans. The only exception is nature, but ML model can be trained on photos of nature, too. Also, you are missing the point.
> Also it's pretty clear you haven't worked with many artists from that statement.
1. I employed quite a few artists over the past 15 years.
2. I wasn't talking about artists. I was talking about regular humans. The vast majority of them are absolutely unable to create anything resembling Elon Musk riding on a giant sparrow.
(Of course the GitHub ToS do not allow any CoPilot activities in the first place.)
println(“foo at {:x} is {:?}”, &foo as *const _ as usize, foo);
It almost always writes what I would have. How DARE I steal from open source contributors like that?!Check out my essay I’ve submitted recently.
Just because it's in the Terms of Use doesn't mean it can be upheld in court (or more specifically: in every court). If you uploaded your repository to a service advertising itself as a version control service, the service using your uploaded code to feed a commercial code generation product would likely be ruled as "surprising", which at least in Germany has been used by courts to dismiss claims of Terms of Use violations (e.g. when WhatsApp banned users for using third party apps).
Replace "AGPL" with Old Microsoft's "public source" (i.e. proprietary code published without an open source license) for a more likely scenario.
That's why for my own projects, I actually made sure to get approval from all contributors when vendoring a dependency just after the ToS change: https://github.com/justjanne/QuasselDroid-ng/issues/5
The defence would probably claim that GitHub effectively invite users to post AGPL code (this being a pretty fundamental part of their business model), including when they don't hold the copyright, so it is implied that the ToS indemnity cannot be interpreted to include this situation. If GitHub tried to claim otherwise, they'd have to contradict themselves and courts usually find that kind of thing unacceptable.
The indemnity would stand for other cases of course, such as users posting code without permission of a license.
This is the same but for use of open source code: if humans are allowed to use one specific (organic) neural network to read, process, and use open source code, then why shouldn't they be allowed to use some other neural network, artificial or otherwise.
A neural network is closer to a database than a human brain. So this is akin to saying: I can store your personal data in my human brain (without your consent), why am I not allowed to do it in PostgreSQL?
With code, that is denoted via the license, which when supplied with the code and especially as metadata before downloading (as is the case with GitHub) is the common means with which those limits are placed.
Humans and neural networks process information very differently and it's disingenuous to imply otherwise.
(By the way, do you have an objection to my point? I must have missed it.)
"Excellent, let's see how your car goes faster when you push the gas pedal"
My point was that insane amounts of curated fully original works were required to get the output of these generative tools to the "occasionally impressive" level it is at now, and those original works exist precisely thanks to copyright. To say "oh we don't need copyright now" is to saw off the branch on which this hinges.
If you are into this stuff and want to see it become better you should rather promote copyright and differentiation of fully original works.
There are GPL repositories which force you to open your code, which is one aspect, and there are "source available" repositories, which allows you to see the code, but forbids everything else.
There are a lot of blurry areas about this, and in my opinion, an AI learns like a human is not a solid basis for fair use.
On the other hand, if private repositories are crawled too, this would be very, very bad.
We just talked this with a couple of friends. I always cite what I got from where (it's just two occasions, but it's not zero), and always respect their licenses.
I'm worried about both ways of the permeation: GPL to closed and closed to open. Open source is a widely misunderstood concept and people (and companies) are using that misunderstanding to validate their blanket options. That's wrong on so many (legal to ethical, and everything in between) levels.
Emulator writers are afraid to read leaked console code, because any resemblance of their code to it means destruction of years (or decades) of reverse engineering and clean room development done in that domain. If code licensing is that important and crucial, why a court tested license (e.g. GPL) is so worthless? Is this fair, again in the same cross-section (legal to ethical)?
There's a lot to be discussed, and a lot of ideas to be re-learnt here. Open Source (or precisely Free / Copylefted software) doesn't mean free for all. We need to understand that.
If I saw 100 LOC which was very similar to something which I wrote, AND contained a log statement copied verbatim, it's very easy to imply that the entire piece of code is a derivative work.
Let's say I write FizzBuzz:
// Copyright (c) 2022 David Allison. All rights reserved.
for num in range(100):
if num % 3 == 0 and num % 5 == 0:
print("DA: fizzbuzz")
elif num % 3 == 0:
print("DA: fizz")
elif num % 5 == 0:
print("DA: buzz")
else:
print(num)
If I found the modified FizzBuzz algorithm in the wild with one line containing the "DA" prefix, it may have been learned from a fraction of a fraction of my code but it still contains my 'unique' creativity, is that a copyright violation?Aside: Due to some uniquely named code I've contributed to, I strongly suspect that Copilot would output my GitHub username. I don't really want to open Pandora's box here, but I'd be curious.
On the practical side, it is actually easy to filter out sequences of words that are too similar to the training set from the output of the model. You just generate another snippet until it is "original" enough.
Pragmatically, people are already knowingly committing commercially viable copyright violations of my work. I'd rather it wasn't encouraged further by a US-based 'big tech', especially if the people using my code aren't aware that they're doing anything questionable.
Some months, I earn over 100x less from OSS than I would in industry. I don't want people taking advantage any more than I'm comfortable with, especially for commercial purposes.
I pay for copilot and this is very much the truth, but let's see what the court rules out.
Btw I would have been behind MS if they have done one of this 2
1 use all code they have access , including MS code and including private code in GitHub because that would show they actually belive that the AI works as advertised
2 make the model open , let people use it locally, improve it, test it for copyright issues, do whatever they want
At the point of creation something is granted copyright.
Publishers in literature and music are right assholes who’ve created this system. Little middle men rent seeking.
It does need reform but it is for the creators that’s why it’s tied to the creator and not date of publication. Fix your perspective buckaroo
No, under the US Constitution it is for a specified public benefit as its purpose, the private benefit is a mechanism to achieve that.
Under the Statute of Anne, it was nominally for creators (but this was lobbied for by printers after the expiration of earlier laws, and they were the prime beneficiaries in practice.)
The earlier laws were explicitly for printers.
Well, that's false. The actual US Constitution in Article I Section 8 Clause 8 says, "[The Congress shall have power] To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries."
That could, possibly one-day provide public benefits, but it doesn't have to. If public benefits happen, they are side-effects. The purpose is for authors and inventors to have exclusive rights in their writings and discoveries. They are never required to publish or advance anything for the public. This is a private protection not a public benefit for a "specific purpose".
At least according to most histories of copyright I've seen. Wikipedia seems to agree:
https://en.wikipedia.org/wiki/History_of_copyright#Early_dev...
You are talking about modern US copyright law.
But copyright laws (laws around copy) predate the existence of publishers and the declaration of independence of the United States by over a 1000 years.
No, I’m talking mostly about British copyright law prior to the nominal prioritization of creators in the Statute of Anne (1710).
(Techhically, it was focussed on printers rather than publishers, but the separation of function of those is a more modern arrangement.)
You can tell the part you target isn't about modern US copyright law because I later in the same post distinguish all US copyright law under the Constitution (which includes modern US copyright law) from it.
This is wrong on so many levels.
1. Copyright grants a limited set of rights to the copyright holder. "Use" typically doesn't fall into that set. Everyone has the right to "use" copyrighted material for any purpose that isn't some kind of copy or distribution.
2. Even when we consider uses which are actually covered by copyright, a license is not the only way to legally copy the material. Fair use exists.
3. There is no such thing as "the license attached to it". Licenses do not "attach" to copyrighted works. A license is an individual agreement between the copyright holder and each and every person who wants to use the material (within the scope of copyright rights and outside of fair use). Those agreements can be different in every instance, if the licensor and licensee have so agreed.
The only thing a LICENSE file or other similar way of indicating a license on code does is make a (binding) offer to license the work under the specified terms to all comers. Once anyone actually has a license by any means, including a separately negotiated license, then the LICENSE file no longer has anything to do with them or their use of the material. In the case of github, they have separately negotiated (by making a binding offer of their own in their ToS) a license to use the material for the provision of their service; therefore the LICENSE file has nothing to do with them (unless they want to use the offered license instead of the one they negotiated, and they haven't negotiated away the right to use the license offered in the LICENSE file).
I agree that use isn't governed by copyright, copying is. However, to "use" code is to make a copy of it (multiple times usually).
As far as attachment goes, I think the common sense meaning was clear. On GitHub, you can attach a license. I wasn't claiming that "attachment" was some feature of copyright law!
On fair use, I agree with your point entirely.
That line of argument might defang any claims I might have against Copilot, as I have personally uploaded much of my public open-source code to GitHub.
They can analyse it. They say they can't sell it or distribute it outside the service. Even though this can apparently happen with copilot sometimes!
https://docs.github.com/en/site-policy/github-terms/github-t...
"This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service"
Now, I read that to mean they can't sell my content. But apparently they can if they store it in a machine learning model to a greater or lesser accuracy!
Hash the content. If the content has been viewed by the user in the past, halve the numerical hash value. Then sort the list.
Doing this to a music list will create a list that is biased toward music the user has listened to before but will still seem random enough to look like intelligent suggestion that the system has learned to identify. It is just math that mimics learning.
The mathematics of artificial neural networks is math. It is only math. One can make it very complex or very simple but in the end it is just math and pointers.
That's not even scratching the surface of differences
The text you quote is explicit: the public benefit—promotion of science and useful arts—is the purpose. Providing benefits to creators is a mechanism for acheiving that purpose, not the purpose itself. That’s what I said before, and it remains true, and you’ve just quoted the bit of the Constitution that says it while claiming it is false.
1. People have certain rights, duties and prohibitions. Equating the right of George Lucas to use ideas he saw with rights of a machine to do that misses the point by the same measure as asserting that MS enslaves the copilot, but in the opposite direction.
2. Scale does matter. If I'm an ordinary person then the act of eating won't ruin the ecosystem. Now imagine a construct that operates under the same principle of eating, but its jaw, stomach and speed of eating is many magnitudes larger - do we apply same limitations to both, because the principle of eating is the same?
Also, since I'm spelling things out, the fact that I'm seeing the same argument many times over, and that it is so obviously flawed, makes me think that this is a symptom of astroturfing.
Hardly relevant, given that the machine has no rights, so no one is equating those with anything. The point is that the machine is doing automated learning on behalf of the developers who are training it, so what should be decided is whether those very human people have a right to train their model in that way.
And the thing is that the machine does not benefit from the same rights as a person, so we can't absolve MS from responsibility because "it does a similar thing to what people do".
So, to add one more point to spell out: the context matters! ;)
Thats the thing, there is no reason to think that they are similar.
Ai/ml is not artificial general intelligence. It's a mathematical model.
It may well be so. I was arguing with respect to your previous post, in which you stated as a relevant difference merely the size of the job.
With math, you can just describe everything, including the human brain.
My original argument was specifically about neural networks, that I don't really see the principal difference in how a human learns from reading code, and how Copilot has learned from reading code.
Right now it is impossible to accurate describe the human brain in the form of math. What we can do is write simplified models that either describe or mimics behavior for which, if we apply abstractions, we can call predictive models. Their effectiveness are quite poor but that has never stopped people from trying to use poor predictions models to predict the future.
My statement on human brain was: In principle, you can describe it with math. This doesn't mean that we know how to do that yet.
My statement on Copilot was: Comparing learning of the human brain to learning of the artificial neural network, both are still very similar, much more similar to other (machine or other) learning methods. Sure, there are differences. But my point is: Those differences, why are they relevant for the copyright question?
The mind that can acknowledge and appreciate your work in this scenario (Copilot) does literally nothing of its own free will except 1) take your code and 2) give it to me, possibly combined with someone else's code. This is the sole purpose of its entire existence and full range of its capabilities. Is this enough of a difference compared to a human mind when copyright is concerned?
It spares me from knowing that you exists, that you wrote a library that does this thing I need, that I can contribute to it, etc. In such a scenario, what is the motivation for you to make your library publicly available in the first place (other than generate revenue for Microsoft or whoever I pay for access to the network)? Does copyright have relevance to OSS now?