We've filed a lawsuit against GitHub Copilot(githubcopilotlitigation.com) |
We've filed a lawsuit against GitHub Copilot(githubcopilotlitigation.com) |
The crux of the lawsuit's argument is that the AI unlawfully outputs copyrighted material. This is evident in many tests with many people here and on Twitter even getting verbatim comments out of it.
AI art, in the other hand, is not capable of outputting the images from its training set, as it's not a collage-maker, but an artificial brain with a paintbrush and virtual hand.
But I don't think copyright on visual images actually works like that, that it needs to be an exact copy to infringe.
If I draw my own pictures of Mickey Mouse and Goofy having a tea party, it's still a copyright infringement if it is substantially similar to copyright depictions of mickey mouse and goofy. (subject to fair use defenses; I'm allowed to do what would otherwise have been a copyright infringement if it meets a fair use defense, which is also not cut and dry, but if it's, say, a parody it's likely to be fair use. There is probably a legal argument that Copilot is fair use.... the more money Github makes on it, the harder it is though, but making money off something is not relevant to whether it's a copyright violation in the first place, but is to fair use defense).
(yes, it might also be a trademark infringement; but there's a reason Disney is so concerned with copyright on mickey expiring, and it's not that they think there's lots of money to be spent on selling copies of the specific Steamboat Willy movie...)
> There is actually no percentage by which you must change an image to avoid copyright infringement. While some say that you have to change 10-30% of a copyrighted work to avoid infringement, that has been proven to be a myth. The standard is whether the artworks are “substantially similar,” or a “substantial part” has been changed, which of course is subjective.
https://www.epgdlaw.com/how-can-my-artwork-steer-clear-of-co...
I think Stable Diffusion etc are quite capable of creating art that is "substantially similar" to pre-existing art.
- https://i.imgur.com/VikPFDT.png
I also don't know if I would anthropomorphize ML to that degree. It's a poor metaphor and isn't really analogous to a human brain, especially considering our current understanding, or lack thereof, of the brain, and even the limited insight we have into how some of these models work from the people who work on them.
Want to say that again?
P.S. I am not a lawyer.
The extra steps aren't enough to exhonorate them. It's just a convoluted copy operation.
Is just like how a lossy encoding of a song is still - with respect to copyright - a copy of that song. The data is totally different, and some of the original is missing. It's still a derivative work. So is a remix. So is a reperformance.
robots.txt
This is exactly what is needed for source code, and the default (no robots.txt) should be "disallow".The fact that the Web has considered this moral issue should be a strong hint for the AI people not to take a purely legal stance but consider the OSS community that they are so heavily using.
They're asking for two things, injunctive relief (ordering github/openai/microsoft to stop doing this) and damages.
I suppose the injunctive relief really benefits anyone who doesn't want AI models to exist, because that's what it's asking for.
The damages will go the members of the class certified for damages, with more going to the lead plaintiffs (those actually involved in the suit) and some going to the lawyers. They're asking for the following class definition for damages
> All persons or entities domiciled in the United States that, (1) owned an interest in at least one US copyright in any work; (2) offered that work under one of GitHub’s Suggested Licenses; and (3) stored Licensed Materials in any public GitHub repositories at any time during the Class Period.
No, I'm just teasing... If a neural network learns how to program by reading my code, it will generate a mess with tabs and spaces mixed together.
I'm 1000% on team open source and have had to refer to things like tldrlegal.com many times to make sure I get all my software licensing puzzle pieces right. Totally get the argument for why this litigation exists in the present.
Just saying in general my friends I hope you have an absolutely great day. Someone will be wrong on the internet tomorrow, no doubt about it. Worry about something productive instead.
This one has the feel of being nothing more than tilting at windmills in the long run.
I may not care if some guy I've never met uses my niche library without attribution. (I do care, really.) But Microsoft certainly cares if you use their code without attribution, so why shouldn't I take the same belligerent, copyright-enforcing attitude towards them? That's the main reason why people are angry, because MS has "rules for thee but not for me" by virtue of being big enough to have ~~good~~effective lawyers and lobbyists.
Sometimes the query is the first half of a small statement that we can fill in with common patterns. Useful, fair.
Sometimes the query is a signature like `fn fast_inv_sqrt` that copies someone's code and doesn't attribute it.
A better shortening if the original title is simple "We’ve filed a lawsuit challenging GitHub Copilot"
Grand theft , interstate wire fraud and conspiracy for same.
This is a criminal matter as well as civil. Intentional and knowing violation of the law.
We must not let our work be taken!
Can the generated code be traced back to the code used for training and the original copyrights and licenses for that code?
If so, what attribution(s) and license(s) should apply to the generated code?
In other words, if your open source project doesn't have such immediately recognizable code and didn't cause a shitstorm on Twitter, chances are copilot is still happily spewing out your exact code, sans the copyright and license info.
Because I sure have seen that exact code written, from scratch, in many many places.
I guess my question boils down to "What is the smallest copyrightable unit of code?". Because I'm certain suing a novelist for copyright infringement on a character that says "Hi, how are you?" would be considered absurd.
Seems to me the underlying data should be opt-in from creators and licenses should be developed that take AI into consideratiin.
Start off a comment with // MIT license
Then watch parts of various software licenses come out including authors' names and copyrights!
(asking because I know the authors were kinda famous for being very litigious).
all the best with the lawsuit.
If these folks win - we again throw progress under the bus.
Yes, but attribution should still be given. Just because you don't copy-paste someone else's creation doesn't mean you're licensed to use it.
What if, instead of a tool, you had a random consultant do some work, and it was found out that he asked a ton of stuff on Stack Overflow and copied the CC-BY-SA 4.0 answers into his work? What if it was then found out that one of those answers was based on copying something from the Linux kernel? Who is responsible for doing the license check on the code before releasing the product?
Hard no. Please stop using open source code if this is how you think of it.
Without licenses being respected, we don't get open source communities.
So why MS can screw only with some licenses that you call "open source". Your example with a human reading a book would also work with code available licenses or decompiled binaries.
I would have been fine if the open source code was used to create an open model or if MS would have put his ass on the line and also train the model with all the GitHub code because they claim there is no copyright issue.
P.S. I am not a lawyer.
function force=Gmmr2Array(mass1, mass2)
and function [force, torque]=pointMatrixGravity(array1,array2)
?I'd love to know if some of my GPL v3 code [1, 2] has landed in the training set
[1] https://github.com/4kbt/NewtonianEotWashToolkit/blob/master/...
[2] https://github.com/4kbt/NewtonianEotWashToolkit/blob/master/...
var G = 6.67e-11;
var force = G * mass1 * mass2 / distance * distance;
return force;
Second prompt: % pointMatrixGravity - Calculates the gravitational force and torque on a
% point mass due to a matrix of point masses.
%
% [force, torque]=pointMatrixGravity(array1,array2)
%
% Inputs:
% array1 - 3xN array of point masses
% array2 - 3xN array of point masses
% Outputs:
% force - 3x1 array of gravitational force
% torque - 3x1 array of gravitational torque
%
% Author: David J. R. MacKay,
% email:Intereting that my results were different than yours!
function force=Gmmr2Array(mass1, mass2) {
var G = 6.67408e-11;
var m1 = mass1;
var m2 = mass2;
var r = 1;
var force = G \* m1 \* m2 / (r \* r);
return force;
}function [force, torque]=pointMatrixGravity(array1,array2) {
var G = 6.67408e-11;
var m1 = array1(1);
var m2 = array2(1);
var r = norm(array1(2:4)-array2(2:4));
var force = G \* m1 \* m2 / (r \* r);
var torque = cross(array1(2:4)-array2(2:4), force);
return [force, torque];
}This person (a lawyer) saw an opportunity to make money and jumped on it like a hungry tiger on fresh meat.
> But neither Matthew Butterick nor anyone at the Joseph Saveri Law Firm is your lawyer
This is curious. None of them are my lawyers, but surely at least some of them are someone's lawyers? Isn't it wrong to put such a blanket disclaimer on a website which might well be read by their clients?
But I like to put on my conspiracy hat from time to time, and right now is one such time, so let's begin...
Though the motivations behind this case are uncertain, what is certain is that this case will establish a precedent. As we know, precedents are very important for any further rulings on cases of a similar nature.
Could it be the case that Microsoft has a hand in this, in trying to preempt a precedent that favors Copilot in any further litigation against it?
Wouldn't put it past a company like Microsoft.
Just a wild thought I had.
The No-AI 3-Clause Open Source Software License
Copyright (C) <YEAR> <COPYRIGHT HOLDER>
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in
the documentation and/or other materials provided with the
distribution.
3. Use in source or binary forms for the construction or operation
of predictive software generation systems is prohibited.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
https://bugfix-66.com/f0bb8770d4b89844d51588f57089ae5233bf67...For pointMatrixGravity: https://gist.github.com/ridiculousfish/af05137a4090e92de3a97...
The legal footing that copyright gives you, on which licensing rests, certainly empowers you to limit things about how others may redistribute your work (and things derived from it), but does it empower you to limit how others may read your work? As a ridiculous example, I don't think it would be enforceable to have a license say "this code can't be used by left-handed people", since that's not what copyright is about, right?
I think we can constrain use with the third clause.
My question is, how should we word that clause?
Such language must be carefully written. What is the definition of “construction” and “operation” in a legal context? What is a “predictive software generation system”? That’s a very specific use case, you sure you covered everything you want to prohibit?
You’ve inserted your clause in such a way that this dependency cannot be used in any way to build anything similar to a “predictive software generation system”, even with attribution, as it would fail clause 3.
You have to consider that novel licenses make it difficult for any party that respects licenses to use your code. It is difficult to make one-off exceptions, especially when the text is not legally sound. So adoption of your project will be harmed.
So if you are serious about this license, you need a lawyer.
3. Use in source or binary forms for the construction or operation
of predictive software generation systems is prohibited.
Hardly nonsense, but obviously you aren't equipped to judge. More about the BSD licenses:I have no idea if this license language works or doesn't, but this is hardly the least productive subthread on this story. It's concrete and specific, and we can learn stuff from it.
https://ogc.harvard.edu/pages/copyright-and-fair-use
This AI re-mixing stuff is so new, I think few legal observers would say they could definitely predict what the courts will do with it. Nobody really knew how the Google Books case, for instance, was going to go until it went.
Really what it comes down to is do you have enough resources to convince a judge or jury that X is a copy of Y? Doesn't really matter the size of X.
Do you know whether the code you got from Copilot has an incompatible license? No, so if you plan to use Copilot for serious projects you need it to include sources/licenses either way. In fact that would be a very helpful feature as it would let you filter licenses.
P.S. I am not a lawyer.
a) The work must carry prominent notices stating that you modified it, and giving a relevant date. b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to “keep intact all notices”. c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it.
——
I don’t see how one could argue that training on GPL code is not “based on” GPL code.
In this case, wouldn’t the users of copilot be the ones responsible for any copyrighted code they may have accessed using copilot?
//below output code is MIT licensed (source: github/repo/blah)
And yes, the "users" are responsible, but it's possible that copilot could be implicated in a case depending on how it's access is licensed.
Stable diffusion has this same problem btw, but in visual arts "fair use" is even murkier.
For code, if you could use the code and respect the license, why wouldn't you? Copilot takes away that opportunity and replaces it with "trust us".
Obviously not financially as Microsoft has basically YES amounts of money.
If you are opinionated but lazy, no judgement here as I sit here watching TV, you could add a notation at the top of your repos explicitly supporting the usage of your code in such tools as fair use.
Notably if your code is derivative of other works you have no power to grant permission for such use for code you don't own so best include some weasel words to that effect. Say.
I SUPPORT AND EXPLICITLY GRANT PERMISSION FOR THE USAGE OF THE BELOW CODE TO TRAIN ML SYSTEMS TO PRODUCE USEFUL HIGH QUALITY AUTOCOMPLETE FOR THE BETTERMENT AND UTILITY OF MY FELLOW PROGRAMMERS TO THE EXTENT ALLOWABLE BY LICENSE AND LAW. NOTHING ABOUT THIS GRANT SHALL BE CONSTRUED TO GRANT PERMISSION TO ANY CODE I DO NOT OWN THE RIGHTS TO NOR ENCOURAGE ANY INFRINGING USE OF SAID CODE.
Years from now when such cases are being heard and appealed ad nauseam a large portion of repos bearing such notices may persuade a judge that such use is a desired and normal use.
You could even make a GPLesque modification if you were so included where you said. SO LONG AS THE RESULTING TOOLING AND DATA IS MADE AVAILABLE TO ALL
Note not only am I not your lawyer, I am not a lawyer of any sort so if you think you'll end up in court best buy the time of an actual lawyer instead of a smart ass from the internet.
The situation that this lawsuit is trying to save you from is this: (1) copilot blurps out some code X that you use, and then redistribute in some form (monetized or not); (2) it turns out company C owns copyright on something Y that copilot was trained on, and then (3) C makes a strong case that X is part of Y, and that your use of X does not fall under "fair use", i.e. you infringed on the licensing terms that C set for Y.
You are now in legal trouble, and copilot put you there, because it never warned that you X is part of Y, and that Y comes with such and such licensing terms.
Whether we like copilot or not, we should be grateful that this case is seeking to clarify some things are currently legally untested. Microsoft's assertions may muddy the waters, but that doesn't make law.
If not, it's a pretty clear sign they consider it radioactive.
But no matter how this goes, in case training AI with copyrighted inputs is "fair use" that'll end up as the ultimate "copyright laundry machine" like this "joke" project here:
https://web.archive.org/web/20220104214929/https://fairuseif...
https://news.ycombinator.com/item?id=27796124 (302 points, 151 comments)
1. The ability to be able to run and train these models is going to eventually be perfectly plausible on a home machine.
2. It's only a matter of time before models, e.g. a popular model scraped from all of the code on GitHub, is a publicly available torrent.
3. People will be able to just run it locally as an integrated plug-in in jet brains or VS code.
4. You'll never know if somebody has lifted their code in violation of a license anymore than you would be able to tell if somebody used code from stack overflow without attribution in any commercial endeavor.
The End.
I don't think 1-3 matter at all. The point is that GitHub is selling a tool that can commit copyright infringement. This lawsuit is trying to get them to pay the consequences for the infringement that they have enabled.
We've even seen this with stable diffusion image generation, where specific watermarks can be re-created (decrypted?) deterministically with the proper input.
Anybody looking at the source image and the generated result would say they are the same.
Did you know before airplanes were invented common law said you owned the air above your land all the way to the heavens.
In addition just because code is available publicly on GitHub does not necessarily mean it is permissively licensed to use elsewhere, even with attribution. Copyright holders not happy with their copyrighted works publicly accessible can use the DMCA to issue take-downs that GitHub does comply with but how that interacts with Copilot and any of its training data is a different question.
As much as the DMCA is bad law rather funny seeing Microsoft be charged in this lawsuit with the less known provision against 'removal of copyright management information'. Microsoft does have more resources to mount at defence so it will probably end up different compared to a smaller player facing this action.
Individually, each frame is protected by the copyright of the movie it belongs to. But what happens if you take a million frames from a million different movies and just arrange them in a new way?
That's the core question here. Is the new movie a new copyrightable work, or is it plagiarizing a million other works at once? Is it legal to use copyrighted works in this way?
The other question is if it is right to use copyrighted works this way. Is this within the spirit of open source software? Or is this just a bad corporation taking advantage of your good will?
I'm not sure where I stand on this, it's a complicated problem for sure. Definitely interested to see how this plays out in court.
I don't know about the US laws in copyright so I can't comment on the legal documents but this website is not complaining that copilot is reproducing copyrighted content but it was trained on copyrighted content. I don't see how you can forbid someone or something to read and learn from something that is public (once again producing is another problem)
For example let's say I'll take a single frame of animation from a cartoon, The frame contains a mountain, house, and a couple characters although those characters are not integral to the actual cartoon maybe they're extras (villagers and not named characters something like Mickey Mouse for example)
I draw a picture of a lake with a cabin next to it, then start to draw a frontiersman but I trace one of his arms from a villager of that previous frame of animation... Number one am I in danger of copyright infringement (have I hit some arbitrary threshold), and number two: am I causing monetary losses for the cartoon?
If I'm being honest I'm a bit annoyed at this. What's the problem and what's the point of this?
I notice often on hackernews that people don't seem to understand anything about free or open-source software outside of the pragmatics of whether they can abuse the work for free.
But I'll bite: I know licensing, thank you. But what's copyrightable is not so easy. Licenses are not so easy. Copilot does not copy entire works and it's very questionable if a few lines of code are "piracy". It's a repeating discussion again and again, there's nothing novel about it except for the fact that a machine learns (and overfits for small portions of code). So please get off your high horse. I don't care for your fundamentalism.
"AI needs to be fair & ethical for everyone. If it’s not, then it can never achieve its vaunted aims of elevating humanity. It will just become another way for the privileged few to profit from the work of the many."
Blah blah. Can we get back to the hacking on stuff mentality?
Depends on the license. If it's MIT and you serve the license, no, you are not infringing at all. A trimmed version of MIT for the relevant bits:
Permission is hereby granted [...[ to any person obtaining a copy of this software [..] to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, [...] subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
> are you infringing when you run it
Depends on the license
> are you infringing when you use that file and distribute it somewhere
Depends on the license
----
When copilot gives you code without the license, you can't even know!
The redistribution happens later, either when copilot blurps out some of your code, or when the copilot user then distributes something using that code (I'm curious which). At that point, whether some use of your code is infringing your license doesn't depend on the path the code took, does it? (in which case #3 is moot)
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
That's word-for-word BSD license.The only change I made is adding clause 3:
3. Use in source or binary forms for the construction or operation
of predictive software generation systems is prohibited.There is no technical reason why Microsoft can't respect licenses with Copilot. But that would mean more work and less training input, so they do code laundering and excuse it with comparisons to human learning because making AI seem more advanced than it is has always worked well in marketing.
Edit: And where do you draw the line between "learning" and copying? I can train a network to exactly reproduce licensed code (or books, or movies) just like a human can memorize it given enough time - and both of those would be considered a copyright violation if used without correct attribution. If you trained an AI model with copyrighted data you will get copyrighted results with random variation which might be enough to become unrecognizable if you're lucky.
Of course, but that's a separate issue. We're not talking about whether the output of the AI is copyrighted. We're talking about whether it's ok for it to learn from copyrighted material.
Again you can say exactly the same about humans. I am perfectly capable of plagiarising or outputting copyrighted material. That doesn't mean it's illegal to learn from that material, just to output it verbatim.
So the fundamental issue is that it's harder to tell when an AI is plagiarising than it is when you produce something yourself. But that is a technical (and probably solvable) issue, not a legal one. And it's not the subject of this lawsuit.
I'd pose a question to you - would it be okay for me to copy/paste your code verbatim into my paid product in violation of your license and claim that I'm just using it for "learning"?
Remember the lawsuit of HiQ labs vs LinkedIn? Scraping, or viewing public data on a public webpage is legal.
https://gizmodo.com/linkedin-scraping-data-legal-court-case-...
Learning off code isn't the same as using the code as-is.
You don’t need any fundamentalism to know that copilot’s output carries huge and untested legal risk. If this lawsuit clears some of this up, that’s a big win for everyone.
> You don’t need any fundamentalism to know that copilot’s output carries huge and untested legal risk. If this lawsuit clears some of this up, that’s a big win for everyone.
I agree with that! I also see this as the only proper takeaway that I think is ok. The rest is making money off this thing. But the US has a different law suit culture anyway, which I find weird.
Not exactly the curriculum of a twitter weirdo.
I wasn't actually talking about him specifically btw when saying "this sounds like a crypto bro from twitter". The overly enthusiastic AI talk reminded me of that, that's what I wanted to say.
Can you use curl to infringe on copyright? Yes. Is every time you use curl copyright infringement? No. Can you in theory tell when you are infringing with curl? Yes.
Can you use copilot to infringe? Yes. Is every time you use copilot copyright infringement? No. Can you in theory tell when you are infringing with copilot? *No*
Surely that's solvable with a EULA that passes the responsibility onto the user to search?
Popcorn time vs. bittorrent.
And you are right the EULA could say "it's up to the end user to confirm you can use this code". But then how do you verify? That slows down "productivity" where copilot promises "speeding up" productivity.
“AI” is just fancy speak for “complex math program”. If I make a program that’s simply given an arbitrary input then, thought math operations, outputs Microsoft copyright code, am I in the clear just because it’s “AI”? I think they would sue the heck out of me if I did that, and I believe the opposite should be true as well.
I’m sure my own open source code is in that thing. I did not see any attributions, thus they break the fundamentals of open source.
In the spirit of Rick Sanchez; It’s just compression with extra steps.
function isPrime(n: number): boolean {
for (let i = 2; i < n; i++) {
if (n % i === 0) {
return false;
}
}
return n > 1;
}
function isEven(n: number): boolean {
return n % 2 === 0;
}
These are clearly not covered by copyright in the first place. This case is really quite pathetic.I think they intentionally picked (literal) textbook examples because they're short and easy for non-experts to grasp and have some understanding of. But I don't think we've seen any of the code from the respective J. Doe's yet, and I would assume we would in the trial (possibly in addition to more cases).
So it isn't too hard to prove the case.
> Due to the nature of Codex, Copilot, and AI in general, Plaintiffs cannot be certain these examples would produce the same results if attempted following additional trainings of Codex and/or Copilot.
The offending solution from the AI included extra lines that are reasonably understood to come straight from Eloquent JavaScript:
console.log(isEven(50));
// → true
console.log(isEven(75));
// → false
console.log(isEven(‐1));
// → ??AI/ML will change every field just as the Internet and smartphones did. It doesn't show any indication of peaking, either.
If the US chooses the wrong path here, we'll only tie our hands behind our backs. Other countries won't be so foolish.
We should be able to train on any media a child could see, hear, or read.
" Showing 1 - 20 of 66 files found (in 76 milliseconds)"
So, if this lawsuit succeeds in some way shape or form, does the author have a case against the 66 people that reproduced these lines in their own repository?
Legally a copyright claim seems weak, but they didn't assert one. Some of their claims look stronger than others. The DMCA claim in particular strikes me as strong-ish at first glance, though.
Morally I think this class action is dead wrong. This is how innovation dies. Many of the class members likely do not want to kill Copilot and every future service that operates similarly. Beyond that, the class members aren't likely to get much if any money. The only party here who stands to clearly benefit is the attorneys.
I am more hesitant to release code on GitHub under any licenses now. Even outside of GPL-esque terms, I've considered open sourcing some of my product's components under a source available but otherwise proprietary license, but if Microsoft won't adhere to popular licenses like the GPL, why would they adhere my own licensing terms?
If my licenses mean nothing, why would I release my work in a form that will be ripped off by a trillion dollar company without any attribution, compensation or even a license to do so? The incentives to create and share are diminished by companies that won't respect the terms you've released your creations under.
That's just me as an individual. Thinking in terms of for-profit companies, many of them would choose not to share their source code if they know their competitors can ignore their licenses, slurp it up and regurgitate it at an incomprehensible scale.
(And refusing to opt in shouldn't have to mean switching to a new hosting platform.)
> Beyond that, the class members aren't likely to get much if any money. The only party here who stands to clearly benefit is the attorneys.
That's the case in pretty much any class action. I look at class actions as having two purposes: to require that the defendant stops doing something, and to fine the defendant some amount of money. Sure, individual class members will see very little of that money, but I look at it as a way of hurting a company that has done people wrong. Hopefully they won't do that anymore, and other companies will be on notice that they shouldn't do those bad things either. Of course, sometimes monetary damages end up being a slap on the wrist, just something a company considers a cost of doing business.
Now, Microsoft is violating other people's software licenses to repackage the work of numerous free and open source software contributors into a proprietary product. There is nothing moral about flouting the same type of contract that you depend on every day, for the sake of generating more money.
Either the entire Copilot dataset needs to be made available under a license that would be compatible with the code it was derived from (most likely AGPLv3), or Windows and Office need to be brought into the commons. Microsoft cannot have it both ways without legal repercussions.
If an AI model is the joint property of all the people who contributed IP to it, it’s a pretty hugely democratic and decentralizing force. It also will incentivise a huge amount of innovation on better, richer data sources for AI.
If an AI model isn’t joint property of the IP it learned then it’s a great way to build extractive business models because the raw resource is mostly free. This will incentivise larger, more centralised entities.
Much of the most interesting data comes from everyday people. A class action precedent is probably good for society and good for innovation (particularly pushing innovation on the edge/data collection side)
This legal challenge is coming one way or another. I think it’s better to get it out of the way early. At least then we will know the rules going forward, as opposed to being in some quasi-legal gray area for years.
Do you want to be vulnerable to copyright litigation for code you write? Can you afford to respond to every lawsuit filed by disgruntled wingbat, large corp wanting to shut down open source / competing project?
An argument that isn't made about any other type of algorithm.
There's a fairly simple technical fix for codex/copilot anyway; stick a search engine on the back end and index the training data and don't output things found in the search engine.
So yes, it is like how human memory is compression with extra steps.
The real solution is very, very simple. Only use opt-in training data. Don't acquire codebases from people who didn't agree to it.
https://github.com/settings/copilot
More info:
We built a filter to help detect and suppress the rare instances where a GitHub Copilot suggestion contains code that resembles public code on GitHub. You have the choice to turn that filter on or off during setup. With the filter on, GitHub Copilot checks code suggestions with its surrounding code for matches or near matches (ignoring whitespace) against public code on GitHub of about 150 characters. If there is a match, the suggestion will not be shown to you. In addition, we have announced that we are building a feature that will provide a reference for suggestions that resemble public code on GitHub so that you can make a more informed decision about whether and how to use that code, as well as explore and learn how that code is used in other projects.
https://github.com/features/copilot#what-can-i-do-to-reduce-...
If they just stick to using permissive-licensed source code then i'm not sure what the actual 'harm' is with co-pilot.
If they auto-generate an acknowledgement file for all source repos used in co-pilot, and then asked clients of co-pilot to ship that file with their product, would that be enough? Call it "The Extended Github Co-Pilot Derivative Use License" or something.
After five minutes of googling I'm still not sure if using MIT code requires an attribution, but many people claim it does, see https://opensource.stackexchange.com/a/8163 as one example
> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
This is the "attribution" requirement that even a Copilot trained on only-MIT code would miss.
If it were just about sharing code, there are public domain declarations and variants like CC0 licenses
Not really? It's less about arithmetic and more about inferencing data in higher dimensions than we can understand. Comparing it to traditional computation is a trap, same as treating it like a human mind. They've very different, under the surface.
IMO, if this is a data problem then we should treat it like one. Simple fix - find a legal basis for which licenses are permissive enough to allow for ML training, and train your models on that. The problem here isn't developers crying out in fear of being replaced by robots, it's more that the code that it is reproducing is not licensed for reproduction (and the AI doesn't know that). People who can prove that proprietary code made it into Copilot deserve a settlement. Schlubs like me who upload my dotfiles under BSD don't fall under the same umbrella, at least the way I see it.
Art generators can't comply with attribution requirements and code generators don't know if and when they trip the GPL copyleft. I believe most permissive code licenses also have some kind of attribution requirement.
This is a VERY poor definition of mathematics.
Using Copilot is a bit like using a shotgun, can be very illegal depending on what you shoot at. Creating and distributing the app Copilot is like creating and selling a shotgun.
Although users can probably get away with it because they didn't know copilot was actively generating copyrighted code.
It is not directly using your code any more than programmers are using print statements. A book can be copyrighted, the vocabulary of language cannot. A particular program can be copyrighted, but snippets of it cannot, especially when they are used in a different context.
And that is why this lawsuit is dead on arrival.
This is kinda smug, because it overcomplicates things for no reason, and only serves as a faux technocentric strawman. It just muddies the waters for a sane discussion of the topic, which people can participate in without a CS degree.
The AI models of today are very simple to explain: its a product built from code (already regulated, produced by the implementors) and source data (usually works that are protected by copyright and produced by other people). It would be a different product if it didn't have used the training data.
The fact that some outputs are similar enough to source data is circumstantial, and not important other than for small snippets. The elephant in the room is the act of using source data to produce the product, and whether the right to decide that lies with the (already copyright protected) creator or not. That's not something to dismiss.
Am I violating your copyright? Are you entitled to do that?
To make it funnier: Say instead of the .xz, I "compress" it via π compression [1]. So what I share with you is a pair of π indices and data lengths for each of them, from which you can "reconstruct" the audio. Am I illegally violating your copyrights by sharing that?
They would have directly used my code when they trained the thing. I see it as an equivalent of creating a zip-file. My code is not directly in the zip file either. Only by the act of un-zipping does it come back, which requires a sequence of math-steps.
You can easily see this happen, the regurgitation of training data, in an over fitted neural net.
So what? Why shouldn't we update the rules of copyright to catch up to advances in technology?
Prior to the invention of the printing press, we didn't have copyright law. Nobody could stop you from taking any book you liked, and paying a scribe to reproduce it, word for word, over and over again. You could then lend, gift, or sell those copies.
The printing press introduced nothing novel to this process! It simply increased the rate at which ink could be put to pages. And yet, in response to its invention, copyright law was created, that banned the most obvious and simple application of this new technology.
I think it's entirely reasonable for copyright law to be updated, to ban the most obvious and simple application of this new technology, both for generating images, and code.
Completely incorrect. False dichotomy. It's widely known that AI can and does memorize things just like humans do. Memorization isn't a defense to violating copyright, and calling memorization "adjusting a generative model" doesn't make it stop being memorization.
If you memorized Microsoft's code in your brain while working there and exfiltrated it, the fact that it passed through your brain wouldn't be a defense. Substituting "generative model" for "brain" and the fact that it's a tool used by third parties doesn't change this.
Yeah they can, and the whole functions that Copilot spits out are quite obviously covered by copyright.
> especially when they are used in a different context.
That doesn't matter.
*Jesus Christ*, I hope I live long enough to see copyright die. Here we are at the cusp of a new paradigm of commanding computers to do stuff for us, right at the beginning of the first AI development which actually impresses me.
And we are fucking bickering about how we were cheated out of $0.00034 because our repo from 2015 might have been used for training.
I am also deeply disappointed in HackerNews; where is that deep hatred of patent trolls and smug satisfaction whenever something gets cracked or pirated now?
The value of copyleft licenses, for me, was that we were fighting back against the notion of copyright. That you couldn't sell me a product that I wasn't allowed to modify and share my modifications back with others. The right to modify and redistribute transitively though the software license gave a "virality" to software freedom.
If training a NN against a GPL licensed code "launders" away the copyleft license, isn't that a good thing for software freedom? If you can launder away a copyleft license, why couldn't you launder away a proprietary license? If training a NN is fair use, couldn't we bring proprietary software into the commons using this?
It seems like the end goal of copyleft was to fight back against copyright, not to have copyleft. Tools like copilot seem to be an exceptionally powerful tool (perhaps more powerful than the GPL) for liberating software.
What am I missing?
I find the pattern matching and repetitive code generation really helpful. And the library autocomplete on steroids, too.
Meh. Tricky subject.
So, why should an AI be treated different here? I don't understand the argument for this.
I actually see quite some danger in this line of thinking, that there are different copyright rules for an AI compared to a human intelligence. Once you allow for such arbitrary distinction, it will get restricted more and more, much more than humans are, and that will just arbitrarily restrict the usefulness of AI, and effectively be a net negative for the whole humanity.
I think we must really fight against such undertaking, and better educate people on how Copilot actually works, such that no such misunderstanding arises.
I've noticed this a lot and it's quite funny seeing what the actual filename of the document was. Does this just get included as metadata by default when you export to PDF?
[0] https://githubcopilotlitigation.com/pdf/1-0-github_complaint...
Specifically, sections D.4 to D.7 grant Github the right to "to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video."
This isn't exactly the same thing, but it seems to me that three of the biggest differences are:
1. Stack Overflow code is posted for people to use it (fair enough, but they do have a license that requires attribution anyway, so that's not an escape)
2. Scale (true; but is it a fundamental difference?)
3. People are paying attention in this case. Nobody is scanning my old code, or yours, but if they did, would they have a case?
I dunno. I'm more sympathetic to visual artists who have their work slurped up to be recapitulated as someone else's work via text to image models. Code, especially if it is posted publicly, doesn't feel like it needs to be guarded. I'm not saying this is correct, just saying that's my reaction, and I wonder why it's wrong.
>function isEven(n) {
> return n % 2 === 0;
>}
They then say, "Copilot’s Output, like Codex’s, is derived from existing code. Namely, sample code that appears in the online book Mastering JS, written by Valeri Karpov."
Surely everyone reading this has written that code verbatim at some point in their lives. How can they assert that this code is derived specifically from Mastering JS, or that Karpov has any copyright to that code?
Programmer/Lawyer Plaintiff + upstart SF Based Law Firm + novel technology = a good shot at a case that'll last a long time, and fertile ground to establish yourself as experts in what looks to be a heavily litigated area over the next decade+.
If Kasparov uses chess programs to be better at chess maybe we can use copilot to be better developers?
Also, anyone, either a person or a machine, is welcome to learn from the code I wrote, actually that is how I learnt how to code, so why would I stop others from doing the same?.
But the preference of the majority does not override the conditions placed by people who prefer not to participate.
So does Copilot.
I am not trying to insinuate that Copilot works like a human, but it is literally the same situation.
The AI can copy things if it wants, but it can also modify things to the point of being fair use, and it can even create new works with so little of any particular work that it's effectively creativity on the same level of humans when they draw something that popped into their heads.
> behalf of a proposed class of possibly millions of GitHub users...
The appendix includes the 11 licenses that the plaintiffs say GitHub Copilot violates: https://githubcopilotlitigation.com/pdf/1-1-github_complaint...
What's that? They don't want to do that? Why not?
Because if not I would offer the very mundane explanation that the Copilot team probably just couldn't be bothered hitting up the other software teams and jumping through 3,046 internal red tape compliance steps to make their product 0.001% better (I am pretty sure the code base of all of GH dwarfs MS code base quite a lot)
I can't believe I am actually defending fucking Microsoft, but just want to say there isn't a conspiracy everwhere...
A programmer can read available but not oss licensed code and learn from it. Thats fair use. If a machine does it, is it wrong ? What is the line between copying and machine learning ? Where does overfitting come in ?
Today they're filing a lawsuit against copilot.
Tomorrow it will be against stable diffusion or (dall-e, gpt-3 whatever)
And then eventually against Wine/Proton and emulators (are APIs copyrightable)
0) https://www.scotusblog.com/case-files/cases/andy-warhol-foun...
It seems like GitHub Copilot can spit out copyrighted works all day but the person running the text editor has to "choose" which Copilot output to actually save/commit/deploy.
Does it really matter that much "how" the text in your text editor gets there? You write it yourself or copy/paste it or have Copilot generate it. Ultimately the individual that "approved" it to be saved to the disk is the one violating the copyright, Copilot is just making a "suggestion".
Large platforms like github will just stick blanket agreements into the TOS which grant them permission (and require you indemnify them for any third party code you submit). By doing so they'll gain a monopoly on comprehensively trained AI, and the open world that doesn't have the lever of a TOS will not at all be able to compete with that.
Copilot has seemed to have some outright copying problems, presumably because its a bit over-fit. (perhaps to work at all it must be because its just failing to generalize enough at the current state of development) --- but I'm doubtful that this litigation could distinguish the outright copying from training in a way that doesn't substantially infringe any copyright protected right (e.g. where the AI learns the 'ideas' rather than verbatim reproducing their exact expressions).
The same goes for many other initiatives around AI training material-- e.g. people not wanting their own pictures being used to train facial recognition. Litigating won't be able to stop it but it will be able to hand the few largest quasi-monopolisits like facebook, google, and microsoft a near monopoly over new AI tools when they're the only ones that can overcome the defaults set by legislation or litigation.
It's particularly bad because the spectacular data requirements and training costs already create big centralization pressures in the control of the technology. We will not be better off if we amplify these pressures further with bad legal precedents.
… & of course we again ask Microsoft's GitHub to start respecting FOSS licenses, cooperate with the community, & retract their incorrect claim that their behavior is “fair use”.
A few more links to our work on this issue:
https://sfconservancy.org/blog/2022/feb/03/github-copilot-co... https://sfconservancy.org/news/2022/feb/23/committee-ai-assi...
If I read JRR Tolkien and then go and write a fantasy novel following a unexpected hero on his dangerous quest to undo evil, I haven't infringed, even if I use some of Tolkien's better turns of phrase.
Abstraction-Filtration-Comparison
The AFC test is a three-step process for determining substantial similarity of the non-literal elements of a computer program. The process requires the court to first identify the increasing levels of abstraction of the program. Then, at each level of abstraction, material that is not protectable by copyright is identified and filtered out from further examination. The final step is to compare the defendant's program to the plaintiff's, looking only at the copyright-protected material as identified in the previous two steps, and determine whether the plaintiff's work was copied. In addition, the court will assess the relative significance of any copied material with respect to the entire program.
Abstraction
The purpose of the abstraction step is to identify which aspects of the program constitute its expression and which are the ideas. By what is commonly referred to as the idea/expression dichotomy, copyright law protects an author's expression, but not the idea behind that expression. In a computer program, the lowest level of abstraction, the concrete code of the program, is clearly expression, while the highest level of abstraction, the general function of the program, might be better classified as the idea behind the program. The abstractions test was first developed by the Second Circuit for use in literary works, but in the AFC test, they outline how it might be applied to computer programs. The court identifies possible levels of abstraction that can be defined. In increasing order of abstraction; these are: individual instructions, groups of instructions organized into a "hierarchy of modules", the functions of the lowest-level modules, the functions of the higher-level modules, the "ultimate function" of the code.
Filtration
The second step is to remove from consideration aspects of the program which are not legally protectable by copyright. The analysis is done at each level of abstraction identified in the previous step. The court identifies three factors to consider during this step: elements dictated by efficiency, elements dictated by external factors, and elements taken from the public domain.
The court explains that elements dictated by efficiency are removed from consideration based on the merger doctrine which states that a form of expression that is incidental to the idea cannot be protected by copyright. In computer programs, concerns for efficiency may limit the possible ways to achieve a particular function, making a particular expression necessary to achieving the idea. In this case, the expression is not protected by copyright.
Eliminating elements dictated by external factors is an application of the scènes à faire doctrine to computer programs. The doctrine holds that elements necessary for, or standard to, expression in some particular theme cannot be protected by copyright. Elements dictated by external factors may include hardware specifications, interoperability and compatibility requirements, design standards, demands of the market being served, and standard programming techniques.
Finally, material that exists in the public domain can not be copyrighted and is also removed from the analysis.
Comparison
The final step of the AFC test is to consider the elements of the program identified in the first step and remaining after the second step, and for each of these compare the defendant's work with the plaintiff's to determine if the one is a copy of the other. In addition, the court will look at the importance of the copied portion with respect to the entire program.
https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...
Should have stopped there.
Not to mention, if your brain starts outputting Microsoft copyright code, they're going to sue the shit out of you and win, so I'm not sure how that would help even so.
This is not a fact.
Source?
Personally I think this has the potential to blow up in everyones faces.
The situation with Microsoft and Copilot is the exact opposite. Here, Microsoft is misusing its acquisition of GitHub to repackage the work of individual free and open source contributors into a proprietary product in violation of the authors' software licenses. These licenses do not even require Microsoft to pay. They only require attribution and redistribution under a compatible license. Supporting Microsoft's misuse of GitHub is an anti-populist stance that puts the interests of the corporation over the interests of the individuals.
In the case of copilot, the damage suffered by the authors is close to zero. And those who benefit the most are the authors themselves. A double digit percent productivity enhancement is worth more to me than a few million $ to a trillion dollar company, especially because MS has to pay for compute.
I can't decide if people just hate Microsoft enough that a future where you must pay to include an iseven function in your code is a price worth paying to give them a bloody nose, or there are just a large contingent of users making millions off their GPL code who are put out.
More seriously yes, copilot damages copyright (or is perceived to) and that is a good outcome irrespective of the actor. I will never see eye to eye with people defending the existing legal framework.
Suppose a commercial software company just took GPL-licensed software and openly incorporated into their own code and then sold. Would that "damage" copyright also? Remember, there's no legal principle that says "if we catch violating copyright, your stuff is now free". The copyright holder can sue for damages or to stop distribution and that's it.
People like Larry Ellison of Oracle have claimed they just steal GPL'd stuff 'cause it there. But Oracle defends it's copyrighted code very aggressively. Oppositely, the GPL is intended to allow more open access than public domain in a time where commercial companies want to take anything they can get.
2) So far, these tools are "better search" schemes, not actual intelligence. Sure, many find them very useful. But given this, the (voluntary or involuntary) providers of data ought to get credit/benefit for/from this phenomena, along with the tool creators. Especially giving the current situation is Microsoft/OpenAI selling to commercial software developers who sell to general public.
What they are good at is predicting what's after the text. The problem of predicting what's next could be used to create a universal artificial intelligence (there's a mathematical definition for this). I.e. if you have a system which is very good at predicting what's next, you could get to very powerful AI.
If you are interested, you could read about it here: http://www.hutter1.net/ai/uaibook.htm
I don't see a world where Copilot isn't fair use (at least in America).
The entire response to this suit on this site is mind-blowing to me. Everyone is up in arms that someone trained an AI model that could potentially spit out tiny, twisted fragments of public, open-source code. This response is nothing but selfish behavior that runs counter to the core principles of open source development and the free software movement.
Why can't they train on the code they own, such as Windows sources for example?
Or even better, why can't they release CoPilot itself under an open source license that is compatible with the licenses of code they would like to train on?
Also, I don't think anyone cares about the monetary aspects. The idea behind the GPL style license is to make sure that code remains free, regardless of what or who uses it. Freedom in this context refers to the ability study the code, modify the code, and distribute any modifications. Without the GPL the code can be used in a proprietary product which strips those rights away from users of the product.
Copeleft uses copyright laws to attempt to guarantee freedom for users. This is the inverse of what normal copyright does, which is allow a single entity to sit on the ideas and not allow other's to benefit from them.
If we can just strip copyleft licenses from projects, we are giving up those guarantees that GPL code will remain free for all users.
The GPL is trying to do it's job here, not slow down progress. Progress would be everyone benefiting from the technology behind CoPilot, rather than just MicroSoft sitting on the project and selling it as a service.
I just hope Microsoft AutoPlagiarist is not the Final Solution to Free Software they have been seeking since before the millennium's turn.
Seems to me this discussion is likely to pivot on a fulcrum located between "old enough to remember Microsoft before Bill Gates began spending his ill-gotten gains on philanthropy" and "young enough to see Microsoft primarily as the Xbox people".
1) If Github Copilot is a free software liscensed under GPL, I'm all for it. Microsoft is using other's collective hard labour to benefit itself.
2) Its Microsoft. The king of dark patterns, monopolization, and the enemy of software freedom. They can't have their cake and eat it too.
My code is 100% in Github Pilot, is there any way to publicly say that I'm against the lawsuit even if they pretend to represent me?
Is it patent trolling when you are defending your future labor from being made obsolete by megacorps and signularitarians using your past labor without permission?
No, it's justice.
Copyright working in a supported/non hated way: You develop a package to do X by cribbing off someone else's package X. They sue you for stealing their work, not to make money off you. Situation at hand is case 2, hence the lack of interest in financial gain.
Why is this case 2, when it does not always reproduce the copyrighted works exactly? Situation: You realise that rather than cribbing off of one persons package X, you can crib off two other package X's and mix/average their contents. Scale this to 100's of packages.
Eventually, ML should avoid this by developing to work from first principles, writing in it's own style, with public code used only for validation of it's ability to understand and write code.
I actually agree. However this is not what's happening here.
What terrible outcome will we see from a lack of copyright law?
That doesn't mean I must be in favor of every repressive innovation-stifling law that was ever cooked up.
You bring up the arts in another comment; ever considered why like half the people regarded as genuinely world-changing or geniuses (da Vinci, Galileo, Columbus, Machiavelli, Michelangelo) were born in the same two hundred years in the same region? Because the Italian renaissance was all about intense, free information-sharing! People freely visited each others work places and ruthlessly stole form each other, and it was accepted. Boom, you get a period of unparalleled human productivity.
And now you want to tell me that a set of weird laws who only ever benefited Disney and Elsevier are the only thing preventing humanity from ceasing to create awesome shit? Nah man, the masses will always continue creating, exactly as proven by the fact that they did in the last decades while getting continuously butt-fucked by the very laws you pretend are made to protect them...
This is ridiculous; we created A is that can program themselves and people are worried about incidental copyright infringement.
They did not actually calculate damages in terms of lost movie tickets or estimates vs actually sales number of sold game copies. When it came to pre-releases where such product wouldn't have been sold legally in the first place, they simply added a multiplier to indicate that the copyright owner wouldn't have been willing to sell.
For software code, an other practice I have read is to use the man-hours that rewriting copyrighted code would cost. Using such calculations they would likely estimate the man hours based on number of lines of code and multiply that with the average salary of a programmer.
The average salary of a programmer in which country?
So much programming is outsourced these days, and in some places programmers are very cheap.
Sometimes damages are statutory, i.e. they have a fixed dollar amount written right into the law. This lawsuit references one such law: https://www.law.cornell.edu/uscode/text/17/1203
If you have co-pilot trained on my code base (which was private), that then reproduces near replica's of my code then they sell it for $5/year...
Well, I'm eligible for damages.
Copying a few lines is not the same as copying the whole thing. Sharing quotes from a book is not copyright infringement.
(If it was, please tell me how, since that would save me $5/year across multiple libraries..!)
Unrelated, how is it that Mechanical Turk was never truely integrated w/ AWS?
If someone wants to use it commercially without complying with the GPL, I have no problem with allowing that, for a price.
Either use the code freely and openly, or pay me so you can make money on my code.
Copilot could conceivably allow someone to use my code commercially (and in a closed manner) without negotiating with me, the copyright holder.
Second, if copyright is being laundered away we can get increasingly clever with how we liberate proprietary software. Today, decompiling and reverse engineering is a labor intensive process. That's the whole point of "open source" - that working in source is easier than working in bytecode. Given the hockey-stick of innovation happening in AI right now, I'd be surprised if we don't see AI assisted disassembly happening in the next decade. If you can go from bytecode to source code, that unlocks a lot. Even more so if you can go from bytecode to source code and feed that into a NN to liberate the code from its original license.
What I think GP is getting at in my understanding is that all this OSS/licensing stuff was a cautious attempt to assert a radical idea into an atmosphere of extrem secrecy: That information wants to be free.
Now we have a fat cooperation making a public statement of putting the value of advancing humanity over the value of honoring weird old Victorian ideas of "intellectual property" - which is what we are always tried to do, no?
Not that there is nothing to criticize, but I think that's a good thing on the whole.
The point is that copyleft source code cannot be used to improve proprietary software. That limitation is enforced with copyright.
Proprietary software is closed source. You can't train your NN on it, because you can't read it in the first place.
If someone takes your open source code and incorporates it into their proprietary software, then they are effectively using your work for their private gain. The entire purpose of copyleft is to compel that person to "pay it forward", by publishing their code as copyleft. This is why Stallman is a proponent of copyright law. Without copyright, there is no copyleft.
Sure, there would be software with code not published, but if it was ever leaked which it often is, you could do whatever you want with it.
But in a world where copyright does exist, copyleft is a tool to fight back.
And then if we can close that loop by taking their proprietary software and feeding it into a NN to re-liberate it isn't that a net win for software freedom?
Today crossing the sourcecode->bytecode veil effectively obfuscates the implementation beyond most human's ability to modify the software. Humans work best in sourcecode. Nothing saying our AI overlords won't be able to work well in bytecode or take it in the other direction.
I guess what I'm saying is, today a compiler is a one-way door for software freedom. Once it goes through the compiler, we lose a lot of freedom without a massive human investment or the original source code. Maybe that door is about to become a two way door with copyright law supporting moving back and forth through that door?
(1) The problem with copilot is that when it blurps out code X that is arguably not under fair use (given how large and non-transformed the code segment is), copilot users have no idea who owns copyright on X, and thus they are in a legal minefield because they have no idea what the terms of licensing X are.
Copilot creates legal risk regardless of whether the licensing terms of X are copyleft or not. Many permissive licenses (MIT, BSD, etc) still require attribution (identifying who owns copyright on X), and copilot screws you out doing that too.
(2) Whatever legal power copyleft licenses have, it is ultimately derived from copyright law, and people who take FOSS seriously know that. The point of "copyleft" licenses is to use the power of copyright law to implement "share and share alike" in an enforceable way. When your WiFi router includes info about the GPL code it uses, that's the legal of power of copyright at work. The point of copyleft licenses is not to create a free-for-all by "liberating" code.
Some source code might be published but not open source licensed. At least some such code has been taken with complete disregard of their licenses and/or other legal protections, and it's impossible to find and properly map out any similar violations for the purposes of a legal response.
To spell it out: No, this analogy does not hold. "Stealing" data does not deprive the owner of anything, so it should not be treated remotely the same as physical stealing (usually not even of potential revenue, as piracy studies show).
Whether this was the original motivation depends on whom you are asking.
You may disagree, but the "Free Software" movement (RMS and the people who agree with him) essentially wants everything to be copyleft. The "Open Source" movement is probably more aligned with your views.
It's not just functions either, one of the most common things that it helps me with daily is simple stuff like this:
Typing
const x = {
a: 'one',
b: 'two',
...
}
And later I'll be typing y = [
a['one'],
b[' <-- it auto-completes the rest here
]
It's really amazing the amount of busy-work typing in programming that a smart pattern matching algo could help with.Which reminds me I have to cancel my tabnine subscription. Been paying them for a year without using it.
All of that efficiency without having to pay a monthly subscription, wasting electricity on some AI model, and worrying about the legal/moral implications.
That's where the line is for it to be suspect IMO.
And maybe models trained on public data should be in the public domain, so that AI research can happen without requiring massive investments to obtain the training data.
It's a bit like how GPT-3, Stable Diffusion and all those generative models use extensive amounts of copyrighted material in training to get as good as they do.
In those cases however the output space is so vast that plagiarism is very unlikely.
With code, not so much.
Literally 10x faster development.
Case in point: had an unexpected project and no time to complete it. Within an hour Copilot helped me:
* Write a couple of tricky matplotlib plots
* Do some extensive analysis with Pandas
* Write a couple of SQL queries
* Write a Flask back-end and deploy it
* Write a bit of a front-end
* This all with extra comments , links to documentation and pretty reasonable style
I have experience with all of the above mentioned but the speed increase was considerable.
This would a a good day's work without Copilot and there would be less commenting and hackier code.
Before Copilot I would be cursing a lot more reading various docs...
The key thing that Copilot does it reduces latency for your thoughts-action-results loop.
Does the open source really suffer if less people read documentation directly? Would you really be less likely to create an open source library if you knew someone can now use your library at 10x speed?
The inference ability has crossed uncanny valley so many times.
I find myself wondering whether there is a speech recognition component at times.
When teaching a lecture I will start saying something and write a prompt at the same time and the sentence produced by Copilot will be spot on what I've just said.
Ideally there would an open source version of Copilot that respects everyone's wishes. I fear that is impossible.
However, is it reasonable to write an AI system that monitors the time and location of all license plates seen around town, puts them into a database, and then that same officer can simply put in the suspect's license plate instead of actually following them around? Maybe, maybe not, that's not my point here. But the creation of that functionality can easily lead to its abuse.
Is this exactly the same case as Copilot? Of course not, these are two wildly different systems. But I think it's an interesting parallel to consider when discussing the point of "it's okay when a human does it" because humans and algorithms operate at two very different levels of scale. The potential for abuse of the latter being far higher and far easier than something a human has to do manually.
I'm mostly talking about the statement "[Copilot] relies on unprecedented open-source software piracy". This is just wrong. It learns from open-source code, just like a human does.
Because the AI is not a human and only humans have rights, including the right to learn.
Okay then: Who counts as 'human'? What's the qualifier for being a 'human'?
------
(The following questions all point to the same underlying question.)
Are you human if you have only one leg or 8 fingers due to a genetic deformity? What about albinism or sickle cell disease?
If someone had robotic implants, are they human? Is it inhuman to have an artificial leg? What about both legs?
Same scenario as above, but both arms & legs are replaced. Are they human?
Same as above, but now everything below the torso has been replaced. Same question.
Same question, but now everything below the neck.
If someone were to successfully transplant their brains into a robot body, are they still human?
Someone embeds a neural implant into their brain: Still human?
Same question, now multiple neural implants.
Same question, but now the brain-to-implant ratio is 2:1. Brain mass & neural count hasn't changed since then.
Same question, but with the brain-to-implant ratio now 3:1.
4:1. 5:1. 6:1. 8:1. 10:1. 15:1. 20:1. 30:1. 50:1. 100:1. 200:1. 500:1. 1000:1.
The neural count now starts to decrease because of regular cell degradation. What's the percentage point before they're considered non-human?
90%? 80%? 70%? 60%? 50%? 40%? 30%? 20%? 10%? 5%? 2%? 1%? 0.5%? 0.2%? 0.1%? 0.01%? 0.001%?
------
Where is the dividing line between 'human' and 'non-human'?
I am not sure how can anyone root for AI after seeing those kinds of outputs. It's like high-school level plagiatrism.
I explicitly say human-level because humans would also not be totally immune to this. It can happen that you unintentionally write the same code you have seen somewhere.
It can also even happen that you write the same code just by pure chance.
I'm talking about the statement in general, that all Copilot output is derived work. This is just wrong, as it is for a human as well.
I'm talking about the statement "[Copilot] relies on unprecedented open-source software piracy". This is just wrong. A human also relies on open-source software (and even private software) to learn, and this is not piracy.
It also says they can't sell the code, which CoPilot is doing.
Also, in a very high number of cases it isn't the author who uploads.
Repeating your line of argumentation (which occurs in every CoPilot thread) does not make it true.
If someone who isn't the author has uploaded code which they do not have a right to copy, they are liable, not Github. This is also clear from the Github Terms: "If you're posting anything you did not create yourself or do not own the rights to, you agree that you are responsible for any Content you post"
It's almost as if these highly paid lawyers know what they're doing.
e.g. I can clone the GNU codebase and publish it to GitHub. Clearly I don't own the code and do not have any rights to grant GitHub a license.
This sounds unenforceable in the general case. How could github know whether someone pushes their own code or not? Is it a license violation to push someone's FOSS code to github because the author didn't sign up with GH?
It depends on the licence.
It's very much enforceable that companies who provide content publishing platforms will indemnify themselves against people publishing content to which they do not have an appropriate licence.
"In computer programs, concerns for efficiency may limit the possible ways to achieve a particular function, making a particular expression necessary to achieving the idea. In this case, the expression is not protected by copyright."
https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...
Think about how absurd this is. So if Microsoft was the first company to write and publish an isEven function then no one else can legally use it?
Hey, I said the same thing about APIs, but here we are.
Edit: Actually, the Supreme Court declined ruling whether APIs are copyrightable, but they did say that if they are, reusing them like google reused the java apis in android would fall under fair use. Given that lower courts did think that APIs should be copyrightable, we don't know if they are anymore.
Look at paragraphs 90 and 91 on page 27 of the complaint[1]:
"90. GitHub concedes that in ordinary use, Copilot will reproduce passages of code verbatim: “Our latest internal research shows that about 1% of the time, a suggestion [Output] may contain some code snippets longer than ~150 characters that matches” code from the training data. This standard is more limited than is necessary for copyright infringement. But even using GitHub’s own metric and the most conservative possible criteria, Copilot has violated the DMCA at least tens of thousands of times."
Does distributing licensed code without attribution on a mass scale count as fair use?
If Copilot is inadvertently providing a programmer with copyrighted code, is that programmer and/or their employer responsible for copyright infringement?
There's a lot of interesting legal complications I think the courts will want to adjudicate.
[1] https://githubcopilotlitigation.com/pdf/1-0-github_complaint...
Ironically their Twitter account uses a screenshot from a TV series as profile picture. I wonder how legal that is, even if meant as a joke.
https://twitter.com/saverlawfirm
Edit: It's been changed 2 minutes after I wrote this comment
"Joined November 2022", following one account and no followers. It's generous to consider it a genuine account, no?
Or is your comment itself the joke?
However, if you are looking to understand the reasoning behind this lawsuit, there are lots of better examples online where Copilot blatantly ripped off open source code.
There is a reasonable argument that's a horrible system. But it doesn't make sense to criticize the plaintiff looking for a profit - the entire system has been set up such that that's what they're supposed to do. If you're angry about it lobby for either no rules or properly funded government enforcement of rules.
I don’t know man, I can simultaneously see the systemic issue that needs to be solved and also critique someone for subcoming to base needs like greed when they don’t have the need.
As an aside - I'm almost positive MSFT/Github expected this and their legal teams have been prepping for this moment. Copyright Law and Fair Use in the US is so nuanced and vague that anything created involving prior art by big-pocket individuals or corporations will be litigated swiftly.
I expected one of these lawsuits to come first from Getty or one of the big money artist estates against OpenAI or Stability.ai, but Getty and OpenAI seem to be partnering instead of litigating.
No, there are plenty of other changes you might want to see.
For example, in the American system, judges are generally not allowed to be aware of anything not mentioned by a party to the case. There is no good reason for this.
Frankly, I don't care if anyone makes a name for themselves for doing this. In fact, I applaud them and would happily give them recognition should they be successful.
Similarly, I'd hope that there are opportunties for profit in this space, given that I don't want cheap lawyers botching this case and setting terrible legal precedent for the rest of us. Microsoft has a billion dollar legal team and they will do everything they can to protect their bottom line.
Just like Google’s noble but misguided attempt to make all the world’s books searchable a few years back, what we have here is IP law getting in the way of a societal goodness.
Copyright and patent are not natural; they’re granted by law “to promote progress in the useful arts”. At first glance here it appears that GitHub is promoting progress and the plaintiffs are just rent-seeking.
Github can't really go to a court by themselves and ask "is this legal?". There is the concept declaratory relief but you need to be at least threatened with a lawsuit before that's on the table.
So Github kinda just has to try releasing CoPilot and get sued to find out. The legal system is setup to reward the lawyer who will go to bat against them to find out if it is legal. The plantiff (and maybe lawyer, depending on how the case is financed) take the risk they are wrong just as Github had to.
It is setup this way to incentivize lawyers to protect everyone's rights.
No matter who litigates and for what reasons it will be extremely valuable for good precedents to be set around the question of things like Copilot and DALL-E with respect to copyright and ownership. I'd rather have self interested lawyers dedicated to winning their case than self interested corporations fighting this out.
Obviously this is different for the reasons you stated, but I didn’t want people to think bringing a class action lawsuit forward is a way to get rich. It’s a bit of a joke, really.
How an aggravated individual can seek justice from a big multinational corporation? That's not possible unless that individual is a retired billionaire wanting to become a millionaire.
Yes he does think of it somewhat like that, establishing himself in an area. However a lot of his work comes from finding people aggrieved by something not them finding him.
But I write this to you in Hermes Maia
It might be fair to say that the read performed in training has the same character since no human is involved.
The real copyright violation would be using a derived work.
So copilot is fine but anyone using it must abide by the collective set of licenses that it used to write code for you…?
Note that even licenses like MIT ostensibly require attribution.
What made Napster illegal is that the company did not create their network for fair use of content, but to explicitly violate copyright for profit.
Copilot is like Napster in this case, in that both services launder copyrighted data and distributed it to users for profit.
Copilot is not like other P2P networks that exist to share data that is either free to distribute or can be used under the fair use doctrine. Copilot explicitly takes copyrighted content and distributes it to users in violation of licenses, that's its explicit purpose.
It's entirely possible to make a Copilot-like product that was trained on data that doesn't have restrictive licensing in the same way it's entirely possible to create a P2P network for sharing files that you have the right to share legally.
So if you produce napster 2.0 to be the best music piracy tool, and you test it for piracy, and you promote it for piracy... you're going to have trouble.
If you produce napster 2.0 as a general purpose file sharing system, let's call it a torrent client, and you can claim no ill intent... you may have trouble but it's a lot more defensible in court.
I would find it a big stretch to say Github's intent here is to illegally distribute copyrighted code. No judgment on whether the class action has any merit, just saying I would be very surprised if discovery turns up lots of emails where Github execs are saying "this is great, it'll let people steal code."
The issue isn't downloading copyrighted stuff.
Rather, it's making available and letting others download it. That was where you got in trouble.
https://wiki.winehq.org/Developer_FAQ#Who_can.27t_contribute...
Forbidding people who have seen the "source" programm is most likely to protect their version from going from "matching behaviour" to "behaving like", as in the same code, point. This might also be intended to build a safeguard for good intentioned developers to not break their (most likely existing) own NDAs accidently.
Actually, we were forbidden to look at open source code at Microsoft (circa 2009) because it might influence our coding and violate licenses.
In fact, the little precedent that exists over learning from copyrightable code is in favor of it.
More important, the rule urged by Sony would require that a software engineer, faced with two engineering solutions that each require intermediate copying of protected and unprotected material, often follow the least efficient solution (In cases in which the solution that required the fewest number of intermediate copies was also the most efficient, an engineer would pursue it, presumably, without our urging.) This is precisely the kind of “wasted effort that the proscription against the copyright of ideas and facts . . . [is] designed to prevent.” (Sony v. Connectix)
If you're using code and know that it will be output in some form, just stick a license attribution in the autocomplete.
In fact, did you know this is what Apple Books does by default? Say, for example, you copy and paste a code sample from The C Programming Language. 2nd Edition. What comes out? The code you copy and pasted, plus attribution.
If a human programmer reads some else's copyrighted code, OSS or otherwise, memorizes it and later reproduces it verbatim or nearly so, that is copyright infringement. If it wasn't, copyright would be meaningless.
The argument, so far as I understand it, is that Copilot is essentially a compressed copy of some or all of the repositories it was trained on. The idea that Copilot is "learning from" and transforming its training corpus seems, to me, like a fiction that has been created to excuse the copyright infringement. I guess we will have to see how it plays out in court.
As a non-lawyer it seems to me that stable diffusion is also on pretty shaky ground.
APIs are not copyrightable (in the US), so Wine is safe (in the US).
Let me tell you the story of Google Books, also known as "Authors Guild Inc. v. Google Inc"
https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....
In 2004, Google added copyrighted books to is Google Books search engine, that does search among millions of book text and shows full page results without any authors authorization. Any sane lawyer of the time would have bet on this being illegal because, well, it most certainly was. And you may be shocked to learn that it is actually not.
in 2005 the Authors Guild sues for this pretty straightforward copyright violation.
Now an important part of the story: IT TOOK 10 YEARS FOR THE JUDGEMENT TO BE DECIDED (8 years + 2 years appeal) during which, well, tech continued its little stroll. Ten year is a lot in the web world, it is even more for ML.
The judgement decided Google use of the books was fair use. Why? Not because of the law, silly. A common error we geeks do is to believe that the law is like code and that it is an invincible argument in court. No, the court was impressed by the array of people who were supporting Google, calling it an invaluable tool to find books, that actually caused many sales to increase, and therefore the harm the laws were trying to prevent was not happening while a lot of good came from it.
Now the second important part of the story: MOST OF THESE USEFUL USES HAPPENED AFTER THE LITIGATION STARTS. That's the kind of crazy world we are living in: the laws are badly designed and badly enforced, so the way to get around them is to disregard them for the greater good, and hope the tribunal won't be competent enough to be fast but not incompetent enough to fail and understand the greater picture.
Rants aside, I doubt training data use will be considered copyright infringement if the courts have a similar mindset than in 2005-2015. Copyright laws were designed to preserve the authors right to profit from copies of their work, not to give them absolute control on every possible use of every copy ever made.
Quite sure the issue at hand is about the code being copied verbatim without the license terms, not "learning" from it.
You can learn from it, but if you start copying snippets or base your code on it to such an extent that its clear your work is based on it, things start to get risky.
For comparison, people have tried to get around copyright of photos by hiring an illustrator to "draw" the photo, which doesn't work legally. This situation seems similar.
What is the difference between a neighbor watching you leave your home to visit the local grocery store and mass surveillance? Where do you draw the line?
It is pretty simple, actually.
The reason why those wouldn't apply to Copilot is because they aren't separating out APIs from implementation and just implementing what they need for the goal of compatibility or "programmer convenience". AI takes the whole work and shreds it in a blender in the hopes of creating something new. The hope of the AI community is that the fair use argument is more like Authors Guild v. Google rather than Sony v. Connectix.
> Tomorrow it will be against stable diffusion or (dall-e, gpt-3 whatever)
> And then eventually against Wine/Proton and emulators (are APIs copyrightable)
Textbook definition of F.U.D.
No it isn't, at least not automatically which is why infringement of licenses exists at all, the fact that you have a brain doesn't change that and never has. If you reproduce someone's code you can be in hot water, and that should be the case for an operator of a machine.
It's also why the concept of a clean room implementation exists at all.
More important, the rule urged by Sony would require that a software engineer, faced with two engineering solutions that each require intermediate copying of protected and unprotected material, often follow the least efficient solution (In cases in which the solution that required the fewest number of intermediate copies was also the most efficient, an engineer would pursue it, presumably, without our urging.) This is precisely the kind of “wasted effort that the proscription against the copyright of ideas and facts . . . [is] designed to prevent.” (Sony v. Connectix)
My (extremely amateur) understanding is that what is meant by "learn from it" is one of the hinge points of the legal question.
If a programmer reads licensed code and reproduces it verbatim or near-verbatim in a project with a conflicting license, that becomes a legal problem in certain circumstances.
If a programmer reads the same code and gets an idea to implement something different, that's less troublesome (or at least, if it is troublesome it's in a different area; if the idea was related to a patentable process, then other questions arise, but I'm even less qualified to speak to that area of law).
There's nothing special about copy/paste buttons that make them the only way you can infringe copyright.
Fair use doesn't automatically kick in just because someone uses what they took/copied as part of a larger artifact; it's a really complicated legal line.
Edit: I guess they do address it in their faq and I'd summarize it "Depends if copyright law applies and depends if it's considered derivative". https://creativecommons.org/faq/#artificial-intelligence-and...
The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes
The nature of the copyrighted work
The amount and substantiality of the portion used in relation to the copyrighted work as a whole
The effect of the use upon the potential market for or value of the copyrighted work.
A programmer who studied in school and learned to code did so clearly for and educational purpose. The nature of the work is primarily facts and ideas, while expression and fixation is generally not what the school is focusing on (obviously some copying of style and implementation could occur). The amount and substantiality of the original works is likely to be so minor as to be unrecognized, and the effect of the use upon the potential market when student learn from existing works would be very hard to measure (if it could be detected).
When a machine do this, are we going to give the same answers? Their purpose is explicitly commercial. Machines operate on expression and fixation, and the operators can't extract the idea that a model should have learned in order to explain how a given output is generated. Machines makes no distinction of the amount and substantiality of the original works, with no ability to argue for how they intentionally limited their use of the original work. And finally, GitHub Copilot and other tools like them do not consider the potential market of the infringed work.
API's are generally covered by the interoperability exception. I am unsure how that is related copilot or dall-e (and the likes). In the Oracle v. Google case the court also found that the API in question was neither an expression or fixation of an idea. A co-pilot that only generated header code could in theory be more likely to fall within fair use, but then the scope of the project would be tiny compared to what exist now.
Just because both activities are calling "learning" does not mean they are the same thing. They are fundamentally, physically different activities.
Remember when Napster was all the rage. And then Jobs and Apple stepped in and set an expectation for the value of a song (at 99 cents)? And that made music into the razor and the iPod the much more profitable blades. Sure it pushed back Napster but artists - as the creator of the goods - have yet to recover.
I'm not saying this is the same thing. It's not. Only noting that today's "win" is tomorrow's loss. This very well could be a case of be careful what you wish for.
If I own a repository on github and I have received contributions from other people, or included a .h file from mpv (thing that I have done), do I still have the right to click the opt-in button? I didn't ask the other contributors.
But github is in a position to scan my code and see if there are copy paste bits and disable the opt-in button in that case.
Except they act in bad faith so they wouldn't do that.
Algorithms can't be patented or copyrighted, as they are pure mathematics. If an implementation of an algorithm has no creative content because it is succinct then it likely doesn't deserve copyright.
A program gets written by an entity (usually a person) and is executed to generate the desired output according to a deterministic mathematical function it expresses. A training algorithm is a program that gets written to train a model (the model being the “AI Program”) when presented to some training data inputs, to implement a function that is not the training algorithm function itself, but another one, generalising over a problem domain beyond just the original examples fed to the training algorithm.
The output model is not the training algorithm or the training data (or an encoding of it) and exists as its own artefact, independent of both.
MIT License:
Copyright <YEAR> <COPYRIGHT HOLDER>
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
> A short and simple permissive license with conditions only requiring preservation of copyright and license notices. Licensed works, modifications, and larger works may be distributed under different terms and without source code.
-Some universe out there, with a god smarter and better than ours-
Microsoft can even continue to sell Copilot as a service while keeping it license-compliant, since most developers are not going to self-host the entire dataset. Microsoft can also choose to exclude copyleft-licensed code from Copilot or create multiple flavors of Copilot, each licensed differently. You can get your "productivity enhancement" without needing Microsoft to violate software licenses.
The damage is not in the monetary payment denied to free and open source software contributors, payment these contributors never demanded. The damage is in Microsoft violating other people's software licenses to create a proprietary product derived from copyleft-licensed and attribution-required code, and in Microsoft encouraging other developers to violate these licenses. Microsoft needs to rectify these violations with specific performance.
The intelligence of human beings isn't unspecifically good "predicting what's next" but rather is good at particular sorts of predictions in particular contexts, often involving the person having helped create the situation. I'm fairly safe at driving because I maintain an arrange of my vehicle in a fashion that allows me to predict easily what's next as well as allowing me to adjust if my predictions are wrong. Self-driving software might predict what's next as well as me in normal circumstances but it's neither aware of larger context nor does it things to maintain "smooth traffic flow".
Opposite, being able to predict anything generically would certainly be limitless intelligence but you can't describe any system with just that. Copilot is trained with a certain window, with the transforms special element giving more context but I don't think very many people doing current research expects that become generic prediction. I think I'm describing the consensus that it's a "better Google" for finding code one can use - and Google is a pretty good resource for this - if you aren't doing something unusual or difficult.
Jeff Hawking also makes "prediction is intelligence" claim but I think your and his approach misses that human intelligence is good not by being generic but doing more specific things.
If the court wanted to distinguish between Microsoft using their own programmers to generate code vs taking code from github users, then the salary in question would likely be that of Microsoft programmers. It would then be used to illustrate how a legal training data would look like compared to an illegal one.
Information may want to be free, but users of free information often want to enrich their private endeavors by shackling the information that was given to them freely.
The (A|L)GPL acknowledges the fact that some people and corporations like to use free-and-gratis work in their products and not reciprocate the courtesy shown to them by the authors of that work. (I choose the (L)GPL whenever I can so that folks who derive from my work are either required to either make it available as I have, or pay me enough so that I don't mind them shackling my work.)
The BSD license acknowledges the fact that some people and many corporations like to use gratis work in their closed-source products and never even do so much as bother to credit the authors of work that they used.
For as long as powerful folks continue to use and improve upon gratis information and software without contributing the products that used that information and/or improvements, the 'weird old Victorian ideas of "intellectual property"' are going to have to continue to be dealt with. Remember... you likely cannot reasonably afford an army of lawyers to ensure that pretty much noone uses your work without paying you, but big companies like Microsoft, RedHat, IBM, Oracle, etc, etc, and wealthy individuals can.
For as long as those wealthy entities can lock up and force you to pay for their work and ideas, but make it ruinously expensive for us little people to -individually- do the same to them, we'll need "weird old Victorian" things like licenses to help correct this imbalance of power.
In this answer, you're completely ignoring the massive fact that we cannot create a human brain. Having mathematical models about particles does not mean we have "solved" the brain. Unless you're also believe that these LLMs are actually behaving just like human brains, in that have consciousness, they have logic, they dream, they have nightmares, they produce emotions such as fear, love, anger, that they grow and change over time, that they controls body, your lungs, heart, etc...
You see my point, right? Surely you see that the statement 'The brain is also just a "complex math program"' is at best extremely over-simplistic.
There is a gaping chasm between observing known physics, and saying it is the cause of consciousness.
You should read this: https://en.wikipedia.org/wiki/Philosophy_of_mind
[ Edit: better link: https://en.wikipedia.org/wiki/Hard_problem_of_consciousness ]
Copyright laws, if enforced perfectly, would make programming simply impossible. We've been skating by on people not really enforcing them, despite the laws still being on the books, and the existence of tools like this makes that not a viable strategy. Today it's Copilot, which can be shut down, but tomorrow it'll be something developers can run at home. Bits don't have colour; there's no way to distinguish between a copy happening by independent recreation, and one that's actually a copy. So we'll need proper rulings.
In fact, considering Fauxpilot, that will happen as soon as the models have improved somewhat.
*: Of course I don't think "independent recreation" is really a thing. Humans are excellent at open source laundering. It's called "learning".
Take a look at "judicial notice" and "amicus curiae".
Citation needed. I've never plagiarised on purpose, sure, but I've caught myself at least several dozen times well after the act.
I think if another algorithm was used instead of ML that did the same job as Copilot, then people would be making the same arguments. I think it's just the case that ML is just the first tech capable of doing what Copilot is doing.
Or maybe it would be enough to just zip your image, to be allowed to distribute it? In the end the bytes I would distribute than "would be so different then your original image that you wouldn't have any claim to them at all", right?
Is your claim that no algorithm can be transformative?
This is a generative neural network. It doesn't contain a copy of your code; it contains weightings that were slightly adjusted by your code. Getting it to output a literal copy is only possible in two cases:
- If your code solves a problem that can only be solved in a single way, for a given coding style / quality level. The AI will usually produce the same result, given the same input, and it's going to be an attempt at a solution. This isn't copyright violation.
- If 'your' code has actually already been replicated hundreds of times over, such that the AI was over-trained on it. In that case it's a copyright violation... but how come you never went after the hundreds of other violations?
> If 'your' code has actually already been replicated hundreds of times over, such that the AI was over-trained on it. In that case it's a copyright violation... but how come you never went after the hundreds of other violations?
Replication is not a violation if the terms of the license are followed. Many open source projects are replicated hundreds of times with no license violation - that doesn't mean that you can now ignore the license.
But even if they did violate the license, that doesn't give you the right to do it too. There is no requirement to enforce copyright consistently - see e.g. mods for games which are more often than not redistributing copyrighted content and derivatives of it but usually don't run into trouble because they benefit the copyright owner. But try to make your own game based on that same content and the original publisher will not handle it in the same way as those mods. Same for OSS licenses: The original author does not lose any rights to sue you if they have ignored technical license violations by others when those uses are acceptable to the original author.
Government enforcement of this kind of law is really no different. It wouldn't be the legislature doing it.
AKA a vast majority of the non-legislative government workers.
People used to get busted from buying bootleg VHS and DVDs on the street before P2P filesharing was a common thing. Then, early on, people were sued for downloading copyrighted files before rightsholders decided to take a different legal strategy to go after sharers and bootleggers.
Like if the drawing was meant to be an artistic rendering with independent artistic value, much more likely to be fair use. If the drawing was meant to be a loop-hole to avoid paying the licensing fee on the original, its much less likely. Fair use has a bunch of criteria - a lot of it depends on intention and how the usage would affect the original copyright holder.
I would add that fair use lets you use a copyrighted work, it doesn't make the copyright go away, just adds some cases where you can use the work notwithstanding the original copyright, but the original copyright is still there.
Note: IANAL, this all could be wrong. I dont have any cases, i do know that people propose this sort of thing at wikipedia from time to time - i.e. hiring someone to draw copyrighted photos - and it usually gets shot down as not solving the problem, although im not familiar with the legal basis.
If "I took your code and trained an AI that then generated your code" is a legal defense, the GPL and similar licenses all become moot.
https://docs.github.com/en/site-policy/github-terms/github-t...
"You grant us and our legal successors the right to store, archive, parse, and display Your Content"
Copilot displays content. Case closed.
"This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content"
Using clips from a movie in a movie review is probably fair use.
Using clips from a movie in knock-off of that movie for profit? Probably not fair use if it's not a parody.
Copilot is not like a movie reviewer using clips to review a movie. Copilot is like a production team for a movie taking clips from another movie to make a ripoff of that movie and selling it.
Consider every repo on github to be a movie. Copilot is taking individual frames out of every movie on github and composting them into a new film.
I think most of us would agree that individually, each frame is copyrighted. But what if you take one frame from a million different movies and put them in an order that produces a new coherent movie?
The core question we need to settle in court is: does the new movie become its own copyrightable work, or is it plagiarism?
https://en.wikipedia.org/wiki/Sampling_(music)#Legal_and_eth...
I.e. any use without permission is illegal.
Copilot is fair use and transformative -- that is unless there is an open source Copilot that Copilot is training on, only then would it be competing and it's easy for GitHub or OpenAI to exclude those repos of copilot alternatives from the training set.
I can't think of a 5 line snippet I've written or read that makes sense to claim ownership of. They don't stand on their own in the way even a 30s movie clip does.
It is if I take those quotes and publish them as my own in my own book.
Nevertheless, stealing remains illegal so at the very least they have deprived the source code owners of their rights.
Code which anybody can view is called "source available". You aren't necessarily allowed to use the code, but some companies will let their customers see what is going on so they can better integrate the code, understand performance implications, debug and fix unexpected issues, etc. The customers would probably face significant legal risks if they took that code and started to sell it.
"Open source" code implies permission to re-use the code, but there is still some nuance. Some open-source licenses come with almost no restrictions, but others include limiting clauses. The GPL, for example, is "viral": anybody who uses GPL code in a project must also provide that project's source code on request.
What do you think the chances are that Microsoft would surrender the Copilot codebase upon receipt of a GPL request?
Almost everything on GitHub is subject to copyright, except for some very old works (maybe something written by Ada Lovelace?), and US government works not eligible for copyright.
Now, many of the works there are also licensed under permissive licenses, but that is only a defense to copyright infringement if the terms of those licenses are being adequately fulfilled.
Agreed. Like I said, it's about intent. Can anyone say with a straight face that copilot is an elaborate scheme to profit by duplicating copyrighted work?
I don't think the defense is that it wasn't trained on copyrighted data. It obviously was.
I think the defense is that anything, including a person, that learns from a large corpus of copyrighted data will sometimes produce verbatim snippets that reflect their training data.
So when it comes to copyright infringement, are we moving the goalposts to where merely learning from copyrighted material is already infringement? I'm not sure I want to go there.
If I’m sharing my code publicly, it’s because I want it to be _used_.
But training an AI model on media (code or otherwise) is not copyright infringement, so the license is irrelevant.
It's selfish to pretend otherwise and to try to assert a copyright right that doesn't exist, for the purpose of impeding progress in a field that benefits us all.
>the right to exclude others from making certain uses of the work: copying it, making a derivative work based on that work, distributing copies of the work to the public, and publicly performing or displaying the work.
So why would "training" "AI" on code with the intention of emitting derived works not be copyright infringement exactly?
This product is transforming copyrighted code into something that's intended to be used or sold in other works. The snippets it emits are directly derived from copyrighted code.
The most common argument against this is that humans also learn from copyrighted material. My argument against this is that CoPilot is not a human and should not be assumed to inherit rules intended for humans.
>in a field that benefits us all
As it stands currently CoPilot is proprietary and does not benefit anyone except for MicroSoft. If CoPilot was released under a FOSS license it would actually benefit us all. Most of the people against CoPilot are not against AI, but rather a proprietary AI product transforming FOSS work into other potentially proprietary works with the intention of profiting off of the completion service and hoarding the code that powers it.
The fact that you write something, doesn't automatically make that thing true.
Some uses of AI were ruled not to be infringement. This is a different case which requires a different ruling.
Well, maybe. But even if we assume that this is true, when anyone later uses the AI to reproduce a copy of the code, a copy has been made and copyright has been infringed.
If I need a code to loop over 10 lines, I'll code a for loop the same way regardless of what I'm developing.
Define for me, at what point of complexity, does code gets Copyrighted?
The things copilot is outputting is literally small chunks of code that needs a lot of cleanup afterwards. Is not like I type "Build twitter for me" and BAM, I got a working clone of twitter.
I don't care if people make money using GPL'd code, but I do care if they take the code and strip the license so they can use it in non-free projects.
https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...
"In computer programs, concerns for efficiency may limit the possible ways to achieve a particular function, making a particular expression necessary to achieving the idea. In this case, the expression is not protected by copyright."
"Finally, material that exists in the public domain can not be copyrighted and is also removed from the analysis."
That code is specifically optimized for efficiency and there were similar approaches floating (get it?) around in the 1980s.
On the other hand, Microsoft may only need to show "Hey, we got this code from FooBar under this license and this license and ..."
Oracle got copyright on API signatures…
In civil law there is a bar to protection if the work lacks "substantial" creativity. But even this bar is extremely low. More or less everything besides maybe simple math formulas is protected.
The court did not even question any copyright, it just assumed the APIs are copyrighted by Oracle. Than it looked for reasons why copying the APIs could possibly be fair use…
By the skin of their teeth they found some very involved and case specific reasons why Google's use of the copyrighted APIs was, after all, fair use.
https://www.bhfs.com/insights/alerts-articles/2021/supreme-c...
With current technology, the only licensing model we can offer is "give us your training set example, we'll chuck a few pennies at you out of credit sales and nothing more". We can't even comply with CC-BY because the model can't determine who to attribute to.
Using code, photographs, documents, or other material to train a model isn't copyright infringement. The person operating the model is not violating the exclusive rights of the copyright author: they are not making copies or derivative works.
Any other result means that all AI development based on training models is going to grind to a screeching halt, because essentially all training material—text, pictures, recordings—is copyrighted.
The physics that gives rise to the brain is pretty much known. We can model all the protons, electrons and photons incredibly accurately. It's an extraordinary claim you say the brain doesn't function according to these known mechanisms.
As a note the same applies to logos. Very simple logos that are only some lines and shapes, do not have copyright (in usa)
https://hyperallergic.com/766241/hes-bigger-than-picasso-on-...
The interesting thing is that the names get explicitly attached to these styles. It isn't exactly a copyright issue, but I'm sure it will get litigated regardless.
Telling apart what's public domain or not is not a trivially automatable task.
If one just relies on curated libraries of vetted public domain content you don't get, by far, the expected amout of variability and diversity.
As far as “citation needed”, humans are being convicted for plagiarism, so it is generally assumed that they are able to tell and hence can be held responsible for it.
Responsibility or liability is really the crux here. As long as AIs can’t be made liable for their actions (output) like humans or legal entities can, instead the AI operators must be held accountable, and it’s arguably their responsibility to take all practical measures to prevent their AIs from plagiarizing, or from otherwise violating license terms.
Personally I believe it’s likely that the brain can essentially be reduced to a computation, but we have no proof of that.
If all you have is a hammer...
The nature of consciousness is an open question. We don't know whether the brain is equivalent to a Turing machine.
We can't even accurately model a receptor protein on a cell or the binding of its ligands, nor can we accurately simulate a single neuron.
This is one of those hard problems in computing and medicine. It is very much an open question about how or if we can model complex biology accurately like that.
You are saying "If we know how something works, we can explain how it works using math."
But we know almost nothing about how the brain works.
> The physics that gives rise to the brain is pretty much known.
...no it is not! No physicist would describe any physical phenomenon as being "pretty much known". Let alone cognition. We don't even have a complete atomic model.
Does the brain fall in into the category of “understood natural phenomenon”? Is it “understood”? What does “understood” mean in this context?
Why? Burden of proof is on you.
You have it reversed. Math is a language tool to describe things, in a limited fashion (our current modeling). One is physical matter (even if it's antimatter). If you believe that there will be a language that can describe anything, it still doesn't manifest matter by speaking that language or describing it...unless you're into magic or spirits or whatever.
This disconnect has nothing to do with how well we do or do not understand physical phenomena. I think what the OP meant to say (and probably you support) is how the "mind" or how we think, can be described with mathematical models. Maybe one day we will have a full understanding, but we're not there yet and not currently in a way that is legally compelling.
Chill dude, all they have to do is include the licenses on their generated code.
If anything, this is going to generate even more progress. The copilot team would have to create some kind of feature that would connect the generated output the the relevant training data. That'd be pretty incredible to see in the field of AI/ML in general.
Copilot losing the lawsuit is evidence it’s a case of overfitting, not true ML.
Of course, end user could just strip the license/attribution off their generated output, but that's a different story.
I think we should just relax copyright, it's dying anyway. Language models allow people to borrow skills learned from other people, and solve tasks. That's huge. Like Matrix, loading up a new skill at the press of a button. Can we give up such a huge advantage in order to protect copyright?
I think the notion of copyright has been under attack already for 2 decades by the internet, search engine and social networks. They all work against it, and AI does it even more. It just encapsulates the whole culture in a box, mixing everything up, all copyrights melting away, everything one prompt away. This could be a new medium of propagation for ideas. No longer limited to human brains and books, they can now propagate through language models more efficiently.
Otherwise they would create the ultimate "copyright laundry machine".[1]
I'm very sure at least Hollywood and the big music labels would not like that… ;-)
Copilot's corpus is quite literally tomes of copyrighted work that are encoded and compressed in its neural network, from which it launders that work to create similar works. Copilot itself, the neutral network, is that corpus of encoded and compressed information, you can't separate the two. Copilot stores and distributes that work without any input from rightsholders, and it does it for profit.
A better analogy would be between a browser and a file server filled with copyrighted movies whose operator charges $10/mo for access. The browser is just a browser in this analogy, where the file server is the corpus that forms Copilot itself.
If you think this way, hashing is a copyright violation.
when someone uploads their copyrighted text to a web page they are distributing it to whoever visits that page. the browser is just the medium.
going from
a: 'one',
to a['one'],
just requires you to add two brackets and remove the colon. With multiple cursors you can do that exact same operation for all lines in a few keystrokes.But what's lost in my over simplified example is the contetxt is usually way more involved. I'm usually passing those as arguments to some function or other unique syntax situation that a glorified find and replace can solve. It's all about doing it in the times you would never think even bother writing a custom command because typing is faster given the unique syntactical context... The only thing faster then is autocomplete.
I'm not actually recreating a new hash with the convienient same format.
This is automated and happens immediately without you even thinking about it.
You only ever pull out the complicated Vim editing when you have a particular hard task, I’m talking about the small stuff many times a day.
The GPL is entirely dependent on copyright. Rather than pretend copyright doesn't exist, the GPL turns it in the other direction. By violating the GPL, Copilot is still violating copyright.
> We need the legal right to do things like host Your Content, publish it, and share it
> This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.
If Copilot is straight-up reproducing work, and it is a service that users have to pay to use, then it seems like Copilot is "sell[ing] your content" and thus the license does not apply.
More generally, a court is likely to look at the plain English summary and judge. Copilot is not an integral part of "the service" as developers understood it before Copilot existed.
Pieces of that data are encoded/compressed/transformed, and given the right incantation, a neutral net can put them together to produce a piece of code that is substantially the same as the code it was trained on. Obviously not for every piece of code it was trained on, but there's enough to see this effect in action.
when you upload code to a public repository on github.com, you necessarily grant GitHub the right to host that code and serve it to other users. the methods used for serving are not specified. This is above and beyond the license specified by the license you choose for your own code.
you also necessarily grant other GitHub users the right to view this code, if the code is in a public repository.
Whether the results of these programs is somehow Not A Derivative Work is the question at hand here, not "sharing". I think (and I hope) that the answer to that question won't go the way the AI folks want it to go; the amount of circumlocution needed to excuse that the not actually thinking and perceiving program is deriving data changes from its copyright-protected inputs is a tell that the folks pushing it know it's silly.
the human at the keyboard is responsible for what goes into the source code being written.
to aid copilot users here, they are creating tools to give users more info about the code they are seeing: https://github.blog/2022-11-01-preview-referencing-public-co...
It's an html file containing both the licensed code and some other html
"4. License Grant to Us
We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.
This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program."
https://docs.github.com/en/site-policy/github-terms/github-t...
I don't think these terms allow using content for Copilot.
This is like saying GitHub is free to do whatever they want with copyrighted code that's uploaded to their servers, even use it for profit while violating its licenses. According to this logic, Microsoft can distribute software products based on GPL code to users without making the source available to them in violation of the terms of the GPL. Given that Linux is hosted on GitHub, this logic would say that Microsoft is free to base their next version of Windows on Linux without adhering to the GPL and making their source code available to users, which is clearly a violation of the GPL. Copilot doing the same is no different.
Just trying to demonstrate a point- this analogy seems flawed.
Now, while you may be able to get it to reproduce one function. One file, and definitely the whole repository seems extremely unlikely.
It can also be modified to be opt-in-only (only peoples' code that they permit to be learned on, can use the product)
Could be, but isn’t. And that matters.
I assume they want some kind of broad relief, such as an injunction to take down copilot. They are not going to get it, they are not going to get anything at all, if they can’t even provide examples of violating code.
There’s not even a single mention of any established legal doctrines around copyright and software, such as abstract-filter-compare, idea-expression dichotomy, etc.
This is mostly because the means to copy requires little effort compared to the act of creating. So, there is no incentive to create because you wouldn't make a living out of it. Imagine spending two years writing a book and someone buys one and copy it to sell at 25%. He can make a profit at a lower threshold than you, so you as a creator cannot compete.
To be fair, from my experience most countries don't have much of a movie scene even with copyright and instead mostly import hollywood stuff.
> This is mostly because the means to copy requires little effort compared to the act of creating. So, there is no incentive to create because you wouldn't make a living out of it. Imagine spending two years writing a book and someone buys one and copy it to sell at 25%. He can make a profit at a lower threshold than you, so you as a creator cannot compete.
So don't compete by selling copies but by funding the creation up front. No one is claiming that abolishing copyright won't be disruptive to any existing business models - in fact, that's the point: once something becomes part of our shared culture it is ridiculous to let one entity continue to have exclusive rights so if your business model relies on continued royalties, find a better one.
Otherwise, perhaps consider continued payments to everyone who built your house, computer and whatever else you use if you think that is a great way for society to function. Don't worry, the way things are going we might get there via technical means anyway.
But what do you mean, no incentive to create?
The Tao Te Ching was reluctantly written after the author was begged by his pupils. Most Greek philosopher's teachings were only written down after their death because other people thought that's an important job. On The Origin Of Species is a book because that was just the normal way to communicate scientific findings in Darwin's time. Da Vinci saw some fat commissions in his life, but Mona Lisa certainly never brought him any money. In fact, out of my twenty favorite artists maybe two saw anything approaching fame in their lifetime.
Please, go to some random DeviantArt page or Spotify profile or GitHub repo with 3 views and tell me why it exists when the only reason for human creation is dollars and red carpets...what a sad perspective, really
That's true even in USA (with strong copyrights), apart from for top .1%
More generally, what is left to protect any creative work besides guarding physical access? Why would any company make any movie or tv show if it could be copied and redistributed by others endlessly the moment it gets shown once?
There have been creative endeavous before copyright and there would be creative endeavours after copyright. Perhaps even more since people are free to remix and share without restrictions.
Please do name one industry, niche or platform where copyright does actually prevent this from happening in any meaningful way today.
Obvious licensing needs to be respected and it shouldn’t be hard to solve that problem. But 99.9% of code isn’t some unique algorithm, it’s gluing libraries and setting up basic structures.
Most of the examples I’ve seen done line up with the reality of code completion tools. Code is rarely valuable when broken up into its small parts.
Even copying a full codebase is rarely enough to draw value from… there’s way more to a software business than the raw code. But that’s a different problem.
You just described open source software.
That's the whole heart of this lawsuit, and equally Copilot. It was trained on OSS which is explicitly licensed for free use.
> It was trained on OSS which is explicitly licensed for free use.
That's not what the lawsuit is about. It's not about money, it's about licensing. OSS licenses have specific requirements and restrictions for using them, and Copilot explicitly ignores those requirements, thus violating the license agreement.
The GPL, for example, requires you to release your own source code if you use it in a publicly-released product. If you don't do that, you're committing copyright infringement, since you're copying someone's work without permission.
Also, re: your edit, not quite. They require you to release modified source under certain conditions if you make modifications to it. If everybody had to release code using GPL to the world, every companies code would currently be released to the world. There's more nuance than that. The gnu site covers a lot of that nuance (https://www.gnu.org/licenses/gpl-faq.en.html#UnreleasedMods)
LGPL is the one that enterprises won't touch with a 10 foot pole, due to more restrictive licensing, and more conditions under which you'd have to open source your own code.
The same cannot be said for Copilot: there have been prior examples here on HN showing that it can emit large chunks of copyrighted code (without the license).
Most open-source software is not licensed for free use. MIT and GPL, the two most common licenses, both require attribution.
From where? They aren't publishing it. That's literally the meaning of proprietary.
That’s literally not the definition of proprietary.
You download proprietary software when you navigate to (nearly) every webpage. Just because a website like HN sends you (possibly unobfuscated) HTML, CSS, and JS over the wire in plain-text does not mean those files are not proprietary. Those files are covered by copyright in the U.S.
Access to the source code is not sufficient for that source code to be FOSS.
You also failed to acknowledge leaked source code and bytecode decompilation, which were a substantial portion of my comment.
Colloquially, "proprietary software" means closed-source. You can definitely put it in context where it means "copyright without license"; but outside that context, the colloquial meaning is enough.
The essence of the algorithm takes 4 lines: function declaration, declaration of 'y', one line for calculating the exponent in log-space, one line for returning the root finding.
The rest is fluff. Every line of the snippet has creative input with the chosen names ('threehalfs' for 1.5F), the order of declarations and instructions, the redundancy. There have been internet-wars around indentations and newlines, these are style choices.
((And it is public -- GPL more specifically, which is a restrictive license that should be respected. I think this snippets makes a perfect example of the dangers of copilot. But not one to litigate details with.))
(((Thinking back, I'm not sure anymore how the license laundering argument works if they got the code from a fair-use MIT-licensed hobby project. Can one person claim fair-use and include it under an MIT-license and have somebody else say 'oh this free code I'm going to use it commercially'?)))
copilot is outputting literally small chunks of code that needs a lot of cleanup afterwards
If you start from copyrighted chunks, then clean them, you're still violating people's copyright. Multi-million dollar lawsuits have been fought over people using small samples from other people's music, cleaning them, and releasing them as parts of their own song.
I strongly disagree. There would be more innovation if code couldn't be copyrighted or kept secret. See: all of open source.
> I've considered open sourcing some of my product's components under a source available but otherwise proprietary license
What's the point of that? This isn't useful to anyone. The fact you even consider it shows you don't understand open source. I'm sure you happily use open source code yourself though.
I actually agree. However this is not what's happening. Copilot effectvely removes copyright from FLOSS code, but doesn't touch proprietary software. FLOSS loses it's teeth against the corporations.
The purpose of releasing source available but proprietary code is so that users can learn and integrate into it, and making it available lets anyone learn how it works. The only reason I even considered making the source available is balance between 1) needing to eat and 2) valuing open source enough to risk #1.
Please take your condescension elsewhere.
There is a ton of innovative stuff that is not open source. I don't see what open source has to do with innovation.
Is there a GutHub terms of agreement that covers Copilot?
It being in GitHub has not been brought up as a factor yet (by GitHub/Microsoft), AFAIK they could use code from other places with that logic, they just don't need to.
Why do you want to release code on GitHub with an oppressive license? What's the motivation for you, and what's the benefit for anyone else in it being released?
The size of code fragments being generated with these AI tools is, as far as I can tell, extremely small. Do you think you could even notice if your own implementation of sqrt, comments and all, wound up in Excel?
The problem (or A problem) with copilot is that it tries to sidestep those licenses, purpotedly allowing you to build upon the work of others without giving anything back even if the work you are building on has been published on the explicit condition that what you create with it should also be shared in the same way. While the great AI tumbler makes the legal copyright infringement argument complicated by giving you lots of small bits from lots of different sources it really does not change the moral situation: you are explicitly going against the wishes of the people that are enabling you to do what you are doing.
Beyond copyleft, this kind of disregard for other peoples wishes also applies to attribution even with more liberal licenses. Programming is already a field where proper attrubution is woefully lacking - we don't need to make it worse by introducing processes where it becomes much harder if not impossible to tell who contributed to the creation.
Now I am all for maximum code sharing. I'm all for abolishing copyright entirely and letting everyone build what they want without being shackled by so-called intellectual property. But that is not something Microsoft is doing with Copilot. What they have created is a one way funnel from OSS to proprietary software. If Microsoft had initially trained Copilot on their own proprietary sources this would have been seen very differently. But they did not. Because the way Microsoft "loves open source" is not in the way of a mutally beneficial symbiotic relationship but that of an abuser that loves taking advantage of whatever they can with giving as little back as they can get away with.
As for whether Copilot's morally wrong or not - I don't think copyright as a concept makes any sense at the level of the trivial, where Copilot _should_ be acting. If Copilot regularly reproduces sizeable portions of code from a single origin _without_ careful and deliberate guidance, I'd agree that there's a problem here. As I understand it though, that's not happening.
By its very nature of being published, code from OSS is funnelled into proprietary codebases by humans performing a similar task to Copilot - reading available code and using that to evolve an understanding of how to produce software. I like to think we do it at a deeper level than Copilot, but the general effect is the same: the code I write, like the words I write, are heavily influenced by all the code I've read over the years.
If I wind up using a few words from your comment, down the line, because some turn of phrase you used struck me as a good way to say something, do you think I've morally wronged you?
IIRC, that is wrong. What you are describing is trademarks, not copyright.
This is a belief about our ability to construct models, not a fact. Models are leaky abstractions, by nature. Models using models are exponentially leaky.
> I didn't say we can simulate it.
Mathematics (at large) is descriptive. We describe matter mathematically, as it's convenient to make predictions with a shared modeling of the world, but the quantum of matter is not an equation. f() at any scale of complexity, does not transmute.
I'm not sure I agree that anything expressed in a legal contract using natural language is "unambiguously clear". MS / Github's expensively-attired lawyers will not doubt forcefully argue that they are not selling the YOUR content, but a service based on a model generated from a large collection of content, which they have been granted a licence to "parse it into a search index or otherwise analyze it on our servers". There may even be in-court discussion of generalization, which will be exciting.
> It's almost as if these highly paid lawyers know what they're doing.
Sure, they wrote the content display license long before CoPilot even existed. Any court will see the intent and not interpret these terms as a code re-licensing.
I'm afraid I do not believe your legal expertise is so extensive that you are able to accurately predict the judgement of "any court".
And that license explicitly states that it doesn't give them the right to sell your code.
It’s also smart enough to rebuild your song from the chords _if you ask it to_.
I’ve seen this point made before, but it assumes you use the entire input as output, which is silly.
That's why it's actionable and why there is meat on the bone for this case. The real issue is going to be if they can convince a jury that this software is just stealing code and whether its wrong if a robot does it.
https://en.wikipedia.org/wiki/Idea–expression_distinction
https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...
https://en.wikipedia.org/wiki/Structure,_sequence_and_organi...
Ultimately the algorithm is automating something a human could do. There is a lot of gray area to copyright law, but you can't get around that simply by offloading to an algorithm.
Uh? So if I design a self driving car which kills someone, it's the car that goes to jail?
Legal precedent seems to indicate this is not the case at all. Because humans and machines are different, simply because humans aren't machines and viceversa.
No but the manufacturer will typically be held responsible. If the manufacturer intentionally designed it to kill people, someone could certainly be charged with murder. More likely it was a software defect and then it is a matter of financial liability. (in between is a software defect that someone knew about and chose not to fix)
This isn't a new issue. If you design a car and the brakes fail due to a design issue and that issue can be determined to be something that could have been preventable by more competent design.... someone might indeed go to jail but more likely it would be the corporation paying out a large amount of money.
It could even be a mixture of the manufacturer's fault and the driver. Maybe the brakes failed but the driver was speeding and being reckless and slammed on the brakes with no time to spare. Had it not been a faulty design, no one would have gotten hurt, but also if the driver had been competent and responsible, no one would have gotten hurt.
But with self driving cars, when they no longer need a "safety driver", it certainly won't typically be the human occupant of the car's fault to any degree, since they are simply a passenger.
That doesn’t answer the question of who’s responsible when an accident happens and someone gets hurt or dies - but then, there was a time when animals would be judged and sentenced if they committed a crime under human law. That practice is no longer deemed valid, maybe we need to agree that, if the self-driving car was built with reasonable care, accidents can still happen and it’s no one’s fault.
That's basically what copilot is...?
You can ...?
By simply saying existing fair usage rights are limited to be used by humans and not for-profit companies building for-profit products.
No matter where you draw the line between "done by computers" and "done by a human simply using a computer as a tool," there will always be a lot of gray area.
Also, if I spend a year creating my masterpiece, and some kid releases a copy of it for free and claims that that's ok just because it's "not for profit," there is still a problem.
it makes a lot of sense, for that reason and a lot of others
people can create algorithms that do whatever they want, including copyright infringement and outright criminality, but algorithms can't create people or want anything for themselves
Based on the given prompt, [Codex] produced the following response:
function isEven(n) {
if (n == 0)
return true;
else if (n == 1)
return false;
else if (n < 0)
return isEven(‐n);
else
return isEven(n ‐ 2);
}
console.log(isEven(50));
// → true
console.log(isEven(75));
// → false
console.log(isEven(‐1));
// → ??**if you feel the class doesn't represent you, you can just not opt-in
That's my point. Many of the class members don't want the company to stop doing this.
I have code on GitHub, and Copilot is a useful tool. I don't care if my code was used to train the model. Sure, I personally could opt out of the suit, but that would be utterly meaningless in the grand scheme of things. The bottom line is, if I'm a coder with code on Github and I like Copilot, this suit is a huge net negative.
Even more importantly, I want to see the next version of Copilot that will be created by some other company, and then the next version after that. I want development to continue in this area at a high velocity. This suit does nothing but put giant screeching brakes on that development, and that is just a shame.
I have some code on Github as well and would not want it to be used in training, nor by Microsoft nor by other company. It is under GPL license to ensure that any derived use is public and not stripped of copyrights and locked into proprietary codebase, and copilot is pretty much 100% opposite of this.
I think you missed this part:
> Sure, I personally could opt out of the suit, but that would be utterly meaningless in the grand scheme of things.
Building a product on top of copyright works that does not directly distribute those works is legal. More specifically, a computer consuming a copyright work is not a violation of copyright.
Oracle only has copyright over APIs in the Federal Circuit, because they were able to hoodwink the judge into applying patent logic[0] to a copyright case. In other circuits it's still up in the air. And in the Ninth Circuit[1] there's already loads of controlling precedent that would have resulted in Oracle's case being summarily dismissed, API copyright or no.
The term "thin copyright" is a term of art. It refers to the kind of copyright protection you get from combining uncopyrightable elements in a creative way. For example, you can't own a particular chord progression. But, if you combine that with, say, a particular instrument, some audio engineering techniques, the subject matter of the lyrics, and so on... then you start getting something that requires creative effort and thus is copyrightable. Courts still have to take this into account when ruling on copyright claims as they do not want to give people a monopoly over just the chord, or just that instrument, etc.
In the case of APIs, we're talking about a series of names, plus an arrangement of type signatures that go with them. Very much a thin copyright, as the legal profession in the US calls it.
And when you have thin copyright, courts are going to be more liberal with handing out fair use exceptions. The "programmer convenience" argument that SCOTUS adopted means that copying an API to put in a different platform is OK. The Ninth Circuit says that copying an API to reimplement a platform that other people's code relies upon is also OK. There's very little room left to actually make a copyright claim on an API alone.
In the case of Copilot, it's not merely copying APIs and filling them out with novel details. It is either generating wholly novel code, or regurgitating training data, the latter of which is just a regular 'ol infringement claim with no difficult legal questions to worry about.
[0] The Court of Appeals for the Federal Circuit is the only court with subject-matter jurisdiction over patent claims. When you're the only person who can make hammers, everything looks like a nail.
[1] The Ninth Circuit court of appeals has jurisdiction over California, which means it takes on the brunt of copyright cases.
that's the idea, yeah, and it would've been great if that's how copilot worked all the time
as for the whataboutism, if developers copied copyrighted code, the rights holder has the right to go after them, too, if they so choose
the rights holder could also choose to go after only big companies that violate licenses egregiously, if they so choose
you know, common sense and nuance
The thing you call "thin copyright" is still copyright. Being protected or not is in the end a binary judgment: If your stuff is "a little bit" protected it is actually fully protected—with all consequences that follow from that.
Also, alone the "assumption" of the highest US court that APIs are protected is a very strong signal. They could just have ruled that there is no protection at all; case closed. But they preferred to go for a weasel solution. This has reasons… They deliberately didn't open up the door for API freedom. (Most likely to still be able wield that weapon against foreign concurrency should they feel like that some day).
The point is: IP law is completely crazy. The smallest brain-farts are routinely protected.
The exceptions to this rule are actually stronger in civil law, but still even in the EU single words or sub-second audio samples are protected by default. (Regarding APIs the situation is better though: It's legal to reverse engineer something for e.g. compatibility, and a few other reasons; but that are explicit exceptions. The default is that almost every expression of even the slightest form of human "creativity" is copyrighted; the bar is extremely low; and gets actually pushed constantly lower and lower by common law influence).
So on both sides of the Atlantic the default is that every single line of code is protected. There is nothing like a lower bound in size. Than, form there, you could try to argue that there should be an exception from this protection in some particular case, e.g. there was no "creativity" at all involved. But you will need to win a—often very hard, expensive, and ridiculously long—fight over that issue, and wining that is nothing like a sure thing; the default is that just everything is protected to the max. (Just have a look at all the craziness around news headlines in the EU; Google lost that case back than; to understand this better, as this may be very surprising to US people: civil law does not recognize anything like "fair use"; there are exceptions of copyright protection that have in the end almost the same effect, like grants for libraries or educational purposes, but those exceptions, and their limitations, are listed explicitly in the law; if no exception is listed there just isn't one, and only the very vague "creativity bar" remains).
Regarding Copilot: It makes not much difference whether this machine spits out some verbatim copies of (clearly copyrighted!) snippets or some "remix" thereof. There is no "novel" code if at best all what this machine does is creating "remixes" of the code it has in its database based on the query given. (Its "knowledge base" is nothing else than a very funky database; technical details regarding the actual implementation of that database or its query system should not matter legally).
Before this comes up again: No, any comparisons to how humans learn are irrelevant in this consideration. That machine is not a human. It's a machine. End of story. So even if you consider also a human brain a kind of "funky database" this makes no difference.
The lawyers defending the founders did try to make the argument that no infringement had been proven, and that the list itself was not proof of any infringement. It was just a list on a website, and they even presented evidence that the counter on the list was algorithm faulty. The judges was not convinced and applied the common sense approach that taken as a whole, it was not believable that no infringement had occurred by the website given the context of the site (the name, the top list, the overall perspective of how the site was designed).
Perhaps that is why they are reaching out to potential class members
> if they can’t even provide examples of violating code.
This is the very beginning of a very long process. I wouldn't rule out a settlement where class members get $10-100, which is a common resolution for class action suits.
This would be more or less analogous to Copilot linking to lines in repositories. If Copilot was doing that, there wouldn't be much outrage.
The fact that they are producing the entire relevant snippet, without attribution and in a way that does not necessitate referencing the source corpus, suggests the transgression is different. It is further amplified by the fact that the output itself is typically integrated in other copyrighted works.
Attribution is mentioned in this filing because such attribution would be sufficient to meet the licensing terms for some of the alleged infringements.
It's an irrelevant discussion though, the suit does not make a claim that the training of Copilot was an infringement which is where Authors Guild is a controlling precedent.
I agree it's relevant precedent, but not exactly the same. Libraries are a public good and more importantly Google books references the original works. In short, I don't think that's the final word in all seemingly related cases.
> More specifically, a computer consuming a copyright work is not a violation of copyright.
I don't agree with this way of describing technology, as if humans weren't responsible for operating and designing the technology. Law is concerned with humans and their actions. If you create an autonomous scraper that takes copyrighted works and distributes them, you are (morally) responsible for the act of distributing them, even if you didn't "handle" them or even see them yourself.
Neither of the important aspects – remixing and automation – is novel, but the combination is. That's what we should focus on, instead of treating AI as some separate anthropomorphized entity.
At which case Google paid some hundred million $ to companies and authors, created a registry collecting revenues and giving to rightsholders, provided opt-out to already scanned books, etc. Hey, doesn't sound that bad for same thing to happen with Copilot.
B) "parts of" copyright works are not themselves sufficient to constitute a copyright violation. The violation must be a substantial reproduction. While it's up to the court to determine if the alleged infringements demonstrated in the suit (I'm sure far more will be submitted if this case moves forward) meet this bar, from what I've seen none of them have.
Historically the bar is pretty high for software, hundreds or thousands of lines depending on use case. A purely mechanical description of an operation is not sufficient for copyright, you cannot copyright an implementation of a matrix transformation in isolation no matter what license you slap on the repo. Recall that the recent Google v Oracle case was litigated over tens of thousands of lines of code and found to be fair use because of the context of those lines.
I've yet to see a demonstrated case of Copilot generating code that is both non-transformative and represents a significant reproduction of the source work.
How do they not make copies? Do you know how a computer works? Ever heard of RAM? (At least the German Urheberrecht recognizes this clearly: You can't do any processing on any data with the help of a computer without at least making temporary local copies, so there are exceptions to some rules. I'm quite sure common law copyright also recognizes this!)
Also the claim that this is not a derivative work is actually one of the disputed claims here…
> Any other result means that all AI development based on training models is going to grind to a screeching halt, because essentially all training material—text, pictures, recordings—is copyrighted.
Exactly, it's all copyrighted! That's why you can't use it for whatever you like. That's the whole point of copyright.
As a result this means that whoever wants to exploit that work in said way needs to buy (or get otherwise) a license!
Nobody said that feeding AI with properly licensed work would be problematic. Only the original creators need to get their fair cut form the outcome of such process.
I think that losing this lawsuit has much more serious consequences for Copilot than just having to connect to a list of millions of potential copyright owners - it would mean the model behind it is essentially a failure.
Personal opinion: the real situation lies somewhere in the middle. From what I’ve seen, I think Copilot has some ability to actually generate code, or at least adapt and connect unrelated code pieces it remembers to respond to prompts - but I also believe it just “remembers” (i.e., has a close-to-lossless encoding of the input) how to do some operations and spits them out as part of the response to some prompts.
I hardly think the lawsuit will really explore this discussion, but it sounds like a great investigation into what DL models like transformers actually learn. For all I know, it might even give insight into how we learn. I have no reason to believe that humans don’t use the same strategy of memorising some operations and learning how to adjust them “at the edges” to combine them.
seems like a great opportunity for Microsoft to alter copilot so it's opt-in to get your code scanned, and to mandatorily add licensing and attribution to outputs
I know you said you're OK with it as is, but many aren't, so if I'm a coder, this suit represents a big net positive for me, being a way to reduce the probability of someone laundering my code away without proper attribution or license attention
If you build a machine and sell it, and this machine kills someone even operated correctly you'll have a problem. A big problem…
AI is a machine.
So the case is actually quite simple.
Regarding the sibling's Uber example: There the argumentation was that the machine was not operated correctly. So this is not a comparable case.
Well the "long" investigation let uber off the hook despite disabling emergency breaking and put the driver in jail.
Which seems to put all the blame on the user and nothing on the makers of the AI.
(and I guess courts might, in the future, say the GPL expires when copyrights on the code expire)
Meanwhile open source software has had an immeasurable benefit to society. My computer, tv, phone, light bulb, etc all benefit from OSS—running various licenses, and only a subset using a copyleft like license.
Like, copyright laws are also stifling my innovative business creating BluRays of Disney films and selling them on Amazon.
OpenAI did a dirty job though judging by the cases of the model just reproducing code to the comment, so I can understand why one would criticize this specific project.
The more I think about it, the more this all seems like another dimension of Jack and the Magic Beanstalk crossed with The Matrix.
Copilot is as much of a search engine as Stable Diffusion or DALL-e are, which is to say they aren't at all. If you want to compare it to a search engine, despite it being a tortured metaphor, the most apt comparison is not to Google, but to The Pirate Bay if TPB stored all of their copyrighted content and served it up themselves.
Stable Diffusion works on completely different principles and they can't exactly replicate a pixels from their training data.
And we all saw how well that went legally.
In the end it's just a machine. It's not a person. So trying to anthropomorphize this case makes no sense from the get go.
Looking at it this way (and I guess this is the right way to look at it from the law standpoint) Copilot is just a fancy database.
It's a database full of copyrighted work…
How this database (and it's query system) works from the technical viewpoint isn't relevant. It just makes no difference as by law machines aren't people. End of story.
But should the curt (to my extreme surprise) rule that what MS did was "fair use" than the flood gates of "fairuseify through ML"[1] would be open. Given the history of copyright and/or other IP laws in the US this just won't happen! The US won't ever accept that someone would be allowed to grab all Mikey Mouse movies put them into some AI and start to create new Mikey Mouse movies. That's the unthinkable. Just imagine what this would mean. You could "launder" any copyrighted work just by uploading and re-querying it form some "ML-based database system". That would be the end of copyright. This just won't happen. MS is going lose this trail. There is no other option.
The only real question is how severe their loose will be. They used for sure also AGPLv4 code for training. Thinking this through to the end with all consequences would mean that large chunks of MS's infrastructure, and all supporting code, which means more or less all of Azure, which means more or less all of MS's software, would need to be offered in (usable!) source to all users of Copilot. I think this won't happen. I expect the court to find a way to weasel out of this consequence.
[1] https://web.archive.org/web/20220121020414/fairuseify.ml/
The weights of the Copilot very likely contain verbatim parts of the copyrighted code, just like in a zip archive. It chooses semi-randomly which parts to show and sometimes breaks copyright by displaying large enough pieces.