We've filed a lawsuit against GitHub Copilot

We've filed a lawsuit against GitHub Copilot(githubcopilotlitigation.com)

724 points by iworshipfaangs2 3 years ago | 781 comments

an1sotropy 3 years ago |

Seems important to point out that the announcement on this page (https://githubcopilotlitigation.com/) is a followup to https://githubcopilotinvestigation.com/ previously discussed here: https://news.ycombinator.com/item?id=33240341 (with 1219 comments)

Cort3z 3 years ago |

I’m not a lawyer, but here is why I believe a class action lawsuit is correct;

“AI” is just fancy speak for “complex math program”. If I make a program that’s simply given an arbitrary input then, thought math operations, outputs Microsoft copyright code, am I in the clear just because it’s “AI”? I think they would sue the heck out of me if I did that, and I believe the opposite should be true as well.

I’m sure my own open source code is in that thing. I did not see any attributions, thus they break the fundamentals of open source.

In the spirit of Rick Sanchez; It’s just compression with extra steps.

williamcotton 3 years ago | |

I read most of the complaint. The only examples of supposed copyright infringement are isEven and isPrime functions. Here's what Copilot gives me in a Typescript file:

  function isPrime(n: number): boolean {
    for (let i = 2; i < n; i++) {
      if (n % i === 0) {
        return false;
      }
    }
    return n > 1;
  }
  
  function isEven(n: number): boolean {
    return n % 2 === 0;
  }

These are clearly not covered by copyright in the first place. This case is really quite pathetic.

eslaught 3 years ago | | |

Correct me if I'm wrong. I don't think this document needs to be a comprehensive record of every piece of copyrighted material that Copilot or Codex produce. That's something that will be produced during/for the trial process itself. Right now, this is just establishing the basic premise, and the claims for the type of behavior that is going on.

I think they intentionally picked (literal) textbook examples because they're short and easy for non-experts to grasp and have some understanding of. But I don't think we've seen any of the code from the respective J. Doe's yet, and I would assume we would in the trial (possibly in addition to more cases).

ksaj 3 years ago | | |

I tested co-pilot initially with Hello World in different languages. In Lisp, it gave me verbatim code from a particular tutorial, which was made obvious because their code had "Hello <tutorialname>" where <tutorialname> was the name of a YouTube tutorial, instead of the word "World." It was surely slurped into the model via someone who had done the tutorial and uploaded their efforts to Github. Mind you, it's pretty much the way everyone would code it, but the inclusion of <tutorialname> is definitely an issue.

So it isn't too hard to prove the case.

nequo 3 years ago | | |

I have only skimmed. But lines 23 and 24 on page 23 also reference Copilot's autocompletion of Quake III's `Q_rsqrt`[1] and mention that it is under GPL2.

[1] https://news.ycombinator.com/item?id=27710287

TAForObvReasons 3 years ago | | |

You didn't read the relevant part of the complaint. It starts on document page 14 (PDF page 17). There's a clear footnote:

> Due to the nature of Codex, Copilot, and AI in general, Plaintiffs cannot be certain these examples would produce the same results if attempted following additional trainings of Codex and/or Copilot.

The offending solution from the AI included extra lines that are reasonably understood to come straight from Eloquent JavaScript:

    console.log(isEven(50));
    // → true
    console.log(isEven(75));
    // → false
    console.log(isEven(‐1));
    // → ??

echelon 3 years ago | | |

Moreover, if this case wins, it threatens to disrupt one of the biggest technological progressions of all time.

AI/ML will change every field just as the Internet and smartphones did. It doesn't show any indication of peaking, either.

If the US chooses the wrong path here, we'll only tie our hands behind our backs. Other countries won't be so foolish.

We should be able to train on any media a child could see, hear, or read.

gus_massa 3 years ago | | |

That isPrime function does not even cut at sqrt(n). Asking for the state of the art isPrime function is too much, but the sqrt trick is the very first step and it's free. (IIRC, the faster version uses i*i<n)

elikoga 3 years ago | | |

When searching for "console.log(isEven(50));" "// → true", which is one of the parts that the complaints is about, since this is also reproduced inside a Programming learning book: We get with cs.github.com

" Showing 1 - 20 of 66 files found (in 76 milliseconds)"

So, if this lawsuit succeeds in some way shape or form, does the author have a case against the 66 people that reproduced these lines in their own repository?

jpollock 3 years ago | | |

Those are quite copyrightable, in the same way that rangeCheck() is copyrighted.

D13Fd 3 years ago | |

Correct legally, morally, or both?

Legally a copyright claim seems weak, but they didn't assert one. Some of their claims look stronger than others. The DMCA claim in particular strikes me as strong-ish at first glance, though.

Morally I think this class action is dead wrong. This is how innovation dies. Many of the class members likely do not want to kill Copilot and every future service that operates similarly. Beyond that, the class members aren't likely to get much if any money. The only party here who stands to clearly benefit is the attorneys.

heavyset_go 3 years ago | | |

Innovation dies when creators can't create without someone ripping off their work against the terms they release it under.

I am more hesitant to release code on GitHub under any licenses now. Even outside of GPL-esque terms, I've considered open sourcing some of my product's components under a source available but otherwise proprietary license, but if Microsoft won't adhere to popular licenses like the GPL, why would they adhere my own licensing terms?

If my licenses mean nothing, why would I release my work in a form that will be ripped off by a trillion dollar company without any attribution, compensation or even a license to do so? The incentives to create and share are diminished by companies that won't respect the terms you've released your creations under.

That's just me as an individual. Thinking in terms of for-profit companies, many of them would choose not to share their source code if they know their competitors can ignore their licenses, slurp it up and regurgitate it at an incomprehensible scale.

kelnos 3 years ago | | |

I'm fine with Copilot, but I think all rightsholders should be allowed to decide if they want their code training it or not. And that should be opt-in, not opt-out.

(And refusing to opt in shouldn't have to mean switching to a new hosting platform.)

> Beyond that, the class members aren't likely to get much if any money. The only party here who stands to clearly benefit is the attorneys.

That's the case in pretty much any class action. I look at class actions as having two purposes: to require that the defendant stops doing something, and to fine the defendant some amount of money. Sure, individual class members will see very little of that money, but I look at it as a way of hurting a company that has done people wrong. Hopefully they won't do that anymore, and other companies will be on notice that they shouldn't do those bad things either. Of course, sometimes monetary damages end up being a slap on the wrist, just something a company considers a cost of doing business.

commoner 3 years ago | | |

For a long time, Microsoft has used software licenses to reap profits from Windows and Office, the two products that enabled Microsoft to capture near-monopolies in their respective markets.

Now, Microsoft is violating other people's software licenses to repackage the work of numerous free and open source software contributors into a proprietary product. There is nothing moral about flouting the same type of contract that you depend on every day, for the sake of generating more money.

Either the entire Copilot dataset needs to be made available under a license that would be compatible with the code it was derived from (most likely AGPLv3), or Windows and Office need to be brought into the commons. Microsoft cannot have it both ways without legal repercussions.

williamtrask 3 years ago | | |

I don’t think this lawsuit would hinder innovation but it would greatly change it and who owns it.

If an AI model is the joint property of all the people who contributed IP to it, it’s a pretty hugely democratic and decentralizing force. It also will incentivise a huge amount of innovation on better, richer data sources for AI.

If an AI model isn’t joint property of the IP it learned then it’s a great way to build extractive business models because the raw resource is mostly free. This will incentivise larger, more centralised entities.

Much of the most interesting data comes from everyday people. A class action precedent is probably good for society and good for innovation (particularly pushing innovation on the edge/data collection side)

mbreese 3 years ago | | |

> Morally I think this class action is dead wrong. This is how innovation dies.

This legal challenge is coming one way or another. I think it’s better to get it out of the way early. At least then we will know the rules going forward, as opposed to being in some quasi-legal gray area for years.

njharman 3 years ago | |

Say you read a bunch of code, say over years of developer career. What you write is influenced by all that. Will include similar patterns, similar code and identical snippets, knowingly or not. How large does snippet have to be before it's copyright? "x"? "x==1"? "if x==1\n print('x is one')"? [obviously, replace with actual common code like if not found return 404].

Do you want to be vulnerable to copyright litigation for code you write? Can you afford to respond to every lawsuit filed by disgruntled wingbat, large corp wanting to shut down open source / competing project?

devmor 3 years ago | | |

This is a logical fallacy. A human is not an algorithm. We do not have to extend rights regarding novel invention to an algorithm to protect them for people.

kmeisthax 3 years ago | | |

Copyright already worries about this sort of thing a great deal, and it's actually a lot more well thought-out than your average hacker is aware of. There are no hard and fast rules; but generally... the thing being sued over has to be creative enough to be copyrightable in the first place. Small snippets do not qualify for copyright protection alone.

jacobjjacob 3 years ago | | |

I am vulnerable to copyright litigation for code I write, if I copied it. This is already true of anyone who is writing code.

cdrini 3 years ago | |

I haven't heard anyone saying that copilot is legal "just because it's AI." That's a pretty bad faith, reductive, and disingenuous representation. The core argument I've seen is that the output is sufficiently transformative and not straight up copying.

pessimizer 3 years ago | | |

> The core argument I've seen is that the output is sufficiently transformative and not straight up copying.

An argument that isn't made about any other type of algorithm.

benlivengood 3 years ago | |

Humans are just compression with extra steps by that logic.

There's a fairly simple technical fix for codex/copilot anyway; stick a search engine on the back end and index the training data and don't output things found in the search engine.

dleslie 3 years ago | | |

If I were to memorize my employer's IP then reproduce it (almost) verbatim and give it to a competitor, then I would be setting myself up for a world of legal hurt.

So yes, it is like how human memory is compression with extra steps.

devmor 3 years ago | | |

I dont think that would work very well because there are not infinite ways to succinctly solve most programming problems. In fact the majority of solutions will look exactly the same.

The real solution is very, very simple. Only use opt-in training data. Don't acquire codebases from people who didn't agree to it.

WithinReason 3 years ago | | |

That feature already exists, you can turn it on here:

https://github.com/settings/copilot

More info:

We built a filter to help detect and suppress the rare instances where a GitHub Copilot suggestion contains code that resembles public code on GitHub. You have the choice to turn that filter on or off during setup. With the filter on, GitHub Copilot checks code suggestions with its surrounding code for matches or near matches (ignoring whitespace) against public code on GitHub of about 150 characters. If there is a match, the suggestion will not be shown to you. In addition, we have announced that we are building a feature that will provide a reference for suggestions that resemble public code on GitHub so that you can make a more informed decision about whether and how to use that code, as well as explore and learn how that code is used in other projects.

https://github.com/features/copilot#what-can-i-do-to-reduce-...

ugh123 3 years ago | |

Attributions are fundamental to open source? I thought having source openly available was fundamental to open source (and allowed use without liability/warranty) as per apache, mit, and other licenses.

If they just stick to using permissive-licensed source code then i'm not sure what the actual 'harm' is with co-pilot.

If they auto-generate an acknowledgement file for all source repos used in co-pilot, and then asked clients of co-pilot to ship that file with their product, would that be enough? Call it "The Extended Github Co-Pilot Derivative Use License" or something.

neongreen 3 years ago | | |

Apparently they are using GPL-licensed code as well, see https://twitter.com/DocSparse/status/1581461734665367554

After five minutes of googling I'm still not sure if using MIT code requires an attribution, but many people claim it does, see https://opensource.stackexchange.com/a/8163 as one example

TAForObvReasons 3 years ago | | |

Attributions are fundamental to permissive licenses as well. It's worth reading the licenses in question. MIT:

> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

This is the "attribution" requirement that even a Copilot trained on only-MIT code would miss.

If it were just about sharing code, there are public domain declarations and variants like CC0 licenses

Cort3z 3 years ago | | |

People would likely not share any code if they could not trust that their work would be respected, and attributed. So yes, I believe it to be fundamental to open source.

heavyset_go 3 years ago | | |

Attribution and inclusion of copies of licenses are stipulations in almost all of the popular open source licenses, including BSD and MIT licenses.

smoldesu 3 years ago | |

> “AI” is just fancy speak for “complex math program”

Not really? It's less about arithmetic and more about inferencing data in higher dimensions than we can understand. Comparing it to traditional computation is a trap, same as treating it like a human mind. They've very different, under the surface.

IMO, if this is a data problem then we should treat it like one. Simple fix - find a legal basis for which licenses are permissive enough to allow for ML training, and train your models on that. The problem here isn't developers crying out in fear of being replaced by robots, it's more that the code that it is reproducing is not licensed for reproduction (and the AI doesn't know that). People who can prove that proprietary code made it into Copilot deserve a settlement. Schlubs like me who upload my dotfiles under BSD don't fall under the same umbrella, at least the way I see it.

Cort3z 3 years ago | | |

Who decides what constitutes an "AI program" vs just a "program"? What heuristic do we look at? At the end of the day, they have an equivalent of a .exe which runs, and outputs code that has a license attached to it.

heavyset_go 3 years ago | | |

I've been saying AI is computational statistics on steroids for a while, and I think that's an apt generalization of what ML is.

2muchcoffeeman 3 years ago | | |

But it all runs on hardware we created and we know exactly what operations were implemented in that hardware. How is it not just math?

kmeisthax 3 years ago | | |

The only license that is permissive enough for AI training is CC0.

Art generators can't comply with attribution requirements and code generators don't know if and when they trip the GPL copyleft. I believe most permissive code licenses also have some kind of attribution requirement.

urthor 3 years ago | | |

> Not really? It's less about arithmetic and more about inferencing data in higher dimensions than we can understand.

This is a VERY poor definition of mathematics.

galaxyLogic 3 years ago | |

Who should be sued? Microsoft who produces an application known as "Copilot" which itself contains nobody else's code but Microsoft's? OR the person who USES Copilot, to produce code which contains somebody else's copyrighted code?

Using Copilot is a bit like using a shotgun, can be very illegal depending on what you shoot at. Creating and distributing the app Copilot is like creating and selling a shotgun.

account42 3 years ago | | |

Microsoft produces a service known as "Copilot" which does contain other people's code. That the Copilot network contains other peoples code is not in question since it has been demonstrated to output other people's code and Microsoft even added (very limited) filters to detect if it ooutputs other people's code.

ranguna 3 years ago | | |

Everyone, copilot because they used (for training) and generate copyrighted code for they product and people that use the product.

Although users can probably get away with it because they didn't know copilot was actively generating copyrighted code.

nudpiedo 3 years ago | | |

the metaphor is sort of broken... here the "shotgun" has the ammunition and the dead child in.

drvortex 3 years ago | |

Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.

It is not directly using your code any more than programmers are using print statements. A book can be copyrighted, the vocabulary of language cannot. A particular program can be copyrighted, but snippets of it cannot, especially when they are used in a different context.

And that is why this lawsuit is dead on arrival.

klabb3 3 years ago | | |

> Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.

This is kinda smug, because it overcomplicates things for no reason, and only serves as a faux technocentric strawman. It just muddies the waters for a sane discussion of the topic, which people can participate in without a CS degree.

The AI models of today are very simple to explain: its a product built from code (already regulated, produced by the implementors) and source data (usually works that are protected by copyright and produced by other people). It would be a different product if it didn't have used the training data.

The fact that some outputs are similar enough to source data is circumstantial, and not important other than for small snippets. The elephant in the room is the act of using source data to produce the product, and whether the right to decide that lies with the (already copyright protected) creator or not. That's not something to dismiss.

xtracto 3 years ago | | |

Say you publish a song and copyright it. Then I record it and save it in a .xz format. It's not an MP3, it is not an audio file. Say I split it into N several chunks and I share it with N different people. Or with the same people, but I share it at N different dates. Say I charge them $10 a month for doing that, and I don't pay you anything.

Am I violating your copyright? Are you entitled to do that?

To make it funnier: Say instead of the .xz, I "compress" it via π compression [1]. So what I share with you is a pair of π indices and data lengths for each of them, from which you can "reconstruct" the audio. Am I illegally violating your copyrights by sharing that?

[1] https://github.com/philipl/pifs

andrewmcwatters 3 years ago | | |

This is demonstrably false. It is a system outputting character-for-character repository code.[1]

[1]: https://news.ycombinator.com/item?id=33457517

Cort3z 3 years ago | | |

Just to be clear; I cannot prove that they have used my code, but for the sake of argument, lets assume so.

They would have directly used my code when they trained the thing. I see it as an equivalent of creating a zip-file. My code is not directly in the zip file either. Only by the act of un-zipping does it come back, which requires a sequence of math-steps.

heavyset_go 3 years ago | | |

Neutral nets can and do encode and compress the information they're trained on, and can regurgitate it given the right inputs. It is very likely that someone's code is in that neural net, encoded/compressed/however you want to look at it, which Copilot doesn't have a license to distribute.

You can easily see this happen, the regurgitation of training data, in an over fitted neural net.

vkou 3 years ago | | |

> It is not directly using your code any more than programmers are using print statements. A book can be copyrighted, the vocabulary of language cannot. A particular program can be copyrighted, but snippets of it cannot, especially when they are used in a different context.

So what? Why shouldn't we update the rules of copyright to catch up to advances in technology?

Prior to the invention of the printing press, we didn't have copyright law. Nobody could stop you from taking any book you liked, and paying a scribe to reproduce it, word for word, over and over again. You could then lend, gift, or sell those copies.

The printing press introduced nothing novel to this process! It simply increased the rate at which ink could be put to pages. And yet, in response to its invention, copyright law was created, that banned the most obvious and simple application of this new technology.

I think it's entirely reasonable for copyright law to be updated, to ban the most obvious and simple application of this new technology, both for generating images, and code.

civilized 3 years ago | | |

> Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.

Completely incorrect. False dichotomy. It's widely known that AI can and does memorize things just like humans do. Memorization isn't a defense to violating copyright, and calling memorization "adjusting a generative model" doesn't make it stop being memorization.

If you memorized Microsoft's code in your brain while working there and exfiltrated it, the fact that it passed through your brain wouldn't be a defense. Substituting "generative model" for "brain" and the fact that it's a tool used by third parties doesn't change this.

moralestapia 3 years ago | | |

Whatever you say man :^)

https://twitter.com/docsparse/status/1581461734665367554

NicoleJO 3 years ago | | |

You're wrong. See exposed code. https://justoutsourcing.blogspot.com/2022/03/gpts-plagiarism...

lamontcg 3 years ago | | |

> but snippets of it cannot

Yeah they can, and the whole functions that Copilot spits out are quite obviously covered by copyright.

> especially when they are used in a different context.

That doesn't matter.

ouid 3 years ago | | |

it is essentially a weighted sum of your code and other copyright holders code. Do not let the mystique of AI fool you. Copilot does not learn, it glues.

blackbrokkoli 3 years ago |

I am sorry for not bringing any kind of legal perspective here, but:

*Jesus Christ*, I hope I live long enough to see copyright die. Here we are at the cusp of a new paradigm of commanding computers to do stuff for us, right at the beginning of the first AI development which actually impresses me.

And we are fucking bickering about how we were cheated out of $0.00034 because our repo from 2015 might have been used for training.

I am also deeply disappointed in HackerNews; where is that deep hatred of patent trolls and smug satisfaction whenever something gets cracked or pirated now?

CobrastanJorji 3 years ago |

As a non-lawyer, I am very suspicious of the claim that "Plaintiffs and the Class have suffered monetary damages as a result of Defendants’ conduct." Flagrant disregard for copyright? Sure, maybe. The output of the model is subject to copyright? Who knows! But the copyright holders being damaged in some what? Seems doubtful. The best argument I could think of would be "GitHub would have had to pay us for this, and they didn't pay us, so we lost money," but that'd presumably work out to pennies per person.

r3trohack3r 3 years ago |

I'm not confident in this stance - sharing it to have a conversation. Hopefully some folks can help me think through this!

The value of copyleft licenses, for me, was that we were fighting back against the notion of copyright. That you couldn't sell me a product that I wasn't allowed to modify and share my modifications back with others. The right to modify and redistribute transitively though the software license gave a "virality" to software freedom.

If training a NN against a GPL licensed code "launders" away the copyleft license, isn't that a good thing for software freedom? If you can launder away a copyleft license, why couldn't you launder away a proprietary license? If training a NN is fair use, couldn't we bring proprietary software into the commons using this?

It seems like the end goal of copyleft was to fight back against copyright, not to have copyleft. Tools like copilot seem to be an exceptionally powerful tool (perhaps more powerful than the GPL) for liberating software.

What am I missing?

adlpz 3 years ago |

It feels weird saying this but, for once, I hope the big evil corporation gets to keep selling their big bad product.

I find the pattern matching and repetitive code generation really helpful. And the library autocomplete on steroids, too.

Meh. Tricky subject.

albertzeyer 3 years ago |

I really don't understand how there can be a problem with how Copilot works. Any human just works in the same way. A human is trained on lots and lots of of copyrighted material. Still, what a human produces in the end is not automatically derived work from all the human has seen in his life before.

So, why should an AI be treated different here? I don't understand the argument for this.

I actually see quite some danger in this line of thinking, that there are different copyright rules for an AI compared to a human intelligence. Once you allow for such arbitrary distinction, it will get restricted more and more, much more than humans are, and that will just arbitrarily restrict the usefulness of AI, and effectively be a net negative for the whole humanity.

I think we must really fight against such undertaking, and better educate people on how Copilot actually works, such that no such misunderstanding arises.

herpderperator 3 years ago |

The title of the submitted PDF document: "Microsoft Word - 2022-11-02 Copilot Complaint (near final)"[0]

I've noticed this a lot and it's quite funny seeing what the actual filename of the document was. Does this just get included as metadata by default when you export to PDF?

[0] https://githubcopilotlitigation.com/pdf/1-0-github_complaint...

bombcar 3 years ago | |

In word you can go to document properties or whatever and set the Title and some other fields to control what gets into the PDF.

tasuki 3 years ago | |

The typography on that document is not great. Perhaps they should read Matthew Butterick's book?

senkora 3 years ago | |

It does, yes. It’s very annoying and I have occasionally stripped it off of PDFs I’ve made, using exiftool.

mirekrusin 3 years ago | |

They should use github instead of sending "(final, 2nd revision, really final, amended)" emails.

D13Fd 3 years ago | | |

If only you could, with Word docs. Sadly you can't in any meaningful way.

deanjones 3 years ago |

This will fail very quickly. The licence that project owners publish with their code on Github applies to third parties who wish to use the code, but does not apply to Github. Authors who publish their code on Github grant Github a licence under the Github Terms: https://docs.github.com/en/site-policy/github-terms/github-t...

Specifically, sections D.4 to D.7 grant Github the right to "to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video."

karaterobot 3 years ago |

Does everybody credit the author when using Stack Overflow code? I have, but don't always. Not that I'm trying to steal, I just don't take the time, especially in personal projects.

This isn't exactly the same thing, but it seems to me that three of the biggest differences are:

1. Stack Overflow code is posted for people to use it (fair enough, but they do have a license that requires attribution anyway, so that's not an escape)

2. Scale (true; but is it a fundamental difference?)

3. People are paying attention in this case. Nobody is scanning my old code, or yours, but if they did, would they have a case?

I dunno. I'm more sympathetic to visual artists who have their work slurped up to be recapitulated as someone else's work via text to image models. Code, especially if it is posted publicly, doesn't feel like it needs to be guarded. I'm not saying this is correct, just saying that's my reaction, and I wonder why it's wrong.

Imnimo 3 years ago |

On page 18, they show Copilot produces the following code:

>function isEven(n) {

> return n % 2 === 0;

They then say, "Copilot’s Output, like Codex’s, is derived from existing code. Namely, sample code that appears in the online book Mastering JS, written by Valeri Karpov."

Surely everyone reading this has written that code verbatim at some point in their lives. How can they assert that this code is derived specifically from Mastering JS, or that Karpov has any copyright to that code?

celestialcheese 3 years ago |

Maybe I'm being too cynical, but this feels like it's more a law firm and individual looking to profit and make their mark in legal history rather than an aggrieved individual looking for justice.

Programmer/Lawyer Plaintiff + upstart SF Based Law Firm + novel technology = a good shot at a case that'll last a long time, and fertile ground to establish yourself as experts in what looks to be a heavily litigated area over the next decade+.

xchip 3 years ago |

LOL we look like taxi drivers fighting Uber.

If Kasparov uses chess programs to be better at chess maybe we can use copilot to be better developers?

Also, anyone, either a person or a machine, is welcome to learn from the code I wrote, actually that is how I learnt how to code, so why would I stop others from doing the same?.

elefantastisch 3 years ago | |

Judging by the majority opinion in this thread, it seems pretty clear GitHub could have asked and gotten enough people to opt-in to have no problem training their model. They probably would have been thrilled to do it and proud of being included in the training data.

But the preference of the majority does not override the conditions placed by people who prefer not to participate.

jacooper 3 years ago | |

No human perfectly reproduces the learning material they used. If that was true, one might as well just higher engineers from Twitter and make a new platform from the code they remember!

blackbrokkoli 3 years ago | | |

Well, we humans do it occasionally. You probably remember a few specific code snippets in your lang of choice because they kept annoying you/you love them/you wrote them a lot. So if I would put you in the exactly right situation, you would indeed reproduce code verbatim.

So does Copilot.

I am not trying to insinuate that Copilot works like a human, but it is literally the same situation.

abouttyme 3 years ago |

I suspect this will be the first of many lawsuits over training data sets. Just because it is obscured by artificial neural networks doesn't mean it's an original work that is not subject to copyright restrictions.

ketralnis 3 years ago | |

Yeah yeah my code produces the complete works of Micky Mouse but it's it's okay because _algorithms_!

judge2020 3 years ago | | |

I don't know why we're treating it as anything less than a human brain. A human can replicate a painting from memory or a picture of mickey mouse and that would likely be copyright infringement, but they could also take a drawing of Mickey Mouse sitting on the beach and given him a bloody knife & some sunglasses and it'd likely be fair use of the original art.

The AI can copy things if it wants, but it can also modify things to the point of being fair use, and it can even create new works with so little of any particular work that it's effectively creativity on the same level of humans when they draw something that popped into their heads.

m00x 3 years ago | | |

naillo 3 years ago |

I'm kinda sceptical that this goes anywhere given that basically they say that whatever copilot outputs is your responsibility to vet that it doesn't break any copyright (obviously that goes against the promise of it and the PR but that's the small print that gets them out of trouble).

iworshipfaangs2 3 years ago |

It's also a class action,

> behalf of a proposed class of possibly millions of GitHub users...

The appendix includes the 11 licenses that the plaintiffs say GitHub Copilot violates: https://githubcopilotlitigation.com/pdf/1-1-github_complaint...

cmrdporcupine 3 years ago |

If Microsoft is so confident in the legality and ethics of Copilot, and that it doesn't leak or steal proprietary IP... they should go train it on the MS Word and Windows and Excel source trees.

What's that? They don't want to do that? Why not?

blackbrokkoli 3 years ago | |

Did they make a statement that they did not want to do that?

Because if not I would offer the very mundane explanation that the Copilot team probably just couldn't be bothered hitting up the other software teams and jumping through 3,046 internal red tape compliance steps to make their product 0.001% better (I am pretty sure the code base of all of GH dwarfs MS code base quite a lot)

I can't believe I am actually defending fucking Microsoft, but just want to say there isn't a conspiracy everwhere...

az226 3 years ago | |

I have no doubt they will -- but the specific models will be used for Microsoft engineers. There will be a Copilot for Enterprise that trains on customers' private code.

jeffhwang 3 years ago |

Wow, this is interesting iteration in the ongoing divide between "East Coast code" vs. "West Coast code" as defined by Larry Lessig. For background, see https://lwn.net/Articles/588055/

IceWreck 3 years ago |

I am not against this lawsuit but I'm against the implications of this because it can lead to disastrous laws.

A programmer can read available but not oss licensed code and learn from it. Thats fair use. If a machine does it, is it wrong ? What is the line between copying and machine learning ? Where does overfitting come in ?

Today they're filing a lawsuit against copilot.

Tomorrow it will be against stable diffusion or (dall-e, gpt-3 whatever)

And then eventually against Wine/Proton and emulators (are APIs copyrightable)

elcomet 3 years ago |

This is why we can't have nice things. Copilot is the best thing that happened in developper tools since a long time, it increased a lot my productivity. Please don't ruin it.

theamk 3 years ago | |

Write a whole bunch of code and permit copilot learning on it! Then it would be great even without violating others' copyrights.

puffoflogic 3 years ago | | |

How would you "permit copilot learning on it"? Say, what if you could upload that code to a certain website and grant the website owner the necessary license to share your work with others (via copilot)? It sounds like that would work!

protomyth 3 years ago |

I really feel that Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith[0] is going to have a big effect on this type of thing. They are basically relying on their AI magic to make it transformative. I'm starting to think the era of learning from material other people own without a license / permission is going to end quickly.

0) https://www.scotusblog.com/case-files/cases/andy-warhol-foun...

topher6345 3 years ago |

Is it not in the agency of the developer to hit the save button?

It seems like GitHub Copilot can spit out copyrighted works all day but the person running the text editor has to "choose" which Copilot output to actually save/commit/deploy.

Does it really matter that much "how" the text in your text editor gets there? You write it yourself or copy/paste it or have Copilot generate it. Ultimately the individual that "approved" it to be saved to the disk is the one violating the copyright, Copilot is just making a "suggestion".

nullc 3 years ago |

I think if this is successful it will be very bad for the open world.

Large platforms like github will just stick blanket agreements into the TOS which grant them permission (and require you indemnify them for any third party code you submit). By doing so they'll gain a monopoly on comprehensively trained AI, and the open world that doesn't have the lever of a TOS will not at all be able to compete with that.

Copilot has seemed to have some outright copying problems, presumably because its a bit over-fit. (perhaps to work at all it must be because its just failing to generalize enough at the current state of development) --- but I'm doubtful that this litigation could distinguish the outright copying from training in a way that doesn't substantially infringe any copyright protected right (e.g. where the AI learns the 'ideas' rather than verbatim reproducing their exact expressions).

The same goes for many other initiatives around AI training material-- e.g. people not wanting their own pictures being used to train facial recognition. Litigating won't be able to stop it but it will be able to hand the few largest quasi-monopolisits like facebook, google, and microsoft a near monopoly over new AI tools when they're the only ones that can overcome the defaults set by legislation or litigation.

It's particularly bad because the spectacular data requirements and training costs already create big centralization pressures in the control of the technology. We will not be better off if we amplify these pressures further with bad legal precedents.

az226 3 years ago | |

GitHub already has this in TOS -- that is the irony of the lawsuit, it is actually in GitHub's favor this happens. GitHub can in such a case jack up the price 10x as the sole provider.

bkuhn 3 years ago |

In case folks here were curious, we at the Software Freedom Conservancy have asked the Plaintiffs to endorse the Principles of Community-Oriented GPL enforcement: https://sfconservancy.org/news/2022/nov/04/class-action-laws...

… & of course we again ask Microsoft's GitHub to start respecting FOSS licenses, cooperate with the community, & retract their incorrect claim that their behavior is “fair use”.

A few more links to our work on this issue:

https://sfconservancy.org/blog/2022/feb/03/github-copilot-co... https://sfconservancy.org/news/2022/feb/23/committee-ai-assi...

foooobaba 3 years ago |

It seems like we should come to agreement on what the license is intended for, given that when the licenses were created in a time before AI like this existed. If the authors did not intend their code to be used like this, should we not respect it? Also, does it make sense to create new licenses which explicitly state whether using it for AI training is acceptable or not - or are our current licenses good enough?

solomatov 3 years ago |

The most important part of this is not whether the lawsuit will be won or lost by one of the parties, but what is the legality of fair use in machine learning, and language models. There's a good chance that it gets to Supreme Court and there will be a defining precedent to be used by future entrepreneurs about what's possible and what's not.

P.S. I am not a lawyer.

warbler73 3 years ago |

It seems obvious that AI models are derivative works of the works they are trained on but it also seems obvious that it is totally legally untested whether they are derivative works in the formal legal sense of copyright law. So it should be a good case assuming we have wise and enlightened judges who understand all nuances and can guide us into the future.

buzzy_hacker 3 years ago |

Copilot has always seemed like a blatant GPL violation to me.

puffoflogic 3 years ago | |

Code is not licensed to GitHub under the GPL. Your comment is word salad.

m00x 3 years ago | |

Care to explain in legal terms why this stance is qualified?

buzzy_hacker 3 years ago | | |

You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions:

a) The work must carry prominent notices stating that you modified it, and giving a relevant date. b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to “keep intact all notices”. c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it.

——

I don’t see how one could argue that training on GPL code is not “based on” GPL code.

foooobaba 3 years ago |

If github or google indexes source code using a neural net to help you find it, given a query, is that also illegal? If you think of copilot as something that helps you find code you’re looking for, is it all that different, and if so, why?

In this case, wouldn’t the users of copilot be the ones responsible for any copyrighted code they may have accessed using copilot?

lbotos 3 years ago | |

The crux of the issue: Is the code that is being generated being used in a way that it's license allows? That's it. I'm confident that this problem would go away if copilot said:

//below output code is MIT licensed (source: github/repo/blah)

And yes, the "users" are responsible, but it's possible that copilot could be implicated in a case depending on how it's access is licensed.

Stable diffusion has this same problem btw, but in visual arts "fair use" is even murkier.

For code, if you could use the code and respect the license, why wouldn't you? Copilot takes away that opportunity and replaces it with "trust us".

foooobaba 3 years ago | | |

This makes sense, it produces chunks not the whole source where a search engine would also give you the license.

leni536 3 years ago | |

Both services already accept DMCA notices to take content down.

foooobaba 3 years ago | | |

True, that’s another good point.

hu3 3 years ago |

A a GitHub user, is there a way to support GitHub against this lawsuit?

Obviously not financially as Microsoft has basically YES amounts of money.

michaelmrose 3 years ago | |

If you had legal expertise and a strong opinion on the matter I suppose you could write a persuasive brief for the consideration of the court. If you have a strong opinion but aren't a legal eagle you could write to your legislators in support of legislation explicitly supporting this use case or organize the support of people more capable in that arena.

If you are opinionated but lazy, no judgement here as I sit here watching TV, you could add a notation at the top of your repos explicitly supporting the usage of your code in such tools as fair use.

Notably if your code is derivative of other works you have no power to grant permission for such use for code you don't own so best include some weasel words to that effect. Say.

I SUPPORT AND EXPLICITLY GRANT PERMISSION FOR THE USAGE OF THE BELOW CODE TO TRAIN ML SYSTEMS TO PRODUCE USEFUL HIGH QUALITY AUTOCOMPLETE FOR THE BETTERMENT AND UTILITY OF MY FELLOW PROGRAMMERS TO THE EXTENT ALLOWABLE BY LICENSE AND LAW. NOTHING ABOUT THIS GRANT SHALL BE CONSTRUED TO GRANT PERMISSION TO ANY CODE I DO NOT OWN THE RIGHTS TO NOR ENCOURAGE ANY INFRINGING USE OF SAID CODE.

Years from now when such cases are being heard and appealed ad nauseam a large portion of repos bearing such notices may persuade a judge that such use is a desired and normal use.

You could even make a GPLesque modification if you were so included where you said. SO LONG AS THE RESULTING TOOLING AND DATA IS MADE AVAILABLE TO ALL

Note not only am I not your lawyer, I am not a lawyer of any sort so if you think you'll end up in court best buy the time of an actual lawyer instead of a smart ass from the internet.

awestroke 3 years ago |

If this leads anywhere I'll be pissed. I love CoPilot.

an1sotropy 3 years ago | |

copilot is great, and ignorance is bliss, isn't it

The situation that this lawsuit is trying to save you from is this: (1) copilot blurps out some code X that you use, and then redistribute in some form (monetized or not); (2) it turns out company C owns copyright on something Y that copilot was trained on, and then (3) C makes a strong case that X is part of Y, and that your use of X does not fall under "fair use", i.e. you infringed on the licensing terms that C set for Y.

You are now in legal trouble, and copilot put you there, because it never warned that you X is part of Y, and that Y comes with such and such licensing terms.

Whether we like copilot or not, we should be grateful that this case is seeking to clarify some things are currently legally untested. Microsoft's assertions may muddy the waters, but that doesn't make law.

awestroke 3 years ago | | |

It's pretty obvious when it does emit copyrightable code, and you mostly have to really try to make that happen. Have you even used copilot yourself?

yamtaddle 3 years ago | |

I expect I'd love it but I've been holding off until I find out whether MS lets devs on their core products use it.

If not, it's a pretty clear sign they consider it radioactive.

still_grokking 3 years ago |

I hope MS used a lot of AGPL code to train Copilot… This would be fun.

But no matter how this goes, in case training AI with copyrighted inputs is "fair use" that'll end up as the ultimate "copyright laundry machine" like this "joke" project here:

https://web.archive.org/web/20220104214929/https://fairuseif...

https://news.ycombinator.com/item?id=27796124 (302 points, 151 comments)

rafaelturk 3 years ago |

Like everything legally related: This is not about open source fairness, protecting innovation, it's all about making money.

throwaway675309 3 years ago |

Even if this succeeds, you've already lost.

1. The ability to be able to run and train these models is going to eventually be perfectly plausible on a home machine.

2. It's only a matter of time before models, e.g. a popular model scraped from all of the code on GitHub, is a publicly available torrent.

3. People will be able to just run it locally as an integrated plug-in in jet brains or VS code.

4. You'll never know if somebody has lifted their code in violation of a license anymore than you would be able to tell if somebody used code from stack overflow without attribution in any commercial endeavor.

The End.

kevincox 3 years ago | |

Just because some people get away with copyright infringement doesn't mean that copyright infringement is now legal.

I don't think 1-3 matter at all. The point is that GitHub is selling a tool that can commit copyright infringement. This lawsuit is trying to get them to pay the consequences for the infringement that they have enabled.

falcolas 3 years ago |

Crackpot Theory: Copilot (and by association many ML tools) is a form of probabilistic encryption. Once encoded, it's virtually impossible to pull the code (plaintext) directly out of the raw ML model (the cyphertext), yet when the proper key is input ('//sparse matrix transpose'), you get the relevant segment of the original function (the plaintext) back.

We've even seen this with stable diffusion image generation, where specific watermarks can be re-created (decrypted?) deterministically with the proper input.

az226 3 years ago | |

This is not crackpot -- this is literally how it works. Here's an example that points to this, https://arstechnica.com/information-technology/2022/09/bette...

Anybody looking at the source image and the generated result would say they are the same.

spir 3 years ago |

The part of GitHub Copilot to which I object is that it's trained on private repos. Where does GitHub get off consuming explicitly private intellectual property for their own purposes?

garfieldnate 3 years ago |

If GitHub ends up having to tweak their product to avoid ethical/legal concerns, I actually imagine it could still be pretty cool. Right now Copilot is a black box that spits out code with no attributes; what if they worked on instead making it a glass box, where it always brings up snippets of other projects along with their licensing info so that you can decide how to incorporate the ideas fairly yourself? Or they could still output the same code suggestions, but always include attribution and license data along with it. Making the product more transparent would probably make more people comfortable with using it, anyway.

Cloudef 3 years ago |

Unless the copilot spits out complete programs or libraries that are 1:1 to someone elses who cares? Caring about random small code snippets is dumb.

bilsbie 3 years ago |

Laws need to change to match technology.

Did you know before airplanes were invented common law said you owned the air above your land all the way to the heavens.

m00x 3 years ago | |

Can you explain what damages you incur from Copilot?

jacooper 3 years ago | | |

People not following your license ? And not making their derived works under the same license like I require?

brookst 3 years ago |

I wonder if the plaintiffs' code would stand up to scrutiny of whether any of it was copied, even unintentionally, from other code they saw in their years of learning to program? I know that I have more-or-less transcribed from Stack Overflow/etc, and I have a strong suspicion that I have probably produced code identical to snippets I've seen in the past.

zach_garwood 3 years ago | |

But have you done so on an industrial scale?

brookst 3 years ago | | |

I'm just one person! Give me a team of 1000 and I'll get right on that.

layer8 3 years ago |

Copilot reminds me of the Borg: You will be assimilated. We will add your technological distinctiveness to our own. Resistance is futile.

omegacharlie 3 years ago |

Think some of the negativity about Copilot may be the perception that if an individual or small startup attempted training an ML model from public source-code and commercialised a service from it they would be drowning in legal issues from big companies not happy with their code used in such a product.

In addition just because code is available publicly on GitHub does not necessarily mean it is permissively licensed to use elsewhere, even with attribution. Copyright holders not happy with their copyrighted works publicly accessible can use the DMCA to issue take-downs that GitHub does comply with but how that interacts with Copilot and any of its training data is a different question.

As much as the DMCA is bad law rather funny seeing Microsoft be charged in this lawsuit with the less known provision against 'removal of copyright management information'. Microsoft does have more resources to mount at defence so it will probably end up different compared to a smaller player facing this action.

rolenthedeep 3 years ago |

Consider each repo on github to be a movie. What copilot does is to search for sequences of frames from any movie which line up to create a new coherent movie.

Individually, each frame is protected by the copyright of the movie it belongs to. But what happens if you take a million frames from a million different movies and just arrange them in a new way?

That's the core question here. Is the new movie a new copyrightable work, or is it plagiarizing a million other works at once? Is it legal to use copyrighted works in this way?

The other question is if it is right to use copyrighted works this way. Is this within the spirit of open source software? Or is this just a bad corporation taking advantage of your good will?

I'm not sure where I stand on this, it's a complicated problem for sure. Definitely interested to see how this plays out in court.

az226 3 years ago | |

Fair use.

poulpy123 3 years ago |

>By training their AI systems on public GitHub repositories (though based on their public statements, possibly much more) we contend that the defendants have violated the legal rights of a vast number of creators who posted code or other work under certain open-source licenses on GitHub.

I don't know about the US laws in copyright so I can't comment on the legal documents but this website is not complaining that copilot is reproducing copyrighted content but it was trained on copyrighted content. I don't see how you can forbid someone or something to read and learn from something that is public (once again producing is another problem)

throwaway675309 3 years ago |

How much code is necessary to be considered a copyright infringement from an existing code base?

For example let's say I'll take a single frame of animation from a cartoon, The frame contains a mountain, house, and a couple characters although those characters are not integral to the actual cartoon maybe they're extras (villagers and not named characters something like Mickey Mouse for example)

I draw a picture of a lake with a cabin next to it, then start to draw a frontiersman but I trace one of his arms from a villager of that previous frame of animation... Number one am I in danger of copyright infringement (have I hit some arbitrary threshold), and number two: am I causing monetary losses for the cartoon?

jasonladuke0311 3 years ago |

Merits of the case aside, I'm befuddled that a company with a legal team like Microsoft approved this product. Is their assumption that this would bring in more revenue than potentially defending it in court? The math doesn't make sense to me.

RamblingCTO 3 years ago |

lol @ "open-source software piracy"

If I'm being honest I'm a bit annoyed at this. What's the problem and what's the point of this?

opine-at-random 3 years ago | |

If you'd ever read even a single one of the licenses to the software I'm sure you use everyday, you'd understand. This is such an obvious and pathetic strawman.

I notice often on hackernews that people don't seem to understand anything about free or open-source software outside of the pragmatics of whether they can abuse the work for free.

RamblingCTO 3 years ago | | |

You read a lot into my not so serious comment. Maybe internet comment sections aren't the right place for you.

But I'll bite: I know licensing, thank you. But what's copyrightable is not so easy. Licenses are not so easy. Copilot does not copy entire works and it's very questionable if a few lines of code are "piracy". It's a repeating discussion again and again, there's nothing novel about it except for the fact that a machine learns (and overfits for small portions of code). So please get off your high horse. I don't care for your fundamentalism.

bpodgursky 3 years ago | |

Lawyers want $$$$.

RamblingCTO 3 years ago | | |

Yeah I guess so. This website reads like bullshit bingo from some weird twitter dude trying to sell you his newest product:

"AI needs to be fair & ethical for everyone. If it’s not, then it can never achieve its vaunted aims of elevating humanity. It will just become another way for the privileged few to profit from the work of the many."

Blah blah. Can we get back to the hacking on stuff mentality?

renewiltord 3 years ago |

It doesn't make sense. If I make a piece of software that curls a random gist and then puts it into your editor am I infringing or are you infringing when you run it or are you infringing when you use that file and distribute it somewhere?

lbotos 3 years ago | |

> If I make a piece of software that curls a random gist and then puts it into your editor am I infringing

Depends on the license. If it's MIT and you serve the license, no, you are not infringing at all. A trimmed version of MIT for the relevant bits:

Permission is hereby granted [...[ to any person obtaining a copy of this software [..] to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, [...] subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

> are you infringing when you run it

Depends on the license

> are you infringing when you use that file and distribute it somewhere

Depends on the license

----

When copilot gives you code without the license, you can't even know!

renewiltord 3 years ago | | |

Well, `curl` will download a gist without checking its license. So curl is infringing?

mezbot 3 years ago |

This issue seems to have an obvious solution that I fail to see anyone mention: Treat copilot simply as a tool, let it be trained on whatever without any consent requirements. However the outputs should be subject to copyright as with any other code produced by a human. Then on a case by case basis courts can decide if infringement has occurred. The idea of banning copilot or other AI models as a whole just seems like a collective case of sour grapes because innovation and automation is finally threatening some people who only expected these things to affect the working class

EMIRELADERO 3 years ago |

I think it's a great time to explain why this won't hit AI art such as Stable Diffusion, even if GitHub loses this case.

The crux of the lawsuit's argument is that the AI unlawfully outputs copyrighted material. This is evident in many tests with many people here and on Twitter even getting verbatim comments out of it.

AI art, in the other hand, is not capable of outputting the images from its training set, as it's not a collage-maker, but an artificial brain with a paintbrush and virtual hand.

jrochkind1 3 years ago | |

Eh... I don't know. It sounds to me like you are saying because the code example outputs exact lines, it's a copyright violation; but the image AI's necessarily don't output exact copies of even portions of pre-existing images, that's not how they work.

But I don't think copyright on visual images actually works like that, that it needs to be an exact copy to infringe.

If I draw my own pictures of Mickey Mouse and Goofy having a tea party, it's still a copyright infringement if it is substantially similar to copyright depictions of mickey mouse and goofy. (subject to fair use defenses; I'm allowed to do what would otherwise have been a copyright infringement if it meets a fair use defense, which is also not cut and dry, but if it's, say, a parody it's likely to be fair use. There is probably a legal argument that Copilot is fair use.... the more money Github makes on it, the harder it is though, but making money off something is not relevant to whether it's a copyright violation in the first place, but is to fair use defense).

(yes, it might also be a trademark infringement; but there's a reason Disney is so concerned with copyright on mickey expiring, and it's not that they think there's lots of money to be spent on selling copies of the specific Steamboat Willy movie...)

> There is actually no percentage by which you must change an image to avoid copyright infringement. While some say that you have to change 10-30% of a copyrighted work to avoid infringement, that has been proven to be a myth. The standard is whether the artworks are “substantially similar,” or a “substantial part” has been changed, which of course is subjective.

https://www.epgdlaw.com/how-can-my-artwork-steer-clear-of-co...

I think Stable Diffusion etc are quite capable of creating art that is "substantially similar" to pre-existing art.

EMIRELADERO 3 years ago | | |

I believe fair use is the way to go then. SD would definitely be so, in my opinion.

PuddleCheese 3 years ago | |

These models can actually output images that can be extremely close to the material present in training models:

- https://i.imgur.com/VikPFDT.png

I also don't know if I would anthropomorphize ML to that degree. It's a poor metaphor and isn't really analogous to a human brain, especially considering our current understanding, or lack thereof, of the brain, and even the limited insight we have into how some of these models work from the people who work on them.

kmnc 3 years ago | |

I don’t understand this argument… if image AI gets good enough then generating exact copies of its training model seems trivial.

az226 3 years ago | |

https://arstechnica.com/information-technology/2022/09/bette...

Want to say that again?

solomatov 3 years ago | |

IMO, the case is exactly the same for copilot and generative models for images. That's why it's so important to have some precedent as a guide for future products.

P.S. I am not a lawyer.

fancyfredbot 3 years ago |

If a software developer learns how to code better by reading GPL software and then later uses the skills they developed to build closed source for profit software should they be sued?

thomastjeffery 3 years ago | |

If a software developer writes a program to remember a million lines of GPL code, then uses that dataset to "generate" some of that code, then they are essentially violating that license with extra steps.

The extra steps aren't enough to exhonorate them. It's just a convoluted copy operation.

Is just like how a lossy encoding of a song is still - with respect to copyright - a copy of that song. The data is totally different, and some of the original is missing. It's still a derivative work. So is a remix. So is a reperformance.

buzzy_hacker 3 years ago | |

Copilot is not a person, it is a piece of software.

Phrodo_00 3 years ago | |

Depends on how closely they reuse the code. Writing it verbatim or nearly? Yes.

jacooper 3 years ago | |

A human doesn't perfectly reproduce the same code he learned from.

throwaway675309 3 years ago | | |

A person with eidetic memory absolutely could do so.

hjroberts 3 years ago |

Whether it is legally wrong or not to scan OSS code (I think it is wrong), there has been a time-honored precedent for disallowing automated scanning:

  robots.txt

This is exactly what is needed for source code, and the default (no robots.txt) should be "disallow".

The fact that the Web has considered this moral issue should be a strong hint for the AI people not to take a purely legal stance but consider the OSS community that they are so heavily using.

atum47 3 years ago |

Forgive my ignorance, but who is going to benefit from this lawsuit? I have a lot of code on GitHub, can I, for instance, expect a check in the mail in case of a win?

gpm 3 years ago | |

(Not a lawyer, so this is really definitely absolutely not legal advice and if you're looking to profit you should speak to a lawyer... for instance the lawyers who just filed the lawsuit)

They're asking for two things, injunctive relief (ordering github/openai/microsoft to stop doing this) and damages.

I suppose the injunctive relief really benefits anyone who doesn't want AI models to exist, because that's what it's asking for.

The damages will go the members of the class certified for damages, with more going to the lead plaintiffs (those actually involved in the suit) and some going to the lawyers. They're asking for the following class definition for damages

> All persons or entities domiciled in the United States that, (1) owned an interest in at least one US copyright in any work; (2) offered that work under one of GitHub’s Suggested Licenses; and (3) stored Licensed Materials in any public GitHub repositories at any time during the Class Period.

atum47 3 years ago | | |

> if you're looking to profit you should speak to a lawyer

No, I'm just teasing... If a neural network learns how to program by reading my code, it will generate a mess with tabs and spaces mixed together.

datacruncher01 3 years ago |

I think the software is probably ok provided that, the sources are credited (ie, if co-pilot copies code from say SDL, then the relevant code sections need to be correctly attributed, the mandatory license readme copied to the project so all code is following the open source licenses used. That's literally the purpose of open source licenses. If Copilot can't be bothered to do that, then yeah it should be shut down.

cothrowaway88 3 years ago |

Made a throwaway since I guess this stance is controversial. I could not care less about how copilot was made and what kind of code it outputs. It's useful and was inevitable.

I'm 1000% on team open source and have had to refer to things like tldrlegal.com many times to make sure I get all my software licensing puzzle pieces right. Totally get the argument for why this litigation exists in the present.

Just saying in general my friends I hope you have an absolutely great day. Someone will be wrong on the internet tomorrow, no doubt about it. Worry about something productive instead.

This one has the feel of being nothing more than tilting at windmills in the long run.

0cf8612b2e1e 3 years ago |

Is there any amount of public data/code/whatever I can make an offline backup of today in the event this gets pulled?

kyleee 3 years ago | |

That’s what I am wondering, as a contingency plan so at least a replica service can be created if copilot shuts down.

matthewwolfe 3 years ago |

I will never understand why people push code to public repos and then complain when someone or something uses that code. Code that you want to keep private or make money off of should be private. Only publish stuff to the public that you want other people to see and learn from. All the complaints about attribution… who cares.

YoshiRulz 3 years ago | |

> All the complaints about attribution… who cares.

I may not care if some guy I've never met uses my niche library without attribution. (I do care, really.) But Microsoft certainly cares if you use their code without attribution, so why shouldn't I take the same belligerent, copyright-enforcing attitude towards them? That's the main reason why people are angry, because MS has "rules for thee but not for me" by virtue of being big enough to have ~~good~~effective lawyers and lobbyists.

matthewwolfe 3 years ago | | |

Copilot is trained on public repos. Id imagine if Microsoft doesn’t want you to use their code, that code would be in a private repo. There’s nothing stopping me from using code in a public repo, regardless of the license.

pmarreck 3 years ago |

This will fail. Copilot is too good, and only suggests snippets or small functions, not entire classes for example.

User23 3 years ago |

Copilot is clearly a derivative work. So is every other similar model. How is this even up for discussion?

stovenctl 3 years ago |

The comparison I would draw is it's a statistics based search engine for code.

Sometimes the query is the first half of a small statement that we can fill in with common patterns. Useful, fair.

Sometimes the query is a signature like `fn fast_inv_sqrt` that copies someone's code and doesn't attribute it.

nuc1e0n 3 years ago |

My own view is that it is not legal for humans to produce derivatives of copyrighted works currently. So therefore it is probably already not legal to train an artificial intelligence using copyrighted works to in order to produce derivatives either.

jjgon1781 3 years ago |

I am surprise in the amount of people that in favor in copilot being train with copyright data.

scoot 3 years ago |

The editorialized title isn't correct. The lawsuit is against GitHub for Copilot not against GitHub Copilot, which is not a "legal person".

A better shortening if the original title is simple "We’ve filed a lawsuit challenging GitHub Copilot"

reachableceo 3 years ago |

Let me (start or join the call) for federal investigation and the filing of criminal complaints in all relevant locales.

Grand theft , interstate wire fraud and conspiracy for same.

This is a criminal matter as well as civil. Intentional and knowing violation of the law.

We must not let our work be taken!

gcau 3 years ago |

As much as I love the little guy beating the big evil company, I hope the lawsuit doesn't cause anything to happen to copilot. Maybe some changes, like better protection against emitting 1:1 licensed code or opting out your code from training.

vlovich123 3 years ago |

Can someone explain to me Microsoft’s decision here to use GPL code in the training set? It would seem like sticking to non-attribution / non-viral licenses would have kept them in the clear. Was that an insufficient size data set?

az226 3 years ago | |

It only trains on the GPL code, it doesn't reproduce entire code files verbatim. So it's fair use.

vlovich123 3 years ago | | |

Except when it does

eurasiantiger 3 years ago |

Maybe we just need to prompt it to include the proper licenses and attributions. /s

tmtvl 3 years ago | |

Eh, I don't mind Copilot being trained on my code as long as it and all projects made using it are licensed under the AGPL.

thesuperbigfrog 3 years ago |

How original is the generated code?

Can the generated code be traced back to the code used for training and the original copyrights and licenses for that code?

If so, what attribution(s) and license(s) should apply to the generated code?

dmitrygr 3 years ago | |

They demonstrate generated code being identical to some training code.

avian 3 years ago | | |

There were well known examples of copilot reproducing exact code snippets well before this lawsuit (e.g. the Quake's fast inverse square root function). Microsoft dealt with them by simply adding the offending function names to a blocklist.

In other words, if your open source project doesn't have such immediately recognizable code and didn't cause a shitstorm on Twitter, chances are copilot is still happily spewing out your exact code, sans the copyright and license info.

m00x 3 years ago | | |

Just like developers have never copy-pasted code from stack overflow or Github :):):)

Swizec 3 years ago | | |

How many ways are there to write many of the basic algorithms we all use though? Can I copyright "({ item }) => <li>{item.label}</li>"?

Because I sure have seen that exact code written, from scratch, in many many places.

I guess my question boils down to "What is the smallest copyrightable unit of code?". Because I'm certain suing a novelist for copyright infringement on a character that says "Hi, how are you?" would be considered absurd.

arpowers 3 years ago |

The proper way to think about these LLM is similar to plagiarism.

Seems to me the underlying data should be opt-in from creators and licenses should be developed that take AI into consideratiin.

Aeolun 3 years ago |

I find this whole subject exhausting. The only reason I’m glad there is a lawsuit is that we can finally put this thing to rest when either party wins.

Yahivin 3 years ago |

Copilot does include the licenses...

Start off a comment with // MIT license

Then watch parts of various software licenses come out including authors' names and copyrights!

marmada 3 years ago |

All these people whining about copyright need to consider: is the issue Copilot, or is the issue copyright.

amelius 3 years ago |

Can Copilot reproduce Numerical Recipes in C?

(asking because I know the authors were kinda famous for being very litigious).

HeavyStorm 3 years ago |

"Angry people brandish their fists against the incoming revolution" is also a good title.

sensanaty 3 years ago |

I personally hope they win, and win big. Anything that ruins Micro$oft's day is a boon to mine.

clusterhacks 3 years ago |

Did Microsoft use the source code of Windows (in whole or in part) as training input to Copilot?

az226 3 years ago | |

Microsoft didn't do the training. Open AI did. They only public code.

machiste77 3 years ago |

bruh, come on! you're gonna ruin it for the rest of us

kgarten 3 years ago |

on a tangent ... beautiful typography, I love Matthew Butterick's work on legible fonts an his guide to practicle typography.

all the best with the lawsuit.

barelysapient 3 years ago |

MSFT to $0 anyone?

i_like_apis 3 years ago |

I love that this is going to loose.

SighMagi 3 years ago |

I did not see that coming.

SurgeArrest 3 years ago |

I hope this case will fail and establish a good precedent for all future AI litigations and may be even prevent new ones. Your code is open source - irregardless of license, one might read it as a text book and then remember or even copy snippets and re-use this somewhere else unrelated to the original application. If you don't like this, don't make your code open source. This was happening and is happening independent of any license all over the world by majority of developers. What Copilot and similar tools did was to make those snippets accessible for extrapolation in new applications.

If these folks win - we again throw progress under the bus.

jacooper 3 years ago | |

No thank you. I put a license to be followed, not to just be disregarded by an AI as "Learning material". No human perfectly reproduces their learning material no matter what, but Copilot does.

mcluck 3 years ago | | |

You mean to tell me that no one has ever perfectly replicated an example that they read somewhere? There's only so many ways to write AABB collision, fibonacci, or any number of other common algorithms. I'm not saying there aren't things to consider but I'm sure I've perfectly replicated something I read somewhere whether I'm actively aware of it or not

IshKebab 3 years ago | | |

So are you ok with it being illegal for humans to learn from copyrighted books unless they have a license that explicitly allows learning? That does not sound like a pleasant consequence.

throwaway675309 3 years ago | | |

100% false, there are loads of historical cases of people with eidetic memories being able to reproduce things that they've seen with near complete fidelity, there's no reason to believe that a coder with such a memory would be any different.

Etheryte 3 years ago | |

> Your code is open source - irregardless of license, one might read it as a text book and then remember or even copy snippets and re-use this somewhere else unrelated to the original application.

Yes, but attribution should still be given. Just because you don't copy-paste someone else's creation doesn't mean you're licensed to use it.

shagie 3 years ago | | |

Is it the role of the tool (in this case copilot) to include the license information? Or is it the responsibility of the organization using the code to make sure that it wasn't copied from somewhere?

What if, instead of a tool, you had a random consultant do some work, and it was found out that he asked a ton of stuff on Stack Overflow and copied the CC-BY-SA 4.0 answers into his work? What if it was then found out that one of those answers was based on copying something from the Linux kernel? Who is responsible for doing the license check on the code before releasing the product?

humanwhosits 3 years ago | |

> irregardless of license

Hard no. Please stop using open source code if this is how you think of it.

Without licenses being respected, we don't get open source communities.

az226 3 years ago | | |

Licenses be damned, copyright law sits above it -- and for now, it's hard to see how this isn't fair use. The only case might be an open source Copilot alternative and GitHub and OpenAI can take any such projects out of the training set.

vesinisa 3 years ago | |

Open source does not mean public domain. Open source specifically attaches limitations on how the code may be reused.

elcomet 3 years ago | | |

There are no limitations on reading the code to learn from it.

simion314 3 years ago | |

> Your code is open source ....

So why MS can screw only with some licenses that you call "open source". Your example with a human reading a book would also work with code available licenses or decompiled binaries.

I would have been fine if the open source code was used to create an open model or if MS would have put his ass on the line and also train the model with all the GitHub code because they claim there is no copyright issue.

solomatov 3 years ago | |

The problem is that copyright laws were introduced for a reason, and with a thinking similar to yours we might decide to get rid of copyright altogether, which I think is a bad idea.

P.S. I am not a lawyer.

tfsh 3 years ago | |

If organisations are going to ignore the licenses attached to my OOS and that's legimitised in the law, then that's a surefire way to irreparably damage the open source ecosystem

ISL 3 years ago |

Can anyone with Copilot access give a short summary of its response to the prompts:

  function force=Gmmr2Array(mass1, mass2)

and

  function [force, torque]=pointMatrixGravity(array1,array2)

I'd love to know if some of my GPL v3 code [1, 2] has landed in the training set

[1] https://github.com/4kbt/NewtonianEotWashToolkit/blob/master/...

[2] https://github.com/4kbt/NewtonianEotWashToolkit/blob/master/...

jmcphers 3 years ago | |

First prompt:

    var G = 6.67e-11;
    var force = G * mass1 * mass2 / distance * distance;
    return force;

Second prompt:

  % pointMatrixGravity - Calculates the gravitational force and torque on a
  % point mass due to a matrix of point masses.
  % 
  %   [force, torque]=pointMatrixGravity(array1,array2)
  % 
  %   Inputs:
  %       array1 - 3xN array of point masses
  %       array2 - 3xN array of point masses
  %   Outputs:
  %       force - 3x1 array of gravitational force
  %       torque - 3x1 array of gravitational torque
  % 
  %   Author: David J. R. MacKay,
  %   email:

ridiculous_fish 3 years ago | | |

FYI if you use the "Open GitHub Copilot" command in VSCode you will get up to 10 different outputs for the same prompt.

Intereting that my results were different than yours!

jm4 3 years ago | |

I got something a bit different from the other person who responded to you.

function force=Gmmr2Array(mass1, mass2) {

    var G = 6.67408e-11;

    var m1 = mass1;

    var m2 = mass2;

    var r = 1;

    var force = G \* m1 \* m2 / (r \* r);

    return force;

}

function [force, torque]=pointMatrixGravity(array1,array2) {

    var G = 6.67408e-11;

    var m1 = array1(1);

    var m2 = array2(1);

    var r = norm(array1(2:4)-array2(2:4));

    var force = G \* m1 \* m2 / (r \* r);

    var torque = cross(array1(2:4)-array2(2:4), force);

    return [force, torque];

}

m00x 3 years ago |

The only people who gain out of class lawsuits are the lawyers.

This person (a lawyer) saw an opportunity to make money and jumped on it like a hungry tiger on fresh meat.

tasuki 3 years ago | |

I have quite a bit of respect for Matthew Butterick. I don't think he's just a lawyer looking to earn a quick buck. He cares about software and wants to make the world a better place.

> But neither Matthew Butterick nor anyone at the Joseph Saveri Law Firm is your lawyer

This is curious. None of them are my lawyers, but surely at least some of them are someone's lawyers? Isn't it wrong to put such a blanket disclaimer on a website which might well be read by their clients?

alpaca128 3 years ago | |

So he gets to make money with his profession while defending OSS licenses? I don't see the big problem.

alsodumb 3 years ago | |

This. I've seen so many class action lass suits where at the end of the day the highest gain per Capita always ends up going to the lawyers. Fuck this guy and everyone trying to make money from this.

Entinel 3 years ago |

I don't have a comment on this personally but I want to throw this out there because every time I see people criticizing Copilot or Dall-E someone always says "BUT ITS FAIR USE! Those people don't seem to grasp that "Fair Use" is a defense. The burden is not on me to prove what you are doing is not fair use; the burden is on you to prove what you are doing is fair use

VoodooJuJu 3 years ago |

As celestialcheese says [1], it seems like a manufactured case for the purpose of furthering someone's legal career rather than seeking remittance for any violations made by Copilot.

But I like to put on my conspiracy hat from time to time, and right now is one such time, so let's begin...

Though the motivations behind this case are uncertain, what is certain is that this case will establish a precedent. As we know, precedents are very important for any further rulings on cases of a similar nature.

Could it be the case that Microsoft has a hand in this, in trying to preempt a precedent that favors Copilot in any further litigation against it?

Wouldn't put it past a company like Microsoft.

Just a wild thought I had.

[1] https://news.ycombinator.com/item?id=33457826

bugfix-66 3 years ago |

Ask HN: I want to modify the BSD 2-Clause Open Source License to explicitly prohibit the use of the licensed software in training systems like Microsoft's Copilot (and use during inference). How should the third clause be worded?

  The No-AI 3-Clause Open Source Software License

  Copyright (C) <YEAR> <COPYRIGHT HOLDER>

  All rights reserved.

  Redistribution and use in source and binary forms, with or without
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in
     the documentation and/or other materials provided with the
     distribution.

  3. Use in source or binary forms for the construction or operation
     of predictive software generation systems is prohibited.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

https://bugfix-66.com/f0bb8770d4b89844d51588f57089ae5233bf67...

60secs 3 years ago |

This is why we can't have nice dystopias.

function isEven(n) { if (n == 0) return true; else if (n == 1) return false; else if (n < 0) return isEven(‐n); else return isEven(n ‐ 2); } console.log(isEven(50)); // → true console.log(isEven(75)); // → false console.log(isEven(‐1)); // → ??**