An open source lawyer’s view on the copilot class action lawsuit

An open source lawyer’s view on the copilot class action lawsuit(katedowninglaw.com)

118 points by spiffage 3 years ago | 175 comments

belorn 3 years ago |

A very interesting interpretation of the github TOS. Kate Downin is saying that users of github is giving a special license to GitHub, one that bypasses the original license. However if that is true then any upload of code that users do not have 100% copyright control of is then a copyright violation since the user would not have the authority to grant github that special license. It would be similar to a user uploading a copyrighted movie to youtube, and google using that as a license to use the movie in an advertisement.

I wonder if a court would think that microsoft in this case has done their due diligent to verify that the license grant that they got from users are correct and in order.

hyperman1 3 years ago | |

I also wondered about this when I read the TOS.

e.g. 4. [..] You grant us [..] the right to [..] parse, and display Your Content [..] as necessary to provide the Service, This license includes [...] show it to [...] other users; parse it into a search index or otherwise analyze it

As the Service now includes copilot, publishing anything on Github seems to give them the right to use it in copilot. Maybe even for private repos

Besides of the issue we're currently discussing, I wonder also about:

5. [..] you grant each User of GitHub a [..] license to use, display, and perform Your Content through the GitHub Service and to reproduce Your Content solely on GitHub as permitted through GitHub's functionality (for example, through forking).

So if you find GPLed content on github, you might be allowed to violate the GPL as long as it happens only on github. I don't know how bad this is in practice. Their CI presumably allows you to run code for other people without granting them the rights the GPL should give them, but that might be a violation of the Github TOS as this might be abuse of the CI servers.

This might also mean you violate the GPL when publishing someone else's GPLed code on github, as you now granted Microsoft and others rights not included in the GPL.

Clearly, IANAL, don't know how valid this reading is, but publishing anything you didn't wrote yourself might not be on a very stable legal basis.

https://docs.github.com/en/site-policy/github-terms/github-t...

belorn 3 years ago | | |

> Clearly, IANAL, don't know how valid this reading is, but publishing anything you didn't wrote yourself might not be on a very stable legal basis.

Yes. This was one of the legal theories behind why Apple refuse to allow GPL in the Mac App Store. The TOS that apple required from developers givens Apple specific rights which the GPL do not grant, and thus any software that get uploaded must be assumed as providing the software under two separate licenses. Given that many free and open source projects has multiple authors, it is a rather large assumption that the person who uploads the software has the complete authority to provide the software under multiple conflicting licenses.

It is after all the distributor that has to do the due diligence to confirm that they are in the right to distribute.

dathinab 3 years ago | |

It also falls under the aspect of "hidden surprises" which could mean that this part of the TOS wrt. this specific aspect might not be legally binding/valid. At least in the EU. Or it might.

TazeTSchnitzel 3 years ago | |

> if that is true then any upload of code that users do not have 100% copyright control of is then a copyright violation since the user would not have the authority to grant github that special license

That doesn't sound right. Licences can allow sublicensing, and I think all the popular open-source ones do.

belorn 3 years ago | | |

Sublicensing can only create additional restrictions on top of the existing conditions inside the license. All open source licenses require at minimum that distribution provides attribution and the original copyright notice. License like GPL has additional conditions.

There is also additional problems specific to sublicenses. In the United States, only exclusive licensees are assumed by statute to have a right to sublicense. The theory is that licensees of exclusive licensees are assumed to have the control/authority similar to that of the author. Nonexclusive licensees are not assumed to be granted such a monopoly by the licensor.

lindenksv1 3 years ago | |

Kate Downing here. This is an excellent question. So, just like YouTube, GitHub would likely argue that they are protected by the DMCA and that so long as they comply with DMCA take-down requests, they are not liable for copyright infringement (direct or indirect) for third party content posted to GitHub by people other than the copyright owners. Remember that the DMCA effectively shifts that due diligence you speak of away from providers of online services and onto copyright holders themselves. Without the DMCA, many businesses that rely on user-generated content just wouldn't exist because that due diligence isn't possible at scale - it's often not even possible for individual pieces of content because the publication of any copyrighted work can be very obscure and because in the US you can hold a copyright without formally registering it.

In practice, I think the entire open source world knows that people post each other's open source code on GitHub. Even projects that have very purposefully chosen to primarily use other services or self-host their source code are well aware that their code gets mirrored on GitHub and/or included in other people's repos on GitHub. Up until now, I don't think this has been controversial and I don't think GitHub gets a lot of takedown requests for this practice. I think most developers see this as a feature, not a bug. Copilot might make people rethink whether or not they want to start sending take-down requests but that'll be a tough call for a lot of people because withholding code from GitHub to avoid its usage in Copilot also effectively means making their code less easily available to the rest of the world. It may be very disruptive to other projects that include the copyright owner's code in their own projects.

ghoward 3 years ago | | |

I am an Open Source developer. My code is not on GitHub and never will be.

If my code was uploaded on GitHub, I would DMCA it because of Copilot, but it wouldn't matter because the information is already in the model. So the DMCA does not help here.

The only way it would help is if I could DMCA the entire model and force them to retrain without my code. As it stands, this lawsuit is the only way for GitHub to be reined in; I don't have the resources to do so on my own.

IANAL.

Also, about high impact, suppose Copilot has 1 million users that use it on average 10 times a day, 5 days a week. You claim that less than 1% of uses of Copilot would result in copyright violation. Let's assume 0.1%. How many times would copyright violation happen per day? It would happen 10,000 times per day. For five days a week.

It would take a mere twenty weeks (less than six months) to reach a million violations.

That seems impactful.

Andrew_nenakhov 3 years ago |

A hypothetical question: imagine a filmmaker, who had studied a lot of obviously copyrighted movies by famous renowned directors. This means he has trained his neural network using their copyrighted licensed content. Does he breach copyright when he composes and films a scene? Are visual quotes copyright theft? Homages? Did George Lucas infringe copyright when he was borrowing compositions from "Triumph of the will"?

steve_gh 3 years ago |

Hmmm. I'm interested in the GitHub ToS, which (if I understand correctly) basically says that GitHub and it's affiliates (MS) can use anything you post on GitHub to improve their service.

What if I build an AGPL licenced service, using GitHub to coordinate development. According to the ToS MS could offer a version my service because I posted the code on GitHub, and they are using it to improve their service to me. According to my AGPL licence, they would need to share their source.

So which takes precedence. The licence or the ToS?

visarga 3 years ago |

I think copyright itself might be on its way out. What meaning does a copyright have when I can click "Variations" on anything and get 4 suggestions in 10 seconds? Imagine how good they will be by 2030.

hooby 3 years ago | |

Over many years it has now mostly become a tool for large companies to accumulate rights (on works they didn't create themselves) and monetize them.

Maybe a reform is needed, to find a way back to the original purpose.

dragonwriter 3 years ago | | |

> Copyright was originally intended to protect the creators of a work.

No, it wasn’t. Copyright was originally intended to protect the publishers of a work. It was later transformed to nominally focus on the creators, but even this was lobbied for by publishers in their own self-interest after the old law directly protecting them was allowed to lapse, and because it still had the same net effect since realizing value meant licensing to a publisher in most practical cases, so the publishers were still major beneficiaries.

And, of course, US copyrights under the Constitution do not exist for the purpose of protecting creators, instead a private benefit for creators is a mechanism but the purpose is expressly to “promote the progress of science of useful arts”.

izacus 3 years ago | |

There has never been more support for tightening and enforcing copyright than there is today. This is very unlikely to change due to megacorps like Microsoft, Disney, Apple et.al. having a massive vested interest to use it to extract maximum profits.

classified 3 years ago | | |

Copyright protection for the rich and powerful, while those who cannot afford armies of lawyers get their stuff stolen by machine learning models. Sounds credible to me.

mjw1007 3 years ago |

I think this is the most interesting part:

> [Github's Terms of Service] specifically identifies “GitHub” to include all of its affiliates (like Microsoft) and users of GitHub grant GitHub the right to use their content to perform and improve the “Service.” Diligent product counsel will not be surprised to learn that “Service” is defined as any services provided by “GitHub,” i.e. including all of GitHub’s affiliates.

tryre 3 years ago | |

No, the misinterpretation of the ToS is not the most interesting part. The part that clearly shows her colors is:

"It looks a lot more like trolling if an otherwise incredibly useful and productivity-boosting technology is being stymied by people who want to receive payouts for a lack of meaningless attributions."

1MachineElf 3 years ago | | |

Ah, so she is an "open source lawyer" in an OSI Foundation sense...

LesZedCB 3 years ago |

out of curiosity, would anybody else cease to have an issue copilot if it was an open source model?

i'm not paying for copilot right now because i'm waiting for this to shake out. but i'd be happy to pay (even their current asking price) if i knew the model was also open source and could be self hosted.

maybe this is the wrong way to ask the question, but hopefully it makes sense

MattPalmer1086 3 years ago |

Has anyone produced a legally watertight license or clause for other licenses that prevents code being used for training of copilot-like services?

insanitybit 3 years ago | |

The article addresses this in a number of ways.

For example,

> That rings a bit like the Facebook memes of yesteryear promising users that if they just copy and paste these magical sentences onto their timelines, then Facebook won’t be able to do something or other with their data or accounts.

MattPalmer1086 3 years ago | | |

I'm not sure I understand your point.

The only legal way you can use copyrighted code is due to the license attached to it by the copyright holder.

If a license specifically prohibits copying the code for a purpose, then it is a violation of the copyright to copy the code for that purpose. You have no other legal way to do it.

These aren't magic words, they are legal obligations. Ok, well maybe legal obligations are magic words. But it is magic that works :). Otherwise things like GPL could not function.

rwmj 3 years ago | |

It would be a Field of Endeavor restriction so the resulting license wouldn't be open source, and I don't think (?) Copilot is trained on proprietary code.

(Section 6 here: https://opensource.org/osd)

MattPalmer1086 3 years ago | | |

I don't really care if a license meets some arbitrary definition.

Let's say I added a clause to my BSD license that prohibits the copying of this code to train ML models.

Would that not immediately make GitHub in violation of this license?

Or do they only train it where the license is explicitly one of the ones it knows about?

6stringmerc 3 years ago |

I have a companion piece talking about music and training AI/ML:

https://medium.com/@6StringMerc/artificial-intelligence-mach...

terminal_d 3 years ago |

If this isn't enough incentive to move away from github, then I don't know what is.

hnbad 3 years ago |

> It looks a lot more like trolling if an otherwise incredibly useful and productivity-boosting technology is being stymied by people who want to receive payouts for a lack of meaningless attributions.

This one sentence threw off my entire opinion of the article as it demonstrates the author's clear bias in favor of Copilot, not just specifically in this case but in principle.

Legal opinion on Copilot and generative AI in general hinges entirely on metaphors. If the AI is understood to behave like a human being building knowledge and drawing from it for inspiration, Copilot is just another way to write code. But we've already established legal precedent that machines can not hold copyright, which suggests that they can not be deemed to be creative, which could be used to argue that they are therefore just creating an inventory of copyright works and creating mechanical mashups.

The author's dismissal also ignores that this would not JUST result in attribution. If Copilot indexed copyleft code and were required to provide attribution when using this code, the output might also be affected and this could in turn affect the entire code base. Worse yet, Copilot may output code with conflicting licenses. The author considers only the possibility that Copilot itself might have to inherit the license (and the dismissal that it would "help noone" because it runs on a server ignores both the existence of a (presumably self-hosted) enterprise service and the existence of licenses like AGPL, which would still apply) but it seems most people's concerns are with the output instead.

I also fail to understand how the argument that it doesn't reproduce the code exactly 99% of the time is helpful. If I copy code and rename the variables and run an autoformatter on it, it's still a copy of the code. It's odd to see a lawyer use what is essentially obfuscation as a defense against copyright claims. Also 1% is an incredibly large number given how Copilot is supposed to be used and how large the potential customer base is. Given the direction GitHub is heading with "Hello GitHub" (demoed at GitHub Universe yesterday) it's not unlikely that Copilot would in some cases be used to generate hundreds, thousands or tens of thousands of lines of code in a single project.

The question isn't just whether Copilot is violating the law or not, the question is why it is or isn't because that could have wide implications outside GitHub itself. But as the author points out, sadly the lawsuit doesn't try to settle this for copyright, which might be the most impactful question.

iLoveOncall 3 years ago |

This lawsuit is open-source developers destroying open-source.

Havoc 3 years ago |

What’s the point of licenses if TOS overrides it?

baby 3 years ago |

This is why we can’t have nice things. Copilot is the future

nomilk 3 years ago |

If organic neural networks are allowed to read and learn from open source code, why should an artificial one be any different?

geysersam 3 years ago | |

1. Humans are not neural networks. 2. Humans are not allowed to directly copy even rather short snippets of licenced code. 3. Humans do not have the capacity to memorize the entirity GitHub.

fhd2 3 years ago | | |

I can't shake the feeling that a lot of the logic around ML models having more or less the same "rights" as humans comes from misleading marketing that they, in any shape or form, resemble human intelligence. AI is a buzzword applied to any kind of algorithm for an activity that people previously thought couldn't be automated.

Back when I was young, graph pathfinding algorithms where called AI. A few decades later they are a well understood commodity and I haven't seen anyone call them AI for a while. Maybe that'll happen to LLMs too, given a few years?

throwaway290 3 years ago | |

For one, an organic network (for the sake of the argument I'll play along if you want to reduce a human to this) has rights, freedoms and ethical values and is not controlled by a single entity and has not specifically been instantiated to generate profit for such.

insanitybit 3 years ago |

HN is so insanely frustrating, so many comments demonstrate that the user didn't read this article at all. Just immediately jumping into a "but what about this argument that I made?".

robocat 3 years ago | |

  Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that."

https://news.ycombinator.com/newsguidelines.html

insanitybit 3 years ago | | |

Yeah, I'm aware, this is just so extreme at this point it feels worth pointing out.

// Copyright (c) 2022 David Allison. All rights reserved. for num in range(100): if num % 3 == 0 and num % 5 == 0: print("DA: fizzbuzz") elif num % 3 == 0: print("DA: fizz") elif num % 5 == 0: print("DA: buzz") else: print(num)