FSF-calls for white papers on philosophical and legal questions around Copilot

FSF-calls for white papers on philosophical and legal questions around Copilot(fsf.org)

283 points by non_sequitur 4 years ago | 198 comments

davisr 4 years ago |

The ignorance in this comment section is already giving me an aneurysm. Software licenses matter. Copyright matters. If megacorps like Microsoft can sue people into oblivion for violating their copyright terms, people can sue Microsoft into oblivion for violating theirs. I don't use MS Github, I have no skin in the game, but I hope there is at-least a $1000 award to every instance of AGPL and GPL license violation because it's unfair and illegal what they're doing.

This isn't ML, it is a ripoff and is violating clear software licensing terms. https://news.ycombinator.com/item?id=27710287

Software freedom matters, but I wouldn't expect the typical HN type to understand, since their money is made on exploiting freely-available software, putting it into proprietary little SaaS boxes, then re-selling it.

heavyset_go 4 years ago | |

> The ignorance in this comment section is already giving me an aneurysm. Software licenses matter. Copyright matters.

If anyone thinks they don't, ask why Microsoft didn't train Copilot on their Windows, Office, or Azure source repositories.

zarzavat 4 years ago | | |

Microsoft (presumably) did train it on their open source repositories, since those repositories are public GitHub repos. They didn't train it on anybody's private repositories.

cromka 4 years ago | | |

Case closed, everybody go home.

lostmsu 4 years ago | | |

Because that's extra work to wire them up? Until recently Windows wasn't even in Git.

xxpor 4 years ago | |

Software licenses have barely been tested in court, let alone how they apply to code injected and combined with other code via machine learning. You're extremely overconfident about how this will actually play out.

For one, just because your code is covered by the GPL, it doesn't mean every single line in isolation is copyrightable. It has to demonstrate creativity. That's why you don't have to worry about writing for (int i = 0; i < idx; i++) {.

ghoward 4 years ago | | |

You're right that code has to demonstrate creativity for copyright. But that also means that an algorithm, even a transformative algorithm, cannot change copyright because an algorithm is not creative, by definition.

This means that the output of any algorithm on copyrighted code is still under the original copyright. I mean, we still apply the copyright of the original to the output of compilers, even though compilers can be transformative with inlining and link-time optimization, to the point that it mixes disparate code in the same way Copilot does.

In fact, I wrote some software licenses [1] that codify the fact that algorithms cannot change copyright.

[1]: https://yzena.com/licenses/

alpaca128 4 years ago | | |

> it doesn't mean every single line in isolation is copyrightable

Microsoft did not just copy individual lines. They fed whole repositories into their model, ignoring the license (if it exists) even though they knew from the start that information generated by the model will be publicly available. Available usually out of context, but nonetheless - the scope of the input and intent are very clearly "everything" and "redistribution".

Just adding a filter/ML model to the output shouldn't matter. I dare you to build a Copilot clone trained from leaked internal Microsoft code and then trying to argue the output is a bit mixed up.

That is a clear violation imho.

hodgesrm 4 years ago | | |

> Software licenses have barely been tested in court...

OSS licenses have been litigated and upheld. Can't supply details of my own experience for confidentiality reasons but plenty of plaintiffs have prevailed in suits about violations of OSS license terms. My guess is the numbers are higher than you might think because a lot of the cases end in non-public settlements.

api 4 years ago | | |

What about non-traditional-FOSS licenses? There is a lot of source-available not-OSI-compliant licensed software on GitHub like MongoDB, CockroachDB, etc., and that's clearly proprietary. If this thing is trained on that and generates what amount to snippets of that code then it's clearly violating those licenses.

Then there's private repositories. If they included those in the training data set that's even more actionable.

Personally I think this is software piracy at an absolutely unprecedented scale. Machine learning is just information transfer from the training data into weights in a model, a close relative of lossy data compression. Microsoft is now reselling all its GitHub users' code for profit.

sangnoir 4 years ago | | |

> You're extremely overconfident about how this will actually play out.

I'd argue Microsoft too, was/is overconfident about how this would play out. I would have expected a little more caution on selecting the training data.

josefx 4 years ago | | |

> it doesn't mean every single line in isolation is copyrightable.

copilot is known to reproduce entire blocks of text including non functional parts like comments.

bluGill 4 years ago | | |

While they are not tested, anything other than accepting the idea kills the idea of software completely. There is lots of room to change details, but somehow copyright and the fact that the code is copied into computer memory needs to be reconciled.

austincheney 4 years ago | | |

A software license, like any license, is a permission to operate.

> it doesn't mean every single line in isolation is copyrightable

It is if you can prove reproduction apart from your own original work (fair use). Unlike patents copyright doesn’t protect uniqueness. It is only a shield from reproduction, and if reproduction is demonstrable to a court you are likely at risk.

https://cws.auburn.edu/OVPR/pm/tt/copyrightvplagiarism

hartator 4 years ago | |

> Software licenses matter. Copyright matters.

Some of us think is detrimental to humanity at whole.

Y_Y 4 years ago | | |

Why not both?

Suppose that it's just a bad idea and shouldn't exist. Does that mean that I should release my code into the public domain? I think you could make a good case that even being totally opposed to copyright morally or pragmatically or otherwise, given that it currently is enforced in many places it's worthwhile to play along. For example, some people would prefer a world without copyright, but GPL their code, because it might prevent a greater evil.

sangnoir 4 years ago | | |

True, but while they exist,they should be evenly applied

warkdarrior 4 years ago | | |

If you abolish copyright, that will only make it easier for for-profit corporations to use FOSS. There will be nothing stopping them from using FOSS, unless people stop sharing their code altogether.

pabs3 4 years ago | |

If Copilot is violating the GPL license family, then it is also violating the permissive licenses like MIT too.

MichaelMoser123 4 years ago | | |

How so? the MIT license allows you to do everything with the code. It doesn't allow to sue the author, but that's about it. Here it is: https://opensource.org/licenses/MIT

swayson 4 years ago | |

Very well put and refreshing. Thank you.

c7DJTLrn 4 years ago | |

If I ever receive monetary compensation for violation of the license on my repositories, I will personally deliver it to you in cash. It won't happen.

I have a feeling Copilot is more of a tool for publicity than for development.

spywaregorilla 4 years ago | | |

That statement sort of depends on how important your repos are

snarky_birdie 4 years ago | |

>I don't use MS Github, I have no skin in the game

You don't have to use Github to have a skin in the game. As long as someone has access to your open source code, no matter where it's hosted, anyone is free to upload it to Github. The open source license of your code allows that.

JTbane 4 years ago | |

>I hope there is at-least a $1000 award to every instance of AGPL and GPL license violation

So much this. If a neural network is capable of regurgitating code verbatim (with comments!), it's not a stretch to say it's a derivative work of the GPL code used to feed it.

syshum 4 years ago | |

Thank you...

api 4 years ago | |

But don't you get it? The purpose of FOSS is to provide free labor for billion dollar companies.

tomnipotent 4 years ago | | |

A non-trivial amount of FOSS is contributed by programmers on the clock working for those same billion dollar companies.

ralph84 4 years ago |

Their link to why you shouldn't use GitHub[0] takes you to a page where they criticize GitHub for complying with US export controls. The FSF is a US corporation, why do they think that US export controls don't equally apply to savannah.gnu.org? And unlike FSF, GitHub has actually done the work of applying for export licenses so that developers in US-sanctioned countries can access GitHub[1].

[0] https://www.gnu.org/software/repo-criteria-evaluation.html#G... [1] https://github.blog/2021-01-05-advancing-developer-freedom-g...

fadjacent 4 years ago | |

there is a different and more important criticism listed too, githus is nonfree.

But, github could easily establish a non-us entity to host export restricted code. And for savannah, if anyone had any code they were worried about export control for their code, savannah would quickly and easily have an independent person host that repo outside the US.

judge2020 4 years ago | | |

To be more specific:

https://stackoverflow.com/legal/terms-of-service/public#:~:t...

> You agree that any and all content.. that you provide to the public Network... is perpetually and irrevocably licensed to Stack Overflow on a worldwide, royalty-free, non-exclusive basis pursuant to Creative Commons licensing terms (CC BY-SA 4.0)

Technically a lot of people who copy from Stack Overflow are breaking CC BY-SA 4.0 since it requires attribution AND requires distributing code that uses it under the same license ( I think - I am not your lawyer) :

https://creativecommons.org/licenses/by-sa/4.0/

lamontcg 4 years ago |

Given how the racist twitterbot AI turned out, along with L4 autonomous driving by 2017, I suspect that Copilot is going to suffer most from an incredibly high velocity of churned out security bugs and bad code. SWEs are probably going to get fired for using it and companies will need to ban it, even if the legal problems don't take it down.

belorn 4 years ago |

An interesting initiative from FSF, through I suspect the answer the most of the question will be answered when someone attempts a similar projects in a more traditional copyright-restrictive area.

As an example I would like to see is a Cosinger, where the AI is trained using songs on youtube and streaming services. With the final product, a user start to sing and the algorithm attempt to sing along and give the singer suggestions for how the song should continue. I could see how a lot of musicians would be willing to pay good money for such program, and removing obligations to pay any money for the training set would make it much more feasible to create.

There are already AI's that create music (through unlikely from proprietary training sets). A Cosinger shouldn't be too far from that.

antocv 4 years ago | |

A Cosinger would be illegal unethical, profit killing, anti democracy and ultimately anti our very own freedom to own intellectual property. /s

The same difference as allowing Google to prosper while beating down ThePirateBay, another search engine.

belorn 4 years ago | | |

I predict it is very likely we will see a court case where a smaller actor will take public available information as training data and get sued for copyright information. It will be interesting to see if, just like in the pirate bay case, the courts will be creative. In the TPB case, the accused was found guilty of an Swedish anti-biker gang law that was written with the intention to shut down biker bars.

When copilot came out, one thing it reminded me of was the ethical considerations of face generators in animation. The output naturally has some similarities with the training data, and it is trivial to use a limited set of actors in order to create faces with canny similarities of the actors. A question that people asked (here on HN if I recall) was if you needed permission from those actors to use in the training set, or if this would allow anyone to "steal" the face of public faces and create semi-look alike that can then be used in anything from porn to advertisement.

The law is undoubtedly going to catch up.

hartator 4 years ago |

> We already know that Copilot as it stands is unacceptable and unjust, from our perspective.

So, why call for white papers? I don’t believe they will publish any papers that go against their views.

user-the-name 4 years ago | |

Read the rest of the paragraph. They think it is unacceptable and unjust from certain perspectives that are trivial for them. However, there are other perspectives that are worth exploring, and that is what this is about.

humanistbot 4 years ago | |

You seem to be unfamiliar with (edit: or object to) the very idea of lawyers.

pavon 4 years ago | |

Just because someone has formed strong opinions about some aspect of a subject, doesn't mean they can't be open minded about another aspect. They plainly state that they don't have clear answers about many of the questions that Copilot raises, and this isn't going to be the last time that those issues appear. It is these broader issues that they want to hold discussions about, not Copilot itself. I don't see any reason not to accept this interest as genuine.

meepmorp 4 years ago | |

They have a position and they now want to support it with arguments, and they'd like it if people would help them do that.

I think that's a backwards because it's putting the conclusion first then seeking to justify it, but to each their own.

user-the-name 4 years ago | | |

No, they have a position and arguments to support it, but those have nothing to do with the machine learning aspects, just with the fact that the software is proprietary.

They are asking for views on the machine learning, which they do not have arguments or a position on.

kelnos 4 years ago | | |

> I think that's a backwards because it's putting the conclusion first then seeking to justify it

Isn't that literally a lawyer's job?

tyingq 4 years ago | |

They know a couple of reasons for sure. They want more reasons, or more detail on other reasons for which they aren't as sure yet.

whazor 4 years ago |

I am curious about the results.

Having tested copilot, most suggestions are based on existing code in your opened file. Furthermore, most snippets tend to be relatively short, where it feels more like a Stack Overflow answer than existing code.

Of course it is possible to make the model generate longer pieces of code that are potentially GPL. But you would have to do certain effort for it. It also tends to adopt your coding style.

But maybe the fact that there are no guarantees makes it unfair.

dcow 4 years ago | |

The difference is that Stack Overflow has taken the legal responsibility of making sure any contributions to the site are licensed in a way that allows users to copy-paste them into their own works, and has the authority to do as much. GH does not have the authority to, without authors' permission, launder their code through an AI "tumbler" and spit out shiny suggestions stripped of all license concerns.

unnah 4 years ago | | |

I just checked Stackoverflow terms, and it still says that all user contributions are licensed under Creative Commons CC BY-SA 4.0, which means that copying them to your own codebase is likely to be a copyright violation. Lots of people do it, but it's a well-known legal problem.

thomzane 4 years ago |

I am excited to see where these questions lead.

grepfru_it 4 years ago | |

Something like this?

  [GitHub Copilot License Config Menu]
  Show suggestions with the following tags:
  - [ ] GPLv3
  - [x] GPLv2
  - [ ] AGPL
  - [x] CC-BY-SA
  - [x] Apache License
  - [x] MIT License
  - [ ] No License Attribution

dunham 4 years ago | | |

Would that require generating 2^n models or can models be combined?

tyingq 4 years ago | | |

Other picklists might be handy too, especially something that would narrow to higher quality sources.

And they need a report button with a picklist of reasons.

remram 4 years ago | | |

Those licenses require attributions. You can't just say "Copyright (c) all the projects indexed in Copilot".

MichaelMoser123 4 years ago |

i actually like it that copilot is better than me at solving interview questions. https://www.youtube.com/watch?v=FHwnrYm0mNc I for one welcome our robot overlords.

i wonder if they could retrain the model on BSD or MIT licensed code only; How much of the open source code is licensed as GPL vs more permissive licenses, does anyone know?

Interesting that they want to charge for the use of co-pilot, I guess that we will see this business model more in the future.

Trollmann 4 years ago | |

Haven’t watched the video but this makes a lot of sense. I assume there are quite a lot of Leetcode solution repositories containing exact problem descriptions and LeetCode naming on GitHub. So essentially it‘s copy and pasting from these solutions.

6510 4 years ago |

My opinion: Copilot is a derived work.

lostmsu 4 years ago | |

An opinion: so is you.

6510 4 years ago | | |

Thats much to deep for this forum.

lights0123 4 years ago |

> It requires running software that is not free/libre (Visual Studio, or parts of Visual Studio Code)

A little nitpicky, but the only proprietary part it requires is the plugin itself, not the IDE—Copilot runs just fine with the Free build of VS Code compiled from source from GitHub, after flipping a switch to enable WIP APIs.

r283492 4 years ago | |

I think you are wrong: https://vscodium.com/

lights0123 4 years ago | | |

VSCodium provides Free pre-compiled binaries of VS Code from GitHub, like I was describing. What about it makes me wrong?

I did it two days ago, installing the Copilot plugin in a Free build of VS Code provided by my distro.

zekrioca 4 years ago |

Interesting: In HN, a same link submitted at a different time get different # of upvotes.

Same link, just 13h ago, but with 5x less upvotes than the one in here: https://news.ycombinator.com/item?id=27992894

ghoward 4 years ago | |

Because the US programmers were going to bed?

zekrioca 4 years ago | | |

I'd expect HN to not let duplicates to be submitted.

kmeisthax 4 years ago |

>Is Copilot's training on public repositories infringing copyright? Is it fair use?

My money's on yes, but this isn't settled until SCOTUS says so.

>How likely is the output of Copilot to generate actionable claims of violations on GPL-licensed works?

This depends on how likely Copilot is to regurgitate it's training input instead of generate new code. If it only does so IF you specifically ask it to (e.g. by adding Quake source comments to deliberately get Quake input), then the likelihood of innocent users - i.e. people trying to write new programs and not just launder source code - infringing copyright is also low. However, if Copilot tends to spit out substantially similar output for unrelated inputs, then this goes up by a lot. This will require an actual investigation into the statistical properties of Copilot output, something you won't really be able to do without unrestricted access to both the Copilot model and it's training corpus.

>How can developers ensure that any code to which they hold the copyright is protected against violations generated by Copilot?

I'm going to remove the phrase "against violations generated by Copilot" as it's immaterial to the question. Copilot infringement isn't any different from, say, a developer copypasting a function or two from a GPL library.

The answer to that, is that unless the infringement is obvious, it's likely to go unpunished. Content ID systems (which, AFAIK, don't really exist for software) only do "striking similarity" analysis; but the standard for copyright infringement in the US is actually lower: if you can prove access, then you only have to prove "substantial similarity". This standard is intended to deal with people who copy things and then change them up a bit so the judge doesn't notice. There is no way to automate such a check, especially not on proprietary software with only DRM-laden binaries available.

If you have source code, then perhaps you can find some similar parts. Indeed, this is what SCO tried to do to the Linux kernel and IBM AIX; and it turned out that the "copied" code was from far older sources that were liberally licensed. (Also, SCO didn't actually own UNIX.) Oracle also tried doing this to the Java classpath in Android and got smacked down by the Supreme Court. Having the source open makes it easier to investigate; but generally speaking, you need some level of suspicion in order to make it economic to investigate copyright infringement in software.

Occasionally, however, someone's copying will be so hilariously blatant that you'll actually find it. This usually happens with emulators, because it's difficult to actually hire for reverse engineering talent and most platform documentation is confidential. Maui X-Stream plagiarized and infringed PearPC (a PowerPC Macintosh emulator) to produce "CherryOS"; Atari ported old Humongous Entertainment titles to the Wii by copying ScummVM; and several Hyperkin clone consoles feature improperly licensed SNES emulation code. In every case, the copying was obvious to anyone with five minutes and a strings binary, simply because the scope of copied code was so massive.

>Is there a way for developers using Copilot to comply with free software licenses like the GPL?

Yes - don't use it.

I know I just said you can probably get away with stealing small snippets of code. However, if your actual intent is to comply with the GPL, you should just copy, modify, and/or fork a GPL library and be honest about it.

To add onto the FSF's usual complaints about software-as-a-service and GitHub following US export laws (which, BTW, the FSF also has to do, unless Stallman plans to literally martyr himself for--- oh god he'd actually do that); I'd argue that Copilot is unethical to use regardless of concerns over plagiarism or copyright infringement. You have no guarantee that the code you're actually writing actually works as intended, and several people have already been able to get Copilot to hilariously fail on even basic security-relevant tasks. Copilot is an autocomplete system, it doesn't have the context of what your codebase looks like. There are way better autocomplete systems that already exist in both Free and non-Free code that don't require a constant Internet connection to a Microsoft server.

>Should ethical advocacy organizations like the FSF argue for change in copyright law relevant to these questions?

I'm going to say no, because copyright law is already insane as-is and we don't need to make it worse just so that the copyleft hack still works a little better.

Please, for the love of god, we do not need stronger copyrights. We need to chain this leviathan.

pkrefta 4 years ago |

I'm using Github to publish my code and seriously I don't care whenever Copilot was trained using it. I published it and in the end somebody can do anything with it without giving a damn about license, copyright etc - that's the truth of open-source.

grepfru_it 4 years ago | |

This was the same mentality that brought copyleft to the masses in 1984. While you may not care, there are others who do care about the sanctity of license agreements. This is an argument where staying silent means you accept this approach. Of the millions of open source projects, a large portion of the contributors ARE speaking up because they don't find this to be acceptable. I personally think copilot is the future and all this discussion is doing is going to bring a license usage feature to copilot (e.g. i want only or i do not want GPL code in my copilot suggestions)

Please continue using GitHub as you were, but maybe consider acting on your words and either removing or changing licenses within your code that does not represent your ideals. Nothing is preventing you from releasing code into the public domain, so do that!

Permit 4 years ago | | |

> Of the millions of open source projects, a large portion of the contributors ARE speaking up because they don't find this to be acceptable.

Is this true? Is there really a large portion of contributors speaking up against this? I got the opposite sense, that it was a very small portion of contributors speaking up against this but I don't have any evidence one way or the other.

kelnos 4 years ago | |

> that's the truth of open-source

No, that's your opinion, which as it turns out also has no legal basis. For me, I want proper attribution from people who use my code. And for any code that I release that's under copyleft, I absolutely do want that license followed.

You seem to be fine releasing your stuff into the public domain, and that's great that you want to do that, but you don't speak for everyone.

colechristensen 4 years ago | |

Well then you're a BSD-license kind of person.

Not everybody is and that's ok too.

nitrogen 4 years ago | | |

The BSD license still requires attribution and copyright notices visible to the end user.

laumars 4 years ago | |

This is why there are a multitude of different open source software licences. Because some people care more than others about the terms in which their code is used by others.

johannes1234321 4 years ago | |

That is a valid position one can have.

However other people for varying reasons have other ideas ...

senko 4 years ago |

> We already know that Copilot as it stands is unacceptable and unjust [...]. Activists wonder if there isn't something fundamentally unfair about a proprietary software company building a service off their work.

> We will read the submitted white papers, and we will publish ones that we think help elucidate the problem.

Doesn't give me hope they're aiming for unbiased opinion. I would be very surprised if any of the published papers don't closely align with FSFs apriori position.

nescioquid 4 years ago | |

It sounds like they have a legal premise and they want to work out the implications, not to open up discussion to every quibble about the FSF's values. Having an opinion on the legal issues around their licenses and values seems sort of essential to what the organization does.

The word "unbiased" seems to be doing a lot of heavy work in your comment. The FSF is inherently biased towards its project -- how is that a problem?

senko 4 years ago | | |

> The word "unbiased" seems to be doing a lot of heavy work in your comment. The FSF is inherently biased towards its project -- how is that a problem?

That's straw-man, I never said (nor do I think) FSF should not be biased towards its project.

However, I would be more willing to trust the results of this call if I had confidence that all solid arguments are presented, even if they're not aligned with FSF's agenda. Hiding them won't make them disappear - you might as well get as informed as possible about the issue, especially if you care deeply about the issue and agree with the FSF.

kelnos 4 years ago | |

Well, sure. They're looking for legal support for their position. They're not pretending to be an unbiased, disinterested observer.

user-the-name 4 years ago | |

The part you removed is the crucial part that explains that paragraph.

ghoward 4 years ago |

I honestly wish I was in a position to write a whitepaper for this. However, I should not for several reasons:

* I have already made my position clear in public, [1] so I could probably be identified.

* I am not a lawyer, just some bloke who attempted to write FOSS licenses to combat ML on copyrighted code. [2]

[1]: https://gavinhoward.com/2021/07/poisoning-github-copilot-and...

[2]: https://yzena.com/licenses/

slownews45 4 years ago |

Anyone feel like FSF moved from maybe engineering idealists to a very lawyer driven type org?

The big GPLv3 push and development - plenty of attacks on folks actually shipping product on GPLv2 and building communities around that model (which keeps software free but allows users of the software to do what they want with it pretty much including putting in devices that are locked down - cars / tivo's etc).

Here's an opportunity to really advance in an interesting area with ML -> something that may open up programming to more people -> may advance computers ability to program and modify their own programs in the long run.

And regardless of the FSF attorney stuff, places like china, tiny little LLC's with no assets will very likely use the wonderful amount of code on the web to develop solutions in this space, even if FSF claims everything is a violation. Where is the vision anymore from FSF.

One thing that's been sad about the FSF -> it's gone from what I would consider a forward looking idealism sort of thing -> here's how we could do / make cool stuff that let communities work together -> to now sort of a legal compliance type org that really is focused on "actionable claims" " protected against violations" etc.

Question - does the Linux community and other successful larger open source communities welcome the FSF and their attorney's into the discussion? I can hardly imagine the BSD's, the Linux folks really connecting anymore with them.

Is there space for a different group, maybe a collection of actual develops shipping code in larger communities to get together, no FSF / SFC lawyers present, to think creatively about the future? What should we be working for, what is fair to everyone, what helps society, what works around pro-social community building?

A tool that helps with cross language building blocks for common functions etc (stackoverflow on steroids) - just how bad is this?