GitHub Copilot emits GPL code

GitHub Copilot emits GPL code(codeium.com)

586 points by fortenforge 3 years ago | 370 comments

jrockway 3 years ago |

As long as the AI doesn't produce this function, you're fine:

     private static void rangeCheck(int arrayLen, int fromIndex, int toIndex {
       if (fromIndex > toIndex)
          throw new IllegalArgumentException("fromIndex(" + fromIndex +
               ") > toIndex(" + toIndex+")");
       if (fromIndex < 0) 
          throw new ArrayIndexOutOfBoundsException(fromIndex);
       if (toIndex > arrayLen) 
          throw new ArrayIndexOutOfBoundsException(toIndex);
    }

On a more serious note, I really wonder where the line is drawn for copyright. I see a lot of people claiming that AI is producing code they've written verbatim, but sometimes I wonder if everyone just writes certain things the same way. For the above rangeCheck function, there isn't much opportunity for the individual programmer to be creative. Perhaps there is a matter of taste on what exceptions you throw, or in what order. But the chosen ones are certainly what most people think of first, and the order to validate arguments, then check the low value, then check the high value, is pretty much what anyone would do. Perhaps you could format the error message differently. That's about it. So when someone "rips off" your code wholesale, it's could just be that everyone writing that function would have typed in the exact same bytes as you. You know your style guide is working when you look at code, think you wrote it, but actually you didn't!

humanistbot 3 years ago | |

That's why copyright holders for reference works have been using copyright traps for ages. That's where you include a fictional town in a map, a nonsense word in a dictionary, or a fake person in your phone book. If your competitors reproduce the trap, then that's clear evidence you can use in court.

https://en.wikipedia.org/wiki/Copyright_trap

tedivm 3 years ago | | |

We don't need the copyright traps here though as Github openly admits to using the public code for training. They just don't care that they're essentially license laundering code since they can make money doing it.

That said we used copyright traps at Malwarebytes, which is how we found out that iobit was stealing our database.

jschrf 3 years ago | | |

Also, re: maps, fake streets and cul-de-sacs that don't exist.

I've set a "trap" myself years ago in code in a novel solution at the time for uploading photos from iOS non-interactively after the fact. It was to support disconnected field workers taking photos from iPhones/iPads, with the payloads uploaded at a later date.

Chunked form data constructed in userland JS was the solution. Chunk separator was 17 dashes in a row (completely arbitrary), company name in 1337 speak, plus 17 more dashes.

Found a competitor that had copied the code, changing only the 1337 speak part. 17 dashes remained on each side. Helped me realize that they had unminified and indeed ripped off our R&D work.

Wonder if Copilot could be gamed the same way.

iudqnolq 3 years ago | | |

If you look at the Legal Action section of your link you'll see the line "However, the case was dismissed" quite a few times. That's because data isn't copyrightable.

Edit: As sroussey points out s/isn't copyrightable/isn't copyrightable in the USA

cxr 3 years ago | | |

It's occasionally explained—but still not widely understood, I'd wager—that this is the reason why so much GNU code is hard to follow.

In the US legal system the merger doctrine is a concept whereby a given expression cannot be granted protection if it's not sufficiently creative—and there only so many ways to express something when stripped down to its fundamentals. In response to this, RMS and Moglen encouraged contributors from very early on to try to express the inner workings of GNU utilities in creative and non-obvious ways out of caution against the possibility that the copyleft obligations of the GPL wrt a given package could be nullified by a finding in court that it did not pass the threshold for creativity.

ljm 3 years ago | | |

I first saw this in action on StackOverflow when, during an interview, a candidate copy-pasted a solution verbatim including the attribution. Didn't even give it a second thought, like they didn't even read the code or what it was doing.

It wasn't the right solution to the problem in question, for what it's worth.

Just manually did what GPT does now.

netfortius 3 years ago | | |

I think I mentioned this before, in another context: the solution is known as "honeytoken", and it is equally applicable in computer security.

layer8 3 years ago | |

Copyright is limited to works that meet a certain threshold of originality [0]. It is assumed that works meeting such a threshold won’t be replicated by mere coincidence.

[0] https://en.wikipedia.org/wiki/Threshold_of_originality

rvnx 3 years ago | |

In the big picture, if we enter a world where an AI is instantly capable of doing code better than you do and without efforts, then I'm not sure why code should be copyrightable at all.

Copyright protects original works of authorship including literary, dramatic, musical, and artistic works, such as poetry, novels, movies, songs, computer software, and architecture.

Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed.

Here (and in the future even more), AI is totally capable of expressing one idea in any programming language if you ask for it (even if conceptually inspired by copyrighted code).

Which means that a particular expression (a specific implementation) is practically of no value or particular interest at this stage.

You could ask the AI to do a slightly different implementation, it would not be a problem for it and would require no efforts.

There is no point to protect something that can be generated using no effort and has no particular genius in it.

DanHulton 3 years ago | | |

We don't need to enter a world where AI gets any better at all to be able to argue that software shouldn't be copyrightable, smart people have been doing that for ages.

The problem, however, is that we live in this world, where it is copyrightable, and componies relying on Copilot to do large swathes of code generation do potentially have to worry about including copyrighted code in their codebase, and what the legal fallout from that might be.

zvolsky 3 years ago | |

This "think you wrote, but actually you didn't!", sometimes with another "actually you did, but you are looking at the code of someone who wrote it the same" happens often with people who have similar taste for solving problems. Or whose taste is influenced by the same teachers, such as you, jrockway! I've been using your open source as as one of my references for Go style. Thank you for sharing your opinionated-server, jsso2, and other projects, under the Apache 2.0 license!

tehsauce 3 years ago | |

In the example from the article, copilot produces identical comments, not just a functionally identical implementation. So in this case your hypothesis is false. But thanks for trying to stand up against the open source community for microsoft. /s

ChatGTP 3 years ago | | |

I don’t understand why people have become so accepting. “Oh they’ve stolen all the public code and not provided attribution then sold it for a profit, can we just give these poor evil companies a break? It’s just progress…”.

This is completely unacceptable and another example that Microsoft is an evil and amoral company who only cares about open source for financial gain.

jrm4 3 years ago | |

They do not. This is pretty easily provable. I was in a CS class that had an automated plagiarism checker over 20 years ago.

(And since Brian Kernighan was teaching it, I'm inclined to believe in it.)

15155 3 years ago | | |

The trick with these:

1) They are using your IP with coerced consent in the future to check other people's work as well as your own in the future. (Let's have a fun discussion about "self-plagiarism."

2) ChatGPT and the like are going to so massively increase the noise floor on this problem space that these counterfeit detection companies should all but disappear in a number of years.

rcme 3 years ago | | |

Did it notify you automatically if you had plagiarized something, or did it flag you internally for manual review?

sp332 3 years ago | |

The way copyright works, it's a violation if it was copied, but it's fine if it was generated independently. In this case I would say it's a copy, but I'm sure someone else would argue differently.

makk 3 years ago | | |

IANAL but I work alongside them. Here's an argument I've heard.

You can read the data to train a thing. So long as that thing doesn't literally copy the data into itself then the training hasn't violated copyright.

When that thing later generates an output, the output isn't copyrightable because it's machine generated (this is the current US position) and it isn't a copyright violation because it was generated, not copied.

You can launder copyrighted material through an LLM, basically.

jacquesm 3 years ago | |

As a datapoint: I once successfully fielded a copyright case on about 15 lines of code.

avbanks 3 years ago | | |

You could probably do that w/ 1 line of code depending on the variable name :)

hgsgm 3 years ago | | |

What does "fielded" mean?

numpad0 3 years ago | |

I believe there are couple different aspects to “it’s AI training legal same as human” argument:

1. Copyright is only granted to creative elements; lots of program codes are supposedly un-copyrightable, though no one wants to fight on that ground.

2. It is lawful in many jurisdictions to effectively steal and train AI with even copyrighted materials, for the sake of humanity at large; same supposedly not apply to the output. But AI-supportive clusters tends to conflate between the two.

3. AI training processes, stochastic gradient descent and all, are only called “learning” and/or “training” by convention; there is no public consent that it is same as how the word is supposedly defined, though we generally don’t scare quote airplanes flying.

ml-anon 3 years ago | | |

on 3, the convention could have just as easily gone a different way i.e. it could have converged to model "fitting" using the statistical parlance or the sklearn convention. Further if you take the math seriously most of these models are "just" fitting probability distributions to data.
Also, in part it depends greatly on the objective function used. In GPT style models the objective is to precisely copy from input to output, token by token. I think its extremely bad-faith to argue that this has any relationship to human learning or learning objectives.

you shouldn't take the math seriously and I'm not being dismissive with the word "just" in scare quotes. However the community somehow wants to have its cake an eat it too.

z3t4 3 years ago | |

In most countries a copyright work need to be something substantial. You can not copyright single machine instructions. It needs to be a combination that is unique. And just the instructions are not copyrightable, you cant for example copyright a recepy. But you can copyright a book of recepies. So if you make a program with many instructions put togheter you automatically get copyright. And if someone steals parts of your code it will be difficult to claim the copyright if those parts are used to create a new program. But if the new program is based on your program, for example a fork, or most of the code comes from your program its derative work.

numlock86 3 years ago | |

> but sometimes I wonder if everyone just writes certain things the same way

> For the above rangeCheck function, there isn't much opportunity for the individual programmer to be creative.

We are at a point at which compilers detect such functions and replace them with highly optimized ones. If you have to artificially change just for the sake of patent or license trolls you don't just get more work but also worse performance/optimizations in most cases.

itslennysfault 3 years ago | |

I wouldn't worry about this code. It wouldn't compile anyways. lol

Syntax Error on line 1. Missing closing ) in the method definition.

breck 3 years ago | |

> On a more serious note, I really wonder where the line is drawn for copyright.

As soon as you start thinking about copyright, you end up realizing it's all non-sense. Stephan Kinsella (a patent lawyer!) is the leading thinker on this, and his videos, essays, and podcasts are worth listening to: https://www.youtube.com/watch?v=e0RXfGGMGPE

amelius 3 years ago | |

Or anything from Numerical Recipes in C.

m_0x 3 years ago | |

Why? Why is that function special?

csmattryder 3 years ago | | |

Oracle's lawyers said they owned the rights to it, Google disagreed. Google was right, legally.

https://www.supremecourt.gov/opinions/20pdf/18-956_d18f.pdf

gjsman-1000 3 years ago | |

> I see a lot of people claiming that AI is producing code they've written verbatim, but sometimes I wonder if everyone just writes certain things the same way. For the above rangeCheck function, there isn't much opportunity for the individual programmer to be creative.

This point is absolutely going to come up in any lawsuits; because the law does sometimes examine how much creativity there is available in a field before making a determination (Oracle v Google comes to mind). If you can show that there are very, very few reasonable ways to accomplish a goal, and said goal is otherwise not patented or prohibited, it's either not copyrightable or Fair Use, take your pick.

This even applies under the interoperability section of the DMCA and similar laws for huge projects. Assuming that ReactOS, for example, is actually completely clean-room; that would be protected despite having the same API names and, likely, a lot of similar code implementing most of the most basic APIs.

asddubs 3 years ago | | |

the code is a reference to oracle v google

jamesmunns 3 years ago |

This is interesting, but many permissive licenses still require attribution at the project or file level.

If Codeium doesn't produce these when producing "verbatim enough" snippets, how is this actually better, besides avoiding a GPL boogeyman?

I get that there have been fewer (if any? I'm not aware of any) MIT/Apache2.0/MPL2.0 license violations that have gone to court than GPL violations, but this still feels like an "address the symptoms" and not "address the cause" difference.

cattown 3 years ago |

I believe that laundering licensed or copyrighted content for reuse that fails to recognize the original authors or usage restrictions is likely to be one of the biggest commercial applications of generative machine learning algorithms.

I also believe this is where a lot of the hype about "rogue AIs" and singularity type bullshit comes from. The makers of these models and products will talk about those non-problems to cover for the fact that they're vacuuming up the work of individuals then monetizing it for the profit of big industry players.

samwillis 3 years ago |

Of course if you include the "function header" from some code in the training data (below) it will prompt GPT to generate the rest of the function. That's kind of exactly the point of it, it autocomplete on steroids.

  // CSparse/Source/cs_gaxpy: sparse matrix times dense vector
  // CSparse, Copyright (c) 2006-2022, Timothy A. Davis. All Rights Reserved.
  // SPDX-License-Identifier:  LGPL-2.1+
  #include "cs.h"
  /* y = A*x+y */
  csi cs_gaxpy (const cs *A, const double *x, double *y)

It's like starting to sing "happy birthday to you" and being surprised that people in the room join in and finish the song.

Sure they make a valid point about including GPL code in the training data, but it's a little disingenuous to go to that extent to get Copilot to output the GPL code verbatim.

The sooner we have a test case go through the courts the better.

WithinReason 3 years ago | |

And then they have the audacity to claim It should be worrisome how easily GitHub Copilot spits out GPL code without being prompted adversarially right after prompting in adversarially.

kerakaali 3 years ago | |

> It's like starting to sing "happy birthday to you" and being surprised that people in the room join in and finish the song.

A very apt analogy that's funny in that the happy birthday song has its own history of copyright battles.

pms 3 years ago | | |

Funny, except the comparison makes no sense. Is the happy birthday song licensed, does it have commercial competition?

samwillis 3 years ago | | |

Exactly why I chose it!

pms 3 years ago | |

Did you just compare the happy birthday song to a function from GPL-licensed repository? Is the song licensed, does it have commercial competition?

Sorry, but you sound just a little biased and greedy to me...

HopenHeyHi 3 years ago | |

I think the concern is that the only reason that source attribution comment is there is because they haven't figured out how to better plagiarize/launder code.

Otherwise the tool can go in the other direction and literally say "hey how about this function from project $foo?" with a full attribution. Apparently Google Bard does bother to do that.

codexb 3 years ago | | |

There's no reason for attribution. It's inspired code, not included code. Human coders do the same thing every day.

runeks 3 years ago | | |

What is “code laundering”?

mhandley 3 years ago |

Given how cautious corporate lawyers usually are, I'm surprised any company allows the use of AI for code generation. The USPTO has been pretty clear that AI generated material is not copyrightable, as to qualify for copyright a work has to be the creative act of a human. So any company allowing AI to generate code runs the risk of not owning the copyright on it.

masukomi 3 years ago |

to any Codeium dev / management reading this:

You have completely missed the point. We still need to know the applicable licenses of the code it is emitting even the ones that aren't GPL. Furthermore GPL people don't want they code to not be used. They want it to be used _within the terms of the license_. I distribute MIT and GPL code in my repos, BOTH should have their license terms honored.

MIT licensed code still needs to be correctly attributed, just like GPL.

I don't care what license the code is that's emitted, as long as the licenses are included. It'd be nice to be able to choose to only emit code trained on particular licenses but I get that that's not easy.

quicklime 3 years ago |

It's great that they've removed "non-permissive" (GPL) code from their training data, but it looks like they still train on code with "permissive" licenses (they use MIT, BSD, Apache as examples). But don't these permissive licenses still require the copyright notice to be reproduced?

From the MIT license:

> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

From the BSD licenses:

> Redistribution and use in source and binary forms are permitted provided that the above copyright notice and this paragraph are duplicated in all such forms...

From the Apache 2.0 license:

> You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works

AlchemistCamp 3 years ago |

I was recently working on something for a new feature on my Elixir-learning site and opened a new file called "fibonacci.ex" to write a tail-recursive fibonacci function.

After typing in nothing more than, "defmodule Fibonacci do", Copilot emitted the entire module from the code on my site here: https://alchemist.camp/episodes/elixir-tdd-ex_unit

The function names and documentation strings were identical. Also, the site isn't under a GPL, just a standard copyright. That said, I'm curious to learn if others see the same behavior. It's possible I once opened that file locally with Copilot installed and that my own computer was its source.

abetusk 3 years ago |

This was inevitable. Copyright law has always used a "color of your bits" argument [0]. GPL and other libre/free/open licenses were a great hack to circumvent draconian copyright laws but the laws themselves are not designed for a rigorous treatment of similarity (maybe even by design?).

Also, it's worth noting in the example of ChatGPT emitting LGPL code without attribution or license, the code is actually different [1]. Is the difference enough to circumvent a copyright violation claim? I don't know but a big part of determining whether it does is now muddled because of the way the system was designed. Even if we could get an entropy distribution on which training data was used to generate the text, it's not even clear the courts could use it in any meaningful way.

[0] https://ansuz.sooke.bc.ca/entry/23

[1] https://twitter.com/DocSparse/status/1581461734665367554

LeifCarrotson 3 years ago | |

> Copyright law has always used a "color of your bits" argument

This is an excellent point in the context of this question. Typical computer programmer responses like "but there are only so many ways to write a function that does X" or "how small of a matching section counts as copyright infringement" ignore the color of the bits.

A judge can look at ChatGPT or Copilot, decide that it took in license-limited copyrighted data in its training set, observe that a common use is to have it emit that data - to emit bits that are still colored with copyright - and tell OpenAI, or Copilot, or their users that they are guilty of copyright infringement. There may be no coherent mathematical or technical formula to determine the color of a bit, but that's understandable, because the color doesn't exist in mathematical, technical, coherent domains anyways: Only the legal domain sees color, and it can take care of itself.

jacquesm 3 years ago | |

> GPL and other libre/free/open licenses were a great hack to circumvent draconian copyright laws

The GPL relies on copyright law.

abetusk 3 years ago | | |

That's an unkind reading. The implication is that GPL circumvents some relevant restrictions of copyright law in question by creating a legal framework to do so.

jonnycomputer 3 years ago |

As far as it goes, I got chatGPT3.5 to reproduce the second snippet in the post, i.e. I asked it to complete this function:

    // CSparse/Source/cs_gaxpy: sparse matrix times dense vector
    // CSparse, Copyright (c) 2006-2022, Timothy A. Davis. All Rights Reserved.
    // SPDX-License-Identifier: LGPL-2.1+
    #include "cs.h"
    /* y = A*x+y */
    csi cs_gaxpy (const cs *A, const double *x, double *y)
    {
        // Fill in here
    }

The code was the same. Though it also explained how it worked to me.

Entinel 3 years ago |

Legitimate question, Microsoft does not seem to care about Copilot violating licenses and GPL appears to be toothless as many companies use GPL code without following the terms of the license and nothing happens to them so what does removing GPL code accomplish other than making a weaker product. I have not used Codeium but my assumption is that GPL code is a very significant amount of open source code so removing that must have some ramifications?

O5vYtytb 3 years ago |

I don't understand the issue here. You input GPL code (the headers) and get GPL code out, what do you expect?

The more insinuating issue would be if you started with a innocent seeming function that a typical software developer would write, and ended up with GPL code. Has anyone shown that to happen?

pornel 3 years ago | |

It's not meant to be a useful use-case, but a proof that the training data contains GPL code and the model is capable of reproducing copyrighted code.

And yes, the implication is that a different less explicit prompt could still emit copyrighted code.

abigail95 3 years ago |

Anyone talking about copyright in this thread without discussing a potential for how a court will apply fair use is talking nonsense and should be disregarded.

brown 3 years ago |

For anyone who wants to slow the development of AI, copyright is the soft underbelly to go after.

dvt 3 years ago | |

Are you seriously arguing that stealing code is okay in the name of "AI development"?

yamoriyamori 3 years ago | | |

I think their comment was to the contrary, that the copyright/legal implications of 'stolen' code could seriously hobble the wider development, proliferation, adoption, and commercialization of AI software.

codexb 3 years ago | | |

Are you seriously arguing that using short snippets open source code to inspire similar, yet not exactly the same, original code is "stealing code"? Human developers do that all day long. And just because a piece of code exists in a GPL project doesn't mean it originated there. Every algorithm or sort function likely originated in a more permissively licensed project before it got included in a GPL project.

jupp0r 3 years ago | | |

What happens if I (a human) read GPL code and then reuse the knowledge gained from it in my own commercial projects? It's not as clear cut as you make it sound.

HideousKojima 3 years ago | | |

I... don't see how you read what he said that way at all?

lmarcos 3 years ago | | |

Is not ok, but Microsoft couldn't care less (because they are not going to get fined).

noselasd 3 years ago | | |

The comment is arguing quite the opposite.

IshKebab 3 years ago | | |

Training AI on code is clearly not the same as stealing it.

bakugo 3 years ago |

How many times are we going to go through this before we accept that nobody involved in generative AI cares about pesky things like licenses and copyright?

One of the main reasons corporations love it so much is because it effectively lets them profit off of the work of others with no consequences.

WillPostForFood 3 years ago | |

Seriously, let's get back good old honest model of paying outsourced indian programmers $2.50 an hour to retype GPL code or copy and paste it from Stack Overflow into our codebase.

phendrenad2 3 years ago |

It's probably a good time to plug the Unlicense: https://unlicense.org/

A truly attribution-free license that checks several other important boxes (disclaiming liability and warranty etc.)

If you want your code to be usable by things like github copilot, consider using it (can't imagine most of the HN crowd wants their code used by copilot, but maybe some lurkers here do!)

codexb 3 years ago | |

This is the real effect of AI.

Non-permissive open source licenses have been on a slow death march for over a decade. They're effectively pointless now.

Either you decide to give your code for free to everyone or you don't. Adding a bunch of restrictions defeats the purpose of OSS.

alphabet9000 3 years ago |

i recommend the Jollo LNT license for all your pointless theatrical "copyright" needs. it does not use swear words, unlike "WTFPL", and is even more ambiguous. ive tried submitting it to the FSF before for review, but they were confused by it http://jollo.org/LNT/doc/licensing

smegsicle 3 years ago | |

one potentially major issue is that it seems to be written in some dialect of gibberish

jwilk 3 years ago | | |

https://news.ycombinator.com/item?id=25807559

microtherion 3 years ago | | |

Anglo-American legal writing often relies on French terms of art, but I don't think this license is quite applying the idea properly.

tehologist 3 years ago |

Copyrighting code never made sense to me. We already have patents for intellectual property. If two people use the same RFC or Whitepaper for an algorithm in the same language, they will probably name the variables similarly and their code will look very similar. Just like if two people wrote out the same hamburger recipe or instructions for hooking up a stereo would also write something similar.

The copyright on the implementation will outlive the patent and allow the implementor to legally take action on claims of copyright infringement. Even though a program is literally just a list of instructions to implement the expired patent.

pornel 3 years ago | |

Copyright protects not the idea, but specific implementation of it. It's there to prevent unauthorized copying of software. Not every software has to be novel enough to be patentable, but may still take effort to write a millionth-first JS framework.

If you take someone else's software without a license and rename variables, it will be a copyright violation, because you've copied (and then modified) it without permission.

But if you write your own software from scratch, even if it happens to be almost identical to someone else's code, that's fine. You've done your own work and a copyright owner can't stop you from doing that. They control their own work only.

As you can see, this is very much tied to human work and intent, since the concept has been invented long before ML existed. This is why ML "learning" and doing "work" is so controversial and appears to be a loophole in copyright.

armchairhacker 3 years ago |

I want to see a solution where Github, OpenAI, Stability, etc. get to keep and keep scraping copyrighted works, but the models and training data must be provided free and open.

That way, we get to keep the models since they are genuinely useful, but also there’s no issue with copyright and less of an issue with consent to distribute (which can be hopefully be managed by the “humans also learn from data” and “it’s not actually producing your content verbatim unless it follows a basic pattern that anyone could discover). And furthermore, no issue with AI privatized which IMO is my biggest concern with these new tools.

ChatGTP 3 years ago | |

So I see it in a similar way, like why the fuck does Microsoft and Open AI get to be the soul benefactor of basically the sum total of all human intellectual output ?

It’s absolutely ridiculous on so many levels. These models may claim so many jobs and have a serious negative impact on so many peoples lives, yet basically one company owns the model?

I actually find it funny albeit totally insane.

yellowapple 3 years ago |

Still waiting for someone to trick Copilot into ingesting the Windows source code and regurgitating snippets of it verbatim.

w10-1 3 years ago |

No court has said AI ingesting open-source code is "fair use".

Almost all open-source licenses say it can be copied for use in development (i.e., not for re-publication or regurgitation), and even completely open licenses are speaking to people as readers.

The only reason this is happening is coordination costs: a few extremely motivated people with tons of resources are copying from many, many people who would be difficult to organize and have little at stake.

Unfortunately, the law typically ends up reflecting exactly these imbalances.

fatherzine 3 years ago |

Once AI can write decent code from scratch, it is likely it can also circumvent potential copyright violations.

A. Check AI generated code against a comprehensive library of open-source copyrighted code and identify potential violations.

B. Ask AI to generate a paraphrase of the potential violations, by employing any number of semantic preserving transforms -- e.g. variable name change, operator replacement, structured block rewrite, functional rebalance, etc.

Lazy example:

    private static void rangeCheck(int arrayLen, int fromIndex, int toIndex {
       if (fromIndex > toIndex)
          throw new IllegalArgumentException("fromIndex(" + fromIndex +
               ") > toIndex(" + toIndex+")");
       if (fromIndex < 0) 
          throw new ArrayIndexOutOfBoundsException(fromIndex);
       if (toIndex > arrayLen) 
          throw new ArrayIndexOutOfBoundsException(toIndex);
    }

    private static void rangeCheck(int len, int start, int end) {
       if (!(0 <= start)) {
          throw new ArrayIndexOutOfBoundsException(`Failed: 0 <= ${start}`);
       } else if (!(start <= end)) {
          throw new IllegalArgumentException(`Failed: ${start} <= ${end}`);
       } else if (!(end <= len)) {
          throw new ArrayIndexOutOfBoundsException(`Failed: ${end} <= ${len}`);
       }
    }

chrsjxn 3 years ago | |

This feels like it would make the situation much worse from a legal perspective.

If you know your AI produces code that is "tainted" by license violations, adding code to hide it after the fact suggests that you're intentionally violating the license terms.

hgs3 3 years ago |

This is Hacker News so the conversation is obviously slanted towards code, but I wonder what the perspective would look like for other structured works, like books? If an author is using a "copilot for writers" and the AI emits text verbatim to another work, then I would think it would be plagiarism. If the text emitted is similar, but not the same, then I would think it would be considered paraphrasing which still requires attribution.

ugh123 3 years ago |

Maybe slightly off topic, but i'd be willing to bet most people who choose GPL for their software license on open source projects don't even understand it with all its ambiguities and gotchas. Many are probably just choosing it because its the default, or because its the one they hear about the most (but still don't understand).

Can't believe we still spend time debating this license and nobody, not even lawyers at software companies, seem to get it.

goodpoint 3 years ago |

The title is true, but the claim that coedium is not violating licenses is false.

Many licenses still require attribution and Coedium is violating them.

epylar 3 years ago | |

The crux of this is at what point is the code being copied, and is that copying allowed under the license? For example, maybe --

* Training an AI with the code is allowed legally.

* Storing model weights is allowed legally.

* Querying the AI with those model weights is allowed legally.

Or maybe not.

challengedchip 3 years ago | | |

It seems like a stretch to argue that the model isn't "a work based on" GPL code when that GPL code is an input to a deterministic algorithm from which the model is produced. So, my bet is on point #1.

The only ambiguity as far as I can tell is GPL covers "source code", "machine-readable Corresponding Source", and "object code form", and it's not explicit whether vector-fields count as any of those things. I doubt anyone would seriously argue that zipping and then un-zipping some GPL source code means you don't need to respect the original license. LLMs are different in that they're lossy compared to the zip format - does the nature of this lossiness invalidate the intent of the GPL's original language? I doubt it.

abigail95 3 years ago | |

Did the accept the license terms or are they using it under fair use?

naikrovek 3 years ago |

the article cites that 6mo tweet that everyone else cites. I don't think it is known if the user had public code suggestions turned off at the time, either; he wouldn't/didn't answer the question at the time.

Also if I am remembering correctly, and I make no guarantee that I am, this tweet is from a person with a strong dislike for Microsoft, and if I am right about that, I would not put it past this person, or anyone else with a strong dislike of Microsoft, to craft a situation to make Microsoft look bad solely to hurt Microsoft.

I've tried to make Copilot give me GPL code snippets while I have "suggestions matching public code" set to "blocked" and I can't make it happen.

so even if this was a problem 6 months ago, it would take some convincing to get me to believe that this happens today.

chairmanwow1 3 years ago |

These guys are trying so hard to smear Copilot. Similar blog post posted a few weeks ago with wild claims.

Dwedit 3 years ago |

Even if you sample stuff from programs that use a permissive license, you still legally need to attribute that code. No attribution = copyright infringement. Can the AI code generator supply attribution for the specific works sampled?

gplthrowaway88 3 years ago |

I submit that this arms race will not slow down and in the long run no one will end up caring about the licenses this was generated from (i.e software licensing is from a by-gone age already).

I too would prefer that these sorts of things cite sources and the licenses correctly. Will it get mired in legal battles? You bet. Will it get regulated? I assume they'll try! Will it slow down progress of code generating / auto-completing agents? My argument is nope, cut off heads of the hydra if you'd like but it's not going away at all.

Spend your day worrying about something else. This train has left the station.

mtkd 3 years ago |

Makes you wonder how many public repos you would need to seed with a carefully crafted attack/weakness in a common feature/pattern to start effectively poisoning codebases that are leaning on copilot

xwdv 3 years ago |

Let’s write some regulations that say every code review must require a lawyer to comb through the code and look for possible copyright violations or compliance issues. The lawyer can then tell the author to change the lines of code and submit for review again.

Or perhaps every company can just invent its own programming language and translate copyrighted code into the new language and thus avoid copyright issues altogether, though they may still run afoul of software patents.

noselasd 3 years ago |

How much if this is due to someone ripping off GPL code and stuffing it in a repo under a different license that got fed to copilot training?

VWWHFSfQ 3 years ago | |

Maybe. But copilot also trains on the original gpl code with the gpl license intact so it doesn't matter.

ognarb 3 years ago |

Also I wonder how this will hold with certain technology. For example apps written with Qt or GPL are very likely to be GPL licensed, unlike apps written in JavaScript which are often licensed under MIT. The likeness of copilot/chatgpt splitting gpl licensed code is the quite higher in Qt/GTK projects...

hgsgm 3 years ago | |

LLMs still violate MIT license's attribution requirement

abigail95 3 years ago | | |

You are allowed to read others code to learn from it, regardless of any license being accepted offered or rejected. You must do so witin fair use, which is for a court to decide, based on individual case factors.

Saying an LLM violates an atrribution requirement is a bad legal argument.

GaggiX 3 years ago |

>researchers say LLMs rarely spit out training data verbatim unless interacted with adversarially, but theoretically, they could.

Theoretically they can generate any arbitrary snippet of code (if it correctly fits the distribution), regardless of whether or not the code was in the training dataset.

firstlink 3 years ago |

> GPL code

There is no such thing as "GPL code" or any other "$license code". This is a fundamental misunderstanding of what a license is. The code in question was licensed to GitHub under a different license - possibly fraudulently.

gumballindie 3 years ago |

They all do. The Great Heist is ongoing and it would appear without an end in sight.

r3trohack3r 3 years ago |

I personally hope that we bring a lawsuit against an LLM company for emitting GPL licensed code and lose. It sets great precedent for FOSS.

Focusing on the GPL license is probably the wrong move. We want to set precedent that _any_ licensed code that is emitted from an LLM is fair game. If an LLM to emits non-FOSS copyrighted code and it's fair game, I can blindly use that implementation in my code, including FOSS code, and everyone wins.

GPL was a clever hack to use copyright against itself with an infectious license. LLMs might be a better hack. Wanting to block this seems short sighted for giving user's agency over machines.

I'd also like to see more patent defenses of GPL licensed code. If you can release a GPL licensed implementation and block non-FOSS rewrites through patents, that's a huge win for software freedom.

thordenmark 3 years ago |

It is interesting to see coders starting to express the same complaints artists had a year ago when AI image making became really, really good, by training on copyrighted art.

visarga 3 years ago |

Yeah, when you start with dozens of words replicating exactly a source file it is much easier to get a regurgitation. You can't prefix so deeply and then complain.

praveen9920 3 years ago |

I believe there will be new "AI permissive licenses" that will pop up in near future. Or existing licenses to add a clause for training AI with their code.

josefx 3 years ago | |

But you need billions of lines to train an AI and most existing code can't just be re-licensed over night. So that would still kill all code related AI projects for the next decade if not longer.

felipelalli 3 years ago |

Completely unnecessary! These licenses tend to stifle AI! They are immoral. I recommend reading "Against Intellectual Property" by Stephan Kinsella.

jcq3 3 years ago |

I don't mind about anti violation licence value proposal, I want to know if it works better than gh copilot? As it is free so I could switch to it.

29athrowaway 3 years ago |

When these articles were published, I was certain Microsoft had a plan to betray everyone's trust as they always do.

Microsoft's business model is betrayal. Github is Microsoft.

HNers got mad at people who pointed this out, and now here we are.

You were warned, but you decided to believe again in the most vile people in the history of computing.

https://www.bloomberg.com/news/articles/2018-06-06/github-is...

throwaway290 3 years ago | |

OpenAI is also pretty much Microsoft, hard to believe 10 billions USD investment comes without enough strings attached to make them a puppet...

rvz 3 years ago | |

> When these articles were published, I was certain Microsoft had a plan to betray everyone's trust as they always do.

They thrive on betrayal and will never change and are getting cleverer.

> Microsoft's business model is betrayal. Github is Microsoft.

O̶p̶e̶n̶AI.com is also Microsoft.

They were warned straight from the beginning [0] [1] and the same HNers keep falling for the Microsoft freebies and giveaways.

Perhaps the time they will learn the hardest: Is when it is too late.

[0] https://news.ycombinator.com/item?id=27772446

[1] https://news.ycombinator.com/item?id=28324999

blibble 3 years ago | |

the github thing acquisition isn't really a big deal in terms of LLMs as they could have crawled github regardless of whether or not they owned it

29athrowaway 3 years ago | | |

It is just the beginning.

reidrac 3 years ago |

> Codeium doesn’t regurgitate non-permissive code

What is that? The problem is when GH Copilot it emits the code without the licence, not the licence.

elif 3 years ago |

Easy solution: Just make it generate intentionally obfuscated versions of the same functions. Throw in some valid syntax that humans would never consider to use. Break up functions into smaller sub functions. If the LLM has intricate knowledge of the compiler used, it could even generate code which it knows will produce identical bytecode.

Now the only loser is the humans that still have to maintain the ugly code, and RMS can have his weaponized copyright and eat toejam too.

salawat 3 years ago | |

I'd prefer a world without copyright tbqh.

CrankyBear 3 years ago |

Since this advertising a service to fix this problem, I'm suspicious of the research and its conclusions.

bastardoperator 3 years ago |

Looks like the code in question is hosted on Github:

https://github.com/ibayer/CSparse/blob/master/Source/cs_gaxp...

Isn't that covered by:

"You grant us and our legal successors the right to store, archive, parse, and display Your Content... share it with other users..."

lousken 3 years ago |

Is this any different from a developer looking at some code and stylizing it in his own way?

rattlesnakedave 3 years ago | |

tpmx 3 years ago |

The submitter trimmed/edited the title. The real one is:

"GitHub Copilot Emits GPL. Codeium Does Not."

Why?

prosim 3 years ago | |

To hide the fact that this whole post is a marketing campaign with flat out wrong facts and examples that are nothing more than goading.

prepend 3 years ago |

Why do people pay for Codium or Copilot when chatgpt does this for free?

pyth0 3 years ago | |

Copilot currently has great plugin integrations for a number of editors and IDEs. I'm sure the same kind of tooling is in the works for ChatGPT but it's not as mature.

MangezBien 3 years ago | |

I imagine because by paying for Copilot you offload some of your legal liability to github

prepend 3 years ago | | |

By using chatgpt, I offload all my legal liability to openai.

user- 3 years ago |

Is Codeium just using openAI's api ? It seems to be just gpt3

gavinhoward 3 years ago |

Does Codeium give attribution for code under other FOSS licenses? No?

Still infringing.

Nice try.

avbanks 3 years ago |

Is posting code to StackOverflow a copyright violation?

jprete 3 years ago | |

If you cannot grant the rights that Stack Overflow asserts on its content, then you are definitely violating copyright.

avbanks 3 years ago | | |

If stackoverflow is still in business, Copilot has nothing to worry about :)

marcodiego 3 years ago |

No problem. Just release your code under the GPL.

yafbum 3 years ago |

> non-permissive licenses such as GPL mean that you cannot [use the code] without consent.

Huh? GPL does have strings attached, but if consent one of them?

Seems like a thinly disguised ad

alecnotthompson 3 years ago |

The only reason this is a bad thing is because we live under capitalism.

wg0 3 years ago |

What model this Codeium is based on?

ForHackernews 3 years ago |

Is anyone even remotely surprised?

cheriot 3 years ago |

Code snippiets are not poems. I don't see how society benefits from granting an exclusive right to a few lines of C.

hsjqllzlfkf 3 years ago | |

Same way as society benefits from granting exclusive right to a few lines of poem...?

cheriot 3 years ago | | |

A poem is an entire work. A 5 line snippet is one brick in a wall.

attah_ 3 years ago |

In other news: water is wet. What did they expect it to do, if not exactly this?

zaps 3 years ago |

Of course it does

benkarst 3 years ago |

Time to sue MSFT

vulcan01 3 years ago | |

Butterick filed a class-action lawsuit 5 months ago: https://githubcopilotlitigation.com/

efitz 3 years ago |

Yeah, I totally GPL’d

print(f’Hello, world’)

And it auto completes all the time!

seadan83 3 years ago |

How to get a new AI powered software tool high up in hacker news? Mention GitHub Copilot, the equivalent of the abortion debate but for software engineers (everyone is certain to disagree and debate endlessly without swaying any opinions). This post seems like an advertisement for codeium. It wouldn't need to mention anything about Copilot at all and would be just as complete. My 2 cents, click bait & flame war trolling.

umvi 3 years ago |

Human brains emit GPL code too (probably) if you've looked at enough of it. Heck, some humans intentionally study GPL code and then rewrite it with a slightly different implementation to get around the license.

defmodule Fibonnaci do @moduledoc """ Documentation for Fibonnaci. """ @doc """ Calculates the nth Fibonnaci number """ def fibonnaci(n) when n < 0, do: nil def fibonnaci(0), do: 0 def fibonnaci(1), do: 1 def fibonnaci(n), do: fibonnaci(n - 1) + fibonnaci(n - 2) end