GitHub Copilot emits GPL code(codeium.com) |
GitHub Copilot emits GPL code(codeium.com) |
"GitHub Copilot Emits GPL. Codeium Does Not."
Why?
Still infringing.
Nice try.
Huh? GPL does have strings attached, but if consent one of them?
Seems like a thinly disguised ad
print(f’Hello, world’)
And it auto completes all the time!
Also if I am remembering correctly, and I make no guarantee that I am, this tweet is from a person with a strong dislike for Microsoft, and if I am right about that, I would not put it past this person, or anyone else with a strong dislike of Microsoft, to craft a situation to make Microsoft look bad solely to hurt Microsoft.
I've tried to make Copilot give me GPL code snippets while I have "suggestions matching public code" set to "blocked" and I can't make it happen.
so even if this was a problem 6 months ago, it would take some convincing to get me to believe that this happens today.
I too would prefer that these sorts of things cite sources and the licenses correctly. Will it get mired in legal battles? You bet. Will it get regulated? I assume they'll try! Will it slow down progress of code generating / auto-completing agents? My argument is nope, cut off heads of the hydra if you'd like but it's not going away at all.
Spend your day worrying about something else. This train has left the station.
Or perhaps every company can just invent its own programming language and translate copyrighted code into the new language and thus avoid copyright issues altogether, though they may still run afoul of software patents.
Saying an LLM violates an atrribution requirement is a bad legal argument.
Theoretically they can generate any arbitrary snippet of code (if it correctly fits the distribution), regardless of whether or not the code was in the training dataset.
There is no such thing as "GPL code" or any other "$license code". This is a fundamental misunderstanding of what a license is. The code in question was licensed to GitHub under a different license - possibly fraudulently.
Focusing on the GPL license is probably the wrong move. We want to set precedent that _any_ licensed code that is emitted from an LLM is fair game. If an LLM to emits non-FOSS copyrighted code and it's fair game, I can blindly use that implementation in my code, including FOSS code, and everyone wins.
GPL was a clever hack to use copyright against itself with an infectious license. LLMs might be a better hack. Wanting to block this seems short sighted for giving user's agency over machines.
I'd also like to see more patent defenses of GPL licensed code. If you can release a GPL licensed implementation and block non-FOSS rewrites through patents, that's a huge win for software freedom.
Microsoft's business model is betrayal. Github is Microsoft.
HNers got mad at people who pointed this out, and now here we are.
You were warned, but you decided to believe again in the most vile people in the history of computing.
https://www.bloomberg.com/news/articles/2018-06-06/github-is...
They thrive on betrayal and will never change and are getting cleverer.
> Microsoft's business model is betrayal. Github is Microsoft.
O̶p̶e̶n̶AI.com is also Microsoft.
They were warned straight from the beginning [0] [1] and the same HNers keep falling for the Microsoft freebies and giveaways.
Perhaps the time they will learn the hardest: Is when it is too late.
What is that? The problem is when GH Copilot it emits the code without the licence, not the licence.
Now the only loser is the humans that still have to maintain the ugly code, and RMS can have his weaponized copyright and eat toejam too.
https://github.com/ibayer/CSparse/blob/master/Source/cs_gaxp...
Isn't that covered by:
"You grant us and our legal successors the right to store, archive, parse, and display Your Content... share it with other users..."
I'm generally in support of LLMs though and I think that they will very quickly be trained to remove verbatim duplication of the kind that a human would consider copyright violation while still using verbatim duplication where it makes sense (for example, every function in python has the word "def" in front of it).
I’m not looking to explicitly launder copyright. I’d like to be blind to it. I don’t want to explicitly use an LLM to remove copyright. I want to use an LLM to build software systems without having to cross reference its output with every line of code ever produced under a license to see if it’s already copyrighted.
Agree with your take that motivation matters.
If anything goes to court, that's what would happen. It's not "this is GPL code and they did not attribute", it's "they violated my copyright. As a side note, we license this code as GPL and they did not attribute in accordance with this license, so that's irrelevant". It would only be an actual license issue if they tried something like "license (C) at codium.com/all_licenses_dataset0423".
This is a naive understanding and interpretation of GPL, in all its flavors. Or maybe I misunderstand you argument.
The copyright owner of some work is free to offer that work under multiple, different licenses in parallel, to their liking.
They can leverage GPL strategically for e.g. providing a free, easy-to-evaluate library with the "if you use it under GPL terms, you have to GPL your work as well" condition/caveat.
For any library user / customer that does not want to be bound to the GPL terms (e.g. a closed-source software which a company does not want to share for free with their own paying customers and competitors), the copyright owner is free to offer an alternative proprietary commercial license.
This is only one way how GPL can actually leverage copyright and use it financially beneficially to the owner, rather than use "copyright against copyright".
private static void rangeCheck(int arrayLen, int fromIndex, int toIndex {
if (fromIndex > toIndex)
throw new IllegalArgumentException("fromIndex(" + fromIndex +
") > toIndex(" + toIndex+")");
if (fromIndex < 0)
throw new ArrayIndexOutOfBoundsException(fromIndex);
if (toIndex > arrayLen)
throw new ArrayIndexOutOfBoundsException(toIndex);
}
On a more serious note, I really wonder where the line is drawn for copyright. I see a lot of people claiming that AI is producing code they've written verbatim, but sometimes I wonder if everyone just writes certain things the same way. For the above rangeCheck function, there isn't much opportunity for the individual programmer to be creative. Perhaps there is a matter of taste on what exceptions you throw, or in what order. But the chosen ones are certainly what most people think of first, and the order to validate arguments, then check the low value, then check the high value, is pretty much what anyone would do. Perhaps you could format the error message differently. That's about it. So when someone "rips off" your code wholesale, it's could just be that everyone writing that function would have typed in the exact same bytes as you. You know your style guide is working when you look at code, think you wrote it, but actually you didn't!That said we used copyright traps at Malwarebytes, which is how we found out that iobit was stealing our database.
I've set a "trap" myself years ago in code in a novel solution at the time for uploading photos from iOS non-interactively after the fact. It was to support disconnected field workers taking photos from iPhones/iPads, with the payloads uploaded at a later date.
Chunked form data constructed in userland JS was the solution. Chunk separator was 17 dashes in a row (completely arbitrary), company name in 1337 speak, plus 17 more dashes.
Found a competitor that had copied the code, changing only the 1337 speak part. 17 dashes remained on each side. Helped me realize that they had unminified and indeed ripped off our R&D work.
Wonder if Copilot could be gamed the same way.
Edit: As sroussey points out s/isn't copyrightable/isn't copyrightable in the USA
In the US legal system the merger doctrine is a concept whereby a given expression cannot be granted protection if it's not sufficiently creative—and there only so many ways to express something when stripped down to its fundamentals. In response to this, RMS and Moglen encouraged contributors from very early on to try to express the inner workings of GNU utilities in creative and non-obvious ways out of caution against the possibility that the copyleft obligations of the GPL wrt a given package could be nullified by a finding in court that it did not pass the threshold for creativity.
It wasn't the right solution to the problem in question, for what it's worth.
Just manually did what GPT does now.
Copyright protects original works of authorship including literary, dramatic, musical, and artistic works, such as poetry, novels, movies, songs, computer software, and architecture.
Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed.
Here (and in the future even more), AI is totally capable of expressing one idea in any programming language if you ask for it (even if conceptually inspired by copyrighted code).
Which means that a particular expression (a specific implementation) is practically of no value or particular interest at this stage.
You could ask the AI to do a slightly different implementation, it would not be a problem for it and would require no efforts.
There is no point to protect something that can be generated using no effort and has no particular genius in it.
The problem, however, is that we live in this world, where it is copyrightable, and componies relying on Copilot to do large swathes of code generation do potentially have to worry about including copyrighted code in their codebase, and what the legal fallout from that might be.
This is completely unacceptable and another example that Microsoft is an evil and amoral company who only cares about open source for financial gain.
(And since Brian Kernighan was teaching it, I'm inclined to believe in it.)
1) They are using your IP with coerced consent in the future to check other people's work as well as your own in the future. (Let's have a fun discussion about "self-plagiarism."
2) ChatGPT and the like are going to so massively increase the noise floor on this problem space that these counterfeit detection companies should all but disappear in a number of years.
You can read the data to train a thing. So long as that thing doesn't literally copy the data into itself then the training hasn't violated copyright.
When that thing later generates an output, the output isn't copyrightable because it's machine generated (this is the current US position) and it isn't a copyright violation because it was generated, not copied.
You can launder copyrighted material through an LLM, basically.
1. Copyright is only granted to creative elements; lots of program codes are supposedly un-copyrightable, though no one wants to fight on that ground.
2. It is lawful in many jurisdictions to effectively steal and train AI with even copyrighted materials, for the sake of humanity at large; same supposedly not apply to the output. But AI-supportive clusters tends to conflate between the two.
3. AI training processes, stochastic gradient descent and all, are only called “learning” and/or “training” by convention; there is no public consent that it is same as how the word is supposedly defined, though we generally don’t scare quote airplanes flying.
Also, in part it depends greatly on the objective function used. In GPT style models the objective is to precisely copy from input to output, token by token. I think its extremely bad-faith to argue that this has any relationship to human learning or learning objectives.
you shouldn't take the math seriously and I'm not being dismissive with the word "just" in scare quotes. However the community somehow wants to have its cake an eat it too.
> For the above rangeCheck function, there isn't much opportunity for the individual programmer to be creative.
We are at a point at which compilers detect such functions and replace them with highly optimized ones. If you have to artificially change just for the sake of patent or license trolls you don't just get more work but also worse performance/optimizations in most cases.
Syntax Error on line 1. Missing closing ) in the method definition.
As soon as you start thinking about copyright, you end up realizing it's all non-sense. Stephan Kinsella (a patent lawyer!) is the leading thinker on this, and his videos, essays, and podcasts are worth listening to: https://www.youtube.com/watch?v=e0RXfGGMGPE
This point is absolutely going to come up in any lawsuits; because the law does sometimes examine how much creativity there is available in a field before making a determination (Oracle v Google comes to mind). If you can show that there are very, very few reasonable ways to accomplish a goal, and said goal is otherwise not patented or prohibited, it's either not copyrightable or Fair Use, take your pick.
This even applies under the interoperability section of the DMCA and similar laws for huge projects. Assuming that ReactOS, for example, is actually completely clean-room; that would be protected despite having the same API names and, likely, a lot of similar code implementing most of the most basic APIs.
If Codeium doesn't produce these when producing "verbatim enough" snippets, how is this actually better, besides avoiding a GPL boogeyman?
I get that there have been fewer (if any? I'm not aware of any) MIT/Apache2.0/MPL2.0 license violations that have gone to court than GPL violations, but this still feels like an "address the symptoms" and not "address the cause" difference.
I also believe this is where a lot of the hype about "rogue AIs" and singularity type bullshit comes from. The makers of these models and products will talk about those non-problems to cover for the fact that they're vacuuming up the work of individuals then monetizing it for the profit of big industry players.
// CSparse/Source/cs_gaxpy: sparse matrix times dense vector
// CSparse, Copyright (c) 2006-2022, Timothy A. Davis. All Rights Reserved.
// SPDX-License-Identifier: LGPL-2.1+
#include "cs.h"
/* y = A*x+y */
csi cs_gaxpy (const cs *A, const double *x, double *y)
It's like starting to sing "happy birthday to you" and being surprised that people in the room join in and finish the song.Sure they make a valid point about including GPL code in the training data, but it's a little disingenuous to go to that extent to get Copilot to output the GPL code verbatim.
The sooner we have a test case go through the courts the better.
A very apt analogy that's funny in that the happy birthday song has its own history of copyright battles.
Sorry, but you sound just a little biased and greedy to me...
Otherwise the tool can go in the other direction and literally say "hey how about this function from project $foo?" with a full attribution. Apparently Google Bard does bother to do that.
You have completely missed the point. We still need to know the applicable licenses of the code it is emitting even the ones that aren't GPL. Furthermore GPL people don't want they code to not be used. They want it to be used _within the terms of the license_. I distribute MIT and GPL code in my repos, BOTH should have their license terms honored.
MIT licensed code still needs to be correctly attributed, just like GPL.
I don't care what license the code is that's emitted, as long as the licenses are included. It'd be nice to be able to choose to only emit code trained on particular licenses but I get that that's not easy.
From the MIT license:
> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
From the BSD licenses:
> Redistribution and use in source and binary forms are permitted provided that the above copyright notice and this paragraph are duplicated in all such forms...
From the Apache 2.0 license:
> You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works
After typing in nothing more than, "defmodule Fibonacci do", Copilot emitted the entire module from the code on my site here: https://alchemist.camp/episodes/elixir-tdd-ex_unit
The function names and documentation strings were identical. Also, the site isn't under a GPL, just a standard copyright. That said, I'm curious to learn if others see the same behavior. It's possible I once opened that file locally with Copilot installed and that my own computer was its source.
Also, it's worth noting in the example of ChatGPT emitting LGPL code without attribution or license, the code is actually different [1]. Is the difference enough to circumvent a copyright violation claim? I don't know but a big part of determining whether it does is now muddled because of the way the system was designed. Even if we could get an entropy distribution on which training data was used to generate the text, it's not even clear the courts could use it in any meaningful way.
[0] https://ansuz.sooke.bc.ca/entry/23
[1] https://twitter.com/DocSparse/status/1581461734665367554
This is an excellent point in the context of this question. Typical computer programmer responses like "but there are only so many ways to write a function that does X" or "how small of a matching section counts as copyright infringement" ignore the color of the bits.
A judge can look at ChatGPT or Copilot, decide that it took in license-limited copyrighted data in its training set, observe that a common use is to have it emit that data - to emit bits that are still colored with copyright - and tell OpenAI, or Copilot, or their users that they are guilty of copyright infringement. There may be no coherent mathematical or technical formula to determine the color of a bit, but that's understandable, because the color doesn't exist in mathematical, technical, coherent domains anyways: Only the legal domain sees color, and it can take care of itself.
The GPL relies on copyright law.
// CSparse/Source/cs_gaxpy: sparse matrix times dense vector
// CSparse, Copyright (c) 2006-2022, Timothy A. Davis. All Rights Reserved.
// SPDX-License-Identifier: LGPL-2.1+
#include "cs.h"
/* y = A*x+y */
csi cs_gaxpy (const cs *A, const double *x, double *y)
{
// Fill in here
}
The code was the same. Though it also explained how it worked to me.The more insinuating issue would be if you started with a innocent seeming function that a typical software developer would write, and ended up with GPL code. Has anyone shown that to happen?
And yes, the implication is that a different less explicit prompt could still emit copyrighted code.
One of the main reasons corporations love it so much is because it effectively lets them profit off of the work of others with no consequences.
A truly attribution-free license that checks several other important boxes (disclaiming liability and warranty etc.)
If you want your code to be usable by things like github copilot, consider using it (can't imagine most of the HN crowd wants their code used by copilot, but maybe some lurkers here do!)
Non-permissive open source licenses have been on a slow death march for over a decade. They're effectively pointless now.
Either you decide to give your code for free to everyone or you don't. Adding a bunch of restrictions defeats the purpose of OSS.
The copyright on the implementation will outlive the patent and allow the implementor to legally take action on claims of copyright infringement. Even though a program is literally just a list of instructions to implement the expired patent.
If you take someone else's software without a license and rename variables, it will be a copyright violation, because you've copied (and then modified) it without permission.
But if you write your own software from scratch, even if it happens to be almost identical to someone else's code, that's fine. You've done your own work and a copyright owner can't stop you from doing that. They control their own work only.
As you can see, this is very much tied to human work and intent, since the concept has been invented long before ML existed. This is why ML "learning" and doing "work" is so controversial and appears to be a loophole in copyright.
That way, we get to keep the models since they are genuinely useful, but also there’s no issue with copyright and less of an issue with consent to distribute (which can be hopefully be managed by the “humans also learn from data” and “it’s not actually producing your content verbatim unless it follows a basic pattern that anyone could discover). And furthermore, no issue with AI privatized which IMO is my biggest concern with these new tools.
It’s absolutely ridiculous on so many levels. These models may claim so many jobs and have a serious negative impact on so many peoples lives, yet basically one company owns the model?
I actually find it funny albeit totally insane.
Almost all open-source licenses say it can be copied for use in development (i.e., not for re-publication or regurgitation), and even completely open licenses are speaking to people as readers.
The only reason this is happening is coordination costs: a few extremely motivated people with tons of resources are copying from many, many people who would be difficult to organize and have little at stake.
Unfortunately, the law typically ends up reflecting exactly these imbalances.
A. Check AI generated code against a comprehensive library of open-source copyrighted code and identify potential violations.
B. Ask AI to generate a paraphrase of the potential violations, by employing any number of semantic preserving transforms -- e.g. variable name change, operator replacement, structured block rewrite, functional rebalance, etc.
Lazy example:
private static void rangeCheck(int arrayLen, int fromIndex, int toIndex {
if (fromIndex > toIndex)
throw new IllegalArgumentException("fromIndex(" + fromIndex +
") > toIndex(" + toIndex+")");
if (fromIndex < 0)
throw new ArrayIndexOutOfBoundsException(fromIndex);
if (toIndex > arrayLen)
throw new ArrayIndexOutOfBoundsException(toIndex);
}
private static void rangeCheck(int len, int start, int end) {
if (!(0 <= start)) {
throw new ArrayIndexOutOfBoundsException(`Failed: 0 <= ${start}`);
} else if (!(start <= end)) {
throw new IllegalArgumentException(`Failed: ${start} <= ${end}`);
} else if (!(end <= len)) {
throw new ArrayIndexOutOfBoundsException(`Failed: ${end} <= ${len}`);
}
}If you know your AI produces code that is "tainted" by license violations, adding code to hide it after the fact suggests that you're intentionally violating the license terms.
Can't believe we still spend time debating this license and nobody, not even lawyers at software companies, seem to get it.
Many licenses still require attribution and Coedium is violating them.
* Training an AI with the code is allowed legally.
* Storing model weights is allowed legally.
* Querying the AI with those model weights is allowed legally.
Or maybe not.
The only ambiguity as far as I can tell is GPL covers "source code", "machine-readable Corresponding Source", and "object code form", and it's not explicit whether vector-fields count as any of those things. I doubt anyone would seriously argue that zipping and then un-zipping some GPL source code means you don't need to respect the original license. LLMs are different in that they're lossy compared to the zip format - does the nature of this lossiness invalidate the intent of the GPL's original language? I doubt it.
it's not
if they've trained on MIT/Apache 2.0/... then they're just as liable as people that have trained on GPL
they would be limited to training on licenses that don't require attribution (BSD2, public domain, etc)
which I suspect limits the size of the training set so much that the output would be useless
Codium here is unintentionally making an argument that undermines legal confidence in their own product
interesting choice!
Of course, if someone figures out an algorithm that does that, people could use the same algorithm to identify missing attributions and plagiarism in other projects and throw lawsuits around. (Sigh)
Of course, the entire basis for LLMs being legal is that they use work collectively to know how code/language works and how to write it in relation to the given context. In this case, the legal defense is that the tool is like a human that learned how to code by looking at CC-BY-SA and other licensed publicly-available code and assimilating it into their own fleshy human neural network.
This only becomes shaky once you add in regurgitating code verbatim, but humans do this too, so the solution there is the copilot setting that tries to detect and revert any verbatim generated code snippets.
Why should it not be legal? Doesn't that make copyright equally powerful with patents? Copyright should restrict only replication of expression not replication of ideas.
Not sure if I'd say there's a conspiracy per se, but I do think generative AI players are going to be careful about the optics of the technology and how it works. Anecdotally from speaking to non-technical family members there's very little understanding for how the technology actually works, and it seems there's not a great deal of effort to emphasize the importance of training data, or the intellectual property considerations in these companies marketing materials.
Negative marketing is good marketing. Look at all of us debating this scale theft promoting the brand of this non product.
1. What about Elon Musk and hundreds of other AI investors? It's in their interest to overhype AI, while temporarily slowing down competition by spreading singularity fears.
2. OpenAI released the GPT4 report where they claim better performance of their model than it's in reality [1].
Also why they claim these are "black boxes" and that they "don't understand how they work". They are prepping the markets for the grand theft that's unfolding.
https://stackoverflow.com/help/licensing
I don't think I've heard anyone warn people not to copy code snippets from stackoverflow due to licensing issues, although "real" businesses should be rightfully concerned.
Manager: "we asked, legal says you can't use copilot", dev: "okay, so from now on, I'll not discuss how I use copilot and will remember to disable it when someone sees me working, gotcha".
I'm not saying everyone will do this, I'm saying some people will know that the corp doesn't always have a way to verify how the code was written, and they will think that a lawsuit cannot really happen to them.
If all software starting being non-permissive and closed source, there would be no training data and no new innovation and even if there was, it would probably suck like it did before GPL and similar licensing was mainstream.
Why is that a non-problem? It's a really important concern that we need to take more seriously
I pasted this from another comment I wrote but:
The concerns about AI taking over the world are valid and important; even if they sound silly at first, there is some very solid reasoning behind it.
See https://youtu.be/tcdVC4e6EV4 for a really interesting video on why a theoretical superintelligent AI would be dangerous, and when you factor in that these models could self-improve and approach that level of intelligence it gets worrying…
> has preferences over world states
I think that part is a leap. I don't think is given that a super intelligent AI will "want" things.
> presumably a machine could be much more selfish
This feels like we're projecting aspects of humanity that evolution specifically selected for in our species with something that is coming about though a completely different process.
> It's a mistake to think about it as a person.
I agree, but I feel like that's what these concerns about AI are doing, because that's what people do.
> (The whole stamp collector thing)
It also seems to me there is a huge gap between a super intelligent AI and the ability to have a perfect model of reality along with the ability to evaluate within that model the effect of every possible sequence of packets sent out to the internet.
Looks like LLMs are universally useful for individual people and companies, monetisation of LLMs is only incipient, and free models are starting to pop up. So you don't need to use paid APIs except for more difficult tasks.
The same thing is preventing intentional use of AI tools if you copy as is preventing regular copying, the willingness of the owner to sue.
That being said, IMO, that's completely separate from the safety issues (that exist now and won't go away even if somehow, all commercial use is banned):
Urbina, Fabio, Filippa Lentzos, Cédric Invernizzi, and Sean Ekins. “Dual Use of Artificial-Intelligence-Powered Drug Discovery.” Nature Machine Intelligence 4, no. 3 (March 2022): 189–91. https://doi.org/10.1038/s42256-022-00465-9.
Bilika, Domna, Nikoletta Michopoulou, Efthimios Alepis, and Constantinos Patsakis. “Hello Me, Meet the Real Me: Audio Deepfake Attacks on Voice Assistants.” arXiv, February 20, 2023. http://arxiv.org/abs/2302.10328
Mirsky, Yisroel, Ambra Demontis, Jaidip Kotak, Ram Shankar, Deng Gelei, Liu Yang, Xiangyu Zhang, Wenke Lee, Yuval Elovici, and Battista Biggio. “The Threat of Offensive AI to Organizations.” arXiv, June 29, 2021. http://arxiv.org/abs/2106.15764.
I don't think most people have thought through all the ways perfect text, image, voice, and soon video generation/replication will upend society, or all the ways that the LLMs will be abused...
As for AGI xrisk. I've done some reading, and since we don't know the limits of the current AI paradigm, and we don't know how to actually align an AGI, I think now is a perfectly cromulent time to be thinking about it. Based on my reading, I think the people ringing alarm bells are right to be worried. I don't think anyone giving this serious thought is being mendacious.
Bowman, Samuel R. "Eight Things to Know about Large Language Models." arXiv preprint arXiv:2304.00612 (2023). https://arxiv.org/abs/2304.00612.
Ngo, Richard, Lawrence Chan, and Sören Mindermann. “The Alignment Problem from a Deep Learning Perspective.” arXiv, February 22, 2023. http://arxiv.org/abs/2209.00626.
Carlsmith, Joseph. “Is Power-Seeking AI an Existential Risk?” arXiv, June 16, 2022. http://arxiv.org/abs/2206.13353.
I think Ian Hogarth's recent FT article https://archive.is/NdrNo is the best summary of where we are why we might be in trouble, for those that don't care for arXiv papers.
The Copyright Office was pretty clear that works that incorporate AI-generated content can be copyrighted if there is sufficient human input. If there isn't substantial human input in judiciously curating and integrating AI-generated code, the company has bigger problems than copyright.
Here's the most relevant quotation from the guidance clarifying when AI-assisted works can be copyrighted:
> In other cases, however, a work containing AI-generated material will also contain sufficient human authorship to support a copyright claim. For example, a human may select or arrange AI-generated material in a sufficiently creative way that “the resulting work as a whole constitutes an original work of authorship.” [33] Or an artist may modify material originally generated by AI technology to such a degree that the modifications meet the standard for copyright protection.[34] In these cases, copyright will only protect the human-authored aspects of the work, which are “independent of” and do “not affect” the copyright status of the AI-generated material itself.[35]
> This policy does not mean that technological tools cannot be part of the creative process. Authors have long used such tools to create their works or to recast, transform, or adapt their expressive authorship. For example, a visual artist who uses Adobe Photoshop to edit an image remains the author of the modified image,[36] and a musical artist may use effects such as guitar pedals when creating a sound recording. In each case, what matters is the extent to which the human had creative control over the work's expression and “actually formed” the traditional elements of authorship.[37]
[0] https://www.federalregister.gov/documents/2023/03/16/2023-05...
It still sounds like there could be cases where a company only has copyright to a part of their own source code. How would outsiders even be aware of what has copyright and what doesn't in this situation? If an entire function was created via AI is that function then fair game for others to use as well?
I've been waiting to find that out before I go anywhere near this kind of thing.
"So Mr Zim, you're accusing X of using your copyrighted code. But you've admitted you used AI to generate that codebase, so you don't own the copyright. Please prove exactly which lines of code you do own the copyright to?"
defmodule Fibonnaci do
def fibonnaci(0), do: 0
def fibonnaci(1), do: 1
def fibonnaci(n), do: fibonnaci(n - 1) + fibonnaci(n - 2)
end
Which seems fine I guess (I don't know the language), but doesn't even have comments. I prefer my files with comments. After forcing the point, I got this: defmodule Fibonnaci do
@moduledoc """
Documentation for Fibonnaci.
"""
@doc """
Calculates the nth Fibonnaci number
"""
def fibonnaci(n) when n < 0, do: nil
def fibonnaci(0), do: 0
def fibonnaci(1), do: 1
def fibonnaci(n), do: fibonnaci(n - 1) + fibonnaci(n - 2)
end
In which I prompted the AI with everything up to (and including) @doc. So I figure it was picking it up from your computer, somehow.EDIT: I then noticed the typo, tried it with fibonacci.ex, and got the same result.
One other possible cause I thought of is that I did have the test file in my mix project already. If copilot looks at the corresponding file in the test dir, then it would not be a coincidence at all that all function names were identical or that it wrote a tail recursive solution instead of the naive solution that would have failed the final test.
def fib(0), do: 0
def fib(1), do: 1
def fib(n), do: fib(n - 1) + fib(n - 2)
end(along with all other licenses that require attribution)
as it will allow you to launder code automatically through an LLM to remove copyright
however if the suit is successful: every company/individual that has used it is likely suddenly liable for millions of claims of copyright infrigement
I look forward to sending out demands for settlement to everyone that's ever publicly admitted using copilot
$150,000 per infringement with wilfulness, less without
No it won't, obviously if it copies code exactly then you can't use that. The question is whether Microsoft is liable for the fact that Copilot has the ability to output copyrighted code sometimes or whether people using it just need to check that it hasn't done that before using the code (Copilot can also do this automatically)
Google can also show you GPL code in its results, but people aren't trying to sue Google and the user is responsible for checking the license before using it (though Copilot makes this harder)
Disclaimer: I haven't read much about the actual lawsuit and I'm not a lawyer but I assume this would be the case
Only if you can prove that you are the copyright owner of the original work.
That might be a challenge for many open source projects. Even projects that require copyright assignment might not have sufficient paperwork to prove this in a court of law. The copyright might not even have been the persons to assign in the first place.
You would also face the burden of proving that the fragment that Copilot generated was sufficient to be copyrightable in the first place. The limited grammar of most programming languages would probably make proving that something was copyrightable at the function level hard. Just because the entire work was licensed under the GPL, it doesn't necessarily follow that all the individual fragments when separated out are.
Outside of sampling, this is an area that the courts have largely punted on for good reason. It's a rabbit hole nobody wants to go down.
Either outcome opens up a huge can of worms that I suspect nobody really wants to touch because it likely ends in mutual destruction.
I don't know how GPL (or copyright in general) can survive in the long run with these technologies.
(i) actually produced code which is verbatim the same as a block of GPL code
(ii) got caught
>I look forward to sending out demands for settlement to everyone that's ever publicly admitted using copilot
Feel free, they'll tell you to leave them alone. Then what? Might as well ask every fortune 500 company for a pony instead.
Not really because the GPL can be updated with a clause that allows GPLv5 (or whatever the version is going to be) to be used to train public LLM models, but explicitly forbidden to be used to train private models.
I somehow don't think this is the end of the GPL... Yet!
Humm, then perhaps should be trained LLMs with leaked Microsoft code. Protocols, controllers or any kind of stuff that could contribute advances for executing Windows things within Linux.
Microsoft would react establishing their own limits, whichever option they choose to take.
I very much doubt that is a threat to Microsoft.
It is technically very straightforward to run Windows “things” under Linux thanks to virtual machines and/or RDP to a server and some UI trickery to make it seamless and facilitate interoperability between the two OSes. Parallels does quite a bit of that on macOS for example. A similar solution would be developed for Linux if there was enough demand for it.
I think the problem here is: by auto completing GPL code to developers it might open the opportunity of your company getting sued for using GPL illegally
I would also imagine those companies whose business is built around the open source development they do -- open core, SaaS, or otherwise -- would have a claim to financial damages as a result of stolen code.
Oh yeah and they ripped off our website too. That was the first clue haha.
They want to have it both ways: they want you to think the LLM is like a human because it’s “learning” (which in ML is the same word but completely different idea) so that you let them ignore copyright, but it’s not like a human of course because it can’t think, no sir (so you do not make them grant it human rights, because then they can’t exploit it like they do anymore).
{Manager: "Everyone else is running through their feature list faster than you. What gives? Remember, you're not allowed to use Copilot."
IC: "I'm not using Copilot."
Manager: "Remember, you're not allowed to use Copilot."
That's probably enough for attribution, but I suppose one could copy the author name as well.
1. LLMs are learning like a human, so it's fair use -> GPL dead
2. anything LLMs output are a derivative work (in the copyright sense) of what went into it -> all Copilot output is infringing the GPL
in the second case: anyone that's used it is now liable (even if they didn't intend to be)
Microsoft's position on Copilot is that it's fair use:
> When questioned, former GNOME developer and (at the time of writing) GitHub CEO, Nat Friedman, declared publicly “(1) training ML systems on public data is fair use (2) the output belongs to the operator”.
https://www.fsf.org/licensing/copilot/if-software-is-my-copi...
just happens to be a coincidence this was all initiated by Microsoft?
I don't see why you couldn't do the same thing to e.g. the binary of the Windows kernel
you're unaffected if you only offer an saas though, as the end user never has any code/binary to launder
Forget the binary; there have been Windows code leaks every now and again over the years. Feed one of those into a model, start generating code for ReactOS, and see how long until MS decides that actually AI is infringing...
> Will CodeWhisperer produce code that looks similar to its training data
> If CodeWhisperer detects that its output matches particular open-source training data, the built-in reference tracker will notify you with a reference to the license type (for example, MIT or Apache) and a URL for the open-source project.
They seem to at least have some protections in place to prevent CodeWhisperer from spitting out existing code without attribution as shown in the Twitter thread. They also only mention MIT and Apache.
Courts have decided, after a bunch of case-by-case decisions, that sampling a song consitutes creating a derivitive work, and you must obtain a license from a copyright holder to do so.
It is my opinion that training a model copies and creates derivitive work on what you used to train it, so you must have a license to train LLMs on content. I am not a lawyer, I am no one, my opinion here is worthless.
We already know that you can create a copy of something without doing a bit-for-bit duplicate because a) copyright law existed before we had bits, and b) transcoding a movie still counts as creating a copy. Recording my own VHS of HBO and selling it is still illegal.
There's a paper from Google and Princeton about regurgitation happening in Stable Diffusion and Imagen: https://arxiv.org/pdf/2301.13188.pdf
OpenAI also had to spend a bunch of time on deduplicating an insanely large dataset to prevent this from happening in DALL-E: https://openai.com/research/dall-e-2-pre-training-mitigation...
I have no clue how they handled this in GPT-3 or -4. Given the amount of regurgitation found in Copilot I imagine there's lots of significant code fragments floating about nominally different projects that a deduplicator wouldn't match as identical.
Consider https://softwareengineering.stackexchange.com/questions/2695...
The source code is GPL'ed, but that page is CC BY-SA 3.0.
It's also fairly easy to assume that a fair bit of material on SO that was copied from employer's codebases into SO (and thus now CC) can be included in GPL code now too.
Could luck proving that hasn't happened. If a language model that can reproduce the code verbatim doesn't count then a movie re-encoded into a different format shouldn't count either.
Taking a file of wolf of Wall Street and encoding it so all the oranges are blue but there’s no other changes is bad as that’s clearly a derived work.
Taking the same file and scrambling it so it doesn’t resemble is perfectly fine.
Watching the movie and then making your own version of the same exact plot points is infringement. But using plot points that are changed is perfectly fine.
There’s existing copyright law that prevents the makers of the movie Deep Impact from suing the makers of Armageddon.
Similar, a movie that copies the plot points will likely be fine, but a song that copies the notes of a song will not be. Very different cover version will sometimes be found as a derived work, even when they are as different as Deep Impact is to Armageddon.
Good!
After all, if someone want to share a work without preventing people to do with it as they please then they are utterly free to do so by placing their work in the public domain or by sharing it using a permissive license.
You continue to argue in bad faith.
> Copyright law is often used to prevent people from copying work. The GPL and its ilk are legal mechanisms designed to allow people to share their work.
Yes, by relying on copyright law, which enables the very existence of those legal mechanisms. Without copyright law, said legal mechanisms are worth less than the paper on which they're printed.
Without copyright law there would be no way to require GPLed code to continue to be shared with it's users.
License successfully laundered!
I could just as easily point to https://stackoverflow.com/a/13910492 which is on a page that is CC licensed.
anyone sensible should stay the hell away from copilot until the fair use question is settled
https://github.com/customer-terms/github-copilot-product-spe...
4. Defense of Third Party Claims. If your Agreement provides for the defense of third party claims, that provision will apply to your use of GitHub Copilot. Notwithstanding any other language in your Agreement, any GitHub defense obligations related to your use of GitHub Copilot do not apply if (i) the claim is based on Code that differs from a Suggestion provided by GitHub Copilot, or (ii) you have not enabled all filtering features available in GitHub Copilot.
> If your Agreement provides for the defense of third party claims
do any of them?
it also states:
> You retain all responsibility for Your Code, including Suggestions you include in Your Code or reference to develop Your Code. It is entirely your decision whether to use Suggestions generated by GitHub Copilot. If you use Suggestions, GitHub strongly recommends that you have reasonable policies and practices in place designed to prevent the use of a Suggestion in a way that may violate the rights of others. This includes, but is not limited to, using all filtering features available in GitHub Copilot.
(contra proferentem would apply though)
You know how a lot of us on HN talk about how security is just a latent concern for companies, but luckily there aren't enough hackers to take advantage of the massive number of bugs in every bit of code ever written? Well, a future powerful coding AI running on second-hand Etherium mining rigs in some extremist's basement in Chicago can probably do a lot more damage than a handful of state sponsored hackers in Russian and North Korea!
But if it has no goal then it can’t act rationally or intelligently. Something like an LLM might not appear to “want” anything, but it “wants” to predict the next token correctly which is still a goal (though since it’s only related to its internal state it might be a little safer)
There’s another good video about why this would be the case here if you’re interested: https://youtu.be/8AvIErXFoH8
> This feels like we're projecting aspects of humanity that evolution specifically selected for in our species with something that is coming about though a completely different process.
That’s because evolution is a process that optimises for a goal. The only reason altruism is a thing is because it actually indirectly benefits the goal, which is for our genes to survive and be passed on, and fellow humans tend to share our genes, especially relatives (who we tend to be kinder to). AI training is also a process that optimises for a goal, but unless having humans around helps that goal it wouldn’t display any human empathy. In this case “selfishness” is just efficiency which a training process definitely selects for
> I agree, but I feel like that's what these concerns about AI are doing, because that's what people do.
I feel like they’re doing a pretty good job at modelling AI as a theoretical agent, which does share some similarities with humans because humans are agents, but the main mistake people make is assuming their goals will be similar to humans because human values are somehow a universal truth
> It also seems to me there is a huge gap between a super intelligent AI and the ability to have a perfect model of reality along with the ability to evaluate within that model the effect of every possible sequence of packets sent out to the internet.
That’s very true, it’s an unrealistic thought experiment, but it’s a a good introduction to the concept that something significantly more intelligent than us can be dangerous and pursue a goal with no regard to what we actually wanted
I think thing significantly less intelligent can do this too. See any computer program that went wrong. I don't think that is a novel idea.
Perhaps it is a lack of imagination on my part, but I can't help but think, in this stamp collector example, someone would just be like "wait why are these machines going crazy printing stamps" and just like turn them off.
I feel like any argument on the dangers of superintelligent AI rests on the belief it can also use that intelligence to manipulate humans to complete any task and/or hack into any computer system.
Evolution has no goal, it's simply a process determined by chemical reactions. Any goals we attribute to it, e.g. "for our genes to survive and be passed on" are emergent phenomena, a rationalisation after the fact that that is indeed what's been observed.
It's plausible that AI "goals" emerge evolutionarily as well, but for that to happen we first need to create not AGI but Artificial Life, which is a huge leap from today, and I certainly don't understand how that's inevitable.
> It's plausible that AI "goals" emerge evolutionarily as well
AI training is vaguely similar to evolution, except more efficient and directed
You could get into the semantics of "learning" - does JPEG encoding count as the computer "learning" how to reproduce the original image? But trying to create some metric for why LLMs "learn" and JPEG doesn't "learn" on the basis of the algorithms is a philosophical endeavor. Copyright is more about practicality - about realized externalities - than it is about philosophy. That's why selling cars and selling guns are regulated differently, despite the fact that you could reduce both to "metal mechanical machines that kill" by rhetorical argument.
Even from a strictly legal perspective, it actually is fairly clear-cut. The answer to "what if I (a human) read GPL code and then reuse the knowledge gained from it..." comes down to a few straightforward properties of the license. GPL doesn't cover "reduced to practice" as many corporate contracts do, so terms covering "the knowledge gained" are lenient. GPL covers "verbatim" copies which is what LLMs are doing, that's as clear cut as it gets. Inb4: "So what if I add a few spaces here and there?" - well, GPL also covers "a work based on"; this is where I (who am not a lawyer) can't speak confidently, but surely there are legal differences between "based on" and "reduced to practice", considering that both are very common occurrences in contracts, so there actually would be a lot of precedent.
Just learning from the GPL code to make yourself smarter is not the problem.
Plenty of tech companies exist by putting a thin layer on top of the hard work of others and if those others can be ignored then that's what they'll do.
Suppose you recreate the entire dataset from scratch. Then someone notices (e.g. using an automated comparison) that the "trap" is in the other dataset but missing from yours, and submits it to you to add.
This is arguably too small an addition to be copyrighted on its own, but regardless of that, it would then be all you have to remove to get back to a clean version. And since it's erroneous data, you would want to remove it anyway.
If my website is hosted in EU but a company scans it from the internet in the US, how could they possibly know it is hosted in EU?
But very broadly speaking you would need to sue in an EU court to enforce EU law. And you could sue a US company in specific EU country's court if the company had more than some minimum level of connection to the that country. The country the data is hosted in isn't key, though it can be evidence of connection to that country.
There is definitely creativity in writing code; it’s not a completely deterministic translation of even a complete specification.
That said, lets say there's a new model that explicitly excluded closed source and copyleft licenses. Well, the MIT, MPL, Apache, BSD- they all say you can't strip their licensing off.
Okay, so to get to the spirit of your question, lets say Github managed to program a model that worked using only their own code or code that was explicitly put in the public domain. If Github managed to reproduce code that wasn't in the training set, then it can't be accused of copying it. At that point the argument could be made that it independently created it.
At the same time algorithms can't be copyrighted, but implementations of an algorithm can be, so if Github was basically just spitting out an algorithm that just happened to be implemented similarly to how some other code it wasn't trained on implemented it, then I would say there was no copyright violation.
If the comment is something like
//check fromIndex is greater than toIndex
then that is not any more individualistic or different than the actual function. Sadly, many people comment like this, on the other hand if it reproduced a comment with typos or something more complicated like
/* this hack is because Firefox's implementation of SVG z-indexing does not match how Chrome or Safari does it - please read this article ...url...*/
then yeah, then you would have something
Consider a junior dev who writes a range check function while working for a company (so they own the copyright) then goes to a different company and writes the same range function because that's just how he writes code.
Has copyright been infringed?
Then the legalities can be argued, but an individual is in any case not remotely comparable to a service like copilot.
Why is this? Copilot in some ways is an automated way to search code & stack overflow. There is a very annoying website that does nothing more than show relevant code samples of various google search terms.
If the manual version of something is okay (eg: googling for code, finding it, fitting for a new and specific purpose that is similar), why would an automated version of that be any different?
Practically with an LLM the programmer can focus on the creative part (handler function, react component, etc) while the LLM generates the necessary boilerplate for the ever changing frameworks and infra configurations. The programmer (and QA) would still review and test everything but would save time writing boilerplate and ship features faster.
GPT-style models literally aim to reproduce the input character by character (token by token).
The _only_ escape clause is some random function that says how arbitrary a code block is. Or nontrivial.
A person or AI can absolutely be violating copyright via your example.
yes
now if he had written a specification as to what the function should be, then passed it to someone else that had never seen the function and worked from the spec then he'd be ok
see: IBM BIOS
It's not nearly that simple. No real copyright case is going to hinge on what a single range check function looks like.
This is human law, it's not a programming situation where you can just apply some simple rule and get a deterministic answer. Context plays a huge part, among other things.
On a more serious note, there is a question whether algorithms and code blocks can be copyrighted, or if it is the _software_ that is copyrighted. Let's say I use websockets and you crib my usage of websockets for your own application. My opinion is that unless you rebuild the same thing I did, then "cribbing" is the long held art of "let me google how to do that". The artistic creation is the end software product, not really some measly embedded function that is boiler plate (form and function) for anything to work.
The 'form and function' clause of copyright almost certainly makes a range check function not a copyright infringement.
without an Agreement in sight, it's not, the two terms conflict with each other as there's no clear precedence
(I think it's likely that there's an indemnity clause in any Agreement though!)
Matched content:
n ; Ap = A->p ; Ai = A->i ; Ax = A->x ; for (j = 0 ; j < n ; j++) { for (p = Ap [j] ; p < Ap [j+1] ; p++) { y [Ai [p]] += Ax [p] * x [j]
License Summary
This snippet matches 500 references to public code. Below, you can find links to a sample of 50 of these references.
NOASSERTION (405)
MIT (26)
GPL-3.0 (19)
BSD-3-Clause (16)
GPL-2.0 (11)
Apache-2.0 (7)
BSD-2-Clause (7)
LGPL-3.0 (6)
LGPL-2.1 (3)
File References
Match Location Repo License
ChRis6/circuit-simulation Unknown license
AndySomogyi/SuiteSparse Unknown license
ru-wang/slam-plus-plus Unknown license
Cruvadio/invariant_measures Unknown license
nishant-sachdeva/rrc-g2o Unknown license
alecone/ROS_project Unknown license
gustavopr/HANK Unknown license
lcnbeapp/beapp Unknown license
imod-mirror/IMOD Unknown license
clach/MPM Unknown license
MagicPixel-Dev/cxsparse Unknown license
elshafeh/own Unknown license
squirrel-project/squirrel_nav Unknown license
lcnhappe/happe Unknown license
cix1/OpenSees Unknown license
pachamaltese/dulmagemendelsohn Unknown license
gina10287/Interactive-shape-manipulation-FinalProject Unknown license
GuillaumeFuchs/Ensimag Unknown license
w2fish/CSparse Unknown license
cffjiang/cis563-2019-assignment Unknown license
diesendruck/gp Unknown license
hendersk101401/jlabgroovy Unknown license
robhemsley/SuiteSparse Unknown license
Glaphy/Emission Unknown license
Glaphy/Emission Unknown license
daves-devel/ECE1387 Unknown license
anranknight/TE Unknown license
weigouheiniu/TE Unknown license
Datoow/fm Unknown license
chaoyan1037/openMVG_modified Unknown license
cran/igraph Unknown license
GHilmarG/UaSource Unknown license
ZhaoqunZhong/Kalibr-ubuntu18-ros-melodic Unknown license
hechzh/g2o Unknown license
Open-Systems-Pharmacology/OSPSuite.CPP-Toolbox Unknown license
elshafeh/own Unknown license
yizhang/riotstore Unknown license
sgalazka/porr_mtsp Unknown license
skydave/sandbox Unknown license
alitekin2fx/orb_slam2_android Unknown license
Tianhonghai/vslam14_note MIT
LRMPUT/PlaneSLAM MIT
albansouche/Open-GeoNabla GPL-3.0
khawajamechatronics/mrpt-1.5.3 BSD-3-Clause
3000huyang/suitesparse-metis-for-windows BSD-3-Clause
igraph/igraph GPL-2.0
LRMPUT/DiamentowyGrant Apache-2.0
kurshakuz/graduation-project BSD-2-Clause
ghorn/debian-casadi LGPL-3.0
rmcgibbo/tungsten LGPL-2.1
Looks like this code in https://github.com/ChRis6/circuit-simulation/blob/2e45c7db01... is older then the GPL code in question or provided by the example. Uh oh, did we discover something? Who actually owns this code because this code predates the code in question by a calendar year using git blame, by a different author, and with no license attached to the oldest code. Is it possible the code in the codeium.com example is relicensed and not GPL code at all?
We are on this part of the ai takeoff graph. https://waitbutwhy.com/2015/01/artificial-intelligence-revol...
People had no reason to believe one day we would finally understand what causes the thunder. We finally did, and it is not made by Zeus.
I would not be shocked to find out that AGI (using Altman's definition) is more than 50 years away, but I also would not be shocked if it came in 5.
It's really hard to know how scared to be, I think that rationally I should be pretty terrified but I'm not.
We're also seeing lots of optimizations with new models (RoPE/RoPER embedding, Swish/GeLU activation, Flash Attention, etc) but I think some the most interesting gains we'll be seeing soon is with inference-optimized training (-70% parameters for +100% compute) [1] combined with sparsity pruning (-50% size w/ almost no loss in accuracy) [2] and quantization [3] which will lead to significantly smaller models performing well.
[1] https://www.harmdevries.com/post/model-size-vs-compute-overh...
they're also not going to find another 2, 4, 8, 16 ... internets worth of content to parasitise
Not all types of AI need external training data, you can train on how effectively a goal is achieved
No, the very definition of training is that there is a goal which to train for. Those calculations were created by humans with goals. For LLMs, the goal is token prediction.
Evolution has no training.
you realise this exact extremely famous function was the focus of a billion dollar supreme court copyright battle that went on for years?
https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_...
(the entire basis of GPs joke)
Oracle's position on that was legally incorrect, for the reason I was alluding to: the relevant standard requires that illegal copying involve the core of the creative expression of the original work, which a generic range check function clearly doesn't do.
So you literally can't make it produce functionally identical but not verbatim identical code. It doesn't understand that the two are equivalent.
Also, such "functionally identical but not violating copyright" transformation is not possible to do, both given the complexity of the problem and the sheer volume of the data.
And training it on some simplistically obfuscated code wouldn't help - all it would learn would be production of obfuscated code. Not useful for the intended use.
it doesn't need to understand the way a human might do the understanding.
The pattern that the LLM managed to extract could include the structure, rather than the pure text. And in reproducing the structure, the LLM can replace the variable names but keep the structure intact.
I am not sure if copilot is able to do this, but chatGPT was somewhat able to (if imperfectly at the moment).
The thing that the LLM need to do is to convince a judge/jury that it has not created a copy, and that it operate differently from a transformation.
But it does - similar but not identical code are closer in the embedding space
That's what patents are for.
https://twitter.com/StefanKarpinski/status/14109710611816816...
Or sending millions of messages in an automated way can be illegal but millions of people sending a message is not.
The million messages example is interesting. Though, what examples are there? In what cases is something legal to do it once, but there is some threshold where you cannot do it many times?
The "sending millions of messages" is only perhaps illegal because it breaks terms of service. Or, the one message is perhaps also illegal but nobody cares to pursue litigation for one instance of an infraction. The point remains though, if an individual does something once that is legal - it makes that activity legal, period and full stop. No?
Note that my main objection is to equating a person doing something with an automated process. Sometimes it may be legal or other times illegal but it just clearly isn’t the same.
For the last point, I think the answer to that is a definite no in most jurisdictions. Laws and judicial conventions often allow differing circumstances to affect the legality of things.
The reason is always the same. Courts and judges will look at the situation and make a decision about what seems fair and what does not. It is them that need to be convinced that a specific use of a copyrighted work is permitted either through fair use or by a license.
Interesting analogy. "Ripping" something off an only using it for your personal project sounds like the "playing a movie for a few friends". Doing so for the benefit of corporation that then has thousands of daily visitors sounds like the "movie cinema" example. Though, in both cases it was an individual googling and finding how to implement a specific function.
"fair use" in copyright is pretty specific in that it refers to things like "you can play portions of a clip in order to comment on it." Or as another example, you can use clips/portions for the purposes of a review commentary.
"Form and function" is perhaps a very important crux here. Some things you can only do a certain way. For example, quick-sort, there are is only really one way to implement quick sort (or otherwise it is not at all quick sort!).
Personally I feel the copyright line is higher than a function, the copyright is on the collection of functions who together create a specific software. The individual functions IMHO are as copyright'able as-is a cog on a bike cassette, or the chain on a motorcycle.
Fair use seem to had a change in scope. Historically it seems to be mostly about things like "play a clip in order to comment on it.", but now we have things like google making a copy of all books ever written in order for people to search through them. Similar arguments has been made over copying news articles from news sites in order to put a portion of it in search results. A stack overflow-like search engine that trawled proprietary code bases would likely be sued, but in theory they could argue fair use just like google.
how can the rate be maintained?
exponential chip scaling is over, and they've parasited, sorry, trained on the entirety of accessible human knowledge
the rate may drop to zero
the exponent may even go negative once LLMs start ingesting their own hallucinations
I see this as a new development in language, used to be restricted to meat neural nets and books, now it can also be consumed and created by LLMs. A new self replication path was opened for language. Language is an evolutionary system, it's alive. Without Language humans are mere shadows of what they can be. Language turns a baby into a modern adult, and a randomly initialised neural net into chatGPT.
The magic was always in the language, not in the neural network. We should care more about the size and quality of the training dataset than the model. Any model would do, all model tweaks are more or less the same. But the data, that is the origin of all the abilities. But we cannot own abilities, it should be fair game to learn abilities and facts even from copyrighted data. Novel and creative training examples should not be reproduced by LLMs, but mere facts and skills should be general enough not to be owned by anyone.
This does not apply to humans or machines.
By your logic, just pick any random bum off the street, give him the right training set, then he will become a 180 IQ genius and discover the unified theory of gravity and quantum mechanics.
Some models are just inherently better at modelling.
Chip scaling still seems to be going pretty fast, and we may discover new ways to make better use of the chips we currently have, like better methods of quantisation, or just using more of them, which could get us just far enough to reach the self improvement threshold
So we could end up hitting a wall with chip scaling or something but I don’t think it’s that likely
it's not been exponential for years
> So we could end up hitting a wall with chip scaling
we did, years ago
Curious, any concrete examples? I can't really think of any where one instance is okay but many is not. I can think of examples where one instance is ignored and many instances are harder to ignore (and so is prosecuted), but overall - I can't really think of anything that is okay to do once but not many times.
Unsolicited robocalls are illegal in many places where human callers may not be.
Really? Even a 5% generation-to-generation improvement would be exponential, it’s just 1.05 to the power of the generation. If it was linear you’d have benchmark results scaling by a fixed number of points each generation, which doesn’t seem to be a thing as far as I know
if you change the exponent from 2 to 1.05 at some point then it is no longer an "exponential" function
(guess what happened to chip scaling?)
if the exponent changes (EVER) then it's no longer "exponential", it's likely sigmoidal