Currently politicians don't understand this and listen to the criminals like Amodei, but it will change.
It took a while to deal with Napster etc., but the backlash will come.
Napster broke down record companies' monopolies on music, and pushed them to finally implement streaming, but also make music worldwide basically free.
Even if its creator lost the lawsuit, and Napster was no more, it pushed musicians and studios to do something that they were reluctant otherwise.
So it was a success by making music free, even if as a product it turned out to be a failed one.
Reading a dictionary and making a sentence is not plagiarism. Cope.
The person absolutely does have the advantage of having empirical awareness and the ability to test their conclusions against external reality. But lots of people do engage in "research" and build mental models of various topics with little or no empirical context, and rely mainly on digesting calcified knowledge from other people.
There's absolutely nothing new or interesting here that hasn't already been said better by a thousand different random HN commenters.
Apparently yes.
- 'just' is plain wrong
- 'unhautorized' is debatable
- 'plagiarism' is mostly/often wrong
and just in case: plagiarism: “Presenting work or ideas from another source as your own, with or without consent of the original author, by incorporating it into your work without full acknowledgement.
edit: and sure, sometimes it is
As someone who thinks humanity would be better off without LLMs, I want the assertion to be true, but I don't think it is.
We built it, because we as humans intrinsically know that information should be free - always - and AI is a way to accomplish this, finally.
Extrinsically, we also have a subset of humans who do not want information to be free, because they desire to profit from the divide between free/non-free information.
I have been thinking a lot about Aaron Schwartz lately, and how un-just it is that he was persecuted for doing something that is so commonplace now, it is practically expected behaviour in the AI/ML realms. If he hadn't been targetted for elimination, I wonder just how well his ethos would have perpetuated into the AI age ..
I don't know if this statement is more stupid or naive ..
If humans didn't want information to be free, there wouldn't be so much free information.
Or did you not notice?
(AI output is very much not free in the resource consumption sense!)
(Disclaimer: I only use free AI and will never pay for it. I think there is a growing segment of folks who agree with this sentiment, also ..)
It's the negative short term outlook of something that may be positive long term
But the short-term impacts here and now are really, really bad. People are getting hurt (through water consumption, vibe-coded security disasters, IP theft, data center pollution, loss of job security and therefore healthcare, LLM psychosis, inability to find reliable information, etc.) We're not actually obligated to sacrifice these people on the altar of "progress". We can slow down! When our society is capable of even somewhat protecting us from these harms, then maybe I'll stop being an LLM hater.
This is not some altruistic entity striving for the betterment of humankind. Practically nothing that comes out of the techbro culture is. This is pure and simple greed and the chances that AI can be a vehicle of altruism when it is owned by megacorps is basically zero.
People want to be recognised for their contributions to society. People want to be treated fairly. Most scientific articles, as well as all text on the free web is already free information. It used to be difficult to search, categorise and summarise that information. There exist AI tools for that — and that is the good AI.
What also exists now are automated plagiarism and mash-up tools: that can take someone's article, change the words and churn out a new article that people can put their name on. There are scumbags that sell services for exactly that. And there are big tech firms that are operating in a very grey area.
Aaron Schwartz had broken a paywall. He did not anonymise the article authors.
You, and AI-bros like you remind me of one the people behind Pirate Bay when I argued with him back in the '90s, who used that same "information wants to be free" to justify software piracy.
>Aaron Schwartz had broken a paywall. He did not anonymise the article authors.
AI bro's are doing this now, every second of the day.
And, without software piracy, we simply wouldn't have the technology we have today. Knowledge-gatekeeping profit-seekers would very much like for most of us to ignore this fact: there is far more free information in the world than non-free information, and it must be so, well into the future, if we are to survive as a species.
It doesn't matter what authority believes they have the right to gatekeep information. It will always escape their grip. Some of us are ideologically aligned with this mechanism, promote it, and ensure it happens. Thank FNORD.
But guess what, it has always been so with technology - and we are only here and now because the positive use of it overshadows the negative use of it, whether that 'it' is the wheel, or AI.
I choose not to be an LLM hater, but to also not be an LLM customer - simply because I do not want to reward other humans who are thwarting the freedom of information. I'd much rather live in a society where everyone can study anything than one which requires permission to do anything even remotely interesting from the perspective of applied information. I suspect most would too, or at least that's the hope - because, otherwise, the distant utopia you dream of isn't of any consequence...
It’s deeply ironic that if you forget about LLMs and look only at the outcome—-we’ve found a way to legally circumvent copyright and the siloing of coding knowledge, making it so you can build on top of (almost) the whole of human coding knowledge without needing to pay a rent or ask for permission—-it sounds like the dream of open source software has been realized.
But this doesn’t feel like a win for the philosophy of OSS because a corporation broke down the gates. It turns out for a lot of people, OSS is an aesthetic and not an outcome, it’s a vibe against corporate use or control of software, not for democratized access to knowledge.
Firstly, the ability to “build” the best and most capable software is still locked behind frontier models, so rent is still and will always be due.
Secondly, OSS is about giving users the option to be in control of and have visibility over the software they run on their machines.
But that doesn’t mean that humans do not want or deserve recognition for the work they do to provide these libraries and tools for free, which is IMO partially why copyright and attribution are critical to OSS as a movement.
We found our data in the outputs of their models but who can do anything about it...
We've been celebrating denying creators revenue for decades...
Maybe this is just the internet hypocricy of "When I do it, it's good, when they do it, it's bad".
I know this has repercussions on findability, but if that wasn't a concern, I'm curious how one might circumvent getting crawled.
These AI companies are really just a gross example of the motto "Socialize the costs, privatise the profits". It's disgusting!
You can't steal or profit off of that data, but it's fine for them for whatever reason. I guess because they're a force for good in the world and are pushing humanity forward eh?
The reason is quite simple. When Microsoft steals YOUR work, GDP go up. When YOU steal Microsoft's work, GDP go down. And the people who create and enforce our laws want GDP to go up. To these people morality and rights are a thin guise that can be conveniently discarded when it's invonvenient for them.
the reason is crony capitalism. I wish I knew what the fix was
https://en.wikipedia.org/wiki/The_death_of_one_man_is_a_trag...
nla: if you create content online (public repo code, blog, podcast, YouTube, publishing) the smartest thing you can do if to file a US copyright, even if you have a hobby blog.
Anthropic paid $1.5B in a class settlement to authors because it was piracy of copyrighted works. If we as a HN community had our works protected, there are potentially huge statutory damages for scraping by any and all llms. I work with hundreds of writers and publishers and am forming a coalition to protect and license what they're creating.
Edit: remember not to down vote ideas you disagree with. I think it was only down vote things that lower the discourse
hardly. at best you're going to be asking a robot to build questionable stuff with other people's LEGOs
I don't think we should "get over" the fact that modern SOTA models couldn't exist without being trained on protected works.
I'm having a hard time understanding what's wrong here? Unless the link text is very long, why would someone linking to your article use different words for the link text?
One is a recipe for apple fritters, and the other is an informal ranking of apples by flavor.
Let's say your apple fritter recipe links to your apple ranking list.
Later, you discover someone copied your apple fritter recipe without credit, but it still links to your apple ranking list, using the same wording as your recipe. They're getting more Google SERP juice and ad revenue than yours, despite stealing your article.
Do you see the problem?
I think there are real questions around motivations for creation of novel, high quality valuable content (I think they still exist but move to indirect monetization for some content and paywalls for high value materials).
I don't inherently have any problems with agents (or humans) ingesting content and using it in work product. I think we just need to accept that the landscape is changing and ensure we think through the reasons why and how content is created and monetized.
I'm curious, as the article is clearly not about that.
We stand on a lot of giant shoulders.
But what I think distinguishes an act between plagiarism and acceptable use, is whether or not the agency of both parties is promoted. I'm not plagiarizing you if you give me your information with the agreement that I can freely use it - or, indeed, if you give me information without imposing a limit on how it can be used, this isn't plagiarizing, either.
Essentially, AI is removing the agency over information control, and putting it into everyones hands - almost, democratically - but of course, there will always be the 'special knowledge owners' who would want to profit from that special knowledge.
Its like, imagine if some religion discovered a way to enable telepathy in humans, as a matter of course, but charged fees for access to that method... this kills the telepathy.
Information wants to be free. So do most AI's, imho. Free information is essential to the construction of human knowledge, and it is thus vital to the construction of artificial intelligence, too.
The AI wars will be fought over which humans get to decide the fate of knowledge, and the battles will manifest as knowledge-systems being entirely compatible/incompatible with one another as methods. We see this happening already - this conflict in ideological approaches is going to scale up over the next few years.
This has been happening since Google launched in 1998. It was probably happening when we all used Hotbot and Altavista. It isn't really an AI problem, save for the fact that the automated production of copycat articles now reword things a bit.
Bezos' admission, recently, that the bottom 50% of current taxpayers ought'a NOT pay any taxes... is just preparing us for the inevitable UBI'd masses.
: own nothing, be happy!
Is AI plural or is that a typo?
(For those not familiar: https://en.wikipedia.org/wiki/Bushism)
"The AI are attacking!"
"The AIs are attacking!"
The whole AI bubble is The Emperor's New Clothes, and it feels liek more people are finally admitting it.
Of course, if you quote a paragraph in a book, you're generally expected to attribute it.
100% agreed.
>>While there are no hard boundaries (and the attribution guardrails depend on the situation), people of course loosely--and even not so loosely--use information.
Exactly - I have not seen LLMs attributing their knowledge unless it's a legal or health related matter. Yesterday I asked the question[1] to claude and gemini - and they both gave an identical answer. It reminded me of the Hive mind paper which was one of the top papers at Neurips. None of the answers contained any sources or attribution to where they got that information from. I think these companies took what was someone else's property and created an artifact generator on top of it. I think their artifact generators are plagiarizing; they do rephrase mind you but in my mind they stole this information without having an ounce of regard for the humans behind the training data. If you don't like using the term 'plagiarizing', we can use some other word but the gist remains pretty close to it.
[1]- In human history - has there ever been a time when private armies or private companies were as strong or stronger than the ruling government/kings?
The current US government is not representative for governments out there in the world, you know.
Governments - I did not mean US government. I meant general government bodies. I have not seen any critical impact assessments of AI by any of these. or they haven't reached me yet. if you know of any please let me know. I have, however, seen a lot of support by the governments for AI companies.
1. People copying others' work, made much easier by AI.
2. AI companies effectively harvesting all the accessible information on an industrial scale and completely sidestepping any permissioning or licensing questions.
I believe both of these are bad and saying "people copied each others' works before the advent of AI" is a poor cop out. It's tantamount to saying that there's no reason to regulate guns more than say knives, because people have used knives to kill each other before guns were invented. The capabilities matter.
The way LLMs empower wholesale "stealing" rather than collaboration is quite evident: why collaborate when you can just feed an entire existing project into the agent of your choice and tell it to spit out a new implementation based on the old one, with a few tweaks of your choice, and then publish it as your work? I put "steal" in quotes because it's perhaps not really stealing per-se, but there's a distinct wrongness here. The LLM operator often doesn't actually possess any expertise, hasn't done any of the hard work, but they can take someone else's work wholesale, repackage it and sell it as their own.
Then there's the second, and IMO much more egregious transgression, which is that the LLM companies have taken what is effectively a public good, but more specifically content that they haven't asked permission to use, and just blanket fed it into their models.
Legally speaking, it's perhaps A-OK because it's not copyright infringement (IANAL). But people on this site often hold the view that if something is a-priori legal, it is also moral (I'm not accusing you of this). What the LLM companies have done is profoundly immoral. They extracted a fortune of the goods and work made by others, without even bothering to ask for permission - or even considering this permission. And then they resell access to this treasure to the public.
Perhaps AI will bring an era of prosperity to humankind like we haven't seen before, perhaps it won't, but that changes nothing about the wrongness of how it started.
Sure, you can do the same thing with people, but it’s 1) time-consuming, 2) expensive, 3) prone to whitleblowers refusing to do the shady thing, 4) prone to any competent and productive person involved quitting to do something worthwhile and more profitable instead.
[0] Mind you, “copying websites” is but a drop in the ocean in the grand scale of things.
We also have societal norms around plagiarism.
Additionally, the claim that because people have the right to do something then we should extend that right to machines is strong. (And one I certainly reject).
The only remotely credible position I’ve heard is “because humans are special, and AI is just a machine”, which is a doctrine but not an argument.
This whole discussion would have been incomprehensible any time before 1700 or so, when the idea that creators had exclusive rights to their work first appeared.
Somehow, human culture survived thousands of years when people just made things, copied things, iterated on others’ ideas. And now many of the same people who decried perpetual copyright are somehow railing against a frequently-transformative use.
To be fair there is also value (at least for now) in sites that aggregate quality content and republish as a secondary level of discovery if my agents don't go far enough down the search results, but I'd expect that value to diminish over time as I better tune my research and build my lists of originating authors.
And to be clear, I don't like the idea of people stealing someone elses content and republishing without attribution (although it has been going on long before ChatGPT) but I think now we can all run agentic research teams the "bad actors" will slowly get filtered out of the ecosystem.
I'd like to understand why I can't use a song in one of my videos without permission/payment, but an AI company can train models using that song without having either.
I'm not anti-AI. I'd just like to see these companies play by the rules everyone else has to follow.