The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”
IANAL, but this basically sounds like LLaMa was trained on illegally obtained books by Meta's own admission. It's an exciting development that Meta is releasing a commercial-use version of the model, but I wonder if this is going to cause issues down the road. It's not like Meta can remove these books from the training set without retraining from scratch (or at least the last checkpoint before they were used).
And they will have much better knowledge, answers, etc than the western, Lawyer approved models.
Sometimes knowledge needs to be set free I guess.
At this point with the quality of current web content and the collapse of journalism as an industry I think we can say online ads have utterly failed as a replacement income stream.
Unless you want all LLM to say “I’m sorry the data I was trained on ends in 2023” you still need a content funding model. Maybe not copyright, but sure as hell not ads either.
Since the company is obtaining + providing these models with 100% of their input data, it could be argued they have some responsibility to verify the legality of their procurement of the data.
its in a weird place imo, with japan ruling that anything goes for AI data, other countries are put under pressure to allow the same
ie,
you're allowed to scrape the web
you're allowed to take what you scrape and put it in a database
you're allowed to use your database to inform on decisions you might make, or content you might create
but once you put AI model in the mix, all of a sudden there's problems, despite the fact that making the model is 10000% harder than doing all of the points mentioned above, the problem of using someone else's work somehow becomes a problem when it never was before
and if truly free and open source LLMs come into the game, then might the corporate ones become crippled from copyright? that's bad for business
They probably can:
https://github.com/zjunlp/EasyEdit
> I wonder if this is going to cause issues down the road.
There are some popular Stable Diffusion models, being run in small businesses, that I am certain have CSAM in them because they have a particular 4chan model in their merging lineage.
... And yet, it hasn't blown up yet? I have no explanation, but running "illegal" weights seems more sustainable than I would expect.
No, actually they probably can’t. There is no verifiable way to remove the data from the model apart from completely removing all instances of information from the training data. The project you linked only describes a selective finetuning approach.
Content is a complement to a social network: the cheaper it is to create content, the more content is available, the easier it is to optimize a feed, the larger the time people spend in the platform, the higher the revenue. GenAI is just a method to drive the cost of content creation to zero.
https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/
From the FT article: '“The goal is to diminish the current dominance of OpenAI,” said one person with knowledge of high-level strategy at Meta.'
This is not charity, this is a shrewd business move.
My guess is still the latter because that's what I've heard the rumors about, but this article is pretty unclear on this fact.
I don't think any business would run such a "licensed" model over MPT 30B or Falcon 40B, unless its way better than LLaMA 65b.
How can I play with open source LLM's locally?
You can leverage those big CPUs while still loading both GPUs with a 65B model.
... If you are feeling extra nice, you should set that up as an AI horde worker whenever you run koboldcpp to play with models. It will run API requests for others in the background whenever its not crunching your own requests, in return allowing you priority access to models other hosts are running: https://aihorde.net/
It works for 7B/13B/30B/65B LLaMA and Alpaca (fine-tuned LLaMA which definitely works better). The smaller models at least should run on pretty much any computer.
Also, it has no "1 click" exe release like kobold.
I originally had 2 2080ti's to experiment also with virtio/proxmox (you need 1 for the host and 1 for any VM you run). I never got that running successfully at the time, but then Proton got really good (I mainly just wanted to run windows games fast in a VM, but that circumvented that). Later on I upgraded one of them to a 3080ti.
It's a System76 machine, they make good stuff
Well now there is a commerical release. I guess it wasn't some corporate plot after all!
Some people just can't admit when a corporation does a good thing.
(In this case, the good thing is being done to obsolete their competitors, but it is good none the less, that a commerical LLM is available for people to use for free)
Still waiting for the 'Meta is dying' and 'Fire Mark Zuckerberg' calls from last year. A year later, where are they now?
Does it mean that any blogs that I wrote from my own insights, will automatically be trained on the model… without my permission?
As an author, it feels like it’s stealing the knowledge and insight without appropriate attribution.
hardware is the only moat
If you want to live the good life before you are exquisitely extinguished, spend every other day figuring out how to buy more NVDA, the other days exercising outside, being human.
QLORA is the most cost effective method so far. Some people also do finetuning on Google TPUs
Open-source commercial?
Free as in beer Vs free as in speech and the whole thing.
If you listen to the definition the Open Source Initiative would have applied to the term open source had they succeeded in acquiring rights to the term, then commercial is redundant with open source, not the opposite of it.
Virtually every discussion in the LLM space right now is almost immediately bifurcated by the "can I use this commercially?" question which has a somewhat chilling effect on innovation. The best performing open source LLMs we have today are llama-based, particularly the WizardLM variants, so giving them more actual industry exposure will hopefully be a force multiplier.
In your scenario, despite the unrealistic coding process, the machine code is the source code, because that's what everyone is working on.
In the development of LLM, the weights is in no way the preferred form of development. Programmers don't work on weights. They work on data, infrastructure, the model, the code for training, etc. The point of machine learning is not to work on weights.
Unless you anthropomorphize optimizers, in which case the weights are indeed the preferred form of editing, but I had never seen anyone---even the most forward AGI supportors---argue that optimziers are intelligent agents.
It seems like the existing large platforms of today—Microsoft’s enterprise moat, Google’s ads and internet services, Meta’s social networks, Apple’s consumer and mobile products—will remain the primary platforms of the future. So having models that can operate exclusively on those platforms via integration to their key products and date will only continue this trend. If you’re an outsider with an AI model, you’ll have a harder time getting access to critical data and your standalone AI product (e.g., ChatGPT) won’t be as useful.
More broadly speaking, I believe the days where the top X largest companies in the stock company would be displaced by newer companies every decade or so is over. The FAANGs just control so many major platforms in so many aspects of our lives.
It also helps that they buy or otherwise cooperate to destroy their competition in questionable ways while heavily lobbying the gov to favor them over others in a quid-pro-quo that benefits politicians and not their constituents.
I disagree: I think big tech is hard to disrupt ATM because the companies are still young and nimble. In the last cycle, the companies being displaced were ancient (by tech standards). When Google and Facebook are 30 years old, their DNA will get in the way of adopting to a new paradigm that will change the world. A paradigm that may be to the Metaverse what the smartphone was to the Apple Newton
Maybe that's Meta's play here? Maybe the idea is that the ecosystem around a model could be as valuable or more valuable than the model itself too, so an OSS model could benefit Meta a lot more by gaining more of the ecosystem mind share?
Or Maybe Yann LeCun is just a hippie that dreams of free love, hard drugs and open-source models?
They might have done well to make gg an offer he couldn't refuse and take on ggml and llama.cpp as an open source project.
Facebook benefits heavily from the open source development done on LLaMA. There was a report I saw that facebook has started using llama.cpp internally for inference. Updates to the licensing will cement facebook as the go to choice for open source language models.
My hypothesis based on the context of Mark discussing the release is that it's going to be completely open source and can licensed to be used commercially. Not that Meta is going to add a whole new revenue side of business to compete with OpenAI. i.e. "Here is model, with commercially permissive licensing" not "Here is model that you can use commercially but must pay me"
https://www.youtube.com/watch?v=Ff4fRgnuFgQ&ab_channel=LexFr...
They can even write it as 'good will' on their financial statements.
It kind of is working.
This seems they will release the weights under some license that allows commercial usage.
How they monetise it (which I assume they will try and do?) is an interesting question.
Maybe some variant of paying a licencing fee?
There doesn't necessarily have to be one. Facebook's goal may be to help commoditize its complements. https://gwern.net/complement
https://huggingface.co/ycros/airoboros-65b-gpt4-1.4.1-PI-819...
Check the prompting syntax here, it has a huge effect on the output:
In a rough way, a NN is just a compiler designed to translate a boatload of simple data into a useful program that operates on similar data.
It might feel like "brand rehab" or "good will" as a consumer, but a lot of this work was put in motion a while ago.
It’s really the ultimate nightmare with the internet becoming just TV 3.0 in which content is controlled and curated … you just consume mindlessly.
Any attempts to create a Reddit clone.. or system in which people freely communicate is now “regulated” for “hate” speech or “terrorism”. The days of open discourse … appear to be numbered. Even email will be analyzed by AI to look for “trends” or “optimize” employee efficiency.
It really is time for a new internet.
> Any attempts to create a Reddit clone.. or system in which people freely communicate is now “regulated” for “hate” speech or “terrorism”.
The fact is that many “systems in which people freely communicate” are regularly first adopted by people who participate in hate speech and terrorism.
> ... Infinite Jest, also called "the Entertainment" or "the samizdat". The film is so compelling that its viewers lose all interest in anything other than repeatedly viewing it, and thus eventually die.
You release your weights, others can build on top of that, fine tune it in different ways, produce new weights they can share with others. Seems very OSS-y.
I feel like there is some semantic nitpicky point being made here that is completely going over my head.
For all practical purposes, if you are part of the team who released the LLMs, you would be writing and modifying the code of data processing, of the model, and of the training process. Those should be considered source code.
And we do have the model, which is pretty Oss-y, and which is why we can fine-tune the weights. But from a broader perspective, it's not fully Oss-y, because we don't have the code for anything else. There's no way to change, for example, how the training is done in the first place.
The network architecture itself is not source code, but a rough specification constraining the optimizer, which searches for possible program descriptions that within the specified constraints, minimize some loss function with respect to the data.
Neither data nor network architecture are the actual source, they are better seen as recipes which if followed (will at great expense), allow finding behaviorally similar programs. As you can see, the standard ideas of open source don’t quite carry over because the actual "source-code" is not human interpretable.
I've often talked about weights being the equivalent to assembly, your note seems to map to a similar intuition. And in that sense provided we ever solve the interpretability problem, we could in theory disassemble the weights to achieve similar outcomes as we do in asm-to-C. Interesting thought experiment insofar as, if the weights ought not be classified as open source (notwithstanding your first point which I agree with), can the disassembled output be classified as open source?
Thats totally fair. And you're correct in that I was making an argument for positive outcomes being orthogonal to the semantics distinction.
> I also believe that actual open source models have the near-term opportunity to make an impact and shape the future landscape, with red pajamas and others in the works. The distinction could be very important in the near term, at the rate this field is developing at.
I think Falcon and MPT support your point as well, but those are still models that were trained on very small budgets relative to llama or gpt-3/4. There's a clear quality delta, albeit that gap is closing. Through that lens, I think having a large, well-funded org doing the pre-training work for the OSS community and releasing the weights permissively is a net positive.
At most, these efforts will amount to data laundering where it will be impossible to prove that a piece of data was used to train the model, not provide conclusive proof that it was removed.
... But yeah, fundamentally the only way to throw out the books is to throw out the weights.
Not that I am disagreeing with you. What I find particularly disturbing are the paid services for this.
Also, I have seen 2 seperate OnlyFans pimps ask for help in a text generation chatroom. Something about automating "private" texting from their "girls."
I don’t think society is going to have a hissyfit until some app comes along that makes it super easy for people to train good models locally on people and then generate whatever they want. That day’s coming really soon though.
If you can show me people who work in AI calling just the weights a "model" then I would happily update my internal definition of the word. I am certainly not an expert in the subject, I am just going off what I've read from the community over the past few years.
The pieces to do local LORA training are all there, but honestly the tyranny of CUDA is the biggest blocker for the average person.
I know there was a phone app that did a limited thing where they gave you profile images and they made bank. I'm a little surprised nobody has tried going whole hog, if the app stores would even allow it.
By some definition of "worked". If we define "worked" as "made money for", who it worked mostly for are the middlemen and a minority of writers... a minority that with the advent of LLMs is likely to shrink even further.
It was probably intended that way, but the reality is that the power has been with the publisher since the beginning, and they've absolutly been screwing over the author's as well. Only the most successful author's have gotten decent deals.
I don't have an answer to this either though, i just wanted to point out that copyright has arguably never been successful at getting money to the content creators proportional to the value the Publisher extracted from the work either.
My guess is extremely poorly. Again, the biggest might be fine. Instead of publishers paying fairly little to authors they could just literally take the best books and print them, taking all of the profits…not to mention ebooks.
I’m not an author so I can’t speak to how much publishers make, but I’d assume that if one was way better than the others in how much they’re distribute to authors all of the best authors would jump ship. Markets have a way of working things out.
A lot of people want to be authors, and any time that happens - game dev, teachers, musicians, etc. - you’re going to take on a bit of extra hardship compared to other jobs.
According to https://www.spiegel.de/international/zeitgeist/no-copyright-... we already had that A/B experiment.
My point was that it doesn't improve their lives, and that's much easier to check in isolation just by reading the news about the current writers strike and how the industry just ignores it until fall, expecting their savings to run out.
Really, copyright just doesn't give the content creators any meaningful power as this right is generally owned by the industry/publisher, not the authors.
Journals get their content for free. Actually often they charge the authors for it.
Research is mainly funded by governments and taxes.
https://www.brookings.edu/articles/rd-for-the-public-good-wa....
Industrial R&D also tends to me more "research for hire" rather than pure research. A bit closer to consulting.
Anyway my point still stands.
Put differently, we consider -- but don't think a whole lot about -- about Wikipedia's "funding," because that's NOT the most important part/innovation of that model.
We should better answer what is?
Can you give some examples of new knowledge that was copyrighted? Generally copyright is used to protect art, software and textbooks. People who produce new knowledge generally are not paid by copyright. The knowledge is either kept secret or published in a journal from which the author recieves no compensation.
I think my main point here is that “legal” does not imply moral or acceptable to society, and our understanding of the technical legal status is not a prerequisite for exploring those factors, which may be the thing that changes the legal status in response to the major shift in landscape.
You risk nothing by assuming things are legal until explicitly illegal.
They are not similar, and I suspect that if they were (i.e. humans could absorb that much information), the information landscape and the market models for exchanging value would look nothing like they do today, and AI wouldn’t be rocking the boat, it’d just be another adherent to the resulting laws.
You can't take a regime that works decently with human-rate copying and convert it to computer-rate copying, because fundamentally the give-and-take of rights to each side is balanced against feasible limits of reproduction.
Or, to put it another way, if you can copy/synthesize at most 1 book a day, I can extend you a lot more implicit rights... than I can afford to someone who can copy/synthesize every book ever in a day.
IMO google and their massive google books DB would have a better leg to stand on here if they trained on that dataset as they owned physical copies of all the books.
The problem with current AI is that they memorize stuff, there is the case with the AI memorizing an algorithm perfectly, or reciting quotes from Dune and then getting censored.
Now you as a paying user of this AI tools are not making reviews but probably using them for commercial purposes and it would not be fiar if your proprietary code would use code copy pasted from GPL code.
If this AI would be so clever then IMO you could have them laarn say Python exactly like a human, a few books and some exercises on python, some books on algorithms, some books on html or whatever tech. But today they train with the full github and you get a mix of stuff. My suggestion would also improve the sorry state of JS in ChatGPT where it uses super old syntax and still uses outdated pattern like it is coding for IE6. My guess this is because it is train with old or bad code and this mean a=most of the code from now one will be old syntax and bad
Citation needed.
More interesting is the broader conversation which involves society’s response to a major shift in the information economy, new questions about what role these tools should play, and how laws should evolve accordingly.
The factors surrounding the emergence/unfolding of AI tooling can’t be stripped down to just the corporate interests involved.