Meta to release open-source commercial AI model

Meta to release open-source commercial AI model(zdnet.com)

177 points by maskil 2 years ago | 159 comments

foob 2 years ago |

From the recent story about the Sarah Silverman lawsuit:

The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”

IANAL, but this basically sounds like LLaMa was trained on illegally obtained books by Meta's own admission. It's an exciting development that Meta is releasing a commercial-use version of the model, but I wonder if this is going to cause issues down the road. It's not like Meta can remove these books from the training set without retraining from scratch (or at least the last checkpoint before they were used).

[1] https://news.ycombinator.com/item?id=36657540

ramshanker 2 years ago | |

Sometimes, I wonder what if someone in XYZ country downloads whole of Z-Library/Libgen, all the books ever printed, and all the papers ever published, all the newspapers and so on. and releases the model open source. There are jurisdictions with Lax rules.

And they will have much better knowledge, answers, etc than the western, Lawyer approved models.

Sometimes knowledge needs to be set free I guess.

TX81Z 2 years ago | | |

The production of knowledge needs to be funded as it isn’t “free”. Copyright and licensing is one model that has worked for a long time. It has flaws, but it has produced good things.

At this point with the quality of current web content and the collapse of journalism as an industry I think we can say online ads have utterly failed as a replacement income stream.

Unless you want all LLM to say “I’m sorry the data I was trained on ends in 2023” you still need a content funding model. Maybe not copyright, but sure as hell not ads either.

dogma1138 2 years ago | |

Training and copyright is going to be interesting, people can be trained on “illegally obtained” books too yet you’ll probably going to be hard pressed to make an argument that any employee who downloaded a book or a paper from “libre library” could be used as fruit of the poisonous tree argument down the line.

l33t233372 2 years ago | | |

If the company supplied the employee with the “illegally obtained” books, that could be reason to view the situation differently than an employee acting on their own.

Since the company is obtaining + providing these models with 100% of their input data, it could be argued they have some responsibility to verify the legality of their procurement of the data.

stainablesteel 2 years ago | |

its not deemed illegal yet

its in a weird place imo, with japan ruling that anything goes for AI data, other countries are put under pressure to allow the same

ie,

you're allowed to scrape the web

you're allowed to take what you scrape and put it in a database

you're allowed to use your database to inform on decisions you might make, or content you might create

but once you put AI model in the mix, all of a sudden there's problems, despite the fact that making the model is 10000% harder than doing all of the points mentioned above, the problem of using someone else's work somehow becomes a problem when it never was before

and if truly free and open source LLMs come into the game, then might the corporate ones become crippled from copyright? that's bad for business

brucethemoose2 2 years ago | |

> It's not like Meta can remove these books from the training set without retraining from scratch (or at least the last checkpoint before they were used).

They probably can:

https://github.com/zjunlp/EasyEdit

> I wonder if this is going to cause issues down the road.

There are some popular Stable Diffusion models, being run in small businesses, that I am certain have CSAM in them because they have a particular 4chan model in their merging lineage.

... And yet, it hasn't blown up yet? I have no explanation, but running "illegal" weights seems more sustainable than I would expect.

Der_Einzige 2 years ago | | |

I’ve been wondering when the landmark moral panic would start against Civit.AI and the coomer crowd. People have no idea just how much porn is being produced by this stuff. One of the top textual inversions right now is a… age slider… (https://civitai.com/models/65214/age-slider) ewww. It’s also extremely well rated and reviewed on there. I’m terrified at the impending backlash because depending on what happens the party going on in AI could end

whimsicalism 2 years ago | | |

That is not at all the same thing as removing the books.

twayt 2 years ago | | |

> They probably can:

No, actually they probably can’t. There is no verifiable way to remove the data from the model apart from completely removing all instances of information from the training data. The project you linked only describes a selective finetuning approach.

potsandpans 2 years ago | | |

that is quite the spicy claim

wongarsu 2 years ago | |

If we accept the argument that you can train a ML model on data scraped from the internet because the model is sufficiently transformative and thus isn't impacted by the copyright of that data, then how does that change simply because somebody else distributed the data illegally? Either the ML model breaks the copyright chain or it doesn't. Or is the argument that using data that was provided to you in violation of copyright is illegal in general?

_Algernon_ 2 years ago | |

How is it different than training from random blogs, or stack overflow or in general "The Internet"?

schleck8 2 years ago | |

Really, really bad look for Eleuther if this is true. I did not expect them do something like this and not even see the issue with it.

Ancalagon 2 years ago | | |

Move fast and break the law.

Der_Einzige 2 years ago | | |

Most large datasets are full of copyrighted content. They aren’t unique.

cameldrv 2 years ago | |

It seems difficult to argue that Meta can copy every ebook in existence to train a model, but then other people cannot copy the resulting model.

zargon 2 years ago |

It's not open source, it's freeware or something like that. Weights aren't the source code of LLMs, they're the binaries.

greatpostman 2 years ago |

Meta is going to ruin open ais moat on purpose. Great business strategy and good for everyone but metas competitors

jonnat 2 years ago | |

Quite the opposite, this is great for Meta's competitors. Meta is not trying to get market share with this strategy, it's trying to commoditize their complements (https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/)

Content is a complement to a social network: the cheaper it is to create content, the more content is available, the easier it is to optimize a feed, the larger the time people spend in the platform, the higher the revenue. GenAI is just a method to drive the cost of content creation to zero.

strikelaserclaw 2 years ago | | |

kind of a dystopian nightmare world in which large corporations utilize AI to create low cost, infinite content that humans engage with (mostly content catering to the human tendency for tribalism, prestige, sexual desires etc...), sounds like we are creating a world similar to the Matrix.

freedomben 2 years ago | | |

I wish I could upvote this a dozen times. This is a very insightful comment. Read the link above first if you aren't sure what "commoditize their complements" means.

herodoturtle 2 years ago | |

Reminds me of Joel Spolsky’s essay on “Commoditize your complement”:

https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/

ekojs 2 years ago |

Seems that the source is a FT article that was discussed yesterday: https://news.ycombinator.com/item?id=36712168

From the FT article: '“The goal is to diminish the current dominance of OpenAI,” said one person with knowledge of high-level strategy at Meta.'

forgingahead 2 years ago |

Zuck is a total killer. What better way to fight Google and Microsoft than to effectively spawn thousands of potential competitors to their AI businesses with this (and other) releases. There will be a mad scramble over the released weights to develop new tech, these startups will raise tons of money, and then fight the larger incumbents.

This is not charity, this is a shrewd business move.

amelius 2 years ago | |

"Commoditize your complement"

whimsicalism 2 years ago |

If you read past the title, this article is not at all clear if they are referring to a commercial offering (ie. license our model for $$) or an open-source license with commercial usage (Apache, etc.)

My guess is still the latter because that's what I've heard the rumors about, but this article is pretty unclear on this fact.

brucethemoose2 2 years ago | |

Falcon 40B was released as a "free with royalties above a certain amount of profit" license, and got roasted for it. It was so bad that they changed the license to Apache.

I don't think any business would run such a "licensed" model over MPT 30B or Falcon 40B, unless its way better than LLaMA 65b.

whimsicalism 2 years ago | | |

I think it is supposed to be better than LLaMA 65B. Plenty of businesses are paying for OAI API access.

pmarreck 2 years ago |

I have a 128 core Threadripper, a 2080 Ti and a 3080 Ti.

How can I play with open source LLM's locally?

brucethemoose2 2 years ago | |

Kobold.cpp is your best bet.

You can leverage those big CPUs while still loading both GPUs with a 65B model.

... If you are feeling extra nice, you should set that up as an AI horde worker whenever you run koboldcpp to play with models. It will run API requests for others in the background whenever its not crunching your own requests, in return allowing you priority access to models other hosts are running: https://aihorde.net/

pmarreck 2 years ago | | |

oooh, this is a great idea

estreeper 2 years ago | |

If you're just looking to play with something locally for the first time, this is the simplest project I've found and has a simple web UI: https://github.com/cocktailpeanut/dalai

It works for 7B/13B/30B/65B LLaMA and Alpaca (fine-tuned LLaMA which definitely works better). The smaller models at least should run on pretty much any computer.

brucethemoose2 2 years ago | | |

That project seems unmaintained, which is a problem because llama.cpp is changing extremely rapidly.

Also, it has no "1 click" exe release like kobold.

freedomben 2 years ago | |

May I ask why you have such an amazing machine, and two nice graphics cards? Feel free to tell me it's none of my business, it's just very interesting to me :-)

pmarreck 2 years ago | | |

Career dev who had the cash and wanted to experiment with anything that can be done concurrently, such as in my language of choice lately, which features high concurrency (https://elixir-lang.org/) or these LLM's, or anything else that can be done in massively parallel fashion (which is, perhaps surprisingly, only a minority of possible computer work, but it still means I can run many apps without much slowdown!)

I originally had 2 2080ti's to experiment also with virtio/proxmox (you need 1 for the host and 1 for any VM you run). I never got that running successfully at the time, but then Proton got really good (I mainly just wanted to run windows games fast in a VM, but that circumvented that). Later on I upgraded one of them to a 3080ti.

It's a System76 machine, they make good stuff

loufe 2 years ago |

I'm surprised nobody here has brought up the sensorship in this model. Listening to Mark Zuckerberg on Lex Friedman's podcast talk about it, it sounds like the model will be significantly blunted vs its "research" version release.

stale2002 2 years ago |

I remember arguing with people who honest to god thought that LLAMA was some sort of secret ploy, to trick startups into using it, so that meta could sue them for using it commercially.

Well now there is a commerical release. I guess it wasn't some corporate plot after all!

Some people just can't admit when a corporation does a good thing.

(In this case, the good thing is being done to obsolete their competitors, but it is good none the less, that a commerical LLM is available for people to use for free)

obblekk 2 years ago |

Maybe they've solved the fingerprinting problem and can identify text generated from their model, and this is a way of discovering the market they can sell more advanced models to directly. B2B leadgen...

vlovich123 2 years ago | |

I don't think so because I believe you can train AI models against other AI models. I believe you can fingerprint a family of models, but that's not going to tell whether you just used the general approach outlined in the academic papers.

sva_ 2 years ago | |

I mean you could probably just train it on some sequence s.t. the model identifies itself, would be hard to detect that

sebzim4500 2 years ago | | |

That would prbably work to detect if e.g. OpenAI or Anthropic start using their weights directly. It wouldn't detect whether e.g. a blog was generated with their model or not.

0cf8612b2e1e 2 years ago |

From my quick skim I could not find a date. Any idea when this might happen?

rvz 2 years ago |

See. They don't care about the LLaMA model leak. It turns out that it was OpenAI that cares because it ruins their moat. It costs Meta nothing to release a better open-source or freely available version of LLaMA again.

Still waiting for the 'Meta is dying' and 'Fire Mark Zuckerberg' calls from last year. A year later, where are they now?

sebzim4500 2 years ago | |

To be fair to those commenters in the past, I don't think anyone could have forseen that Zuckerberg would turn out to be the "the good guy".

TheBengaluruGuy 2 years ago |

This conversation triggers a thought.

Does it mean that any blogs that I wrote from my own insights, will automatically be trained on the model… without my permission?

As an author, it feels like it’s stealing the knowledge and insight without appropriate attribution.

anaganisk 2 years ago | |

I think we are at a looking where we just have to let go unless we are Disney, with an army of lawyers. May be it's time for the change in thinking. Having said that. Attribution allows a person to trace the source, it's not a success marker anymore. Probably, if enough negative statements generated by AI get popular, that could potentially piss of countries/people for example some LLM recognizing Taiwan as independent country you can bet China will push for attribution to sources. We have bills pending in multiple countries that want access to personal of encrypted messages to trace the source.

Jeff_Brown 2 years ago |

What's the monetization model here? Is this a closed-source version of their open-source model? (That's suggested by the phrase in the article, "a commercial version of LLaMA, its open-source large language model".)

sagebird 2 years ago |

repeat after me:

hardware is the only moat

If you want to live the good life before you are exquisitely extinguished, spend every other day figuring out how to buy more NVDA, the other days exercising outside, being human.

sifar 2 years ago | |

until better algorithms or newer paradigm obviate the need for large memory/computations

bilsbie 2 years ago |

Is it possible to do further training on the weights they release?

brucethemoose2 2 years ago | |

Yes, and there are a sea of finetunes. See: https://huggingface.co/models?sort=modified&search=Ggml

QLORA is the most cost effective method so far. Some people also do finetuning on Google TPUs

40yearoldman 2 years ago |

Is the title an oxymoron?

Open-source commercial?

RobotToaster 2 years ago | |

If anything it's a tautology, open source by definition allows commercial use.

satvikpendem 2 years ago | |

No, you can sell open source software commercially. That being said, I'm wondering if the license will truly be open source or more like Stable Diffusion's license which is not really open source.

schleck8 2 years ago | | |

Because deep learning weights aren't source code.

https://huggingface.co/blog/open_rail

isaacremuant 2 years ago | |

I think you could've googled that one and founds years of knowledge on that one.

Free as in beer Vs free as in speech and the whole thing.

gpm 2 years ago | |

Commercial presumably as opposed to non-commercial licensing (e.g. the CC BY-NC license, or the weird situation LLaMa is in).

If you listen to the definition the Open Source Initiative would have applied to the term open source had they succeeded in acquiring rights to the term, then commercial is redundant with open source, not the opposite of it.