Open source AI is the path forward(about.fb.com) |
Open source AI is the path forward(about.fb.com) |
Llama 3 Training System
Total: 19.2 exaFLOPS
|
+-------------+-------------+
| |
Cluster 1 Cluster 2
9.6 exaFLOPS 9.6 exaFLOPS
| |
+------+------+ +------+------+
| | | |
12K GPUs 12K GPUs 12K GPUs 12K GPUs
| | | |
[####] [####] [####] [####]
400+ 400+ 400+ 400+
TFLOPS/GPU TFLOPS/GPU TFLOPS/GPU TFLOPS/GPUIf he really wants to replicate Linux's success against proprietary Unices, he needs to release Llama with some kind of GPL equivalent, that forces everyone to play the open source game.
They provide their model, with weights and code, as "source available" and it looks like they allow for commercial use until a 700M monthly subscriber cap is surpassed. They also don't allow you to train other AI models with their model:
""" ... v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Meta Llama 3 or derivative works thereof). ... """
There’s a legal precedent that says hard work alone isn’t enough to guarantee copyright, i.e. it doesn’t matter that it took millions of dollars to train.
Has anyone tried that?
I hate how the moment it's too late will be, by design, closed doors.
This is Meta (LLaMA, which has had available weights for a while), not OpenAI (GPT).
> This is one reason several closed providers consistently lobby governments against open source.
Is this substantially true? I've noticed a tendency of those who support the general arguments in this post to conflate the beliefs of people concerned about AI existential risk, some of whom work at the leading AI labs, with the position of the labs themselves. In most cases I've seen, the AI labs (especially OpenAI) have lobbied against any additional regulation on AI, including with SB1047[1] and the EU AI Act[2]. Can anyone provide an example of this in the context of actual legislation?
> On this front, open source should be significantly safer since the systems are more transparent and can be widely scrutinized. Historically, open source software has been more secure for this reason.
This may be true if we could actually understand what was happening in neural networks, or train them to consistently avoid unwanted behaviors. As things are, the public weights are simply inscrutable black boxes, and the existence of jailbreaks and other strange LLM behaviors show that we don't understand how our training processes create models' emergent behaviors. The capabilities of these models and their influence are growing faster than our understand of them, and our ability to steer them to behave precisely how we want, and that will only get harder as the models get more powerful.
> At this point, the balance of power will be critical to AI safety. I think it will be better to live in a world where AI is widely deployed so that larger actors can check the power of smaller bad actors.
This paragraph ignores the concept of offense/defense balance. It's much easier to cause a pandemic than to stop one, and cyberattacks, while not as bad as pandemics, seem to also favor the attacker (this one is contingent on how much AI tools can improve our ability to write secure code). At the extreme, it would clearly be bad if everyone had access to a anti-matter weapon large enough to destroy the Earth; at some level of capability, we have to limit the commands an advanced AI will follow from an arbitrary person.
That said, I'm unsure if limiting public weights at this time would be good regulation. They do seem to have some benefits in increasing research around alignment/interpretability, and I don't know if I buy the argument that public weights are significantly more dangerous from a "misaligned ASI" perspective than many competing closed companies. I also don't buy the view of some in the leading labs that we'll likely have "human level" systems by the end of the decade; it seems possible but unlikely. But I worry that Zuckerberg's vision of the future does not adequately guard against downside risks, and is not compatible with the way the technology will actually develop.
[1] https://thebulletin.org/2024/06/california-ai-bill-becomes-a...
Jokes aside ~ 405b x 2 bytes of memory (FP16), so say 810 gigs, maybe 1000 gigs or so required in reality, need maybe 2 aws p5 instances?
> Third, a key difference between Meta and closed model providers is that selling access to AI models isn’t our business model. That means openly releasing Llama doesn’t undercut our revenue, sustainability, or ability to invest in research like it does for closed providers. (This is one reason several closed providers consistently lobby governments against open source.)
Maybe this is a strategic play to hurt other AI companies that depend on this business model?
Private repos are not being reproduced by any modern AI. Their source code is safe, although AI arguably lowers the bar to compete with them.
Having run many red teams recently as I build out promptfoo's red teaming featureset [0], I've noticed the Llama models punch above their weight in terms of accuracy when it comes to safety. People hate excessive guardrails and Llama seems to thread the needle.
Very bullish on open source.
Does anyone have details on exactly what this means or where/how this metric gets derived?
We mostly don’t all want or need the hardware to run these AIs ourselves, all the time. But, when we do, we need lots of it for a little while.
This is what Holochain was born to do. We can rent massive capacity when we need it, or earn money renting ours when we don’t.
All running cryptographically trusted software at Internet scale, without the knowledge or authorization of commercial or government “do-gooders”.
Exciting times!
Still huge props to them for doing what they do.
Mostly unrelated to the correctness of the article, but this feels like a bad argument. AFAIK, Anthropic/OpenAI/Google are not having issues with their weights being leaked (are they?). Why is it that Meta's model weights are?
It seems safe to assume that not all the companies doing leading-edge LLM’s have good security and that the industry as a whole isn’t set up to keep secrets for long. Things aren’t locked down to the level of classified research. And it sounds like Zuckerberg doesn’t want to play the game that way.
At the state level, China has independent AI research efforts and they’re going to figure it out. It’s largely a matter of timing, which could matter a lot.
There’s still an argument to be made against making proliferation too easy. Just because states have powerful weapons doesn’t mean you want them in the hands of people on the street.
The main threat actors there would be powerful nation-states, in which case they'd be unlikely to leak what they've taken.
It is a bad argument though, because one day possession of AI models (and associated resources) might confer great and dangerous power, and we can't just throw up our hands and say "welp, no point trying to protect this, might as well let everyone have it". I don't think that'll happen anytime soon, but I am personally somewhat in the AI doomer camp.
Llama 3.1 Official Launch
By giving away higher and higher quality models, they undermine the potential return on investment for startups who seek money to train their own. Thus investment in foundation model building stops and they control the ecosystem.
- Open training data (this is very big)
- Open training algorithms (does it include infrastructure code?)
- Open weights (result of previous two)
- Open runtime algorithmCan you imagine the disinformation they could spread with those? With enough of them you could have a massively global site made entirely for spreading it. God what if such a thing got into the hands of an egocentric billionaire?
- We need to control our own destiny and not get locked into a closed vendor. - We need to protect our data. - We want to invest in the ecosystem that’s going to be the standard for the long term.
Thank you Meta for being the bright light of ethical guidance for us all.
We don't get the data or training code. The small runtime framework is open source but that's of little use as its largely fixed in implementation due to the weights. Yes we can fine tune but that is akin to modifying video games - we can do it but there's only so much you can do within reasonable effort and no one would call most video games 'open source'*.
Its freeware and Meta's strategy is much more akin to the strategy Microsoft used with Internet Explorer to capture the web browser market. No one was saying god bless Microsoft for trying to capture the browser market with I.E. Nothing wrong with Meta's strategy just don't call it open source.
*weights are data and so is the video/audio output of a video game. If we gave away that video game output for free we wouldn't call the video game open source as the myriad freeware games essentially do.
Meta provides open source code to modify the the weights (fine tune the model). In this context, fine-tuning the model is better converted to being able to modify the code of the game.
Can't wait to see how the landscape will look in 2027 and beyond.
The actual problem is running these models. Very few companies can afford the hardware to run these models privately. If you run them in the cloud, then I don't see any potential financial gain for any company to fine-tune these huge models just to catch up with OpenAI or Anthropic, when you can probably get a much better deal by fine-tuning the closed-source models.
Also this point:
> We need to protect our data. Many organizations handle sensitive data that they need to secure and can’t send to closed models over cloud APIs.
First, it's ironic that Meta is talking about privacy. Second, most companies will run these models in the cloud anyway. You can run OpenAI via Azure Enterprise and Anthropic on AWS Bedrock.
I can run Llama 3 70B on my (64GB RAM M2) laptop. I haven't tried 3.1 yet but I expect to be able to run that 70B model too.
As for the 405B model, the Llama 3.1 announcement says:
> To support large-scale production inference for a model at the scale of the 405B, we quantized our models from 16-bit (BF16) to 8-bit (FP8) numerics, effectively lowering the compute requirements needed and allowing the model to run within a single server node.
Only the big players can afford to push go, and FB would love to see OpenAI’s code so they can point it to their proprietary user data.
So about all the bots and sock puppets on social media..
Claude is supposed to be better, but it is also even more locked down than ChatGPT.
Word will let me write a manifest for a new Nazi party, but Claude is so locked down that it won't find a cartoon in a picture and Gemini... well.
If AIs are not to harm society, they need to enable us to think in new ways.
And you can't even try it without an FB/IG account.
Zuck will never change.
Why do people keep mislabeling this as Open Source? The whole point of calling something Open Source is that the "magic sauce" of how to build something is publicly available, so I could built it myself if I have the means. But without the training data publicly available, could I train Llama 3.1 if I had the means? No wonder Zuckerberg doesn't start with defining what Open Source actually means, as then the blogpost would have lost all meaning from the get go.
Just call it "Open Model" or something. As it stands right now, the meaning of Open Source is being diluted by all these companies pretending to doing one thing, while actually doing something else.
I initially got very exciting seeing the title and the domain, but hopelessly sad after reading through the article and realizing they're still trying to pass their artifacts off as Open Source projects.
I don't think not releasing the commit history of a project makes it not Open Source, this seems like that to me. What's important is you can download it, run it, modify it, and re-release it. Being able to see how the sausage was made would be interesting, but I don't think Meta have to show their training data any more than they are obligated to release their planning meeting notes for React development.
Edit: I think the restrictions in the license itself are good cause for saying it shouldn't be called Open Source, fwiw.
Right, I'm not talking about the commit history, but rather that anyone (with means) should be able to produce the final artifact themselves, if they want. For weights like this, that requires at least the training script + the training data. Without that, it's very misleading to call the project Open Source, when only the result of the training is released.
> What's important is you can download it, run it, modify it, and re-release it
But I literally cannot download the project, build it and run it myself? I can only use the binaries (weights) provided by Meta. No one can modify how the artifact is produced, only modify the already produced artifact.
That's like saying that Slack is Open Source because if I want to, I could patch the binary with a hex editor and add/remove things as I see fit? No one believes Slack should be called Open Source for that.
If you want to train on top of Llama there's absolutely nothing stopping you. Plenty of open source tools to do parameter optimization.
> is way less valuable than the weights for the vast majority of people
The same is true for most Open Source projects, most people use the distributed binaries or other artifacts from the projects, and couldn't care less about the code itself. But that doesn't warrant us changing the meaning of Open Source just because companies feel like it's free PR.
> If you want to train on top of Llama there's absolutely nothing stopping you.
Sure, but in order for the intent of Open Source to be true for Llama, I should be able to build this project from scratch. Say I have a farm of 100 A100's, could I reproduce the Llama model from scratch today?
If that included, e.g. reading all of Github for code, I wouldn't expect them to host an entire separate read-only copy of Github because they trained on it and say "this is part of our open source model"
Open model weights are still commendable, but it's a far cry from open-source (or even libre) software!
They could release 50% of their best data but that would only stop them from attracting the best talent.
(Disclaimer: I work for an IBM subsidiary but not on any of these products)
I guess this is a rhetorical question, but this is a press release from Meta itself. It's just a marketing ploy, of course.
This is hard to disagree with.
If Zuckerberg had his way, mobile device OSes would let Meta ingest microphone and GPS data 24/7 (just like much of the general public already thinks they do because of the effectiveness of the other sorts of tracking they are able to do).
There are certainly legit innovations that haven't shipped because gatekeepers don't allow them. But there've been lots of harmful "innovations" blocked, too.
Not that anyone would go buy 100,000 H100s to train their own Llama, but words matter. Definitions matter.
The far more important distinction is "open" versus "not open", and I disagree that we should cede that distinction while trying to fight for "source". The Llama license is restrictive in a number of ways (it incorporates an entire acceptable use policy) that make it most definitely not "open" in the customary sense.
The acceptable use policy is seems fine. Don't use it to break the law, solicit sex, kill people, or lie.
If the training data was openly available, even if you can't afford to res train a new version, a competitor like Amazon could do it for you
I don't fully agree.
Isn't that like saying *nix being open source is worthless unless you're planning to ship your own Linux distro?
Knowing how the sausage is made is important if you're an animal rights activist.
They're more like obfuscated binaries. When it comes to fine-tuning only however things shift a little bit, yes.
AI2’s OLMo is an example of what open source actually looks like for LLMs:
https://blog.allenai.org/hello-olmo-a-truly-open-llm-43f7e73...
The Llama license has a lot of restrictions, based on user base size, type of use, etc.
For example you're not allowed to use Llama to train or improve other models.
But it goes much further than that. The government of India can't use Llama because they're too large. Sex workers are not allowed to use Llama due to the acceptable use policy of the license. Then there is also the vague language probibiting discrimination, racism etc.. good luck getting something like that approved by your legal team.
You cannot produce the final artifact with the training script + data. Meta also cannot reproduce the current weights with the training script + data. You could produce some other set of weights that are just about as good, but it's not a deterministic process like compiling source code.
> That's like saying that Slack is Open Source because if I want to, I could patch the binary with a hex editor and add/remove things as I see fit? No one believes Slack should be called Open Source for that.
This analogy doesn't work because it's not like Meta can "patch" Llama any more than you can. They can only finetune it like everyone else, or produce an entirely different LLM by training from scratch like everyone else.
The right to release your changes is another difference; if you patch Slack with a hex editor to do some useful thing, you're not allowed to release that changed Slack to others.
If Slack lost their source code, went out of business, and released a decompiled version of the built product into the public domain, that would in some sense be "open source," even if not as good as something like Linux. LLMs though do not have a source code-like representation that is easily and deterministically modifiable like that, no matter who the owner is or what the license is.
If you built a business on Llama 3.1, you're not going to suddenly go down in flames because you can't upgrade to Llama 4.
Even saying you really needed to upgrade, Llama 4 would be a new model that you'd have to adapt your prompts for anyway, you can't just version bump and call it good. If you're going to update prompts anyway, at that point you can just switch to any other competitor model. Updating models isn't urgent, you have time to do it slowly and right.
> If the training data was openly available, even if you can't afford to res train a new version, a competitor like Amazon could do it for you
If Llama 4 changed the license then presumably you wouldn't have access to its training data even if you did have access to Llama 3.1's. So now you have access to Llama 3.1's training data... now what? You want to recreate the Llama 3.1 weights in response to the Llama 4 release?
"we're open source, you can use it for anything you can imagine. But you can't use it for these specific things."
Then there's the added rub of the source not really being source code, but a CSV file.
That's fine. If you want to set that expectation, great! But don't call it open source.
People do typically modify model weights. They are the preferred form to modify model.
Saying “build” llama is just a nonsense comparison to traditional compiled software. “Building llama” is more akin to taking the raw weights as text and putting them into a nice pickle file. Or loading it into an inference engine.
Demanding that you have everything needed to recreate the weights from scratch is like arguing an application cannot be open source unless it also includes the user testing history and design documents.
And of course some idiots don’t understand what a pickled weights file is and claim it’s as useless as a distributed binary if you want to modify the program just because it is technically compiled; not understanding that the point of the pickled file is “convenience” and that it unpacks back to the original form. Like arguing open source software can’t be distributed in zip files.
> Say I have a farm of 100 A100's, could I reproduce the Llama model from scratch today?
Say you have a piece of paper. Can you reproduce `print(“hello world”)` from scratch?
If we insist upon the release of training data with Open models, you might as well kiss the idea of usable Open LLMs out the door. Most of the content in training datasets like The Pile are not licensed for redistribution in any way shape or form. It would jeopardize projects that do use transparent training data while not offering anything of value to the community compared to the training code. Republishing all training data is an absolute trap.
But distributing the weights is a "form" of distribution. You can recover many items of the dataset (most easily, the outliers) by using the weights.
Just because they are codified in a non-readily accessible way, does not mean that you are not distributing them.
It's scary to think that "training" is becoming a thinly veiled way to strip copyright of works.
> does not mean that you are not distributing them.
Except you literally aren't distributing them. It's like accusing me of pirating a movie because I sent a screenshot or a scene description to my friend.
> It's scary to think that "training" is becoming a thinly veiled way to strip copyright of works.
This is the way it's been for years. Google is given Fair Use for redistributing incomplete parts of copywritten text materials verbatim, since their application is transformative: https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....
Or Corellium, who won their case to use copywritten Apple code in novel and transformative ways: https://www.forbes.com/sites/thomasbrewster/2023/12/14/apple...
Copyright has always been a limited power.
Does FB even have the capability to do that? I'd assume there's a bunch of data that's not theirs and they can't even release it. Let alone some data that they might not want to admit is in the source.
Also, that doesn't matter in this discussion - if you are unable to release the source under appropriate licence (for whatever reason), you should not call it Open Source.
https://raw.githubusercontent.com/meta-llama/llama-models/ma...
> 2. Additional Commercial Terms. If, on the Llama 3.1 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.
on a side note OpenAI is losing users on its own. It doesn't need meta to put it out of business.
The definition of free software (and open source, for that mater), is well-established. The same definition applies to all programs, whether they are "AI" or not. In any case, if a program was built by training against a dataset, the whole dataset is part of the source code.
Llama is distributed in binary form, and it was built based on a secret dataset. Referring to it as "open source" is not ignorance, it's malice.
If that is the case then the weights must inherit all these copyrights. It has been shown (at least in image processing) that you can extract many training images from the weights, almost verbatim. Hiding the training data does not solve this issue.
But regardless of copyright issues, people here are complaining about the malicious use of the term "open source", to signify a completely different thing (more like "open api").
I'm not sure why I keep seeing this. What is the equivalent of the training data for something like the Linux kernel?
It's the source code.
For the linux kernel:
compile(sourcecode) = binary
For llama: train(data) = weights1. Meta pushed engineering wages higher across the industry.
2. They promote high performing engineers very quickly. There are engineers making 7 figures there with just a few years experience.
3. They have open sourced the most important frameworks: React and Pytorch
This company is a guiding light forcing the hand of other large corporations. Mark Zuckerberg is a hero, and has done a fantastic job
https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...
> Some people argue that we must close our models to prevent China from gaining access to them, but my view is that this will not work and will only disadvantage the US and its allies. Our adversaries are great at espionage, stealing models that fit on a thumb drive is relatively easy, and most tech companies are far from operating in a way that would make this more difficult. It seems most likely that a world of only closed models results in a small number of big companies plus our geopolitical adversaries having access to leading models, while startups, universities, and small businesses miss out on opportunities.
I don't see open source being able to compete with the cutting-edge proprietary models. There's just not enough money. GPT-5 will take an estimated $1.2 billion to train. MS and OpenAI are already talking about building a $100 billion training data center.
How can you compete with that if your plan is to give away the training result for free?
HSBC estimates the training cost for GPT-5 between $1.7B and $2.5B.
Vlad Bastion Research estimates $1.25B - 2.25B.
Some people on HN estimate $10B:
Because they sold the resultant code and systems built on it for money... this is the gold miner saying that all shovels and jeans should be free.
Am I happy Facebook open sources some of their code? Sure, I think it's good for everyone. Do I think they're talking out of both sides of their mouth? Absolutely.
Let me know when Facebook opens up the entirety of their Ad and Tracking platforms and we can start talking about how it's silly for companies to keep software closed.
I can say with 100% confidence if Facebook were selling their AI advances instead of selling the output it produces, they wouldn't be advocating for everyone else to open source their stacks.
You're acting as if commoditizing one's complements is either new or reprehensible [1].
I'm acting as if calling on other companies to open source their core product, just because it's a complement for you, and acting as if it's for the benefit of mankind is disingenuous, which it is.
At the end, it's actually Facebook doing the right thing (though they are known for being evil).
It's a bit of an irony.
The supposedly "good" and "open" people like Google or OpenAI, haven't given their model weights.
A bit like Microsoft became the company that actually supports the whole open-source ecosystem with GitHub.
It's absolutely not useless for developers looking to build a competing project.
>The supposedly "good" and "open" people like Google or OpenAI, haven't given their model weights.
Because they're monetizing it... the only reason Facebook is giving it away is because it's a complement to their core product of selling ads. If they were monetizing it, it would be closed source. Just like their Ads platform...
* You can't use them for any purpose. For example, the license prohibits using these models to train other models. * You can't meaningfully modify them given there is almost no information available about the training data, how they were trained, or how the training data was processed.
As such, the model itself is not available under an open source license and the AI does not comply with the "open source AI" definition by OSI.
It's an utter disgrace for Meta to write such a blogpost patting themselves on the back while lying about how open these models are.
> If you use the Llama Materials or any outputs or results of the Llama Materials to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, you shall also include “Llama” at the beginning of any such AI model name.
Regardless, the license [1] still has many restrictions, such as the acceptable use policy [2].
[1] https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/mai...
I was under the impression that you could still fine-tune the models or apply your own RLHF on top of them. My understanding is that the training data would mostly be useful for training the model yourself from scratch (possibly after modifying the training data), which would be extremely expensive and out of reach for most people
This is why Silo AI, for example, had to start from scratch to get better support for small European languages.
The weights are the result of the development process, like the source code of a program is the result of a development process.
But it does benefit mankind.
More free tech products is good for the world.
This is a good thing. When people or companies do good things, they should get the credit for doing good things.
Is it bad for mankind that Meta publishes its weights? Mutually beneficial is a valid game state--there is no moral law that requires anything good be made as a sacrifice.
Not if it's a fair use (which is obviously the defence they're hoping for)
Dead internet theory is very much happening in real time, and I dread what's about to come since the world has collectively decided to lose their minds with this AI crap. And people on this site are unironically excited about this garbage that is indistinguishable from spam getting more and more popular. What a fucking joke
Plus I didn't want to risk my employers Facebook App being in limbo if I got banned, so I left Facebook alone, never to return.
Facebook trying to police the world is the only thing keeping me away, if I can use the platform and post meme comments again, maybe I might reconsider, but I doubt it. Reddit is in a similar boat. You can get banned, but all the creepy pedophile comments from decades and recently are still up no problem.
IF I ever had to go to FB for anything, I'd probably install a wall-removing browser extension. Mobile app is of course out of question.
OPs comments read like we're describing something the SS built (Godwin says hi).
What are the alternatives for local groups? I've recently seen an increase in the amount of Discourse forums available, which is nice, but I don't think it'd be very appealing to the average cycling or hiking group.
I'm excited about the former since AI has massively improved my productivity as a programmer to a point where I can't imagine going back. Everything is not black or white and people can be excited about one part of something and hate another at the same time.
If this is what productivity looks like then I'm proud to be unproductive.
… to be subsequently drowned out by AI “copies” of themselves, which in turn are used to train more AIs, until we don't have a Dead Internet¹ but a Habsburg Internet.
--
There is still plenty high quality stuff too if that is what you’re looking for. If you want to roll with the pigs in the shit, who am I to tell you no?
* * *
On "collective losing of minds", you might appreciate this quote from 1841 (!) by Charles MacKay. I quoted it in the past[1] here, but is worth re-posting:"In reading the history of nations, we find that, like individuals, they have their whims and their peculiarities; their seasons of excitement and recklessness, when they care not what they do. We find that whole communities suddenly fix their minds upon one object, and go mad in its pursuit; that millions of people become simultaneously impressed with one delusion, and run after it, till their attention is caught by some new folly more captivating than the first [...]
"Men, it has been well said, think in herds; it will be seen that they go mad in herds, while they only recover their senses slowly, and one by one."
— from MacKay's book, 'Extraordinary Popular Delusions and the Madness of Crowds'
Personally I have tons of creative ideas which I think would be interesting and engaging but for which I lack the resources to bring into this world, so I'm hoping that in the long term AI tools can help bridge this gap. I'm really hopeful for a world where people from all over the world can share their creative stories, rather than being mostly limited to a few rich people in Hollywood.
Unfortunately I do expect this to end up being the minority of content, especially as we continue being flooded by increasing amounts of trash. But maybe that's just opening up the opportunity for someone to develop new content curation tools. If anything, even before the rise of AI stuff there were mountains of content, and we saw with the rise of TikTok that a good recommendation algorithm still leaves room for new platforms.
The content you're enjoying today still exists, but it's a needle in a haystack of AI spam
Maybe they’ll go outside.
The AI model complements the platform, and their platform is the money maker. They hold the belief that open sourcing their tools benefit their platform on the long run, which is why they're doing it. And in doing so, they aren't under the control of any competitors.
I would say it's more like a grocery store providing free parking, a bus stop, self-checkout, online menu, and free delivery.
Not the usual nation-state rhetoric, but something that justifies that closed source leads to better user-experience and fewer security and privacy issues.
An ecosystem that benefits vendors, customers, and the makers of close source?
Are there historical analogies other than Microsoft Windows or Apple iPhone / iOS?
But they still have 70 thousand people (a small country) doing _something_. What are they doing? Updating Facebook UI? Not really, the UI hasn't been updated, and you don't need 70 thousand people to do that. Stuff like React and Llama? Good, I guess, we'll see how they make use of Llama in a couple of years. Spellcheck for posts maybe?
This is a very important concern in Health Care because of HIPAA compliance. You can't just send your data over the wire to someone's proprietary API. You would at least need to de-identify your data. This can be a tricky task, especially with unstructured text.
---
Some observations:
* The model is much better at trajectory correcting and putting out a chain of tangential thoughts than other frontier models like Sonnet or GPT-4o. Usually, these models are limited to outputting "one thought", no matter how verbose that thought might be.
* I remember in Dec of 2022 telling famous "tier 1" VCs that frontier models would eventually be like databases: extremely hard to build, but the best ones will eventually be open and win as it's too important to too many large players. I remember the confidence in their ridicule at the time but it seems increasingly more likely that this will be true.
Okay then Mark. Replace "modern AI models" with "social media" and repeat this statement with a straight face.
It's a bit buggy but it is fun.
Disclaimer: I am the author of L2E
On a more serious note, I don't really buy his arguments about safety. First, widespread AI does not reduce unintentional harm but increases it, because the rate of accident is compound. Second, the chance of success for threat actors will increase, because of the asymmetric advantage of gaining access to all open information and hiding their own information. But there is no reverse at this point, I enjoy it while it lasts, AGI will come sooner or later anyway.
Meta announced they have 25 providers ready on day 1, so no it's not all AWS.
1. Software: this is all Pytorch/HF, so completely open-source. This is total parity between what corporates have and what the public has.
2. Model weights: Meta and a few other orgs release open models - as opposed to OpenAI's closed models. So, ok, we have something to work with.
3. Data: to actually do anything useful you need tons of data. This is beyond the reach of the ordinary man, setting aside the legality issues.
4. Hardware: GPUs, which are extremely expensive. Not just that, even if you have the top dollars, you have to go stand in a queue and wait for O(months), since mega-corporates have gotten there before you.
For Inference, you need 1,2 and 4. For training (or fine-tuning), you need all of these. With newer and larger models like the latest Llama, 4 is truly beyond the reach of ordinary entities.
This is NOTHING like open-source, where a random guy can edit/recompile/deploy software on a commodity computer. Wrt LLMs, Data/Hardware are in the equation, the playing field is complete stacked. This thread has a bunch of people discussing nuances of 1 and 2, but this bike-shedding only hides the basic point: Control of LLMs are for mega-corps, not for individuals.
Open-Source Code in the past was fantastic because the West had a monopoly on CPUs and computers. Sharing and contributing was amazing while ensured that tyrants couldn't use this tech to harm people simply because they don't have a hardware to run.
But now, things are different. China is advancing in chip technology, and Russia is using open-source AI to harm people on the scale today, with auto-targeting drones being just the start. Red sea conflict etc.
And somehow, Zuckerberg keeps finding ways to mess up people's lives, despite having the best intentions.
Right now you can build a semi-autonomous drone with AI to kill people for ~$500-700. The western world will still use safe and secure commercial models. While new axis of evil will use models based on Meta or any other open source to do whatever harm they can imagine with not a hint of control.
This particular model. Fine-tune it to develop a nuclear bomb using all possible research that level of government can get on the scale. Killing drone swarms etc. Once the knowledge got public these models can be a base model to get expert-level knowledge to anyone who wants it, uncensored. Especially if you are government that wants to destroy a peaceful order for whatever reason.
Open weights (and open inference code) is NOT open source, but just some weak open washing marketing.
The model that comes closest to being TRULY open is AI2’s OLMo. See their blog post on their approach:
https://blog.allenai.org/hello-olmo-a-truly-open-llm-43f7e73...
I think the only thing they’re not open about is how they’ve curated/censored their “Dolma” training data set, as I don’t think they explicitly share each decision made or the original uncensored dataset:
https://blog.allenai.org/dolma-3-trillion-tokens-open-llm-co...
By the way, OSI is working on defining open source for AI. They post weekly updates to their blog. Example:
https://opensource.org/blog/open-source-ai-definition-weekly...
I was thinking today about Musk, Zuckerberg and Altman. Each claims that the next version of their big LLMs will be the best.
For some reason it reminded me of one apocryphal cause of WW1, which was that the kings of Europe were locked in a kind of ego driven contest. It made me think about the Nation State as a technology. In some sense, the kings were employing the new technology which was clearly going to be the basis for the future political order. And they were pitting their own implementation of this new technology against the other kings.
I feel we are seeing a similar clash of kings playing out. The claims that this is all just business or some larger claim about the good of humanity seem secondary to the ego stakes of the major players. And when it was about who built the biggest rocket, it felt less dangerous.
It breaks my heart just a little bit. I feel sympathy in some sense for the AIs we will create, especially if they do reach the level of AGI. As another tortured analogy, it is like a bunch of competitive parents forcing their children into adversarial relationships to satisfy the parent's ego.
however, the "open-source" narrative is being pushed a bit too much like descriptive ML models were called "AI", or applied statistics "data science". with reinforced examples such as this, we start to lose the original meaning of the term.
the current approach of startups or small players "open-sourcing" their platforms and tools as a means to promote network effect works but is harmful in the long run.
you will find examples of terraform and red hat happening, and a very segmented market. if you want the true spirit of open-source, there must be a way to replicate the weights through access to training data and code. whether one could afford millions of GPU hours or not, real innovation would come from remixing the internals, and not just fine-tuning existing stuff.
i understand that this is not realistically going to ever happen, but don't perform deceptive marketing at the same time.
*I reserve the right to remove this praise if they abuse this open source model position in the future.
With the new model, I am seeing alot of how open source they are and can be build upon. Is it now completely open source or similar to their last models ?
Gradient descent works on these models just like the prior ones.
What people are complaining about (totally unreasonably in my view) is obviously Meta is not "open sourcing" all the training data, so nobody can retrain the model from scratch themselves. This argument to me is just silly. The whole point of these models is they distil pretraining on massive data sets you wouldn't have access to otherwise. If you insist on them releasing the data set, they will have to cut it down to 0.1% of the size and you will be getting what you had access to already in the first place.
Without those you're locked in to them in terms of licensing of future versions.
My impression is that AI if done correctly will be the new way to build APIs with large data sets and information. It can't write code unless you want to dump billions of dollars into a solution with millions of dollars of operational costs. As it stands it loses context too quickly to do advance human tasks. BUT this is where it is great at assembling data and information. You know what is great at assembling data and information? APIs.
Think of it this way if we can make it faster and it trains on a datalake for a company it could be used to return information faster than a nested micro-service architecture that is just a spiderweb of dependencies.
Because AI loses context simple API requests could actually be more efficient.
Also, are there any "IP" rights attached at all to a bunch numbers coming out of a formula that someone else calculated for you? (edit: after all, a "model" is just a matrix of numbers coming out of running a training algorithm that is not owned by Meta over training data that is not owned by Meta.)
Meta imposes a notification duty AND a request for another license (no mention of the details of these) for applications of their model with a large number of users. This is against the spirit of open source. (In practical terms it is not a show stopper since you can easily switch models, although they all have subtlely different behaviours and quality levels.)
https://www.vox.com/future-perfect/24151437/ai-israel-gaza-w...
https://www.972mag.com/mass-assassination-factory-israel-cal...
https://www.theguardian.com/world/2024/apr/03/israel-gaza-ai...
Fine tune, update, Run model without very deep domain knowledge, that's what we receive as an outcome.
If you are a software engineer and you steal a model in some close format of Open AI , you will not get lots of benefits even if you understand the format of that model, it's a complex beast by all means.
This is a playbook how anyone can run it.
So yeah, big corp is evil from one side, but oh well, think of North Korea, Russia etc level of evilness and what they can do whit that.
You’re missing a then to your if. What happens if it’s “truly” open per your definition versus not?
Another benefit is that we can learn from how the training and other steps actually work. We can change them to suit our needs (although costs are impractical today). Etc. It’s all the usual open source benefits.
I imagine its main use would be to train other models by distilling them down with LoRA/Quantization etc(assuming we have a tokenizer). Or use them to generate training data for smaller models directly.
But, I do think there is always a way to share without disclosing too many specifics, like this[1] lecture from this year's spring course at Stanford. You can always say, for example:
- The most common technique for filtering was using voting LLMs (without disclosing said llms or quantity of data).
- We built on top of a filtering technique for removing poor code using ____ by ____ authors (without disclosing or handwaving how you exactly filtered, but saying that you had to filter).
- We mixed certain proportion of this data with that data to make it better (without saying what proportion)
[1] https://www.youtube.com/watch?v=jm2hyJLFfN8&list=PLoROMvodv4...
That kind of burning down is classified as "mostly peaceful" by mainstream and AI.
Between those bots (for nefarious, mundane, or marketing reasons) and previous attempts at automated bots, “broad” internet discourse was already ruined. Now people recognize it. This will have the effect of pushing communities back to smaller sizes, I think this is a good thing.
People shouldn’t have trusted all the things they read online from untrusted sources in the first place.
To date, I have not seen any "evil" applications of AI, let alone dangerous or even useful ones. If Russia or North Korea get their hands on a modern AI model, the CIA will get their "Red Mercury" call: https://en.wikipedia.org/wiki/Red_mercury
You’ll probably find you can no longer make an account. I’m in the same boat as you (not used and haven’t missed in over a decade), however, my partner needed an account to manage an ad campaign for a client and neither of us were able to make one. Both tried a load of different things and, ultimately, gave up. Had to tell the client what they needed over a video call
¿Porqué no los dos?
There’s also the option to move on from Netflix if you don’t like its content
Llama 3.1 - https://news.ycombinator.com/item?id=41046540 - July 2024 (114 comments)
$279mm in 1957 dollars is about $3.2bn today [2]. A public cluster of GPUs provided for free to American universities, companies and non-profits might not be a bad idea.
[1] https://en.m.wikipedia.org/wiki/Heavy_Press_Program
[2] https://data.bls.gov/cgi-bin/cpicalc.pl?cost1=279&year1=1957...
(To connect universities to the different supercomputing centers, the NSF funded the NSFnet network in the 80s, which was basically the backbone of the Internet in the 80s and early 90s. The supercomputing funding has really, really paid off for the USA)
This would be the logical place to put such a programme.
I'm in Canada, and our science funding has likewise fallen year after year as a proportion of our GDP. I'm still benefiting from A100 clusters funded by tax payer dollars, but think of the advantage we'd have over industry if we didn't have to fight over resources.
Terrible name unless they low-key plan to make AI researchers' hair fall out.
Not sure why a publicly accessible GPU cluster would be a better solution than the current system of research grants.
The investment was made to build the press, which created significant jobs and capital investment. The press, and others like it, were subsequently operated by and then sold to a private operator, which in turn enabled the massive expansion of both military manufacturing, and commercial aviation and other manufacturing.
The Heavy Press Program was a strategic investment that paid dividends by both advancing the state of the art in manufacturing at the time it was built, and improving manufacturing capacity.
A GPU cluster might not be the correct investment, but a strategic investment in increasing, for example, the availability of training data, or interoperability of tools, or ease of use for building, training, and distributing models would probably pay big dividends.
Totally agree. That doesn't mean it can't generate massive ROI.
> Govt investment would also drive the cost of GPUs up a great deal
Difficult to say this ex ante. On its own, yes. But it would displace some demand. And it could help boost chip production in the long run.
> Not sure why a publicly accessible GPU cluster would be a better solution than the current system of research grants
Those receiving the grants have to pay a private owner of the GPUs. That gatekeeping might be both problematic, if there is a conflict of interests, and inefficient. (Consider why the government runs its own supercomputers versus contracting everything to Oracle and IBM.)
You mean a better solution than different teams paying AWS over and over, potentially spending 10x on rent rather than using all that cash as a down payment on actually owning hardware? I can't really speak for the total costs of depreciation/hardware maintenance but renting forever isn't usually a great alternative to buying.
Sure, academia could build LLMs, and there is at least one large-scale project for that: https://gpt-nl.com/ On the other hand, this kind of models still need to demonstrate specific scientific value that goes beyond using a chatbot for generating ideas and summarizing documents.
So I fully agree that the research budget cuts in the past decades have been catastrophic, and probably have contributed to all the disasters the world is currently facing. But I think that funding prestigious super-projects is not the best way to spend funds.
To access the resource I had to go through EuroCC [0], which is a network facilitating access to and exploitation of HPC/HTC infra. It is (or can be) a great competing model to US cloud providers.
As a small business I got 8 hrs of consultancy and 10k compute hours for free. I’m still learning the details but my understanding is is that after that the prices are very competitive.
[1] https://developer.apple.com/metal/tensorflow-plugin/ [2] https://www.xda-developers.com/nvidia-cuda-amd-zluda/
Until we get cheaper cards that stand the test of time, building a public cluster is just a waste of money. There are far better ways to spend $1b in research dollars.
The private companies buying hundreds of billions of dollars of GPUs aren't writing them off in 2 years. They won't be cutting edge for long. But that's not the point--they'll still be available.
> Nvidia's profit margins on the H100 are crazy
I don't see how the current practice of giving a researcher a grant so they can rent time on a Google cluster that runs H100s is more efficient. It's just a question of capex or opex. As a state, the U.S. has a structual advantage in the former.
> far better ways to spend $1b in research dollars
One assumes the U.S. government wouldn't be paying list price. In any case, the purpose isn't purely research ROI. Like the heavy presses, it's in making a prohibitively-expensive capital asset generally available.
AI is a fad, the brick and mortar of the future is open source tools.
USA and Europe is already doing that in a grand scale, in different forms. Both at national and international scale.
I work at an HPC center which provides servers nationally and collaborates on international level.
[1] https://www.technologyreview.com/2024/05/13/1092322/why-amer...
That just doesn't seem a good idea.
How much capability would $3.2bn in terms of AI computing power provide, including the operational and power costs of the cluster?
Certainly, you could build a "$3.2bn GPU cluster", but it would be dark.
So, how much learning time would $3.2bn provide? 1 year? 10 years?
Just curious about hand wavy guesses. I have no idea the scope of the these clusters.
Unfortunately, the dominant LLM architecture makes it relatively infeasible right now.
- Gaming hardware has too limited VRAM for training any kind of near-state-of-the-art model. Nvidia is being annoyingly smart about this to sell enterprise GPUs at exorbitant markups.
- Right now communication between machines seems to be the bottleneck, and this is way worse with limited VRAM. Even with data-centre-grade interconnect (mostly Infiniband, which is also Nvidia, smart-asses), any failed links tend to cause big delays in training.
Nevertheless, it is a good direction to push towards, and the government could indeed help, but it will take time. We need both a more healthy competitive landscape in hardware, and research towards model architectures that are easy to train in a distributed manner (this was also the key to the success of Transformers, but we need to go further).
They probably won't be using it now because the phone in your pocket is likely more powerful. Moore law did end but data center stuff are still evolving order of magnitudes faster than forging presses.
If anything, allocate compute to citizens.
If something like this were to become a reality, I could see something like "CitizenCloud" where once you prove that you are a US Citizen (or green card holder or some other requirement), you can then be allocated a number of credits every month for running workloads on the "CitizenCloud". Everyone would get a baseline amount, from there if you can prove you are a researcher or own a business related to AI then you can get more credits.
Why couldn’t law enforcement be private too? You call 911, several private security squads rush to solve your immediate crime issue, and the ones who manage to shoot the suspect send you a $20k bill. Seems efficient. If you don’t like the size of the bill, you can always get private crime insurance.
Government distorting undeveloped markets that have a lot of room for competition to increase efficiencies is a bad thing.
Government agencies running programs that should not be profitable, or where the only profit to be left comes at the expense of society as a whole, is a good thing.
Lots of basic medicine is the go to example here, treating cancer isn't going to be "profitable" and attempting to make it such just leads to dead people.
On the flip side, one can argue that dentistry has seen amazing strides in affordability and technological progress through the free market. From dental xrays to improvements in dental procedures to make them less painful for the patients.
Eye surgery is another area where competition has lead to good consumer outcomes.
But life of death situations where people can't spend time researching? The only profit there comes through exploiting people.
that is bereft of detail enough to just be wrong. There are things that government is good for and things that government is bad for, but "anything" is just too broad, and reveals an anti-government bias which just isn't well thought out.
I find the language around "open source AI" to be confusing. With "open source" there's usually "source" to open, right? As in, there is human legible code that can be read and modified by the user? If so, then how can current ML models be open source? They're very large matrices that are, for the most part, inscrutable to the user. They seem akin to binaries, which, yes, can be modified by the user, but are extremely obscured to the user, and require enormous effort to understand and effectively modify.
"Open source" code is not just code that isn't executed remotely over an API, and it seems like maybe its being conflated with that here?
- No more vendor lock-in
- Instead of just wrapping proprietary API endpoints, developers can now integrate AI deeply into their products in a very cost-effective and performant way
- Price race to the bottom with near-instant LLM responses at very low prices are on the horizon
As a founder, it feels like a very exciting time to build a startup as your product automatically becomes better, cheaper, and more scalable with every major AI advancement. This leads to a powerful flywheel effect: https://www.kadoa.com/blog/ai-flywheel
Maybe a big price war while the market majors fight out for positioning but they still need to make money off their investments so someone is going to have to raise prices at some point and youll be locked into their system if you build on it.
There are going to be loads of providers for these open models. Openrouter already has 3 providers for the new 405B model within hours.
Depends on how you define this. Most of the top companies don't care as much about making a profit off of AI inference itself, if the existence of the -feature- of AI inference drives more usage and/or sales of their other products (phones, computers, operating systems, etc.)
That's why, for example, Google and Bing searches automatically perform LLM inference at no cost to the user.
Including adtech models, which are predominantly cloud-based.
And so the models that have mechanisms for curating and preventing such misapplied weighting, and then the organizations and individuals who accurately create adjustments to the models, will in the end be the winners - where truth has been more honed for.
https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in...
https://huggingface.co/failspy/Llama-3-70B-Instruct-ablitera...
This is not altruism although it's still great for devs and startups. All FB GPU investments is primarily for new AI products "friends", recommendations and selling ads.
https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/
Someone can correct me here but AFAIK we don't even know which datasets are used to train these models, so why should we even use "open" to describe Llama? This is more similar to a freeware than an open-source project.
[1] https://www.ftc.gov/policy/advocacy-research/tech-at-ftc/202...
In fairness to Llama, the source code itself (though not the training data) is available to access, although not really under a license that many would consider open source.
This means they need content that will grab attention, and creating open source models that allow anyone to create any content on their own becomes good for Meta. The users of the models can post it to their Instagram/FB/Threads account.
Releasing an open model also releases Meta from the burden of having to police the content the model generates, once the open source community fine-tunes the models.
Overall, this move is good business move for Meta - the post doesn't really talk about the true benefit, instead moralizing about open source, but this is a sound business move for Meta.
But I have strong doubts they (or any other company) actually believe what they are saying.
Here is the reality:
- Facebook is spending untold billions on GPU hardware.
- Facebook is arguing in favor of open sourcing the models, that they spent billions of dollars to generate, for free...?
It follows that companies with much smaller resources (money) will not be able to match what Facebook is doing. Seems like an attempt to kill off the competition (specifically, smaller organizations) before they can take root.
Bravo! While I don't agree with Zuck's views and actions on many fronts, on this occasion I think he and the AI folks at Meta deserve our praise and gratitude. With this release, they have brought the cost of pretraining a frontier 400B+ parameter model to ZERO for pretty much everyone -- well, everyone except Meta's key competitors.[a] THANK YOU ZUCK.
Meanwhile, the business-minded people at Meta surely won't mind if the release of these frontier models to the public happens to completely mess up the AI plans of competitors like OpenAI/Microsoft, Google, Anthropic, etc. Come to think of it, the negative impact on such competitors was likely a key motivation for releasing the new models.
---
[a] The license is not open to the handful of companies worldwide which have more than 700M users.
Step 1. Chick-Fil-A releases a grass-fed beef burger to spite other fast-food joints, calls it "the vegan burger"
Step 2. A couple of outraged vegans show up in the comments, pointing out that beef, even grass-fed beef, isn't vegan
Step 3. Fast food enthusiasts push back: it's unreasonable to want companies to abide by this restrictive definition of "vegan". Clearly this burger is a gamechanger and the definition needs to adapt to the times.
Step 4. Goto Step 2 in an infinite loop
That's the difference between open source and free software.
I.e., the more important thing - the more "free" thing - is the licensing now.
E.g., I play around with different image diffusion models like Stable Diffusion and specific fine-tuned variations for ControlNet or LoRA that I plug into ComfyUI.
But I can't use it at work because of the licensing. I have to use InvokeAI instead of ComfyUI if I want to be careful and only very specific image diffusion models without the latest and greatest fine-tuning. As others have said - the weights themselves are rather inscrutable. So we're building on more abstract shapes now.
But the key open thing is making sure (1) the tools to modify the weights are open and permissive (ComfyUI, related scripts or parts of both the training and deployment) and (2) the underlying weights of the base models and the tools to recreate them have MIT or other generous licensing. As well as the fine-tuned variants for specific tasks.
It's not going to be the naive construction in the future where you take a base model and as company A you produce company A's fine tuned model and you're done.
It's going to be a tree of fine-tuned models as a node-based editor like ComfyUI already shows and that whole tree has to be open if we're to keep the same hacker spirit where anyone can tinker with it and also at some point make money off of it. Or go free software the whole way (i.e., LGPL or equivalent the whole tree of tools).
In that sense unfortunately Llama has a ways to go to be truly open: https://news.ycombinator.com/item?id=36816395
In terms of inference and interface (since you mentioned comfy) there are many truly open source options such as vLLM (though there isn't a single really performant open source solution for inference yet).
Ok, first of all, has this really worked? AI moderators still can't capture the mass of obvious spam/bots on all their platforms, threads included. Second, AI detection doesn't work, and with how much better the systems are getting, it's probably never going to, unless you keep the best models for yourself, and it's is clear from the rest of the note that its not zuck's intention to do so.
> As long as everyone has access to similar generations of models – which open source promotes – then governments and institutions with more compute resources will be able to check bad actors with less compute.
This just doesn't make sense. How are you going to prevent AI spam, AI deepfakes from causing harm with more compute? What are you gonna do with more compute about nonconsensual deepfakes? People are already using AI to bypass identity verification on your social media networks, and pump out loads of spam.
I don't think that's true. I don't think even the best privately held models will be able to detect AI text reliably enough for that to be worthwhile.
I still agree with his general take - bad actors will get these models or make them themselves, you can't stop it. But the logic about compute power is odd.
FB was notorious for censorship. Anyway, what is with the "actions/actors" terminology? This is straightforward totalitarian language.
This also has the important effect of neutralizing the critique of US Government AI regulation because it will democratize "frontier" models and make enforcement nearly impossible. Thank you, Zuck, this is an important and historic move.
It also opens up the market to a lot more entry in the area of "ancillary services to support the effective use of frontier models" (including safety-oriented concerns), which should really be the larger market segment.
Plus there's still the spectre of SB-1047 hanging around.
Is the vision here to treat LLM-based AI as a "public good", akin to a utility provider in a civilized country (taxpayer funded, govt maintained, non-for-profit)?
I think we could arguably call this "open source" when all the infra blueprints, scripts and configs are freely available for anyone to try and duplicate the state-of-the-art (resource and grokking requirements nonwithstanding)
You also can't use it if you're the government of India.
Neither can sex workers use it. (Do you know if your customers are sex workers?)
There are also very vague restrictions for things like discrimination, racism etc.
Llama could change the license on later versions to kill your business and you have no options as you don't know how they trained it or have the budget to.
It's not much more free than binary software.
The whole thing is interesting, but this part strikes me as potentially anticompetitive reasoning. I wonder what the lines are that they have to avoid crossing here?
"Commoditize your complements" is an accepted strategy. And while pricing below cost to harm competitors is often illegal, the reality is that the marginal cost of software is zero.
Which open-source has such restrictions and clause?
C'mon folks, they're opening up for free to 99.99% of potential users what cost hundreds of millions of dollars, if not in the ballpark of a billion.
Let's appreciate that, instead of focusing on semantics for a while.
A good wording for this is "open-washing" as described in this paper: https://dl.acm.org/doi/fullHtml/10.1145/3630106.3659005
The HPC domain (data and compute intensive applications that typically need vector, parallel or other such architectures) have been around for the longest time, but confined to academic / government tasks.
LLM's with their famous "matrix multiply" at their very core are basically demolishing an ossified frontier where a few commercial entities (Intel, Microsoft, Apple, Google, Samsung etc) have defined for decades what computing looks like for most people.
Assuming that the genie is out of the bottle, the question is: what is the shape of end-user devices that are optimally designed to use compute intensive open source algorithms? The "AI PC" is already a marketing gimmick, but could it be that Linux desktops and smartphones will suddenly be "ΑΙ natives"?
For sure its a transformational period and the landscape T+10 yrs could be drastically different...
I think it's interesting to think about this question of open source, benefits, risk, and even competition, without all of the baggage that Meta brings.
I agree with the FTC, that the benefits of open-weight models are significant for competition. The challenge is in distinguishing between good competition and bad competition.
Some kind of competition can harm consumers and critical public goods, including democracy itself. For example, competing for people's scarce attention or for their food buying, with increasingly optimized and addictive innovations. Or competition to build the most powerful biological weapons.
Other kinds of competition can massively accelerate valuable innovation.
The FTC must navigate a tricky balance here — leaning into competition that serves consumers and the broader public, while being careful about what kind of competition it is accelerating that could cause significant risk and harm.
It's also obviously not just "big tech" that cares about the risks behind open-weight foundation models. Many people have written about these risks even before it became a subject of major tech investment. (In other words, A16Z's framing is often rather misleading.) There are many non-big tech actors who are very concerned about current and potential negative impacts of open-weight foundation models.
One approach which can provide the best of both worlds, is for cases where there are significant potential risks, to ensure that there is at least some period of time where weights are not provided openly, in order to learn a bit about the potential implications of new models.
Longer-term, there may be a line where models are too risky to share openly, and it may be unclear what that line is. In that case, it's important that we have governance systems for such decisions that are not just profit-driven, and which can help us continue to get the best of all worlds. (Plug: my organization, the AI & Democracy Foundation; https://ai-dem.org/; is working to develop such systems and hiring.)
i am not down with this concept of the chattering class deciding what are good markets and what are bad, unless it is due to broad-based and obvious moral judgements.
But this is really positive stuff and it’s nice to view my time there through the lens of such a change for the better.
Keep up the good work on this folks.
Time to start thinking about opening up a little on the training data.
There is still a lot of modifying you can do with a set of weights, and they make great foundations for new stuff, but yeah we may never see a competitive model that's 100% buildable at home.
Edit: mkolodny points out that the model code is shared (under llama license at least), which is really all you need to run training https://github.com/meta-llama/llama3/blob/main/llama/model.p...
In a better world, there would be no “I ran some algos on it and now it’s mine” defense.
If you have open data and open source code you can reproduce the weights
Has that changed?
I believe this is the current draft: https://opensource.org/deepdive/drafts/the-open-source-ai-de...
People are framing this as if it was an open-source hierarchy, with "actual" open-source requiring all training code to be shared. This is not obvious to me, as I'm not asking people that share open-source libraries to also share the tools they used to develop them. I'm also not asking them to share all the design documents/architecture discussion behind this software. It's sufficient that I can take the end result and reshape it in any way I desire.
This is coming from an LLM practitioner that finetunes models for a living; and this constant debate about open-source vs open-weights seems like a huge distraction vs the impact open-sourcing something like Llama has... this is truly a Linux-like moment. (at a much smaller scale of course, for now at least)
The source of a language model is the text it was trained on. Llama models are not open source (contrary to their claims), they are open weight.
15T tokens, 45 terrabytes. Seems fairly open source to me.
There is still a lot you can do with weights, like fine tuning, and it is arguably more useful as retraining the entire model would cost millions in compute.
- If we start with the closed training set, that is closed and stolen, so call it Stolen Source.
- What is distributed is a bunch of float arrays. The Llama architecture is published, but not the training or inference code. Without code there is no open source. You can as well call a compiler book open source, because it tells you how to build a compiler.
Pure marketing, but predictably many people follow their corporate overlords and eagerly adopt the co-opted terms.
Reminder again that FB is not releasing this out of altruism, but because they have an existing profitable business model that does not depend on generated chats. They probably do use it internally for tracking and building profiles, but that is the same as using Linux internally, so they release the weights to destroy the competition.
Isn't price dumping an anti trust issue?
Inference code is the runtime; the code that runs the model. Not the model itself.
Additionally, models can be (and are) fine tuned via APIs, so if that is the threshold required for a system to be "open source", then that would also make the GPT4 family and other such API only models which allow finetuning open source.
If everyone open sources their AI code, Meta can snatch the bits that help them without much fear of helping their direct competitors.
If we make large strides in interpretability we may have something resembling source code, but we're certainly not there yet. I don't think the solution to that problem should be to change the definition of open source and pretend the problem has been solved.
"Finding an agreement on what constitutes Open Source AI is the most important challenge facing the free software (also known as open source) movement. European regulation already started referring to "free and open source AI", large economic actors like Meta are calling their systems "open source" despite the fact that their license contain restrictions on fields-of-use (among other things) and the landscape is evolving so quickly that if we don't keep up, we'll be irrelevant."
[1] https://fosdem.org/2024/schedule/event/fosdem-2024-2805-movi... defining-open-source-ai/
I'm not sure if facebook has done that
Strategy of FB is that they are good to be a user only and fine ruining competitor’s business with good enough free alternatives while collecting awards as saviors of whatever.
Just say "open weights", not "open source".
Does the training data require permission from the copyright holder to use? Are the weights really open source or more like compiled assembly?
But just because a single developer couldn’t do it doesn’t mean it couldn’t be done. It means nobody has organized a large enough effort yet.
For something like a browser, which is critical for security, you need both the organization and the trust. Despite frequent criticism, Mozilla (for example) is still considered pretty trustworthy in a way that an unknown developer can’t be.
The actual point that matters is that these models are available for most people to use for a lot of stuff, and this is way way better than what competitors like OpenAI offer.
.. the thing is, we have not dealt with llm much, it's hard to say what can be considered open source llm just yet, so we use that as metaphore for now
New stuff, so probably not good to force old words, with known meanings, onto new stuff.
This post is an ad and trying to paint these things as something they aren't.
If the FOSS community sets this as the benchmark for open source in respect of AI, they're going to lose control of the term. In most jurisdictions it would be illegal for the likes of Meta to release training data.
HN spends a day figuring out how it’s actually bad
If we lived in a sensible world we'd have nuked Meta into a trillion tiny little pieces some time around the Cambridge Analytica bullshit.
* they need LLMs that they can control for features on their platforms (Fb/Instagram, but I can see many use cases on VR too)
* they cannot sell it. They have no cloud services to offer.
So they would spend this money anyways, but to compensate some losses they just decided to use it to fix their PR by contenting developers
I also think LeCunn opposes OpenAI's gatekeeping at a philosophical/political level. He's using his position to strengthen open-source AI. Sure, there's strategic business considerations, but I wouldn't rule out principled motivations too.
It will seem incredibly weird today to have an imaginary friend that you treat as a genuine relationship but I genuinely expect this will happen and become a commonplace thing within the next two decades.
Given the mountain of GPUs they bought at precisely the right moment I don't think that's entirely accurate
If I remember correctly, FB didnt buy those GPUs because of Open AI, they were going to buy it anyway but Mark said whatever we are buying let's double it.
> A complement is a product that you usually buy together with another product. Gas and cars are complements. Computer hardware is a classic complement of computer operating systems. And babysitters are a complement of dinner at fine restaurants. In a small town, when the local five star restaurant has a two-for-one Valentine’s day special, the local babysitters double their rates. (Actually, the nine-year-olds get roped into early service.)
> All else being equal, demand for a product increases when the prices of its complements decrease.
Smart phones ar a complement of Instagram. VR headsets are a complement of the metaverse. AI could be a component of a social network, but it's not a complement.
1. Is there such a thing as 'attention grabbing AI content' ? Most AI content I see is the opposite of 'attention grabbing'. Kindle store is flooded with this garbage and none of it is particularly 'attention grabbing'.
2. Why would creation of such content, even if it was truly attention grabbing, benefit meta in particular ?
3. How would poliferation of AI content lead to more ad spend in the economy. Ad budgets won't increase because of AI content?
To me this is typical Zuckerberg play. Attach metas name to whatever is trendy at the moment like ( now forgotten) metaverse, cryptocoins and bunch of other failed stuff that was trendy for a second. Meta is NOT an Gen AI company ( or a metaverse company, or a cypto company) like he is scamming ( more like colluding) the market to believe. A mere distraction from slowing user growth on ALL of meta apps.
ppl seem to have just forgotten this https://en.wikipedia.org/wiki/Diem_(digital_currency)
Every piece of content in any feed (good, bad, or otherwise) benefits the aggregator (Meta, YouTube, whatever), because someone will look at it. Not everything will go viral, but it doesn't matter. Scroll whatever on Twitter, YouTube Shorts, Reddit, etc. Meta has a massive presence in social media, so content being generated is shared there.
The more content of any type leads to more engagement on the platforms where it's being shared. Every Meta feed serves the viewer an ad (for which Meta is paid) every 3 or so posts (pieces of content). It doesn't matter if the user doesn't like 1/5 posts or whatever, the number of ads still goes up.
More important is the products that Meta will be able to make if the industry standardizes on Llama. They would have the front seat in not just with access the latest unreleased models but also settings the direction of progress and next gen LLM optimizes for. If you're Twitter or Snap or TikTok or compete with Meta on the product then good luck in trying to keep up.
That is why they hopped on the Attention is All You Need train
Then all other visual AI content will be banned. If that is where legislation is heading.
Through this lense, Meta’s actions make more sense to me. Why invest billions in VR/AR? The answer is simple, don’t get locked out of the next platform, maybe you can own the next one. Why invest in LLMs? Again, don’t get locked out. Google and OpenAi/Microsoft are far larger and ahead of Meta right now and Meta genuinely believes the best way to make sure they have an LLM they control is to make everyone else have an LLM they can control. That way community efforts are unified around their standard.
Small guys are the ones being screwed over by AI companies and having their text/art/code stolen without any attribution or adherence to license. I don’t think Meta is on their side at all
It's helpful to also look at what do the developers and companies (everyone outside of top 5/10 big tech companies) get out of this. They get open access to weights of SOTA LLM models that take billions of dollars to train and 10s of billions a year to run the AI labs that make these. They get the freedom to fine tune them, to distill them, and to host them on their own hardware in whatever way works best for their products and services.
There is still, just about, a strong ethos( especially in the research teams) to chuck loads of stuff over the wall into opensource. (pytorch, detectron, SAM, aria etc)
but its seen internally as a two part strategy:
1) strong recruitment tool (come work with us, we've done cool things, and you'll be able to write papers)
2) seeding the research community with a common toolset.
Meta wants to make sure they commoditize their complements: they don’t want a world where OpenAI captures all the value of content generation, they want the cost of producing the best content to be as close to free as possible.
For now, Meta seems to release Llama models in ways that don't significantly lock people into their infrastructure. If that ever stops being the case, you should fork rather than trust their judgment. I say this knowing full well that most of the internet is on AWS or GCP, most brick and mortar businesses use Windows, and carrying a proprietary smartphone is essentially required to participate in many aspects of the modern economy. All of this is a mistake. You can't resist all lock-in. The players involved effectively run the world. You should still try where you can, and we should still be happy when tech companies either slip up or make the momentary strategic decision to make this easier
Fork what? The secret sauce is in the training data and infrastructure. I don't think either of those is currently open.
Also, the underdog always touts Open Source and standards, so it’s good to remain skeptical when/if tables turn.
Pretty sure the only reason Meta’s managed to do this is because of Zuck’s iron grip on the board (majority voting rights). This is great for Open Source and regular people though!
Was always their modus operandi, surely. How else would they have survived.
Thanks for returning everyone else;s content and never mind all the content stealing your platform did.
We interviewed Thomas who led Llama 2 and 3 post training here in case you want to hear from someone closer to the ground on the models https://www.latent.space/p/llama-3
"Commoditize Your Complement" is often cited here: https://gwern.net/complement
There is a demo video that shows a user wearing a Quest VR headset and asks the AI "what do you see" and it interprets everything around it. Then, "what goes well with these shorts"... You can see where this is going. Wearing headsets with AIs monitoring everything the users see and collecting even more data is becoming normalized. Imagine the private data harvesting capabilities of the internet but anywhere in the physical world. People need not even choose to wear a Meta headset, simply passing a user with a Meta headset in public will be enough to have private data collected. This will be the inevitable result of vision models improvements integrated into mobile VR/AR headsets.
He was clear in that one of their motivations is avoiding vendor lockin. He doesn't want Meta to be under the control of their competitors or other AI providers.
He also recognizes the value brought to his company by open sourcing products. Just look at React, PyTorch, and GraphQL. All industry standards, and all brought tremendous value to Facebook.
It's a proprietary dump of data you can't replicate or verify.
What were the sources? What datasets it was trained on? What are the training parameters? And so on and so on
It is still far from zero.
Is it possible to run this with ollama?
Ollama will offload as many layers as it can to the gpu then the rest will run on the cpu/ram.
Nope. Not one bit. Supporting F/OSS when it suits you in one area and then being totally dismissive of it in every other area should not be lauded. How about open sourcing some of FB's VR efforts?
Used to contribute in the early 2000s with my Pentium for a while.
Ever got any results?
Also, for training LLMs, I understand there is a huge bandwith problem with this approach.
NVIDIA offers discounts https://developer.nvidia.com/education-pricing
eg. for Australia, the National Computing Infrastructure allows researchers to reserve time on:
- 160 nodes each containing four Nvidia V100 GPUs and two 24-core Intel Xeon Scalable 'Cascade Lake' processors.
- 2 nodes of the NVIDIA DGX A100 system, with 8 A100 GPUs per node.
It's not the "inference code", its the code that specifies the architecture of the model and loads the model. The "inference code" is mostly the model, and the model is not legible to a human reader.
Maybe someday open source models will be possible, but we will need much better interpretability tools so we can generate the source code from the model. In most software projects you write the source as a specification that is then used by the computer to implement the software, but in this case the process is reversed.
Add to the list of benefits to Meta that it keeps LeCun happy.
Although it is clear that the computing capacity of the GPU would be very underutilized with the SSD as the bottleneck. Even using RAM instead of VRAM is pretty impractical. It might be a bit better for chips like Apple's where the CPU, RAM and GPU are all tightly connected on the same SoC, and the main RAM is used as the VRAM.
Would that performance be still worth more than the electricity cost? Would the earnings be high enough for a wide population to be motivated to go through the hassle of setting up their machine to serve requests?
This never made sense to me -- Apple could easily hire top talent to write Apple Silicon bindings for these popular libraries. I work at a creative ad agency, we have tons of high end apple devices yet the neural cores sit unused most of the time.
This time, humanity narrowly averted complete disaster thanks to the huge efforts and resources of a small number of people.
I wonder if we are witnessing humanity's the end of open knowledge and compute (at least until we pass through a neo dark ages and reach the next age of enlightenment).
Whether it'll be due to profit or control, it looks like humanity is posed to get fucked.
I like the fact that these can be made with just mass-printed multiplication (and in ternary computing's case - addition) gates which require little more than 10 year old tech which is already widely distributed.
it's a bigger investment, but it's an investment which will pay dividends for decades. with a compute cluster, the government is taking on an asset in the form of the cluster but also liabilities in the form of operations and administration.
with a fab, the government takes on either a promise of lower taxes for N years or hands over a bag of cash. after that they're clear of it. the company operating the fab will be responsible for the risks and on-going expenses.
on top of that...
> thousand of staff
the company will employ/attract even more top talent, each of whom will pay taxes and eventually go on to found related companies or teach the next generation or what have you. not to mention the risk reduction that comes with on-shoring something as critical to national security and the economy as a fab.
a public-access compute cluster isn't a bad idea, but it probably makes more sense to fund/operate it in similar PPP model. non-profit consortium of universities and business pool resources to plan, build, and operate it, government recognizes it as a public good and chips in a significant amount of money to help.
It's not going to stay like this I can assure you that :).
Open router is a paid api so that can absolutely be sustainable.
And meta has multiple reasons for going open route - some explained in their posts so less so (harms their competitors)
I reckon there will be a llama 4 and beyond
See Joel on Software "Smart companies try to commoditize their products’ complements" https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/
Source is the input to some built artifact. It is the source of that artifact. As in: where the artifact comes from. Textual input is absolutely the source of the ML model. What you are using "source" as is analogous to the source of the compiler in traditional programming.
Asset is an artifact used as input, that is revered verbatim by the output. For example, a logo baked into an application to be rendered in the UI. The compilation of the program doesn't make a new logo, it just moves the asset into the built artifact.
Aside from licensing content, that content creators don’t like redistribution means a lawful model would probably only use Gutenberg’s collection and permissive code. Anything else, including Wikipedia, usually has licensing requirements they might violate.
Regardless, it fits the compute used and the claim that they trained from public web data, and was suspiciously published by HF staff shortly after L3 released. It's about as official as the Mistral 7B v0.2 base model. I.e. mostly, but not entirely, probably for some weird legal reasons.
Small companies interests are aligned with Meta as they are now on an equal footing with large incumbent players. They can now compete with a similarly sized team at a big tech company instead of that team + dozens of AI scientists
Training Code and dataset are analogous to the developer who wrote the script
Model and weights are end product that is then released
Inference Code is the runtime that could execute the code. That would be e.g. PyTorch, which can import the weights and run inference.
No, I completely disagree. Python is near pseudo-text source. Source exists for the specific purpose of being easily and completely understood, by humans, because it's for and from humans. You can turn a python calculator into a web server, because it can be split and separated at any point, because it can be completely understood at any point, and it's deterministic at every point.
A model cannot be understood by a human. It isn't meant to be. It's meant to be used, very close to as is. You can't fundamentally change the model, or dissect it, you can only nudge it in a direction, with the force of that nudge being proportional to the money you can burn, along with hope that it turns out how you want.
That's why I say it's closer to a binary: more of a black box you can use. You can't easily make a binary do something fundamentally different without changing the source. You can't easily see into that black box, or even know what it will do without trying. You can only nudge it to act a little differently, or use it as part of a workflow. (decompilation tools aside ;))
The DoE helped subsidize development of Kepler, Maxwell, Pascal, etc along with the underlying stack like NVLink, NGC, CUDA, etc either via purchases or allowing grants to be commercialized by Nvidia. They also played matchmaker by helping connect private sector research partners with Nvidia.
The DoE also did the same thing for AMD and Intel.
But before that, it was video games, like quake. Nvidia wouldn't be viable if not for games.
But before that, graphics research was subsidized by the DoD, back when visualizing things in 3D cost serious money.
It's funny how technology advances.
Bitcoin moved to FPGAs/ASIC very quickly because dedicated hardware was vastly more efficient they were only viable from Oct 2010. By 2013 when ASIC’s came online GPU’s only made sense if someone else was paying for both the hardware and electricity.
Llama is probably just running on spare capacity (I mean, sure, they've kept increasing capex, but if they're worried about an llm-based fb competitor they sort of have to in order to enact their copycat strategy)
https://blog.allenai.org/hello-olmo-a-truly-open-llm-43f7e73...
HN censors too. Facebook just does it automatically on a huge scale with no reasoning behind each censor.
Censorship is just tuning people or things you don't want out. Censorship of your own content as a user is extremely annoying and Facebook's censorhsip is quite unethical. It doesn't help safety of the users, it helps safety of the business.
Also Facebook censors things that are not objectively not offensive in lots of instances. YouTube too. Safety for their brand.
The best kind of open source: All the important ingredients to make it work (more and more data and money) are either not open source or in the hands of Meta. It's prohibitive by design.
People seem happy to help build Metas empire once again in return for scraps.
I am still amazed that we can do that.
Imagine if the source code was in a programming language of which the basic syntax and semantics were known to no one but the original developers.
Or more realistically, I think it’s a major problem if an open source project can only be built by an esoteric process that only the original developers have access to.
Raw training datasets similarly has some value as you can analyze it for different characteristics to understand why the trained model is under/over-representing different concepts.
But yes real FOSS should be "open-build" and allow anyone to build a test-passing artifact from raw source material.
Also... Wow, Mark Zuckerberg is such a liar! Implying that llama is open source when it isn't, while at the same time trying to gather the goodwill of FOSS developers.
You are aware Facebook tracks everyone, not just people with Facebook accounts, right? They have a history of being anti-consumer in every sense of the word. So while I can understand where you're coming from, it's just not anywhere close to being reality.
If you want to or not, if you consent or not, Facebook is tracking and selling you.
No they are not selling me. How can they sell my attention to advertisers when I don't look at their ads? How can they influence me if I don't engage with their algorithm? You're the one who's trying to sell me your fear and mistrust.
I am talking about in general, not me personally. No popular content on any website/platform is AI generated. Maybe you have examples that lead you believe that its possible on a mass scale.
> look like a famous character or put the person in a movie scene
what attention grabbing movie used gen ai persons
you have plenty of bot or "AI/LLM" generated content, that is consumed -- up to and including things like "news".
as for the comment about movies, i'm confused -- CGI has been a thing for a long time, and "AI" has been used to convey aging or how a person might look given some conditions, on screen, as well as a whole host of things.
while this might not be an LLM, it is certainly computer generated, predictive, and artificially generated.
Can anyone share the cost of their pre-built clusters, they’ve recently started selling? (sorry feeling lazy to research atm, I might do that later when I have more time).
https://smicro.eu/nvidia-hgx-h100-640gb-935-24287-0001-000-1
8x H100 HGX cluster for €250k + VAT
This way the government pays 2'500 USD per card, not 40'000 USD or whatever absurd.
20-25 year old drugs are a lot more useful than 20-25 year old GPUs, and the manufacturing supply chain is not a bottleneck.
There's no generics for the latest and greatest drugs, and a fancy gene therapy might run a lot more than $40k.
You want to punish NVIDIA for calling its shots correctly? You don't see the many ways that backfires?
That said, it strikes me that the actual limiting factor is fab capacity not nvidia's designs and we probably need to lift the monopolies preventing competition there if we want to reduce prices.
One of the things that the post mentioned was the meager profit margin that the companies made during this time.
But the thing is that this set the America auto and aviation industry up to rule the world for decades.
A government going to a company and saying 'we need you to produce this product for us at a lower margin thab you'd like to' isn't the end of the world.
I don't know if this is one of those scenarios but they exist.
[0] https://www.construction-physics.com/p/how-to-build-300000-a...
Along similar lines, I'm trying to build a developer credits program where I get whomever (AMD/Dell) to purchase credits on my super computers, that we then give away to developers to build solutions, which drives more demand for our hardware, and we commit to re-invest those credits back into more hardware. The idea is to create a win-win-win (us, them, you) developer flywheel ecosystem. It isn't a new idea at all, Nvidia and hyperscalers have been doing this for ages.
You can use that model with open data to train it from scratch yourself. Or you can load Meta’s open weights and have a working LLM.
That said the LLama-license doesn't meet strict definitions of OS, and I bet they have internal tooling for datacenter-scale training that's not represented here.
That makes it source available ( https://en.wikipedia.org/wiki/Source-available_software ), not open source
RMS is an advocate for Free Software. Free Software generally implies Open Source, but not the converse.
RMS considers openness of source to be a separate category from the freeness of software. "Free software is a political movement; open source is a development model."
Most of us who were there remember it differently. True open source advocates will find little to refute in what I’ve said.
No true Scotsman https://en.wikipedia.org/wiki/No_true_Scotsman
OSI helped popularize the open source movement. They not only make it palatable to businesses, but got them excited about it. I think that FSF/Stallman alone would not have been very successful on this front with GPL/AGPL.
Your characterization is quit easily refutable, because at the time that OSI was founded, there was already an explosion of possible licenses and RMS and other GNUnatics were making lots of noise about GNU/Linux and trying to be as maximalist as possible while presenting any choice other than the GNU GPL as "against freedom".
This certainly would not have held well with people who were using the MIT Licence or BSD licences (created around the same time as the GNU GPL v1), who believed (and continue to believe) that there were options other than a restrictive viral licence‡. Yes, some of the people involved vilified the "free software principles", but there were also GNU "advocates" who were making RMS look tame with their wording (I recall someone telling me to enjoy "software slavery" because I preferred licences other than the GNU GPL).
The "Free Software" advocates were pretending that the goals of their licence were the only goals that should matter for all authors and consumers of software. That is not and never has been the case, so it is unsurprising that there was a bit of reaction to such extremism.
OSI and the open source label were a move to make things easier for corporations to accept and understand by providing (a) a clear unifying definition, and (b) a set of licences and guidelines for knowing what licenses did what and the risks and obligations they presented to people who used software under those licences.
‡ Don't @ me on this, because both the virality and restrictiveness are features of the GNU GPL. If it weren't for the nonsense in the preamble, it would be a good licence. As it is, it is an effective if rampantly misrepresented licence.
It also says commercial orgs can get access via negotiation, I expect a random member of the public would be able to go that route as well. I expect that there would be some hurdles to cross, it isn't really common for random members of the public to be doing the kinds of research Gadi was created to benefit. I expect it is the same way in this case in Canada. I suppose the argument is if there weren't any gatekeeping at all, you might end up with all kinds of unsuitable stuff on the cluster, e.g. crypto miners and such.
Possibly another way for a true random person to get access would be to get some kind of 0-hour academic affiliation via someone willing to back you up, or one could enrol in a random AI course or something and then talk to the lecturer in charge.
In reality, the (also taxpayer-subsidised) university pays some fee for access, but it doesn't come from any of our budgets.
It's pretty meagre pickings!
These resources aren't available to the public, but if I were king for a day we'd increase science funding such that we'd have compute resources available to high-school students and the general public (possibly following training on how to use it).
Making sure folks didn't use it to mine bitcoin would be important, though ;)
I wish that wasn't the case though!
No, it doesn't mean that. To quote the page I linked, emphasis mine,
> Source-available software is software released through a source code distribution model that includes arrangements where the source can be viewed, and in some cases modified, but without necessarily meeting the criteria to be called open-source. The licenses associated with the offerings range from allowing code to be viewed for reference to allowing code to be modified and redistributed for both commercial and non-commercial purposes.
> This is kinda the opposite, you can modify the model, but you don't see all the details of its creation.
Per https://github.com/meta-llama/llama3/blob/main/LICENSE there's also a laundry list of ways you're not allowed to use it, including restrictions on commercial use. So not Open Source.
Isn't that what the model is? just a collection weights?
I'd consider the ability to admit when even your most hated adversary is doing something right, a hallmark of acting smarter.
Now, they haven't released the training data with the model weights. THAT plus the training tooling would be "end to end open source". Apple actually did that very thing recently, and it flew under almost everyone's radar for some reason:
https://x.com/vaishaal/status/1813956553042711006?s=46&t=qWa...
The best I can tell is that their self-interest here is more about gathering mindshare. That's not a terrible motive; in fact, that's a pretty decent one. It's not the bully pressing you into their ecosystem with a tit-for-tat; it's the nerd showing off his latest and going "Here. Try it. Join me. Join us."
I have faith they will continue to do what's in their best interests and if their best interests happen to align with mine, then I will support that. Just like how I don't bother killing the spider in my basement because it helps clean up the other bugs.
Of all the things to expand the scope of government spending why would they choose AI, or more specifically GPUs?
As for the why... because there's no shortage of capital for AI. It sounds like the government would like to encourage redirecting that capital to something that's good for the economy at large, rather than good for the investors of a handful of Silicon Valley firms interested only in their own short term gains.
If it succeeds, you were ahead of the curve. If it fails, you were prudent enough to fund an investigation early. Either way, bleeding edge tech gives you a W.
If it fails certain groups ensure everyone knows the government "wasted" taxpayer money.
Would you mind expanding on these options? Universal training data sounds intriguing.
Much of this is already available to private sector entities, but having a publicly funded organization responsible for curating and publishing this would enable new entrants to quickly and easily get a foundation without having to scrape the internet again, especially given how rapidly model generated content is being published.
Admittedly it hasn't been cleaned all that much - you still need to put a bit of effort into that (newer certificates tend to be better quality), but it's very low friction overall. I'd love to see them do this with more datasets
The public is incredibly lazy, though. Don't expect them to do anything until their hand is forced, which doesn't bode well for the action to meet a desirable outcome.
> if facial recognition becomes that common for wearers, most of the population is going to adorn something to prevent that
"Most of the population" is going to be "the wearers".
> Coworkers don’t and wouldn’t tolerate coworkers taking videos or pictures of them.
Here is a fun experience you can try: just hit "record" on every single Teams or Meet meeting you're ever on (or just set recording as the default setting in the app).
See how many coworkers comment on it, let alone protest.
I can tell you from experience (of having been in thousands of hours of recorded meetings in the last 3 years) that the answer is zero.
> And if facial recognition becomes that common for wearers, most of the population is going to adorn something to prevent that
My brother in Christ, you sincerely underestimate how much "most of the population" gives a shit. Most people are being tracked by Google Maps or FindMy, are triangulated with cell towers that know their exact coordinates, and willingly use social media that profiles them individually. The population doesn't even try in the slightest to resist any of it.
The same applies here, you can take those models and modify them to do whatever you want (provided you know how to train ML models), without having to ask for permission, get scrutinized or pay someone.
I personally think using the term open source is fine, as it conveys the intent correctly, even if, yes, weights are not sources you can read with your eyes.
Model weights are like a binary that nobody has the source for. We need another term.
Here, modifying that model is not harder that doing regular ML, and I can redistribute.
Meta doesn’t have access to some magic higher level abstraction for that model that would make working with it easier that they did not release.
The sources in ML are the architecture the training and inference code and a paper describing the training procedure. It’s all there.
The "Additional Commercial Terms" section of the license includes restrictions that would not meet the OSI definition of open source. You must ask for permission if you have too many users.
"are available for most people to use for a lot of stuff, and this is way way better than what competitors like OpenAI offer."
I presume you agree with it.
> rather than serving access
Its not the same access though.
I am sure that you are creative enough to think of many questions that you could ask llama3, that would instead get you kicked off of OpenAI.
> They don't "[allow] developers to modify its code however they want"
Actually, the fact that the model weights are available means that you can even ignore any limitations that you think are on it, and you'll probably just get away with it. You are also ignoring the fact that the limitations are minimal to most people.
Thats a huge deal!
And it is dishonest to compare a situation where limitations are both minimal and almost unenforceable (Except against maybe Google) to a situation where its physically not possible to get access to the model weights to do what you want with them.
The limitations here are technical, not legal. (Though I am aware of the legal restrictions as well, and I think its worth noting that no other project would get by calling themselves open source while imposing a restriction which prevents competitors from using the system to build their competing systems.) There isn't any source code to read and modify. Yes, you can fine tune a model just like you can modify a binary but this isn't source code. Source code is a human readable specification that a computer can use to transform into executable code. This allows the human to directly modify functionality in the specification. We simply don't have that, and it will not be possible unless we make a lot of strides in interpretability research.
> Its not the same access though.
> I am sure that you are creative enough to think of many questions that you could ask llama3, that would instead get you kicked off of OpenAI.
I'm not saying that systems that are provided as SaaS don't tend to be more restrictive in terms of what they let you do through the API they expose vs what is possible if you run the same system locally. That may not always be true, but sure, as a general rule it is. I mean, it can't be less restrictive. However, that doesn't mean that being able to run code on your own machine makes the code open source. I wouldn't consider Windows open source, for example. Why? Because they haven't released the source code for Windows. Likewise, I wouldn't consider these models open source because their creators haven't released source code for them. Being technically infeasible to do doesn't mean that the definition changes such that its no longer technically infeasible. It is simply infeasible, and if we want to change that, we need to do work in interpretability, not pretend like the problem is already solved.
"are available for most people to use for a lot of stuff, and this is way way better than what competitors like OpenAI offer." And that this is very significant.
> 160 nodes each containing four Nvidia V100 GPUs
and two, well, it's a CPU-based supercomputer.
None of that changes that there have been major technical breakthroughs, and entire classes of products and services that didn't exist before those investments in NASA (see https://en.wikipedia.org/wiki/NASA_spin-off_technologies for a short list). There are 15 departments and dozens of Agencies that comprise the US Federal government, many of whom make investments in science and technology as part of their mandates, and most of that is delivered through some structure of public-private partnerships.
What you see as over-hyped and over-funded nonsense could be the next ground breaking technology, and that is why we need both elected leaders who (at least in theory) represent the will of the people, and appointed, skilled bureaucrats who provide the elected leaders with the skills, domain expertise, and experience that the winners of the popularity contest probably don't have.
Yep, there will be waste, but at least with public funds there is the appearance of accountability that just doesn't exist with private sector funds.
There's a pretty clear difference between the 'finetuning' offered via API by GPT4 and the ability to do whatever sort of finetuning you want and get the weights at the end that you can do with open weights models.
"Brute forcing" is not the correct language to use for describing fine-tuning. It is not as if you are trying weights randomly and seeing which ones work on your dataset - you are following a gradient.
Yes, the difference is that one is provided over a remote API, and the provider of the API can restrict how you interact with it, while the other is performed directly by the user. One is a SaaS solution, the other is a compiled solution, and neither are open source.
""Brute forcing" is not the correct language to use for describing fine-tuning. It is not as if you are trying weights randomly and seeing which ones work on your dataset - you are following a gradient."
Whatever you want to call it, this doesn't sound like modifying functionality in source code. When I modify source code, I might make a change, check what that does, change the same functionality again, check the new change, etc... up to maybe a couple dozen times. What I don't do is have a very simple routine make very small modifications to all of the system's functionality, then check the result of that small change across the broad spectrum of functionality, and repeat millions of times.
You can take the weights and train LoRAs (which is close to fine-tuning), but you can also build custom adapters on top (classification heads). You can mix models from different fine-tunes or perform model surgery (adding additional layers, attention heads, MoE).
You can perform model decomposition and amplify some of its characteristics. You can also train multi-modal adapters for the model. Prompt tuning requires weights as well.
I would even say that having the model is more potent in the hands of individual users than having the dataset.
You can modify individual neurons if you are so inclined. That's what Anthropic have done with the Claude family of models [1]. You cannot do that using any closed model. So "Open Weights" looks very much like "Open Source".
Techniques for introspection of weights are very primitive, but i do think new techniques will be developed, or even new architectures which will make it much easier.
[1] https://www.anthropic.com/news/mapping-mind-language-model
Maybe an analogy would help. A family spent generations breeding the perfect apple tree and they decided to “open source” it. What would open sourcing look like?
Yeah, that is my point. Things that don't have source code can't be open source.
"Maybe an analogy would help. A family spent generations breeding the perfect apple tree and they decided to “open source” it. What would open sourcing look like?"
I think we need to be weary of dilemmas without solutions here. For example, let's think about another analogy: I was in a car accident last week. How can I open source my car accident?
I don't think all, or even most things, are actually "open sourcable". ML models could be open sourced, but it would require a lot of work to interpret the models and generate the source code from them.
GNU says "The GNU GPL can be used for general data which is not software, as long as one can determine what the definition of “source code” refers to in the particular case. As it turns out, the DSL (see below) also requires that you determine what the “source code” is, using approximately the same definition that the GPL uses."
and offers these categories, for example:
https://www.gnu.org/licenses/license-list.en.html#NonFreeSof...
* Software Licenses
* * GPL-Compatible Free Software Licenses
\
* * GPL-Incompatible Free Software Licenses
\
* Licenses For Documentation
* * Free Documentation Licenses
\
* Licenses for Other Works
* * Licenses for Works of Practical Use besides Software and Documentation
* * Licenses for Fonts
* * Licenses for Works stating a Viewpoint (e.g., Opinion or Testimony)
* * Licenses for Designs for Physical Objects
No one on the planet understands how the model weights work exactly, nor can they modify them specifically (i.e. hand modifying the weights to get the result they want). This is an impossible standard.
The source code is open (sorta, it does have some restrictions). The weights are open. The training data is closed.
Which is my point. These models aren't open source because there is no source code to open. Maybe one day we will have strong enough interpretability to generate source from these models, and then we could have open source models. But today its not possible, and changing the meaning of open source such that it is possible probably isn't a great idea.
https://opensource.org/license (linking to OSI for the list because it's convenient, not because they get to decide)
In English, proper nouns are capitalized.
"Open" and "source" are both very normal English words. English speakers have "the right" to use them according to their own perspective and with personal context. It's the difference between referring to a blue tooth, and Bluetooth, or to an apple store or an Apple store.
Who died and made OSI God?
Is free, but it's not open source
Can you really say this model will still be useful in 2 years, 5 years for you? And that FB's stance on these models will still be open source at that time once they incrementally make improvements? Maybe, maybe not. But FB doesn't give anything away for free, and the fact that you think so is your blindness, not mine. In case you haven't figured it out, this isn't a technology problem, this is a "FB needs marketshare and it needs it fast" problem.
Is it, though? They are literally giving this away "for free". https://dev.to/llm_explorer/llama3-license-explained-2915 Unless you build a service with it that has over 700 million monthly users (read: "problem anyone would love to have"), you do not have to re-negotiate a license agreement with them. Beyond that, it can't "phone home" or do any other sorts of nefarious shite. The other limitations there, which you can plainly read, seem not very restrictive.
Is there a magic secret clause conspiracy buried within the license agreement that you believe will be magically pulled out at the worst possible moment? >..<
Sometimes, good things happen. Sorry you're "too blinded" by past hurt experience to see that, I guess
We've seen people try to deceptively describe non-OSS projects as open source, and no doubt we will continue to see it. Thankfully the community (including Hacker News) is quick to call it out, and to insist on not cheapening the term.
This is one the topics that just keeps turning up:
* https://news.ycombinator.com/item?id=24483168
Speak for yourself, please. The term is much older than 1998, with one easily-Googled example being https://www.cia.gov/readingroom/docs/DOC_0000639879.pdf , and an explicit case of IT-related usage being https://i.imgur.com/Nw4is6s.png from https://www.google.com/books/edition/InfoWarCon/09X3Ove9uKgC... .
Unless a registered trademark is involved (spoiler: it's not) no one, whether part of a so-called "community" or not, has any authority to gatekeep or dictate the terms under which a generic phrase like "open source" can be used.
Recently, companies are trying to market things as open source when in reality, they fail to adhere to the definition.
I think we should not let these companies change the meaning of the term, which means it's important to explain every time they try to seem more open than they are.
I'm afraid the battle is being lost though.
It was defined and accepted by the community well before OSI came around though.
Why do you think these private entities are willing to invest the massive capital it takes to keep the frontier advancing at that rate?
> I do want to limit the amount we reward NVIDIA for calling the shots correctly to maximize the benefit to society
Why wouldn't NVIDIA be a solid steward of that capital given their track record?
Because whether they make 100x or 200x they make a shitload of money.
> Why wouldn't NVIDIA be a solid steward of that capital given their track record?
The problem isn't who is the steward of the capital. The problem is that economically efficient thing to do for a single company is (given sufficient fab capacity, and a monopoly) to raise prices to extract a greater share of the pie at the expense of shrinking the size of the pie. I'm not worried about who takes the profit, I'm worried about the size of the pie.
It's not a certainty that they 'make a shitload of money'. Reducing the right tail payoffs absolutely reduces the capital allocated to solve problems - many of which are risky bets.
Your solution absolutely decreases capital investment at the margin, this is indisputable and basic economics. Even worse when the taking is not due to some pre-existing law, so companies have to deal with the additional uncertainty of whether & when future people will decide in retrospect that they got too large a payoff and arbitrarily decide to take it from them.
Past performance is not indicative of future results.
Lol it's not "monopolies" limiting fab capacity. Existing fab companies can barely manage to stand-up a new fab in different cities. Fabs are impossibly complex and beyond risky to fund.
It's the kind of thing you'd put government money to making but it's so risky government really don't want to spend billions and fail so they give existing companies billions so if they fail it's not the governments fault.
Under your idea, we’ll try a badly broken economic philosophy again. And while we’re at it, we will completely stifle investment in innovation.
Please read through their "acceptable use" policy before you decide whether this is really in line with open source.
I'm not taking a specific posiion on this license. I haven't read it closely. My broad point is simply that open source AI, as a term, cannot practically require the training data be made available.
How come releasing an LLM trained on that data is not illegal then? I think it should be.
For an LLM, that’s not the training data. That’s the model itself. You don’t make changes to an LLM by going back to the training data and making changes to it, then re-running the training. You update the model itself with more training data.
You can’t even use the training code and original training data to reproduce the existing model. A lot of it is non-deterministic, so you’ll get different results each time anyway.
Another complication is that the object code for normal software is a clear derivative work of the source code. It’s a direct translation from one form to another. This isn’t the case with LLMs and their training data. The models learn from it, but they aren’t simply an alternative form of it. I don’t think you can describe an LLM as a derivative work of its training data. It learns from it, it isn’t a copy of it. This is mostly the reason why distributing training data is infeasible – the model’s creator may not have the license to do so.
Would it be extremely useful to have the original training data? Definitely. Is distributing it the same as distributing source code for normal software? I don’t think so.
I think new terminology is needed for open AI models. We can’t simply re-use what works for human-editable code because it’s a fundamentally different type of thing with different technical and legal constraints.
Sure. But that's not going to be released. The term open source AI cannot be expected to cover it because it's not practical.
Synthetic part of the training data could be released.
I disagree with the purists - if you can legally change the source or weights - even without having access to the data used by the upstream authors - it's open enough for me. YMMV.
There is a massive difference between a compiled binary that you are allowed to do anything you want with, including modifying it, building something else on top or even pulling parts of it out and using in something else, and a SaaS offering where you can't modify the software at all. But that doesn't make the compiled binary open source.
To really be intellectually curious we need to be open to the idea that there is not (yet) a solution to this problem. Or in the analogy you laid out, that it is simply not possible for the system to be "open source".
Note that most of the licenses listed under the "Licenses for Other Works" section say "It is incompatible with the GNU GPL. Please don't use it for software or documentation, since it is incompatible with the GNU GPL and with the GNU FDL." This is because these are not free software/open source licenses. They are licenses that the FSF endorses because they encourage openness and copyleft in non-software mediums, and play nicely with the GPL when used appropriately (i.e. not for software).
The GPL is appropriate for many works that we wouldn't conventionally view as software, but in those contexts the analogy is usually so close to the literal nature of software that it stops being an analogy. The major difference is public perception. For example, we don't generally view jpegs as software. However, jpegs, at their heart, are executable binaries with very domain specific instructions that are executed in a very much non-Turing complete context. The source code for the jpeg is the XCF or similar (if it exists) which contains a specification (code) for building the binary. The code becomes human readable once loaded into an IDE, such as GIMP, designed to display and interact with the specification. This is code that is most easily interacted with using a visual IDE, but that doesn't change the fact that it is code.
There are some scenarios where you could identify a "source code" but not a "software". For example, a cake can be open sourced by releasing the recipe. In such a context, though, there is literally source code. It's just that the code never produces a binary, and is compiled by a human and kitchen instead of a computer. There is open source hardware, where the source code is a human readable hardware specification which can be easily modified, and the hardware is compiled by a human or machine using that specification.
The scenario where someone has bred a specific plant, however, can not be open source, unless they have also deobfuscated the genome, released the genome publicly, and there is also some feasible way to convert the deobfuscated genome, or a modification of it, into a seed.
They are an intellectual property company holding the rights on plans to make graphic cards, not even a company actually making graphic cards.
The government could launch an initiative "OpenGPU" or "OpenAI Accelerator", where the government orders GPUs from TSMC directly, without the middleman.
It may require some tweaking in the law to allow exception to intellectual property for "public interest".
Reflexively, I count that harm as a feature. I don't like private capital markets because I've been screwed by private capital on multiple occasions.
But you are right: I don't understand how these actions would harm. So please do expand your concerns.
Here’s a more important point: how far would the open source people have gotten without GCC and glibc?
Much less far than they will ever admit, in my experience.
> Like I said, honest open source advocates won’t take issue to how I framed their position.
Yet you've failed to provide even a single point of evidence to back up your claim.
> "honest open source advocates"
You've literally just made this term up. It's meaningless.
I’ve met honest open source advocates before and, once again, they would be unlikely to refute the fact that “open source” was invented in explicit contrast to “free software” to achieve corporate palatability.
The comment you are responding to was literally responding to a comment which validated this exact sentiment.
As to providing evidence, those of us who were there at the time don’t need any and those of you who weren’t ought to seek some. It’s not my job to link to the nearly infinite number of conversations where this obvious dynamic played out.
You can also modify a binary, but that doesn't mean that binaries are open source.
"That's what Anthropic have done with the Claude family of models [1]. ... Techniques for introspection of weights are very primitive, but i do think new techniques will be developed"
Yeah, I don't think what we have now is robust enough interpretability to be capable of generating something comparable to "source code", but I would like to see us get there at some point. It might sound crazy, but a few years ago the degree of interpretability we have today (thanks in no small part to Anthropic's work) would have sounded crazy.
I think getting to open sourcable models is probably pretty important for producing models that actually do what we want them to do, and as these models become more powerful and integrated into our lives and production processes the inability to make them do what we actually want them to do may become increasingly dangerous. Muddling the meaning of open source today to market your product, then, can have troubling downstream effects as focus in the open source community may be taken away from interpretability and on distributing and tuning public weights.
If you don't have a way to replicate what they did to create the model, it seems more like freeware than open source.
This should also make everyone very skeptical of any claim they are making, from benchmark results to the legalities involved in their training process to the prospect of future progress on these models. Without being able to vet their results against the same datasets they're using, there is no way to verify what they're saying, and the credulity that otherwise smart people have been exhibiting in this space has been baffling to me
As a developer, if you have a working Llama model, including the source code and weights, and it's crucial for something you're building or have already built, it's still fundamentally a good thing that Meta isn't gating it behind an API and if they went away tomorrow, you could still use, self-host, retrain, and study the models
A) Release the data, and if it ends up causing a privacy scandal, at least you can actually call it open this time.
B) Neuter the dataset, and the model
All I ever see in these threads is a lot of whining and no viable alternative solutions (I’m fine with the idea of it being a hard problem, but when I see this attitude from “researchers” it makes me less optimistic about the future)
> and the credulity that otherwise smart people have been exhibiting in this space has been baffling to me
Remove the “otherwise” and you’re halfway to understanding your error.
Isn't that a bit like arguing that a linux kernel driver isn't open source if I just give you a bunch of GPL-licensed source code that speaks to my device, but no documentation how my device works? If you take away the source code you have no way to recreate it. But so far that never caused anyone to call the code not open-source. The closest is the whole GPL3 Tivoization debate and that was very divisive.
The heart of the issue is that open source is kind of hard to define for anything that isn't software. As a proxy we could look at Stallman's free software definition. Free software shares a common history with open source and in most open source software is free/libre, and the other way around, so this might be a useful proxy.
So checking the four software freedoms:
- The freedom to run the program as you wish, for any purpose: For most purposes. There's that 700M user restriction, also Meta forbids breaking the law and requires you to follow their acceptable use policy.
- The freedom to study how the program works, and change it so it does your computing as you wish: yes. You can change it by fine tuning it, and the weights allow you to figure out how it works. At least as well as anyone knows how any large neural network works, but it's not like Meta is keeping something from you here
- The freedom to redistribute copies so you can help your neighbor: Allowed, no real asterisks
- The freedom to distribute copies of your modified versions to others: Yes
So is it Free Software™? Not really, but it is pretty close.
What would you have them do instead? Specifically?
If you can't fork it and take the project in your own direction, it's not open source.
Forgive me, I am AI naive, is there some way to harness Llama to train ones own actually-open AI?
private capital is absolutely the driving force for the vast majority of innovations since the beginning of the 20th century. public capital may be involved, but it is dwarfed by private capital markets.
Private nuclear research is heavily dependent on governmental contracts to function. Solar was subsidized to heck and back for years. Public investment does work, and does make a didference.
I would even say governmental involvement is sometimes even the deciding factor, to determine if research is worth pursuing. Some major capital investors have decided AI models cannot possibly gain enough money to pay for their training costs. So what do we do when we believe something is a net good for society, but isn’t going to be profitable?
Of course I agree I'm going to stop marginal investments from occurring into research into patent-able technologies by reducing the expect profit. But I'm going to do so very slightly because I'm not shifting the expected value by very much. Meanwhile I'm going to greatly increase the investment into the existing technology we already have, and allow many more people to try to improve upon it, and I'm going to argue the benefits greatly outweigh the costs.
Whether I'm right or wrong about the net benefit, the basic economics here is that there are both costs and benefits to my proposed action.
And yes I'm going to marginally reduce future investments because the same might happen in the future and that reduces expected value. In fact if I was in charge the same would happen in the future. And the trade-off I get for this is that society gets the benefit of the same actually happening in the future and us not being hamstrung by unbreachable monopolies.
I think you're shifting it by a lot. If the government can post-hoc decide to invalidate patents because the holder is getting too successful, you are introducing a substantial impact on expectations and uncertainty. Your action is not taken in a vacuum.
> Meanwhile I'm going to greatly increase the investment into the existing technology we already have, and allow many more people to try to improve upon it, and I'm going to argue the benefits greatly outweigh the costs.
I think this is a much more speculative impact. Why will people even fund the improvements if the government might just decide they've gotten too large a slice of the pie later on down the road?
> the trade-off I get for this is that society gets the benefit of the same actually happening in the future and us not being hamstrung by unbreachable monopolies.
No the trade-off is that materially less is produced. These incentive effects are not small. Take for instance, drug price controls - a similar post-facto taking because we feel that the profits from R&D are too high. Introducing proposed price controls leads to hundreds of fewer drugs over the next decade [0] - and likely millions of premature deaths downstream of these incentive effects. And that's with a policy with a clear path towards short-term upside (cheaper drug prices). Discounted GPUs by invalidating nvidia's patents has a much more tenuous upside and clear downside.
[0]: https://bpb-us-w2.wpmucdn.com/voices.uchicago.edu/dist/d/312...
You're massively increasing uncertainty.
> the same would happen in the future. And the trade-off I get for this is that society gets the benefit
Why would you expect it would ever happen again? What you want is an unrealized capital gains tax. Not to nuke our semiconductor industry.
Your claim that removing a profit motivation will increase investment is flat out wrong. Everything else crumbles from there.
Who? It's not their data.
It depends on the binary and the license the binary is released under. If the binary is released to the public domain, for example, you are free to make whatever modifications you wish. And there are plenty of licenses like this, that allow closed source software to be used as the user wishes. That doesn't make it open source.
Likewise, there are plenty of closed source projects who's binaries we can poke and prod with much higher understanding of what our changes are actually doing than we're able to get when we poke and prod LLMs. If you want to make a Pokemon Red/Blue or Minecraft mod you have a lot of tools at your disposal.
A project that only exists as a binary which the copyright holder has relinquished rights to, or has released under some similar permissive closed source license, but people have poked around enough to figure out how to modify certain parts of the binary with some degree of predictability is a more apt analogy. Especially if the original author has lost the source code, as there is no source code the speak of when discussing these models.
I would not call that binary "open source", because the source would, in fact, not be open.
Yes.
You can change it however you like, then look at the paper [1] under section 3.2. to know which hyperparameters were used during training and finetune the model to work with your new tokenizer using e.g. FineWeb [2] dataset.
You'll need to do only a fraction of the training you would have needed to do if you were to start a training from scratch for your tokenizer of choice. The weights released by Meta give you a massive head start and cost saving.
The fact that it's not trivial to do and out of reach of most consumer is not a matter of openness. That's just how ML is today.
[1]: https://scontent-sjc3-1.xx.fbcdn.net/v/t39.2365-6/452387774_...
Just like open source?
> Training setup and data is completely non trivial for a large language model. To replicate Llama would take hundreds of hours of engineering, at least.
The entire point of having the pre-trained weight released is to *not* have to do this. You just need to finetune, which can be done with very little data, depending on the task, and many open source toolkits, that work with those weights, exist to make this trivial.
I can do all sorts of things by “fine tuning” Excel with formulas, but I certainly don’t have the source for Excel.
When AI research was still mostly academic, I'm sure a lot of people still cheated, but there was somewhat less incentive to, and norms like publishing datasets made it easier to verify claims made in research papers. In a world where people don't, and there's significant financial incentive to lie, I just kind of assume they're lying
The community has the authority to complain about companies mis-labelling their pork products as vegan, even if nobody has a registered trademark on the term vegan. Would you tell people to shut up about that case because they don't have a registered trademark? Likewise, the community has authority to complain about Meta/Facebook mis-labelling code as open source even when they put restrictions on usage. It's not gate-keeping or dictatorship to complain about being misled or being lied to.
I especially like how I'm the one telling people to "shut up" all of a sudden.
As for the rest, see my other reply.
There is 0 doubt that you are better of finetuning that model to use your tokenizer than training from scratch. So what Meta gives you for free massively helps you building your model, that's OSS to me.
I do think there are some serious IP issues, as IP rules can be hijacked in the US, but that means you fix those problems, not blow up IP that was rightfully earned
They are leaders in solar and EVs.
Remember how Japan leapfrogged the western car industry, and six sigma became required reading for managers in every industry?
Open training data is hard to the point of impracticality. It requires excluding private and proprietary data.
Meanwhile, the term "open source" is massively popular. So it will get used. The question is how.
Meta et al would love for the choice to be between, on one hand, open weights only, and, on the other hand, open training data, because the latter is impractical. That dichotomy guarantees that when someone says open source AI they'll mean open weights. (The way open source software, today, generally means source available, not FOSS.)
Here's the source of the disagreement. You're justifying the use of the term "open source" by saying it's logical for Meta to want to use it for its popularity and layman (incorrect) understanding.
Other person is saying it doesn't matter how convenient it is or how much Meta wants to use it, that the term "open source" is misleading for a product where the "source" is the training data, and the final product has onerous restrictions on use.
This would be like Adobe giving Photoshop away for free, but for personal use only and not for making ads for Adobe's competitors. Sure, Adobe likes it and most users may be fine with it, but it isn't open source.
>The way open source software, today, generally means source available, not FOSS.
I don't agree with that. When a company says "open source" but it's not free, the tech community is quick to call it "source available" or "open core".
I'm actually not a fan of Meta's definition. I'm arguing specifically against an unrealistic definition, because for practical purposes that cedes the term to Meta.
> the term "open source" is misleading for a product where the "source" is the training data, and the final product has onerous restrictions on use
Agree. I think the focus should be on the use restrictions.
> When a company says "open source" but it's not free, the tech community is quick to call it "source available" or "open core"
This isn't consistently applied. It's why we have the free vs open vs FOSS fracture.
Right, so the onus is on Facebook/Meta to get that right, then they could call something Open Source, until then, find another name that already doesn't have a specific meaning.
> (The way open source software, today, generally means source available, not FOSS.)
No, but it's going in that way. Open Source, today, still means that the things you need to build a project, is publicly available for you to download and run on your own machine, granted you have the means to do so. What you're thinking of is literally called "Source Available" which is very different from "Open Source".
The intent of Open Source is for people to be able to reproduce the work themselves, with modifications if they want to. Is that something you can do today with the various Llama models? No, because one core part of the projects "source code" (what you need to reproduce it from scratch), the training data, is being held back and kept private.
you are playing very loosely with terms that have specific, widely accepted definitions (e.g. https://opensource.org/osd )
I don't get why you think it would be useful to call LLMs with published weights "open source"
OSF's definition is far from the only one [1]. Switzerland is currently implementing CH Open's definition, the EU another one, et cetera.
> I don't get why you think it would be useful to call LLMs with published weights "open source"
I don't. I'm saying that if the choice is between open weights or open weights + open training data, open weights will win because the useful definition will outcompete the pristine one in a public context.
[1] https://en.wikipedia.org/wiki/Open-source_software#Definitio...
Realistically, nobody outside of Hacker News commenters have ever cared about the OSD. It's just not how the term is used colloquially.
If the company wants to help research, it should full-throatedly endorse the position that it doesn't consider it a violation of privacy to train on the data it does, and release it so that it can be useful for research. If the company thinks it's safeguarding user privacy, it shouldn't be training models on data it considers private and then using them in public-facing ways at all
As it stands, Facebook seems to take the position that it wants to help the development of software built on models like Llama, but not really the fundamental research that goes into building those models in the same way
Thousands of entities would scramble to sue Facebook over any released dataset no matter what the privacy implications of the dataset are.
It's just not worth it in any world. I believe you are not thinking of this problem from the view of the PM or VPs that would actually have to approve this: if I were a VP and I was 99% confident that the dataset had no privacy implications, I still wouldn't release it. Just not worth the inevitable long, drawn out lawsuits from people and regulators trying to get their pound of flesh.
I feel the world is too hostile to big tech and AI to enable something like this. So, unless we want to kill AGI development in the cradle, this is what we get - and we can thank modern populist techno-pessimism for cultivating this environment.
There's no AGI development in the cradle. And the world isn't "hostile". The world is increasingly tired of predatory behavior by supranational corporations
Lmao what? If the world were sane and hostile to big tech, we would've nuked them all years ago for all the bullshit they pulled and continue to pull. Big tech has politicians in their pockets, but thankfully the "populist techno-pessimist" (read: normal people who are sick of billionaires exploiting the entire planet) are finally starting to turn their opinions, albeit slowly.
If we lived in a sane world Cambridge Analytica would've been the death knell of Facebook and all of the people involved with it. But we instead live in a world where psychopathic pieces of shit like Zucc get away with it, because they can just buy off any politician who knocks on their doors.
The ire people have toward tech companies right now is, like most ire, perhaps in places overreaching. But it is mostly justified by the real actions of tech companies, and facebook has done more to deserve it than most. The thought process you just described sounds like an accurate prediction of the mindset and culture of a VP within Facebook, and I'd like you to reflect on it for a sec. Basically, you rightly point out that the org releasing what data they have would likely invite lawsuits, and then you proceeded to do some kind of insane offscreen mental gymnastics that allow this reality to mean nothing to you but that the unwashed masses irrationally hate the company for some unknowable reason
Like you're talking about a company that has spent the last decade buying competitors to maintain an insane amount of control over billions of users' access to their friends, feeding them an increasingly degraded and invasive channel of information that also from time to time runs nonconsensual social experiments on them, and following even people who didn't opt in around the internet through shady analytics plugins in order to sell dossiers of information on them to whoever will pay. What do you think it is? Are people just jealous of their success, or might they have some legit grievances that may cause them to distrust and maybe even loathe such an entity? It is hard for me to believe Facebook has a dataset large enough to train a current-gen LLM that wouldn't also feel, viscerally, to many, like a privacy violation. Whether any party that felt this way could actually win a lawsuit is questionable though, as the US doesn't really have signficant privacy laws, and this is partially due to extensive collaboration with, and lobbying by, Facebook and other tech companies who do mass-surveillance of this kind
I remember a movie called Das Leben der Anderen (2006) (Officially translated as "the lives of others") which got accolades for how it could make people who hadn't experienced it feel how unsettling the surveillance state of East Germany was, and now your average American is more comprehensively surveilled than the Stasi could have imagined, and this is in large part due to companies like facebook
Frankly, I'm not an AGI doomer, but if the capabilities of near-future AI systems are even in the vague ballpark of the (fairly unfounded) claims the American tech monopolies make about them, it would be an unprecedented disaster on a global scale if those companies got there first, so inasmuch as we view "AGI research" as something that's inevitably going to hit milestones in corporate labs with secretive datasets, I think we should absolutely kill it to whatever degree is possible, and that's as someone who truly, deeply believes that AI research has been beneficial to humanity and could continue to become moreso
We can't prove that a model like llama will never produce a segment of its training data set verbatim.
Any potential privacy scandal is already in motion.
My cynical assumption is that Meta knows that competitors like OpenAI have PR-bombs in their trained model and therefore would never opensource the weights.
This whole 'Open Source' thing is a bigger pet peeve than it should be, because I've received criticism for using the term on a page where I literally just posted a .zip file full of source code. The smart thing to do would have been to ignore and forget the criticism, which I will now work harder at doing.
In the case of a pork producer who labels their products as 'vegan', that's different because there is some authority behind the usage of 'vegan'. It's a standard English-language word that according to Merriam-Webster goes back to 1944. So that would amount to an open-and-shut case of false advertising, which I don't think applies here at all.
I don't see the difference. Open source software is a term of art with a specific meaning accepted by its community. When people misuse the term, invariably in such a way as to broaden it to include whatever it is they're pushing, it's right that the community responds harshly.
This kind of argument is literally why trademark law exists. OSI did not elect to go down that path. Maybe they should have, but I respect their decision not to, and perhaps you should, too.
and (strong personal opinion) any software developer should have a firm grip on the terminology and details for legal reasons
There is a large span of people between gray beard programmer and lay person, and many in that span have some concept of open-source. It's often used synonymously with visible source, free software, or in this case, open weights.
It seems unfortunate - though expected - that over half of the comments in this thread are debating the OSD for the umpeenth time instead of discussing the actual model release or accompanying news posts. Meanwhile communities like /r/LocalLlama are going hog wild with this release and already seeing what it can do.
> any software developer should have a firm grip on the terminology and details for legal reasons
They'd simply need to review the terms of the license to see if it fits their usage. It doesn't really matter if the license satisfies the OSD or not.
For the CH Open, I'm not finding anything specific, even from Swiss websites, could you help me understand what you're referring to here?
I'm guessing that all these definitions have at least some points in common, which involves (another guess) at least being able to produce the output artifacts/binaries by yourself, something that you cannot do with Llama, just as an example.
Was on the HN front page earlier [1][2]. The definition comes strikingly close to source on request with no use restrictions.
> all these definitions have at least some points in common
Agreed. But they're all different. There isn't an accepted defintiion of open source even when it comes to software; there is an accepted set of broad principles.
[1] https://news.ycombinator.com/item?id=41047172
[2] https://joinup.ec.europa.eu/collection/open-source-observato...
Agreed, but are we splitting hairs here and is it relevant to the claim made earlier?
> (The way open source software, today, generally means source available, not FOSS.)
Do any of these principles or definitions from these orgs agree/disagree with that?
My hypothesis is that they generally would go against that belief and instead argue that open source is different from source available. But I haven't looked specifically to confirm if that's true or not, just a guess.
Agreed. There is no trademark on aileron or carburetor or context-free grammar. A couple of years ago I made this same point myself. [0]
> A given term is either an ordinary dictionary word that everyone including the courts will readily recognize ("Vegan"), a trademark ("Microsoft® Office 365™"), or a fragment of language that everyone can feel free to use for their own purposes without asking permission. "Open Source" falls into the latter category.
This taxonomy doesn't hold up.
Again, it's a term of art with a clear meaning accepted by its community. We've seen numerous instances of cynical and deceptive misuse of the term, which the community rightly calls out because it's not fair play, it's deliberate deception.
> This kind of argument is literally why trademark law exists
It is not. Trademark law exists to protect brands, not to clarify terminology.
You seem to be contradicting your earlier point that terms of art do not require licenses.
> OSI did not elect to go down that path. Maybe they should have, but I respect their decision not to, and perhaps you should, too.
I haven't expressed any opinion on that topic, and I don't see a need to.
Come up with a new term and trademark that, and heck, I'll help you out with a legal fund donation when Facebook and friends inevitably try to appropriate it. Apart from that, you've fought the good fight and done what you could. Let it go.
I don't think so. Take the Swiss definition. Source on request, not even available. Yet being branded and accepted as open source.
(To be clear, the Swiss example favours FOSS. But it also permits source on request and bundles them together under the same label.)
Don't understand what big tech does for humanity and how much they rely on it in the day to day. Literally all of their modern conveniences are enabled by big tech.
In my experience many 'normal people' understand far more than you deign credit, many are able to forgo modern 'conveniences' if pressed.
I think you have too much faith in the average person. They scarcely understand how nearly everything in their life has been manufactured on or designed on something powered by big tech.
Zuck knows this very well and it does him no honour to speak like, and from his position this equals attempt ate trying to change the present semantics of open source. Of course, others do that too - using the notion of open source to describe something very far from open.
What Meta is doing under his command can better be desdribed as releasing the resulting...build, so that it can be freely poked around and even put to work. But the result cannot be effectively reversed engineered.
Whats more ridiculous is that precisely because the result is not the source in its whole form, that these graphical structures can made available. Only thanks to the fact it is not traceable to the source, which makes the whole game not only closed, but like... sealed forever. An unfair retell of humanity's knowledge tossed around in very obscure container that nobody can reverse engineer.
how's that even remotely similar to open source?
If someone gives me an executable that I can run for free, and then says "eh why do you want the source, it would take you a long time to compile", that doesn't make it open source, it just makes it gratis.
With LLMs, weights are the binary code: it's how you run the model. But to be able to train the model from scratch, or to collaborate on new approaches, you have to operate at a the level of architecture, methods, and training data sets. They are the source code.
There are a bunch of independent, fully open source foundation models from companies that share everything (including all data). AMBER and MAP-NEO for example. But we have yet to see one in the 100B+ parameter category.
Using open data and dclm: https://github.com/mlfoundations/dclm
Edit typo.
Linux doesn't ship you the compiler you need to build the binaries either, that doesn't mean it's closed source.
LLMs are fundamentally different to software and using terms from software just muddies the waters.
Linux sources :: dataset that goes into training
Linux sources' build confs and scripts :: training code + hyperparameters
GCC :: Python + PyTorch or whatever they use in training
Compiled Linux kernel binary :: model weights
LLMs are not software any more than photographs are.
They're still software, they just don't have source code (yet).
If I self host a project that is open sourced rather than paying for a hosted version, like Sentry.io for example, I don't expect data to come along with the code. Licensing rights are always up for debate in open source, but I wouldn't expect more than the code to be available and reviewable for anything needed to build and run the project.
In the case of an LLM I would expect that to mean the code run to train the model, the code for the model data structure itself, and the control code for querying the model should all be available. I'm not actually sure if Meta does share all that, but training data is separate from open source IMO.
It's the computation that is costly.
This is in contrast to a compiled binary or obfuscated source image, where alteration may be possible with extraordinairy skill and effort but is not expected and possibly even specirically discouraged.
In this sense, weights are entirely like those compiler binaries or obfuscated sources rather than the source code usually associated with "open source"
To be "open source" we would want LLM's where one might be able to manipulate the original training data or training algorithm to produce a set of weights more suited to one's own desires and needs.
Facebook isn't giving us that yet, and very probably can't. They're just trading on the weird boundary state of the term "open source" -- it still carries prestige and garners good will from its original techno-populist ideals, but is so diluted by twenty years of naive consumers who just take it to mean "I don't have to pay to use this" that the prestige and good will is now misplaced.
The open source movement was a cash grab to make the free software movement more palatable to big corp by moving away from copy left licenses. The MIT license is perfectly open source and means that you can buy software without ever seeing its code.
They only give you a blob of data you can run.
DOOM-the-engine is open source (https://github.com/id-Software/DOOM), even though DOOM-the-asset-and-scenario-data is not. While you need a copy of DOOM-the-asset-and-scenario-data to "use DOOM to run DOOM", you are free to build other games using DOOM-the-engine.
It sounds like Meta doesn't share source for the training logic. That would be necessary for it to really be open source, you need to be able to recreate and modify the codebase but that has nothing to do with the training data or the trained model.
Meta shares the code for inference but not for training, so even if we say it can be open-source without the training data, Meta's models are not open-source.
I can appreciate Zuck's enthusiasm for open-source but not his willingness to mislead the larger public about how open they actually are.
"The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed."
> In the case of an LLM I would expect that to mean the code run to train the model, the code for the model data structure itself, and the control code for querying the model should all be available
The M in LLM is for "Model".
The code you describe is for an LLM harness, not for an LLM. The code for the LLM is whatever is needed to enable a developer to modify to inputs and then build a modified output LLM (minus standard generally available tools not custom-created for that product).
Training data is one way to provide this. Another way is some sort of semantic model editor for an interpretable model.
> Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.
This definition actually makes it impossible for any LLM to be considered open source until the interpretability problem is solved. The trained model is functionally obfuscated code, it can't be read or interpreted by a human.
We may be saying the same thing here, I'm not quite sure if you're saying the model must be available or if what is missing is the code to train your own model.
The trained model doesn't need to be open source though, and frankly I'm not sure what the value there is specifically with regards to OSS. I'm not aware of a solution to interpretability problem, even if the model is shared we can't understand what's in it.
Microsoft ships obfuscated code with Windows builds, but that doesn't make it open source.
So, the only bit that's actually open-sourced in these models is the inference code. But that's a trivial part that people can procure equivalents of elsewhere or reproduce from published papers. In this sense, even if you think calling the models "open source" is correct, it doesn't really mean much, because the only parts that matter are not open sourced.
I did not mean that LLM training data needs to be released for the model to be open source. It would be a good thing if creators of models did release their training data, and I wouldn't even be opposed to regulation which encourages or even requires that training data be released when models meet certain specifications. I don't even think the bar needs to be high there- We could require or encourage smaller creators to release their training data too and the result would be a net positive when it comes to public understanding of ML models, control over outputs, safety, and probably even capabilities.
Sure, its possible that training data is being used illegally, but I don't think the solution to that is to just have everyone hide that and treat it as an open secret. We should either change the law, or apply it equally.
But that being said, I don't think it has anything to do with whether the model is "open source". Training data simply isn't source code.
I also don't mean that the license that these models are released under is too restrictive to be open source. Though that is also true, and if these models had source code, that would also prevent them from being open source. (Rather, they would be "source available" models)
What I mean is "The trained model is functionally obfuscated code, it can't be read or interpreted by a human." As you point out, it is definitionally impossible for any contemporary LLM to be considered open source. (Except for maybe some very, very small research models?) There's no source code (yet) so there is no source to open.
I think it is okay to acknowledge when something is technically infeasible, and then proceed to not claim to have done that technically infeasible thing. I don't think the best response to that situation is to, instead, use that as justification for muddying the language to such a degree that its no longer useful. And I don't think the distinction is trivial or purely semantic. Using the language of open source in this way is dangerous for two reason.
The first is that it could conceivably make it more challenging for copyleft licenses such as the GPL to protect the works licensed with them. If the "public" no longer treats software with public binaries and without public source code as closed source, then who's to say you can't fork the linux kernel, release the binary, and keep the code behind closed doors? Wouldn't that also be open source?
The second is that I think convincing a significant portion of the open source community that releasing a model's weights is sufficient to open source a model will cause the community to put more focus on distributing and tuning weights, and less time actually figuring out how to construct source code for these models. I suspect that solving interpretability and generating something resembling source code may be necessary to get these models to actually do what we want them to do. As ML models become increasingly integrated into our lives and production processes, and become increasingly sophisticated, the danger created by having models optimized towards something other than what we would actually like them optimized towards increases.
Weights aren't really a binary in the same sense that a compiler produces, they lack instructions and are more just a bunch of floating point values. Nor can you run model weights without separate code to interpret them correctly. In this sense, they are more like a JPEG or 3d model
IMO a pre-trained model given with the source code used to train/run it is analogous to a company shipping a compiler and a compiled binary without any of the source, which is why I don't think it's "open source" without the training data.
Training data instead source code at all, it's content fed into the ingestion side to train a model. As long as source for ingedting and training a model is available, which it sounds like isn't the case for Meta, that would be open source as best I understand it.
Said a little differently, I would need to be able to review all code used to generate a model and all code used to query the model for it to be OSS. I don't need Meta's training data or their actual model at all, I can train my own with code that I can fully audit and modify if I choose to.
Actually executables you can reverse engineer it into something that could be compiled back into an executable with the exact same functionality, which is AFAIK impossible to do with "open weights". Still, we don't call free executables "open source".
And yes, like other executables, they are not literal black boxes. Rather, they provide machine readable specifications which are not human readable without immense effort.
For an LLM to be open source there would need to be source code. Source code, btw, is not just a procedure that can be handed to a machine to produce code that can be executed by the machine. That means the training data and code is not sufficient (or necessary) for an open source model.
What we need for an open source model is a human readable specification of the model's functionality and data structures which allows the user to modify specific arbitrary functionally/structure, and can be used to produce an executable (the model weights).
We simply need much stronger interpretability for that to be possible.
Videos aren't software and neither are llms.
If a video somehow does have source code which can "generate it", then the question of what it means for the source code to the video to be open even if the only program which can read it and generate the video is closed source is equivalent to asking if a program written in Visual Basic can ever be open source given that the Visual Basic compiler is closed source. Personally, I can see arguments either way on this issue, though most people seem to agree that the program is still open source in such a situation.
However, we need not care too much about the answer to that specific conundrum, as the moral equivalent of both the compiler and the runtime virtual machine are almost always open source. What is then important is much easier: if you don't provide the source code to the project, even if the compiler is open source and even if it runs on an open source machine, clearly the project -- whatever it is that we might try to be discussing, including video files -- cannot be open source. The idea that a video can be open source when what you mean is the video is unencrypted and redistributanle but was merely intended to be played in an open source video player is absurd.
If you're given the source material and project files to continue editing where the original editors finished, and you're granted the rights to re-distribute - Yes, that would be open source[1].
Much like we have "open source hardware" where the "source" consists of original schematics, PCB layouts, BOM, etc. [2]
If a video lacks a specification file (the source code) which can be used by a human reader to modify specific features in the video, then it is software that is simply incapable of being open sourced.