Could you train a ChatGPT-beating model for $85k and run it in a browser?

Could you train a ChatGPT-beating model for $85k and run it in a browser?(simonwillison.net)

430 points by sirteno 3 years ago | 170 comments

whalesalad 3 years ago |

Are there any training/ownership models like Folding@Home? People could donate idle GPU resources in exchange for access to the data, and perhaps ownership. Then instead of someone needing to pony up $85k to train a model, a thousand people can train a fraction of the model on their consumer GPU and pool the results, reap the collective rewards.

dekhn 3 years ago | |

A few people have built frameworks to do this.

There is still a very large open problem in how to federate large numbers of loosely coupled computers to speed up training "interesting" models. I've worked in both domains (protein folding via Folding@Home/protein folding using supercomputers, and ML training on single nodes/ML training on supercomputers) and at least so far, ML hasn't really been a good match for embarrassingly parallel compute. Even in protein folding, folding@home has a number of limitations that are much better addressed on supercomputers (for example: if your problem requires making extremely long individual simulations of large proteins).

All that could change, but I think for the time being, interesting/big models need to be trained on tightly coupled GPUs.

itissid 3 years ago | | |

And you can rule out most of the monte carlo stuff too. Which rules out parallelization modern statistical frameworks like STAN used for explainable models; things like Finance modeling of risk which is a sampling of posteriors using MCMC also can't be parallelized.

whalesalad 3 years ago | | |

Probably going to mirror the transition from single-threaded to multi-threaded compute. Took a while until application architectures took hold of the populous to utilize multi-core.

mirekrusin 3 years ago | |

Unfortunately training is not emberassingly parallelisable [0] problem. It would require new architecture. Current models diverge too fast. By the time you'd download and/or calculate your contribution the model would descend somewhere else and your delta would not be applicable - based off wrong initial state.

It would be great if merge-ability would exist. It would also likely apply to efficient/optimal shrinking for models.

Maybe you could dispatch tasks to train on many variations of similar tasks and take average of results? It could probably help in some way, but you'd still have large serialized pipeline to munch through and you'd likely require some serious hardware ie. dual gtx 4090 on client side.

[0] https://en.wikipedia.org/wiki/Embarrassingly_parallel

amitport 3 years ago | | |

hmmm... seems like you're reinventing distributed learning.

merge-ability does exist and you can average the results.

spyder 3 years ago | |

Learning@Home using Decentralized Mixture-of-Expert models:

https://learning-at-home.github.io/

https://training-transformers-together.github.io/

https://arxiv.org/abs/2002.04013

ftxbro 3 years ago | |

Yes there is petals/bloom https://github.com/bigscience-workshop/petals but it's not so great. Maybe it will improve or a better one will come.

riedel 3 years ago | | |

I read that it is only scoring the model collaboratively but it allows some fine-tuning I guess.

Getting the actual gradient descent to parallelize is more difficult because one needs to average the gradient when using data/batch parallelism. It becomes more a network speed than GPU speed problem. Or are LLMs somehow different?

whalesalad 3 years ago | | |

Really interesting live monitor of the network: http://health.petals.ml

polishdude20 3 years ago | | |

I wonder how they handle illegal content. Like, if you're running training data on your computer, what's to stop someone else's data that is illegal, from being uploaded to your computer as part of training?

ellisv 3 years ago | |

That’d be cool but I don’t think most idle consumer GPUs (6-8GB) would have large enough memory for a single iteration (batch size 1) of modern LLMs.

But I’d love to see more federated/distributed learning platforms.

mirekrusin 3 years ago | | |

6GB can store 3 billion parameters, gpt3.5 has 175 billion parameters.

whalesalad 3 years ago | | |

Is it possible to break the model apart? Or does the entire thing need to be architected from the get-go such that an individual GPU can own a portion end to end?

semitones 3 years ago | |

The main reason an arbitrarily distributed set of compute nodes cannot give you good performance for training a model (even if you have an immodest number of nodes), is that the latency of the inter-node communication will be a massive bottleneck. GPU cloud providers shell out big bucks for ultra fast intra-DC networking via infiniband and the like, and the networking is paid attention to as much (if not more sometimes) than the capabilities of the nodes themselves.

neoromantique 3 years ago | |

How long until somebody creates a crypto project on that?

buildbuildbuild 3 years ago | | |

Bittensor is one, not an endorsement. chat.bittensor.com

_trampeltier 3 years ago | |

Start a Boinc project.

https://boinc.berkeley.edu/projects.php

peter303 3 years ago | |

Every parameter needs to reach every other parameter. Ideally enough core memory for that. But their tiling algorithms.

cleanchit 3 years ago | |

This is how you get skynet.

ftxbro 3 years ago |

His estimate is that you could train a LLaMA-7B scale model for around $82,432 and then fine-tune it for a total of less than $85K. But when I saw the fine tuned LLaMA-like models they were worse in my opinion even than GPT-3. They were like GPT-2.5 or like that. Not nearly as good as ChatGPT 3.5 and certainly not ChatGPT-beating. Of course, far enough in the future you could certainly run one in the browser for $85K or much less, like even $1 if you go far enough into the future.

captainmuon 3 years ago |

I guess companies like OpenAI and Google have no incentives to make models use less resources. The compute required, and of course also their training data, is their moat.

If you accept that your model knows less about the world - it doesn't have to know about every restaurant in mexico city or the biography of every soccer player around the world - then you can get away with much fewer parameters and much less training data. Then you can't query it like an oracle about random things anymore, but you shouldn't do that anyway. But it should still be able to do tasks like reformulating texts, judging simularity (by embedding distance), and so on.

And TFA mentions it also, you could hook up your simple language model with something like ReAct to get really good results. I don't see it running in the browser, but if you had a license-wise clean model that you can run on premises on one or two GPUs, that would be huge for a lot of people!

lxe 3 years ago |

Keep in mind that image transformer models like stable diffusion are generally smaller than language models, so they are easier to fit in wasm space.

Also. you can finetune llama-7b on a 3090 for about $3 using LoRA.

bitL 3 years ago | |

Only for images. People want to generate videos next and those models will be likely GPT-sized.

Metus 3 years ago | | |

There is a video model making the rounds on /r/stablediffusion and it is just a tiny bit larger than Stable Diffusion.

danielbln 3 years ago | |

Generative image models don't use transformers, they're diffusion models. LLMs are transformers.

GaggiX 3 years ago | | |

Diffusion models can use a transformer architecture, example: DiT. Stable Diffusion is using a U-Net architecture with transformer blocks.

lxe 3 years ago | | |

Ah yes that's right. Well they technically do use a visual transformer for CLIP text encoder as I understand.

JasonZ2 3 years ago |

Does anyone know how the results from a 7B parameter model with bloomz.cpp (https://github.com/NouamaneTazi/bloomz.cpp) compares to the 7B parameter Alpaca model with llama.cpp (https://github.com/ggerganov/llama.cpp)?

I have the latter working on a M1 Macbook Air with very good results for what it is. Curious if bloomz.cpp is significantly better or just about the same.

captaincrowbar 3 years ago |

The big problem with AI R&D is that nobody can keep up with the big bux companies. It makes this kind of project a bit pointless. Even if you can run a GPT3-equivalent on a web browser, how many people are going to bother (except as a stunt) when GPT4 is available?

adeon 3 years ago | |

The ones that can't use the GPT4 for whatever reason. Maybe you are a company and you don't want to send OpenAI your prompts. Or a person who has very private prompts and feel sketchy about sending them over.

Or maybe you are an individual who has a use case that's too edgy for OpenAI or a silicon valley corporate image. When Replika shut down people trying to have virtual boyfriend/girlfriends on their platform, their reddit filled up with people who mourned like they just lost a partner.

I think it's important that alternative non-big bux company options exist, even if most people don't want to or need to use them.

moffkalast 3 years ago | | |

Or maybe you're in Italy and OpenAI had just been banned from the country for not adhering to GDPR. I suspect the rest of the EU may follow soon.

psychphysic 3 years ago | | |

Those are seriously niche use cases. They exist but can they fund gpt5 level development?

simonw 3 years ago | |

An increasingly common complaint I'm hearing about GPT3/4/etc is people who don't want to pass any of their private data to another company.

Running models locally is by far the most promising solution for that concern.

dangond 3 years ago | |

Cost is a big reason. It doesn't matter how good the top-of-the-line models are if the cheaper ones suit your needs. Commoditization is great that way. I'd absolutely use an open source GPT-4 in my browser over a pricy closed GPT-5 once we get to that point.

version_five 3 years ago |

If you have ~100k to spend, aren't there options to buy a gpu rather than just blow it all on cloud? How much is an 8xA100 machine?

4xA100 is 75k, 8 is 140k https://shop.lambdalabs.com/deep-learning/servers/hyperplane...

munk-a 3 years ago |

A wonderful thing about software development is that there is so much reserved space for creativity that we have huge gaps between costs and value. Whether the average person could do this for 85k I'm uncertain of - but there is a very significant slice of people that could do it for well under 85k now that the ground work has been done. This leads to the hilarious paradox where a software based business worth millions could be built on top of code valued around 60k to write.

nico 3 years ago | |

> This leads to the hilarious paradox where a software based business worth millions could be built on top of code valued around 60k to write.

Or the fact that software based businesses just took a massive hit in value overnight and cannot possibly defend such high valuations anymore.

The value of companies is quickly going to shift from tech moats to brands.

Think CocaCola - anyone can create a drink that tastes as good or better than coke, but it's incredibly hard to compete with the CocaCola brand.

Now think what would have happened if CocaCola had been super expensive to make, and all of a sudden, in a matter of weeks, it became incredibly cheap.

This is what happened to the saltpeter industry in 1909 when synthetic saltpeter was invented. The whole industry was extinct in a few years.

prerok 3 years ago | |

Nit: not to write but to run. The cost of development is not considered in these calculations.

thih9 3 years ago |

> as opposed to OpenAI’s continuing practice of not revealing the sources of their training data.

Looks like that choice makes it more difficult to adopt, trust, or collaborate on the new tech.

What are the benefits? Is there more to that than competitive advantage? If not, ClosedAI sounds more accurate.

Tryk 3 years ago |

Why doesn't someone just start a gofundme/kickstarter with the goal of funding the training of an open-source ChatGPT-capable model?

cj 3 years ago | |

Create a clone of OpenAI that pledges to remains open and remains not for profit.

That could do really well via crowd funding with the right spin/marketing behind it.

gessha 3 years ago | | |

And when everyone buys in, you go private everything and reap the benefits. Brilliant!

GartzenDeHaes 3 years ago |

It's interesting to me that LLaMA-nB's still produce reasonable results after 4-bit quantization of the 32-bit weights. Does this indicate some possibility of reducing the compute required for training?

lmeyerov 3 years ago |

It seems the quality goes up & cost goes down significantly with Colossal AI's recent push: https://medium.com/@yangyou_berkeley/colossalchat-an-open-so...

Their writeup makes it sounds like, net, 2X+ over Alpaca, and that's an early run

The browser side is interesting too. Browser JS VMs have a memory cap of 1GB, so that may ultimately be the bottleneck here...

lmeyerov 3 years ago | |

Interesting, since I looked last year, Chrome has started raising the caps internally on buffer allocation to potentially 16GB: https://chromium.googlesource.com/chromium/src/+/2bf3e35d7a4...

Last time I tried on a few engines, it was just 1-2GB for typed arrays, which are essentially the backing structure for this kind of work. Be interesting to try again..

For our product, we actually want to dump 10GB+ on to the WebGL side, which may or may not get mirrored on the CPU side. Not sure if additional limits there on the software side. And after that, consumer devices often have another 10GB+ CPU RAM free, which we'd also like to use for our more limited non-GPU stuff :)

jesse__ 3 years ago | |

I thought the memory limit (in V8 at least) was 2GB due to the GC not wanting to pass 64 bit pointers around, and using the high bit of a 32-bit offset for .. something I now forget ..?

Do you have a source showing a JS runtime with a 1GB limit?

jesse__ 3 years ago | | |

UPDATE: After a nominal amount of googling around it appears valid sizes have increased on 64-bit systems to a maximum of 8GB, and stayed at 2GB on 32-bit systems, for FF at least. I guess it's probably 'implementation defined'

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

SebJansen 3 years ago | |

does the 1gb limit extend to wasm?

make3 3 years ago |

Alpaca uses knowledge distillation (it's trained on outputs from OpenAI models). It's something to keep in mind. You're teaching your model to copy an other model's outputs.

thewataccount 3 years ago | |

> You're teaching your model to copy an other model's outputs.

Which itself was trained on human outputs to do the same thing.

Very soon it will be full Ouroboros as humans use the model's output to finetune themselves.

visarga 3 years ago | |

> You're teaching your model to copy an other model's outputs.

That's a time honoured tradition in ML, invented by the father of the field himself, Geoffrey Hinton, in 2015.

> Distilling the Knowledge in a Neural Network

https://arxiv.org/abs/1503.02531

brrrrrm 3 years ago |

The WebGPU demo mentioned in this post is insane. Blows any WASM approach out of the water. Unfortunately that performance is not supported anywhere but chrome canary (behind a flag)

raphlinus 3 years ago | |

This will be changing soon. I believe Chrome M113 is scheduled to ship to stable on May 2, and will support WebGPU 1.0. I agree it's a game-changing technology.

fzliu 3 years ago |

I was a bit skeptical about loading a _4GB_ model at first. Then I double-checked: Firefox is using about 5GB of memory for me. My current open tabs are mail, calendar, a couple Google Docs, two Arxiv papers, two blog posts, two Youtube videos, milvus.io documentation, and chat.openai.com.

A lot of applications and developers these days take memory management for granted, so embedding a 4GB model to significantly enhance coding and writing capabilities doesn't seem too far-fetched.

astlouis44 3 years ago |

WebGPU is going to be a major component in this. Modern GPU's prevalent in mobile devices, desktops and laptops, is more than enough to do all of this client side.

agnokapathetic 3 years ago |

> My friends at Replicate told me that a simple rule of thumb for A100 cloud costs is $1/hour.

AWS charges $32/hr for an 8xA100s (p4d.24xlarge) which comes out to $4/hour/gpu. Yes you can get lower pricing with a 3 year reservation but thats not what this question is asking.

You also need 256 nodes to be colocated on the same fabric -- which AWS will do for you but only if you reserve for years.

thewataccount 3 years ago | |

AWS certainly isn't the cheapest for this, did they mention using AWS? Lamdba Labs is 12$/hr for 8xA100's, and there's others relatively close to this price on demand, I assume you can get a better deal if you contact them for a large project.

Replicate themselves rent out GPU time so I assume they would definitely know as that's almost certainly the core of their business.

sebzim4500 3 years ago | |

Maybe they are using spot instances? $1/hr is about right for those.

celestialcheese 3 years ago | |

lambdalabs will let you do on-demand 8xa100 @ 80GB VRAM/GPU for $12/hr, or reserved @ $10.86/hr

8xA100 @ 40gb for $8/hr

Replicate friend isn't far off.

pavelstoev 3 years ago | |

model-depending, you can train on lesser (cheaper) GPUs but system-level optimizations are needed. Which is what we provide at centml.ai

IanCal 3 years ago | |

Lambda labs charges about 11-12/hr for 8xA100.

robmsmt 3 years ago | | |

and is completely at capacity

d4rkp4ttern 3 years ago |

Everyone seems to assume that all the “tricks” behind training ChatGPT are known. The only clues are in papers from ClosedAI like the InstructGPT paper. So we assume there is Supervised Fine Tuning, then Reward Modeling and finally RLHF.

But there are most likely other tricks that ClosedAI has not published. These probably took years of R&D to come up with, others trying to replicate ChatGPT would need to come up with these tricks on their own.

Also curiously the app was released in late 2022 while the knowledge cutoff is 2021 — I was curious why that might be, and one hypothesis I had was that it may have been because they wanted to keep the training data fixed while they iterated on numerous methods, hyperparameter tuning etc. All of these are unfortunately a defensive moat that ClosedAI has.

pavelstoev 3 years ago |

Training a ChatGPT-beating model for much less than $85,000is entirely feasible. At CentML, we're actively working on model training and inference optimization without affecting accuracy, which can help reduce costs and make such ambitious projects realistic. By maximizing (>90%) GPU and platform hardware utilization, we aim to bring down the expenses associated with large-scale models, making them more accessible for various applications. Additionally, our solutions also have a positive environmental impact, addressing the excess CO2 concerns. If you're interested in learning more about how we are doing it, please reach out via our website: https://centml.ai

nwoli 3 years ago |

What we need is a RETRO style model where basically after the input you go through a small net that just fetches a desired set of weights from a server (serving data without compute is dirt cheap) and is then executed locally. We’ll get there eventually

tinco 3 years ago | |

Can anyone explain or link some resource on why these big GPT models all don't incorporate any RETRO style? I'm only very superficially following ML developments and I was so hyped by RETRO and then none of the modern world changing models apply it.

nwoli 3 years ago | | |

Openai might very well be using that internally who knows how they implement things. Also emad retweeted a RETRO related thing a bit back so they might very well be using that for their awaited LM, here’s hoping

breck 3 years ago |

Just want to say SimonW has become one of my favorite writers covering the AI revolution. Always fun thought experiments with linked code and very constructive for people thinking about how to make this stuff more accessible to the masses.

skybrian 3 years ago |

I wonder why anyone would want to run it in a browser, other than to show it could be done? It's not like the extra latency would matter, since these things are slow.

Running it on a server you control makes more sense. You can pick appropriate hardware for running the AI. Then access it from any browser you like, including from your phone, and switch devices whenever you like. It won't use up all the CPU/GPU on a portable device and run down your battery.

If you want to run the server at home, maybe use something like Tailscale?

simonw 3 years ago | |

The browser thing is definitely more for show than anything else - I used it to help demonstrate quite how surprisingly lightweight these models can be.

jedberg 3 years ago |

With the explosion of LLMs and people figuring out ways to train/use them relatively cheaply, unique data sets will become that much more valuable, and will be the key differentiator between LLMs.

Interestingly, it seems like companies that run chat programs where they can read the chats are best suited to building "human conversation" LLMs, but someone who manages large text datasets for others are in the perfect place to "win" the LLM battle.

fswd 3 years ago |

There is somebody finetunin 160m rwkv4 on alpaca on the rwkv discord, I am out of the office and can't link but the person posted in prompt showcase channel

buzzier 3 years ago | |

RWKV-v4 Web Demo (169m/430m params) https://josephrocca.github.io/rwkv-v4-web/demo/

nope96 3 years ago |

I remember watching one of the final episodes of Connections 3: With James Burke, and he casually said we'd have personal assistants that we could talk to (in our PDAs). That was 1997 and I knew enough about computers to think he was being overly optimistic about the speed of progress. Not in our lifetimes. Guess I was wrong!

alecco 3 years ago |

Interesting blog but the extrapolations are way overblown. I tried one of the 30bn models and it's not even remotely close to GPT-3.

Don't get me wrong, this is very interesting and I hope more is done in the open models. But let's not over-hype by 10x.

gessha 3 years ago |

We need a DAWNBench* benchmark for training ChatGPT the fastest and cheapest.

* https://dawn.cs.stanford.edu/benchmark/

ushakov 3 years ago |

Now imagine loading 3.9 GB each time you want to interact with a webpage

KMnO4 3 years ago | |

Yeah, I’ve used Jira.

neilellis 3 years ago | | |

:-)

sroussey 3 years ago | |

10yrs from now models will be in the OS. Maybe even in silicon. No downloads required.

pessimizer 3 years ago | | |

Not in mine. I don't even want redhat's bullshit in there. I'm not installing some black box into my OS that was programmed with motives that can't be extracted from the model at rest.

swader999 3 years ago | | |

The OS will be in the cloud interfacing into our brain by then. I don't want this btw.

cavisne 3 years ago |

There is a minimum cluster size to get good utilization of the GPU’s. $1 an hour per chip might get you one A100 but it won’t get you hundreds clustered together.

ChumpGPT 3 years ago |

I'm not so smart and I don't understand a lot about ChatGPT, etc, but could there be a client side app like Folding@home that would allow millions of people to give processing power to train a LLM?

v4dok 3 years ago |

Can someone at the EU, the only player in this thing with no strategy yet just pool together enough resources so the open-source people can train models. We don't ask much, just give compute power

0xfaded 3 years ago | |

No, that could risk public money benefitting a private party.

Feel free to form a multinational consortium and submit a grant application to one of our distribution partners under the Horizon program though.

Now, how do you plan to create jobs and reduce CO2?

PeterisP 3 years ago | |

Yes, there are a bunch of government-funded supercomputers or clusters which can be obtained for public research needs (based on an evaluation of which projects are likely to bring the most benefit), and are used, among other things, to train large language models. E.g. some interesting Swedish models got trained on https://www.nsc.liu.se/systems/berzelius/ .

TMWNN 3 years ago |

Hey, that means it can be turned into an Electron app!

ultrablack 3 years ago |

If you could, you should have done it 6 months ago.

munk-a 3 years ago | |

I mean - is there a developer alive that'd be unable to write the nascent version of Twitter? I think that Twitter as a business exists entirely because of the concept - the code to cover the core functionality is absolutely trivial to replicate.

I don't think this is a very helpful statement because actually finding the idea on what to build is the hard part - or even just believing it's possible. The company I work at has been using NLP for years now and we have a model that's great at what we do... but if you asked if we could develop that into a chatbot as functional as chatgpt two years ago you'd probably be met with some pretty heavy skepticism.

Cloning something that has been proven possible is always easier than taking the risk building the first version with no real grasp of feasibility.

rspoerri 3 years ago |

So cool it runs on a browser /sarcasm/ i might not even need a computer. Or internet when we are at it.

It either runs locally or it runs on the cloud. Data could come from both locations as well. So it's mostly technically irrelevant if it's displaying in a browser or not.

Except when it comes to usability. I don't get it why people love software running in a browser. I often close important tools i have not saved when it's in a browser. I cant have offline tools which work if i am in a tunnel (living in Switzerland this is an issue) . Or it's incompatible because i am running LibreWolf.

/sorry to be nitpicking on this topic ;-)

Speed - My research implied that deploying AWS on the US East Coast would be the fastest if I am targeting a US/Asia-based audience. Security - How secure are they? Scaling - How does it actually work? Cost efficiency - There's something like a dyno that makes it easy to scale. How do they fare against their competitors? For example, Engine Yard and bluebox?