Could you train a ChatGPT-beating model for $85k and run it in a browser?(simonwillison.net) |
Could you train a ChatGPT-beating model for $85k and run it in a browser?(simonwillison.net) |
Don't get me wrong, this is very interesting and I hope more is done in the open models. But let's not over-hype by 10x.
Feel free to form a multinational consortium and submit a grant application to one of our distribution partners under the Horizon program though.
Now, how do you plan to create jobs and reduce CO2?
I don't think this is a very helpful statement because actually finding the idea on what to build is the hard part - or even just believing it's possible. The company I work at has been using NLP for years now and we have a model that's great at what we do... but if you asked if we could develop that into a chatbot as functional as chatgpt two years ago you'd probably be met with some pretty heavy skepticism.
Cloning something that has been proven possible is always easier than taking the risk building the first version with no real grasp of feasibility.
It either runs locally or it runs on the cloud. Data could come from both locations as well. So it's mostly technically irrelevant if it's displaying in a browser or not.
Except when it comes to usability. I don't get it why people love software running in a browser. I often close important tools i have not saved when it's in a browser. I cant have offline tools which work if i am in a tunnel (living in Switzerland this is an issue) . Or it's incompatible because i am running LibreWolf.
/sorry to be nitpicking on this topic ;-)
If you read the article, part of the argument was for the sandboxing that the browser provides.
"Obviously if you’re going to give a language model the ability to execute API calls and evaluate code you need to do it in a safe environment! Like for example... a web browser, which runs code from untrusted sources as a matter of habit and has the most thoroughly tested sandbox mechanism of any piece of software we’ve ever created."
I don't know exactly about the browser sandboxing. But isn't it's purpose to prevent access to the local system, while it mostly leaves access to the internet open?
Is that really a good way to limit and AI system's API access?
1 - Everyone already has a web browser, so there's no software to download (or the software is automatically downloaded, installed and run, if you want to look at it that way... either way, the experience is a lot easier and more seamless for the user)
2 - The website owner has control of the software, so they can update it and manage user access as they like, and it's easier to track users and usage that way
3 - There are a ton of web developers out there, so it's easier to find people to work on your app
4 - You ostensibly don't need to rewrite your app for every OS, but may need to modify it for every supported browser
1 - Not everyone has or wants fast access to the internet all the time.
2 - I try to prevent access of most of the apps to the internet. I don't want companies to access my data or even metadata of my usage.
3 - sure, but it doesn't make it better for the user.
4 - Also supporting different screen sizes and interaction types (touch or mouse) can be a big part of the work.
The most important part for a user is if he/she is only using the app rarely or once. Not having to install it will make the difference between using it or not. However with the app stores most OS's feature today this can change pretty soon and be equally simple.
I might be old school on this, but i resent subscription based apps. For applications that do not need to change, deliver no additional service or aren't absolutely vital for me i will never subscribe. And browser based app's are at the core of this unfortunate development. But that's gone very far from the original topic :-)
https://developer.mozilla.org/en-US/docs/Web/Security/Same-o...
There is still a very large open problem in how to federate large numbers of loosely coupled computers to speed up training "interesting" models. I've worked in both domains (protein folding via Folding@Home/protein folding using supercomputers, and ML training on single nodes/ML training on supercomputers) and at least so far, ML hasn't really been a good match for embarrassingly parallel compute. Even in protein folding, folding@home has a number of limitations that are much better addressed on supercomputers (for example: if your problem requires making extremely long individual simulations of large proteins).
All that could change, but I think for the time being, interesting/big models need to be trained on tightly coupled GPUs.
It would be great if merge-ability would exist. It would also likely apply to efficient/optimal shrinking for models.
Maybe you could dispatch tasks to train on many variations of similar tasks and take average of results? It could probably help in some way, but you'd still have large serialized pipeline to munch through and you'd likely require some serious hardware ie. dual gtx 4090 on client side.
merge-ability does exist and you can average the results.
https://learning-at-home.github.io/
Getting the actual gradient descent to parallelize is more difficult because one needs to average the gradient when using data/batch parallelism. It becomes more a network speed than GPU speed problem. Or are LLMs somehow different?
But I’d love to see more federated/distributed learning platforms.
If you accept that your model knows less about the world - it doesn't have to know about every restaurant in mexico city or the biography of every soccer player around the world - then you can get away with much fewer parameters and much less training data. Then you can't query it like an oracle about random things anymore, but you shouldn't do that anyway. But it should still be able to do tasks like reformulating texts, judging simularity (by embedding distance), and so on.
And TFA mentions it also, you could hook up your simple language model with something like ReAct to get really good results. I don't see it running in the browser, but if you had a license-wise clean model that you can run on premises on one or two GPUs, that would be huge for a lot of people!
Also. you can finetune llama-7b on a 3090 for about $3 using LoRA.
I have the latter working on a M1 Macbook Air with very good results for what it is. Curious if bloomz.cpp is significantly better or just about the same.
Or maybe you are an individual who has a use case that's too edgy for OpenAI or a silicon valley corporate image. When Replika shut down people trying to have virtual boyfriend/girlfriends on their platform, their reddit filled up with people who mourned like they just lost a partner.
I think it's important that alternative non-big bux company options exist, even if most people don't want to or need to use them.
Running models locally is by far the most promising solution for that concern.
4xA100 is 75k, 8 is 140k https://shop.lambdalabs.com/deep-learning/servers/hyperplane...
Or the fact that software based businesses just took a massive hit in value overnight and cannot possibly defend such high valuations anymore.
The value of companies is quickly going to shift from tech moats to brands.
Think CocaCola - anyone can create a drink that tastes as good or better than coke, but it's incredibly hard to compete with the CocaCola brand.
Now think what would have happened if CocaCola had been super expensive to make, and all of a sudden, in a matter of weeks, it became incredibly cheap.
This is what happened to the saltpeter industry in 1909 when synthetic saltpeter was invented. The whole industry was extinct in a few years.
Looks like that choice makes it more difficult to adopt, trust, or collaborate on the new tech.
What are the benefits? Is there more to that than competitive advantage? If not, ClosedAI sounds more accurate.
Their writeup makes it sounds like, net, 2X+ over Alpaca, and that's an early run
The browser side is interesting too. Browser JS VMs have a memory cap of 1GB, so that may ultimately be the bottleneck here...
Last time I tried on a few engines, it was just 1-2GB for typed arrays, which are essentially the backing structure for this kind of work. Be interesting to try again..
For our product, we actually want to dump 10GB+ on to the WebGL side, which may or may not get mirrored on the CPU side. Not sure if additional limits there on the software side. And after that, consumer devices often have another 10GB+ CPU RAM free, which we'd also like to use for our more limited non-GPU stuff :)
Do you have a source showing a JS runtime with a 1GB limit?
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...
Which itself was trained on human outputs to do the same thing.
Very soon it will be full Ouroboros as humans use the model's output to finetune themselves.
That's a time honoured tradition in ML, invented by the father of the field himself, Geoffrey Hinton, in 2015.
> Distilling the Knowledge in a Neural Network
A lot of applications and developers these days take memory management for granted, so embedding a 4GB model to significantly enhance coding and writing capabilities doesn't seem too far-fetched.
AWS charges $32/hr for an 8xA100s (p4d.24xlarge) which comes out to $4/hour/gpu. Yes you can get lower pricing with a 3 year reservation but thats not what this question is asking.
You also need 256 nodes to be colocated on the same fabric -- which AWS will do for you but only if you reserve for years.
Replicate themselves rent out GPU time so I assume they would definitely know as that's almost certainly the core of their business.
8xA100 @ 40gb for $8/hr
Replicate friend isn't far off.
But there are most likely other tricks that ClosedAI has not published. These probably took years of R&D to come up with, others trying to replicate ChatGPT would need to come up with these tricks on their own.
Also curiously the app was released in late 2022 while the knowledge cutoff is 2021 — I was curious why that might be, and one hypothesis I had was that it may have been because they wanted to keep the training data fixed while they iterated on numerous methods, hyperparameter tuning etc. All of these are unfortunately a defensive moat that ClosedAI has.
Running it on a server you control makes more sense. You can pick appropriate hardware for running the AI. Then access it from any browser you like, including from your phone, and switch devices whenever you like. It won't use up all the CPU/GPU on a portable device and run down your battery.
If you want to run the server at home, maybe use something like Tailscale?
Interestingly, it seems like companies that run chat programs where they can read the chats are best suited to building "human conversation" LLMs, but someone who manages large text datasets for others are in the perfect place to "win" the LLM battle.
My biggest problem: I haven't managed to get a great summarization out of a LLaMA derivative that runs on my laptop yet. Maybe I haven't tried the right model or the right prompt yet though, but that feels essential to me for a bunch of different applications.
I still think a LLaMA/Alpaca fine-tuned for the ReAct pattern that can execute additional tools would be a VERY interesting thing to explore.
[ ReAct: https://til.simonwillison.net/llms/python-react-pattern ]
even davinci can be used as part of a chain, because you can direct it to structure and unstructure data, and then extract the single component and build them into tasks. cohere, llama et al are currently struggling to consistently produce these result reliably, even if you can chat with them and frankly it's not about the chat
example from a stack overflow that split the questions before sending it down chain for answering all points individually:
This is a customer question:
I'm a beginner RoR programmer who's planning to deploy my app using Heroku. Word from my other advisor friends says that Heroku is really easy, good to use. The only problem is that I still have no idea what Heroku does...
I've looked at their website and in a nutshell, what Heroku does is help with scaling but... why does that even matter? How does Heroku help with:
Speed - My research implied that deploying AWS on the US East Coast would be the fastest if I am targeting a US/Asia-based audience.
Security - How secure are they?
Scaling - How does it actually work?
Cost efficiency - There's something like a dyno that makes it easy to scale.
How do they fare against their competitors? For example, Engine Yard and bluebox?
Please use layman English terms to explain... I'm a beginner programmer.Extract the scenario from the question including a summary of every detail, list every question, in JSON:
{ "scenario": "A beginner RoR programmer is planning to deploy their app using Heroku and is seeking advice about deploying it.", "questions": [ "What does Heroku do?", "How does deploying AWS on the US East Coast help with speed?", "How secure is Heroku?", "How does scaling with Heroku work?", "What is a dyno and why is it cost efficient?", "How does Heroku compare to its competitors, such as Engine Yard and Bluebox?" ] }
I want the best LLMs to be open source too, but I'm not delusional enough to make insane claims like the hundreds of GitHub forks out there.
How do you do this without being incredibly wealthy?
Hypothesis 1: With better logical thinking (an API call away!), I bet you could train a GPT based on a “small” initial dataset. Why shouldn’t multilingual wikipedia/wiktionary and libgen be enough? That’s what, like less than 10% of the OpenAI training? /s
Hypothesis 2: Data sets of philosophical dialogues could help efficiently develop AI reasoning skills.
Socratic thinking in Plato and Xenophon represented a powerful new mode of critical thinking. Maybe some Student-Teacher-Student template of dialogue could be powerful in developing useful datasets for AI training.
What is the utility of different AI reflective loops for generating training data? (References appreciated if you know any) One possibility to test is a chain of Analyze, Evaluate and Apply loops, applied over and over? “analyze the above piece of text, then evaluate it, then apply to everyday life.”
Now, on HN, many have expressed concern that GPT trained on GPT-GPT conversations is going to result in very misaligned models. Like a copy machine degradation, do we want training data from the AI being trained on the AI? But, on the other hand, it is possible that supporting reflective thought is a good idea in AI (we generally value reflective thought) or a bad idea (maybe the reflection will somehow turn it evil, or at least misaligned).
Design Question: how might we create useful training data through a process of structuring AI-AI dialogue?
“Student-Teacher-Student” conversations seem like they could be good as a useful mode of dialogue. Previously, I’ve finetuned GPT with the complete works of Plato and I was able to generate interesting new dialogues. But the question is whether new dialogues could produce useful data. Perhaps I could use GPT4 to read a part of Plato and then try to autocomplete another part of Plato. Or, as above, use a piece of Platonic dialogue as a target, then use an Analyze, Evaluate, Apply chain on it. We could use methods like these over and over again to make a large dataset about philosophical reasoning. We could have human ratings of the reasonableness of the dialogue output.
If a Socratic structure of thinking could read the complete works of Plato over and over again, commenting, countering and synthesizing— with human oversight (RLHF), perhaps we could develop a small module for philosophical reasoning. It might still need millions of conversations, though. But, perhaps by reflecting philosophically by itself, it could produce a sufficiently large dataset that enabled a sophisticated small model with very open resources.
And, you’d still need the human preference training RLHF to get it to interact well—and I think it also needs some world model.
In any case, I think making smaller and smaller models is a good idea, it sounds fun.
TL;DR
1. AI training has philosophically interesting implications
2. Philosophical reasoning is valuable to develop in AI
3. Good philosophical reasoning might be a key benchmark for small models. These models don’t need to know everything but perhaps they could learn what they don’t know.
4. Reading a lot of Plato over and over could be a great way to train GPT that it doesn’t know a lot.
5. What kind of AI-AI dialogues might produce training data that is useful for training small models?
“1.) While large datasets and models aim for general capability, smaller systems can target specific skills like philosophical reasoning in depth. Testing models on nuanced logic, conceptual analysis and ethics could benchmark their progress, especially if combined with broader knowledge. But these abilities alone won't achieve real-world alignment - we must also instill human values and practical wisdom.
2.) Repeatedly exposing models to philosophical texts like Plato's dialogues could improve their reasoning if guided and reviewed by researchers. Look for both progress and problems in how they interpret, discuss, and extend ideas. Analyses can inform how best to structure philosophical training for alignment by providing evidence of what does/does not work. But reading alone won't necessarily lead models to become safe, ethical or beneficial - significant oversight and feedback are required.
3.) Carefully-designed AI-AI dialogues could generate data on models' reasoning skills, especially for limited systems. Have them debate complex issues, challenge or build on each other's thinking, consider analogies and counterfactuals. Review conversations to check for undesirable or biased beliefs, as well as areas of progress. Look for principles of how to scaffold productive discussions that facilitate improvement and value alignment. But without close oversight of these interactions, they risk amplifying errors or other problematic behaviors.
Opportunities: • Study how philosophical knowledge and skills develop in smaller models through approaches like text analysis, structured debates, and conceptual evaluations. Monitor for progress as well as issues, using insights to inform values-aligned training. • Explore ways to balance philosophical and empirical learning. Test how models apply theoretical reasoning to real-world situations, analyzing any errors from imbalance. Look for crossover effects and how to facilitate mutually-beneficial learning across knowledge domains. • Research transparent and auditable techniques for instilling human values in limited AI systems. Develop methods to verify what values models have internalized and how they apply them in context. Consider how approaches may differ for narrow prototypes vs. more advanced agents. • Propose and discuss evidence-based techniques for "Constitutional AI" - systematically developing safe and ethical systems. Consider policies around data, objectives, abilities, oversight, and shuttering models if needed for alignment. Debate how to make this a collaborative, multidisciplinary process even for open models. • Study philosophical issues of machine mind, knowledge, goals and responsibility as we build reasoning models. How should we think about and ensure their ethical development? What are our obligations as researchers, and how can we meet them? There is significant opportunity for research on aligning "small, open models" that focuses on reasoning and philosophy. But this requires acknowledgement of limitations and commitment to oversight, responsibility, and guarding against problems - not assuming abilities will necessarily lead to safe, ethical or beneficial behavior on their own. By analyzing how knowledge and skills develop, exploring balanced and transparent methods, considering our own assumptions and obligations as model builders, and emphasizing "safe failure", we can make progress toward systems that autonomously apply nuanced logic in service of human priorities. The potential is there, but so is the work required - we must choose to take it on and see it through at each step. With rigorous reflection and review combined with pragmatic experimentation, philosophy and AI can be mutually informative. But only if we actively build in human wisdom and values along the way. The key opportunities are there for researchers willing to have the deeper discussions and make the harder choices - both to achieve the goal and ensure we are shaping it rightly. Small, open models focused on reasoning are a promising path, but one that requires care, responsibility and oversight to follow productively. Progress is possible, but dependent on our commitment as guides. If done responsibly, these systems could yield many benefits - but we must step up to meet the challenge, not assume it will be solved for us. The work is ours to do. Let's take it on.“
Ignoring the operational costs of on-prem hardware is pretty common, but those costs are significant and can greatly change the calculation.
Sure, if you're planning to service a large number of users, building your infrastructure in-house might be a bit overkill, as you'll need a infrastructure team to service it as well.
If you're just want to buy 4 GPUs to put in one server to run some training yourself, I don't think it's that much overkill. Especially considering you can recover much of the cost even after a year by selling much of the equipment you bought. Most of your losses will be costs for electricity and internet connection.
Cloud pricing is pretty steep and obviously has a fat profit margin but building your own data centers isn't cheap either. Doing this at scale is not something most companies would be very good at either. Which means it probably is quite a bit more expensive relative to what the big cloud providers are doing.
One is buying capital that produces models, the other is buying a single model.
In my experience, physical hardware has a management overhead over cloud resources. Backups, large disk storage for big models, etc.
https://www.reddit.com/r/StableDiffusion/comments/126xsxu/ni...
Too bad SD learned the Shutterstock watermark so well, lol
You’re only out of luck if each iteration is too compute intense to fit on one worker node, even if each iteration might be embarrassingly parallelizable, since the overhead of having to aggregate computations across workers at every iteration would be too high.
If you have similar variants of the same task you can accelerate it more where the diff is.
You can't average on past results computed from historic base weights - it's linear process.
If you could do that, you'd just map training examples to diffs and merge them all.
Or take two distinct models and merge them to have model that is roughly sum of them. You can't do it, it's not linear process.
Let me clarify:
It's serialised, iterative, step repeating process where each step depends on output of previous one - aka linear process.
Where each step is non-linear transformation (gradient descent).
It's not distributable (over internet) task because it'd require transferring gigabytes of data (whole model weights) on each step.
To put it in other words - distributed task has massive input size and requires quick computation and tasks arrive very frequently - which means it can't be distributed over internet.
Buying and selling hardware isn't free; it comes with its own cost. I would not want to be in the position of selling a $100K box of computer equipment- ever.
True, but some things are harder to sell than others. A100's in today's market would be easy to sell. Harder to buy, because the supply is so low unless you're Google or another big name, but if you're trying to sell them, I'm sure you can get rid of them quickly.
The challenge is that for it to work cost effectively you need to be able to append what is basically a final network layer to the model that is algorithmically designed and until OpenAI exposes the full logits and/or some way to modify them on the fly you're going to be stuck with open source models. I've run things against GPT-2 mostly but it's only list to try LLaMA.
[1] "Structural Alignment: Modifying Transformers (like GPT) to Follow a JSON Schema" @ https://github.com/newhouseb/clownfish
GPT-3 etc can only do this because they had a LOT of code included in their training sets.
The LLaMA paper says Github was 4.5% of the training corpus, so maybe it does have that stuff baked in and just needs extra tuning or different prompts to tap into that knowledge.
“This time” happens with open source where there’s typically no economical incentive and people are doing it for the heck of it.
I think there are plenty of rich people who would benefit indirectly from running an OpenAI-like non-profit.
And companies would not want to do that. Imagine you make partner AI that goes unhinged like Bing did and tells you to kill yourself or something similar. I can't imagine companies would want that kind of risk.
Even Jennifer Lawrence stored her nudes on iCloud.
If you think a small corp is going to get a big gov contract outside of a nepo-state you're in for a shock.
My thinking about pricing doesn't include that option because I wouldn't just hook a server like that up to a regular outlet in an office and use it for production work. If that works for you- you can happily ignore my comments. But if you go ahead and build such a thing and operate it for a year, please let us know if there were any costs- either dollar or in suffering- associated with your decision
[edit: adding in that the value of this machine also suggests it cannot live unattended in an insecure location, like an office]
signed, person who used to build closet clusters at universities
Of course that's still a very small system when talking LLM training, the only reason why i would not put that in a regular office is it's extreme price. Do you really want something worth 80k in a form factor that could be casually carried through the door?
Most people who rent cloud servers are not doing this type of workload.
Sadly the reality of funding today makes it unlikely that these two will both be simultaneously satisfied. The problem is that history will look back on the necessary business plan and deem it a failure even if it generates a company that does a billion dollars plus in annual revenue.
This is actually not unique to large language models but most innovation around computers. The basic problem is that if you build a force-multiplier (spreadsheets, personal computing, large-language models all come to mind) then what will make it succeed is its versatility: people want a hammer that can be used for smashing all manner of things, not just your company's particular brand of matching nails. And most people will only pick up that hammer once per week or once per month, only like 1% of the economy if that will be totally revolutionized, "we use this force-multiplier every day, it is now indispensable, we can't imagine life without it," and it's never predictable what that sector will be -- it's going to be like "oh, who ever dreamed that the killer application for LLMs would be them replacing AutoCAD at mechanical contractors" or some shit.
In those strange eons, to wildly succeed, one must give up on anticipating all usages of the software, one must cease controlling it and set it free. "Well where's the profit in that?" -- it is that this company was one of the first players in the overall market, they got an early chance to stake out as much territory as possible. But the market exploded way larger than they could handle and then everybody looks back on them and says "wow, what a failure, they only captured 1% of that market, they could have been so much more successful." Yeah, they captured 1% of a $100B market, some failure, right?
But what actually happens is that companies see the potential, investors get dollar signs in their eyes, everyone starts to lock down and control these, "you may use large language models but only in the ways that we say, through the interfaces which we provide," and then the only thing that you can use it for is to get generic conversational advice about your hemorrhoids, so after 5-10 years the bubble of excitement fizzles out. Nobody ever dreams to apply it to AutoCAD or whatever, and the world remains unchanged.
OpenAI has spent a lot of money to get their result. It's safe to assume it will take a lot of money to get a similar result, and then to share it (although I assume bit torrent will be good enough). Once people are running their models, they can innovate to their hearts content. It's not clear how or why they'd give money back to the enabling technology. So how does money flow back to the innovators in proportion to the value produced, if not a SaaS?
But maybe the governments will make one and maintain it with taxes as an infrastructure service, like roads, giving everyone expanded powers of cognition, memory, and expertise, and raising the consciousnesses of humanity to new heights. Probably in USA it wouldn't happen if we judge ourselves only in zero sum relation to others - helping everyone would be a wash and only waste our money!
The problem with making something nationalised or a utility is you'd better have made sure there's no innovation needed or risk required. Once that's all settled, then maybe consider it.
https://github.com/bigscience-workshop/petals is a project that does this kind of thing for running inference - I tried it out in Google Collab and it seemed to work pretty well.
Model training is much harder though, because it requires a HUGE amount of high bandwidth data exchange between the machines doing the training - way more than is feasible to send over anything other than a local network connection.
My own experience with this was a distributed ray tracer where the server sent the full model to the machines and then each machine would ask for one scan line to do, report back, and then ask for another scan line and repeated.
There was no interaction between the machines - what was on one scan line didn't need any coordination with what was on another scan line.
Likewise, with SETI@home, the server could give you a chunk of data and you could analyze that chunk - the contents of another chunk of data didn't change the analysis being done on this one.
Furthermore, these can be done asynchronously and then assembled when everything is done. Only the very final product / analysis / artifact needs all of the data and nothing other than the end process is waiting on any sub process.
For doing gradient descent ( https://www.3blue1brown.com/lessons/gradient-descent ), as I understand it, each iteration is dependent on the previous one.
Doing 13,002 dimensional (for the example of a 784 -> 16 -> 16 -> 10 neuron net digit recognizer in the 3b1b page) matrix math is the parallel part... but and if you get into the billions of parameters it gets much larger. Matrix multiplication has difficulty across a network. For example - http://www.lac.inpe.br/~stephan/CAP-372/Fox_example.pdf and http://www.cs.csi.cuny.edu/~gu/teaching/courses/csc76010/sli...
> We are now ready for the second stage. In this stage, we broadcast the next column (mod n) of A across the processes and shift-up (mod n) the B values.
That use of "broadcast" - the matrix multiplication is limited by the speed of the slowest node and it needs to send all the data from the previous calculation to all the nodes making it difficult to use across a network that experiences latency.
When doing ML training, they most of TB/sec of bandwidth... and the high end extremes are in PB/sec ( https://www.cerebras.net/product-chip/ ) ... and I'm sitting here watching Steam download.
The inefficiencies of the network, slow computers, and amount of data transfer to preform the next calculation make network distributed machine learning "not a good choice" at this time.