' Why would pretraining a 1.1B model for so long make sense? Doesn't it contradict the Chinchilla Scaling Law?
Above is the training loss curve taken from the Llama 2 paper. Here I quote from that paper: "We observe that after pretraining on 2T Tokens, the models still did not show any sign of saturation". That is why we believe pretraining a 1.1B model for 3T tokens is a reasonable thing to do. Even if the loss curve does not go down eventually, we can still study the phenomenon of saturation and learn something from it.'
It is something I have been wondering about: why did Meta not keep the training process going on while the loss curves seemed to go down? Could they conceivably release a Llama 2.1 being checkpoints taken a month after 2.0 was 'cut'? Maybe the expected gain is too small compared to what can be gained with fine/instruct tuning afterward anyway?
Because choosing the LR decay requires knowing the # of steps in advance. LR is too small after the 2T tokens, and changing it afterwards doesn't tend to help.
If I remember correctly, it's because the main reason they trained multiple models was to show a scaling trend. Each model was trained using a chinchilla-optimal mix of model size, cpu amount, and parameter size. The point was to provide an empirical scaling law that could possibly be extrapolated to estimate the performance of more expensive models, like imagine a billion dollar model for which the model size, data size, and cpu amount is picked in the chinchilla optimal ratios.
On small models the chinchilla optimal scaling stops training the model even when the model is still improving.
The problem comes when people are actually using these small llama models rather than treating them as just data points. If you are actually using these models, what you want is one that is trained forever on as many tokens and training time as possible.
Get together 5 people in that position and it's less than a week's income for the group. That sounds doable as a hobby for those lucky people.
More realistically, it's within range for a grant, or use of someone else's hardware if they aren't using it, as the sibling comment from wongarsu said.
Also cloud vendors sometimes give out large batches of credits to startups and such as marketing incentive to get future customers.
That's a lot for a hobby, but small enough that it might be running on a university machine (the TinyLlama devs provide a way to cite them and all seem to work or study at Singapore University of Technology) or could be sponsored (no indication of that now, but "people made an awesome model in our cloud" is good advertisement). Government grants or grants in general also aren't out of the question, especially for a topic with this much hype.
They are training the model on 3000/22=136 times the value of the chinchilla scale. It will be interesting to see how much it will improve after way beyond this value.
?? A 3060 or a slightly bigger AMD/Intel GPU can stream llama 7B about as fast as someone can read, if not faster. A somewhat bigger consumer GPU can batch it and serve dozens of users.
I use 13B finetunes on my 2020 14" laptop all the time, with 6GB of VRAM and 16GB of CPU RAM.
I have seen many people on HN say this, and I can't help but wonder why the optimized, quantized llama implementations are flying under the radar.
That's the thing: you need a whole GPU per concurrent user, this is insanely expensive if you want to run it as part of a SaaS (which is what most for-profit want to do). Of course running models locally is much better in almost every regard, but nobody is gonna be a billionaire with that…
And I'm not sure what you mean by inference latency being infeasible. Most people using thsss models at home don't even bother with the 7B and go straight to 13B because it's easy to run too and much smarter. And any cloud gpu can run 13B.
Also, when are we going to start seeing open weights MOE models being released?
2- The only 2 i know of are airoboros[1] and Hydra which is still in progress.
[0] https://x.com/ggerganov/status/1698667093711880687?s=46&t=Jp...
Hydra, is this it? https://github.com/SkunkworksAI/hydra-moe
The "main" training step using huge amounts of inputs is called pre-training. The idea is that after that pre-training, you might fine tune the model for your specific use case.
2T not saturating on a 7B is very different from 3T on a 1B.
It’s my understanding that the entire race to ever-more parameters was driven by that.
Newer large datasets like the ones used here optimize for diversity. (e.g. SlimPajama is a heavily-deduped dataset)
Yeah, the line keeps going down as the model gets bigger. What's your point? That there's a hump in the middle?
AFAIU slim pajama is about 627B tokens, and Starcoder:
> approximately 250 Billion tokens.
Ed: I see TFA says:
> Combined Dataset Size - Around 950B tokens
> Total Tokens During Training - 3 trillion (slightly more than 3 epochs/1430k steps)
... but I'm not seeing how one becomes three? That's more like 1 trillion than 3 trillion tokens?
Citation would be nice. From my experience restart sometimes is required. When model gets unstable and 'explodes', or gets stuck in some local minima. This is common with GANs. I usually rollback the model a bit, but keep the latest discriminator. So that discriminator 'knows' what to expect. It works in most cases, except for the 'fatality', when model blows up no matter what. That's the end of training.
i watched that series so many times…
"A somewhat bigger consumer GPU can batch it and serve dozens of users."
Did you not read it?
- Most apps are not non-stop token generation for concurrent users-- ChatGPT's duty cycle at this is very low.
- A 4090 amortized over 4 years, working days & hours, is 20 cents per working hour; this is basically the same as the power going into it. It's less than a penny per hour per concurrent on a task like this.
- Hopefully you're using LLM to deliver value that's worth more than a penny per hour of the people using it.
- If you hit massive scale and want to buy A100s to improve the economics because you're drowning in business, you can go ahead and readily do that at that time...
It may not be super profitable, but its not untenable either.
The exception is the A100 GPU which does not use 100% of GPU compute and therefore you get benefit from batching, but is hella expensive.
The economics are not simple, and in most cases "just use the ChatGPT API" is also the most cost-effective option anyways. A smaller 1.1B model (which would likely not be compute-bound) with similar performance to a 7B model may tip the scales.
From what I understand, they are severely bandwidth bound at a GPU batch size of 1. Even llama.cpp is fairly RAM speed bound on a CPU with much less compute than a GPU.
It's just that batching is quite inefficient without an implementation like this: https://www.anyscale.com/blog/continuous-batching-llm-infere...
LLM with batch_size=1 technically cannot use '100%' of GPU. Because it has to move a lot of data around and use different blocks of GPU. So, when tensor cores are used cuda cores are idle. Tensor cores are used for matrix multiplication, cuda cores for activation functions (I'm simplifying). Model has to use both at different times moving data between them. Meanwhile GPU monitor may report 100%. But it's still possible to insert another process. I think I've seen this idea in Pytorch docs.
As for 1.1B LLM, it would be nice. Interesting experiment anyway. I'm only afraid that with big and diverse dataset model will focus more on memorization and generic logic may not emerge. They aren't doing anything new in terms of architecture and training methods.
But that's not how it works: you need to have enough of it to accommodate for peek usage, but a good fraction of that isn't going to be running most of the time. You'd end up with a cost that's not too far from what Cloud providers are offering, which is a roughly 3 times that price. And you need to pay for the whole server hosting these GPUs (this less of a factor when you're using big GPUs like H100, but if you want to stick with consumer-grade GPUs, then the host is still a non-trivial fraction of the cost, and your supporting a server for a small bunch of concurrent users, which means your infra team is going to work with a massive pool of servers very quickly, with all the associated costs).
> It's less than a penny per hour per concurrent on a task like this.
It's still two orders of magnitude more expansive than any other SaaS business.
> Hopefully you're using LLM to deliver value that's worth more than a penny per hour of the people using it.
Maybe, but then again you're trying to build a service that has to add much more value than what the typical SaaS start-up provide.
Also regarding this:
> - Most apps are not non-stop token generation for concurrent users-- ChatGPT's duty cycle at this is very low.
ChatGPT is mostly being used by people who use it a few minutes per day, which is a nice place to be, but:
- this market is already taken by them, so your startup isn't gonna do the same.
- when you start integrating LLMs in tools you use routinely (an IDE being the typical example, then the token generation amount skyrockets).
Really? Some SaaS businesses have users doing things that generate tens of thousands of IOs per user request across spinning storage, or even far more.
> ChatGPT is mostly being used by people who use it a few minutes per day, which is a nice place to be, but:
I think you basically completely misunderstood everything I said. Here, the point was that someone using it is generating tokens a very large proportion of the time they're sitting in front of the service compared to most use cases-- but it's still only like 20% of the time.
We all have a pretty good understanding of the tradeoffs between owning hardware vs. elastic usage of a utility. We know that "peek usage" [sic] is higher than average (which is why there's a duty cycle correction in the calculation in the first place).
> - when you start integrating LLMs in tools you use routinely (an IDE being the typical example, then the token generation amount skyrockets).
It all depends. The system I just built and deployed does not need to be immediately responsive to end-users (users can tolerate a delay of a couple of minutes), with a few thousand tokens per user per week, and usage smeared pretty well over a several hour per day window. There's a lot of reasons (beyond economics) why moving it to a consumer GPU is attractive, but it won't be happy with a 1B parameter model.
You are very smart indeed…
There's plenty of reasons why firms will want to run this stuff on-prem, both for their own usage and as a service. It probably will not be the majority of usage or zero, but instead a noticeable small chunk.
Yes, it's more expensive than many things, but not anywhere close to the most expensive service that people choose to run on-prem. And you can still support a decent userbase from a few computers, depending upon what you're doing.