Mistral "Mixtral" 8x7B 32k model [magnet](twitter.com) |
Mistral "Mixtral" 8x7B 32k model [magnet](twitter.com) |
RAM Wise, you can easily run a 70b with 128GB, 8x7B is obviously less than that.
Compute wise, I suppose it would be a bit slower than running a 13b.
edit: "actually", I think it might be faster than a 13b. 8 random 7b ~= 115GB, Mixtral is under 90. I will have to wait for more info/understanding.
I had to manually add these trackers and now it works: https://gist.github.com/mcandre/eab4166938ed4205bef4
EDIT: I mean, I guess they didn't hack their own twitter account, but still.
Holy shit, this is some clever marketing.
Kinda wonder if any of their employees were part of the warez scene at some point.
Mixtral brings MoE to an already-powerful model.
What is the max number of tokens in the output?
Mistral Mixtral Model Magnet
Mistral Mixtral Model Magnet
MistralLite is basically overfit to summarize and retrieve in its 32K context, but its extremely good at that for a 7B. Its kinda useless for anything else.
Yi 200K is... smart with the long context. An example I often cite is a Captain character in a story I 'wrote' with the llm. A Yi 200K finetune generated a debriefing for like 40K of context in a story, correctly assessing what plot points should be kept secret and making some very interesting deductions. You could never possibly do that with RAG on a 4K model, or even models that "cheat" with their huge attention like Anthropic.
I test at 75K just because that's the most my 3090 will hold.
https://huggingface.co/fblgit/una-xaberius-34b-v1beta
https://huggingface.co/fblgit/una-cybertron-7b-v2-bf16
I mention this because it could theoretically be applied to Mistral Moe. If the uplift is the same as regular Mistral 7B, and Mistral Moe is good, the end result is a scary good model.
This might be an inflection point where desktop-runnable OSS is really breathing down GPT-4's neck.
I asked around a bit about the example and it was strangely coherent and focused across the whole conversation. It was really well detecting, where I'm starting a new thread (without clearing a context) or referring to things before.
It caught me off guard as well with this:
> me: What does following mean [content of the docker compose]
> cybertron-7b: In the provided YAML configuration, "following" refers to specifying dependencies
I've never seen any model using my exact wording in quotes in conversation like that.
EDIT: just saw this “ Megatron (1, 2, and 3) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.”
If you have ollama installed you can try it out with `ollama run nollama/una-cybertron-7b-v2`.
[1]: https://huggingface.co/TheBloke/una-cybertron-7B-v2-GGUF
On the human ratings, three different 7B LLMs (Two different Openchat models and a Mistral fine tune) beat a version of GPT-3.5.
(The top 9 chatbots are GPT and Claude versions. Tenth place is a 70B model. While it's great that there's so much interest in 7B models, and it's incredible that people are pushing them so far, I selfishly wish more effort would go into 13B models... since those are the biggest that my macbook can run.)
But its not totally irrelevant. They are still a datapoint to consider with some performance correlation. YMMV, but these models actually seem to be quite good for the size in my initial testing.
Although, anyone claiming a 7b LLM is better than a well trained 70b LLM like Llama 2 70b chat for the general case, doesn't know what they are talking about.
In the future will it be possible? Absolutely, but today we have no architecture or training methodology which would allow it to be possible.
You can rank models yourself with a private automated benchmark which models don't have a chance to overfit to or with a good human evaluation study.
Edit: also, I guess OP is talking about Mistral finetunes (ones overfitting to the benchmarks) beating out 70b models on the leaderboard because Mistral 7b is lower than Llama 2 70b chat.
Any thoughts on that?
The 7B model (cybertron) is trained on Mistral. Mistral is technically a 32K model, but it uses a sliding window beyond 32K, and for all practical purposes in current implementations it behaves like an 8K model.
The 34B model is based on Yi 34B, which is inexplicably marked as a 4K model in the config but actually works out to 32K if you literally just edit that line. Yi also has a 200K base model... and I have no idea why they didn't just train on that. You don't need to finetune at long context to preserve its long context ability.
I think that the '7b beating 70b' is mostly due to the fact that Mistral is likely trained on considerably more tokens than Chinchilla optimal. So is llama-70b, but not to the same degree.
Pretty much anything with ~6-8GB of memory that's not super old.
It will run on my 6GB laptop RTX 2060 extremely quickly. It will run on my IGP or Phone with MLC-LLM. It will run fast on a laptop with a small GPU, with the rest offloaded to CPU.
Small, CPU only servers are kinda the only questionable thing. It runs, just not very fast, especially with long prompts (which are particularly hard for CPUs). There's also not a lot of support for AI ASICs.
New open weights LLM from @MistralAI
params.json: - hidden_dim / dim = 14336/4096 => 3.5X MLP expand - n_heads / n_kv_heads = 32/8 => 4X multiquery - "moe" => mixture of experts 8X top 2
Likely related code: https://github.com/mistralai/megablocks-public
Oddly absent: an over-rehearsed professional release video talking about a revolution in AI.
If people are wondering why there is so much AI activity right around now, it's because the biggest deep learning conference (NeurIPS) is next week.
Can we expect some big announcements (new architectures, models, etc) at the conference from different companies? Sorry, not too familiar what the culture for research conferences is.
Recent announcements from companies tend to be even more divorced from conference dates, as they release anemic "Technical Reports" that largely wouldn't pass muster in a peer review.
>- n_heads / n_kv_heads = 32/8 => 4X
These two are exactly the same as the old Mistral-7B
Its does remind me how some Google employee was bragging that they disclosed the weights for the Gemini, and only the small mobile Gemini, as if that's a generous step over other companies.
I am 100% in agreement with your viewpoint, but feel squeamish seeing an un-needed lie coupled to it to justify it. Just so much Othering these days.
Not that this affects Google's user base in any way, at the moment.
{
"dim": 4096,
"n_layers": 32,
"head_dim": 128,
"hidden_dim": 14336,
"n_heads": 32,
"n_kv_heads": 8,
"norm_eps": 1e-05,
"vocab_size": 32000,
"moe": {
"num_experts_per_tok": 2,
"num_experts": 8
}
}Everything open source is llama now. Facebook all but standardized the architecture.
I dunno about the moe. Is there existing transformers code for that part? It kinda looks like there is based on the config.
ChatGPT 4 is amazing yes and i've been a day 1 subscriber, but it's huge, runs on server farms far away and is more or less a black box.
Mistral is tiny, and amazingly coherent and useful for it's size for both general questions and code, uncensored, and a leap i wouldn't have believed possible in just a year.
I can run it on my Macbook Air at 12tkps, can't wait to try this on my desktop.
Mistral - magnet link and that's it
[1] https://www.bloomberg.com/news/articles/2023-12-04/openai-ri...
It’s crazy what can be done with this small model and 2 hours of fine tuning.
Chatbot with function calling? Check.
90 +% accuracy multi label classifier, even when you only have 15 examples for each label? Check.
Craaaazy powerful.
- Mixture of Experts architecture.
- 8x 7B parameters experts (potentially trained starting with their base 7B model?).
- 96GB of weights. You won't be able to run this on your home GPU.
Excited to play with this once it's somewhat documented on how to get it running on a dual 4090 Setup.
What memory will this need? I guess it won't run on my 12GB of vram
"moe": {"num_experts_per_tok": 2, "num_experts": 8}
I bet many people will re-discover bittorrent tonight
Its also a good candidate for splitting across small GPUs, maybe.
One architecture I can envision is hosting prompt ingestion and the "host" model on the GPU and the downstream expert model weights on the CPU /IGP. This is actually pretty efficient, as the CPU/IGP is really bad at the prompt ingestion but reasonably fast at ~14B token generation.
Llama.cpp all but already does this, I'm sure MLC will implement it as well.
Warning: the implementation might be off as there's no official one. We at Fireworks tried to reverse-engineer model architecture today with the help of awsome folks from the community. The generations look reasonably good, but there might be some details missing.
If you want to follow the reverse-engineering story: https://twitter.com/dzhulgakov/status/1733330954348085439
It will run with the speed of a 7B model while being much smarter but requiring ~24GB of RAM instead of ~4GB (in 4bit).
Mistral 7B is basically an 8K model, but was marked as a 32K one.
Anyway, if the vanilla version requires 2x80gb cards, I wonder how would it run on a M2 Ultra 192gb Mac Studio.
Anyone having the machine could try?
Is it mainly because its hard to apply the limitations so that it doesn't spit out bomb making instructions?
I remember running llama1 33B 8 months ago that as i remember was on Mistral 7B's level while other 7B models were a rambling mess.
The jump in "potency" is what is so extreme.
If an LLM or a SmallLM can be retrained or fine-tuned constantly, every week or every day to incorporate recent information then outdated models trained a year or two years back hold no chance to keep up. Dunno about the licensing but OpenAI could incorporate a smaller model like Mistral7B into their GPT stack, re-train it from scratch every week, and charge the same as GPT-4. There are users who might certainly prefer the weaker, albeit updated models.
I was really hoping for a 13B Mistral. I'm not sure if this MOE will run on my 3090 with 24GB. Fingers crossed that quantization + offloading + future tricks will make it runnable.
They are not far from GPT-3 logic wise i'd say if you consider the breadth of data, ie. very little in 7GB's; so missing other languages, niche topics and prose styles etc.
I honestly wouldn't be surprised if 13B would be indistinguishable from GPT-3.5 on some levels. And if that is the case - then coupled with the latest developments in decoding - like Ultrafastbert, Speculative, Jacobi, Lookahead etc. i honestly wouldn't be surprised to see local LLM's on current GPT-4 level within a few years.
That seems kinda low, are you using Metal GPU acceleration with llama.cpp? I don't have a macbook, but saw some of the llama.cpp benchmarks that suggest it can reach close to 30tk/s with GPU acceleration.
If anyone has faster than 12tkps on Air's let me know.
I'm using the LM Studio GUI over llama.cpp with the "Apple Metal GPU" option. Increasing CPU threads seemingly does nothing either without metal.
Ram usage hovers at 5.5GB with a q5_k_m of Mistral.
People are running models 2-4 times that size on local GPUs.
What's more, this will run on a MacBook CPU just fine-- and at an extremely high speed.
This is just about right for 24GB. I bet that is intentional on their part.
This seems like a non-sequitur. Doesn't MoE select an expert for each token? Presumably, the same expert would frequently be selected for a number of tokens in a row. At that point, you're only running a 7B model, which will easily fit on a GPU. It will be slower when "swapping" experts if you can't fit them all into VRAM at the same time, but it shouldn't be catastrophic for performance in the way that being unable to fit all layers of an LLM is. It's also easy to imagine caching the N most recent experts in VRAM, where N is the largest number that still fits into your VRAM.
Even if you can't fit all of them in the VRAM, you could load everything in tmpfs, which at least removes disk I/O penalty.
You can these days, even in a portable device running on battery.
96GB fits comfortably in some laptop GPUs released this year.
Would this allow you to run each expert on a cheap commodity GPU card so that instead of using expensive 200GB cards we can use a computer with 8 cheap gaming cards in it?
I would think no differently than you can run a large regular model on a multiGPU setup (which people do!). Its still all one network even if not all of it is activated for each token, and since its much smaller than a 56B model, it seems like there are significant components of the network that are shared.
As far as I understand in a MOE model only one/few experts are actually used at the same time, shouldn't the inference speed for this new MOE model be roughly the same as for a normal Mistral 7B then?
7B models have a reasonable throughput when ran on a beefy CPU, especially when quantized down to 4bit precision, so couldn't Mixtral be comfortably ran on a CPU too then, just with 8 times the memory footprint?
So you need roughly two loaded in memory per token. Roughly the speed and memory of a 13B per token.
Only issues is that's per-token. 2 experts are choosen per token, which means if they aren't the same ones as the last token, you need to load them into memory.
So yeah to not be disk limited you'd need roughly 8 times the memory and it would run at the speed of a 13B model.
~~~Note on quantization, iirc smaller models lose more performance when quantized vs larger models. So this would be the speed of a 4bit 13B model but with the penalty from a 4bit 7B model.~~~ Actually I have zero idea how quantization scales for MoE, I imagine it has the penalty I mentioned but that's pure speculation.
How are AMD/Intel totally missing this boat?
https://twitter.com/zacharynado/status/1732425598465900708
(Alt: https://nitter.net/zacharynado/status/1732425598465900708)
That is fair though, this was an impulsive addition on my part.
Other sibling commenter refulgentis is correct too. The Apple M{1-3} Max chips have up to 400GB/s memory bandwidth. I think that's noticably faster than every other consumer CPU out there. But it's slower than a top Nvidia GPU. If the entire 96GB model has to be read by the GPU for each token, that will limit unquantised performance to 4 tokens/s at best. However, as the "Mixtral" model under discussion is a mixture-of-experts, it doesn't have to read the whole model for each token, so it might go faster. Perhaps still single-digit tokens/s though, for unquantised.
- Exui with exl2 files on good GPUs.
- Koboldcpp with gguf files for small GPUs and Apple silicon.
There are many reasons, but in a nutshell they are the fastest and most VRAM efficient.
I can fit 34Bs with about 75K context on a single 24GB 3090 before the quality drop from quantization really starts to get dramatic.
Unfortunately, that's because they have Wall St. analysts looking at their videos who will (indirectly) determine how big of a bonus Sundar and co takes home at the end of the year. Mistral doesn't have to worry about that.
It would probably fit in 32GB at 4-bit but probably won’t run with sensible quantization/perf on a 3090/4090 without other tricks like offloading. Depending on how likely the same experts are to be chosen for multiple sequential tokens, offloading experts may be viable.
Its theoretically doable, with quantization from the recent 2 bit quant paper and a custom implementation (in exllamav2?)
EDIT: Actually the download is much smaller than 8x7B. Not sure how, but its sized more like a 30B, perfect for a 3090. Very interesting.
That's a total of 12B, meaning this should be able to be run on the same hardware as 13B models with some loading time between generations.
Or watch Tim Dettmers, who is releasing code to run Mixtral 8x7b in just 4GB of RAM.
You can find most popular models converted for you on huggingface.co if you add exl2 to your search and start with the 4 bit quantized version. Don't bother going above 6 bits even if you have spare vram, practically it doesn't offer much benefit.
For reference I max out around 60 tok/s at 4bit, 50 tok/s at 5bit, 40 at 6bit for some random 7b parameter model on a rtx 2070.
For reference, I'm getting 10 t/s with a Q5_K_M Mistral GGUF model.
If MOE models do well it could be great for commodity hw based distributed inference approaches.
In other words, assuming you ask a coding question and there's a coding expert in the mix, it would answer it completely.
Just today, i asked GPT and Bard(Gemini) to write code using slint, neither of them had any idea of slint. Slint being a relatively new library, like two and a half (0.1 version) to one and a half (0.2 version) years back [1] is not something they trained on.
Natural language doesn't change that much over the course of a handful of years, but in coding 2 years back may as well be a century. My argument is that, SmallLMs not only they are relevant, they are actually desirable, if the best solution is to be retrained from scratch.
If on the other hand a billion token context window proves to be practical, or the RAG technique solves most of use cases, then LLMs might suffice. This RAG technique, could it be aware, of million of git commits daily, on several projects, and keep it's knowledge base up to date? I don't know about that.
use slint::slint;
slint! { DialogBox := Window { width: 400px; height: 200px; title: "Input Dialog";
VerticalBox {
padding: 20px;
spacing: 10px;
TextInput {
id: input_field;
placeholder_text: "Enter text here";
}
Button {
text: "Submit";
clicked => {
// Handle the button click event
println!("Input: {}", input_field.text());
}
}
}
}
}fn main() { DialogBox::new().run(); }
I do not have a GPT4 subscription, i did not bother because it is so slow, limited queries etc. If the cutoff date is improved, like being updated periodically i may think about it. (Late response, forgot about the comment!)
So 2B for attention + 5Bx2 for inference = 12B in RAM at runtime.
It's between 7B and 13B in terms of VRAM usage and 70B in terms of performance.
Tim Dettmers (QLoRA creator) released code to run Mixtral 8x7b in 4GB of VRAM. (But it benchmarks better than Llama-2 70B).
These aren't totally common configurations, but they're not totally out of reach like buying an H100 for personal use.
2. Yes, and my MBP M1 Pro can run quantized 34b models. My point was that when you do MoE, memory requirements suddenly become too challenging. A 7b Q8 is roughly 7GB (7b parameters × 8 bits each). But 8x of that would be 56GB, and all of that must be in memory to run.
For some uninteresting regurgitation, sure. But size - width and depth - seems like an important piece for ability to extract deep understanding of the universe.
Also, MoE, as I understand it, will inherently not be able to glean insight into, nor reason about, and certainly not be able to come up with novel understanding, for cross-expert areas.
I believe size matters, a lot.
Its just that each bit picks up different "parts" of the training more strongly, which can be selectively picked at runtime. This is actually kinda analogous to animals, which dont fire every single neuron so frequently like monolithic models do.
The tradeoff, at equivalent quality, is essentially increased VRAM usage for faster, more splittable inference and training, though the exact balance of this tradeoff is an excellent question.
The trick is not this neural alignment - it is training on many, many more tokens than Chinchilla recommends.
[1] https://digital-strategy.ec.europa.eu/en/policies/regulatory...
2B for the attention head and 5B from each of 2 experts.
It should be able to run slightly faster than a 13B desnse model, in as little as 16GB of RAM with room to spare.
I don't think that's the case, for full speed you still need (5B*8)/2+2~fewB overhead.
I think the experts chosen per-token? That means that yes you technically only need two in VRAM memory+router/overhead per token, but you'll have to constantly be loading in different experts unless you can fit them all, which would still be terrible for performance.
So you'll still be PCIE/RAM speed limited unless you can fit all of the experts into memory (or get really lucky and only need two experts).
Which is more than a price of RTX A6000 48gb ($4k used on ebay)
it's not very good, at all, but now we can claim some pretty massive speedups.
I can't find anything for llama 2 70B on 4090 after 10 minutes of poking around, 13B is about 30 tkn/s. it looks like people generally don't run 70B unless they have multiple 4090s.
We clearly see that Mistral-7B is in some important, representative respects (eg coding) superior to Falcon-180B, and superior across the board to stuff like OPT-175B or Bloom-175B.
"Well trained" is relative. Models are, overwhelmingly, functions of their data, not just scale and architecture. Better data allows for yet-unknown performance jumps, and data curation techniques are a closely-guarded secret. I have no doubt that a 7B beating our best 60-70Bs is possible already, eg using something like Phi methods for data and more powerful architectures like some variation of universal transformer.
I don't know how Mistral performs on codegen specifically, but models which are finetuned for a specific use case can definitely punch above their weight class. As I stated, I'm just talking about the general case.
But so far we don't know of a 7b model (there could be a private one we don't know about) which is able to beat a modern 70b model such as Llama 2 70b. Could one have been created which is able to do that but we simply don't know about it? Yes. Could we apply Phi's technique to 7b models and be able to reach Llama 2 70b levels of performance? Maybe, but I'll believe it when we have a 7b model based on it and a human evaluation study to confirm. It's been months now since the Phi study came out and I haven't heard about any new 7b model being built on it. If it really was such a breakthrough to allow 10x parameter reduction and 100x dataset reduction, it would be dumb for these companies to not pursue it.
Actually I am testing the 34B myself (not the 7B), and it seems good.
If you chatted with them, you know .. that strange sensation, you know what is it.. Intelligence. Xaberius-34B is the highest performer of the board, and is NOT contaminated.
Gotcha
> Actually I am testing the 34B myself (not the 7B), and it seems good.
I've heard good things about it
With a 7b and a 70b trained the same way, the 70b should always be better
I see too many misleading "NEW 7B MODEL BEATS GPT-4" posts. People test those models a couple of times, come back to the comments section, declare it true, and onlookers know no better than to believe it and in my opinion has led to many people claiming 7b models have gotten as good as Llama 2 70b or GPT-4 when it is not the case when you account for the overfit being exhibited by these models and actually put them to the test via human evaluation.
Do we actually either no that it is MoE or that size? IIRC both if those started as outsidr guesses that somehow just became accepted knowledge without any actual confirmation.
TL;DR you can think of it as the initial part of the model is essentially dedicated to learning which experts to choose.
Honestly its crazy that AMD indulges in this, especially now. Their workstation market share is comparatively tiny, and instead they could have a swarm of devs (like me) pecking away at AMD compatibility on AI repos if they sold cheap 32GB/48GB cards.