Mistral AI Launches New 8x22B MOE Model

Mistral AI Launches New 8x22B MOE Model(twitter.com)

379 points by varunvummadi 2 years ago | 153 comments

freeqaz 2 years ago |

What's the easiest way to run this assuming that you have the weights and the hardware? Even if it's offloading half of the model to RAM, what tool do you use to load this? Ollama? Llama.cpp? Or just import it with some Python library?

Also, what's the best way to benchmark a model to compare it with others? Are there any tools to use off-the-shelf to do that?

fbdab103 2 years ago | |

I think the llamafile[0] system works the best. Binary works on the command line or launches a mini webserver. Llamafile offers builds of Mixtral-8x7B-Instruct, so presumably they may package this one up as well (potentially a quantized format).

You would have to confirm with someone deeper in the ecosystem, but I think you should be able to run this new model as is against a llamafile?

[0] https://github.com/Mozilla-Ocho/llamafile

jart 2 years ago | | |

llamafile author here. I'm downloading Mixtral 8x22b right now. I can't say for certain it'll work until I try it, but let's keep our fingers crossed! If not, we'll be shipping a release as soon as possible that gets it working.

My recent work optimizing CPU evaluation https://justine.lol/matmul/ may have come at just the right time. Mixtral 8x7b always worked best at Q5_K_M and higher, which is 31GB. So unless you've got 4x GeForce RTX 4090's in your computer, CPU inference is going to be the best chance you've got at running 8x22b at top fidelity.

noman-land 2 years ago | | |

+1 on llamafile. You can point it to a custom model.

varunvummadi 2 years ago | |

The easiest is to use vllm (https://github.com/vllm-project/vllm) to run it on a Couple of A100's, and you can benchmark this using this library (https://github.com/EleutherAI/lm-evaluation-harness)

sheepscreek 2 years ago | | |

In that regard, it’s even easier to use one Apple Studio with sufficient RAM and llama.cpp or even PyTorch for inference.

hmottestad 2 years ago | |

LM Studio is a great way to test out LLMs on my MacBook: https://lmstudio.ai/

Really easy to search huggingface for new models to test directly in the app.

LeoPanthera 2 years ago | | |

Make sure you get the prompt template set correctly, the defaults are wrong for a lot of models.

bevekspldnw 2 years ago | |

There is a user called The Bloke on hugging face- they release pre quantized models pretty soon after the full size drop. Just watch their page and pray you can fit the 4 bit in your GPU.

I’m sure they are already working on it.

nathanasmith 2 years ago | | |

TheBloke stopped uploading in January. There are others that have stepped up though.

MPSimmons 2 years ago | | |

I think 4b for this is support to be over 70GB, so definitely still heavy hardware.

mritchie712 2 years ago | |

you can try it on together here:

https://api.together.xyz/playground/language/mistralai/Mixtr...

SushiHippie 2 years ago |

[dupe] https://news.ycombinator.com/item?id=39986047

Which has the link to the tweet instead of the profile:

https://twitter.com/MistralAI/status/1777869263778291896

mlsu 2 years ago |

8x22b. If this is as good as Mixtral 8x7b we are in for a wonderful time.

cchance 2 years ago | |

I've heard command-r is first opensource to beat gpt4 in benchmarks

jxy 2 years ago | | |

It's "Command R+". "Command R" is a smaller model.

varunvummadi 2 years ago | | |

It beats the old GPT4 version in lmsys benchmark you can check it out here https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar... but Command R is commercially licensed We can assume that mistral will do a better job.

moralestapia 2 years ago | |

You mean better, right?

Why would you want another 8x7b, if you already have it ...

nazka 2 years ago |

Out of topic but are we now back at the same performance than ChatGPT 4 at the time people said it worked like magic (meaning before the nerf to make it more politically correct but making his performance crash)?

hmottestad 2 years ago | |

I’ve been testing a lot of LLMs on my MacBook and I would say that all of them are far away from being as good as GPT-4, at any time. Many are as good as GPT-3 though. There are also a lot of models that are fine tuned for specific tasks.

Language support is one big thing that is missing from open models. I’ve only found one model that can do anything useful with Norwegian, which has never been an issue GPT-4.

Eisenstein 2 years ago | | |

Which ones have you tested? There were some huge ones released recently.

segmondy 2 years ago | |

With open models, yes we are at the performance of at least the first release of ChatGPT 4.

sp332 2 years ago | | |

Could you recommend one or a few in particular?

zmmmmm 2 years ago |

A pre-Llama3 race for everyone to get their best small models on the table?

moffkalast 2 years ago | |

262 GB is not exactly small. But yes it seems they're all getting them out the door in case they end up being worse than llama-3 in which case it'll be too embarrassing to release later.

hmottestad 2 years ago | | |

Since it’s a MOE model it will only need to load a few of the 8 sub models into vram in order to answer a query. So it may look large, but I think a quantized model will easily fit on a Mac with 64GB of memory and maybe even a bit fewer bits and it’ll fit into 32GB.

I think it might be the end for 24GB 4090 cards though :(

swyx 2 years ago | |

this is likely v true given llama 3 rumored to release in next 2 weeks

nen-nomad 2 years ago |

Mixtral 8x7b has been good to work with, and I am looking forward to trying this one as well.

ZeljkoS 2 years ago |

Here is the unofficial benchmark: https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1/...

bevekspldnw 2 years ago | |

Wish it had GPT-4, that’s the one to beat still.

GuB-42 2 years ago | | |

It is there, not for all the benchmarks, but for those where it is included, GPT-4 scores much higher.

Not surprising since GPT-4 is still state-of-the-art and much bigger. Where Mistral has been particularly impressive is when you take the size of the model into account.

deoxykev 2 years ago |

4 bit quants should require 85GB VRAM, so this will fit nicely on 4x 24G consumer GPUs, plus some leftover for KV cache optimization.

qeternity 2 years ago | |

4bit should take up less than this, there are quite a few shared parameters between experts.

But unless you’re running bs=1 it will be painful vs 8x GPU as you’re almost certain to be activating most/all of the experts in a batch.

hedgehog 2 years ago | |

I've found the 2 bit quant of Mixtral 8x7B is usable for some purposes with an 8GB GPU. I'm curious how this new model will work in similar cheap 8-16GB GPU configurations.

reissbaker 2 years ago | | |

16GB will be way too small unfortunately — this has over 3x the param count, so at best you're looking at a 24GB card with extreme 2bit quantization.

Really though if you're just looking to run models personally and not finetune (which requires monstrous amounts of VRAM), Macs are the way to go for this kind of mega model: Macs have unified memory between the GPU and CPU, and you can buy them with a lot of RAM. It'll be cheaper than trying to buy enough GPU VRAM. A Mac Studio with 192GB unified RAM is under $6k — two A6000s will run you over $9k and still only give you 96GB VRAM (and God help you if you try to build the equivalent system out of 4090s or A100s/H100s).

Or just rent the GPU time as needed from cloud providers like RunPod, although that may or may not be what you're looking for.

aydyn 2 years ago | | |

AFAIK, 2-bit quant leads to too much loss of performance, such that you're better off using a different smaller model altogether. See here:

https://www.reddit.com/r/LocalLLaMA/comments/18ituzh/mixtral...

cjbprime 2 years ago | | |

Wouldn't expect that to work at all.

zone411 2 years ago |

Very important to note that this is a base model, not an instruct model. Instruct fine-tuned models are what's useful for chat.

haolez 2 years ago | |

What's the feeling of playing with a powerful base model? Will it just complete the prompt text like a continuation of it?

MPSimmons 2 years ago | | |

Generally, yes, it literally just tries to predict the next token again and again and again.

This model is apparently surprisingly good at chat, even though it is a base model, and will take part it it to some extent. It should be really interesting once it's fine-tuned.

talsperre 2 years ago |

Right on time as LLama 3 is released.

jimmySixDOF 2 years ago | |

And the same day Google Gemini Pro gets almost complete open long context multimodal access and OpenAI upgrade to GPT4-Turbo it was a big day in general for news drops that's for sure!

abdullahkhalids 2 years ago |

Why are some of their models open, and others closed? What is their strategy?

Jackson__ 2 years ago | |

My personal speculation is that their closed models are based on other companies' models.

For example on EQbench[0], Miqu[1], a leaked continued pretrain based on LLama2, performs extremely similar to the mistral medium model their API offers.

Maybe they're thinking it'd be bad PR for them to release models they didn't create from scratch, or there is some contractual obligation preventing the release.

[0]https://eqbench.com/index.html

[1]https://huggingface.co/miqudev/miqu-1-70b

moffkalast 2 years ago | | |

That's quite likely, some have also speculated that Mistral 7B got some EU grant funding that stipulated it had to be openly released later, and Mixtral is based on Mistral 7B so it would likely be subject to the same terms. I haven't found any source to substantiate it though.

unraveller 2 years ago | |

Mistral have stated they want to chase the fine-tune dollar to support le research. We should get thrown a bone of hard to tune mid-range stuff occasionally. Especially when big announcements about small models are expected later in the week (llama3) or when haiku is stealing the thunder from mixtral 8x7b.

kvmet 2 years ago | |

It's gotta be either perceived value or training data/licensing restrictions.

blackeyeblitzar 2 years ago | |

I am not sure why some are open and some are closed - if I had to speculate, it’s perhaps that the commercial models help fund the team. They come with safety features built-in as well as API-based access (instead of needing to self-host). They word their mission (https://mistral.ai/company/#missions) as follows:

> Our mission is to make frontier AI ubiquitous, and to provide tailor-made AI to all the builders. This requires fierce independence, strong commitment to open, portable and customisable solutions, and an extreme focus on shipping the most advanced technology in limited time.

wkat4242 2 years ago |

Weird, the last post I see at that link is from the 8th of December 2023 and it's not about this.

Edit: Ah, it's the wrong link. https://news.ycombinator.com/item?id=39986047

Thanks SushiHippie!

intellectronica 2 years ago |

It's weird that more than a day after the weights dropped, there still isn't a proper announcement from Mistral with a model card. Nor is it available on Mistral's own platform.

tosh 2 years ago | |

at least they confirmed it is Apache 2.0

https://twitter.com/arthurmensch/status/1778308399144333411

ein0p 2 years ago |

To this day 8x7b Mixtral remains the best model you can run on a single 48GB GPU. This has the potential to become the best model you can run on two such GPUs, or on an MBP with maxed out RAM, when 4-bit quantized.

ryao 2 years ago | |

I am looking forward to the pricing of those dropping. It is a shame that high memory graphics cards are not mainstream.

rspoerri 2 years ago | |

I hope i get it to run on my 96gb m2 in q4.

rspoerri 2 years ago | | |

It actually does, in case anybody wonders. But it seems as if it's not fine tuned to chat, or i'm doing it wrong at the moment. Getting a lot of duplicates and non useful answers.

noman-land 2 years ago | |

My first thought was how much RAM? Will it work on 64GB M1?

jwitthuhn 2 years ago | | |

It is ~260GB with presumably fp16 weights. Should fit into 64GB at 3-bit quantization (~49GB).

Edit: To add to this, I've had good luck getting solid output out of mixtral 8x7b at 3-bit, so that isn't small enough to completely kill the model's quality.

ein0p 2 years ago | | |

Nope. Just the weights would take 88GB at 4 bit. 128GB MBP ought to be able to run it. If I were to guess, a version for Apple MLX should be available within a few days, for those of us fortunate enough to own such a thing.

varunvummadi 2 years ago |

They Just announced their new model on Twitter, which you can download using torrent

aurareturn 2 years ago |

Might be a dumb question but does this mean this model has 176B params?

idiliv 2 years ago | |

In Mixtral 8x7B, the 8 means that the model uses Mixture-of-Experts (MoE) layers with 8 experts. The 7B means that if you were to remove 7 of the 8 experts in each layer, then you would end up with a 7B model (which would have exactly the same architecture as Mistral 7B). Therefore, a 1x7B model has 7B params. An 8x7B model has 1 * 7B + (8-1) * sz_expert params, where sz_expert is some constant value that the MoE layers increase by when adding one expert. In the case of Mixtral 8x7B the model size is 46.3GB, so, sz_expert ≈ 5.6B.

If these assumptions port over to 8x22B, then 8x22B has, at 281GB, sz_expert ≈ 13.8B.

KTibow 2 years ago | | |

I tried to check this for myself.

I agreed for the first one, (46.3 - 7) / 7 = 5.61b.

The second one doesn't match up, (281 - 22) / 7 = 37b or (140.5 - 22) / 7 = 16.92b. Am I doing something wrong?

idiliv 2 years ago | | |

Oh, and to answer your actual question: Assuming that the model is released with 16 bits per parameter, then it as 281GB / 16 bit = 140.5 parameters.

hovering_nox 2 years ago | |

8x7 had 46B or so.

resource_waste 2 years ago |

What is the excitement around models that arent as good as llama?

This is clearly an inferior model that they are willing to share for marketing purposes.

If it was an improvement over llama, sure, but it seems like just an ad for bad AI.

Me1000 2 years ago | |

Mixtral 7x8b was way better than llama2 70b and used less RAM and compute at the same time. This model is way better than llama.

In fact I would go as far as saying llama2 isn’t that good compared to some of the most recent models.

jeppebemad 2 years ago | |

We use their earlier Mixtral model because it outperforms llama for our use case. They do not release full models for marketing purposes, though it definitely grabs attention! You may need to revise your views..

cma 2 years ago | |

It beats llama on the benchmark posted below (though maybe leaked into training data). But also you can run it on cheaper split up hardware with less individual vram than the big llama.

zone411 2 years ago | |

What makes it you think it's not as good as LLaMA? It's likely much better. There are multiple open-weight models that are better than LLaMA 2 out there already.

swalsh 2 years ago |

Is this Mistral large?

Jackson__ 2 years ago | |

Unlikely, this model has a max sequence length of 65k, while mistral large is 32k.

varunvummadi 2 years ago | |

Not sure trying to download the torrent and checking it out

fbdab103 2 years ago | | |

For those of us without twitter, how many GB is the model?

stainablesteel 2 years ago |

has anyone had success making an auto-gpt concept for mistral/llama models? i haven't found one

dkasper 2 years ago | |

Has anyone had success making an auto-gpt with any models? Besides toy use cases

danenania 2 years ago | | |

I built one using GPT-4[1]. It's not perfect but is working quite well and is now being used by hundreds of users, apart from me, to work on real, non-toy tasks. For example, I used it to build most of a production-ready AWS infrastructure (and accompanying deploy script) with the AWS CDK.

I want to add Mistral support soon, probably via together.ai or a similar service.

1 - https://github.com/plandex-ai/plandex

angilly 2 years ago |

The lack of a corresponding announcement on their blog makes me worry about a Twitter account compromise and a malicious model. Any way to verify it’s really from them?

simonw 2 years ago | |

Their https://twitter.com/MistralAI account has 5 tweets since the account opened, three of which were model release magnet links.

https://twitter.com/MistralAILabs is their other Twitter account, which is very slightly more useful though still very low traffic.

swyx 2 years ago | |

you must be new to mistral releases. they invented the magnet first blog later meta

angilly 2 years ago | | |

At 3:30a France local? Alrighty. I still wait a lil bit ;)

llm_trw 2 years ago | |

This is how they released every model so far.

tjtang2019 2 years ago |

What are the advantages compared to GPT? Looking forward to using it!

qball 2 years ago | |

>What are the advantages compared to GPT?

It actually does what you tell it, and won't try to silently change your prompt to conform to a specific flavor of Californian hysterics, which is what OpenAI's products do.

Also, since it's a local model, your queries aren't being datamined nor can access to the service be revoked on a whim.