Mistral "Mixtral" 8x7B 32k model [magnet]

Mistral "Mixtral" 8x7B 32k model [magnet](twitter.com)

546 points by xyzzyrz 2 years ago | 239 comments

In other llm news, Mistral/Yi finetunes trained with a new (still undocumented) technique called "neural alignment" are blasting other models in the HF leaderboard. The 7B is "beating" most 70Bs. The 34B in testing seems... Very good:

https://huggingface.co/fblgit/una-xaberius-34b-v1beta

https://huggingface.co/fblgit/una-cybertron-7b-v2-bf16

I mention this because it could theoretically be applied to Mistral Moe. If the uplift is the same as regular Mistral 7B, and Mistral Moe is good, the end result is a scary good model.

This might be an inflection point where desktop-runnable OSS is really breathing down GPT-4's neck.

eurekin 2 years ago | |

I just played with 7b version. It really feels different than anything I tried before. It could explain a docker compose file. It generated a simple vue application component.

I asked around a bit about the example and it was strangely coherent and focused across the whole conversation. It was really well detecting, where I'm starting a new thread (without clearing a context) or referring to things before.

It caught me off guard as well with this:

> me: What does following mean [content of the docker compose]

> cybertron-7b: In the provided YAML configuration, "following" refers to specifying dependencies

I've never seen any model using my exact wording in quotes in conversation like that.

mark_l_watson 2 years ago | | |

How did you run it? Are there model files in Ollama format? Are you running on NVidia or Apple Silicon?

EDIT: just saw this “ Megatron (1, 2, and 3) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.”

brucethemoose2 2 years ago | | |

Yeah, the Yi version is quite something too.

nikvdp 2 years ago | |

This piqued my interest so I made an ollama modelfile of it for the smallest variant (from TheBloke's GGUF [1] version). It does indeed seem impressively gpt4-ish for such a small model! Feels more coherent than openhermes2.5-mistral which was my previous goto local llm.

If you have ollama installed you can try it out with `ollama run nollama/una-cybertron-7b-v2`.

[1]: https://huggingface.co/TheBloke/una-cybertron-7B-v2-GGUF

fblgit 2 years ago | |

Correct. UNA can align the MoE at multiple layers, experts, nearly any part of the neural network I would say. Xaberius 34B v1 "BETA".. is the king, and its just that.. the beta. I'll be focusing on the Mixtral, its a christmas gift.. modular in that way, thanks for the lab @mistral!

brucethemoose2 2 years ago | | |

Do a Yi 200K version as well! That would make my Christmas, as Mistral Moe is only maybe 32K.

inciampati 2 years ago | | |

Do you have any docs describing the method?

stavros 2 years ago | |

Aren't LLM benchmarks at best irrelevant, at worst lying, at this point?

sbierwagen 2 years ago | | |

If you don't like machine evaluations, you can take a look at the lmsys chatbot arena. You give a prompt, two chatbots answer anonymously, and you pick which answer is better: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

On the human ratings, three different 7B LLMs (Two different Openchat models and a Mistral fine tune) beat a version of GPT-3.5.

(The top 9 chatbots are GPT and Claude versions. Tenth place is a 70B model. While it's great that there's so much interest in 7B models, and it's incredible that people are pushing them so far, I selfishly wish more effort would go into 13B models... since those are the biggest that my macbook can run.)

puttycat 2 years ago | | |

I wonder how it will rank on benchmarks which are password-protected to prevent test contamination, for example: https://github.com/taucompling/bliss

brucethemoose2 2 years ago | | |

Yes, absolutely. I was just preaching this.

But its not totally irrelevant. They are still a datapoint to consider with some performance correlation. YMMV, but these models actually seem to be quite good for the size in my initial testing.

typon 2 years ago | | |

Yes. The only thing that is relevant is a hidden benchmark that's never released and run by a trusted third party.

nabakin 2 years ago | | |

More or less. The automated benchmarks themselves can be useful when you weed out the models which are overfitting to them.

Although, anyone claiming a 7b LLM is better than a well trained 70b LLM like Llama 2 70b chat for the general case, doesn't know what they are talking about.

In the future will it be possible? Absolutely, but today we have no architecture or training methodology which would allow it to be possible.

You can rank models yourself with a private automated benchmark which models don't have a chance to overfit to or with a good human evaluation study.

Edit: also, I guess OP is talking about Mistral finetunes (ones overfitting to the benchmarks) beating out 70b models on the leaderboard because Mistral 7b is lower than Llama 2 70b chat.

screye 2 years ago | |

Yeah, and Mistral doesn't particularly care about lobotomizing the model with 'safety-training'. So it can achieve much better performance per-parameter than anthropic/google/OpenAI while being more steerable as well.

behnamoh 2 years ago | | |

until Mistral gets too big for lawyers to ignore.

_boffin_ 2 years ago | |

Interesting. One thing i noticed is that Mistral has a `max_position_embeddings` of ~32k while these have it at 4096.

Any thoughts on that?

brucethemoose2 2 years ago | | |

Is complicated.

The 7B model (cybertron) is trained on Mistral. Mistral is technically a 32K model, but it uses a sliding window beyond 32K, and for all practical purposes in current implementations it behaves like an 8K model.

The 34B model is based on Yi 34B, which is inexplicably marked as a 4K model in the config but actually works out to 32K if you literally just edit that line. Yi also has a 200K base model... and I have no idea why they didn't just train on that. You don't need to finetune at long context to preserve its long context ability.

whimsicalism 2 years ago | |

DPO is pretty good as well.

I think that the '7b beating 70b' is mostly due to the fact that Mistral is likely trained on considerably more tokens than Chinchilla optimal. So is llama-70b, but not to the same degree.

3abiton 2 years ago | |

HF leaderboards are rarely reflective of real world performance especially in small variations, but nonetheless, this is promising. What are the HW requirements for this latest Mistral7B?

eyegor 2 years ago | | |

Any 7b can run well (~50 tok/s) on an 8gb gpu if you tune the context size. 13b can sometimes run well but typically you'll end up with a tiny context window or slow inference. For cpu, I wouldn't recommend going above 1.3b unless you don't mind waiting around.

brucethemoose2 2 years ago | | |

> What are the HW requirements for this latest Mistral7B

Pretty much anything with ~6-8GB of memory that's not super old.

It will run on my 6GB laptop RTX 2060 extremely quickly. It will run on my IGP or Phone with MLC-LLM. It will run fast on a laptop with a small GPU, with the rest offloaded to CPU.

Small, CPU only servers are kinda the only questionable thing. It runs, just not very fast, especially with long prompts (which are particularly hard for CPUs). There's also not a lot of support for AI ASICs.

swyx 2 years ago | |

what is neural alignment? who came up with it?

brucethemoose2 2 years ago | | |

@fblgit apparently, from earlier in this thread.

BryanLegend 2 years ago |

Andrej Karpathy's take:

New open weights LLM from @MistralAI

params.json: - hidden_dim / dim = 14336/4096 => 3.5X MLP expand - n_heads / n_kv_heads = 32/8 => 4X multiquery - "moe" => mixture of experts 8X top 2

Likely related code: https://github.com/mistralai/megablocks-public

Oddly absent: an over-rehearsed professional release video talking about a revolution in AI.

If people are wondering why there is so much AI activity right around now, it's because the biggest deep learning conference (NeurIPS) is next week.

https://twitter.com/karpathy/status/1733181701361451130

henrysg 2 years ago | |

> Oddly absent: an over-rehearsed professional release video talking about a revolution in AI.

crakenzak 2 years ago | |

> it's because the biggest deep learning conference (NeurIPS) is next week.

Can we expect some big announcements (new architectures, models, etc) at the conference from different companies? Sorry, not too familiar what the culture for research conferences is.

jbarrow 2 years ago | | |

Typically not. Google as an example: the transformer paper (Vaswani et al., 2017) was arxiv'd in June of 2017, and NeurIPS (the conference in which it was published) was in December of that year; BERT (Devlin et al., 2019) was similarly arxiv'd before publication.

Recent announcements from companies tend to be even more divorced from conference dates, as they release anemic "Technical Reports" that largely wouldn't pass muster in a peer review.

GaggiX 2 years ago | |

>-hidden_dim / dim = 14336/4096 => 3.5X MLP expand

>- n_heads / n_kv_heads = 32/8 => 4X

These two are exactly the same as the old Mistral-7B

Der_Einzige 2 years ago | |

Also, because EMNLP 2023 is happening right now.

aubanel 2 years ago |

Mistral sure does not bother too much with explanations, but this style gives me much more confidence in the product than Google's polished, corporate, soulless announcement of Gemini!

brucethemoose2 2 years ago | |

I will take weights over docs.

Its does remind me how some Google employee was bragging that they disclosed the weights for the Gemini, and only the small mobile Gemini, as if that's a generous step over other companies.

refulgentis 2 years ago | | |

I don't think that's true, because quite simply, they have not.

I am 100% in agreement with your viewpoint, but feel squeamish seeing an un-needed lie coupled to it to justify it. Just so much Othering these days.

whimsicalism 2 years ago | | |

they did not disclose the weights for any gemini, you must have misunderstood

maremmano 2 years ago |

Do you need some fancy announcement? let's do it the 90s way: https://twitter.com/erhartford/status/1733159666417545641/ph...

eurekin 2 years ago | |

I find that a way more bold and confident than dropping a obviously manipulated and unrealistic marketing page or video

maremmano 2 years ago | | |

Frankly I don't know why Google continues to act this way. Let's remind the "Google Duplex: A.I. Assistant Calls Local Businesses To Make Appointments" story. https://www.youtube.com/watch?v=D5VN56jQMWM

Not that this affects Google's user base in any way, at the moment.

seydor 2 years ago | |

FILE_ID.DIZ

nulld3v 2 years ago |

Looks to be Mixture of Experts, here is the params.json:

    {
        "dim": 4096,
        "n_layers": 32,
        "head_dim": 128,
        "hidden_dim": 14336,
        "n_heads": 32,
        "n_kv_heads": 8,
        "norm_eps": 1e-05,
        "vocab_size": 32000,
        "moe": {
            "num_experts_per_tok": 2,
            "num_experts": 8
        }
    }

sockaddr 2 years ago | |

What does expert mean in this context?

moffkalast 2 years ago | | |

It means it's 8 7B models in a trench coat in a sense, it runs as fast as a 14B (2 experts at a time apparently) but takes up as much memory as a 40B model (70% * 8 * 7B). There is some process trained into it that chooses which experts to use based on the question posed. GPT 4 is allegedly based on the same architecture, but at 8*222B.

sp332 2 years ago | |

I don't see any code in there. What runtime could load these weights?

brucethemoose2 2 years ago | | |

Its presumably llama just like Mistral.

Everything open source is llama now. Facebook all but standardized the architecture.

I dunno about the moe. Is there existing transformers code for that part? It kinda looks like there is based on the config.

sigmar 2 years ago |

Not exactly similar companies in terms of their goals, but pretty hilarious to contrast this model announcement with Google's Gemini announcement two days ago.

cuuupid 2 years ago |

Stark contrast with Google's "all demo no model" approach from earlier this week! Seems to be trained off Stanford's Megablocks: https://github.com/mistralai/megablocks-public

MyFirstSass 2 years ago |

Hot take but Mistral 7B is the actual state of the art of LLM's.

ChatGPT 4 is amazing yes and i've been a day 1 subscriber, but it's huge, runs on server farms far away and is more or less a black box.

Mistral is tiny, and amazingly coherent and useful for it's size for both general questions and code, uncensored, and a leap i wouldn't have believed possible in just a year.

I can run it on my Macbook Air at 12tkps, can't wait to try this on my desktop.

jpdus 2 years ago |

We now have a (experimental) working HF version here: https://huggingface.co/DiscoResearch/mixtral-7b-8expert

manojlds 2 years ago |

Google - Fake demo

Mistral - magnet link and that's it

mareksotak 2 years ago |

Some companies spend weeks on landing pages, demos and cute thought through promo videos and then there is Mistral, casually dropping a magnet link on Friday.

tananaev 2 years ago | |

I'm sure it's also a marketing move to build a certain reputation. Looks like it's working.

HlessClaudesman 2 years ago | | |

Not geoblocking the entirety of Europe also makes them stand out like a ringmaster amongst clowns.

throwaway4aday 2 years ago | | |

technically, it is marketing but at this level marketing is indistinguishable from shipping

tarruda 2 years ago | |

I'm curious about their business model.

jorge-d 2 years ago | | |

Well so far their business model seems to be mostly centered about raising money[1]. I do hope they succeed in becoming a succesful contender against OpenAI.

[1] https://www.bloomberg.com/news/articles/2023-12-04/openai-ri...

nuz 2 years ago | | |

They can make plenty by offering consulting fees for finetuning and general support around their models.

leobg 2 years ago |

I love Mistral.

It’s crazy what can be done with this small model and 2 hours of fine tuning.

Chatbot with function calling? Check.

90 +% accuracy multi label classifier, even when you only have 15 examples for each label? Check.

Craaaazy powerful.

leodriesch 2 years ago | |

Could you link me to a finetune optimized for function calling? I was looking for one a few weeks ago but did not find any.

leobg 2 years ago | | |

See sibling comment.

jeanloolz 2 years ago | |

Can you point me to a function calling fine tune mistral model? This is the only feature that keeps me from migrating away from openai. I searched a few time but could not find anything in HG

leobg 2 years ago | | |

Can’t share the model, since it was trained for a client. I don’t know if any public datasets exist. But Mistral will learn what you throw at it. So if you build a dataset of chat conversations that contains, say, answers in the form of {“answer”:”The answer”, “image”:”Prompt for stable diffusion”}, you’ll get a model that can generate images, and also will know when to use that capability. It’s insane how well that works.

kcorbitt 2 years ago |

No public statement from Mistral yet. What we know:

- Mixture of Experts architecture.

- 8x 7B parameters experts (potentially trained starting with their base 7B model?).

- 96GB of weights. You won't be able to run this on your home GPU.

fortunefox 2 years ago |

Releasing a model with a magnet link and some ascii art gives me way more confidence in the product than any OpenAI blog post ever could.

Excited to play with this once it's somewhat documented on how to get it running on a dual 4090 Setup.

seydor 2 years ago |

looks like they're too busy being awesome. i need a fake video to understand this!

What memory will this need? I guess it won't run on my 12GB of vram

"moe": {"num_experts_per_tok": 2, "num_experts": 8}

I bet many people will re-discover bittorrent tonight

brucethemoose2 2 years ago | |

Looks like it will squeeze into 24GB once the llama runtimes work it out.

Its also a good candidate for splitting across small GPUs, maybe.

One architecture I can envision is hosting prompt ingestion and the "host" model on the GPU and the downstream expert model weights on the CPU /IGP. This is actually pretty efficient, as the CPU/IGP is really bad at the prompt ingestion but reasonably fast at ~14B token generation.

Llama.cpp all but already does this, I'm sure MLC will implement it as well.

syntaxing 2 years ago | |

BitTorrent was the craze when llama was leaked on torrent. Then Facebook started taking down all huggingface repos and a bunch of people transitioned to torrent released temporarily. llama 2 changed all this but it was a fun time.

dzhulgakov 2 years ago |

You can try Mixtral live at https://app.fireworks.ai/ (soon to be faster too)

Warning: the implementation might be off as there's no official one. We at Fireworks tried to reverse-engineer model architecture today with the help of awsome folks from the community. The generations look reasonably good, but there might be some details missing.

If you want to follow the reverse-engineering story: https://twitter.com/dzhulgakov/status/1733330954348085439

_fizz_buzz_ 2 years ago |

Does anybody have a tutorial or documentation how I can run this and play around with this locally. A „getting started“ guide of sorts?

0cf8612b2e1e 2 years ago | |

Even better if a llamafile gets released.

YetAnotherNick 2 years ago |

86 GB. So it's likely a Mixture of experts model with 8 experts. Exciting.

tarruda 2 years ago | |

Damn, I was hoping it was still a single 7B model that I would be able to run on my GPU

renonce 2 years ago | | |

You can, wait for a 4-bit quantized version

tarruda 2 years ago |

Still 7B, but now with 32k context. Looking forward to see how it compares with the previous one, and what the community does with it.

MacsHeadroom 2 years ago | |

Not 7B, 8x7B.

It will run with the speed of a 7B model while being much smarter but requiring ~24GB of RAM instead of ~4GB (in 4bit).

dragonwriter 2 years ago | | |

Given the config parametes posted, its 2 experts per token, so the conputation cost per token should be the cost of the conponent that selects experts + 2× cost of a 7B model.

brucethemoose2 2 years ago | |

We can't infer the actual context size from the config.

Mistral 7B is basically an 8K model, but was marked as a 32K one.

seydor 2 years ago | |

unfortunately too big for the broader community to test. Will be very interesting to see how well it performs compared to the large models

brucethemoose2 2 years ago | | |

Not really, looks like a ~40B class model which is very runnable.

balnazzar 2 years ago |

Might be relevant: https://twitter.com/dzhulgakov/status/1733217065811742863.

Anyway, if the vanilla version requires 2x80gb cards, I wonder how would it run on a M2 Ultra 192gb Mac Studio.

Anyone having the machine could try?

cloudhan 2 years ago |

Might be the training code related with the model https://github.com/mistralai/megablocks-public/tree/pstock/m...

cloudhan 2 years ago | |

Mixtral-8x7B support --> Support new model

https://github.com/stanford-futuredata/megablocks/pull/45

udev4096 2 years ago |

https://nitter.rawbit.ninja/MistralAI/status/173315051239503...

smlacy 2 years ago |

https://nitter.net/MistralAI/status/1733150512395038967

lxe 2 years ago |

If anyone can help running this, would be appreciated. Resources so far:

- https://github.com/dzhulgakov/llama-mistral

swah 2 years ago |

Kinda following all this stuff from outside w/o really understanding, but why are these things released like this, instead of "competing ChatGPTs apps" with higher and higher quality/costs? Could be open sourced but also hosted version that is maybe 5 usd/minute - if the results are great I guess people would pay the fair price...

Is it mainly because its hard to apply the limitations so that it doesn't spit out bomb making instructions?

maremmano 2 years ago |

Who know if I can run this on MBC Pro M3 max 128gb? at what TPS?

marci 2 years ago | |

If I understand correctly:

RAM Wise, you can easily run a 70b with 128GB, 8x7B is obviously less than that.

Compute wise, I suppose it would be a bit slower than running a 13b.

edit: "actually", I think it might be faster than a 13b. 8 random 7b ~= 115GB, Mixtral is under 90. I will have to wait for more info/understanding.

treprinum 2 years ago | |

I would say so based on LLaMA 2 70B; if it's 8x inference in MoE then I guess you'd see <20 tokens/sec?

M4v3R 2 years ago | |

Big chance that you’ll be able to run it using Ollama app soon enough.

deoxykev 2 years ago | |

I would like to know this as well.

udev4096 2 years ago |

based mistral casually dropping a magnet link

sergiotapia 2 years ago |

Stuck on "Retrieving data" from the Magnet link and "Downloading metadata" when adding the magnet to the download list.

I had to manually add these trackers and now it works: https://gist.github.com/mcandre/eab4166938ed4205bef4

Jayakumark 2 years ago |

https://huggingface.co/someone13574/mixtral-8x7b-32kseqlen

politician 2 years ago |

Honest question: Why isn't this on Huggingface? Is this one a leaked model with a questionable training or alignment methodology?

EDIT: I mean, I guess they didn't hack their own twitter account, but still.

kcorbitt 2 years ago | |

It'll be on Huggingface soon. This is how they dropped their original 7B model as well. It's a marketing thing, but it works!

politician 2 years ago | | |

Ah, well, ok. I appreciate the torrent link -- much faster distribution.

politician 2 years ago | | |

@kcorbitt Low priority, probably not worth an email: Does using OpenPipe.ai to fine-tune a model result in a downloadable LoRA adapter? It's not clear from the website if the fine-tune is exportable.

asolidtime1 2 years ago |

https://huggingface.co/someone13574/mixtral-8x7b-32kseqlen/b...

Holy shit, this is some clever marketing.

Kinda wonder if any of their employees were part of the warez scene at some point.

userbinator 2 years ago | |

They certainly got that aesthetic right; the only thing that stands out (but might be a necessity) is using real names instead of handles.

yodsanklai 2 years ago |

Can anyone explain what this means?

ukuina 2 years ago | |

Possibly a huge leap forward in open-source model capability. GPT4's prowess supposedly comes from strong dataset + RLHF + MoE (Mixture of Experts).

Mixtral brings MoE to an already-powerful model.

_uqgj 2 years ago |

multimodal? 32k context is pretty impressive, curious to test instructability

brucethemoose2 2 years ago | |

MistralLite is already 32K, and Yi 200K actually works pretty well out to at least 75K (the most I tested)

civilitty 2 years ago | | |

What kind of tests did you run out to that length? (Needle in haystack, summarization, structured data extraction, etc)

What is the max number of tokens in the output?

stevebmark 2 years ago |

Mistral Mixtral Model Magnet

lagniappe 2 years ago |

Magnet link says invalid for me

ahmetkca 2 years ago |

Let’s go multimodal

poulpy123 2 years ago |

is it eight 7b models in a trench coat ?

VerticalBox { padding: 20px; spacing: 10px; TextInput { id: input_field; placeholder_text: "Enter text here"; } Button { text: "Submit"; clicked => { // Handle the button click event println!("Input: {}", input_field.text()); } } } }