Mistral Small 3

620 points by jasondavies 1 year ago | 194 comments

simonw 1 year ago |

I'm excited about this one - they seem to be directly targeting the "best model to run on a decent laptop" category, hence the comparison with Llama 3.3 70B and Qwen 2.5 32B.

I'm running it on a M2 64GB MacBook Pro now via Ollama and it's fast and appears to be very capable. This downloads 14GB of model weights:

  ollama run mistral-small:24b

Then using my https://llm.datasette.io/ tool (so I can log my prompts to SQLite):

  llm install llm-ollama
  llm -m mistral-small:24b "say hi"

More notes here: https://simonwillison.net/2025/Jan/30/mistral-small-3/

simonw 1 year ago | |

The API pricing is notable too: they dropped the prices by half from the old Mistral Small - that one was $0.20/million tokens of input, $0.60/million for output.

The new Mistral Small 3 API model is $0.10/$0.30.

For comparison, GPT-4o-mini is $0.15/$0.60.

85392_school 1 year ago | | |

Competition will likely be cheaper. (For context, Deepinfra runs larger 32B models at $0.07/$0.16)

isoprophlex 1 year ago | |

I make very heavy use of structured output (to convert unstructured data into something processable, eg for process mining on customer service mailboxes)

Is it any good for this, if you tested it?

I'm looking for something that hits the sweet spot of runs locally & follows prescribed output structure, but I've been quite underwhelmed so far

enkrs 1 year ago | | |

I thought structured output is a solved problem now. I've had consistent results with ollama structured outputs [1] by passing Zod schema with the request. Works even with very small models. What are the challenges you're facing?

[1] https://ollama.com/blog/structured-outputs

the_mitsuhiko 1 year ago | | |

I get decent JSON from it quite well with the "assistant: {" trick. I'm not sure how well trained it is to do JSON. The template on ollama has tools calls so I assume they made sure JSON works: https://ollama.com/library/mistral-small:24b/blobs/6db27cd4e...

starik36 1 year ago | | |

The only model that I've found to be useful in processing customer emails is o1-preview. The rest of the models work as well, but don't get all the minutia of the emails.

My scenario is pretty specific though and is all about determining intent (e.g. what does the customer want) and mapping it onto my internal structures.

The model is very slow, but definitely worth it.

d4rkp4ttern 1 year ago | | |

It does decently well actually. You can test function-calling using Langroid. There are several example scripts you could try from the repo, e.g.

    uv run examples/basic/tool-extract-short-example.py --model ollama/mistral-small

sample output: https://gist.github.com/pchalasani/662d7f13dbe690d6e2bfef01c...

Langroid has a ToolMessage mechanism that lets you specify a tool/fn-call using Pydantic, which is then transpiled into system message instructions.

mohsen1 1 year ago | | |

See function calling being called out here

https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-...

mercer 1 year ago | | |

I've found phi4 to be very good for this.

rkwz 1 year ago | | |

What local models are you currently using and what issues are you facing?

halyconWays 1 year ago | |

Maybe I'm an outlier but I don't see much value in running tiny local models vs. using a more powerful desktop in my house to host a larger and far more usable model. I run Open WebUI and connect it to my own llama.cpp/koboldcpp that runs a 4-bit 70B model, and can connect to it anywhere easily with Tailscale. For questions that even 70B can't handle I have Open WebUI hit OpenRouter and can choose between all the flagship models.

Every time I've tried a tiny model it's been too questionable to trust.

kamranjon 1 year ago | | |

Have you tried Gemma 27b? I’ve been using it with llamafile and it’s pretty incredible. I think the winds are changing a bit and small models are becoming much more capable. Worth giving some of the smaller ones a shot if it’s been a while. I can run Gemma 27b on my 32gb MacBook Pro and it’s pretty capable with code too.

pks016 1 year ago | |

Question for people who spent more time with these small models. What's a current best small model to extract information from a large number of pdfs? I have multiple collection of research articles. I want two tasks 1) Extract info from pdfs 2) classify papers based the content of the paper.

Or point me to right direction

themanmaran 1 year ago | | |

Hey this is something we know a lot about. I'd say Qwen 2.5 32B would be the best here.

We've found GPT-4o/Claude 3.5 to benchmark at around 85% accuracy on document extraction. With Qwen 72B at around 70%. Smaller models will go down from there.

But it really depends on the complexity of the documents, and how much information you're looking to pull out. Is it something easy like document_title or hard like array_of_all_citations.

rahimnathwani 1 year ago | |

Given you have 64GB RAM, you could run mistral-small:24b-instruct-2501-q8_0

jhickok 1 year ago | |

Do you know how many tokens per second you are getting? I have a similar laptop that I can test on later but if you have that info handy let me know!

snickell 1 year ago | | |

M2 max with 64GB: 14 tokens/s running `ollama run mistral-small:24b --verbose`

prettyblocks 1 year ago | |

Hey Simon - In your experience, what's the best "small" model for function/tool calling? Of the ones I've tested they seem to return the function call even when it's not needed, which requires all kinds of meta prompting & templating to weed out. Have you found a model that more or less just gets it right?

simonw 1 year ago | | |

I'm afraid I don't have a great answer for that - I haven't spent nearly enough time with function calling in these models.

I'm hoping to add function calling to my LLM library soon which will make me much better equipped to experiment here.

jonas21 1 year ago | |

I don't understand the joke.

simonw 1 year ago | | |

It's hardly a joke at all. Even the very best models tend to be awful at writing jokes.

I find the addition of an explanation at the end (never a sign of a good joke) amusing at the meta-level:

  Why did the badger bring a puffin to the party?

  Because he heard puffins make great party 'Puffins'!

  (That's a play on the word "puffins" and the phrase "party people.")

emmelaich 1 year ago | | |

Apparently "party puffin" is a company that sells cheap party supplies and decorations. That's all that I can think of.

asb 1 year ago |

Note the announcement at the end, that they're moving away from the non-commercial only license used in some of their models in favour of Apache:

We’re renewing our commitment to using Apache 2.0 license for our general purpose models, as we progressively move away from MRL-licensed models

tadamcz 1 year ago |

Hi! I'm Tom, a machine learning engineer at the nonprofit research institute Epoch AI [0]. I've been working on building infrastructure to:

* run LLM evaluations systematically and at scale

* share the data with the public in a rigorous and transparent way

We use the UK government's Inspect [1] library to run the evaluations.

As soon as I saw this news on HN, I evaluated Mistral Small 3 on MATH [2] level 5 (hardest subset, 1,324 questions). I get an accuracy of 0.45 (± 0.011). We sample the LLM 8 times for each question, which lets us obtain less noisy estimates of mean accuracy, and measure the consistency of the LLM's answers. The 1,324*8=10,584 samples represent 8.5M tokens (2M in, 6.5M out).

You can see the full transcripts here in Inspect’s interactive interface: https://epoch.ai/inspect-viewer/484131e0/viewer?log_file=htt...

Note that MATH is a different benchmark from the MathInstruct [3] mentioned in the OP.

It's still early days for Epoch AI's benchmarking work. I'm developing a systematic database of evaluations run directly by us (so we can share the full details transparently), which we hope to release very soon.

[0]: https://epoch.ai/

[1]: https://github.com/UKGovernmentBEIS/inspect_ai

[2]: https://arxiv.org/abs/2103.03874

[3]: https://huggingface.co/datasets/TIGER-Lab/MathInstruct

mohsen1 1 year ago |

Not so subtle in function calling example[1]

        "role": "assistant",
        "content": "---\n\nOpenAI is a FOR-profit company.",

[1] https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-...

spwa4 1 year ago |

So the point of this release is

1) code + weights Apache 2.0 licensed (enough to run locally, enough to train, not enough to reproduce this version)

2) Low latency, meaning 11ms per token (so ~90 tokens/sec on 4xH100)

3) Performance, according to mistral, somewhere between Qwen 2.5 32B and Llama 3.3 70B, roughly equal with GPT4o-mini

4) ollama run mistral-small (14G download) 9 tokens/sec on the question "who is the president of the US?" (also to enjoy that the answer ISN'T orange idiot)

freehorse 1 year ago |

I tried just a few of the code generating prompts I have used last days, and it looks quite good and promising. It seems at least on par with qwen2.5-coder-32b which was the first local model i would actually use for code. I am also surprised how far we went with small models producing such more polished output in the last year.

On another note, I also wish they would follow up with a new version of the 8x7B mixtral. It was one of my favourite models, but at the time it could barely fit in my ram, and now that I have more ram it is rather outdated. But I don't complain, this model anyway is great and it is great that they are one of the companies which actually publish such models targeted to edge computing.

msp26 1 year ago |

Finally, all the recent MoE model releases make me depressed with my mere 24GB VRAM.

> Note that Mistral Small 3 is neither trained with RL nor synthetic data

Not using synthetic data at all is a little strange

colonial 1 year ago | |

I recall seeing some complaints recently w.r.t. one of the heavily synthetic models (Phi?) - apparently they tend to overfit on STEM "book knowledge" while struggling with fuzzier stuff and instruction following.

I'm not much of an LLM user, though, so take my warmed over recollections with a grain of salt.

bloopernova 1 year ago | |

I'm surprised no GPU cards are available with like a TB of older/cheaper RAM.

gr3ml1n 1 year ago | | |

Not surprising at all: Nvidia doesn't want to compete with their own datacenter cards.

aurareturn 1 year ago | | |

Because memory bandwidth is the #1 bottleneck for inference, even more than capacity.

What good is 1TB RAM if the bandwidth is fed through a straw? Models would run very slow.

You can see this effect on 128GB MacBook Pros. Yes, the model will fit but it’s slow. 500GB/s of memory bandwidth feeds 128GB RAM at a maximum rate of 3.9x per second. This means if your model is 128GB large, your max tokens/s is 3.9. In the real world, it’s more like 2-3 tokens/s after overhead and compute. That’s too slow to use comfortably.

You’re probably wondering why not increase memory bandwidth too. Well, you need faster memory chips such as HBM and/or more memory channels. These changes will result in drastically more power consumption and bigger memory controllers. Great, you’ll pay for those. Now you’re bottlenecked by compute. Just add more compute? Ok, you just recreated the Nvidia H100 GPU. That’ll be $20k please.

Some people have tried to use AMD Epyc CPUs with 8 channel memory for inference but those are also painfully slow in most cases.

bugglebeetle 1 year ago |

Interested to see what folks do with putting DeepSeek-style RL methods on top of this. The smaller Mistral models have always punched above their weight and been the best for fine-tuning.

petercooper 1 year ago | |

It's not RL, but you can get a long way with a thorough system prompt to encourage it to engage in 'thinking' behavior on its own without extra training. Just playing with it myself now with promising results - Mistral Small seems very receptive to this approach (not all models are - cough, Llama).

Update: This is such a prompt: https://gist.github.com/peterc/955d797ee35b3c777d76a2d881d2f...

yodsanklai 1 year ago |

I'm curious, what people do with these smaller models?

Beretta_Vexee 1 year ago | |

RAG mainly, Feature extraction, tagging, Document and e-mail classification. You don't need a 24B parameter to know whether the e-mail should go to accounting or customer support.

Panoramix 1 year ago | | |

Would this work for non-text data? Like finding outliers in a time series or classifying trends, that kind of thing

pheeney 1 year ago | | |

What models would you recommend for basic classification if you don't need a 24B parameter one?

celestialcheese 1 year ago | |

Classification, tagging tasks. Way easier than older ML techniques and very fast to implement.

mattgreenrocks 1 year ago | | |

When compared against more traditional ML approaches, how do they fare in terms of quality?

ignoramous 1 year ago | |

Mistral repeatedly emphasize on "accuracy" and "latency" for this Small (24b) model; which to me means (and as they also point out):

- Local virtual assistants.

- Local automated workflows.

Also from TFA:

  Our customers are evaluating Mistral Small 3 across multiple industries, including:

  - Financial services customers for fraud detection
  - Healthcare providers for customer triaging
  - Robotics, automotive, and manufacturing companies for on-device command and control
  - Horizontal use cases across customers include virtual customer service, and sentiment and feedback analysis.

frankfrank13 1 year ago | |

They're fast, I used 4o mini to run the final synthesis in a CoT app and to do initial entity/value extraction in an ETL. Mistral is pretty good for code completions too, if I was in the Cursor business I would consider a model like this for small code-block level completions, and let the bigger models handle chat, large requests, etc.

_boffin_ 1 year ago | |

Cleaning messy assessor data. Email draft generation.

superkuh 1 year ago | |

Not spend $6000 on hardware because they run on computers we already have. But more seriously, they're fine and plenty fun for making recreational IRC bots.

rahimnathwani 1 year ago |

Until today, no language model I've run locally on a 32GB M1 has been able to answer this question correctly: "What was Mary J Blige's first album?"

Today, a 4-bit quantized version of Mistral Small (14GB model size) answered correctly :)

https://ollama.com/library/mistral-small:24b-instruct-2501-q...

kamranjon 1 year ago | |

I just tried your question against Gemma 2 27b llamafile on my M1 Macbook with 32gb of ram, here is the transcript:

>>> What was Mary J Blige's first album?

Mary J. Blige's first album was titled *"What's the 411?"*.

It was released on July 28, 1992, by Uptown Records and became a critical and commercial success, establishing her as the "Queen of Hip-Hop Soul."

Would you like to know more about the album, like its tracklist or its impact on music?

rahimnathwani 1 year ago | | |

Ah! I had not tried any gemma models locally. It worked:

  % llm -m gemma2:27b-instruct-q4_0 "What was Mary J Blige's first album?"
  Mary J. Blige's first album was **"What's the 411?"** It was released in July 1992.
  
  Let me know if you have any other questions about Mary J. Blige!

cptcobalt 1 year ago |

This is really exciting—the 12-32b size range has my favorite model size on my home computer, and the mistrals have been historically great and embraced for various fine-tuning.

At 24b, I think this has a good chance of fitting on my more memory constrained work computer.

ericol 1 year ago | |

> the mistrals have been historically great and embraced for various fine-tuning Are there any guides on fine tuning them that you can recommend?

ekam 1 year ago | | |

Unsloth is the one I personally hear the most about

timestretch 1 year ago |

Their models have been great, but I wish they'd include the number of parameters in the model name, like every other model.

jbentley1 1 year ago | |

It's 24B parameters

rcarmo 1 year ago |

There's also a 22b model that I appreciate, since it _almost_ fits into my 12GB 3060. But, alas, I might need to get a new GPU if this trend of fatter smaller models continues.

aargh_aargh 1 year ago | |

That's the older version (4 months old), check the release date.

rcarmo 1 year ago | | |

Ah. I need to find a tighter quantization then, if it exists at all.

GaggiX 1 year ago |

Hopefully they will finetuning it using RL like DeepSeek did, it would be great to have more open reasoning models.

Alifatisk 1 year ago |

Is there a good benchmark one can look at that shows the best performing llm in terms of instruction following or overall score?

The only ones I am aware of is benchmarks on Twitter, Chatbot Arena [1] and Aider benchmark [2]

1. https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leade...

2. https://aider.chat/docs/leaderboards

mike31fr 1 year ago |

Running it on a MacBook with M1 Pro chip and 32 GB of RAM is quite slow. I expected to be as fast as phi4 but it's much slower.

mike31fr 1 year ago | |

With eval rate numbers:

- phi4: 12 tokens/s

- mistral-small: 9 tokens/s

On Nvidia RTX 4090 laptop:

- phi4: 36 tokens/s

- mistral-small: 16 tokens/s

Terretta 1 year ago |

"When quantized, Mistral Small 3 can be run privately on a single RTX 4090 or a Macbook with 32GB RAM."

jszymborski 1 year ago | |

The trouble now is finding an RTX 4090.

hnuser123456 1 year ago | | |

RTX 3090s are easy to find and work just as well.

benkaiser 1 year ago | | |

Runs on an AMD 7900 XTX at about ~20 tokens per second using LM Studio + Vulkan.

unraveller 1 year ago |

What's this stuff about the model catering to ‘80%’ of generative AI tasks? What model do they expect me to use for the other 20% of the time when my question needs reasoning smarts.

sneak 1 year ago | |

There are APIs that use a very small model to determine the complexity of the request then route it to different apis or models based on the result of that classifier model.

This way you can do cheap/local automatically without the api client having to know anything about it, and the proxy will send the requests out to an expensive big model only when necessary.

abdullahkhalids 1 year ago | |

Crazy idea: a small super fast model whose only job is to decide which model to send your task to.

hahnchen 1 year ago | | |

Not so crazy, it sorta exists https://withmartian.com so it's probably a good idea to pursue

sneak 1 year ago | | |

This already exists but I forgot the name. It’s an api proxy.

zamadatix 1 year ago | |

Take your pick based on your use cases and needs?

xnx 1 year ago | |

Mistral Large

Havoc 1 year ago |

Used it a bit today on coding tasks and overall very pleasant. The combination of fast and fits into 24gb is also appreciated

Wouldn’t be surprised if this gets used a fair bit given open license

butz 1 year ago |

Is there a gguf version that could be used with llamafile?

simonw 1 year ago | |

A bunch have started showing up here: https://huggingface.co/models?other=base_model:quantized:mis...

The lmstudio-community ones tend to work well in my experience.

picografix 1 year ago |

Tried running locally, gone were the days where you get broken responses on local models (i know this happened earlier but I tried after so many days)

adt 1 year ago |

https://lifearchitect.ai/models-table/

mrbonner 1 year ago |

Is there a chance for me to get a eGPU (external GPU dock) for my M1 16GB laptop to plunge thru this model?

hnfong 1 year ago | |

The smaller IQ2/Q3 GGUF quants should run "fine" on your existing 16GB.

(also, I don't know that M1 supports any eGPU...)

rvz 1 year ago |

The AI race to zero continues to accelerate and Mistral has shown one card to just stay in the race. (And released for free)

OpenAI's reaction to DeepSeek looked more like cope and panic after they realized they're getting squeezed at their own game.

Notice how Google hasn't said anything with these announcements and didn't rush out a model nor did they do any price cuts? They are not in panic and have something up their sleeve.

I'd expect Google to release a new reasoning model that is competitive with DeepSeek and o1 (or matches o3). Would be even more interesting if they release it for free.

beAbU 1 year ago | |

Google has been consistently found with their finger up their nose during this entire AI bubble.

The reason why they are so silent is because they are still reacting to ChatGPT 3.5

jug 1 year ago | | |

Gemini 2.0 Experimental is now a leading LLM. They started out poorly, but after a more reasonable 1.5 Pro, 2.0 is in another class entirely and a direct competitor to o1 (or o1-mini as for Gemini 2.0 Flash). They've made quick strides forward as the DeepMind team is kicking into gear, and I feel like they're neglected a bit too often these days, especially now while usage cost on AI Studio is a nice $0.

christianqchung 1 year ago | | |

The Gemini launch was a complete disaster, but technically speaking since February 2024, Gemini 1.5 pro and the ensuing lineup have been very impressive.

staticman2 1 year ago | | |

Gemini 1.5 pro is extremely impressive at reading 2 million tokens of a document and answering questions about it. And at least for the time being it's offered for free on AI studio.

upbeat_general 1 year ago | | |

imo gemini-exp-1206 is the best public LLM that exists right now.

jiraiya0 1 year ago | |

Already tried it. It’s called gemini-2.0-flash-thinking-exp-01-21. Looks better than DeepSeek.

k__ 1 year ago | | |

R1 or V3?

Havoc 1 year ago |

How does that fit into a 4090? The files on the repo look way too large. Do they mean a quant?

cbg0 1 year ago | |

> Mistral Small can be deployed locally and is exceptionally "knowledge-dense", fitting in a single RTX 4090 or a 32GB RAM MacBook once quantized.

fuegoio 1 year ago |

Finally something from them

azinman2 1 year ago | |

They released codestral on Jan 13. What do you mean by “finally”?

strobe 1 year ago |

not sure how much worse it than original but mistral-small:22b-instruct-2409-q2_K seems works on 16GB VRAM GPU

resource_waste 1 year ago |

Curious how it actually compares to LLaMa.

Last year Mistral was garbage compared to LLaMa. I needed a permissive license, so I was forced to use Mistral, but I had LLaMa that I could compare it to. I was always extremely jealous of LLaMa since the Berkley Sterling finetune was so amazing.

I ended up giving up on the project because Mistral was so unusable.

My conspiracy was that there was some European patriotism that gave Mistral a bit more hype than was merited.

maven29 1 year ago | |

They're both European. Look at the author names on the llama paper.

resource_waste 1 year ago | | |

That is a very European thing to say/do/claim.

staticman2 1 year ago | |

I remember a huge improvement between the Mistral models that came out in January and February 2024 versus their later releases starting in July 2024, so it wouldn't surprise me if this was a good model for it's size.

netdur 1 year ago |

seems on par or better than gpt4 mini

mariconrobot 1 year ago |

i cunt get past the name

fvv 1 year ago |

given new USA ai diffusion rules will mistral be able to survive and attract new capitals ? , I mean, given that france is top tier country

beAbU 1 year ago | |

This sounds like a USA problem, rather than a Mistral problem.

lkbm 1 year ago | | |

Not being able to attract capital would clearly be a Mistral problem.

solomatov 1 year ago | |

What are these ai diffusion rules?

Beretta_Vexee 1 year ago | | |

"Those destinations, which are listed in paragraph (a) to Supplement No. 5 to Part 740, are Australia, Belgium, Canada, Denmark, Finland, France, Germany, Ireland, Italy, Japan, the Netherlands, New Zealand, Norway, Republic of Korea, Spain, Sweden, Taiwan, the United Kingdom, and the United States. For these destinations, this IFR makes minimal changes: companies in these destinations generally will be able to obtain the most advanced ICs without a license as long as they certify compliance with specific requirements provided in § 740.27." [0]

France seems clearly exempt from most of the requirements. The main requirement of 740.27 is to sign a license under U.S. law, under which customers are prohibited from re-exporting ICs to non-Third 1 countries without U.S. approval.

What's more, the text refers to AIs, which can have dual uses. The concept of dual civil-military use concerns a large number of technologies, and dates back to the first nuclear technologies.

The text gives a few examples of dual-use models, such as models that simulate or facilitate the production of chemical compounds that could be used for chemical weapon creation, non conventional weapon creation or that could simplify or replace already identified dual-use goods or technologies.

These uses are already covered by existing legislation on dual-use goods, and US export control. The American legislator is therefore potentially thinking of other uses, such as satellite and radar image analysis, and electronic warfare.

As France is a nuclear-armed country with its own version of thoses technologies, it makes little sense to place it under embargo.

But France isn't going to like being obliged once again to be forced to apply American law and regulation on its soil.

As a European, I hope that alternatives to American dependence will soon appear.

[0] https://www.federalregister.gov/documents/2025/01/15/2025-00...

m3kw9 1 year ago |

Sorry to dampen the news but 4o-mini level isn’t really a useful model other than talk to me for fun type of applications.