Mistral Small 3(mistral.ai) |
Mistral Small 3(mistral.ai) |
Last year Mistral was garbage compared to LLaMa. I needed a permissive license, so I was forced to use Mistral, but I had LLaMa that I could compare it to. I was always extremely jealous of LLaMa since the Berkley Sterling finetune was so amazing.
I ended up giving up on the project because Mistral was so unusable.
My conspiracy was that there was some European patriotism that gave Mistral a bit more hype than was merited.
France seems clearly exempt from most of the requirements. The main requirement of 740.27 is to sign a license under U.S. law, under which customers are prohibited from re-exporting ICs to non-Third 1 countries without U.S. approval.
What's more, the text refers to AIs, which can have dual uses. The concept of dual civil-military use concerns a large number of technologies, and dates back to the first nuclear technologies.
The text gives a few examples of dual-use models, such as models that simulate or facilitate the production of chemical compounds that could be used for chemical weapon creation, non conventional weapon creation or that could simplify or replace already identified dual-use goods or technologies.
These uses are already covered by existing legislation on dual-use goods, and US export control. The American legislator is therefore potentially thinking of other uses, such as satellite and radar image analysis, and electronic warfare.
As France is a nuclear-armed country with its own version of thoses technologies, it makes little sense to place it under embargo.
But France isn't going to like being obliged once again to be forced to apply American law and regulation on its soil.
As a European, I hope that alternatives to American dependence will soon appear.
[0] https://www.federalregister.gov/documents/2025/01/15/2025-00...
Merikan company with Merikan investment get the credit. No one cares except Europeans about the interchangable workers residency is.
I'm trying to remember the other case where people lol'd at Europe/Italy for taking credit for something that was clearly invented in the US. I think the person was born there, and moved to the US, but Italy still took credit.
lol no. Its probably even more embarrassing that they left Europe.
Even if for now France is strong-armed into applying the same restrictions, they will be in a much better position than US companies if US-Europe relations deteriorate. Something that's not entirely unlikely under Trump. We are a week into his presidency and France is already talking about deploying troops to Greenland
On the other hand, ICs could be subject to restrictions, and France has no alternative for sourcing large-capacity ICs.
The USA could use dollar-denominated transactions to broaden the scope of the text. It's not insurmountable, but it will complicate matters.
It's telling that the release from OpenAI today warns about exactly this threat in their lengthy security section: https://cdn.openai.com/o3-mini-system-card.pdf
I'm running it on a M2 64GB MacBook Pro now via Ollama and it's fast and appears to be very capable. This downloads 14GB of model weights:
ollama run mistral-small:24b
Then using my https://llm.datasette.io/ tool (so I can log my prompts to SQLite): llm install llm-ollama
llm -m mistral-small:24b "say hi"
More notes here: https://simonwillison.net/2025/Jan/30/mistral-small-3/The new Mistral Small 3 API model is $0.10/$0.30.
For comparison, GPT-4o-mini is $0.15/$0.60.
Is it any good for this, if you tested it?
I'm looking for something that hits the sweet spot of runs locally & follows prescribed output structure, but I've been quite underwhelmed so far
My scenario is pretty specific though and is all about determining intent (e.g. what does the customer want) and mapping it onto my internal structures.
The model is very slow, but definitely worth it.
uv run examples/basic/tool-extract-short-example.py --model ollama/mistral-small
sample output:
https://gist.github.com/pchalasani/662d7f13dbe690d6e2bfef01c...Langroid has a ToolMessage mechanism that lets you specify a tool/fn-call using Pydantic, which is then transpiled into system message instructions.
https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-...
Every time I've tried a tiny model it's been too questionable to trust.
Or point me to right direction
We've found GPT-4o/Claude 3.5 to benchmark at around 85% accuracy on document extraction. With Qwen 72B at around 70%. Smaller models will go down from there.
But it really depends on the complexity of the documents, and how much information you're looking to pull out. Is it something easy like document_title or hard like array_of_all_citations.
I'm hoping to add function calling to my LLM library soon which will make me much better equipped to experiment here.
I find the addition of an explanation at the end (never a sign of a good joke) amusing at the meta-level:
Why did the badger bring a puffin to the party?
Because he heard puffins make great party 'Puffins'!
(That's a play on the word "puffins" and the phrase "party people.")We’re renewing our commitment to using Apache 2.0 license for our general purpose models, as we progressively move away from MRL-licensed models
* run LLM evaluations systematically and at scale
* share the data with the public in a rigorous and transparent way
We use the UK government's Inspect [1] library to run the evaluations.
As soon as I saw this news on HN, I evaluated Mistral Small 3 on MATH [2] level 5 (hardest subset, 1,324 questions). I get an accuracy of 0.45 (± 0.011). We sample the LLM 8 times for each question, which lets us obtain less noisy estimates of mean accuracy, and measure the consistency of the LLM's answers. The 1,324*8=10,584 samples represent 8.5M tokens (2M in, 6.5M out).
You can see the full transcripts here in Inspect’s interactive interface: https://epoch.ai/inspect-viewer/484131e0/viewer?log_file=htt...
Note that MATH is a different benchmark from the MathInstruct [3] mentioned in the OP.
It's still early days for Epoch AI's benchmarking work. I'm developing a systematic database of evaluations run directly by us (so we can share the full details transparently), which we hope to release very soon.
[0]: https://epoch.ai/
[1]: https://github.com/UKGovernmentBEIS/inspect_ai
"role": "assistant",
"content": "---\n\nOpenAI is a FOR-profit company.",
[1] https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-...1) code + weights Apache 2.0 licensed (enough to run locally, enough to train, not enough to reproduce this version)
2) Low latency, meaning 11ms per token (so ~90 tokens/sec on 4xH100)
3) Performance, according to mistral, somewhere between Qwen 2.5 32B and Llama 3.3 70B, roughly equal with GPT4o-mini
4) ollama run mistral-small (14G download) 9 tokens/sec on the question "who is the president of the US?" (also to enjoy that the answer ISN'T orange idiot)
On another note, I also wish they would follow up with a new version of the 8x7B mixtral. It was one of my favourite models, but at the time it could barely fit in my ram, and now that I have more ram it is rather outdated. But I don't complain, this model anyway is great and it is great that they are one of the companies which actually publish such models targeted to edge computing.
> Note that Mistral Small 3 is neither trained with RL nor synthetic data
Not using synthetic data at all is a little strange
I'm not much of an LLM user, though, so take my warmed over recollections with a grain of salt.
What good is 1TB RAM if the bandwidth is fed through a straw? Models would run very slow.
You can see this effect on 128GB MacBook Pros. Yes, the model will fit but it’s slow. 500GB/s of memory bandwidth feeds 128GB RAM at a maximum rate of 3.9x per second. This means if your model is 128GB large, your max tokens/s is 3.9. In the real world, it’s more like 2-3 tokens/s after overhead and compute. That’s too slow to use comfortably.
You’re probably wondering why not increase memory bandwidth too. Well, you need faster memory chips such as HBM and/or more memory channels. These changes will result in drastically more power consumption and bigger memory controllers. Great, you’ll pay for those. Now you’re bottlenecked by compute. Just add more compute? Ok, you just recreated the Nvidia H100 GPU. That’ll be $20k please.
Some people have tried to use AMD Epyc CPUs with 8 channel memory for inference but those are also painfully slow in most cases.
Update: This is such a prompt: https://gist.github.com/peterc/955d797ee35b3c777d76a2d881d2f...
- Local virtual assistants.
- Local automated workflows.
Also from TFA:
Our customers are evaluating Mistral Small 3 across multiple industries, including:
- Financial services customers for fraud detection
- Healthcare providers for customer triaging
- Robotics, automotive, and manufacturing companies for on-device command and control
- Horizontal use cases across customers include virtual customer service, and sentiment and feedback analysis.Today, a 4-bit quantized version of Mistral Small (14GB model size) answered correctly :)
https://ollama.com/library/mistral-small:24b-instruct-2501-q...
>>> What was Mary J Blige's first album?
Mary J. Blige's first album was titled *"What's the 411?"*.
It was released on July 28, 1992, by Uptown Records and became a critical and commercial success, establishing her as the "Queen of Hip-Hop Soul."
Would you like to know more about the album, like its tracklist or its impact on music?
% llm -m gemma2:27b-instruct-q4_0 "What was Mary J Blige's first album?"
Mary J. Blige's first album was **"What's the 411?"** It was released in July 1992.
Let me know if you have any other questions about Mary J. Blige!At 24b, I think this has a good chance of fitting on my more memory constrained work computer.
The only ones I am aware of is benchmarks on Twitter, Chatbot Arena [1] and Aider benchmark [2]
1. https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leade...
- phi4: 12 tokens/s
- mistral-small: 9 tokens/s
On Nvidia RTX 4090 laptop:
- phi4: 36 tokens/s
- mistral-small: 16 tokens/s
This way you can do cheap/local automatically without the api client having to know anything about it, and the proxy will send the requests out to an expensive big model only when necessary.
Wouldn’t be surprised if this gets used a fair bit given open license
The lmstudio-community ones tend to work well in my experience.
(also, I don't know that M1 supports any eGPU...)
OpenAI's reaction to DeepSeek looked more like cope and panic after they realized they're getting squeezed at their own game.
Notice how Google hasn't said anything with these announcements and didn't rush out a model nor did they do any price cuts? They are not in panic and have something up their sleeve.
I'd expect Google to release a new reasoning model that is competitive with DeepSeek and o1 (or matches o3). Would be even more interesting if they release it for free.
The reason why they are so silent is because they are still reacting to ChatGPT 3.5
It's a bit like developing a binary application and slapping a FOSS license on the binary while keeping the code proprietary. Not saying that's wrong or anything, but people reading these announcements tend to misunderstand what actually got FOSS licensed when the companies write stuff like this.
To consider just the power of fine tuning: all of the press DeepSeek have received is over their R1 model, a relatively tiny fine-tune on their open source V3 model. The vast majority of the compute and data pipeline work to build R1 was complete in V3, while that final fine-tuning step to R1 is possible even by an enthusiastic dedicated individual. (And there are many interesting ways of doing it.)
The insistence every time open sourced model weights come up that it is not "truly" open source is tiring. There is enormous value in open source weights compared to closed APIs. Let us call them open source weights. What you want can be "open source data" or somesuch.
This kind of purity test mindset doesn't help anyone. They are shipping the most modifiable form of their model.
Like every other open source / source available LLM?
No one's going to pay for an inferior closed model...
One question i have regarding evals is, what sampling temperature and/or method do you use? As far as i understand temperature/ method can impact model output alot. Would love to here you're thoughts on how these different settings of the same model can impact output and how to go about evaluating models when its not clear how to use the to their fullest
For models we run ourselves from the weights, at the moment we'd use vLLM's defaults, but this may warrant more thought and adjustment. Other things being equal, I prefer to use an AI lab's API, with settings as vanilla as possible, so that we essentially defer to them on these judgments. For example, this is why we ran this Mistral model from Mistral's API instead of from the weights.
I believe the `temperature` parameter, for example, has different implementations across architectures/models, so it's not as simple as picking a single temperature number for all models.
However, I'm curious if you have further thoughts on how we should approach this.
By the way, in the log viewer UI, for any model call, you can click on the "API" button to see the payloads that were sent. In this case, you can see that we do not send any values to Mistral for `top_p`, `temperature`, etc.
I have used such models to structure human-generated data into sth a script can then read and process, getting important aspects in this data (eg what time the human reported doing X thing, how long, with whom etc) into like a csv file with columns eg timestamps and whatever variables I am interested in.
Note: from October; also I work at Airtrain
Over the last 12-18 months though, the instruction-following capabilities of the models have improved substantially. This new mistral model in particular is fantastic at doing what you ask.
My approach to this personally and professionally is to just benchmark. If I have a classification task, I use a tiny model first, eval both, and see how much improvement I'd get using an LLM. Generally speaking though, the vram costs are so high for the latter that its often not worth it. It really is a case-by-case decision though. Sometimes you want one generic model to do a bunch of tasks rather than train/finetune a dozen small models that you manage in production instead.
Still, you can go from 0 to ~mostly~ clean data in a few prompts and iterations, vs potentially a few hours with a fine tuning pipeline for BERT. They can actually work well in tandem to bootstrap some training data and then use them together to refine your classification.
I tried the GPT-4o, it's good but it'll cost a lot if I want to process all the documents.
2. Give a few Sonnet or 4o input/output examples to haiku, 4o-mini, or any other smaller model. Giving good examples to smaller models can bring the output quality closer to (or on par with) the better model.
Also, I don't buy the argument that because many in the ecosystem mislabel/mislead people about the licensing, makes it ethically OK for everyone else to do so too.
Given that, I'd expect a single 3060 (if a large enough one existed) to run at about 16 tok/s so 20 tok/s on two isn't bad not being NVLinked.
Hopefully is at least quadchannel.
I guess I'm vary of the messaging because I'm a developer 99% thanks to FOSS, and being able to learn from FOSS projects how to build similar stuff myself. Without FOSS, I probably wouldn't have been able to "escape" the working-class my family was "stuck in" when I grew up.
I want to do whatever I can to make sure others have the same opportunity, and it doesn't matter if the weights themselves are FOSS or not, others cannot learn how to create their own models based on just looking at the weights. You need to be able to learn the model architecture, training and what datasets models are using too, otherwise you won't get very far.
> This kind of purity test mindset doesn't help anyone. They are shipping the most modifiable form of their model.
It does help others who might be stuck in the same situation I was stuck in, that's not nothing nor is it about "purity". They're not shipping the most open model they can, they could have done something like OLMo (https://github.com/allenai/OLMo) which can teach people how to build their own models from scratch.
I'm not sure I'd even call Llama "open weights". For me that would mean I can download the weights freely (you cannot download Llama weights without signing a license agreement) and use them freely, you cannot use them freely + you need to add a notice from Meta/Llama on everything that uses Llama saying:
> prominently display “Built with Llama” on a related website, user interface, blogpost, about page, or product documentation.
https://www.llama.com/llama3_2/license/
Not sure what the correct label is, but it's not open source nor open weights, as far as I can tell.
For someone who basically couldn't become a developer with FOSS, this way of thinking is so backwards, especially on Hacker News. I thought we were pro-FOSS in general, but somehow LLMs get a pass because "they're too complicated and no one would build one from scratch".
Yes, it'd be nice if it was open and reproducible from start to finish. But let's not let perfect be the enemy of good.
"Let's not let companies exploit well-known definitions for their own gain" is what I'm going for, regardless if we personally gain from it or not.
The llm replies with a joke that is barely a joke.
The man says "another."
The llm gives another unfunny response.
"Another!"
Followed by another similarly lacking response.
"Another!"
With exasperation, the llm replies "stop badgering me!"
Except it won't, because that's not a high likelihood output. ;)
But there are a ton of models I can't run at all locally due to VRAM limitations. I'd take being able to run those models slower. I know there are some ways to get these running on CPU orders of magnitude slower, but ideally there's some sort of middle ground.
Agree that there is more value in open source weights than closed APIs, but what I really want to enable, is people learning how to create their own models from scratch. FOSS to me means being able to learn from other projects, how to build the thing yourself, and I wrote about why this is important to me here: https://news.ycombinator.com/item?id=42878817
It's not a puritan view but purely practical. Many companies started using FOSS as a marketing label (like what Meta does) and as someone who probably wouldn't be a software developer without being able to learn from FOSS, it fucking sucks that the ML/AI ecosystem is seemingly OK with the term being hijacked.
The thing you want, open source model data pipelines, is a different thing. It's existence in no way invalidates the concept of an open source model. Nothing has been hijacked.
Meta/Llama probably started the trend, and they still today say "The open-source AI models" and "Llama is the leading open source model family" which is grossly misleading.
You cannot download the Llama models or weights without signing a license agreement, you're not allowed to use it for anything you want, you need to add a disclaimer on anything that uses Llama (which almost the entire ecosystem breaks as they seemingly missed this when they signed the agreement) and so on, which to me goes directly against what FOSS means.
If you cannot reproduce the artifact yourself (again, granted you have the resources), you'd have a really hard time convincing me that that is FOSS.
If it would not be hijacked, then such articles would not exist.
META is falsely and deceptively, but also carefully, pretending to be Open Source.
The Open Source Definition – Open Source Initiative https://opensource.org/osd
What is Free Software? - GNU Project - Free Software Foundation https://www.gnu.org/philosophy/free-sw.html
Word "Open" as in "Open Source" - Words to Avoid (or Use with Care) Because They Are Loaded or Confusing https://www.gnu.org/philosophy/words-to-avoid.html#Open
Please refrain from using "open" or "open source" as a synonym for "free software." These terms originate from different perspectives and values. The free software movement advocates for your freedom in computing, grounded in principles of justice. The open source approach, on the other hand, does not promote a set of values in the same way. When discussing open source views, it's appropriate to use that term. However, when referring to our views, our software, or our movement, please use "free software" or "free (libre) software" instead. Using "open source" in this context can lead to misunderstandings, as it implies our views are similar to those of the open source movement.
It seems to me that open source weights enable everything the FOSS community is practically capable of doing.
You can still learn web development even though you don't have 10,000s of users with a large fleet of servers and distributed servers. Thanks to FOSS, it's trivial to go through GitHub and find projects you can learn a bunch from, which is exactly what I did when I started out.
With LLMs, you don't have a lot of options. Sure, you can download and fine-tune the weights, but what if you're interested in how the weights are created in the first place? Some companies are doing a good job (like the folks building OLMo) to create those resources, but the others seems to just want to use FOSS because it's good marketing VS OpenAI et al.
I'm not sure how someone would argue (in good faith) that training on copyrighted materials does not cause the weights to be a derivative of those materials and the output of their AI is not protected under copyright but the part in the middle, the weights, does fall under copyright.
Note that this would be about the weights (i.e. the numbers), not their container.
The opinion that AI output isn't copyrightable derives from the opinion of the US Copyright Office, which argues that AI output is more like commissioning an artist than like taking a picture. And since the artist isn't human they can't claim copyright for their work.
It's not at all obvious to me that the same argument would hold for the output of AI training. Never mind that the above argument about AI output is just the opinion of some US agency and hasn't been tested in court anywhere in the world.
The only defense these AI companies have is making the weights machine output and thus not copyrightable.
But then again that's the theory, the copyright system follows money and it wouldn't be surprising to have contradicting ideas being allowed.
Similarly, claiming copyright on AI output is like claiming copyright on something like `init_state(42, &s); for (int i=0; i < count; i++) output[i] = next_random(&s);`. While there is a bit of (theoretical) effort involved into choosing 42 as a starting input, ultimately you can't really claim copyright on a bunch of random numbers because you chose the initial seed value.
Of course you can claim copyright in the code, but doing the same on the output makes no sense: even the if the idea of owning random numbers isn't absurd enough, consider what would happen if -say- 10000 people did the same thing (and to make things even more clear, what if `init_state` used only 8bits of the given number, therefore making sure that there would be a lot of people ending up with the same numbers).
AI is essentially `init_state` and `next_random`, just with more involved algorithms than a random number generator.
https://en.wikipedia.org/wiki/Threshold_of_originality
Areas of dispute include photographs of famous paintings (is it more in the character of a photocopy?), photographs taken by animals (does the human get copyright if they deliberately created the situation where the animal would take a photograph?), and videos taken automatically (can a CCTV video have an author?)
Historically, the results are all over the place.
My concern in this thread is people rejecting the concept of open source model weights as not "true" open source, because there is more that could be open sourced. It discounts a huge amount of value model developers provide when they open source weights. You are doing a variant of that here by trying to claim a narrow definition of "free software". I don't have any interest in the FSF definition.
Finetuning weights and building infrastructure around that involves almost all the same things as building a model, except it's actually possible. That's where I've seen most small-scale FOSS development take place over the last few years.
Learning how to make a small website is useful, and so is the website.
Learning how to finetune a large GPT is useful, and so is the finetuned model.
Learning how to train a 124M GPT is useful, but the resulting model is useless.
Those are two completely different roles? One is mostly around infrastructure and the other is actual ML. There are people who know both, I'll give you that, but I don't think that's the default or even common. Fine-tuning is trivial compared to building your own model and deployments/infrastructure is something else entirely.