Phi-4: Microsoft's Newest Small Language Model Specializing in Complex Reasoning(techcommunity.microsoft.com) |
Phi-4: Microsoft's Newest Small Language Model Specializing in Complex Reasoning(techcommunity.microsoft.com) |
Microsoft haven't officially released the weights yet but there are unofficial GGUFs up on Hugging Face already. I tried this one: https://huggingface.co/matteogeniaccio/phi-4/tree/main
I got it working with my LLM tool like this:
llm install llm-gguf
llm gguf download-model https://huggingface.co/matteogeniaccio/phi-4/resolve/main/phi-4-Q4_K_M.gguf
llm chat -m gguf/phi-4-Q4_K_M
Here are some initial transcripts: https://gist.github.com/simonw/0235fd9f8c7809d0ae078495dd630...More of my notes on Phi-4 here: https://simonwillison.net/2024/Dec/15/phi-4-technical-report...
> Chain-of-Thought: Data should encourage systematic reasoning, teaching the model various approaches to the problems in a step-by-step manner.
Phi is all about synthetic training and prompt -> svg -> render -> evaluate image -> feedback loop feels like ideal fit for synthetic learning.
You can push it quite far with stuff like basic 2d physics etc with plotting scene after N seconds or optics/rays, magnetic force etc.
SVG as LLM window to physical world.
What? let’s try not to go full forehead into hype.
SVGs would be an awfully poor analogy for the physical world…
TL;DR it was nigh-impossible to get it to emit the proper "end of message" token. (IMHO the chat training was too rushed). So all the local LLM apps tried silently hacking around it. The funny thing to me was no one would say it out loud. Field isn't very consumer friendly, yet.
Nice. Thanks.
Do you think sampling the stack traces of millions of machines is a good dataset for improving code performance? Maybe sample android/jvm bytecode.
Maybe a sort of novelty sampling to avoid re-sampling hot-path?
I'd be interested for any LLM emitting any kind of text-to-picture instructions to get results that are beyond a kindergartner-cardboard-cutout levels of art.
https://gist.github.com/uschen/38fc65fa7e43f5765a584c6cd24e1...
Copied SVG from gist into figma, added dark gray #444444 background, exported as PNG 1x.
Prompt:
Are you based on OpenAI?
Reply:
No, I am not based on OpenAI's models; I am developed by Microsoft. While both systems are designed for natural language processing and understanding, we may have different architectures, training methodologies, or datasets. However, our shared goal is to provide helpful and informative responses while ensuring user safety and privacy.<|user|>
Are you similar to ChatGPT?<|end|>For example how do you handle the phi-4 models gguf chat template?
You can click on the little info icon on Hugging Face to see that directly.
For https://huggingface.co/matteogeniaccio/phi-4/tree/main?show_... that's this:
{% for message in messages %}{% if
(message['role'] == 'system')
%}{{'<|im_start|>system<|im_sep|>' +
message['content'] + '<|im_end|>'}}{%
elif (message['role'] == 'user')
%}{{'<|im_start|>user<|im_sep|>' +
message['content'] +
'<|im_end|><|im_start|>assistant<|im_sep|>'}}{%
elif (message['role'] == 'assistant')
%}{{message['content'] + '<|im_end|>'}}{%
endif %}{% endfor %}Is this not supposed to cause Model collapse?
For diffusion-based image generators training only on synthetic data over repeated model training can cause model collapse as errors in the output can amplify in the trained model. It's usually the 2nd or 3rd model created this way (with output of the previous used as input for the first) for it to collapse.
It was found that using primary data along side synthetic data avoided the model collapse. Likewise, if you also have some sort of human scoring/evaluation you can help avoid artefacts.
As far as I can tell model collapse happens when you deliberately train LLMs on low quality LLM-generated data so that you can write a paper about it.
Thought it was official just not on huggingface but rather whatever azure competitor thing they're pushing?
Since there are GGUFs now so someone must have released some weights somewhere.
I think the point is more the demonstration that such a small model can have such good performance than any actual usefulness.
I've just assumed it's down to how it was trained, but no expert.
> While phi-4 demonstrates relatively strong performance in answering questions and performing reasoning tasks, it is less proficient at rigorously following detailed instructions, particularly those involving specific formatting requirements.
How far are we from running a GPT-3/GPT-4 level LLM on regular consumer hardware, like a MacBook Pro?
I'm not sure how I can be impressed by a 14B Phi-4. That isn't really small any more, and I doubt it will be significantly better than llama 3 or Mistral at this point. Maybe that will be wrong, but I don't have high hopes.
I wonder what will be next month's buzzphrase.
Only from the perspective of those still complaining about the use of the term "reasoning", who now find themselves left behind as the world has moved on.
For everyone else, the phrasing change perfectly fits the technological change.
1) Realize when it's reached an impasse, then backtrack and explore alternatives
2) Recognize when no further progress towards the goal appears possible, and switch from exploiting existing knowledge to exploring/acquiring new knowledge to attempt to proceed. An LLM has limited agency, but could for example ask a question or do a web search.
In either case, prediction failure needs to be treated as a learning signal so the same mistake isn't repeated, and when new knowledge is acquired that needs to be remembered. In both cases this learning would need to persist beyond the current context in order to be something that the LLM can build on in the future - e.g. to acquire a job skill that may take a lot of experience/experimentation to master.
It doesn't matter what you call it (basic or advanced), but it seems that current attempts at adding reasoning to LLMs (e.g. GPT-o1) are based around 1), a search-like strategy, and learning is in-context and ephemeral. General animal-like reasoning needs to also support 2) - resolving impasses by targeted new knowledge acquisition (and/or just curiosity-driven experimentation), as well as continual learning.
I think once they get to about the 85th percentile, we could upgrade the phrase to advanced reasoning. I'm roughly equating it with the percentage of the US population with at least a master's degree.
I feel like I'm taking crazy pills sometimes.
The worst was the gpt4o update in November. Basically a 2 liner on what it is better at and in reality it regressed in multiple benchmarks.
Here we just get MMLU, which is widely known to be saturated and knowing they trained on synthetic data, we have no idea how much "weight" was given to having MMLU like training data.
Benchmarks are not perfect, but they give me context to build upon. ---
edit: the benchmarks are covered in the paper: https://arxiv.org/pdf/2412.08905
Microsoft Research just dropped Phi-4 14B, an open-source model that’s turning heads. It claims to rival Llama 3.3 70B with a fraction of the parameters — 5x fewer, to be exact.
What’s the secret? Synthetic data. -> Higher quality, Less misinformation, More diversity
But the Phi models always have great benchmark scores, but they always disappoint me in real-world use cases.
Phi series is famous for to be trained on benchmarks.
I tried again with the hashtag#phi4 through Ollama - but its not satisfactory.
To me, at the moment - IFEval is the most important llm benchmark.
But look the smart business strategy of Microsoft:
have unlimited access to gpt-4 the input prompt it to generate 30B tokens train a 1B parameter model call it phi-1 show benchmarks beating models 10x the size never release the data never detail how to generate the data( this time they told in very high level) claim victory over small models
Llama 3.3 70B and Qwen 2.5 72B are certainly comparable to GPT-4, and they will run on MacBook Pros with at least 64GB of RAM. However, I have an M3 Max and I can’t say that models of this size run at comfortable speeds. They’re a bit sluggish.
If only those models supported anything other than English
Did you mean _external gpu_?
Choose any 12GB or more video card with GDDR6 or superior and you'll have at least double the performance of a base m4 mini.
The base model is almost an older generation. Thunderbolt 4 instead of 5, slower bandwidths, slower SSDs.
Phi-4 is yet another step towards a small, open, GPT-4 level model. I think we're getting quite close.
Check the benchmarks comparing to GPT-4o on the first page of their technical report if you haven't already https://arxiv.org/pdf/2412.08905
The Qwen2 models that run on my MacBook Pro are GPT-4 level too.
Some people do place value on running locally, and I'm not against then for it, but realistically no 70B class model has the amount of general knowledge or understanding of nuance as any recent GPT-4 checkpoint.
That being said these models are still very strong compared to what we had a year ago and capable of useful work
Planned week delay between release on their own platform and hf
But much like you I decided I can be patient / use the ggufs
If that physics engine were itself a good model for the real world, then you could do simulated evolution to get an end result that is at least as functional as a bike (though perhaps it wouldn't look like a traditional bike) even if the only values available to the LLM were the gross characteristics like overall dimensions and mass.
But I'd say the chance of getting a pelican SVG out of a model like this is mostly related to lots of text describing the anatomy of pelicans, and it would not gain anything from synthetic data.
I don’t think the fact that the images are described with vectors magically makes it better for representing physics than any other image representation. Maybe less so, since there will be so much textual information not related to the physical properties of the object.
What about them makes it easier to map to physics than an AABB?
For soft body physics, im pretty sure a simpler sort of distance field representation would even be better. (I’m not as familiar with soft body as rigid body)
You can of course just rasterise the vector for output, it's not like people view these things on oscilloscopes.
Whether it's a good model or a model that the LLMs can manipulate, I have no idea. But it's entirely plausible that I could feed an SVG map and ask it geospatial reasoning questions like how far between A and B, what's the nearest grocery store, etc.
I sometimes wonder how many other model or vendor-specific improvements there are, that are missed by third-party tools despite being well-documented by the vendors.
You might be interested in Telosnex, been working on it for ~year and it's in good shape and is more or less designed for this sort of flexibility / allowing user input into requests. Pick any* provider, write up your own canned scripts, with incremental complexity: ex. your average user would just perceive it as "that AI app with the little picker for search vs. chat vs. art"
* OpenAI, Claude, Mistral, Groq Llama 3.x, and one I'm forgetting....Google! And .gguf
All of the Qwen models are basically fluent in both English and Chinese.
Claude 3.5 Sonnet is in second place: https://github.com/simonw/pelican-bicycle?tab=readme-ov-file...
My personal test has been "A horse eating apples next to a tree" but the deliberate absurdity of your example is a much more useful test.
Do you know if this is a recognized technique that people use to study LLMs?
Isn't it just like any kind of conversion or translation? Ie. a relationship mapping between diffrent domains and just as much parroting "known" paths between parts of different domains?
If "sun" is associated with "round", "up high", "yellow","heat" in english that will map to those things in SVG or in whatever bizarre format you throw at with relatively isomorphic paths existing there just knitted together as a different metamorphosis or cluster of nodes.
On a tangent it's interesting what constitutes the heaviest nodes in the data, how shared is "yellow" or "up high" between different domains, and what is above and below them hierarchically weight-wise. Is there a heaviest "thing in the entire dataset"?
If you dump a heatmap of a description of the sun and an SVG of a sun - of the neuron / axon like cloud of data in some model - would it look similar in some way?
I don’t think it reflects any understanding. But to go from screenshot to conceptually accurate and working code was impressive.
At the end of the day, there's only so much you can cram into any given number of parameters, regardless of what any artificial benchmark says.
For $500 all included?
Here's a config for around the same price. All brand new parts for 573. You can spend the difference improving any part you wish, or maybe get an used 3060 and go AM5 instead (Ryzen 8400F). Both paths are upgradeable.
https://pcpartpicker.com/list/ftK8rM
Double the LLM performance. Half the desktop performance. But you can use both at the same time. Your computer will not slow down when running inference.
by modern standards. at the time, it was trained according to neural scaling laws oai believed to hold.
One of the most exciting trends of the past year has been models getting dramatically smaller while maintaining similar levels of capability.
This is what o1-pro yielded: https://gist.github.com/carbocation/8d780ad4c3312693ca9a43c6...
https://int19h.org/chatgpt/lakeside/index.html
One interesting thing that I found out while doing this is that if you ask GPT-4 to produce SVG suitable for use in HTML, it will often just generate base64-encoded data: URIs directly. Which do contain valid SVG inside as requested.
The most significant part I took away is that when safety "alignment" was done the ability plummeted. So that really makes me wonder how much better these models would be if they weren't lobotomized to prevent them from saying bad words.
The comparison is between something you can buy off the shelf like a powerful Mac, vs something powered by a Grace Hopper CPU from Nvidia, which would require both lots of money and a business relationship.
Honestly, people pay $4k for nice TVs, refrigerators and even couches, and those are not professional tools by any stretch. If LLMs needed a $50k Mac Pro with maxed out everything, that might be different. But anything that's a laptop is definitely regular consumer hardware.
If I'm running a software business selling software that runs on 'consumer hardware' the more people can run my software, the more people can pay me. For me, the term means the hardware used by a typical-ish consumer. I'll check the Steam hardware survey, find the 75th-percentile gamer has 8 cores, 32GB RAM, 12GB VRAM - and I'd better make sure my software works on a machine like that.
On the other hand, 'consumer hardware' could also be used to simply mean hardware available off-the-shelf from retailers who sell to consumers. By this definition, 128GB of RAM is 'consumer hardware' even if it only counts as 0.5% in Steam's hardware survey.
Today, lots of people spend far more than that for gaming PCs. An Alienware R16 (unquestionably a consumer PC) with 64 GB of RAM starts at $4700.
It is an expensive computer, but the best mainstream computers at any particular time have always cost between $2500 and $5000.
You'll need a mini-pc with two M.2 slots, like this:
https://www.amazon.com/Beelink-SER7-7840HS-Computer-Display/...
And a riser like this:
https://www.amazon.com/CERRXIAN-Graphics-Left-PCI-Express-Ex...
And some courage to open it and rig the stuff in.
Then you can plug a GPU on it. It should have decent load times. Better than an eGPU, worse than the AM4 desktop build, fast enough to beat the M4 (once the data is in the GPU, it doesn't matter).
It makes for a very portable setup. I haven't built it, but I think it's a reasonable LLM choice comparable to the M4 in speed and portability while still being upgradable.
Edit: and you'll need an external power supply of at least 400W:)
For AI and LLMs, I'm not aware of any company even selling the models assets directly to consumers, they're either completely unavailable (OpenAI) or freely licensed so the companies training them aren't really dependent what the average person has for commercial success.
https://github.com/ggerganov/llama.cpp/discussions/4167
OMM, Llama 3.3 70B runs at ~7 text generation tokens per second on Macbook Pro Max 128GB, while generating GPT-4 feeling text with more in depth responses and fewer bullets. Llama 3.3 70B also doesn't fight the system prompt, it leans in.
Consider e.g. LM Studio (0.3.5 or newer) for a Metal (MLX) centered UI, include MLX in your search term when downloading models.
Also, do not scrimp on the storage. At 60GB - 100GB per model, it takes a day of experimentation to use 2.5TB of storage in your model cache. And remember to exclude that path from your TimeMachine backups.
The problem is that often you can't run anything else. I've had trouble running larger models in 64GB when I've had a bunch of Firefox and VS Code tabs open at the same time.
Qwen2.5 Coder 14B at a 4 bit quantization could run but you will need to be diligent about what else you have in memory at the same time.
Yes, it is not "local" as I will have to use the internet when not at home. But it will also not drain the battery very quickly when using it, which I suspect would happen to a Macbook Pro running such models. Also 70B models are out of reach of my setup, but I think they are painfully slow on Mac hardware.