DeepSeek 4 Flash local inference engine for Metal

DeepSeek 4 Flash local inference engine for Metal(github.com)

498 points by tamnd 11 days ago | 159 comments

kgeist 10 days ago |

Heh, I made something very similar for the Qwen3 models a while back. It only runs Qwen3, supports only some quants, loads from GGUF, and has inference optimized by Claude (in a loop). The whole thing is compact (just a couple of files) and easy to reason about. I made it for my students so they could tinker with it and learn (add different decoding strategies, add abliteration, etc.). Popular frameworks are large, complex, and harder to hack on, while educational projects usually focus on something outdated like GPT-2.

Even though the project was meant to be educational, it gave me an idea I can't get out of my head: what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination? GPUs are expensive and harder to get with each day. If you remove enough abstractions and code directly to the exact hardware/model, you can probably optimize things quite a lot (I hope). Maybe run an agent which tries to optimize inference in a loop (like autoresearch), empirically testing speed/quality.

The only problem with this is that once a model becomes outdated, you have to do it all again from scratch.

Aurornis 10 days ago | |

> what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination?

The inference engines in use already include different backend building blocks optimized for different hardware.

While there are places where you can pick up some low hanging fruit for less popular platforms, there isn't a lot of room to squeeze in super optimized model-runners for specific GPU families and get much better performance. The core computations are already done by highly optimized kernels for each GPU.

There are forks of llama.cpp that have better optimizations for running on CPU architectures, but (barring maintainer disagreements) a better use of time is to target merging these improvements upstream instead of trying to make super specific model+GPU runners.

GeekyBear 10 days ago | | |

Deepseek's custom PTX code has previously outperformed CUDA running on Nvidia H800 GPUs.

> DeepSeek made quite a splash in the AI industry by training its Mixture-of-Experts (MoE) language model with 671 billion parameters using a cluster featuring 2,048 Nvidia H800 GPUs in about two months, showing 10X higher efficiency than AI industry leaders like Meta. The breakthrough was achieved by implementing tons of fine-grained optimizations and usage of Nvidia's assembly-like PTX (Parallel Thread Execution) programming instead of Nvidia's CUDA for some functions,

https://www.tomshardware.com/tech-industry/artificial-intell...

Custom code targeting one specific hardware implementation can improve performance quite a bit.

LoganDark 10 days ago | | |

When you support multiple backends, you end up having to abstract over them. Each backend may implement the abstraction to the best of its capability, but you still have to deal with the abstraction sitting between your workload and its compute. Wouldn't it be nice if you didn't need that abstraction? That's what GP is talking about, I'm sure: optimizing the workload directly for the hardware, rather than merely the workload and the backend for the abstraction.

xtracto 10 days ago | |

This takes me to the famous FizzBuzz High performance codegolf answer [1]. If we could implement optimizations like that for the inferences, maybe we could increase the speeds 10x or more.

[1] https://codegolf.stackexchange.com/questions/215216/high-thr...

Juvination 10 days ago | | |

I love scrolling and reading through this, thinking yeah of course Python is slower than Java, oh wow Rust is pretty on par I wonder what the Java devs did. Then you hit asm and your jaw drops.

mirsadm 10 days ago | |

I've built something like this. One issue is that LLMs are actually terrible at writing good shaders. I've spent way too much time trying to get them not to be so awful at it.

davidwritesbugs 10 days ago | | |

I tried getting any sota llm (GPT 5, Opus 4.6, Deepseek V4 pro, glm-5) to write a Metal 4 shader for a bottle usdz and none of them got it right. They screwed up the normals and textures , total mess. I tried it to do it in Metal 3 and still crappy.

wahnfrieden 10 days ago | | |

Just curious if you've tried GPT 5.5 Pro?

egesko 10 days ago | |

I'll add to this: What if chips were designed for the model? What would happen if we moved from digital to analog (vectors are not represented as bits, but instead as voltages)? Could the compute heavy matrix multiplications be done via op-amps? And could this analog approach be way more efficient than the limitations of bit representation?

kristianp 10 days ago | | |

There is https://taalas.com/ . Their chips are all digital though. The weights are written to silicon.

joshmarlow 10 days ago | |

Another suggestion for optimizing local inference - the Hermes team talks a lot on X about how much better results are when you use custom parsers tuned to the nuances of each model. Some models might like to use a trailing `,` in JSON output, some don't - so if your parser can handle the quirks of the specific model, then you get higher-performing functionality.

didip 10 days ago | |

What if PyTorch is extended to have a pluggable compiler? For M GPU types and N models, if the backend allows, run a specialized compiler?

nopurpose 10 days ago | |

Ultra-optimized HW-specific engines is what Mojo lang seems to be targeting, but I rarely hear about it here.

andsoitis 10 days ago | | |

> Mojo lang seems to be targeting, but I rarely hear about it here

Momentum over at Mojo lang seems very very slow.

According to their roadmap, they're still busy on Phase 1 ("High performance CPU + GPU coding"), and haven't touched Phase 2 ("Systems application programming") and Phase 3 ("Dynamic object-oriented programming").

So perhaps there isn't much to talk about?

p_stuart82 10 days ago | |

this feels closer to ATLAS/FFTW than a model runner. the generated kernel ages out, the tuning harness is the bit you actually want to keep.

lhl 10 days ago |

I think especially with the ability for SOTA AI to optimize kernels more people should try their hand at making better inference for their specific hardware.

I have an older W7900 (RDNA3) which, besides 48GB of VRAM, has some pretty decent roofline specs - 123 FP16 TFLOPS/INT8 TOPS, 864 GB/s MBW, but has had notoriously bad support both from AMD (ROCm) as well as llama.cpp.

Recently I decided I'd like to turn the card into a dedicated agentic/coder endpoint and I started tuning a W8A8-INT8 model. Over the course of a few days of autolooping (about 800 iterations using a variety of frontier/SOTA models, Kimi K2.6 did surprisingly well), and I ended up with prefill +20% and decode +50% faster than the best llama.cpp numbers for Qwen3.6 MoE.

I'm currently grinding MTP and DFlash optimization on it, but I've been pretty pleased with the results, and will probably try Gemma 4 next.

maherbeg 11 days ago |

This is so sick. I'm really curious to see what focused effort on optimizing a single open source model can look like over many months. Not only on the inference serving side, but also on the harness optimization side and building custom workflows to narrow the gap between things frontier models can infer and deduce and what open source models natively lack due to size, training etc.

antirez 10 days ago |

A random, funny, interesting and telling data point: my MacBook M3 Max while DS4 is generating tokens at full speed peaks 50W of energy usage...

speu 11 days ago |

I've been trying deepseek-v4-flash in OpenCode (via OpenRouter) and I'm blown away. It's no Opus, obviously, but it had zero issues with any regular coding task I threw at it. v4-flash is remarkably "good enough" for what I needed. The whole evening of coding cost me $0.52 in API credits.

jiehong 10 days ago | |

Using it in Kagi Assistant is stupidly slow. I get like 10 t/s.

While it’s pretty fast in the official app for example.

Kagi Assistant is also kind of broken when using Qwen 3.6 Plus.

So, beware of using them in Kagi at the moment.

dev_hugepages 7 days ago | | |

Probably a provider thing. Looking at https://help.kagi.com/kagi/ai/llms-privacy.html, they're using deepinfra.

Looking at https://openrouter.ai/deepseek/deepseek-v4-flash/providers tells us that the deepseek provider achieves 49tps of throughput while deepinfra 19tps.

nazgulsenpai 10 days ago |

I keep seeing DS4 and in order my brain interprets it as Dark Souls 4 (sadface), DualShock 4, Deep Seek 4.

visarga 10 days ago |

Large LLMs on MacBook produce tokens at an acceptable speed but the problem is reading context. Not incremental reading like when you have a chat session, because they use KV cache, but large size reading, like when you paste a big file. It can take minutes.

antirez 10 days ago | |

DS4 can process 460 prompt tokens per second. Not stellar but not so slow. On M3 max. See the benchmarks on readme.

brcmthrowaway 10 days ago | |

Why is this the case?

Are there any architectures that don't rely on feeding the entire history back into the chat?

Recurrent LLMs?

bel8 10 days ago | |

And unless I'm mistaken, the repo is about running it with 2bit quantization.

This is probably far from the raw intelligence provided by cloud providers.

Still, this shines more light on local LLMs for agentic workflows.

antirez 10 days ago | | |

It runs both q2 and original (4 bit routed experts). At the same speed more or less. The q2 quants are not what you could expect: it works extremely well for a few reasons. For the full model you need a Mac with 256GB.

habosa 10 days ago | |

Can you ELI5 why this is so slow for local inference but so fast for using hosted models?

layoric 10 days ago |

Very impressive. One thing that seems odd to me is that is at like 4 minutes before it starts a response for large input? I don't use mac hardware for LLMs, but that is quite surprising and would seem to be a pretty large stumbling block for practical usage.

Edit: Caching story makes a lot more sense for regular usage: > Claude Code may send a large initial prompt, often around 25k tokens, before it starts doing useful work. Keep --kv-disk-dir enabled: after the first expensive prefill, the disk KV cache lets later continuations or restarted sessions reuse the saved prefix instead of processing the whole prompt again.

antirez 10 days ago | |

Yep that happens with coding agents sending a very large system prompt. And also when later tool calling feed it large files or diffs. But with the M3 ultra the prefill speed is almost 500 t/s that is quite into the very usable zone. With M3 max you need a bit more patience but it works well and as it emits the think process if you use the pi agent you don't wait: you read non censored chain of though. I posted a video on X yesterday using it with my m3 max. It spills tokens at a decent speed.

zozbot234 10 days ago | | |

Given how small the KV cache for this model seems to be for small contexts, can you clarify how the engine behaves if you try to run increasingly larger batches on your prosumer hardware (RAM 128 GB)? Does it eventually become compute limited?

Also, can the engine support transparent mmap use for fetching weights from disk on-demand, at least when using pure CPU? (GPU inference might be harder, since it's not clear how page faults would interact with running a shader.)

If the latter test is successful, next would be testing Macs with more limited RAM, first running simple requests (would be quite slow) then larger batches (might be more worthwhile if one can partially amortize the cost of fetching weights from storage, and be bottlenecked by other factors).

segmondy 10 days ago | | |

Curious why you went this route, don't you think you could have achieved near this performance 80%+ or more within llama.cpp?

MrBuddyCasino 10 days ago | |

Prefill is faster on the M5s, the older generations are a bit weak.

sev_verso 10 days ago |

I've tried it out with Claude Code on my existing codebase and it seemed to hold its weight (despite being the 2-bit quant). Takes minutes on prompt processing, the actual edits are reasonably quick at above 20 tks.

The good: It succeeded with discovering, applying edits and writing a test for a small task I gave it. The bad: It could not address a small nitpick I had. The ugly: It hallucinated a conversation about "The Duck" that I had with it simultaneously while trying to solve another problem. I can only imagine it's one of examples in the initial Claude Code prompt:

--cut-- However, the user's query is "Can you track these 3 videos here?" which seems unrelated. Perhaps the user is asking if I can track the progress of three videos they are working on?

Let me re-read the user's message. The user said "Source Code" and "The Agent" and "The Duck", it could be video titles. And they are asking if I can track these 3 videos.

?? That doesn't make sense in the context. Could there be two different conversations? --cut--

kristianp 10 days ago |

Hmm, I'm unable to order more than 96GB RAM for a Mac studio, even with the M3 ultra or M4 Max. Is this au specific? However with the MacBook Pro I can specify 128GB with the M5 Mac.

https://www.apple.com/au/shop/buy-mac/mac-studio

Joeri 10 days ago | |

It’s not just AU: https://9to5mac.com/2026/05/05/apples-most-powerful-mac-stud...

They’ve dropped all the mac studio configs higher than 96 gb, as well as the base mac mini. They’re also rumored to be considering taking the Neo base config off the market.

This seems to be how they’re dealing with supply constraints for fab capacity and RAM.

Terretta 10 days ago | |

Difficult to believe this memory is made of unobtanium.

Maybe Apple would rather not price it at all than experience blowback for either gouging or lack of inventory.

smcleod 10 days ago | |

The studio is really old now. The new one will drop at some point no doubt with more memory options. the 128GB M5 max MBP is great though

Terretta 10 days ago | | |

And yet, aside from offering 512GB, that really old Studio Ultra M3 LLMs faster (especially sustained) than the new M5 Max.

Havoc 10 days ago |

Was excited until I realized DS flash is still enormous. Oh well...glad it exists anyway & happy to see antirez still doing fun stuff

zozbot234 10 days ago | |

It could run viably with SSD offload on Macs with very little memory. You could even exploit batching to make the model almost compute limited even in that challenging setting, seeing as the KV cache is so extremely small (for non-humongous context). In fact, if that approach can be made to work I'd like to see a comparison between DS4 Flash and Pro on the same (Mac) hardware.

Havoc 10 days ago | | |

>It could run viably with SSD offload on Macs with very little memory

Not really. That's going to land you somewhere in the 0.2-0.5 tokens a second range

Lovely as modern nvmes are they're not memory

amunozo 10 days ago |

I am curious about it producing less tokens except for the max mode. I love DeepSeek V4 Flash and I use it extensively, it's so cheap I can use it all day and still not use all my 10$ OpenCode Go subscription. I use it always in max mode because of this, but now I wonder whether I should rather use high.

unshavedyak 10 days ago | |

What do you use it for? I tend to just stick to SOTA (Claude 4.7 Max thinking), and put up with the slow req/response. I'm not sure what type of work i'd trust a less thinking model, as my intuition is built around what Claude vSOTA Max can handle.

Nonetheless eventually i want to build an at-home system. I imagine some smaller local model could handle metadata assignment quite well.

edit: Though TIL Mac Studio doesn't offer 512GB anymore... DRAM shortage lol. Rough.

tmaly 10 days ago |

The intro was the best part of the README in my opinion. The rest of the README looks and feels AI generated. I am guilty of this same thing with README files.

ZeroGravitas 10 days ago |

Did I miss a simple motivating benchmark or goal?

I'm assuming this is faster, and/or lets you run a bigger, smarter model than just using the generic tool chain, but it doesn't spell out the level of existing improvements over that baseline or expected improvements as far as I can see?

Presumably you can work it out based on the numbers given if you have the relevant comparison values.

dejli 10 days ago |

The beaty of it, that you can clone and make it, and it just works, no python shenanigans, what a blessing for this eco system.

sourcecodeplz 10 days ago |

Great project!

This is also a fine example of a vibe-coded project with purpose, as you acknowledged.

shivnathtathe 10 days ago |

Been working on local-first LLM observability for exactly this use case — tracing local model pipelines without sending data to cloud. Happy to share if anyone's interested.

shay_ker 11 days ago |

How does it compare to popular local inference engines, e.g. ollama, lm studio, or handrolled llama.cpp? I saw a brief benchmark in the readme but wasn't sure if there was more.

fgfarben 9 days ago |

On both the llama.cpp based version and the custom Metal version, the model forgets how to use tools somewhere around the 50,000 token mark.

brcmthrowaway 10 days ago |

How does this compare with oMLX?

octocop 10 days ago |

Finally someone who pays proper respect to GGML ecosystem.

npgraph 10 days ago |

Any direct TPS comparison to Ollama?

zozbot234 10 days ago | |

Ollama has no local support for DeepSeek V4 at present; it's only listed as a cloud model. Even llama.cpp is still waiting for support.

happyPersonR 10 days ago |

So just gonna ask a question, probably will get downvoted

I know this is flash, but….

But other than this guy, did our whole society seriously never flamegraph this stuff before we started requesting nuclear reactors colocated at data centers and like more than 10% of gdp?

Someone needs to answer because this isn’t even a m4 or m5… WHAT THE FUCK

Sidenote: shout out antirez love my redis :)

AlotOfReading 10 days ago | |

This is built atop a tower of stuff people built with profiling and performance-oriented design.

That said, I've found that most corporate environments are unintentionally hostile to this kind of optimization work. It's hard to justify until the work is already done. That means you often need people with the skills, means, and motivation to do this that are outside normal corporate constraints. There aren't many of those.

happyPersonR 10 days ago | | |

Building this into agentic dev workflows (subject to token/time constraints) is something I spent a lot of time doing at work. I actually am kind of proud of that hahah

But you’re right I agree

In the corporate world they sadly don’t take kindly to performance profiling as a first class citizen

Granted I will say optimization without requirements may not be beneficial but at least profiling itself seems worthy if you have use cases.

A lot of us have been working in the network packet pusher software , distributed systems , distributed storage space

I’m happy to see more stuff like this :)

TLDR; I’ve not seen a lot of flamegraphs of Llm end to end … idk if anyone else has?

liuliu 10 days ago | |

DSv4 generates much faster on NVIDIA class hardware. It is just a very efficient model.

wmf 10 days ago | |

Every lab has a bunch of people doing nothing but optimizing.

ifeot 10 days ago | | |

8011943553

fgfarben 10 days ago | |

The world is not China.