I know this is flash, but….
But other than this guy, did our whole society seriously never flamegraph this stuff before we started requesting nuclear reactors colocated at data centers and like more than 10% of gdp?
Someone needs to answer because this isn’t even a m4 or m5… WHAT THE FUCK
Sidenote: shout out antirez love my redis :)
That said, I've found that most corporate environments are unintentionally hostile to this kind of optimization work. It's hard to justify until the work is already done. That means you often need people with the skills, means, and motivation to do this that are outside normal corporate constraints. There aren't many of those.
But you’re right I agree
In the corporate world they sadly don’t take kindly to performance profiling as a first class citizen
Granted I will say optimization without requirements may not be beneficial but at least profiling itself seems worthy if you have use cases.
A lot of us have been working in the network packet pusher software , distributed systems , distributed storage space
I’m happy to see more stuff like this :)
TLDR; I’ve not seen a lot of flamegraphs of Llm end to end … idk if anyone else has?
Even though the project was meant to be educational, it gave me an idea I can't get out of my head: what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination? GPUs are expensive and harder to get with each day. If you remove enough abstractions and code directly to the exact hardware/model, you can probably optimize things quite a lot (I hope). Maybe run an agent which tries to optimize inference in a loop (like autoresearch), empirically testing speed/quality.
The only problem with this is that once a model becomes outdated, you have to do it all again from scratch.
The inference engines in use already include different backend building blocks optimized for different hardware.
While there are places where you can pick up some low hanging fruit for less popular platforms, there isn't a lot of room to squeeze in super optimized model-runners for specific GPU families and get much better performance. The core computations are already done by highly optimized kernels for each GPU.
There are forks of llama.cpp that have better optimizations for running on CPU architectures, but (barring maintainer disagreements) a better use of time is to target merging these improvements upstream instead of trying to make super specific model+GPU runners.
> DeepSeek made quite a splash in the AI industry by training its Mixture-of-Experts (MoE) language model with 671 billion parameters using a cluster featuring 2,048 Nvidia H800 GPUs in about two months, showing 10X higher efficiency than AI industry leaders like Meta. The breakthrough was achieved by implementing tons of fine-grained optimizations and usage of Nvidia's assembly-like PTX (Parallel Thread Execution) programming instead of Nvidia's CUDA for some functions,
https://www.tomshardware.com/tech-industry/artificial-intell...
Custom code targeting one specific hardware implementation can improve performance quite a bit.
[1] https://codegolf.stackexchange.com/questions/215216/high-thr...
Momentum over at Mojo lang seems very very slow.
According to their roadmap, they're still busy on Phase 1 ("High performance CPU + GPU coding"), and haven't touched Phase 2 ("Systems application programming") and Phase 3 ("Dynamic object-oriented programming").
So perhaps there isn't much to talk about?
I have an older W7900 (RDNA3) which, besides 48GB of VRAM, has some pretty decent roofline specs - 123 FP16 TFLOPS/INT8 TOPS, 864 GB/s MBW, but has had notoriously bad support both from AMD (ROCm) as well as llama.cpp.
Recently I decided I'd like to turn the card into a dedicated agentic/coder endpoint and I started tuning a W8A8-INT8 model. Over the course of a few days of autolooping (about 800 iterations using a variety of frontier/SOTA models, Kimi K2.6 did surprisingly well), and I ended up with prefill +20% and decode +50% faster than the best llama.cpp numbers for Qwen3.6 MoE.
I'm currently grinding MTP and DFlash optimization on it, but I've been pretty pleased with the results, and will probably try Gemma 4 next.
While it’s pretty fast in the official app for example.
Kagi Assistant is also kind of broken when using Qwen 3.6 Plus.
So, beware of using them in Kagi at the moment.
Looking at https://openrouter.ai/deepseek/deepseek-v4-flash/providers tells us that the deepseek provider achieves 49tps of throughput while deepinfra 19tps.
Are there any architectures that don't rely on feeding the entire history back into the chat?
Recurrent LLMs?
This is probably far from the raw intelligence provided by cloud providers.
Still, this shines more light on local LLMs for agentic workflows.
Edit: Caching story makes a lot more sense for regular usage: > Claude Code may send a large initial prompt, often around 25k tokens, before it starts doing useful work. Keep --kv-disk-dir enabled: after the first expensive prefill, the disk KV cache lets later continuations or restarted sessions reuse the saved prefix instead of processing the whole prompt again.
Also, can the engine support transparent mmap use for fetching weights from disk on-demand, at least when using pure CPU? (GPU inference might be harder, since it's not clear how page faults would interact with running a shader.)
If the latter test is successful, next would be testing Macs with more limited RAM, first running simple requests (would be quite slow) then larger batches (might be more worthwhile if one can partially amortize the cost of fetching weights from storage, and be bottlenecked by other factors).
The good: It succeeded with discovering, applying edits and writing a test for a small task I gave it. The bad: It could not address a small nitpick I had. The ugly: It hallucinated a conversation about "The Duck" that I had with it simultaneously while trying to solve another problem. I can only imagine it's one of examples in the initial Claude Code prompt:
--cut-- However, the user's query is "Can you track these 3 videos here?" which seems unrelated. Perhaps the user is asking if I can track the progress of three videos they are working on?
Let me re-read the user's message. The user said "Source Code" and "The Agent" and "The Duck", it could be video titles. And they are asking if I can track these 3 videos.
?? That doesn't make sense in the context. Could there be two different conversations? --cut--
They’ve dropped all the mac studio configs higher than 96 gb, as well as the base mac mini. They’re also rumored to be considering taking the Neo base config off the market.
This seems to be how they’re dealing with supply constraints for fab capacity and RAM.
Maybe Apple would rather not price it at all than experience blowback for either gouging or lack of inventory.
Not really. That's going to land you somewhere in the 0.2-0.5 tokens a second range
Lovely as modern nvmes are they're not memory
Nonetheless eventually i want to build an at-home system. I imagine some smaller local model could handle metadata assignment quite well.
edit: Though TIL Mac Studio doesn't offer 512GB anymore... DRAM shortage lol. Rough.
I'm assuming this is faster, and/or lets you run a bigger, smarter model than just using the generic tool chain, but it doesn't spell out the level of existing improvements over that baseline or expected improvements as far as I can see?
Presumably you can work it out based on the numbers given if you have the relevant comparison values.
This is also a fine example of a vibe-coded project with purpose, as you acknowledged.
Even if not perfect, if you publish on GH or HF, some other agent can maybe start there and not from zero. I did this for Ling-2.6-flash (107B-A7B4 MoE) that's the biggest llm I can ran for practical use on the other h/w I got for local llms (M2 Max). Even if MTP is not working well, still improvement on the current llama.cpp that does not run Ling-2.6-flash at all. This - https://huggingface.co/inclusionAI/Ling-2.6-flash/discussion.... The 4-bit quants are at https://huggingface.co/ljupco/Ling-2.6-flash-GGUF, the branch is at https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flas....
There's a docs/ folder in there that is probably of interest as well.
My effort is called shady-thinker and is on github at github.com/tmzt/shady-thinker.
This was inspired in part by Antirez's earlier work with C kernels as well as other efforts to support in-browser LLMs. I've adapted them to Rust and the wgpu library.
Gemma 4 is also the next likely target (with the MTP work) as I'm experimenting with local AI agents.
I'd love to see what you've done to improve prefill and decode even if its not directly applicable.
One difference, I'm using MLX and GPTQ 4bit quants including AutoRound with safetensors as my shader pipeline is pretty much fixed for each model, ggml just adds unnecessary complexity.
I think llama.cpp could have done a much better job supporting PC. Sure, some of it us due to bad vendor support but with so many users I am surprised we don't see more optimized inference on standard PCs
### Diagnosing parallelism pathologies (L1)
*Grid occupancy:* - `Grid_Size / Workgroup_Size >= CU count` (W7900 = 96, Strix Halo = 40)? - < 0.3 = massively undersubscribed. Fix grid FIRST. Micro-optimization will NOT help. - 0.3-1.0 = partially utilized; depends on VGPR/LDS pressure. - 1.0-4.0 = healthy; micro-optimization can help.
*Within-block distribution:* - Does the kernel do useful work across all threads, or is there an `if (threadIdx.x == 0)` gate around a serial top-k, reduction, or scan? For c=1 decode, many kernels can't grow the grid, but they can always parallelize inside the block. - `Scratch_Size > 0` from dynamically-indexed per-thread arrays is a strong secondary signal of the within-block pathology.
*Router top-k (within-block fix)*: - Kernel: `qwen35_router_select_kernel` @ c=1 decode - Before: grid=1 (can't help; num_tokens=1), blockDim=512, `if (threadIdx.x == 0)` gated 2048 serial compares. Scratch=144 B from spilled per-thread arrays. - Fix: warp-shuffle parallel argmax across the whole block + `__shared__` top_vals buffer eliminating the spill. - Result: 5.7× kernel speedup, +6.6% on 4K/D4K E2E.
Everyone who's betting their competency on the generosity of billionaires selling tokens for 1/10-1/20th of the cost, or a delusional future where capable OS models fit on consumer grade hardware are actually cooked.
Of course there will always be larger flagship models, but if you can count on decent on-device inference, it materially changes what you can build.
You can argue whether the projection is too optimistic or not, but this project definitely made me a little bit optimistic on that end.
An example is https://blog.can.ac/2026/02/12/the-harness-problem/ for just improving edits.
Or if we could really steer these open source models using well structured plans, could we spend more time planning into a specific way and kick off the build over night (a la the night shift https://jamon.dev/night-shift)
They said the same thing about open source chess engines.
48 gb is enough for a capable LLM.
Doing that on consumer grade hardware is entirely possible. The bottleneck is CUDA and other intellectual property moats.
- Consumers of LLM inference (developers and hobbyists) will be more aware of compute cost, leading them to develop more token-efficient uses of LLM inference and be incentivized to pick the right model for the right job (instead of throwing Sonnet at the wall and follow up with Opus if that doesn't stick)
- A larger market for on-device (and therefore open weight) LLM's will probably result in more research concentrated on those inherently more efficient (because compute/memory-constrained) models.
I think that despite the inefficiencies, shifting the market towards local inference would be a net positive in terms of energy use. Remember that 50W might seem like a lot, but is still much less than what, let's say, a PS5 draws.
Also remember how AWS had the same promise and now we're just deploying stack after stack and need 'FinOps' teams to get us to be more resource-efficient?
Plus, a Mac that's not running inference idles down to 1-5W, only drawing power when it needs to. Datacenters must maximize usage, individuals and their devices don't have to.
A Mac is also the rest of the personal computer!
If DS4 Flash peaks at 50W and is 280B parameters, does that mean DS4 Pro at 1.6T parameters would likely be 300W or so? And the latest GPT 5 and Opus which feel maybe comparable-ish around 500W? Is it fair to say that when I'm using Claude Code and it's "autofellating" or whatever I'm burning 500W in a datacenter somewhere during that time?
Data center energy use isn't simple to calculate because servers are configured to process a lot of requests in parallel. You're not getting an entire GPU cluster to yourself while your request is being processed. Your tokens are being processed in parallel with a lot of other people's requests for efficiency.
This is why some providers can offer a fast mode: Your request gets routed to servers that are tuned to process fewer requests in parallel for a moderate speedup. They charge you more for it because they can't fit as many requests into that server.
Claude Sonnet is probably running on a 8 GPU box that consumes 10 kW while Opus might use more like 50 kW but that's shared by a bunch of users thanks to batching.
I could write an engine that only uses 10W on your machine, but it wouldn't be meaningful if it was also 10X slower.
More power consumption is usually an indicator that the hardware is being fully utilized, all things equal (comparing GPU to GPU or CPU to CPU, not apples to oranges)
Washing machine 900W. Hair dryer 1500W. Pizza oven 2000W. So yeah, you say 50W, yeah sure same as video rendering or gaming I guess, yet not really an OMG-level number.
And frankly I'm not quite sure there's anything like economy of scale where it gets more efficient if you serve more users (like some sibling comments seem to imply).
Last thing, and I know many know but also many others don't or have forgotten: Watts is a rate of consumption, not an absolute amount. That is Joule, energy. So you say 50W, but what you pay for (or the planet pays, whatever) generally is the amount of energy, hence you need to say for how long that consumption was sustained. 50W over 2 hours, that's 100 Joules, the actual resource you consumed and paid for.
Power (watts) is like speed (m/s). You say 50 miles an hour, need to say how long was the drive, so we know how far you got.
Also, datacenter scale devices are almost certainly designed to minimize energy use per operation given comparable latency. You can still compete as an on prem consumer by (1) repurposing your existing hardware, which saves on high CapEx costs, (2) increasing latency, getting your answer computed in a longer time, which probably saves at least some power by design if you can leverage e.g. NPUs, or (3) running smaller or more bespoke models that aren't worthwhile for the bigger players to serve at scale.
There's also a likely gain in serving more requests in parallel, but it may have more to do with successfully amortizing memory access for model weights than any inherent increase in efficiency. Anyway, I've argued in sibling comments that you perhaps can also leverage this on consumer hardware for the special case of DeepSeek V4.
https://artificialanalysis.ai/models?models=gpt-5-5%2Cgpt-5-...
That's pretty compelling.
[1] https://finance.yahoo.com/sectors/technology/articles/nvidia...
They don't usually go into much detail, but the impression I get is that they think data centers are energy monsters full of overheated GPU's that need to be constantly replaced, while your phone is full of mostly unused compute capacity and will barely break a sweat if it's only serving queries for a single user at a time.
They don't seem to give much thought to the energy usage per user (or what this will potentially do to your phone battery), or how different phone-sized vs data center-sized models are in terms of capability.
One thing I would love to see is if this dogfoods itself
Like would dsv4 with q2 be able to do this task itself on this hardware ?
Sidenote: I wish I had a M4-m3 … thinking about getting a ASUS ROG Flow Z13 Gaming Laptop (Model GZ302EA-XS99) uses pcie 4.0 so disk might be a little slower, but I want to see how this does on like Vulcan :)
Yes of course that was a brain fart of mine. Watt is Joule per second not certainly Joule per hour. I made the point of "lecturing" readers on power v. energy since Antirez (OP) wrote _"50W of energy usage..."_ (instead of power consumption) and it's a mistake people often make. So my side point was: ok 50W but for how long.
The other thing I'm arguing is 50W is nothing to be shocked by. I would like to see an argument for the opposite. I'd like to know what's the power consumption of playing eg. Baldur's Gate for a couple hours on a gaming rig and I wager we surpass that by a margin.
Now, the data center economy of scales. You're saying they almost certainly exists. Okay whatever I don't know. Requests served in parallel. Amortizing memory access for model weights. Likely. I'm writing this with some thinly veiled dismissive attitude because I believe that it would be very useful to have hard data on whether or not serving many users v. just one user makes LLMs more efficient. It's an important point with wide ranging implications.
If there is scale, like you claim, and one day a wealthy patron gifts me a 40k USD rig where I can run a frontier LLM locally, then I'd still be making selfish use of the commons (energy, which belong to the planet, all of us, that kinda stuff) because the efficient/responsible choice is to pool and use a cloud vendor (or pool your rig with neighbors etc).
But saying a machine can be more efficient if it serves many users sounds to me a bit like nine women making a baby in a month.
A big cloud vendor does not face the same opportunity, they cannot leverage the repurposing of your own existing hardware. And they'll definitely want to minimize latency in order to get maximum throughput/utilization from the hardware they did buy, even at an emergy cost. That's why I was careful to note latency as a possible factor before.