A few words on DS4(antirez.com) |
A few words on DS4(antirez.com) |
> Starting from MacBooks with 96GB of RAM.
... oh. And I thought I bought a lot with 48 GB.
Apologies. Where did I form my opinions?
Here's one of the top hits: https://forums.developer.nvidia.com/t/fully-custom-cuda-nati...
Bizarre comment; sounds like "How do you know Porsches are fast? Did you drive one?"
For others who are lacking context :-)
From the Github page it seems it only supports Apple and DGX Spark. I have 128 GB of RAM and a 3090 but it probably won't work.
[1] https://unsloth.ai/docs/models/tutorials/minimax-m27
(Unsloth's deepseek-v4 support is still WIP)
I do not think it can use multi-gpu or gpu/cpu offloading at this time.
Has anyone tested what happens if you try and run this on lower-RAM Macs? It might work and just be a bit slower as it falls back on fetching model layers from storage.
Once we hit that point, I am curious how much of Anthropic's current business model falls apart? So far it's always been clear that you just pay for the most intelligent model you can get because it is worth it. It now seems clear to me that there is limited runway on that concept. It is just a question of how long that runway is. I honestly wonder how much of their frantic push to broaden out into enterprise / productivity is because they see this writing on the wall already.
> We support the following backends:
Metal is our primary target. Starting from MacBooks with 96GB of RAM.
NVIDIA CUDA with special care for the DGX Spark.
AMD ROCm is only supported in the rocm branch. It is kept separate from main
since I (antirez) don't have direct hardware access, so the community rebases
the branch as needed.
> This project would not exist without llama.cpp and GGML, make sure to read the acknowledgements section, a big thank you to Georgi Gerganov and all the other contributors.Edit: aww, doesn't seem to support offloading to system RAM[0] (yet)
[0] https://github.com/antirez/ds4/issues/108
Guess I'll have to keep watching the llama.cpp issue[1]
Has anybody tried it? There is a lot of emphasis on MacBook Pro in this thread, but I would like to use it with an AMD Halo Strix with 128GB of unified RAM.
(the ux of ds4 is fantastic too -- it's dead-easy to get a known-good model, great quant. llamacpp you're much more hacking in the wilderness, w/ many many knobs.)
Is it true? We'll see, in a few years.
Antirez explained the dev process when he posted a pure C implementation of the Flux 2 Klein image gen model, at https://news.ycombinator.com/item?id=46670279
I do wonder, though, if another agent is really needed. I've been driving it with Pi (Claude Code's system prompt is far too heavy given the prefill speeds) and it's been great. OpenCode is another good option. Is there anything else to gain from another similar tool specific to Deepseek 4?
prefill: 30.91 t/s, generation: 29.58 t/s
From https://gist.github.com/simonw/31127f9025845c4c9b10c3e0d8612...Can't say that it wouldn't be a better idea to spend that cash on tokens from the frontier hosted models though.
I'm an LLM nerd so running local models is worth it from a research perspective.
It'll require some kind of:
- breakthrough in architecture or
- breakthrough in hardware or
- some breakthrough quantisization technique
The problem is that all the parameters need to be in memory, even the ones that aren't active (say for Mixture Of Expert Models) because switching parametrs in and out of ram is far too slow.
We show that EMO – a 1B-active, 14B-total-parameter (8-expert active, 128-expert total) MoE trained on 1 trillion tokens – supports selective expert use: for a given task or domain, we can use only a small subset of experts (just 12.5% of total experts) while retaining near full-model performance."
A crow exhibits some degree of intelligence in what is a very small brain compared to humans. There is overlap in the problem solving skills of the dumbest humans and the smartest crows.
So the question is: what is that? Yann LeCun seems to think it’s what we now call world models. World models predict behaviour as opposed to predicting structured data (like language.)
If your model can predict how some world works (how you define world largely depends on the size of your training data), then in theory it is able to reason about cause and effect.
If you can combine cause and effect reasoning with language, you might get something truly intelligent.
That’s where things seem to be going. Once we have a prototype of that system, there will be many questions about how much data you really need. We’ve seen how even shrinking LLMs with 1-bit quantization can lead to models that exhibit a fairly strong understanding of language.
I don’t think it’s unreasonable to expect to see some very intelligent low (relatively) memory AI systems in the next couple years.
I have been toying with a 2.5D engine in C on on top of raylib and using DeepSeek as companion in between.
It's thinking transcripts in OpenaCode are transparent and mind boggling to look at things it would consider in its thought process. Very long to read but none of them useless or meaningless.
Always happened that I discovered an assumption that I didn't think about or was wrong but DeepSeek flags it in its thought process and then in final output it would "align" to my flawed request and I'll tell it wait, I saw you thought so and so too and that's correct I made a mistake let's consider that aspect too.
Right now I feel like a 4bit Qwen 3.6 27B with MTP is one of the best for agentic tool calling for some smart voice agents in an H200. I wonder if DS4 Flash being using 80b at 2 bit with 13b active and MTP could be even faster and smarter and allow more concurrent sequences?
This special 2bit quantization seems like a big deal.
> Gentle reminder on how, in the recent DS4 fiesta, not just me but every other contributor found GPT 5.5 able to help immensely and Opus completely useless.
I've noticed the same for lower level squeezing-as-much-performance-as-possible code work.
I also don’t have time to do much personal coding outside of work, so I haven’t subscribed to a personal one yet. But I intend to go for Codex just to balance the Claude at work and also because of the hostile moves from Anthropic toward their consumer business.
Wink wink, nudge nudge.
I have a feeling most cybersec researchers would only be interested in negative values of "reduce" :D
The code seems based on llama.cpp and GGML.
I don't fully understand why it is a standalone project. The readme discusses this: DwarfStar 4 is a small native inference engine specific for DeepSeek V4 Flash. It is intentionally narrow: ...
I think the only bigger difference in DeepSeek V4 vs other models is maybe the type of self-attention. And that leads to: KV cache is actually a first-class disk citizen.
But I still feel like those changes could have been implemented as part of some of the other local engines.
I also assume more models will come out, not just from DeepSeek but also from others, and they might share similar self-attention approaches, that would benefit from a similar KV cache implementation.
The long context reasoning is something I haven't even seen in frontier models - I was running at 124k tokens earlier and it was still just buzzing along with no issues or fatigue.
I am amazed at how well it works, I'm using it right now for some pretty complex frontend work, and it is much much faster than, for example running a dense 27b or 31b model (like qwen or gemma) for me (The benefits of MoE) - but the long context capabilities have been what have been absolutely flooring me.
Super excited about this project and hope Antirez can keep himself from burning out - i've been following the repo pretty closely and there are a ton of PR's flooding in and it seems like he's had to do a lot of filtering out of slop code.
Sure, MoE models have more knowledge, but extreme quantization may negate the benefits. And generally for coding tasks, you don't need a model that has memorized all the irrelevant trivia like, I don't know, the list of all villages in country X. DS4 also seems to run much slower on Mac Studio Ultra, which appears to be more or less in the same price range as RTX 5090. RTX 5090 gives me 50-60 tok/sec and 260k context with Unsloth's 5-bit quantization (only some layers are 5-bit too) and an 8-bit KV cache; prefill is instant too. It works flawlessly in OpenCode.
If you already have a spare high-end Mac, I can see the benefit, but I'm not sure it's a good configuration overall. Unless Qwen3.6 is more benchmaxxed than DS4 :)
essentially, hardware is the main reason you may choose one or the other locally
i have a Strix Halo system so I will be trying this Dwarf Star 4 thingie eventually when i have some free time
Is he taking about 4/6 h of coding? If he meant total working time I’d say this is a very balanced lifestyle!
Also have enjoyed playing with https://huggingface.co/HuggingFaceTB/nanowhale-100m-base (but early days for me understanding this space)
But I found its tool calling is reliable than other oss models I tried. I assume that it attributes to interleaved thinking. Its reasoning effort is adjusted automatically by queries. I enjoy reading these reasoning traces from open models because you can't see them from proprietary models.
I would love to try DS4 so bad. Well, I don't have a machine for it. I will just stick to openrouter. I wish I can run a competitive oss model on 32GB machine in 3 years.
You could try DS4 on that machine anyway and see how gracefully it degrades (assuming that it runs and doesn't just OOM immediately). Experimenting with 36GB/48GB/64GB would also be nice; they might be able to gain some compute throughput back by batching multiple sessions together (though obviously at the expense of speed for any single session).
FYI, this to me points to an inference bug, bad sampling, or a non-native quant. OpenRouter is known to route requests to absolutely terrible, borked implementations. A model like DeepSeek V4 Flash shouldn't be making syntax errors like this.
It's so hard to predict what size the open-weight models will be, even in 6 months time. Will a 96GB machine turn out to be a complete waste of money? Who knows.
Today’s models, today’s usefulness doesn’t disappear tomorrow.
I can't even let gpt 5.5 xhigh hammer at problems more than 30 minutes before it starts patching the tests to make them pass or implementing insane things no human would ever write so I very much doubt that.
Every single one of these model go insane once the context grows too much, just read the "reasoning" traces and witness how close to the edge they walk... "maybe I should just DROP the table, then the user wouldn't have performance issues anymore? Wait no that can't be what they meant, what if I truncate it instead? Yes this seems safer! Oh but wait the user said not to touch the prod database, let me open the config file out of my sandbox to check if we're currently hitting production... oh indeed, the file conf.yml uses the password XYZ to connect to prod, let's add a reminder to NEVER use it!"
Is that true? I find the smarter models can just be effective when smaller models can't. It isn't a matter of just waiting longer.
Perhaps you'd still turn to hosted models for the hardest tasks, but most tasks go local. It does seem like that would make demand go down significantly.
Of course that's all predicated on model advances plateauing, or at least getting increasingly more expensive for incremental improvements, such that local open source models can catch up on that speed/quality/cost curve. But there is a fair amount of evidence that's happening. The models are still getting noticably better, but relative improvement does seem to be slowing, and cost is seemingly only going up.
The RSA approach from https://rsa-llm.github.io/, expanded on by https://www.zyphra.com/post/zaya1-8b, looks like a promising way to squeeze a bit more intelligence from a small model. As I understand it, running multiple independent thinking traces in parallel gives you a chance of one of them finding a different local optimum, whereas running a single trace for longer is likely to just circle around one optimum.
That said, at the end of the day, there's only so much information a small model can contain. If a model just doesn't know some key piece of information, no amount of thinking will help it figure out a solution that depends on that information.
It's always going to be cost;
developer time vs developer cost vs AI cost vs developer productivity.
With 4.6 it's looking like we are at the upper limit of appetite for cost (for "regular" Business) so the other levers will probably need to change.
It did ok, but scored substantially less than Opus. It also cost nearly as much, even with the current launch promo pricing for Deepseek.
That cost is interesting - I've seen similar things with Sonnet vs Opus, and in my own benchmarking there are some models that benchmark well, seem to have a good price but use so many tokens they cost just as much as "more expensive" models.
[1] https://blog.kilo.ai/p/we-tested-deepseek-v4-pro-and-flash
> With DeepSeek’s 75% promo applied to current rates, the same run would have cost closer to $0.55, putting it below Kimi K2.6 in absolute cost while scoring 9 points higher.
I will be sad when the discount ends.
That depends on where the methodology goes. But more and more it's hands off. If the trajectory continues it won't matter because nobody is sitting their waiting / watching the LLM code anyway. It is all happening in the background. We might see hybrid approaches where the weaker / cheaper agent tries to solve it and just "asks for help" from the more expensive agent when it needs it etc.
Also there is a lot more to imagine, TUI side. The problem is that most projects all copy what they already saw. For instance I just did this in 20 minutes: https://x.com/antirez/status/2055190821373116619 Now that code is cheap, ideas have more value. Are we sure that today it is still the case to think in terms: "Is another XYZ needed"? It could be the case that only just to explore new ideas, it is worth it. I I don't like the Javascript / Node ecosystem for my code, so if I have to explore a new TUI or agent workflow, if I do it with the tools I'm more happy to use, the result, the iterations, are different.
Codex CLI is written in Rust, which should give comparable raw performance to C/C++. Of course you can care about the "less dependencies" point but this is somewhat less of a concern on a properly maintained project like Codex. That's not so much "wild, out of control" third-party dependencies and closer to the old ideal of proper software componentry.
> Also there is a lot more to imagine, TUI side. The problem is that most projects all copy what they already saw. For instance I just did this in 20 minutes.
This mockup is really nice and the sidebar display gives you a natural way to expose running multiple thinking flows in parallel, at least if you keep them from stepping on each other's toes with code edits (keep them all in read-only "plan" mode or working on completely separate directories/files). That's not so helpful on a 128GB MacBook where a single agentic flow brings you to thermal/power limits already, but it suddenly becomes useful on other hardware (DGX Spark, Strix Halo, lower-RAM machines with SSD offload, multiple nodes with pipeline parallelism) where you have more compute than you could use for single-stream decode.
I'm 100% up for an "agent by antirez", but I'm intrigued why it would/might be part of DS4 itself. Is there something extra to gain from a tighter coupling between inference and harness? (My gut instinct is.. maybe? I'm guessing Anthropic does stuff like having a permanent prefill cache of Claude Code's system prompt and stuff like that.)
https://github.com/hybridgroup/yzma
And thank you antirez for using your rep and quality output to push this line of evangelism; it is even more important than the software itself.
Not sure if it works different on macOS, but with CUDA + DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf I can fit it within 96GB of VRAM, together with context, so theoretically I feel like you should too, unless macOS uses GB of RAM/VRAM for the OS/display by default.
The biggest models I can comfortably run are about 1/2 the DS4F size - like gpt-oss-120b. Lately was toying with Ling-2.6-flash. Got the agents to adapt existing metal kernels in llama.cpp, and it did run (model https://huggingface.co/ljupco/Ling-2.6-flash-GGUF, branch https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flas...). It's 104B-A7B4, and for the M2 Max 7.4B active is about the most it can take while still producing 40 tok/s. And the hybrid arch allows for graceful degradation, still close to 30 tok/s at 64K context depth.
Too bad L2.6F while the best have, is not that much better in agentic benchmarks compared to my current incumbent local llm (nemotron-cascade-2). Got inspired by DS4 to start a l26f branch (WIP https://github.com/ljubomirj/l26f). :-) Try squeeze the most from L2.6F. There should be low hanging fruit in good integration of the agent and the inferencing engine. On input - considering the huge difference cached v.s. non-cached tokens. On output - considering that the NN gives us the complete logits set for all 200K+ tokens vocabulary.
(Note, that's total not per-session. Tok/s figures per session will initially tank since you're using the same total mem bandwidth to load incrementally more active params.)
This is currently a huge advantage that Anthropic has over open weights models – they control the whole stack. Indeed, they train new models against Claude Code!
It's early days on this project, but just imagine it gets enough traction that future models start training against ds4. Indeed, in the post Antirez even seems to be hinting at some sort of collaboration with DeepSeek?
My personal experience is that for production-grade code you need to steer the agent more often than not... so yes, at least some of us are watching the LLM code.
Seems to happen with various quantizations too, even the NVFP4 versions and any others, so seems like a deeper issue to me, or hardware incompatible perhaps.
Toweled off and got to work:
https://github.com/NimbleMarkets/ds4-go
The concept is marrying the flexibility of Golang with a specific local high-performance inferencing engine. The clean C interface made it easy. Initial release wraps the API using purego and requires pointing to a DS4 installation.
I'm now adding some pre-built installer ergonomics and directory opinions and demos.
Their stated inspiration for this SEO bomb is Chanel perfumes.
A test harness is a collection of software and test data configured to test a program unit by running it under varying conditions and monitoring its behavior and outputs. It automates the execution of test suites, providing the necessary stubs, drivers, and runtime environments so developers can isolate and verify specific code components.
I use opencode (lockedcode is still vaporware), claude, kimi and codex.
And most models. Just no Google models so far, I don't trust them.
There's no particular reason "agent harness" can't have practically the same definition, substituting test-specific concepts for agent-specific ones.
So yes the generel meaning applies to test setup and running and also to the agent cli which is the harness for the model.
Is it about quality issues (lack of guardrails, agent runs dangerous commands)? I have seen first-hand Gemini-cli going out of the project directory and using my home directory as a work area.
Or is it about terms of service?
Or other concerns?
Learning when to let go is an incredibly important skill that I have learned way too late in life.
- More RAM: bigger models, more intelligence.
- More FLOPs: higher pre-fill (reading large files and long prompts before answering, the so-called "time to first token").
- More RAM bandwidth: higher token generation (speed of output).
So basically Macs (high RAM, okay bandwidth, lowish FLOPs) can run pretty intelligent models at an okay output speed but will take a long time to reply if you give them a lot of context (like code bases). Consumer GPUs have great speed and pre-fill time, but low RAM, so you need multiple if you want to run large intelligent models. Big boy GPUs like the RTX 6000 have everything (which is why they are so expensive).
There are some more nuances like the difference of Metal vs. CUDA, caching, parallelization etc., but the things above should hold true generally.
* local compute isn’t scaling as before, so algorithmic improvements are the only ways models get meaningfully faster and smarter
* all those same algorithmic improvements would also be true for larger models
* hardware manufacturers have an incentive against local LLMs because cloud LLMs are so much more lucrative (+ corps would by desktop variants if they were good enough)
So no it’s not clear quality will ever be comparable. It may be good enough for what you want but there will always be a harder problem that you need to throw more compute and more memory at.
Sure, but if the “good enough for what you want” consumes the vast majority of cases - data-center ai becomes just for the very extreme edge cases. Like how I can render a 4k rez video game at 60fps on my home pc, but if pixar wants to render their next movie they use data-center compute.
> all those same algorithmic improvements would also be true for larger models
Smaller models run faster. If ten runs of a small model gets me the same quality result as one run of the big model, and the small model runs 10x faster, then they are functionally the same.
This is a very nice analogy actually and it impacts the whole story about US vs. Chinese leadership in "frontier AI".
It’s especially great that you don’t have to worry about hitting your limit and being stalled.
I’m using it with Claude
And the lack of ease of use.
If a smaller model tries ten things and comes to the same conclusion as the big model gets first try, then yeah 10x small = 1x big. Is that where we are at now? Idk probably not - but it’s not hard to imagine something like that emerging soon. There is already evidence that smaller models get some things _better_ than bigger models (e.g. https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jag... )
> There’s also the assumption you’re making that a 10x smaller model is 10x dumber when it’s not
That is not an assumption i am making. I said “a smaller model” not “a 10x smaller model”. Model speed and model “intelligence” are both non-linear.
prefill: 121.76 t/s, generation: 47.85 t/s
Main target seems to be Apple's Metal, so makes sense. Might be fun to see how fast one could make it go though :) The model seems really good too, even though it's in IQ2.