Qwen 3.6 27B is the sweet spot for local development

Qwen 3.6 27B is the sweet spot for local development(quesma.com)

373 points by stared 3 hours ago | 322 comments

iagooar 1 hour ago |

I love my MacBook Pro M5 128GB RAM and I love qwen3.6.

BUT DO NOT buy this MacBook if you plan on doing serious coding using local LLMs with it. The reason is simple: your fingers will burn and your head will explode from the noise.

Running any kind of sophisticated job on the very laptop you are using is just not viable. Sure you can use it in clamshell mode, but forget touching it while working with AI coding or agents.

If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement - or at least a few meters from your desk. Connect to it over LAN or Tailscale. The MacMini will also cost you almost 1/3 of the MacBook Pro.

Thank me later.

andai 8 minutes ago | |

> The reason is simple: your fingers will burn and your head will explode from the noise.

So, just buy a mac mini and put it in the other room? ( Like everyone was doing in February? :)

I've been running coding agents on my laptop in yolo mode for the past half year or so (though mostly not local ones, laptop too slow!) and the way I'm doing that without terror is that I just gave them their own Linux user "agent". They're free to nuke their homedir /agent, and they can't touch (or even read) mine.

There's some slight ergonomics issues (I need to sudo into the user to do anything, but I set up an alias for it), sometimes I get issues with permissions or ownership (gave up on "sticky bits" and just made a function I can run once a day when it breaks).

There's enough hassle that I wish I just had a dedicated machine for it, and then I'd just give them root on it. (For giggles I gave claude root on a $3 VPS and that's going just fine...)

But yeah after months of trial and error I reinvented "just buy a mac mini" from first principles...

geophile 19 minutes ago | |

That's exactly what I'm doing -- Mini M4 Pro 64GB, qwen3.6.

My hearing is not great, but I think I would have noticed the fan, and I have never heard it. In fact, I had to google to find out if it even has a fan.

Matl 16 minutes ago | |

> If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement - or at least a few meters from your desk.

Can confirm this works rather well, most things that integrate with LLMs, (agents, editors), support providing a remote (LAN) URL for Ollama, LM Studio etc.

But you do need a fast LAN connection, otherwise working with agents will be a pain.

jarjoura 32 minutes ago | |

TBF, I just recently picked up this same model, and it's reminding me of the last gen Intel i9 MBP. Just visiting any non-basic website spins up the fans and battery life isn't great either. Yes, this thing is fast, but damn it gets hot just using it for normal tasks.

Still, I don't agree. I think this machine is meant to use local models. You just have to wear pants if you want to keep it directly on your lap. I rarely use it that way anyway. I prefer it plugged into an external display and comfortably sitting on a laptop stand.

iagooar 28 minutes ago | | |

Pants and gloves, and I guess if you run it for more than an hour, sunglasses too to stare directly at the fire.

Arubis 1 hour ago | |

Don't forget that your OLED screen will start to color-shift as the heat cooks the panel!

manmal 1 hour ago | | |

There is no MacBook Pro with OLED (yet).

swang 1 hour ago | |

I have an M4 Max and when I was trying out local LLM work with pi it has probably felt like the hottest I've ever felt any kind of Macbook be. I could feel the radiated heat off it even a few inches away. Honestly felt hotter than any Intel Macbook I've used. Because of that I stopped as I didn't want to harm my laptop in case I need to hold it for 10 years due to all the supply issues/price increases.

dimitrios1 41 minutes ago | | |

I tried to run it on a M4 Air for shits and giggles.

After about 1 minute the entire machine basically bricked and I had to hard reset :D

cmgbhm 45 minutes ago | |

A local model on my m2 made me come to that conclusion but I definitely was having “that config is $2k more” regret. Thanks for posting this!

xd1936 48 minutes ago | |

Apple does not currently sell a Mac Mini with 64GB RAM.

iagooar 39 minutes ago | | |

Get a 2nd hand one. I was lucky enough to get a new one first, last week I get a 2nd hand one in order to run one of my Hermes minions at work.

acters 1 hour ago | |

Would the new upcoming AMD AI ryzen halo desktop be a better value offer? or dgx spark?

You would have to get a third party reseller/scalper or refurbished mac mini to get 64gb of ram ever since apple stopped selling it.

lee_ars 17 minutes ago | | |

I'm currently fiddling with a DGX Spark and Qwen3.6-35B-A3B (specifically Qwen3.6-35B-A3B-NVFP4 under vLLM, with EAGLE3 speculative decoding via eagle3-dogacel-vllm), and it's pretty okay in terms of smarts. The speed is relatively usable at about 50 tok/sec with a 256k context window, and it's definitely smart enough to one-shot some basic coding tasks. I had it doing reverse engineering/disassembly of some ancient MS-DOS assembly language games from the 80s and it handled the task well and produced good outputs.

But it's also really easy to trip up. I fed it some of my Ars pieces and asked it to analyze themes and composition, and it got into a looping argument with me over how it was unable to analyze "my" writing because "the user cannot be the article author, the user is the user, the user did not write the article, the article author wrote the article." I was utterly unable to convince it that I was in fact me.

Qwen3.6-35B-A3B hums along at about 50GB of RAM used with --gpu-memory-utilization=0.42. I haven't tried Qwen3.6-27B (I'd likely grab Qwen3.6-27B-FP8, I think), but I'm curious to see if it makes much of a difference.

pkroll 41 minutes ago | | |

Check the LLM benchmarks once it's out: it's such a common use case for these kinds of machines, you won't be waiting long.

cosmic_cheese 22 minutes ago | |

They really need to release those updated Studios already.

seanmcdirmid 37 minutes ago | |

What sort of M5 are you running? A max? MacMini's don't offer max CPUs.

iagooar 31 minutes ago | | |

M5 Max. But I also have a MacMini M4 Pro 64GB. Qwen3.6 runs on the M4 just fine - sure the M5 is at least 2x the speed. If Apple launches a MacMini with an M5, I will be the 1st one to get it.

Fr0styMatt88 31 minutes ago | |

What kind of speed in tk/s do you get with the MacBook?

iagooar 8 minutes ago | | |

qwen3.6 27B MLX 8bit -> 15 tok / sec. A bit slow but it is a delightful model to use, and smart too.

qwen3.6 35B A3B MLX 8bit -> 85-90 tok / sec! It is impressively fast and roughly 90% as good as 27B (in my opinion).

SkitterKherpi 1 hour ago | |

I am considering getting something like NVIDIA's RTX Spark when it comes out, though even that will be limited to 128GB.

jazzyjackson 56 minutes ago | | |

They’ll sell you a bundle, either a pair or a quartet so you can have 256 or 512GB over a 400GB/s network link

I can’t figure out when it makes sense to pay 10k up front for a quantized Llama 3.1 but it’s an interesting option

awesomeusername 1 hour ago | | |

It's out, I'm daily driving one. It's great

busymom0 1 hour ago | |

Also look into buying the Mac mini refurbished from Apple. They come almost brand new, same warranty and you save money.

verdverm 1 hour ago | |

Get an OEM Spark instead, mine are silent and can fit 2 qwen/gemma at 8bit or give you room for a bunch of other, smaller models (embed,rerank,etc)

oceanplexian 1 hour ago | |

If you want to do coding with a local LLM your best bet is a 6 year old Nvidia 3090 which is substantially more powerful than the highest end overhyped Apple product for 1/5th the price.

chorizo 51 minutes ago | | |

That’s 24GB VRAM. Not enough to run a 27B model at a useful quant+context size.

iagooar 25 minutes ago | | |

My problem is I won't accept anything lower than the 96GB the RTX Pro 6000 Blackwell has. My dream is a workstation with 2x Pro 6000 to run DeepSeek v4 Flash comfortably, possibly qwen 3.6 / ornith on turbo speed.

But man, I have never purchased a computer which is more expensive than a decent family car.

jnovek 44 minutes ago | | |

An M1 Ultra has 800gbps unified memory. It’s nothing to do with Apple, it’s their microarchitecture. They’re just about the only game in town with high-bandwidth memory if you want >24GB (for less than $10k, anyway).

bensyverson 3 hours ago |

The article is based on running Qwen 3.6 on a 128GB MacBook Pro. For reference, a 128GB MBP currently starts at $6699 USD [0]

Some people will be happy to pay that premium for privacy, but at roughly 10X the cost of a MacBook Neo, that money could also buy a lot of credits on OpenRouter or frontier labs.

[0]: https://www.apple.com/shop/buy-mac/macbook-pro/14-inch-space...

onion2k 3 hours ago |

None of the examples reflect 'real work', at least not what I'd consider real work. Being able to nail a zero-shot greenfield project is relatively easy even for a small model. There's not much context to build up and it can fall back to similar examples in the training data easily. So long as you're not asking it to invent something wholly new it'll probably manage.

The real test is whether or not it can work with your existing codebases. In my limited experiments Qwen 3.5 (maybe 3.6 is loads better) does OK on a Rust+React app, and less well on a C# monolith. Not to the point of being unusable but definitely poorly enough that I went back to Claude after 20 minutes. If I lost access to a cloud model and had to use Qwen instead I'd be visibly sad.

doodlesdev 2 hours ago |

I feel like I'm going insane seeing people buy these 128gb MBP for thousands of dollars to run models that are objectively much worse than SOTA and spending so much more. The amount spent on a 128gb M5 MAX can buy you a damned new car here. What the hell am I missing? Are developers in other countries living in such different worlds?

(I'm aware the price is, in absolute terms, more expensive where I live compared to the USA. That reinforces what I think, because anyone sane that would've bought one of those in another country would sell them as soon as they landed here and save that money.)

pkroll 15 minutes ago |

Since no one else posted it... I have open-webui pointed at a linux box with 128 gig of ram and an RTX Pro 6000, and after a couple of runs on trivia, had it do one of Open WebUI's conversation starters: "Show me a code snippet of a website's sticky header in CSS and JavaScript."

72.06 t/s. That's the full Qwen 3.6 27B model BF16, using MTP, running on Ollama. Yes I know I should bite the bullet and get vllm running on that box.

That was, also, at a 570 watt limit: I normally run a little less, but when I first tried this I actually forgot I had set the limit to 300 (it's a hot day, I figured why fight the A/C?), and at 300 watts the same question came back at 69.38 t/s. (The extra power matters more for compute bound things, the difference in generating LTX2.3 videos is considerably higher... but still not linear.)

ctkhn 28 minutes ago |

I have been running qwen 3.6 35b a3b with opencode on my macbook pro 16" with m3 max and 64gb ram, and it's been great for local planning and coding. To be honest I have been on and off wishing I had future proofed with the 128gb after seeing how powerful 64gb is. On the other hand, I also haven't run up against a wall with a model that is just slightly larger than qwen.

cpburns2009 27 minutes ago |

Before you run and go purchase a unified memory computer (e.g., DGX Spark, Mac, Ryzen AI Max 395 / Strix Halo), be aware dense models generally run slow on these machines. Dedicated GPUs run dense models significantly better. Look for benchmarks for your prospective machine. If you really want one of these, you'll be better off running Qwen 3.6 35B or another sparse MoE model.

starefossen 1 hour ago |

We have have had the same experience (qwen3.6 rocks) when we are evaluating local models for our developers in the Norwegian Government https://github.com/navikt/mlx-workspace

beastman82 3 hours ago |

FWIW I'm running gemma4 31b on my 5090 and it's pretty great as well.

QAT, MTP, 128k context.

I liked Qwen 3.6 27b too, it just seems that Gemma4 is a bit underrated.

kofu 2 hours ago | |

My experience also aligns with this. I'm running gemma4 31B on a 4090 through llm.cpp with unsloth models. I also run Qwen 3.6. Qwen is good for thinking and planning as it is faster, but Gemma4's generated code is much higher quality in the first try (Rust, C++ and C#). so it needs less revisions to be at a level I'm comfortable for merging.

beastman82 2 hours ago | | |

I second unsloth models. I'm using them over blackwell-oriented nvfp4 models as they are (empirically) top quality and performance.

accrual 2 hours ago | |

Nice. I flip flop between Qwen 3.5 9B Q6_M and Gemma4 12B Q4_K_M on a 4080 Super. They run at about the same speed and I can have them review each other's plan or diffs. For smaller projects I find them very capable, and I can step up to a better quant for slightly more challenging work.

nok22kon 2 hours ago | | |

you can probably run Gemma4 26B on your card also at 4 bit. World of a difference compared with 12B.

nozzlegear 1 hour ago | |

I can't Gemma4 to actually finish a turn properly, it's always ending abruptly or making malformed tool calls. It's probably something I've misconfigured in oMLX or Opencode.

clusterhacks 23 minutes ago | | |

Huh. Same problem, and I run with llama.cpp. In my case, Gemma4-31B (4-bit quant though) will just stop sometimes.

0x0000000 3 hours ago |

> ... on my Macbook Max M5 128 GB

Local development for who? How many of y'all are rocking 128GB of memory? Am I reading Apple's site correctly that it's a $10,000 laptop?

scotty79 6 minutes ago | |

Qwen3.6 runs great on GPU with 24GB VRAM. You could get used 3090 for it.

kllrnohj 3 hours ago | |

You don't need nearly that much RAM to run Qwen 3.6 27B, though. qwen3.6:27b-q4_K_M is only 17GB, for example.

DanHulton 2 hours ago | | |

This is what I run on an M5 MacBook Air 32GB. Works great.

I’m not having it build whole features from scratch, though. I give it pretty explicit instructions closer to the class or function level, and it still saves me an immense amount of time, while I’m very connected to the code that’s written.

Definitely the sweet spot for me.

__s 3 hours ago | |

I'm on 128GB ram strix halo, bought framework desktop for a few thousand CAD back when everyone was calling framework desktop overpriced

rhdunn 2 hours ago | |

A 27B model can fit easily on a 32GB VRAM card (e.g. 5090) or a 32GB computer in RAM at FP8/Q8 (unsloth have 28.6GB Q8 files).

For 24GB VRAM cards (e.g. 4090) you can use Q6_K (22.5GB) or Q5_K_M (19.5GB) quants, possibly offloading some of the weights to RAM.

jboss10 43 minutes ago | | |

For the 35B model, ofloading to RAM doesn't slow it down much. If you have a nice CPU and a weak GPU, it will be fast enough to use.

wpm 3 hours ago | |

It wasn't $10k a month ago

mr_mitm 2 hours ago | |

Think commercial. My company invested in a local rig since privacy is important to our customers and sometimes I want to use these models on private data.

spike021 3 hours ago | |

Certainly won't work on my M4 Pro with 24GB lol

MatthiasPortzel 2 hours ago | | |

I’m using it on a 48GB machine and it causes some lag, so it might be worse on 24, but it should run.

Unsloth recommends 18GB of RAM for Qwen3.6-27B (for their version of the model).

https://unsloth.ai/docs/models/qwen3.6

whynotmaybe 2 hours ago | | |

I feel you!

Sent from my 8gb M2 Mac mini.

ljosifov 45 minutes ago |

Running 27B dense model on M5 128GB is ok, but one can do better.

On M5 128GB one can make use of the ram and use sparse MoE. For example, DeepSeek-V4-Flash will fit, served by DwarfStar (https://github.com/antirez/ds4). One will probably improve 2x the token/sec speed, given DS4F 13B activated params in the MoE are ~1/2 of the ~27B of the dense Qwen.

27B Of the Qwen fit even on a cheaper 24GB card, e.g. amd 7900xtx (<$1K?) or slightly dearer nvidia 3090 (with cuda). With ~900 GB/s bandwidth they will likely be ~50% faster than the M5 with 600 GB/s.

drnick1 43 minutes ago | |

Works beautifully on a 3090, very usable speed. Don't expect Opus 4.8-level performance, but there are some things you just need to keep local.

ljosifov 17 minutes ago | | |

True - they are workhorses. Not super bright, but good enough for lots of everyday tasks. I've found sweet spot to be turning thinking off, as it adds small or no value, while increasing the token count and waiting time. Last 27B I used was https://huggingface.co/Jackrong/Qwopus3.6-27B-Coder-GGUF - specifically post-train adapted a bit to run with thinking off. I saw today the 35B-A3B MoE from the same HF acc is out, downloading that rn to try.

zedascouves 23 minutes ago |

Just tried on some arduino code. after 10 minutes i got a list of improvements to my code.

I ran those throu opus saking if it was good advice and was not impressed:

I read the actual qr_scanner.ino. Short answer: partially, but I'd push back on most of it. That review reads like generic ESP boilerplate advice written against an imagined version of your code — several of its "fixes" are already in your file, and its headline "critical" claim misreads what the code does. Going point by point:...

zx76 1 hour ago |

I see a lot of people writing about how expensive the hardware to run these local models is - but see no mentions of the Intel Arc Pro B50/B60/B70 which seem like decent value if you're not interested in Apple kit (as much as anything can be decent value in the current status quo).

I just got a B70 with 32GB RAM for the equivalent of $1200 (incl. sales tax and import duties to my non-US location, so presumably it could be cheaper elsewhere). The memory bandwidth is 608 GB/s. For M5 Max (32-core GPU) it's 460 GB/s and for M5 Max (40-core GPU) it's 614 GB/s. A 3090 is still faster at ~900 GB/s but you're getting 32GB VRAM for a lot less than equivalent Nvidia cards. It's about 1/3 the bandwidth of a 5090 for 1/3 the cost, but with the same 32GB VRAM. If you're interested in being able to run bigger quants with some context and stay on a lower budget then it's an appealing trade off.

I'm still exploring using these local models so don't want to spend the equivalent of $5 000 - $10 000 just to test it out. I don't mind slightly slower perf to do some experimentation more affordably.

I actually got an B50 16GB (with meager 70w TDP!) first to test an Intel card with my stack - it worked easily with Ubuntu & Vulkan. I'd read a lot about hassles and people writing them off as unusable but it seems like these are often with SYCL which doesn't even seem to outperform vulkan and so why bother? (The B50 was just $370 inclusive tax and duties). Literally `apt install` the vulkan libraries and it worked with default xe driver in 26.04 and the vulkan build of llama.cpp. The SR-IOV PF/VF also just works with qemu/kvm, no tricks required. Since I got it fwupdmgr has updated the firmware twice so Intel is presumably actually trying to support these products.

jboss10 36 minutes ago |

I don't understand the talk about how expensive the hardware is. These models can run on very old or old and low end. I've been running Qwen3.6-35B Q4 on an old 1080 GPU(8GB vram) with 32GB sys RAM. I have a i7-12700.

It does about 30 tok/s which is enough for me. It's about half what the online models do, but it's enough.

I've heard their 9B models are also good, but they aren't much faster if you have the ram and a nice cpu.

These qwen3.6 models are the first ones I find can do much. GPT OSS was good, and Gemma4 is better. Gemma knows more facts, but qwen3.6 is smarter.

CMay 8 minutes ago | |

The MoE models hold up better on old hardware, but the dense models like this post promotes are in fact better. This isn't unique to Qwen. Are the dense models better-enough to use given the performance costs? It depends on what you are doing.

If a model runs fast enough for your use case and does exactly what you need it to, then you don't need a much slower model that might be more accurate. If you do anything more complicated, the dense models become more necessary and they are much more computationally heavy by comparison.

On your hardware an Unsloth quant of Gemma 4 26BA4B QAT would likely give you better results, but because it has 4B active parameters instead of Qwen's 3B active parameters, it will probably run slower.

felooboolooomba 28 minutes ago | |

Mind sharing the command line you use to rig it up?

RedCinnabar 3 hours ago |

Call me back when you can run these models on 16GB of RAM and any recent i5/i7. Until then, there’s no point on using these toy models.

guax 1 hour ago | |

Its so funny, these "toy models" would be the wet dreams of researchers not 5 years ago.

Progress marches without mercy.

jboss10 40 minutes ago | |

They can be ran on 32GB with 8GB VRAM. I don't think these will be on 16GB for a while. (35B MoE)

TheCycoONE 16 minutes ago | | |

I have 32GB of RAM with 16GB VRAM and I haven't had a lot of luck running larger models like this. Are you able to expand on that?

giancarlostoro 3 hours ago | |

You need it to run in about 8 GB so you have extra space for the context window.

Catloafdev 3 hours ago | |

Hello, it's the internet calling, today is that day.

https://github.com/ikawrakow/ik_llama.cpp

Edit: it's gonna be slow if you're not using any VRAM. But it's possible. Software isn't going to speed that up anytime soon, it's just a hardware bandwidth limit.

rhgraysonii 3 hours ago |

I have been having pretty good success with Qwen 3.5 9B for "nontrivial but not challenging work all things considered" -- it runs great on my 24gb unified memory m4 pro MacBook Pro. What do the baseline specs look like Mac-wise for getting this model to run? Am I looking at a 96gb? 128? 256?

MatthiasPortzel 2 hours ago | |

I posted this elsewhere, but Unsloth says the 27B model should run in 18GB. That leaves little RAM for other tasks, but it depends on your tolerance for slowness I suppose. I haven’t tried it in 24GB so report back if you do.

https://unsloth.ai/docs/models/qwen3.6

dofm 3 hours ago | |

You might be interested in Ornith 1.0 9B, which is a new intriguing post-training of Qwen 3.5 9B.

Qwen 3.6 27B will run in full offload with a 4-bit quantisation in 64GB on an M1 Max. It is quite slow.

I don't know about 48GB but 64GB should be enough.

simonw 2 hours ago | | |

I've been trying Ornith 1.0 35B, I'm pretty impressed with it: https://simonwillison.net/2026/Jun/29/ornith/

rhgraysonii 3 hours ago | | |

Thanks! I was thinking of doing the 128gb to have some future proofing. I figure at this point, it's akin to a mechanic keeping great tools around, when it comes to having this sort of homelab and exposing it for your own uses. And great practice for building the next era of user facing computing that will be around as this proliferates.

jjcm 2 hours ago |

I'd also look at the qwopus distil if you're using qwen 3.6 27b. It's a nice refinement of the current 27b with slightly better stats.

Jackrong has a few different ones available depending on what you're trying to do: https://huggingface.co/Jackrong

alansaber 25 minutes ago |

Is qwen finetuned/RL'd on any agent harness? Or does it just work well enough off the bat with opencode?

IronWolve 2 hours ago |

I think things are moving fast, tested that new vibethink-3B, works on many small tasks/fast, and playing with ornith-35B with a draft vibethinker-3b as a draft gave me some good speed/results.

Was just trying to see how small I could go and get acceptable results, but yeah, larger Qwen 3.6 with MTP is going to be better. Cant wait to see how AI model (unsloth/local-llm/heretic/reaper/etc communities) are tweaking/engineering quality down into smaller models. Lots of new things coming out.

Otternonsenz 2 hours ago |

Is there any hope for people that cant even run 27B parameters, Qwen3.6 or otherwise? Are there any quantized models that do well with tool calling at smaller parameter sizes?

I do not have a crazy rig, a modest gaming one at that, but in trying to understand more about agents and their capabilities, I am SOL with my 16 GB of RAM and 8GB of VRAM. I can get most small, non tool calling models to perform well, but I've had major issues with anything over 9B doing anything more than reasoning (egregiously slow at higher parameter counts).

And so far, I cant get even Pi to extend itself or do any meaningful work with any of the models I currently can get to run.

jboss10 28 minutes ago | |

I have 8GB VRAM, but 32GB sys ram. I can run qwen 3.6 35B at 30 tok/s. I also use pi, and it's smart enough to extend itself(multishot and maybe a few tries)

For you, you could try gemma-4-26B-A4B

jboss10 26 minutes ago | |

I have 8GB VRAM but 32GB RAM. Qwen 3.6 35B runs nicely.

You should look at gemma-4-26B-A4B. 16+8=24gb and Q4 is about 16GB. Not much context left, but might run.

fumeux_fume 2 hours ago | |

I suspect with those specs, you're not in the game right now for reliably using local models for code generation. The easiest way in is a MacBook with at least 32GB of RAM. This should be able to run a 4bit quantization of qwen 3.6 using the MLX format really well.

Otternonsenz 1 hour ago | | |

Now that I’m dipping more into this space, am gonna see what I can upgrade with the motherboard I have, but RAM pricing as it is, I’ll need to be smart about when I upgrade.

I very much appreciate the frank response, as it makes me feel less defeated at knowing my understanding of how it should work is not the full issue, hahaha

fluoridation 2 hours ago | |

I think at 16 GB you'd struggle to run the regular development tools nowadays, forget about any interesting inference.

Otternonsenz 1 hour ago | | |

Fully agreed, and my hope is as open models grow and change, that getting some amount of this working on Pro-sumer hardware will be more attainable.

But certainly seems like we are a few years away from that, sadly.

Am I also screwed in being able to train my own small model or adjust another one with such a non-workhorse PC?

diseasedyak 1 hour ago |

I have 24GB of VRAM (via a RTX 4090) and run Qwen3.6-35b:iq4, so it's importance-aware quantization and isn't nearly as dumb as it sounds like, fitting the 35b into 18 GB so you have some left over. So far I've had no issues, other than it taking a while for things like image gen, which I found out if you're gonna do with any alacrity, just have a cloud model do it.

For anything else local, including writing some automation scripts and such, it works great.

Zambyte 1 hour ago | |

Can you link the model? I also have a 24gb card (7900 XTX). I've been having success with the dense 27b model, but I'd like to see if the 35b iq4 is any better.

jboss10 54 minutes ago | | |

https://unsloth.ai/docs/models/qwen3.6 And https://huggingface.co/collections/unsloth/qwen36

ai_fry_ur_brain 1 hour ago | |

Whats your example of a "great automation script"?

kpw94 3 hours ago |

> What it does:

> --jinja for tool calling support

Pretty sure this flag hasn't done anything for a while. It's enabled by default since ~November of last year

devin 25 minutes ago |

If I have 10k to spend, what should I buy for the best local model experience?

felooboolooomba 30 minutes ago |

What's the minimum requirement for a Nvidia card to run it? For let's say 10 t/s.

cdnsteve 49 minutes ago |

Checkout details on what this runs on for local AI here: https://tokenstead.ai/models/qwen3-6-27b

blopker 2 hours ago |

I've been working with local models for the past year. There's so many possibilities, but I don't think coding is one. Coding requires so many layers beyond inference; I spent so much time trying to replicate what Claude Code does end to end locally. Understanding all the layers and keeping up with the advancements feels like a slog. Even this article messes up and misunderstands what some of the settings are doing. Qwen in particular seems to work at first, then often gets stuck in thought loops when used for actual work.

However, text-to-speech, speech-to-text, and non-code LLM use cases are so useful to have local, and don't require big hardware.

Having a universal reliable inference engine interface, I think, is the big unlock that needs to happen before app devs can ship these features.

Personal concrete use case: meeting recording app. This uses Parakeet + Qwen to create local transcriptions and post-cleanup, respectively.

Right now this app has to download and manage all these models, then bundle an inference engine to run them. It's a lot of code that probably should belong to the OS, or at least a standard interface.

While apps can offload some of this to llama.cpp or a similar process over http, that's another set of setup for the user to do before they can have a useful app.

Anyway, if you're getting started on a Mac, I'd suggest trying out oMLX (https://github.com/jundot/omlx) before messing with llama.cpp. In particular they have community benchmarks so you can see what kind of performance you're likely to get: https://omlx.ai/benchmarks. I wished each one had more configuration details though.

iwontberude 2 hours ago | |

> I don't think coding is one

Certainly this is falsifiable easily by any of us doing it on a regular basis

> Qwen stuck in thought loops

This does happen when context is not managed effectively; creating plans, using subagents and compactions strategically resolves this

blopker 1 hour ago | | |

Sure, local coding is clearly _possible_, but it's not practical for most people. I've yet to see a reliable setup, if you have one, I'd love to see.

> creating plans, using subagents and compactions

Yes, these are all things that Claude Code does for you. However, for the thought loop issue, these are not the fixes. The canonical fix is to limit the number of thought tokens (llama.cpp's `--reasoning-budget`) or try to mess with the various penalty parameters. In any case, it's not a solved problem as far as I can tell.

MangoCoffee 1 hour ago |

Running LLMs locally for development doesn’t make sense to me. The hardware gets outdated in just a few years. Even hyperscalers replace their GPUs faster than they can buy them, plus the cost of running it locally, isn’t cheap. the cost saving just ain't there.

jboss10 24 minutes ago | |

Qwen 3.6 35B runs on 32GB with a 1080. That GPU is from 2017.

logankeenan 46 minutes ago | |

3090 was released six years ago and is still very relevant for running models locally.

guax 1 hour ago | |

> replace their GPUs faster than they can buy them

How does that work? They have negative GPUs now!

blueside 1 hour ago |

i have been trying several open source models for the last few years. running qwen 3.6 27b on my 4090 is the first local llm i have used that made me start to second question if anthropic and openai are actually worth the (already) insane valuations.

don't get me wrong, the frontier models are leaps and bounds ahead of what qwen/kimikgemma are doing - but i don't need to drive a ferrari to the grocery store everytime either.

narrator 1 hour ago |

In hindsight, the Mac 512gb for about $10k was a total steal given that to run GLM 5.2 you need a 4x H100 to get the necessary amount of VRAM. Yeah the h100 is 2 to 8 times faster, but it's $20k a month to rent a 4xH100 VPS.

seemaze 3 hours ago |

I was interested to see that Qwen3.5-122B-A10B narrowly beat Qwen3.6-27B on Donato Capitella's SWEBench-verified-mini run with a similar 128GB UMA architecture.

https://pi-local-coding-bench.dev

jononor 1 hour ago | |

Many people in LocalLLaMA Reddit community has been reporting the same, that 3.5 122B-A10B is on par or slightly better. And a 3.6 or 3.7 od the 122B is one of the models people want to see the most.

HotGarbage 3 hours ago |

And AI companies will continue to buy up all the silicon to make this prohibitively expensive to run at home.

dofm 3 hours ago | |

It will run (somewhat slowly) on a five year old M1 Max with 64GB RAM.

Personally I prefer the 35B MoE model, which is fast enough to be interactively useful, and capable, but I would probably use the 27B if I wanted to generate whole applications like that.

I am unconvinced that most "local" AI applications need anything much more powerful than the Gemma 4 12B model. Local agentic coding is a small niche, but there are plenty of ways a local model can help with development tasks.

I would really like to see a 12B or 16B Qwen 3.6.

I am currently playing with Ornith 1.0 in the MoE configuration, which is based on the 35B variant of Qwen 3.5; I am not sure if it is better than the 3.6 version.

Benchmarks say it is; my own silly tests either suggest otherwise or suggest that I have to talk to it a bit differently.

sleepyeldrazi 3 hours ago | | |

I need to ask, since I have desperately wanted to make Gemma 4 12B work, but im not sure if its the quant (i usually up it to q8, which is a lot higher than iq4_nl that i use for 3.6 27B) or the model itself, but it just starts confusing itself really quickly when I give it coding tasks. And quickly starts failing tool calls.

I really want to have a model that i can run locally on my 24gb m4 pro mbp for when i don't have internet to connect to my 3090 running the qwen, and i love how gemma 4 models 'feel', but i can't make them be competent. I am in the middle of finetuning both qwen3.5 9B and gemma 4 12B just to try and make those bridge closer to 27B for coding/agentic tasks (and am trying to ternarize and DQT 27B so that it fits in ~9gb pre-KV).

How do you run the gemma? What do you use it for (and in what harness), maybe llama.cpp and pi-mono just aren't for this model and that's what i'm doing wrong.

dom96 2 hours ago |

What do folks use to keep on top of new model releases that are appropriate to their system? i.e. the models that will actually work on the MacBook Pro with 48GB of RAM or whatever their specs are.

I've seen sites here and there but they feel like quick little toys that don't get updated, so they always suggest old models.

aand16 3 hours ago |

I've come from the future to say Qwen 3.7 27B is just around the corner and slaps!

lor_louis 3 hours ago | |

Do no give me hope like that.

layer8 3 hours ago | |

Are RAM prices down?

alfiedotwtf 1 hour ago | |

Qwen 3.7 120B will kill off Antropic’s IPO

mendeza 3 hours ago | |

I am eagerly waiting!

jensC 1 hour ago | | |

Me too, I am on a Jetson Orion 64GB (about 50W max). Using the nvidia graphic cards for AI seem to be so power hungry that it was not a choice I could take with todays environmental problems.

drillsteps5 1 hour ago |

I honestly don't get the hostility against local models in this thread (and in some other threads recently).

I haven't seen anyone make an argument they are as good as SotA (OpenAI, Anthropic). It's just they are approaching state where they are "as good" for some _limited_ set of use cases. Which will allow us to resolve 2 primary issues with these SotA models: privacy and vendor lock-in. Plus, they're very useful for education purposes, you get to explore what things looks like under the hood, play with various models, tools, maybe put something simple together yourself.

You get Macbook - great. You got gaming rig with a decent GPU - great (set it up as a dedicated server that you connect to through simple REST).

What exactly is wrong with any of that?

SkitterKherpi 2 hours ago |

27-30B in general seems to be the level where you actually start having decent models. I just wish consumer hardware hadn't stagnated so much that we can't easily go higher than that, and that even running those requires a $5k machine now.

mbgerring 2 hours ago |

Something I find really confusing from this post is the MLX versions of the model running much slower. As I understand it, these model versions are meant to take advantage of Apple Silicon and MacOS APIs, and should produce better/faster results. Any insight into what’s happening here?

blobbers 3 hours ago |

How does llama.cpp use the GPU efficiently as opposed to MLX?

Is there any way to use MLX and GPU at the same time? Or does memory become a big problem?

TBH, I never understood Apple hyping these neural cores because I didn't think anyone actually uses them except maybe certain photo/video editing software.

If I can generate voice at the same time as video, that would be useful.

dannyw 3 hours ago | |

Llama.cpp uses the GPU very effectively because inference of LLMs is very rudimentary and basically as simple as your GPU memory bandwidth. That's essentially the baseline performance ceiling, with model-specific optimisations like MTP potentially increasing it.

The neural cores aren't suitable for LLMs/transformers and isn't used in LLM inference. On the M5 and later chips, it comes with neural accelerators, aka Tensor Cores, which speed up the 'prefill' (i.e. processing your context window) part, but don't do anything for inference.

The MLX vs GGUF debate is mostly irrelevant. The GGUF pathways are optimised for apple silicon to the extent of practically identical performance to MLX. MLX is just one way of using Apple GPUs, it comes with many optimisations in the box, but they're not hard and they're no longer MLX-exclusive.

prasanthabr 2 hours ago |

Has anyone considered a home server? Assuming mobility is not important if we pick components to match a similar hardware would it be more value for money?

drillsteps5 57 minutes ago | |

A decent gaming machine perfectly doubles as your friendly local inference server. Just start llama-server with the model of your choosing and start chatting with it through its Web interface or connect any chat completion-compatible client (agentic or not) which will use REST to send requests and receive responses. From any device on your network. Voila.

LeBit 1 hour ago | |

Which components are you thinking about?

markdog12 2 hours ago |

I've tested it extensively for actual local development for my job, and hard disagree here. It's a waste of time to use it. Wish it were not true.

beastman82 2 hours ago | |

I posted elsewhere but if you have more space try gemma4 31b

mannyv 1 hour ago |

FYI token speed is somewhat irrelevant for agentic development. You let it run, then you come back. The whole point is that it's asynchronous. If it takes 4 hours, 8 hours, 16 hours...who cares?

kmike84 1 hour ago | |

You care if you run it on a laptop. It's getting hot, fans are spinning, and you may want to use laptop for other things while the agent is working.

mannyv 48 minutes ago | | |

I have a Studio 128gb, so it's not an issue.

anonym29 3 hours ago |

Strix Halo user here. While Qwen 3.6 27B exhibits remarkable intelligence density, I will still take unsloth's dynamic IQ2_XXS of Minimax M2.7 over Q8_0 Qwen 3.6 27B any day of the week, and this isn't just because of generation speed either. I wrote my own custom harness, and I get hallucinated tool call parameters and bizarre invocations with Q3.6 27B even at Q8_0, but no issues with the IQ2_XXS of M2.7.

BoredomIsFun 2 hours ago | |

> I get hallucinated tool call parameters and bizarre invocations

tweaking sampler might help

cat_plus_plus 2 hours ago |

Gemma4 31B with MTP enabled is faster and I feel a bit stronger at coding. Either one can run in 32GB VRAM or unified RAM with some tuning (3 bit weights, 8 bit kv cache)

verdverm 2 hours ago |

Qwen's new AgentWorld model is good too: https://huggingface.co/Qwen/Qwen-AgentWorld-35B-A3B

I'm running the NVFP4 alongside Gemma4 at the same quant on an OEM Spark

colinsane 1 hour ago | |

AgentWorld is _fantastic_. i just migrated "down" from the 122B A10B qwen model to agentworld (35B A3B) because it feels as capable, easier to steer, and it's 3x faster.

also i like that if i drop more sophisticated tools into my harness (e.g. any of the NLP/RAG-based search tools in place of grep/rg), the agent will actually reach for them and make progress faster; previous models have been reluctant to embrace new tools.

ascii0eks84 3 hours ago |

Very capable lora adapters are surfacing but it seems they are very niche.

DenisM 3 hours ago | |

Can you share more? It’s the first I hear of lora outside research papers. Practical applications would be great to see.

Lora if effective could be a great reason to run local models.

mikert89 3 hours ago |

none of these local models are good for development, complete waste of time. nobody has $100k+ hardware sitting around at home to actually run a good model

jlongr 3 hours ago | |

skill issue

mikert89 40 minutes ago | | |

the models suck

dmezzetti 2 hours ago |

Local models are great for a lot of things past just software development. We need to move towards solving other real world problems vs just building software. I've been focused on that with TxtAI (https://github.com/neuml/txtai) for 6 years now.

rusk 3 hours ago |

Spent a week trying to get sensible results out of llama 3.3 At one point it even simulated doing the work, log output and everything and when I challenged it about the missing artefacts it actually started questioning my intelligence. Seems appropriate for a Zuck enterprise.

Qwen on the other hand got straight to work with astonishing competency on the same system.

From what I read llama3 needs beefier compute to reliably invoke tools, which I presume relates to it focussing more on simulating AGI rather than being a useful tool.

culi 3 hours ago | |

You might find this helpful. llama is not anywhere near the Pareto distribution (performance vs cost)

https://arena.ai/leaderboard/code/webdev/pareto?license=open...

https://arena.ai/leaderboard/text/pareto?license=open-source

k__ 3 hours ago | | |

Llama3.1 instruct seems to be doing okay on that page, mostly because it's dirt cheap.

am17an 3 hours ago | |

llama 3? Are you from 2023?

217 3 hours ago |

This is kind of like saying grass is green to be honest

madduci 3 hours ago | |

Like everybody got 128 GB RAM..

sleepyeldrazi 3 hours ago | | |

I've been running it almost since launch on a 3090 (24gb vram), you really don't need that much. Second hand those are really cheap and i get 50-70 t/s (with MTP at 2), full ctx. IQ4_NL (unsloth) on this model seems suspiciously competent, and after the (by now not so recent) updates to q4 KV on llama.cpp, I just keep going back to it after dsv4pro disappointed me for the 100th time because it gave up on a task.

dofm 3 hours ago | | |

Doesn't need it at Q4 at least; it'll run in 64GB.