Local Qwen isn't a worse Opus, it's a different tool

Local Qwen isn't a worse Opus, it's a different tool(blog.alexellis.io)

137 points by alphabettsy 4 hours ago | 56 comments

glerk 2 hours ago |

If you play with these models long enough, you realize there is more to them than just "model X is smarter than model Y" or "model Y is cheaper than model Z". They are different tools and the prompting technique is different. It is very much like playing an instrument.

With Claude, you sometimes want to under-specify or phrase things more indirectly to give a color to the implementation or elicit something creative. Also (you might raise an eyebrow at this) being nice to Claude will be rewarded and being mean to Claude will be punished. Claude tends to mirror your tone more aggressively and you don't want to get into negative loops with it.

With GPT, you have to be precise and reduce ambiguity. GPT will often try to resolve ambiguity in a min-max style "I'm going to do X, but make sure it is not quite Y". It will tend to be more paranoid and overengineer to catch all edge cases if you don't tell it precisely what the scope is.

With Qwen, you have to give it a shape and let it fill it in. Qwen likes XML, JSON and lists. Qwen likes to be shown a bunch of examples of previous work.

This is not scientific at all, just vibes, YMMV.

weitendorf 23 minutes ago | |

One thing I used to test quite a lot was rerunning the exact same prompt on the same input, or semantically equivalent (in my mind) but differently framed or worded input, and seeing how much they diverged. In particular I’ve done this quite a lot between Sonnet vs Opus and across Qwen models.

I recommend everybody do this because you don’t need any special data except what you are already using, and the results will be very eye opening: there is WAY more randomness or instability involved than you would otherwise assume. A lot of what you might think is a better prompt technique, or a particularly good or bad outcome, could just as well be random chance or just different behaviors across model version or sizes. And your results can be massively biased by small differences in input.

There’s certainly still a skill to it, especially with agentic loops where if you can get the model into some kind of self-eval structure where it’s hard to cheat or take shortcuts, and it’s in the right structure or domain that models its training data, you’re golden. But it’s hard to know exactly where the sweet spots (pro tip, have Opus 4.8 convert PyTorch models into ONNX or quants or get them running on different hardware, I swear it was like I activated some kind of savant-like skillset; meanwhile I can’t for the life of me get it to properly write/test EBNF formalizations of common languages and formats without cheating).

The worst part is that it changes so much so frequently that it’s almost useless to really go digging for this kind of knowledge unless you’re actually the one training the models. I just wish this kind of “stability” in output was more emphasized in their training so that they’d be more predictable. I assume that‘s hard to do without overfitting or breaking the explore-exploit loop but also, I would spend so much more on LLMs for batch workloads if they could do them more reliably…

h05sz487b 38 minutes ago | |

> It is very much like playing an instrument.

Or it is more like playing a slot machine and you imagine the rest.

ramon156 41 seconds ago | | |

[delayed]

cube00 10 minutes ago | | |

This is how I feel whenever I see bold all caps instructions in a system prompt or they've conducted "research" and found the magic prompt template that makes the model pay out.

Maybe it works some of the time but it isn't a solution that works everytime.

It reminds me of people that hover waiting to play a slot machine when someone gets up and it hasn't paid out.

glerk 15 minutes ago | | |

It is a bit of both. A non-deterministic instrument and a predictable slot machine.

psychoslave 15 minutes ago | | |

I play slot machines as instrument! ;)

visiondude 39 minutes ago | |

while not scientific this is been my experience as well. i will add that language specificity in word choice is also a learned behavior. for example, the word “investigate” vs the phrase “look into”. You will find the outputs are quite different. can you guess which will use more tokens? it’s stuff like this that actually sets people apart in the top percentile of using these tools

qsera 24 minutes ago | | |

Mmm..interesting..So now people are finding behavior patterns in LLMs which are trained on behavior patterns of people...

stingraycharles 2 hours ago | |

I agree with your general gist, and in general it’s a “the best tool for the particular job”, keeping token spent and other things in mind as well.

What I do know absolutely for sure is that LLM benchmarks are not to be trusted, they are just a minor indicator and real world usage is often very different.

willtemperley 51 minutes ago | | |

Yes, how do we know Opus 4.8 hasn't been trained on the SWE-Bench examples?

With a squillion dollars at stake per bench point, someone will have figured out a plausibly deniable way to game these benchmarks.

sanderjd 2 hours ago | | |

I share this sense, but my immediate thought is that we need to improve the evaluations! Do you think this is impossible? That there is something indelible that it is not possible to capture empirically? I kind of have this intuitive sense that it is this way, but simultaneously I think that it's unlikely to really be true.

dv35z 1 hour ago | | |

What would it take to have trustworthy benchmarks? As with all "targets", they can be gamed - but I am curious about quantifiable quality metrics.

vkazanov 1 hour ago | |

The problem is not that there details, the problem is constantly shifting ground. We can only rlpy on a harness to be sort of predictable but the models change all the time.

rkuska 56 minutes ago | | |

It system prompts that change all the time especially in claude code.

theshrike79 52 minutes ago | |

Yyep.

IME Claude is the most "creative" of the bunch, you can get surprising ideas out of it that were kinda tickling the back of your head but didn't really connect.

BUT it's also "relentlessly proactive" like simonw put it. It _will_ get the job done, it's the smartest idiot in town. Why use a library to parse $format when you can just write a custom 1000 line parser? Or if it can't access something, it'll pursue the goal of accessing it in the most creative ways - instead of stopping, asking the user "yo, can you give me access to X" and then continuing.

My solution is to use Claude as a pair programmer. I _very_ rarely just do /goal fix this shit, I watch what it does and interrupt if it gets to the "smart idiot" phase. Also I communicate with it like I would a coworker, never had it berate me or get combative. There's a Finnish proverb for that too[0]

As for Codex, Deepseek, GLM, those I use when the goal is 100% clear like "convert this Brewfile to a list of packages for Arch and Debian, use these two Docker containers to test that pacman and apt work correctly". Boom, done.

But I won't give any creative open-ended tasks to any other model than Claude.

[0] https://en.wiktionary.org/wiki/niin_mets%C3%A4_vastaa_kuin_s...

weitendorf 5 minutes ago | | |

The parsing thing, or the willingness to instantly drop into janky unsanitized string manipulations, or to constantly push back against work on infra projects because some random package on GitHub has 200 stars so it’s totally the safer approach, is driving me insane.

On one hand I’m glad Anthropic is only just now starting to get into infrastructure because it means there’s opportunity there, but it’d be great for their models to be more knowledgeable or able to seek out that knowledge on their own, or for the UX of Claude code to be more amenable to launching 5 in parallel and picking the best one, so I don’t have to spend time arguing with a robot. I think there’s a much better balance to strike between just charging ahead towards the goal at all costs vs being lazy and pushing everything back up to the user. Basically they write too much code that’s too contingent/brittle outside its exact current context and don’t do a good job distilling out the essence of the problem “cleanly”. Almost all of them are like this right now, it’s partially a problem with long-range planning but I think a real bias from over optimization for certain RLVR outcomes vs others.

hashmap 1 hour ago | |

totally true. one key for claude is to not smell like an evaluator, its good at knowing when its being tested and will behave defensively and avoid doing work. i avoid this basin by typing unreasonably excited about the thing i want done. like way over the top. it's harder to keep that up than it sounds.

glerk 48 minutes ago | | |

at the risk of sharing my secret magic spells :)

> this is phenomenal work, genuinely! I feel like you read my mind! <next instruction here>

can go a long way.

of course, I would only say that when I mean it, because Claude can get superficial and cut corners which is why I prefer GPT for raw implementation.

reverius42 2 hours ago | |

These are the vibes that power vibecoding.

zmmmmm 2 hours ago |

That's a great write up.

The one thing I feel it seems to under estimate is the likelihood of improvement. Even the authors acknowledge it's not even worth comparing local models from a year ago to what we have now. In fact, people widely see Opus 4.5 in November last year - 8 months ago - as the first time agentic coding became viable broadly viable even with frontier hosted models.

So why would we lock in hard on any concept at this point of what a local model is and isn't good for? Whatever it is right now, it probably won't be that in a year. It might be naive optimism to think we'll ever get to long horizon tasks with models that run on consumer / pro grade hardware. But so far the naive optimists are winning.

sanderjd 2 hours ago | |

Right. Opus 4.5 8 months ago, good enough for agentic coding. How far behind that are open weight models? More than 8 months? But how much more? When will they reach Opus 4.5 level? A few months from now? A year from now? Never?

theshrike79 41 minutes ago | | |

The power of Opus isn't just the model, it's in the harness too.

You can try it by using Opus through Github Copilot vs official Anthropic tools. You'll get very different results and experience (in my opinion).

theplumber 1 hour ago | | |

I think in the next 6 months we will have Opus 4.5 performance in open models. We are very close

marak830 1 hour ago | | |

GLM 5.2 came out today and the early reports have been quite good. Very difficult to run except on prosumer hardware, but small business could quite easily (or something like open router).

rippeltippel 2 hours ago | |

Since the author is referring to a specific model, I think it makes sense to ignore how the model (or local models in general) may improve over time.

It's like buying a car: I drive that car and get attuned to its characteristics; I don't think how that car (or similar cars) may improve. That's my tool and I want to make the most of it.

It is true that switching a local models it technically very cheap, but there's a considerable time investment in squeezing the most out of it, which may not work on a newer version of that model.

appplication 2 hours ago | |

Agree 100%, even on claude 4.5 being the turning point for agentic coding. It completely turned me around on it.

hypfer 2 hours ago |

That was a lot of text for me still having no idea what the point of the author was (beside what I can infer from the headline that is).

I do however now know that they're a totally cool dude building stuff physically and as software + that other people give them money for it.

Does that have anything to do with the topic suggested by the headline? Not sure.

neonstatic 1 hour ago | |

Everything is an ad these days. The article was not useless, but for the information it provides, it could have been two paragraphs.

hypfer 1 hour ago | | |

FWIW it told me stuff about openfaas. Now I know how to mentally file it and how to mentally file the author. The GitHub profile alone might not have sent the same signal, so this is useful.

Is it bad software? Idk. Probably not.

Should you treat it as a grassroots Foss thing maintained by fellow sane hackers? No sir.

nessex 11 minutes ago |

This is a great post that covers a lot of the recent ground. I have a very similar setup after a very similar journey, minus the RTX6000. Worth noting though that a lot of the recent changes make a single 3090/4090 much more viable here too. MTP and the recent improvements to kv quantization in particular, as well as model-specific template & quant fixes. I run a 4090 with the 4-bit quantized variant of the same model now and have had a great experience. Qwen3.5 was already a big step up, but with 3.6 and the rest of the improvements it's substantially more reliable as a daily use tool and I find myself reaching for hosted models a lot less. Feels like I could work entirely without them if they were to disappear without going back to typing every line of code myself.

To make 4-bit fit on one card with reasonable (100k+) context needs a bit more care though. And tuning can be highly specific to your machine, gpu and use-case. But I use a headless server, offload multi-modal to CPU, use fit-target to reduce wasted memory and use q8_0 kv since the 4090 performs well with it... In addition to most of the same config as the author elsewhere. I get 50-60tps generation with a power limit of 275W (450W is default), more than enough to offer a roughly an Opus-speed feedback loop.

I haven't seen many of the issues with looping the author mentions. But I did with Qwen3.5 and in particular other 4-bit quants in the past. But the difference is probably a mix of the improvements above, as well as habits changing to avoid cases where models will loop. For what I'm doing, it seems like I loop Qwen3.6 on the same kind of prompts I'll make Haiku or Sonnet loop on (the latter hide some of their existential loops behind "thinking"). Usually it's cause I was too vague about some aspect of what I'm wanting them to do or I forgot to include some context that smaller models just don't have access to in their smaller knowledge base. But at least for what I'm doing (Rust, React, kubernetes) it's not been a notable problem at all with the latest iteration of this whole stack. And knowledge of standard libraries and default k8s resource kinds has been almost flawless.

There's still plenty of more complex stuff where I'll choose to jump straight to Claude or GLM-5.2, but if it's not worth that jump I've stopped paying for the middle ground as it's usually not much better than just one more iteration through qwen.

All this to say, if you have a 3090/4090, feel free to give the same setup a go. It's come a long way in recent weeks.

whazor 10 minutes ago |

Would be interesting to use local models for:

- tool calling

- code base exploration

- anonymizing / abstracting your request

Such that your local AI communicates to frontier model like an expensive consultant giving high level advice.

I think due to the lower latency of a local model that this could be faster.

zkmon 17 minutes ago |

The seems to talk a lot about 27B. In my experience, I saw 35B-A3B to be equally good in quality and the MoE gave more tg/s.

gpt5 2 hours ago |

This article is a good summary of local models. Unlike the way they are hyped sometimes, as fantastic tools for coding and agentic local work. The reality is that they are rather limited, would not do well on a long or complex task, and are prone to fall into loops, forget their tasks, etc. Not mentioned in the article is that they are also rather expensive - not just for the hardware cost, but also electricity. These 3090 and 5090 machines are pretty power hungry, and these models are pretty slow on these machines, making them consume more power per token.t

Where they shine is in your ability to control them, their privacy, their predictability (e.g. if you are doing a repetitive task, like classifying your photo/video library), and depending on your energy bill - their costs.

wallkroft 1 hour ago |

>Local Qwen isn't a worse Opus >looks inside >local Qwen is not "near Opus levels"

cptskippy 2 hours ago |

I've been running qwen3-5-9b-q4-k-m and qwen3-6-27b-q6-k simultaneously on an Intel Arc Pro B70 with a lot of success.

https://github.com/cptskippy/battlemage-llm-gateway

Opencode has been a huge productivity accelerator. I have two Hermes agents that I'm training to support my workflow with pretty good success. One is a personal assistant who manages my backlog and keeps me on task, follows up with me on items, and will put together research briefs. The other I use a general purpose coder and research and it's about 50:50 with the tasks I've given it. In fairness though, the task it failed at left me scratching my head to figure out as well.

askvictor 1 hour ago | |

Does Intel make decent GPUs now? I must be out of the loop...

speedgoose 1 hour ago | | |

They released a few good value GPUs for LLM inference about a year ago: more memory than AMD and NVIDIA consumer GPUs, not too expensive, but also not great tokens/watt.

I am not sure whether you can find those in stock anywhere.

hbbio 2 hours ago | |

Interesting setup, thx for sharing.

How many tokens/sec do you get with 27b? Are you using MTP?

jauntywundrkind 2 hours ago | |

What's the value running the smaller model too? Why not just the big model for everything? I note both are dense, as well.

Ritewut 1 hour ago | | |

Tokens per second. The difference between 8B and something like 16B is not as big as you might think in practical usage and 8B is a lot faster and interactive than 16B but there are certain things where it is useful to farm it out to the large model.

wallkroft 1 hour ago |

>Local Qwen isn't a worse Opus >looks inside >local Qwen is not "near Opus levels