Local AI needs to be the norm(unix.foo) |
Local AI needs to be the norm(unix.foo) |
No student will want to use local AI apps if their Macbook Air's battery dies in 2 hours.
I tried Cline and couldn't get it working well and part of this was that at the time it expected OpenAIs output format.
Dont quite think its ready yet.
I have to conclude that people would like to have powerful local AI but it should at the same time only be a tiny model. In which case it wouldn't be powerful.
informatics aren't magic, you'll never be able to compress """knowledge""" into a small model in a way equivalent to the 1.5 TB model
Oh yeah , it feels independent and not lazy , sure
Work? I don't want it local at all. I want it all cloud agent.
Welcome back to 2014. Let us now continue yelling at the cloud.
If we could even get something like GPT 5.5 running locally that would be quite useful.
If you are simply measuring Watt Cost per Token, you are missing the mark drastically. You have to measure quality output per Watt.
It sounds reasonably difficult to benchmark this, maybe I'm wrong though.
Naturally, it's actually complicated. But LLM is a considerable weight and risk. Maybe it's worth involving and maybe not.
And you can't take comfort in knowing that you, personally, will remain in control of your own computing. The majority will let the range and direction of their thoughts and output be determined by the will of the tech giant whose AI they adopt. And that will shape society.
Streaming Services are getting worse and more expensive. I don't see a single report suggesting piracy is decreasing, it seemingly is only increasing now.
When costs increase, quality decreases people look for alternatives. The advent of faster broadband enabled Napster and MP3 sharing. I think this could have a resurgence if the peices align correctly (a new bitorrent client, a new torrent site, something to break the status quo).
How this related to AI, I don't know, although I wouldn't be set on the idea that we will never have local AI as the norm. There is a lot more movement in this space then there is for local streaming imo.
who can afford a house?
NVidia segments the market by limiting the amount of memory on GPUs. It currently tops out at 32GB (on a 5090) but it has excellent memory bandwidth (~1.8TB/s). If you want more than the you need to buy an RTX Pro (eg RTX 6000 Pro w/ 96GB for ~$10K) or you get into high high end solutions like H100, H200, etc that have significantly more memory and even higher bandwidth on HBM memory (eg 3.2TB/s+).
NVidia has released the DGX Spark w/ 128GB of memory for ~$4k. The problem is the memory bandwidth. It's only 273GB/s, which is less than the M5 Pro (307GB/s) but more than the M5. You can buy a 16" Macbook Pro with an M5 Max and 128GB of memory for $6k and it has a bandwidth of 614GB/s. So the DGX Spark is a joke, really.
In case it wasn't clear, Apple is interesting in this space because it has a shared memory architecture so the GPU can use all the memory.
Many, myself include, expect there to be no refresh to the 5000 series consumer GPUs this year, which would otherwise happen based on product cycles. So no 5080 Super, for example. And I wouldn't expect a 6090 before 2028 realistically.
One thing Apple hasn't done yet is release the M5 Mac Studios, which are widely expected in Q3 this year. They are interesting because, for example, the M3 Ultra has a memory bandwidth of 819GB/s and previously had a max spec of 512GB but that got discontinued (and the 256GB version also got discontinued more recently).
So many expect an M5 Max Mac Studio with 1TB/s+ bandwidth and specs up to 256GB or 512GB, probably for ~$10k later this year.
You really have to use this hardware almost 24x7 for it to be economical because otherwise H100 computer hours are probably cheaper.
But what happens when the next generation of GPUs comes out to the trillions in AI DC investment? It's going to halve its value. That's over $1 trillion in capex that will disappear overnight, effectively.
I think Apple is the dark horse here because they have no interest in NVidia's psuedo-monopoly. I'm just waiting for them to realize it.
Now CUDA is an issue here still but I think as time goes on it's going to be less of an issue. Memory is still a huge constraint both in terms of price and just general supply because NVidia can justify paying way more for it than you can, probably.
It's still sad to see that 128GB (2x64GB) DDR5 kits are almost $2k now and werre $400 a year ago. Expect that to continue until this bubble pops (which IMHO it will) and we're likely in a global recession.
So the other issue is models. OpenAI and Anthropic are built on proprietary models. Their entire valuation depends on this moat. I don't think this last so both companies are doomed because open source models are going to be sufficiently good.
We can already do some reasonably cool stuff on local hardware that isn't that expensive and even more so once you get to $5-10k hardware. That's going to be so much better in 2 years that I'm hesitant to spend any amount of money now.
Plus the code for running these things is getting better. Just in the last month there have been huge speed ups in local LLMs with MTP.
When I say 'moat' I don't mean moat specific to a company vis-a-vis other companies, but 'moat' specific to the set of inference providers vis-a-vis self-hosted local inference.
The moat consists primarily of being able to batch inference requests.
If we pretend people weren't interested in long context-lengths, there would be a moat for inference providers. who can batch many requests so that streaming the model weights (regardless if from system RAM to GPU RAM; or from GPU RAM to GPU cache SRAM) can be amortized over multiple requests.
However people do want longer memory than the native context length.
One approach is continual learning (basically continue training by using the past conversation as extra corpus material; interspersed with training on continuations from the frozen model, so it doesn't drift or catastrophically forget knowledge / politeness / ...).
However this is very expensive for inference providers, since they would have to multiply model weight storage with the number of users U=N. For a single user the memory cost of continual learning is much less since they only need to support a single user, and are returned some of the memory cost through elimination of KV-caches, and returned higher quality answers compared to subquadratic approximations of quadratic attention.
An advantage of continual learning is that the conversation / code base / context is continuously rebaked into model weights, and so doesn't need KV caches! It doesn't need imperfect approximations to quadratic attention, it attends through working knowledge being updated.
Nothing prevents local LLM users from implementing this and benefiting from the dropped requirements of KV caches and enjoying true quadratic attention implicitly over the whole codebase, or many overlapping projects indeed.
The only remaining moat of inference providers vis-a-vis continual learning local LLM's is the batching advantage, plus the gradient update costs for continual learning minus the KV storage and compute costs, minus the performance loss due to inexact approximations to quadratic attention.
This points towards a stronger incentive for local hosting than currently realized (none of the popular local LLM tools currently support continual learning, once this genie is out of the bottle it will be a permanent decrease of the inference provider moat, the cost of which can't be expressed merely in hardware or energy costs, since it is difficult to quantify the financial loss of inexact approximations to quadratic attention, the financial loss due to limited effective context length and the concomitant loss in quality of the result)
And local inference requires fairly beefy hardware, that is FAR from ubiquitous across today's userbases. Local models are also still far dumber than what frontier labs can serve.
Weird that this is getting such a tidal wave of upvotes.
> Stop shipping distributed systems when you meant to ship a feature.
But not in the contex the author meant.
Many people don't realize that when you have a frontend, a backend (several instances, for failover/scaling), a (separate) database, maybe some object store -- you have a distributed system.
A recent article[0] touched on that, although most HN commenters[1] latched on the "go" part. But there's something to avoiding rube goldberg machines where we don't need them.
Not at all sure about that. They have really good compute, and DeepSeek V4 (with antirez's 2-bit expert layer quant) may be able to leverage that compute via parallel inference - the jury is still out on that. Now if you had said Strix Halo/Strix Point or perhaps the Intel close equivalents, that would've been a slightly stronger case.
This is what I'm really waiting for. It will enable models comparable to current SOTA at the enthusiast price range.
Does it really work?
Well there’s your problem, control needs to go the other way. If you want your app to be AI-enabled, you need to make it easy for AI to control your app. Have you used OpenClaw? It’s awesome!
proceeds to brutalise the reader with an 88-point headline font.
A useful framing over “local vs cloud AI” can be split along two axes: does the task touch private data, and does it need frontier intelligence? You can use frontier models for developing the software (doesn’t touch data), but open-source models running locally for ops: maintenance, debugging and monitoring (touches data). If you need to fall back to frontier intelligence at some point for a particularly hard to resolve problem, you can still rely on local models for pre-transforming and filtering input in a way that's privacy-preserving or satisfies some constraint before it’s sent off to the cloud for processing. OpenAI's privacy filter is a good example of a model that can be used to mask PII and secrets and that can run locally: https://openai.com/index/introducing-openai-privacy-filter/, before sending any data externally for processing.
Another framing for local vs frontier closed which the article mentions is whether the task saturates model capability. With certain tasks like PDF processing or voice or summarization, adding more intelligence isn't necessarily useful. Arguably we've approached that point for chat interfaces already with frontier open-source models. But for coding and ops through well structured tool use inside a coding capable harness, we're still a ways away.
Tangentially, a contrarian take here is that AI can actually enable more privacy preserving software if you’re so inclined. You can just build personalized software and it lowers the barrier to entry and the effort required to self host. SaaS complexity often comes from scaling and supporting features for all types of customers, and if you're building software for personal use, you don't need all that additional complexity. Additionally, foundational and infra software that is harder to vibecode with AI is often already open source.
Still waiting for those analog AI chips that were supposed to make it lightning fast using minimal energy...
This has been the case for way longer than openAI and Anthropic has been around with services like AWS, Cloudflare, etc.
Isn’t this true of any application that accesses anything not running on your computer? This is just describing what it means to add an API call to your app. Nothing to do with AI (?)
Not saying it’s _wrong_ either – maybe it doesn’t use a backend of its own (the client downloads content directly from some predefined set of sites), maybe there is functionality to adjust how the summaries work that benefit from doing it on device, etc. Just doesn’t convince me that ”local AI should be the norm”.
Local models need to be resident in expensive RAM, the kind that has fat pipes to compute. And if you have a local app, how do you take a dependency on whatever random model is installed? Does it support your tool calling complexity? Does it have multimodal input? Does it support system messages in the middle of the conversation or not? Is it dumb enough to need reminders all the time?
Spend enough time building against local models and you'll see they're jagged in performance. You need to tune context size, trade off system message complexity with progressive disclosure. You simply can't rely on intelligence. A bunch of work goes into the harness.
Meanwhile, third party inference is getting the benefits of scale. You only need to rent a timeslice of memory and compute. It's consistent and everybody gets the same experience. And yes, it needs paying for, but the economics are just better.
Based on what I understand about how the former works, I would assume that the latter has the same properties and failure modes.
Small models are still in their infancy, and there's still much to sort out about and around them, as well
I have been working on a VERY SMALL local-first ai lab myself. nothing crazy, a text editor, a claw, and some lightweight models I started playing with. Absolutely looking for contributions as well.
Reading the tea leaves here, it will probably be common for OS’s to have built in models that can be accessed via API. Apple already does this.
Why not ship your own model? In the age of Electron apps, 10GB+ apps are not unheard of.
It seems easier to have industry specs that define a common interface for local models.
I also assume the OS can, or would need to, be involved in proving the models. That may not be a good thing depending on your views of OS vendors, but sharing a single local model does seem more like an OS concern.
Local models are absolutely going to be the future for things like simple automation and classification tasks that run occasionally and don't need to rely on internet access.
But for all of the serious stuff where you are doing knowledge work, the models will simply continue to be too big, and too slow to run locally.
The article says:
> Use cloud models only when they’re genuinely necessary.
But at least for me, they're genuinely necessary for 99+% of my LLM usage.
At the end of the day, the constraint here really is efficiency and cost.
Privacy can be ensured with the legal system, the same way that businesses that compete with Google still have no problem storing their data in Google Workspace and Google Cloud. The contractual guarantees of privacy are ironclad, and Google would lose its entire cloud business overnight as its customers fled if it ever violated those contractual agreements (on top of whatever penalties they allow for).
I don't think that many people have built apps against these models.
I mean, I use a heavily quantized version of qwen3 for image classification, caption generation, prompt expansion etc. for image generation, instruction-driven edits, and so on. You can go a long way when you don't need a lot.
A model that can do tool calls - any tool calls at all - can look reasonably cool once you put it in a harness where there's enough immediate context to take action. You can get carried away by anything happening at all. But golly gosh it's a long way short of intelligence available in the bigger models.
And the lighter you make your harness, giving the model more free reign, more autonomy, you get a big jump in capability combined with a big jump in failure modes when the model is dumb.
Is there a solution for this? I'm currently just making users download onnx models if they want a feature, but it's not smooth UX
> “But Local Models Aren’t As Smart”
> Correct.
> But also so what?
> Most app features don’t need a model that can write Shakespeare, explain quantum mechanics, and pass the bar exam. They need a model that can do one of these reliably: summarize, classify, extract, rewrite, or normalize.
> And for those tasks, local models can be truly excellent.
I have tried quite a bunch of local models, and the reality is that it's not just a matter of of "it's a small model that should be hostable easily". Its also a matter of whats your acceptable prefill TTFT and decode t/s.
All the local models I used, on a _consumer grade_ server (32GB DDR5, AMD Ryzen) have been mostly unusable interactively (no use as coding agent decently possible), and even for things like classification, context size is immediatly an issue.
I say that with 6m experience running various local models for classifying and summarizing my RSS feeds. Just offline summarizing ans tagging HN articles published on the front page barely make the queue sustainable and not growing continuously.
Used to take me maybe 10-20 minutes per sheet.
Then I got codex to whip up a script that sends each sheet to a fairly low parameter locally running LLM and I have the yaml in a couple seconds.
My dream is to bootstrap myself to local productivity with providers… I know I’ll never get there because hedonic treadmill etc, but I do feel there’s lots more juice to squeeze. I just need to invest more time into AI engineering…
I think the Quixotic accelerationists of AI are more or less a vocal minority of the people who make software, and the choice of online APIs over local systems is largely a choice made for users, rather than developer’s laziness.
You can do more and better with private AI today than with local models. There is no getting around that. Even if local AIs get better, being on the cutting edge of LLM performance is often a very worthy investment.
Most people won’t settle for a product if it’s not the very best and incredibly convenient. That’s a high bar, and local AI often doesn’t meet those standards.
HN’s insistence on treating all users like they are open-source, privacy-first, self-hosted Linux fanatics is painfully corny.
... uh?
Right now it feels like we have all the pieces but nobody integrating all that into an amazing experience.
You don't have any guarantees in terms of data, that's true, you rely on the provider. But this is similar to a database or other services where you don't have the knowledge or resources to run them yourself. Hardware cost is an additional factor here.
If on the other hand your idea works out and the model fits the use case, you can always decide to move to a dedicated infrastructure later.
It would be nice if model makers could at minimum embrace test harnesses, and stretch goal if they’re going to change underlying formats then at least land compatible readers in the big engines (e.g. llama.cpp and vllm)
The goal is that you would assign roles to models based on tasks, capabilities and observed performance. The router would then take care of model selection in the background.
It's tricky though. Probably have another two weeks before I can release the runtime.
I have a preview up at https://role-model.dev/
You can follow me on Twitter if you want updates (see profile)
- and for the web / javascript / svelte applications?
- suggestions for local OCR for bulk images?
Quote 2: "I can only speak on the tooling available within the Apple ecosystem since that’s what I focused initial development efforts on."
Oh, the irony. I will use your tooling when is available on Android with F-droid, that's when, at least, be decoupled from big companies grip.
First of all, no AI model will say "I'm too smart for this question, I suggest you use a cheaper one so I don't make unnecessary money for my owner" or "I'm too dumb, so instead of hallucinating I'll suggest you go to the cloud and ask my smarter sibling".
Second, there is no incentive in the market for tooling to evolve that way. There will be the illusion that some models will do that, similar to today (or maybe some harnesses rather) but nobody will willinglylet money sit on the table. These data centers are not being built to solve world hunger. They are built to ultimately hook you on more realistic fake bs youtube videos so you feel good while getting even more ads injected into your life.
I just dont want us to put all this effort in to on-device computation when we need to get to "SOTA-equivalent" self-hosted computation faster.
Great observation! Often the excitement of novelty makes us lose sight of the real goal
The important issue is where is the data stored. And there are far to many advantages to having your data in the cloud: you can access it from whatever device you happen to have, and it isn't lost if you lose the device. This also outsources your backups to the cloud which is probably doing a much better job than you would (maybe no on hacker news, but nearly everyone else) - the cloud has earned a bad reputation for backups, but it is still much better than most people would be.
Once you accept the data is going to be elsewhere it doesn't matter if the compute is elsewhere or not. The data is the important part.
What needs to be the norm is more self-hosting your own data. Companies should not be outsourcing this by default - even where you outsource some of it, you need to watch your contracts and ensure the ownership is yours - not shared. Once your data is yours on your own cloud accessible servers we can start asking can we run our AI models in the same data center as we already have our data in. I don't need my AI model to run on my phone, it can run on the server in my basement which has a lot more power available (my phone has a better GPU but I can't afford the battery power to run AI on my phone)
I assumed self-hosted AI would fall under local AI for the purposes of this article. Does the author really need to spell it out?
2) It's probably not the time/place to trouble-shoot your "consumer grade server" LLM experience, but if you're running on CPU (you don't mention a GPU) then yeah, your inference speed will be slow.
3) Counterpoint: my consumer-grade Macbook Pro (M1 Max, 64GB) runs Qwen3.6-35B-A3B fast enough to be very usable for regular interactive coding support. (And it would fly with smaller models performing simpler tasks.)
A smaller cheaper local model can delivery most the value for coding, while we still use some services for code review and security compliance.
Once the VC money runs out and they start to charge the real price, the C-level will have to impose budges or limits. The current pissing contest over who can expend the most tokens is both ridiculous and shortsighted
Now today, AI is very expensive and not readily accessible to most people without paying a good amount.
The early internet became now you can just get a free phone from phone companies so long as you get their extras. Then you get a ton of subscriptions and ad-ons, but you don’t have to spend money, could just use youtube with ads etc.
Local AI would similarly shift this dynamic to paying for access to plug-in’s and tools for your local AI to be able to use. Like how the subscription model works right now.
With local model advancements, such as specifically Qwen 3.6 35B A3B, this future is becoming more likely by the year IMO.
Damned if they do, damned if they don't.
Also why doesn't their task manager show that it's actually the one downloading? Why does it go out of it's way to hide this activity?
Since I have conky on my desktop I could catch this immediately, and take the action I preferred with my own computer, which was to _immediately_ disable it.
https://developer.chrome.com/blog/new-in-chrome-148#prompt-a...
https://www.google.com/chrome/ai-innovations/
They have absolutely not been shy about any of this.
This comment is quite dishonest about the nature of the discussion.
Not to mention that the LLM that I choose to run requires a monster machine and is infinitely more capable than whatever google chose to put on their browser?
I mean, none of this affects me because I don't use chrome, obviously, but you don't see the difference? Bewildering.
Why should connecting small models to big models result in higher output quality than just running the big models without the small models?
Assuming we end up in a future where people pay to run multiple smaller models on their machines for specific tasks (e.g. A summariser model, a python coding model, or however fine grained/macro you want to go), the people training those models will need to turn a profit.
So how much will that cost? And how often will consumers have to pay? Models have a very short self life. Say you have a dedicated python coding model - that needs re-training every time there's a significant update to the language itself, any popular packages, related technologies (e.g. servers, cloud infra etc). So how often will users need to "upgrade" to the lastest version? It's going to be "frequently".
And it still needs the language stuff on top of that. Users aren't going to interact with a python coding model by writing python. They're going to use natural language. So the model needs all that stuff. And they're going to give it problems to solve. What if you asked the model "Write me a Bezier curve function". It needs to know about bezier curves, which have nothing to do with Python. So where do these LLM providers draw the line on what makes it into the training data and what doesn't?
And if an LLM doesn't know what a Bezier curve is, that's not going to stop it from just hallucinating an answer. If a significat proportion of prompts resulted in a response that said "Sorry, I don't know what you're talking about", then people will just stop using it. The utility of these things will be quickly overshadowed by the frustrations.
The way these frontier models have been introduced and promoted has set unrealistic expectations, and there's no putting the genie back in the bottle.
Commoditizing complements. If Anthropic/OpenAI/etc is eating your lunch, make it work with cheap local LLMs , you can beat them on price by having local inference you don't pay (nor need data centers for), and try to keep your (user/data) moat.
The more Anth/OAI disrupt, the more likely this is to happen. If they don't disrupt enough (.ie: grow as an ecosystem to defend against incentives to commoditize), then yes, those incentives are removed, but they also leave money on the table, which they need.
Not only at business level, but also geopolitical (to a lesser extent? or not since lots of open weight models comes form China?).
The additional up-front cost for hardware designed to run an LLM in addition to normal workload is unlikely to be accepted by most consumers.
The scale will be very constrained (like Apples on-device models which are small, heavily quantized, and have a small 4K token context window). It’s also terrible for battery life.
AI as it is implemented today is simply just computationally expensive and unless you put in dedicated hardware (like the ANE) for only this purpose - a large cost driver - I don’t really see it getting large scale adoption.
Companies will probably need a server-backed solution as fallback if they want reasonable user experience, so why even invest in diverse hardware support.
I consider it to be very careless to entrust your emails, your chats, your calendar, your notes, your calls, your pictures, your contacts, your location history, your waking hours, your files, your TODO list, i.e. stuff including your health data to the for-profit AI companies. The temptation to earn money with your data is just too great, plus the risk of the data being stolen and sold illegally.
Local AI should be the default. For everone who can't do local AI, we need confidential compute. Yes, it has been hacked before. But it's making it a lot harder.
Still, we all do it with Google. (I don't do it anymore but i did it for mostly two decades so I include myself)
We don't. And never did.
1. Local models are likely to be more power-expensive to run (per-"unit-of-intelligence") than remote models, due to datacenter economies of scale. People do not like to engage with this point, but if you have environmental concerns about AI, this is a pretty important one.
2. Using dumb models for simple tasks seems like a good idea, but it ends up being pretty clear pretty quick that you just want the smartest model you can afford for absolutely every task.
All of this being said, it seems Claude gave up this "constitution" it used to train on? I remember trying to get it to help me code some video editing tools, and it was convinced I was pirating videos and so wouldn't help me anymore in that session.
It runs by now on 8GB Vram, so a Legion 5 for about 1500$ could be a good workhorse.
The obvious optimization for the case presented would be to generate all the summaries on a server instead of in the client. Then the totally used compute would scale with the number of articles instead of number of users.
The moment we see standardized and batteries-included pathways to integrate search, ideally at no additional cost, in things like LM Studio combined with better tool calling in the local models, you'll quickly see local model performance catch up.
* What is the answer to local AI for native apps on Windows?
* What is the answer to local AI for Linux?
This is a big opportunity for Linux, given the high quality of open-weight models. I hope some answer emerges before designs fracture and we get a dozen mutually incompatible answers.
run an ai api endpoint on a unix domain socket
``` harbor pull unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL
# Open WebUI -> llama.cpp + SearXNG for Web RAG + OpenTerminal as sandbox harbor up searxng webui llamacpp openterminal ```
That's it, it's already better than Claude's or ChatGPT's app.
As one commenter mentioned, 2x Mac Studio M3 Max with 512GB can run frontier models and it costs $30k (with RDMA). Apply an efficiency ratio for being in a datacenter, and you understand why OpenAI and the likes spend north of $10k _per customer_ of CAPEX.
Add to that the electricity costs and you've got a very shaky business model. I for one would like to thank the VC for subsidizing my tokens.
With that said, the VCs are not crazy and probably factored in an annual cost decrease of computing power. But how do you make sure that we won't run local LLMs when the HW becomes affordable -- if ever ?
The answer has always been the same in our industry: vendor lock-in. They are getting the users now at a loss, hoping for future captive revenues.
So, be careful when your code maintenance requires the full context that yielded that code, and that this context is in [Claude Code|Codex|Cursor].
This is why I believe OAI and Anthropic I’ve been so aggressive at offering services outside of their pure models like Claude Design. This is what will be competitive and keeping people subscribed.
This is what makes me continuously doubt and rewrite the local-first approach to inline chat in my editor. Next edit/ code complete makes more sense due to latency advantage. But chat is hard.
It's fast and feels good to run locally, but output quality is just not ChatGPT etal.
We are at least 5 years away from that. And DRAM needs a substantial breakthrough in cost reduction.
You can also…turn it off.
Chrome silently elected people into it _and_ downloaded the model without asking because they decided that’s something they (chrome) fancied doing.
The difference should be pretty obvious.
Please show me where in either of those documents it explains it's going to download a 4GB model.
It's a totally separate tab that opens. It's got nothing to do with what you use as your homepage.
I'm on gentoo. I have to update chrome manually. I updated it. On update I _never_ get a "what's new" page. I've had this profile for more than a decade so I have no actual idea why, but, I can absolutely tell you, I do *not* get one. After update it started consuming all my bandwidth. This use did not show in it's task manager. I have a metered connection. This is a problem for me. I worried it was a compromised plugin. I had to spend 10 minutes in Firefox discovering why chrome was doing this then going to the configuration and disabling this.
This was a disappointing experience. I'm sorry you feel differently; other than stating the obvious, I seriously have no idea what you and the other corporate defense squad members are trying to achieve with this gaslighting nonsense.
Note that this package and update is actually not maintained by Google at all, it's done by Gentoo: https://wiki.gentoo.org/wiki/Project:Chromium/How_to_bump_Ch...
I hate to be an apologist for anything but I think you are pointing fingers in the wrong place. The Google-official releases use the built-in automatic updater and do show What's New. This is a Gentoo release and they chose to do their own thing for updates.
> I've never had a "What's new" tab ever open because I disable the customized home page where that's displayed.
I'm not "denying your experience" of not having the what's new tab. I'm denying your explanation for it.
You wrongly thought it was due to disabling the home page, and then you were insulting to the parent with the snarky "I'm guessing you're not aware that's an option".
You were the one who wasn't aware of the real explanation. Now you make up a totally unwarranted accusations ("going out of your way to deny my exact experience", "gaslighting nonsense"), and add character assassination on top ("you're this eager to defend them?", "corporate defense squad members").
Your comment is extremely inappropriate. Please re-read the HN guidelines, especially:
> Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.
And what did they say to me?
"I'm guessing you immediately close the What's New Chrome tab when you update?"
If that's not snarky then I don't know what is. It's rude, churlish, and presumptive. Yet you take absolutely no notice of this as if you didn't care to do anything other than attack me.
> You wrongly thought it was due to disabling the home page
And you are 100% certain it isn't? Why?
> Your comment is extremely inappropriate.
I think your definitions of "snarky" and "extremely" need to be adjusted. You've downvoted me, you disagree, I get that, what more do you want? You've persisted this conversation as if there is more to get from it but all you seem interested in is this off topic browbeating.
Instead of responding to the point, or engaging with it on any level, you've picked one small nit, and then attempted to derail the entire point with it. What /you/ are doing is inappropriate.
If it's so inappropriate then why did you reply? You can't have it both ways.
It doesn't matter. It's not a reason to be insulting back. I'm not insulting you here, even though you're insulting me.
> And you are 100% certain it isn't? Why?
I already explained. This is trivial to test. Another commenter here also provided the more likely explanation.
> If it's so inappropriate then why did you reply?
To let you know why, so you can learn from this. When people are insulting to me, I don't just ignore it so people can walk all over me. But I also don't insult back. I explain what they're doing so they can learn to have better manners.
It's not picking "one small nit", it's standing up to abusive language and behavior ("corporate defense squad members", "gaslighting nonsense"). Please be better in the future.
It's here, right now. I'm running quantized Qwen and Gemma on a decent, but three years old gaming rig (think RTX 3080 12GB and 32 GB RAM). Yes, it's slow, it has a small context window. But it can (given a proper harness) run through my trip photos and categorize them. It can OCR receipts and summarize spendings. It can answer simple questions, analyze code and even write code when little context is required. Probably I could get a half-decent autocomplete out of it, if I bother with VS Code integration. "128 GB VRAM on a MacBook Pro or a Strix Halo" is already a minimum viable setup for agentic coding, I think.
> And then we'll have the equilibrium we already have with the "classic cloud": you either self-host or pay for flexibility and speed.
Currently, it works exactly the other way. The cloud versions are orders of magnitude cheaper than self hosting, because sharing can utilize servers much more efficiently. Company can spend half a million bucks on a rig running GLM 5.1, and get data security, flexibility and lack of censorship, but oh it's so expensive compared to Anthropic per-seat plans.
I mean I've been forcing my good old 1080ti to run local models since a short while after llama was first leaked.
But I wouldn't say "local models are here" in the same way as "year of the Linux desktop!111"
Until someone can just go out and buy some sort of "AI pod" that they can take home, plug in and hit one button on a mobile app to select a model (or even just hide models behind various personas) then I wouldn't say it's quite there yet.
It's important that the average consumer can do it, I think the limitations for that are: things are changing too quickly, ram+compute components are exceedingly expensive now, we're still waiting on better controls/harnesses for this stuff to stop consumers not just from shooting themselves in the foot, but blowing their foot clean off.
Would be interesting to see a Taalas-like chip in a product, albeit there's so many changes going on atm with diffusion based models, Google's Turboquant (which as someone who has had to almost always run quantized models, makes a lot of sense to me).
I tried oMLX and OpenCode a few weeks ago and the 65k context window was useless, it tried to analyze a very small codebase before going full on agentic and ran out of context window immediately
I don't have time to tweak 1,000 permutations of settings just re-prove that its not as smart as Opus 4.6
I need out the box multimodal behavior as similar as typing claude in the command line and its so not there yet
but I'm open to seeing what people's workflows are
The best analogy is the difference between having N senior level engineers working for you, versus having N entry level engineers.
With frontier cloud models, you can give a single invocation one task, and it can figure everything out.
With local models, you have to manage the inputs and outputs quite a bit more, but you can achieve similar results for tasks you set up harnesses for. They are not as a good at finding the right answer internally from their own weights, but they are very capable of ingesting context and reformatting text - for example, for debugging, local models can debug issues quite well if you give them the error and documentation for a particular feature you are trying to implement.
Fixed that for you. Right now most models produced are based on floating point maths and probabilities, which is "expensive" to do math on.
Microsoft has researched 1-bit LLMs which can run much more efficiently, and on much cheaper hardware[1].
If this research is reproducable and reusable outside their research models, this means the cost of running self-hosted LLMs will be reduced by an order of magnitude once this hits mainstream.
However that's not the real battle here. The real battle is control of information to operate over.
While I might have access to a decent model - I don't have the huge integrated databases of everything that companies like Google have, and increasingly governments will accumulate.
As a citizen AI operating of these large datasets is where the concern should be.
This will depend on how much inference happens for consumer (desktop, local) vs enterprise ("cloud"), vs consumer mobile (probably also cloud).
I would assume that the proportion of "consumer, local" is small relative to enterprise and mobile.
Nvidia and other hardware sellers would love if they could sell a bunch of chips to individual consumers that would sit idle for 95% of its life.
I guess, it'll most likely be an AI processing and everything else becoming API.
In case of GPTs and Claudes of the world. They'll be just using an Indexing APIs and KB on top of their LLMs.
To sell tokens profitably you'd need to be able to run inference at 150 tokens per second for less than $1,000 USD a month.
I don't think people realize how expensive it is to host decently capable models and how much their use of capable models is subsidized.
You can only squeeze so many parameters on consumer grade hardware(that's actually affordable, two 4090s is not consumer grade and neither is 128gb macbooks, this is incredibly expensive for the average person, and the models you can still run are not "good enough" they are still essentially useless).
People are betting their competency on a future where billionaires are forever generous, subsidizing inference at a 10-1 20-1 loss ratio. Guess what, that WILL end and probably soon. This idea that companies can afford to give you access to 2mm in GPUs for 5 hours a day at a rate of $200.00 a month is simply unsustainable.
Right now they are trying to get you hooked, DON'T FALL FOR IT. Study, work hard, sweat and you'll reap the benefits. The guy making handmade watches, one a month in Switzerland makes a whole lot more than the guy running a manufacturing line make 50k in China. Just write your own fkin code people.
Don't bet your future on having access to some billionaire's thinking machine. Intelligence, knowledge and competency isn't fungible, the llm hype is a lie to convince you that it is.
With the new DeepSeek V4 series and its uniquely memory-light KV cache you can even extend this to parallel inference in order to hide memory bandwidth bottlenecks and increase compute intensity.
This is perhaps not so useful on a 128GB or 96GB RAM Apple Silicon device (I've seen recent reports of DS4 runs with even one agent flow hitting serious thermal and power limits on these devices, so increasing compute intensity will probably not be helpful there) but it will become useful with 64GB devices or lower that have to stream from a slow disk, or with things like the DGX Spark or to a lesser extent Strix Halo, that greatly overprovision compute while being bottlenecked on memory bandwidth.
Not if you're OK with 4-bit quantization. More like $30K-$50K one time.
Spring for 8 RTX6000s instead of 4, and you can use the full-precision K2.6 weights ( https://github.com/local-inference-lab/rtx6kpro/blob/master/... ).
I think that is a very narrow perspective. Enormous numbers of consumers own $50,000 cars, but a pair of $2000 GPUs is "not consumer"?
I agree with your view that cheap tokens on SOTA are a trap-- people should use local AI or no AI.
> Just write your own fkin code people
Bro is nostalgic for googling random stack overflow threads for 10 days to figure out a bug the agent fixes in an hour.
The question is would you choose to save $10 a day if it causes your inference to slow down 10x and waste 2 hours a day waiting on stuff.
This isn't about the local models you're running on your old gaming rig, or the tesla p40 rig you build for local llm's.
This is about code leveraging the local resources where the code is running for it's AI needs. Rather than making an API call to an external AI service, the code leverages the AI capabilities built into the hardware it runs on. With modern Apple, Intel, and AMD silicon all shipping dedicated AI acceleration, this is the where IMO the focus should be heading.
How many Flops or whatever can your phone do? I bet it's enough to paint the walls of your living room, or draw a pretty good pelican on a bike.
- text-to-speech - speech-to-text - dictionary - encyclopedia - help troubleshooting errors - generate common recipes and nutritional facts - proofread emails, blog posts - search a large trove of documents, find information, summarize it (RAG) - manipulate your terminal/browser/etc - analyze a picture or video - generate a picture or video - generate PDFs, documents, etc (code exec) - simple programming - financial analysis/planning - math and science analysis - find simple first aid/medical information - "rubber ducking" but the duck talks back
A quarter of those don't need more than a gig of RAM, the rest benefit from more RAM. Technically you don't even need a GPU, it just makes it faster. I do half that stuff on my laptop with local models every day.
That said, it really doesn't need to be local. I like the idea that I can do all that stuff offline if I'm traveling, but I usually have cell service, and the total tokens is pretty cheap (like $2/month for all my non-coding AI use).
For the different on-device LLM, I literally went to HuggingFace and filtered by the smallest available models that can do the job, and Granite-4.0-h-1b works just fine, it corrects typos, infers dates, currencies all fields I need.
And it got me thinking how my first reflex was to rely on a cloud LLM which is waaay overkill for my need. Granted, an on-device LLM will need to be loaded on the devices on install or downloaded after the fact (which adds latency when the user needs it for the first time) but still, it's a better tradeoff than a cloud LLM.
I decided on a basic parser, and so far it seems to work fine. granted, it struggles with some words, but I just need to finetune it to have as much coverage as possible in terms of typos without triggering false positives.
A lot of developers have that reflex too and go along with it and then just pass the API costs to the customer. I could have gone that route too but turned out I don't even need an LLM for my usecase.
Until then, I'm going to keep sending my JSON to the server farm in Virginia because it's the only place that can serve me a model that actually works for my uses.
The dependency we have with anthropic and openai for coding for instance is insane. Most accept it because either they don't care, or they just hope chinese will never stop open weights. The business model of open weights is very new, include some power play between countries and labs, and move an absurd amount of money without any concrete oversight from most people.
It's a very dangerous gamble. Today incredible value is available for nearly everyone. But it may stop without any warning, for reason outside our control.
1- Do a particular task with great capability (due to its constrained, limited scope) 2- Do it in such a way, it integrates gracefully in your workflow without ever requiring you to know you are using an LM.
There is a difference between outsourcing your workflow to AI and actually utilizing it.
Check this: https://www.distillabs.ai/blog/we-benchmarked-12-small-langu...
Reason being is that many workloads for AI are dynamically mixed, where training from multiple subjects comes into play and you just can't know exactly what mix will be required for each task ahead of time.
I was hoping loras would do this for us as well but they don't really seem to have worked out for llms (compared to in the image/video diffusion space).
Perhaps some future model will have some sort of "core" that can load/unload portions of itself dynamically at runtime. Like go for a very horizontal architecture/hundreds of MoE and unload/load those paths/weights once a parent value meets or exceeds some minimum, hmmm.
They need to be able to do a small task well and they need to be able to run reasonably on consumer-class devices. Even better if they can run on mobile phones.
In my experiments with local LLMs I noticed that while increasing the size of the model is nice the real thing that turns a barely useless model into something useful is the ability to use tools. Giving my models the ability to search the web and fetch web pages did way more to solve hallucinations than getting a bigger model. And it doesn't have a training cutoff. Sure, the bigger model is probably better at using tools but I often find the smaller models to be good enough.
Knowledge and clean data sets are becoming increasingly valuable, and free community knowledge is drying up. The next big programming language won’t have years of Stack Overflow posts to train on.
Maybe we will see some kind of licensing deals where owners of good datasets charge you a fee to let your AI search them.
A self hosted inference solution that offer good tenant isolation guarantees (ideally zero trust) and is easy enough to deploy and maintain (think Plex for AI) would be my choice for privacy. Now to be honest I have done zero research about this and have zero idea how feasible that is, maybe it already exists and there's some discord servers I should join?
Edit: I don't need to mention it here but what's incredible is that open models are in the ballpark of the best commercial models so supposedly, the hardest part by far is already solved.
>that open models are in the ballpark of the best commercial models
This is basically true for certain tasks. As an example, chat interfaces are not well poised to take advantage of higher model intelligence than what the best open source models already provide. But coding harnesses still benefit from greater model intelligence and even more so, the reinforcement learning that tightly interlinks the provider's coding harness (claude-code, codex) with the model's tool calling interfaces is another reason for discrepancy in effectiveness even when controlled for model intelligence. The opencode founder (open source coding harness that supports different model providers) was recently complaining about the challenges making the harness work well with different providers: https://x.com/thdxr/status/2053290393727324313
I haven't seen a text-based model sharing site spring up yet (perhaps they already have and I don't know about it yet). Civitai, being focused on image-generation, has the obvious advantage that it's easy to show off impressive results from the model on the front page of the website, and judging what someone's home-grown fine-tuned LLM will produce is a lot harder. But at some point I expect a Civitai equivalent site for text models, especially code-based ones, to become popular. That will seriously undercut Anthropic, OpenAI, et al, and will probably force them to find a price equilibrium.
Because once you're competing with "I spend $2,500 up front on a powerful video card, download an open-source model for free, and then I get pretty much everything I need for free" (additional power cost of running that video card isn't nothing, but probably not noticeable in your power bill compared to what you're already using)... then suddenly $200/month means your customers are thinking "after one year I would have been better off with the homegrown solution". The only way they'll continue to pay $200/month is if Claude/GPT/Gemini/whoever is truly head-and-shoulders above the "pay upfront once for hardware then use it for free afterwards" models available. And that's going to be doable, perhaps, but tough.
I agree local models are great, and it’s cool that Apple has models built in now. But I feel like it basically has to be an OS level feature or users are going to get upset. I’d certainly rather have a small utility call out to OpenAI than download its own model.
I think the future will probably be a hybrid of:
1. local AI for simple, private, everyday tasks
2. online AI for very hard or long tasks
local LLMs builds tool that does exactly what user wants, how it wants it, which is bext UX
this becomes AI literacy
LLMs already nicely bridge the gap form "I want this" to "here's a local page that does it".
examples of tools i have built that requires almost very low tech knowledge * push a button on my phone to take screenshot in my mac (when i watch videos) * help me exercise, gamify it for me * "help me track time spent online to how it impacts what i do in real life, built a tool that rewards and me points me towads things that make me DO things online" * i want to improve my writing, give me exercises and build addiitonal tools (leading to an "append only" digital keyboard i use to exercise )
local AI can already create these tools, and no external company is ever going to beat me/the-user because instead of getting features i don't want, or that almost do what i want, or that do something that advantages the company they just do what I want
Repositories of tools-as-ideas created by others are quite often just index.html and ... that's all? manage data in localstorage, end of it?
Online inferences is still needed for large data (audio/video/images) processing. For now? we don't know, history suggests we'll have the capabilities to do that locally "soon". Or maybe not :)
The main issue is "online for collaboration". Not same user across different devices, that is easy. MeteorJS-style approaches (making local copies of part of dbs, reconcile to remote/origin) seems to be an interesting possibility at small scale, since once you have the right primitives in place you can go horizontally everywhere.
On the other hand… v4 flash model is actual magic compared to what was available 2 years ago. If the rate of improvement stays as is, we’ll get a similar performance in a ~120B model in a year, which is viable (if expensive) for everyman hardware. Possibly you’ll be able to run its equivalent on a ~$1200 laptop by 2028, which for me-in-2020 would sound straight out of a scifi movie. A good harness that lets the model fetch data from other sources like a local wikipedia copy from kiwix could do a lot for factual knowledge, too; there’s only so much you can encode in the model itself, but even a cheapish (pre-curent prices) 2TB drive can hold an immense amount of LLM-accessible data.
Big caveat: I don’t see local models for programming or generally demanding agentic tasks being worth it anytime soon. You likely want bleeding edge models for it, and speed is far more important. Chat at 20tok/s is fine; working on even a small codebase at 20tok/s, especially on a noticeably weaker model, is just a waste of time. Maybe it’s a PEBKAC but I have no idea how people make any meaningful use out of qwen 3.6.
This is the wrong way of putting it. Local inference with SOTA models is all about slowing down compute for the sake of fitting on bespoke repurposed hardware. You don't need to go fast if you have the whole machine to yourself 24/7. Cloud AI vendors can't match that kind of economics.
> And for those tasks, local models can be truly excellent.
100% true and I use them for this. But the open-source models seem to be drying up unfortunately. There never was much incentive for the big players to train a model and give it away for free, it was mostly virtue signalling and advertising for their knowhow. The AI "race" seems to have entered a new phase that's more on clamping down costs and making money and this doesn't fit in well.
I hope good local models will still appear but the days that there was a new groundbreaking model for download every couple of weeks is over :'(
TFA is focused on whether big models are necessary for what users want. There's some evidence they may never actually be reliable enough unless a) mechanistic interpretation matures far enough or b) our multi-agent systems all become multi-model.
For (a), advancement in MI might fix problems with big models, but would also mean we can maybe get unified representations, and just slice and dice the useful stuff out of huge models, getting only what we need without the junk. Ability to isolate problems won't really come without bringing the ability to isolate functional subsystems. Only want logic? Only vision? Just cut it out of the big monster and enjoy reduced costs and surface area for problems.
For (b), just look at stuff like the evil vector, or the category of hallucinations specific to tool-use. Without a complete solution for helpful/honest/harmless alignment, it seems likely that creativity and rigor (and many other things) are fundamentally at odds. If you start to need many models for everything anyway, why do we need the huge expensive do-everything ones? So specialization also becomes a pressure to shrink everything towards minimal reliable experts
As OP says, it shines in constrained environments where the model is transforming user-owned data. Definitely less useful for anything more open-ended.
Maybe it would do better with the new Gemma 4 models, which the Chrome devs have been hinting at moving to. And why the API doesn't let you introspect / pick the model, I'm still not sure.
Yup, that's the plan. No local model, no webpage; more, better and cheaper adtech extortion/surveillance for vendors while everyone else pays for the juice and hardware degradation.
Anthropic is going to go out of business by probably Q1 2027 due to not paying their bills. OpenAI will become a new Oracle, serving a luxury product for enterprises and governments. Google and Microsoft will keep doing what Google and Microsoft do. Chinese vendors will capture a significant amount of business over the next 10 years by running the models in non-Chinese DCs, with demand coming from their much lower prices. 95% of regular users will be paying for open model subscriptions, even if their local machine can run the model, because the providers will be offering features that are hard to impossible to replicate locally.
- Self hosting is expensive. It involves expensive machines with GPUs that cost hundreds per month if you use cloud based ones. You might need multiple of those. And you need people to mind those machines and they are even more expensive per month.
- If you run stuff on your laptop, it consumes a lot of resources and energy. I have qwen running on my laptop. Even minimal usage turns my laptop in a radiator. Nice as a demo, but I can't have it this hot all the time. It would run out of battery, and it's probably not great for longevity of components in the laptop.
- Models are evolving quickly and the self hosted smaller ones aren't as good when it comes to things like tool usage, reasoning, etc. Being able to switch tot he latest model is valuable.
- It's easier to get your use case working with one of the top models than with one of the smaller self hosted ones.
- If you get the wrong hardware, it might not be able to run the latest models very soon.
- Self hosting models is mostly a cost optimization. It only becomes relevant if you hit a certain scale.
- You have alternatives in the form of hosted models via a wide range of service providers. Some of those are EU based and offer all the things you'd be looking for if you are offering your services there. Including legal requirements.
- Reinventing what these companies do in house is technically challenging and possibly more expensive than self hosting models because now you need a lot of engineering capacity dedicated to that. And legal. And all the rest.
If, like most companies/people, you are at the experimenting stage, the cheapest and fastest is just getting an API key from an API provider of your choice. You can take it from there if your experiment actually works. And then it's mostly about optimizing cost. If your API usage goes to the thousands per month or worse, it becomes a cost/quality trade off.
This stuff is expensive because supply is much lower than demand. If everyone was to run their own hardware with a batch size of 1, we'd have 100x more demand for inference hardware and electricity than we do now, and people would be even more frustrated. Efficiency is everything, and we need all the economies of scale we can get to meet demand.
The problem is that it's much easier to use the SOTA models (especially if they are subsidized) instead of spending time fixing the knobs with the local one.
I just realized this with coding agents, yeah, you probably shouldn't always use latest version at xhigh, but you will end doing it because you do the job in less time, with less "effort" and basically at the same price.
I guess we'll see a real effort for local AI only when major vendors will start billing based on actual token usage.
And now with LLMs we can create even more fabulously addictive experiences, even more finely tuned information flows, even more treacherous servants. I very much doubt that we'll be allowed full control of it all. Every effort will be spent to centralize power, and every effort will be spent to extract as much cash as possible from us for the privilege.
Not all phones are like this. GNU/Linux phones obeying users exist too.
The promised mega-data center deals are meant to boost valuations today, not serve tons of customers three years from now.
Seriously. I have never ever seen so many people so willingly drink the marketing kool-aid from companies selling their product before. It's scarier to me than any threats of AI actually disrupting society (because it is so far from being capable of doing that).
https://news.ycombinator.com/item?id=48050751
A specialist handrolls a cut-down framework to power a 1 or 2 bit quantised version of a cut-down sort-of-frontier model.
It can be yours if you have 128GB or 256GB of RAM.
That also doesn't preclude LLM services from being massively successful, they'll just have to justify the pricing and complexity that comes with their adoption, just like any other product.
Which also, as I feel the need to remind everyone every time it comes up, has not yet once been actually shown to be a workable strategy. For any worker in any industry.
And to be clear, I'm talking about a worker, sitting in a chair, replaced with an agent, sitting in... a server, I guess, where nothing else about the org has to be changed. That's what's being advertised and sold, and it has never to my knowledge actually happened.
If their product is "access to a big model running on a really big computer" (if we can count 'multiple data-centers' as a single enormous distributed computer), then the product "small, accessible device that everyone has" risks killing their cash cow.
Ironically enough, the first company to really focus on "an LLM in every phone" will have a good shot at actually being the ones that "changed everythingTM", in the way Microsoft changed the world from IBM mainframes to PCs, or Apple made smartphones a thing.
And he would have the audience believing all the demos were running through third party AI providers, until at the last moment explaining “actually all of that ran on device with no connection to any external services.”
You mean the famously hard task? The one picked because it stretches frontier models to their limits?
Unfortunately, as soon as it's a famously hard task trainers know they need to succeed at it and it loses a lot of the power to detect correctness.
Maybe this is an example of training overfit. But it won't be too long before local models chew through the "famously hard tasks". Except possibly ARC-AGI. That's one benchmark that is still developing with capabilities. And every time a new ARC-AGI benchmark is released it make the SOTA LLMs look pathetic. Because there is very little understanding or transferability with LLMs. But in terms of benchmark-able micro tasks, the local LLMs are improving.
https://www.notion.so/adeelkhamisa/Cohere-s-next-steps-to-be...
I urge you to reconsider this attitude. If AI has a tenth the significance people claim, you're signing away your life; your ideas, your privacy, your very sovereignty of mind, all under someone else's control and revocable at any moment. Don't move your brain to the cloud.
Imagine an alternate timeline where we never had personal general purpose computers, only dumb terminals to access corporate servers on subscription. Don't vote for that world with your wallet, today.
Don't be a cloudhead!
If there's a newline in my comment, why not retain it? Whyyyyy?!
For any model where you notice looping, tune the LLM settings. Reduce temperature and top_p, increase presence/frequency penalty, reduce context size. If you have a specific task to do, fine-tuning is the absolute best way to both reduce memory usage and boost performance and quality. Remember that tiny models are not designed for 0-shot/1-shot, they need lots of specific instruction and context in the prompt, with multi-shot prompts having a dramatic effect on output quality. Try to keep your prompt to specific tasks. Think of small models as children, SOTA models as experienced professionals, and middle-of-the-road models as an average adult; you give the bigger ones more responsibility/agency, but more rules and guardrails to the little ones.
For coding you do want the biggest model you can fit, so this is where larger RAM shines (32GB+ iGPU). If you can fit a dense model, do that. MoE is ok but will perform better on narrower tasks. Use the bleeding edge forks of llamacpp for turboquant/etc and Multi-Token Prediction.
The last thing is quants. If you're running something that isn't the bare model (like an unsloth dynamic quant), model performance is gonna suffer the smaller you go, and smaller models will be much more affected. So try to max out the amount of memory you can dedicate to the model, and pick larger quants like Q6/Q8. You can quant the k/v cache but that also may have a negative effect. And again, if you can fine-tune for a task, you will gain much more performance and quality and reduce memory.
I have a lot of fun with the local models and seeing what they can do.
I appreciate the SOTA models even more after my local experiments. The local models are really impressive these days, but the gap to SOTA is huge for complex tasks.
Gives you more control over the outcome and more steering anyway.
Of course then you'll be asking "uhh lemme know when Opus 6.8 level performance is available locally". People are never happy.
Gemma 4 and Qwen 3.6 are legit beast models that would steamroll every API offering from 2 years ago.
The economics of running SOTA locally just does not make sense, because you’re not using it 24/7 at 80%+ utilization while the cloud based providers can.
The huge difference to open source is that you can't just train an LLM with free time and motivation. You need lots of data and a lot of compute.
I sure want to be wrong on that, I definitely like the open-weight version of the future more
In the same way you can imagine the Chinese government pushing the release of deepseek etc to make sure no one thinks the US has “won” and to keep everyone aware that a foreign model might leapfrog in the short term future etc.
At some point though if OpenAI/Antropic/Google plateau or go bust then the open source sponsorship becomes less likely, as making it open source was a weapon not a principle.
Not everything good in our society needs to have a "business model". People still work on it. It's FINE.
So, the business model of open models is the same as closed models: Sell inference. Open source is marketing for that inference.
https://try.works/#why-chinese-ai-labs-went-open-and-will-re...
This is what I do not understand as well and advertising the knowledge and more advanced model is also the only thing that comes to my mind.
Since a month I am using gemma4 locally successfully on a MBP M2 for many search queries (wikipedia style questions) and it is really good, fast enough (30-40t/s) and feels nice as it keeps these queries private. But I don't understand why Google does this and so I think "we" need to find a better solution where the entire pipeline is open and the compute somehow crowdfunded. Because there will be a time when these local models will get more closed like Android is closing down. One restriction they might enforce in the future could be that they cripple the models down for "sensitive" topics like cybersecurity or health topics. Or the government could even feel the need to force them to do so.
I don't think local will necessarily be open-weight. And then it's not that different from personal computing: you're giving up the big lucrative corporate mainframe, thin-client model for "sell copies to a ton of individuals."
So it'd be someone else (an Apple, or the next-year equivalent of 1976 Apple) who'd start eating into that. There are a few on-device things today, but not for much heavy lifting. At first it's a toy, could maybe become more realized in a still-toy-like basis like a fully-local Alexa; in the future it grows until it eats 80-90% of the OpenAI/Anthropic use cases.
Incumbents would always rather you pay a subscription or per-use forever, but if the market looks big enough, someone will try to disrupt it.
Selling managed self-hosting solutions would be another. That is the business of that recent American company.
Selling fine-tuning services or similar adaptations is another. That is what Unsloth is going for, I believe.
Most likely any sound business strategy is going to be of "commoditize your compliments" type. There are many complementary products to open-weight - some probably not invented/discovered yet.
Much like the current Twitter model, being able to put your thumb on the scale of "truth". Bake a stronger bias towards their preferred narrative directly into the model. Could be as "benign" as training it to prefer Azure over AWS. Could be much worse.
Sometimes there are things where the public good is best served with public expenditure.
What stops you from running the best open weighted LLMs currently available on consumer grade hardware for the rest of time? They're good enough for 95% of use cases, and they don't have a used by date. From what I can see, the "danger" is not having the next tier that comes out, but the impact of that is very low.
For quite a lot of use cases, the current systems arguably do get worse over time if not continually updated. The knowledge cutoff date will start to hurt more and more as the weights age in a hypothetical scenario where you are stuck with them forever.
Coding, one of the most popular usescases today, would not be great if it say only understood java to a version from years ago etc.
Pockets are too deep, it will only change once everyone is out of money.
Uh… the hardware requirements? And stop acting like some dog shit 8B model the average Joe can run on a laptop is even close to being comparable to what Claude or even Codex can currently do.
I have pretty good hardware and I’ve tinkered with the best sub-150B models you can use and they are awful compared to Anthropic/OAI/Grok.
They're not at all, not even close. Especially when you consider the use cases for people who are paying for LLM services today.
Read through a 1970s-era issue of Popular Electronics or Byte, and then spend some time surfing /r/LocalLlama. You'll get a sense of real-time deja vu, like you're watching history unfold again.
1. Innovate, create, and offer it all at sweetheart prices to the public while you rack up debt.
2. Shovel in more money and either buy out or outlast the competition. Become dominant. Lock in your users any which way you can.
3. Enshittify and cash in.
The deals Anthropic, OpenAI, etc. offer won't stay this good much longer. Don't let them lock you in. Failing that, you should budget more for the same service. You're going to need it. Having an open alternative running on your own hardware offers non-negligible peace of mind.
Huggingface.
The reason HF doesn’t also compete for image gen is probably some combination of momentum from Civit AI and HF not wanting to deal with the moderation headache.
But for a site sharing code-generation models, it's a very different scenario. I'm curious to see what will happen in that space.
I can’t wait to run my models locally. The sooner I can do my shit without some American mega corp gulping down all my data, the better.
In the future, when regular home computers have the capabilities of modern servers, we'll be able to train the entire LLM at home.
I may personally be of modest intelligence, but to acquire the intelligence that I do have, I did not need to train on every book ever written, every Wikipedia article ever written, every blog post ever written, every reference manual ever written, every line of code ever written, and so on. In fact, I didn't train on even 1% of those materials, or even 0.00000000001% of those. The texts themselves were demonstrably not a prerequisite for intelligence.
At minimum, given that it only took me about 20 years of casual observation of my surroundings to approximate intelligence, this is proof positive that the only "dataset" you need is a bunch of sensors and the world around you.
And yes, of course, the human brain does not start from zero; it had a few million years of evolution to produce a fertile plot for intelligence to take root. But that fundamental architecture is fairly generic, and does not at all seem predicated on any sort of specific training set. You could feasibly evolve it artificially.
That's not a problem, that's a feature; I have something like 8 tabs open to different free-tier providers. ChatGPT, Claude and Gemini are the SOTA ones.
I have no problem maxing one out, then moving to the next. I can do this all day, have them implement specific functions (or classes) in my code. The things is, because I actually know how to write and design software, I don't need to run an agent in a loop to produce everything in a day, I can use the web chatbots with copy/paste to literally generate thousands of lines of code per hour while still having a strong mental model of the code that I can go in and change whatever I need to.[1]
---------------------
[1] Just did that this morning on a Python project: because I designed what I needed, each generation was me prompting for a single function. So when I needed to add something this morning I didn't even bother asking an chatbot to do it, I just went ahead directly to the correct place and did it.
You can't do that if you generate the entire thing from specs.
I have a sneaking suspicion this is kinda like the situation with Linux in the 90s, where it kinda worked but it reeeeeally wasn't ready for the home user, but you had a lot of people who would insist to your face everything was fine, mostly for ideological reasons.
I'm currently running both Sonnet 4.6 and Qwen 3.6-27b on the same codebase (via OpenCode, the parameters were carefully tuned to have a good quality/context size ratio), and on this project, they both struggle with complex non-trivial tasks, and both work flawlessly otherwise. Sonnet 4.6 understands the intent better if my task is ambiguously formulated, but otherwise the gap is pretty small for coding under a harness.
Different usage patterns - you want to issue a single spec then walk away and come back later (when it has consumed $10k worth of API tokens inside your $200/m subscription) to a finished product.
Many people issue a spec for a single function, a single class or similar. When you break it down like that, the advantages of SOTA models shrinks.
I’ve begun to suspect that most people are probably running different hardware. Sure, you run the latest deep flash on your brand new M5 128G maybe you get acceptable performance?
But honestly, how many people have an extra $9000 laying around these days?
Right now, running with acceptable performance is kind of a luxury. I wish the people who always say - “This is great!” - would realize that not everyone has their hardware.
* Have a box with sufficient spare (V)RAM -- probably 8G for simple categorization with qwen3.5-4b, and 24G or more for more intelligent categorization with qwen3.6-27b or gemma4-31b.
* Download or compile llama.cpp. Choose a model, then choose one of the "quantized" builds that will actually fit on your hardware. There are literally hundreds to thousands of these per model on Hugging Face.
* Spend half a day tuning command-line parameters until llama.cpp doesn't crash.
* Watch llama.cpp regularly OOM itself, then put it in a systemd service with a memory limit so it doesn't take the entire machine down when it dies.
* Download all your photos to a folder.
* Start vibing a Python script to categorize your images by repeatedly prompting the LLM with each image in turn.
* Spend days tweaking/refining the prompt to try to get the LLM to actually do what you want.
The endgame is one of:
* The local model categorizes your images. Yay.
* The local model is too slow and you give up. Boo.
* The local model is too slow, so you spend $1k-$10k on hardware. Your image categorization task becomes a cover story for buying new gear. Yay.
* The local model can't understand your categorization metric, so you give up. Boo.
* You eagerly await news of the next open model being released. Yay?
* You consider replacing your local model with a frontier model, but then you realize you'd be spending $500 to categorize your photos. Boo.
* You refuse to allow Google/Gemini/Anthropic to train on your nudes. Boo.
I’m interested in self-hosting for privacy and control. I already owned the hardware I’m testing with, so my spend is limited to time and electricity.
The “LLM pods” you describe will be loaded with spyware and adware (see: Smart TVs), and average consumers won’t max their compute around the clock so naturally data centers are able to make more efficient use of hardware by maximizing utilization.
In terms of maximising compute I kind of agree but also kinda not - people's laptops and phones aren't burning at 100% 24/7 either. Sure AI requires so much more compute...but not _that_ much more, especially as technology marches on.
For the general use case; I could be wrong but I'd see it sort of like a GPU/NAS/etc. "Pay once" rather than a subscription (to a service offered by a datacenter).
But tbf, the way things are now _is_ all subscription models and consumers just kinda let it happen. I would love to be able to pay a one-off fee for lightroom...but I can't because they want a subscription to "pay for all the updating we're doing". They barely update shit.
But I wish we could actually have nice things. I imagine there's a niche for a middle ground: a privacy-preserving device that uses local-only models and doesn't spy on the user, and sells for a one-time payment with no subscription. It'll be expensive, though, likely more expensive than using a cloud-hosted model.
For niche applications, sure. For general use, I think the tendency towards the best model being used for everything will–to the model publishers' delight–continue. It's just much easier to get a feel for Opus and then do everything with it, versus switch back and forth and keep track of how Haiku came up with novel ways to dumbfuck this Sunday evening.
I gave it the reference C implementation, the LTFS spec from SNIA, and asked it to use the C implementation to verify the correctness of the Go code.
LTFS is a pretty straightforward spec, so it made a very reasonable port within about 2 days. It's now working on implementing the iSCSI initiator (client) to speak with my tape drive directly, without involving the kernel.
Edit: the model is Qwen3.6-35B
It's usable. I set it loose on the postgres codebase, told it to find or build a performance benchmark for the bloom filter index and then identify a performance improvement. It took a long time (overnight), but eventually presented an alternate hashing algorithm with experimental data on false positive rate, insertion speed and lookup speed. There wasn't a clear winner, but it was a reasonable find with rigorous data.
so when I encounter a common but invalidated friction, I explain it like I’m 5, understanding that many of the engineering and entrepreneurial problem solvers have the emotional intelligence of a 5 year old
Be aware that it is still a beast that sucks in a lot of memory.
Oh, one more thing ;) remember to keep your Mac plugged in...
The Open Source AI Definition (OSAID) is quite ridiculous, I prefer the Debian ML policy for defining freedoms around AI.
Frontier US labs could still have an advantage for a long time, but many use cases would start gravitating towards Chinese models if they 10x the data centers and provide similar quality inference for a third of the cost.
A universal translator with image and voice recognition and a decent breadth of encyclopedic knowledge in only a small fraction of an English Wikipedia dump(6GB/20+GB) is not "huge".
It is probably closer to the theoretical limit than anyone could have expected.
Doubtful. The increase in demand is greatly outpacing supply, and all signs point to a continued acceleration in demand
> If I could drop $10,000 to have an effectively permanent opus 4.7 subscription today, I would.
lol well obviously, but realistically that price point is going to be closer to $100k, with a perpetual $1k a month in power costs.
I predict the B200 data centers we're build today will be obsolete in 3 years and we'll be using whatever models and hardware that isn't even on a road map today. Likely not NVIDIA, likely not OpenAI or Anthropic. Maybe Chinese?
In the mean time, we must continue building software with the clumsy coding agents tied to cloud services as this (for now) seems to be about the only area where AI economically makes sense.
If we think about the near future, something like Kimi2.6 is within the realm of Opus 4.6 today, but requires closer to $700k in hardware to run.
Note that we are talking about 95% of everyone's use cases, not your specific use cases (which could require better models all the time).
The feature of using all these SOTAs to exhaustion on the free tiers is burning their VC money!
The more I use for free, the more of their money I burn, the closer we'll get to actual 3rd-party and independent setups (local or otherwise).
The ide I built has a full terminal, file system, git integration and AI agent. It uses a private cloud Linux container that is persistent so I can install packages and do anything I want from any phone, computer or browser. It’s amazing that we live in a time where we can build custom software for ourselves just for fun. I will never have to worry about cursor or vs changing getting bought and moth balled like Atom (my favorite ide). I now own my tool and will forever.
Who runs IDE with LLM agents accessing your local filesystem, on bare metal?
Or am I alone to run everything LLM related on my VM just for development work. Then because of ZED genius decision, you need to share your GPU to VM, then some important features will not work, like snapshots. So you also need workaround for this, etc.
Too much hassle, Zed is not for me.
But I'm anti-Apple, so maybe that's the reason :)
Btw, even "ImHex" devs realized this and they're providing version without acceleration for VM use. They're using ImGui. Using it for local desktop app UI is also ridiculous, imho. Whatever.
Doesn’t ghostty also use graphics acceleration? I was under the impression that rendering text is a relatively challenging graphics compute task.
Donations. Have you donated lately?
Wikipedia is cheap compared to creating and training models.
I don’t think donations will suffice at all.
As an example, we had millions of web developers download and install Firebug before browsers shipped their own dev tools. Donations over the course of multiple years would have paid my salary for a month if I were not a volunteer.
But from the “it’s fine” point of view, models will be baked into your OS.
Then later models will be embedded into hardware. Likely only OS makers models.
DeepSeek said it spent $5.6M [1] on training V3, which doesn't sound too much for a near-SOTA model.
An open source entity can come up with a hybrid business model, such as requiring a small fee from those who want to host the model as a business for the first n months following the release of a new model, but making it fully free for individuals.
How many crowdfunded projects do you know that have raised even one percent of that? Who’s going to be in charge of collecting that scale of money? Perhaps some sort of company formed for the benefit of humanity, which will promise to be a non-profit? Some sort of “Open” AI?
Oh, wait.
I can't say that you are lying and you are not exactly exaggerating either. It is true that a new SOTA model -- from literal scratch -- it would be expensive.
But, and it is not a small but, is the starting point really zero?
Side note though, it’s the speed that bothers me more than the reasoning. Qwen 3.5 is awesome, but my Claude subscription can tear through similar workloads an order of magnitude faster than my local LLM can when using Haiku. That’ll matter a lot to some people.
10 years ago I was using 16GB in my MBP and today it's 48GB. It's just a 3x increase during mostly a bonanza period.
And the Mac Studio was available with 512GB until ram got scarce and they cut the max in half recently.
There's plenty of demand for RAM right now. We'll see how this turns out.
It seems that a lot of PC building people are confused too deeply by Intel marketing and fixated on getting the flashiest CPU attainable within budget. Similar things happened with previous AI hype, and some people were using HDD boot drives on GPU rigs and asking others whether low end i7 would cut it. They acted very confused when told that they need SSD and Pentium is plentium.
I mean, there is a shortage going on, but when it'll be over anyhow - whether due to all the last three standing filing bankruptcy or CXMT-Huawei starts delivering in shiploads or Kioxia enters the market - and it comes back down to $2/GB, or even $5/GB, just max it out and forget about it for 10 years. Why not.
Because late stage capitalism demands endless growth in order to pay executives and shareholders (especially those late to the train) more and more YoY.
And those requirements for growth mean that cost cutting is needed. Over the past few decades cost _have_ been cut, building things more efficiently, components becoming cheaper, larger volumes in mass manufacturing.
But we have already reached a point where there are no other places to cut than the quality of the product itself. Look to shrinkflation in food and other places - look at how "live action" versions are being made of previously animated movies, how game franchises from 2 decades ago are being brought back from the dead, the huge influx of remasters etc.
Why? Because it's cheaper to revive/reuse an existing IP than it is to create a new one + it guarantees success with the drooling consumer masses. And cheaper = more Ferraris for the multi millionaire/billionaire execs.
See how much Mario movie made? Just wait...bet you there'll be a live action version. ;)
wrapped. It looks better that way.
The cost of not being efficient is even higher DRAM costs than we have now, given supply and demand.
Maybe the future is a selection of local, specific stack trained models?
https://arxiv.org/html/2605.06663v1
It might be possible to train a big generalist that is a composition of modules, some of which can be dropped dynamically at inference time, depending on the prompt.
> For those of us a bit crazy, we are running KimiK2.6, GLM5.1
Yes, those can compare to Opus, but you can't run those unquantized for less than $400k in hardware.
If you believe what you read here, the gap is closing fast.
The cost to transmit text is basically free and instantaneous. The rent (i.e. a GPU in a data center) vs buy is going to favor rent until buy is a trivial expense. Like 50-100 range.
Even then a LLM that just works is easier than dealing with your own
Video game streaming is the closest thing, and it's never really taken off. (And this, IMO, is a good comparison because it's a pretty similar magnitude up-front-cost, $500-$4000.)
Once the local-AI-is-good-enough (Sonnet level for a lot of basic tasks, say) for a $1k up-front investment the appeal of having something that can chew on various tasks 24/7 w/o rate limits, API token budget charge concerns, etc, is going to unlock a lot of new approaches to problems. Essentially more fully-baked line-of-business OpenClaw-type things. Or the smart home automation bot of Siri's dreams. You can more easily make that all private and secure when all the compute is local: don't give any outside network access. Push data into the sandbox periodically via boring old scripts-on-cronjobs, vs giving any sort of "agentic" harness external access. Have extremely limited data structures for getting output/instructions back out. I'd never want to pass info about my personal finances into a third party remote model; but I'd let a local one crunch numbers on it.
Even if you need Opus/Mythos/whatever level for certain tasks, if 95% of everything else you'd pay Anthropic or OpenAI for can now be done on things you own w/o third party risk... what does that do to the investment appeal of building better AI appliances to sell end users vs building better centralized models?
I think "what if today's LLM performance, but running entirely under your control and your own hardware" opens up a LOT of interesting functionality. Crowdsource the whole world's creativity to figure out what to do with it, vs waiting for product managers and engineers at 3 individual companies to release features.
Anyways, who's spending $1k for a LLM machine when they can spend $20 (or 0) on a subscription? And who's having an LLM crunching away 24/7 anyways? Anyone who is going to do something like that probably wants a cutting edge model.
It'll (probably) get to a point where the hardware is cheap enough and advancement levels off. But we're a ways from that and even then when a data center is 20ms away why not offload heavy compute that's mostly text in text out.
Also, because I wrote and own the code I don’t have to update if I don’t want to. I could choose instead to build around the dependency. That’s much more control over than when Microsoft bought GitHub and destroyed the Atom ide which I loved in favor of vscode which I still hate
Not every country is in a crypto-libertarian race to hoard power and wealth.
Meanwhile, in the EU, the model would be collectively financed, trained by a competent, neutral agency... and then completely lobotomized in the name of "the children," "safety," "IP rights," "correct speech," dozens of individual countries' legal and regulatory requirements, and any number of additional vocal, noncontributing NGOs.
So no one would get rich off of the public model, but no one would get much of anything else out of it, either.
As another reply suggests, there's a reason why things happen in the USA first. Even when they don't, the prime movers move here as soon as they can. Or at least they used to.
What is completely different from every other product is how much they’re spending, and how much they’re obligating themselves to spend going forward. I think there’s a very good chance that the existing providers could be miles underwater coming out of this. Even if the business is not the everything to everybody that they’re banking on it being, they still owe all of that money back to the people they borrowed it from, and they will be a lot less likely to float them cash to get them back to a normal operating mode if they burned the last ocean of cash promising the universe and winding up with “oh yeah, that’s pretty useful sometimes.”
I guess if the time horizons is long, like 20 years, then maybe the spending, as it begins to amortize, gets more in line?
I was thinking that a comparison could be to cloud providers, each of which had to spend a lot of money to build out datacenter before making money. Difference there is AWS proved the product first, so when Microsoft and Google came along, they knew it would work and be profitable. With AI, nobody has proven it will work and be profitable, they're all competing for that at the same time which is a potentially dangerous mix for the reasons you cited.
And look at the difference in spending between their building out general-purpose-computing cloud data centers that even then, had potential use cases if the business failed. What are they going to do… start a massive, extremely expensive pre-rendered online gaming service? Only render Disney movies?
I dunno. None of this makes sense to me.
like by selling it at a loss to build dependencies and then jacking the price up year after year by whatever amount is just below the cost of removing the dependency
How serious a risk is poisoned weights?
Can we leverage the cryptobros into using LLM training as a proof of work?
What do you mean "trust it"? It sounds like you want to vibe-code (never look at the output), and maybe for that you need SOTA, but like I said in a different comment, I can easily generate 1000s of lines of code per hour just prompting the chatbots.
I don't, because I actually review everything, but I can, and some of those chatbots are actually SOTA anyway.
With subpar models I must be more careful on providing instructions and check it step by step because the path it chose is wrong, or I didn't ask for or the agent stuck in a loop somewhere.
You are going off vibes alone, this is easily verified, please go verify.
What makes you think they have zero reason to subsidize, because the providers aren't a household names you assume they wouldn't operate at a loss? Whats your logic here? You make no sense.
Also, a lot of money is being made on input tokens and cached tokens, which are much cheaper to compute.
DeepSeek published their math for serving the V3/R1 models. They were 535% profitable: https://github.com/deepseek-ai/open-infra-index/blob/main/20...
If Anthropic and OpenAI are subsidizing the metered API usage, their model is going to end up just as successful as MoviePass. They are burning enough money on the training costs already.
If you have a machine running at 150 tok/ps you can only make $5820 a month at $15 per 1mm running 24/7. It costs a hell of a lot more than 6k a month to run Claude 4.7 @ 150 tok/ps on that machine 24/7.
This math is a bit off, because you have input tokens too, but regardless its still not profitable especially for how long it takes to turn around a request and the caching is probably not all that profitable.
If you have a machine running at 150 tok/ps you can only make $5820 a month at $15 per 1mm running 24/7. It costs a hell of a lot more than 6k a month to run Claude 4.7 @ 150 tok/ps on that machine 24/7.
This math is a bit off, because you have input tokens too, but regardless its still not profitable especially for how long it takes to turn around a request and the caching is probably not all that profitable.
The cost of cloud compute actually hasn't gone down for old hardware all that much, it still costs $500.00 a year rent 4 core i7700k that's 10 years old. Don't expect much more valuable hardware, like modern GPUs to deflate in price all that quickly.
There's 3 fabs in the world that make ddr7 and they aren't going to be selling their stock to consumers going forward, it will be purchased by datacenters almost entirely and stay in them until EOL.
Your brain is going to atrophy (this is proven), they'll raise the price to something thats closer to break even and you'll be forced to pay it because you no longer have those muscles.
It’s currently unsupported on Llama.cpp and vllm doesn’t support GPU+CPU MoE, so unless all of you have an array of DGX Sparks in your bedroom, what’s the secret sauce?!
it isn't that large of a model and the compressed kv implementation is not that complicated
the problem is that they released the model in a quantized format that is more complex than it appears, and people make a lot of mistakes working with it. it is quantization-aware-trained, so you can't "just" upscale it and scale down.
vllm runs dsv4 flash fine right right now
dgx sparks cannot really run it correctly right now with released vllm but there are PRs, it's just a matter of time. you would need 3 of them. they will still be almost 1/2 as fast as the mac studio.
so the punchline is, well, this is why the 512g mac studio is such a hot commodity right now.
i don't comprehend why people are in such disbelief at how much better this stuff runs on a mac studio than on NVIDIA hardware with 1/5th the VRAM. look, what can i say? NVIDIA is a bigger rip off than Apple is!
honest question, i'm very interested in this, but too casual as of now to know any better.
I'm not, you've actually illustrated my point. LLMs in 2022 were very impressive. By 2024 the general public was finding them an acceptable replacement for many research driven tasks and massive shortcuts for other tasks (coding, image work, document preperation, etc).
Those models are absolutely runnable on consumer hardware now, and we were extremely happy with the results. It's no different to how we used to think CRTs were amazing or early smartphones, but going back now they seem awful.
We're long past "danger". If what we have is the best we'll ever have open source, we're already in an excellent position.
No they weren't. They were a gimmick - it is only in the past 6 or so months that frontier models have started to do stuff beyond mere gimmicks when it comes to coding, and you could make the argument that Mythos has been the first 'Holy shit' moment that we've had that has stepped us beyond 'Yeah that's really neat but...'
> Those models are absolutely runnable on consumer hardware now,
A sub 50B model is awful and can't even write proper English sentences half the time, to say nothing of how bad its world knowledge is. Try the 32B Gemma 4 local model for a week and then go back to Claude and then get back to me.
> We're long past "danger". If what we have is the best we'll ever have open source, we're already in an excellent position.
Not sure what to tell you other than that you and I have very different standards. What we have locally right now is barely more than a glorified autocomplete, and it feels worse than using ChatGPT 2 years ago because the context window is less and it doesn't have good webhooks on consumer setups. Another thing I'd say is that you clearly have no clue what 'consumer hardware' means, or what consumers that can even get this stuff running locally would have to do to get it to even rival the frontier models in terms of their usability (most consumers are't going to just boot into Ubuntu and run this thing from a command line) flow, to say nothing of the hardware requirements. I'd love to never use Claude or Gemini or ChatGPT again for both privacy and money reasons, but the quality of outputs and depth of thinking and writing ability between even the very best local models you can run right now is many orders of magnitude less than what you get using distributed frontier models, and those 'very best' local models require a top of the line machine that 99.9999% of consumers don't have and would never consider buying. The cloud models all have like a trillion(!) parameters now. It isn't even close.
I sure hope the local side of things massively improves over the next 2-3 years, but based on how this has gone my guess is that in 3 years you'll be lucky, if you have very top of the line hardware, to get benchmark performance that we had 6 months ago with the frontier models. The distributed hardware/memory gap is just too big.
This is simply untrue. Using agentic orchestration I was writing production code daily 3 years ago. Hallucinations happened sometimes and context window was smaller (so you had to do some funky workarounds to deal with larger codebases), but it was workable. There have been a lot of marked improvements from a code perspective then - a lot model related yes, but also a lot in the ease of use, interfaces, etc.
> Another thing I'd say is that you clearly have no clue what 'consumer hardware' means, or what consumers that can even get this stuff running locally would have to do to get it to even rival the frontier models in terms of their usability (most consumers are't going to just boot into Ubuntu and run this thing from a command line) flow, to say nothing of the hardware requirements.
You've moved the goalposts. My point was that the "danger" of no new open models being released isn't that high as the existing ones are already impressive. Their ease of use or daily driving isn't relevant to that. If there were a need, someone could wrap a clean interface and support around it, or run it as their own cloud solution.
You seem to be arguing something adjacent to my point, which is fine I guess but I have little to say. Also multiple of your comments have come across quite aggressive and rude. Just food for thought if you want to work on that or not.
A single M3 maxed can run a Q2 Kimi 2.6, though thats with a hardly degraded perplexity.
2x M3s with RDMA can run a lossless Kimi2.6 at Q4, but with CPU only you would get okayish decode but horrible (+1m) TTFT, that wouldnt be a great _interactive_ experience.
I'm very curious what kind of hardware advancements you're imagining. Because we're already kind of near a physical wall regarding heat dissipation on phones.
I mean hey, maybe foundational physics will surprise the world with a radical breakthrough that disappears heat into a black hole or something, but I sure wouldn't hold my breath
Serving models on dedicated hardware is not the same as your at home 150t/s thing. Inference is measured in thousands of tokens / s in aggregate (i.e. for all the sessions in parallel). That's how they make money.
A friend an I had previously worked on an entropy extraction scheme and he recently got around to making a writeup about our work: https://wuille.net/posts/binomial-randomness-extractors/
I instructed the agent to read the URL, implement the technique in C++ for 32-bit registers, then make a SIMD version that interleaves several extractors in parallel for better performance. It implemented it (not hard since there was an implementation there that it read), then wrote more extensive tests. Then it vectorized it. It got confused a few times during debugging because the algorithm uses some number theory tricks so that overflows of intermediate products don't matter and it was obviously trained a lot on ordinary code were such overflows are usually fatal. I instructed it to comment the code explaining why the overflows are fine and had it continue which mostly solved its confusion.
It successfully got the initial 12MB/s scalar implementation to about 48MB/s. Then I told it to keep optimizing until it reaches 100MB/s. I came back the next day and it had stopped after 6 hours when it achieved just over 100MB/s. Reading what it did: it went off looking at disassembly, figured out what hardware it was running on, and reading microarch timing tables online and made some better decisions, tried a lot of things that didn't work, etc. (And of course, the implementation is correct).
I'm pretty skeptical about AI and borderline hateful of many people who (ab)use it and are deluded by it-- but I think this experience shows that a small local model can be objectively useful.
(oh and this experience was also while I only had the model running at 19tok/s)
Running the model in a loop where it can get feedback from actually testing stuff allows you to make progress in spite of making many mistakes.
I could have done this work myself but I didn't have to and I certainly spent less time checking in and prodding it than it would have taken me to do it. In my case I wondered how much faster parallel extractors using SIMD might be-- an idle curiosity that would have gone unanswered if not for the AI.
Congrats, but you're in the 0.0001% thats not just frying their brains, fapping to their local models or doing various magic tricks like a toddler entertained by playing with velcro.
At the end of the day you lost an opportunity to improve yourself and excercise your brain, maybe the opportunity cost is worth it idk, but Im going to keep taking things slow.
Handmade swiss watches > mass manufactured immitations. Handmade clothes > walmart clothes.
$50k is a median priced car in the US. I'd guess >99.9% of people do not own $4000 of GPUs. I consider myself a computer person and I dont think I even own $4000 of computer hardware in total
A car is super useful, so is an AI. But even if we decide cars are incomparably more useful a great many people pay much more than $4000 over the minimum viable car, and that's money that could be deployed to secure access to private, secure, and autonomous AI facilities. A few thousand dollars in computing is consumer hardware, or at least could easily be with more reason and awareness driving adoption.
People spend a LOT of money in things less useful than local copy of qwen3.6-27b can be.
A top-spec MacBook Pro is >$4k, so I assure you that plenty of computer people do own $4k of computer hardware.
Hell, most tech folks are wandering around with a ~$1k smartphone in their pocket too.
It builds good will also. it also shows research prowess.
For China it's different. They need to show Americans who don't trust them at all because of propaganda that they have no tricks up their sleeve. It also doesn't hurt when Chinese companies drop models for free people can run at home that are about as good as sonnet. Serious mic drop.
Running AI models on local hardware was exploratory at first, and if it's so easy today it's thanks to open source. It's a little bit coincidental that we have this today, and that mainstream hardware have this capability. The fact that a phone can run very small models is exploratory or some kind of marketing opportunity at best.
Why would hardware company ships cards with more AI capabilites (like more VRAM) in the foreseable future ? On what ground does the marketing for on device AI will keep generating interest ? For something as important, it's very uncertain. But above all, it should not depends on these brittle justifications.
Showing good will in distribution and research prowess today is positive communication, but it can be exactly the oppositite if/when an attack using those small models will reach a high value target.
For China the cultural difference is so huge, it's difficult to say. I would think they first and foremost need to show to evryone inside and outside of China that they match american models. Second, i would say that when americans prefer few very powerfull companies on the get go because they can leverage a lot of capital rapidly to industrialize, China will prefer leveraging a lot of smaller companies exploring a lot of things simultanously (so doing a lot of research), THEN creating legislation to let only the best (or a few) to survive effectively. In the end it's the same result (monopoly or oligopoly), but China may have a stronger core (research) and America may have stronger productive capital, that may be proved obsolete... In the long run, in either side it's a gamble, again.
I disagree on the second point. I think most Americans don't prefer fewer competition, that's a bit antithetical to the free market.
I doubt the Chinese government cares as much about controlling a few companies as you think they do.
China has a few things going for it beyond research. They are mission driven, they actually have needs for this technology, their needs will forward their entire economy as they are the world's largest manufacturers. They are also huge exporters and have buckets of customer support for various languages.
China also has considerably stronger infrastructure for electricity, etc. even with an nividia embargo they are doing more than showing up.
I don't think it's a matter of who "wins". There is no winning. I think China stands to gain far more from LLMs than the US does, and they have proven they don't need the us to do it, even with he us trying to sabotage it's every move into the space. The game is already more or less over in my mind.
If anything I see LLMs as having a huge market in China, and now the US can't even sell it to them.
All I care about is, if I have to use this technology, let me run it locally to avoid the surveillance capitalism aspect. That seems to be the real reason the us has propped up it economy in anticipation for this technology. Yet it doesn't long term benefit the us nor me.
Basically small and medium models that are crazy well trained for their sizes.
Then we have a lot of specular decoding stuff like MTP and others coming to speed up responses, and finally better quantisation to use less memory.
Local LLM is the future, and the larger labs know that the open models will eat their lunch once people realise that the gap is only a few months. If we were good with LLMs a couple months ago, we're good with the open models now.
That's irrelevant to my decision to use local or not.
I have to assume current architectures aren't optimal though, the idea that we stumbled into the one and only optimal solution seems almost impossible.
If you project out that hardware just a couple of years, and the trained models out a couple of years, you end up in a place where it makes so much more sense to run them locally, for all sorts of latency, privacy, efficacy, and domain-specific reasons.
Not all that different from the old terminal & mainframe->pc shifts.
Finally - hardware has seemingly gotten out ahead of software that most folks use - watching YouTube, listening to music, playing a game or two. There was a time when playing an mp3 or watching a 4k video really taxed all but the nicest systems. Hardware fixed that problem, like it very well could this one.
This LLM trained only and entirely on pre-1930s texts was able to code Python programs when given only a short example:
Or will human readable code be less and less of a thing as AI learns it's own, more terse language to talk to other AI's.
Definitely not the high end local LLMs. The small ones, yes, absolutely.
> If you project out that hardware just a couple of years
One of the biggest bottlenecks for LLMs is memory capacity and bandwidth. With the current glut for memory, it's unlikely we'll see lots of advancements in terms of average memory available or its bandwidth on regular (not super high end devices) in the coming years.
Alternatively, it's possible we get dedicated SMLs for e.g. phone specific use cases, that are optimised and run well.
A lot of useful AI work is shifting from “knowing more” to “working with more context”, files, recordings, repos, screenshots, browsing history, etc.
Once that happens, memory and orchestration start mattering much more than raw model size.
It feels very obvious that the solution is to have a smaller model that can be trained exclusively on Java information to augment the older model. If the architecture doesn't support it currently, then that's what the architecture will look like in the future.
Otherwise you'd be arguing that, to serve users who want to an up-to-date LLM on topic X, you have to train the model on the entire ABC all over again.
It's simply ludicrous to have a coding LLM that needs to be retrained on the latest published poems and pastry recipes to generate Java.
Having an LLM use a web search tool isn't the same thing as researching a topic, IMO, because it's so ephemeral and needs constant reinforcement. LLMs aren't learning machines, they're static ones.
It's easy to rattle off a half-dozen different vectors of likely enshittification over the next few years -- ranging from increasing censorship, to lower rate limits, to removal of existing features and forced addition of unwelcome new ones, to extortionate price increases, to unexplained and irreversible account bans. The only way to avoid them all is by running weights you own on hardware you control.
How smart and how fast is your local model? Those are certainly important questions, but "Does it exist at all?" is more important.
The USB drive light is flickering, showing something is happening. It's been about 8 hours since I entered the prompt and I've gotten about 10 tokens back so far. I'm going to leave it running overnight and see what happens.
What did you use to do this, something standard like llamacpp or something else like vllm or your own contraption ?
It's now spit out about 40 tokens after maybe 18 hours and has not finished the "thinking" stage of responding to the prompt. I'll let it keep running to see what happens
I use an anaconda environment, though would have preferred an "uv" environment, on Linux and automate the startup sequence using the following script (start_comfy.sh) from the term rather than manually starting the environment from same said term:
#!/bin/bash
#
# temporary shell version
eval "$(conda shell.bash hook)"
conda activate comfy-env
comfy launch -- --lowvram --cpu-vae
Here are some of the images: https://imgbox.com/nqjYhdx3 https://imgbox.com/93vSWFic https://imgbox.com/qs1898dz
I'm hesitant to increase the sizes of the renders as that will surely stress my laptop's components.
I can't help but feel that companies using AI, engaging in employee layoffs, are shooting themselves in the foot. The endgame for them will be zero profits, since displaced workers translates to no money to pay for goods and services :|
I mean, inference engine might need to get some tweaks, to support whatever compute is available. But then, if you put a few terabytes of disk for swap, and replace RAM to bigger sticks if possible, it should work? Slowly, of course, but there is no reason it should not to.
I don't think cloud models are going away; the hardware for good perf is expensive and higher param count models will remain smarter for a looong time. Even if the hardware cost for kind-of-usable perf fell to only $10k, cloud ones will be way faster and you'd need a lot of tokens to break even.
I think local AI will win in its niche by repurposing users' existing hardware, especially as cloud hardware itself gets increasingly bottlenecked in all sorts of ways and the price of cloud tokens rises. You don't have to care about "bad" performance when you've got dedicated hardware that runs your workloads 24/7. Time-critical work that also requires the latest and greatest model can stay on the cloud, but a vast amount of AI work just isn't that critical.
They're not smarter, they just know more stuff.
You probably don't need knowledge about Pokemon or the Diamond Sutra in your enterprise coding LLM.
The "smarts" comes from post-training, especially around tool use.
That's one of the biggest remaining head-scratchers in this whole business. You do need all that unrelated stuff to make a good coding model.
Nobody knows why you can't build a coding model by training on nothing but code, CS texts, specifications, and case studies, but so far it appears that you can't.
Effectively they are saying "yea don't crowd our data centers with small queries, go ahead and send your frontier questions to our frontier models. Oh btw those us models? You can run something about as good for free from us if you want hah." It's a power and marketing move. It's also insanely smart to keep up with it to remain sustainable as a brand. Especially given how small their investments into this are.
Look at anthropics growing pains. Deepseek has other hosts spreading their brand for free while they grow. Brilliant honestly. In my opinion it makes anthropic and openai look clueless on a lot of levels.
China is playing a different game here. To them this is commoditizing their compliment and building good will. The Chinese economy doesn't teter on the brink of collapse to deliver frontier grade LLMs. Nope, Alibaba just made qwen because it needs it. It needs efficient models. Similarly, in China they manufacture and automate so much more than the US ever could. LLMs to them are a topping not the whole meal like they are in the us.
The reason it works: each time you read the model (memory bound) to calculate the next token, you can also update multiple requests (compute bound) while at it. It's also much more energy-efficient per token.
The idea that everyone is spinning up a $2 million in GPUs to scan their email inbox, search the web or avoid learning something is still ridiculous to me regardless.
China? Im getting ready to watch the URKL (universal robot knockout league) go on. The USA is dicking around with failed robot dogs.
The USA has been a failed country, coasting on massive inertia. But the tech avenues from a article I cant find showed the USA 8/64 areas excelling. China was 56/64 areas excelling.
Smart people in China design fast manufacturing lines for $25k/yr.
Smart people in the US design bond hedging strategies or ad-pixel trackers for $250k/yr.
China is in the stage the US was in 60 years ago, and eventually those high paying, high impact jobs will suck the intelligence out of all the "blue collar" work. Just like it did in the US.
Dodging politics, the power structures in us industry need serious revamping.
USA exports and exported services, especially in IT. And a lot. USA has nothing to export is true only if you intentionally ignore stuff USA exports.
They're state companies, not some kind of ethical VC charity fund project.
If the US’s fascist experiment continues past the current president, we’ll absolutely be nationalizing frontier companies or exerting equivalent control.
https://try.works/#why-chinese-ai-labs-went-open-and-will-re...
It did work for Deepseek for sure and it seems to move the needle for Xiaomi's MiMo; but will it be enough for Qwen and Gemma? Those are the models you can actually run without going all-in on AI (but only with gaming GPUs and such).
The compute required to run these models is still very far out of reach for the average consumer, yet known enthusiast, therefore they still sell inference, whilst also getting consumer goodwill for providing open weights.
I don't need a model that can easily produce CSAM or reproduce copyrighted works verbatim in order to be productive.
And that's not how these things work. If you censor the model for one purpose, you will degrade it for others. We both know that the bureaucrats won't stop at either of those purposes. It's not in a censor's nature to walk away satisfied.
And Anthropic is famous for putting guardrails on their models, and yet continue to lead.
We don't have to, and shouldn't have to, tolerate tools that easily produce csam or similarly undesirable output.
There are plenty of other uses that people have been making for a long time-- e.g. I know someone who uses a fine tuned local model to sort their incoming email and scan their outgoing messages for accidental privacy leaks.
I don't agree with your assessment on an opportunity lost-- I got my reps in on the original work, the AI gave an incremental step forward which made the whole exercise somewhat more valuable to me with minimal additional cost. I think this improves the cost vs benefit in a way that makes me more likely to try other pointless activities, knowing that when I run out of gas I can toss it to AI to try some variations.
Sometimes you're also 27 steps deep on a nested subproblem and you're really just trying to solve sometime. Even in finr craftsmanship not every step needs to be about maximum craftsmanship. :) Sometimes it's just good to get something done.
I think this is much like any other tool. One can carve furniture using only hand tools, but the benefits of a router are hard to dispute. Both approaches exist in the world and sometimes both are used in concert.
As far as people frying their brains with AI -- you don't need local models for that, plenty of people are driving themselves into deep personally and socially destructive delusion just using the chat interfaces.
I agree with you, there's a way to use them responsibly like your router anology, I just think most aren't doing this correctly and its a slippery slope. I'll contend that you probably have used them responsibly in your example.
Reciprocal?
That's what Chinese models are doing, and beating Opus et al.
An LLM that knows English very well isn't actually very large and certainly not hundreds of billions of parameters.
... currently testing out Stepfun 3.5 Flash Q4_k_m as a stop gap (unless it blows my socks off first).
Also the fact that an M5 version will be coming, and they likely know they are going to sell out on day one (I expect we'll see a price correction from Apple for higher end configs of M5 studios, base price will probably stay the same), so they need to build up stock reserves.
This piqued my interest on how it does it and after briefly checking the project it seems it only has two features for automatic photo categorization. 1) it can group photos by date and 2) It has face detection and recognition that uses trained weights (so ML "intelligence").
I got away from google images and upload to my own Immich instance.
I also use an open source camera app on fdroid to degoogle that whole path.
"They" fully well know that they current frontier model are maybe 6 month ahead of what people will have access to without their control. See Deepseek as Exibit B
The reason you can't run these locally are more with the fact that those mythos sized models require extreme amount of memory and processing power to run at acceptable speeds. And neither you, nor I can afford to pay for those resources to run those models locally. A big reason is that "running locally" means running on your own hardware. And for almost everyone this means "running on hardware that will spent a big portion of its time just sleeping". Because data center and providers have higher utilization rates, they can easily outpace you. That and the fact that when they place an order it's usually for hundreds of thousands of units.
That is why the huge lobby machine is grinding away to make those models illegal.
Rather I think it is just hard for local LLMs to compete in this early stage when the cloud providers are allowed by investors to be unprofitable.
You can grow the utilization rate well beyond that if you don't always care about getting a quick, real-time response. (And if you do, then maybe the cloud model was the better deal after all!)
And, assuming the allegations are true, don't things like Deepseek and Qwen offer existence proofs that frontier models are (and will forever be) trivially distilled down to run domain-specific tasks on boxes that cost a few months of Claude Max subscription?
Isn't that a function of RAM supply not being available now?
Even if that weren't the case, every corp _needs_ you to be on a subscription.
qwen3.5-2b and qwen3.5-4b are great at document parsing. They can run on CPU
qwen3.6-27b and gemma4-31b are borderline better than the human eye in some cases. Their OCR isn't perfect, but they're seriously good. They can still run on the CPU but you'll be waiting minutes per document.
You can demand JSON, YAML, MD, or freeform text just by varying the prompt. Even if you have a custom template, you can just put that in the prompt and they'll do an OK-ish job.
There's also models that aren't in the r/locallama zeitgeist. IBM released a new 4b parameter model for structured text extraction last week, and there's a sea of recent chinese OCR models too.
IMO the open wights models are so good that in a lot of cases it's not worth paying frontier labs for OCR purposes. The only barrier to entry is the effort to set up a pipeline, and havin the spare CPU/GPU capacity.
Besides those, there are a few smaller open-weights models that are dedicated for OCR tasks, for instance DeepSeek-OCR-2 and IBM granite-vision-4.1-4b. (They can be found on huggingface.co)
The dedicated vision models can be run on much cheaper hardware, including smartphones, than the big models that can process images besides text.
Similarly, besides bigger multimodal models, that can accept audio, images or text as imput, there are smaller open-weights models that are dedicated for speech recognition, e.g. Xiaomi MiMo-V2.5-ASR and IBM granite-speech-4.1-2b.
Apple doesn't even sell a model. They just have a deal to use Googles. They can't "protect" their cloud version of a model they don't have.
That's an interesting way to view the world. I mean, utterly stupid as it is, but interesting.
But the previous sentence is even stupider (a Perl script 10 years ago could write code like Qwen does now?), so I guess at least it's consistent.
FWIW I think Gemma 4 31b is more likely to be of use to me than Sonnet, idfk, maybe it's a skill issue but I love Opus 4.7, undisputed king, but Sonnet seems borderline useless and I basically think of it as on the same level as Qwen 35b MoE.
But they diverge greatly on other particular ones whenever the ViT tower and the apriori knowledge of the world is crucial. I wish Gemma was on par but both me and Google know they not.
I'm going to switch to local LLMs for most stuff soon.
I stopped doing local stuff for a bit when I realised I didn't know how well it is supposed to work so have been on Claude for a few months now.
I think I'll try OpenCode this time.
Usually I do stuff in devcontainers, qwen code (non local) was the only time I managed to lose some work as it got confused when I ran out of tokens.
There's still quite a way to go - it does seem like Claude code itself is pretty badly coded, so I think there is a space for open source to come in with a high quality harness at some point.
Thot_experiment is saying that his 2016 Toyota Prius is a great and reliable car for his daily commute and running errands.
Whereas everyone is screeching about its capability gap with a Lockheed Martin F35 lightning.
I didn't read "and how were those models trained" as "Are we there yet?"
Just totally forgetting that the frontier models themselves stole an insane amount to get to where they are.
It's theft all the way across the board, and when someone tries to make the argument that open models theft is bad, but Altman or Amodei's theft is good.. they are revealing a lot about themselves
There will not ever be a monthly subscription for LLM tokens. The economics isn't there.
Local tokens will always be cheaper.
I'm not sure what you mean by "There will not ever be a monthly subscription for LLM tokens." That already exists!
In the future LLMs will be priced per token, not all-you-can-eat.
I'm kidding around. I run 31b models myself too and am perfectly happy with them.
(of course if i'm being honest 640kB is fine, i'm sure tons of the world's commerce is handled by less for example, the delta between a system with 640kb of ram and a modern one is near nil for many people, the UX on a PoS terminal does not require more than that for example, the hacker news UX could also be roughly the same)
How refreshing to hear this kind of old-school hacker thinking, in a thread where most people have given up on local computing in exchange for convenience and permanent third-party dependency.
With embedded systems affordable and ubiquitous, hopefully a growing segment of the new generation will also learn to push the limit of available hardware and see how far we can take it. As an engineer there's a satisfaction in solving things with what you got.
There's a new technique, 1-bit family of language models that can achieve up to 9x memory efficiency compared to existing models. Still multiple gigabytes for practical use I imagine, but it's great progress toward local AI, which I believe will be common in the near future. https://prismml.com/news/ternary-bonsai
Subscriptions aren't gonna go away. They're great for businesses. Rate limits or pricing might change but the underlying business model is very good.
The reason usage-based is so much more expensive than subscription isn't that usage-based is the "true" cost and subscription is a loss leader — just like a buying 30 consecutive day passes to a gym being more expensive than a monthly membership isn't a result of memberships being a loss leader. Memberships are the business model! The day passes are overpriced to steer you into buying the membership.
(This is a generous argument: it also ignores the massive software stack optimization the cloud companies do that doesn't trickle down to local-rig-sized deployments; for example, prefill/decode disaggregation, which would double the VRAM requirements for a local rig — if you could even do it on a local rig, which you can't, because local rigs don't have Infiniband. But at scale, prefill/decode disaggregation improves capital efficiency, since you can tune the compute-bound prefill node differently than the memory-bound decode node.)
The advantage of local rigs is not capital-efficient tokens. It's privacy. But then again, you can get zero-data-retention options from many inference companies, so for many use cases it may not matter unless you need strict guarantees the data never leaves the building...
Sometimes it really is free though, because the hardware was bought to serve some other existing needs and that capital expense was fully depreciated quite some time ago. Underutilised hardware is essentially ubiquitous.
> Within any time budget, you can get many orders of magnitude more large-model tokens off an 8xB200 than off a local rig.
But using that 8xB200 setup to run inference on cheap, non-frontier models is a plain waste. Its highest and best use is in an AI datacenter serving exceptionally smart models like Gemini DeepThink, GPT Pro or Claude Mythos. (If this isn't true, it means that the current level of large-scale investment in frontier, super intelligent AI is misplaced, and you should worry about that; not whether some models are best ran on lower-end hardware!)
No one has 8xRTX Pro 6000s that have depreciated to zero "quite some time ago."
> But using that 8xB200 setup to run inference in cheap, non-frontier models is plain waste
From whose perspective? If someone wants to run an open-source model — and plenty do — someone buying or renting an 8xB200 to serve it cheaply at scale is much better than everyone buying huge amounts of pointless, wasted hardware such as 8xRTX Pro 6000s for $80,000 per person.