Local AI needs to be the norm

1903 points by cylo 58 days ago | 749 comments

pronik 58 days ago |

They will be, and that moment is not that far off. We've got the progression in place already: first, large data centers could have performant LLMs, we are now firmly in "a bunch of servers with a couple of H100s each" territory, slowly going into "128 GB VRAM on a MacBook Pro or a Strix Halo". Within the next year, the pattern of "expensive remote LLM for planning, local slow-but-faster-than-human LLM for execution" will become the norm for companies, slowly moving to "using local LLM for everything is good enough". And then we'll have the equilibrium we already have with the "classic cloud": you either self-host or pay for flexibility and speed. The question will be: how much of the current compute capacity craze will local hosting give the kiss of death to and what that means for the market.

reisse 58 days ago | |

> They will be, and that moment is not that far off.

It's here, right now. I'm running quantized Qwen and Gemma on a decent, but three years old gaming rig (think RTX 3080 12GB and 32 GB RAM). Yes, it's slow, it has a small context window. But it can (given a proper harness) run through my trip photos and categorize them. It can OCR receipts and summarize spendings. It can answer simple questions, analyze code and even write code when little context is required. Probably I could get a half-decent autocomplete out of it, if I bother with VS Code integration. "128 GB VRAM on a MacBook Pro or a Strix Halo" is already a minimum viable setup for agentic coding, I think.

> And then we'll have the equilibrium we already have with the "classic cloud": you either self-host or pay for flexibility and speed.

Currently, it works exactly the other way. The cloud versions are orders of magnitude cheaper than self hosting, because sharing can utilize servers much more efficiently. Company can spend half a million bucks on a rig running GLM 5.1, and get data security, flexibility and lack of censorship, but oh it's so expensive compared to Anthropic per-seat plans.

pbgcp2026 58 days ago | | |

I'm sorry to spoil it for you, but Perl script was able to do all of that like ... 10 years ago? The out-of-the-box Shotwell manages photos quite well without any intelligence. The problem, as people mentioned above, is SOTA models cognitive and tooling abilities. Also, have you noticed as top-end Mac Studios got downgraded recently? They don't want you to have access to frontier models. And you will not have it. See Mythos as Exibit A.

digitaltrees 58 days ago | | |

I built my own IDE and run my own model specifically to have private agentic coding. I can still access model APIs but I can be purely local if I want too. It’s amazing.

DrewADesign 58 days ago | | |

Multiple gazillion dollar companies each seem to be spending to ensure that they alone pretty much dominate all knowledge work, with customers eating up their tokens like Cookie Monster. I wonder if the any of them could survive as LLM providers if they not only failed to do that, but the entire industry ended up selling what the current Cookie Monster would call a “sometimes snack,” for very special occasions?

datadrivenangel 58 days ago | | |

In my experience once you get to ~30 gigs of ram for a model like Gemma4, the rest of the 128g of memory is simply nice to have. The speed and costs are what make it tough though, because its slower and more expensive than the same model served on a big accelerator card, and is going to be worse than a frontier model.

sanderjd 57 days ago | | |

Are there any harnesses that are attempting to optimize for using local models like this? Unsurprisingly, my naive attempts to integrate with harnesses designed for frontier models have gone poorly. But it seems like a harness that understands the capabilities and limitations better could perform significantly better.

fennecfoxy 58 days ago | | |

>It's here, right now.

I mean I've been forcing my good old 1080ti to run local models since a short while after llama was first leaked.

But I wouldn't say "local models are here" in the same way as "year of the Linux desktop!111"

Until someone can just go out and buy some sort of "AI pod" that they can take home, plug in and hit one button on a mobile app to select a model (or even just hide models behind various personas) then I wouldn't say it's quite there yet.

It's important that the average consumer can do it, I think the limitations for that are: things are changing too quickly, ram+compute components are exceedingly expensive now, we're still waiting on better controls/harnesses for this stuff to stop consumers not just from shooting themselves in the foot, but blowing their foot clean off.

Would be interesting to see a Taalas-like chip in a product, albeit there's so many changes going on atm with diffusion based models, Google's Turboquant (which as someone who has had to almost always run quantized models, makes a lot of sense to me).

yieldcrv 58 days ago | | |

I need to see these proper harnesses

I tried oMLX and OpenCode a few weeks ago and the 65k context window was useless, it tried to analyze a very small codebase before going full on agentic and ran out of context window immediately

I don't have time to tweak 1,000 permutations of settings just re-prove that its not as smart as Opus 4.6

I need out the box multimodal behavior as similar as typing claude in the command line and its so not there yet

but I'm open to seeing what people's workflows are

jimbokun 57 days ago | | |

Has anyone tried to calculate the break even cost of buying a PC to run an LLM locally, vs the amount of tokens you could get from an AI provider?

nsvd2 57 days ago | | |

I run Gemma locally on a 3090, it's amazing how useful it is to be able to call out to ollama in a bash script or cron job.

winocm 58 days ago | | |

Perhaps I am the odd one out here, but a small part of me wants to see what happens when you run a proprietary SOTA model on a laptop.

dust1n 58 days ago | | |

Can you share how you use it to categorize trip photos!

antidamage 58 days ago | | |

This is my exact setup as well and dear lord gemma is absolutely batshit insane. I'm trying to get a self-reflection and confidence loop going now, but it does feel like it's not the local resources, it's the limits of the training. Dedicated coding or dedicated real-world task models would be a good optimisation.

root_axis 58 days ago | |

You are greatly underestimating the hardware requirements for productive local LLMs. Research consistently shows that parameter count sets the practical ceiling for a model's reliability. Quantized models with double digit param counts will never be reliable enough to achieve results in the realm of something like Opus 4.6.

thot_experiment 58 days ago | | |

Flat wrong. Q6 Gemma 31b feels a lot like opus 4.5 to me when run in a harness so it can retrieve information and ground itself. The gap is not that big for a lot of usecases. Qwen MoE is fast as fuck locally for things that are oneshottable. I have subscriptions to all the major providers right now and since Gemma 4 and Qwen 3.6 came out I haven't hit limits a single time. I'm actually super surprised by the number of things I try with Gemma 4 with the intent of seeing how it fails and then having Claude do it only to come away with something perfectly usable from the local model.

segmondy 58 days ago | | |

Jokes on you. We are already running Deepseekv4Flash, Mimo2.5, MiniMax2.7, Qwen3-397B locally in very affordable hardware. These models are in the real of Opus4.6. For those of us a bit crazy, we are running KimiK2.6, GLM5.1 and more ...

wincy 58 days ago | | |

Won’t these H100s drop in price in a few years? With the data center build out surely these will become 1/10th the price and you’ll be able to set up a local LLM as good as opus 4.7. Even if the frontier model become more advanced, and memory hungry, you could use the same power usage as your oven to run a current day frontier model as needed? If I could drop $10,000 to have an effectively permanent opus 4.7 subscription today, I would.

CuriouslyC 58 days ago | | |

Parameter size gets you world knowledge and better persistence of behavior as context grows. Both of those things can be engineered around to a large degree, and the latest Qwen models show that small models can be quite smart in narrow domains and short time windows.

ActorNightly 57 days ago | | |

Yes and no.

The best analogy is the difference between having N senior level engineers working for you, versus having N entry level engineers.

With frontier cloud models, you can give a single invocation one task, and it can figure everything out.

With local models, you have to manage the inputs and outputs quite a bit more, but you can achieve similar results for tasks you set up harnesses for. They are not as a good at finding the right answer internally from their own weights, but they are very capable of ingesting context and reformatting text - for example, for debugging, local models can debug issues quite well if you give them the error and documentation for a particular feature you are trying to implement.

stubish 58 days ago | | |

It depends on what you mean for 'productive'. Article mainly seems to be about targeting consumer level hardware, such as the Neural Processing Unit you need for a 'Copilot PC'. Windows Recall is (was?) one such local AI application. If Microsoft get their way and my next PC has one, I look forward to using it for 'productive' purposes such as playing games, handling natural language stuff and leaving my GPU free for GPUing.

josteink 58 days ago | | |

> You are greatly underestimating the current hardware requirements for productive local LLMs.

Fixed that for you. Right now most models produced are based on floating point maths and probabilities, which is "expensive" to do math on.

Microsoft has researched 1-bit LLMs which can run much more efficiently, and on much cheaper hardware[1].

If this research is reproducable and reusable outside their research models, this means the cost of running self-hosted LLMs will be reduced by an order of magnitude once this hits mainstream.

[1] https://github.com/microsoft/BitNet

byzantinegene 58 days ago | | |

i would argue we don't need anything near Opus to be productive. Sonnet is plenty productive enough

DrScientist 58 days ago | |

I think it's inevitable that access to good enough LLM models will be democratised.

However that's not the real battle here. The real battle is control of information to operate over.

While I might have access to a decent model - I don't have the huge integrated databases of everything that companies like Google have, and increasingly governments will accumulate.

As a citizen AI operating of these large datasets is where the concern should be.

pier25 58 days ago | |

How fast do you reckon most people will be able to afford 128-256GB of RAM?

Schiendelman 58 days ago | | |

Other than this recent spike, it's been trending cheaper continuously for decades. In a few years 128GB will be as affordable as 12GB (what flagship phones have now) is today.

cpt_sobel 58 days ago | | |

Their prices are currently so unreachable because of the big players hoarding every chip they can get their hands on, but if/when the market realizes that locally deployed LLMs are the way to go, maybe (hopefully?) then more chips will be available to the consumers for lower prices.

discordance 58 days ago | | |

“Gradually, then suddenly”

emadb 58 days ago | |

Do you think small models will arrive? I mean if I need to write a web application in typescript why should I use a model that knows all the programming languages and it is able to reply to any questions about almost everything? I just a need a small performant model that knows how to write web applications in typescript. That could be very helpful and easy to run on my laptop.

driese 58 days ago | | |

For the same reason that a human who is fluent in five languages can probably express themselves better in either one compared to human that only speaks one, while also having a more nuanced understanding of general grammar. From what I know, learning on a more diverse set makes a model better overall.

thot_experiment 58 days ago | | |

Depending on your laptop, if your laptop is a Strix Halo or a Macbook with a decent amount of ram, that day they arrived is about 6 months ago, and today if you can run Gemma 31b, you're golden for your basic workslop code. You can do most of it with local models. Heck, for a lot of the tier of programming you might encounter in the average job Qwen 35b MoE is good enough and it can hit 100tok/s on decent hardware.

elbasti 58 days ago | |

> The question will be: how much of the current compute capacity craze will local hosting give the kiss of death to and what that means for the market.

This will depend on how much inference happens for consumer (desktop, local) vs enterprise ("cloud"), vs consumer mobile (probably also cloud).

I would assume that the proportion of "consumer, local" is small relative to enterprise and mobile.

stubish 58 days ago | | |

I think the proportion is small because someone has to pay for the cloud services. When phones, PCs and Desktops ship with NPUs whole new markets open up for all that stuff people want but not enough to pay for.

RataNova 58 days ago | |

The biggest impact of local models may simply be that they prevent remote inference from becoming the only game in town

xnx 57 days ago | |

> how much of the current compute capacity craze will local hosting give the kiss of death to and what that means for the market.

Nvidia and other hardware sellers would love if they could sell a bunch of chips to individual consumers that would sit idle for 95% of its life.

inf3cti0n95 58 days ago | |

Certainly, I don't think Data centers are the way here.

I guess, it'll most likely be an AI processing and everything else becoming API.

In case of GPTs and Claudes of the world. They'll be just using an Indexing APIs and KB on top of their LLMs.

dakolli 58 days ago | |

This is simply delusional, It cost 20-30k a month to run Kimi 2.6. The tokens are sold for $3 per mm.

To sell tokens profitably you'd need to be able to run inference at 150 tokens per second for less than $1,000 USD a month.

I don't think people realize how expensive it is to host decently capable models and how much their use of capable models is subsidized.

You can only squeeze so many parameters on consumer grade hardware(that's actually affordable, two 4090s is not consumer grade and neither is 128gb macbooks, this is incredibly expensive for the average person, and the models you can still run are not "good enough" they are still essentially useless).

People are betting their competency on a future where billionaires are forever generous, subsidizing inference at a 10-1 20-1 loss ratio. Guess what, that WILL end and probably soon. This idea that companies can afford to give you access to 2mm in GPUs for 5 hours a day at a rate of $200.00 a month is simply unsustainable.

Right now they are trying to get you hooked, DON'T FALL FOR IT. Study, work hard, sweat and you'll reap the benefits. The guy making handmade watches, one a month in Switzerland makes a whole lot more than the guy running a manufacturing line make 50k in China. Just write your own fkin code people.

Don't bet your future on having access to some billionaire's thinking machine. Intelligence, knowledge and competency isn't fungible, the llm hype is a lie to convince you that it is.

zozbot234 58 days ago | | |

No one runs SOTA models 24/7 for individual use or even for a single household or small business, whereas you can run your own hardware basically 24/7 for AI inference.

With the new DeepSeek V4 series and its uniquely memory-light KV cache you can even extend this to parallel inference in order to hide memory bandwidth bottlenecks and increase compute intensity.

This is perhaps not so useful on a 128GB or 96GB RAM Apple Silicon device (I've seen recent reports of DS4 runs with even one agent flow hitting serious thermal and power limits on these devices, so increasing compute intensity will probably not be helpful there) but it will become useful with 64GB devices or lower that have to stream from a slow disk, or with things like the DGX Spark or to a lesser extent Strix Halo, that greatly overprovision compute while being bottlenecked on memory bandwidth.

NitpickLawyer 58 days ago | | |

API prices are most likely not subsidised. A brief look at openrouter can tell you that. There are plenty of providers that have 0 reason to subsidise that sell models at roughly the same average price. So the model works for them (or they wouldn't do it otherwise).

CamperBob2 58 days ago | | |

It cost 20-30k a month to run Kimi 2.6. The tokens are sold for $3 per mm.

Not if you're OK with 4-bit quantization. More like $30K-$50K one time.

Spring for 8 RTX6000s instead of 4, and you can use the full-precision K2.6 weights ( https://github.com/local-inference-lab/rtx6kpro/blob/master/... ).

nullc 58 days ago | | |

> two 4090s is not consumer grade

I think that is a very narrow perspective. Enormous numbers of consumers own $50,000 cars, but a pair of $2000 GPUs is "not consumer"?

I agree with your view that cheap tokens on SOTA are a trap-- people should use local AI or no AI.

vachina 57 days ago | | |

Training to be artisanal coder now.

hparadiz 58 days ago | | |

Posts like this are so funny to me. I'm staring at a mountain of old hardware right now that cost about $20k ten years ago. I have to pay someone now to come haul it away. What makes you think the current new hardware won't end up with the same fate.

> Just write your own fkin code people

Bro is nostalgic for googling random stack overflow threads for 10 days to figure out a bug the agent fixes in an hour.

simooooo 53 days ago | |

Even on a 5090 qwen is really impressive. Felt as good as Claude for little projects.

dnnddidiej 58 days ago | |

Except you will want the frontier to compete. Local models are useful but you will always need $$$ to be in the same order of magintude as frontier. And also $$$ for same token speed.

The question is would you choose to save $10 a day if it causes your inference to slow down 10x and waste 2 hours a day waiting on stuff.

Akuehne 57 days ago |

I feel like lots of people here are just commenting on the headline.

This isn't about the local models you're running on your old gaming rig, or the tesla p40 rig you build for local llm's.

This is about code leveraging the local resources where the code is running for it's AI needs. Rather than making an API call to an external AI service, the code leverages the AI capabilities built into the hardware it runs on. With modern Apple, Intel, and AMD silicon all shipping dedicated AI acceleration, this is the where IMO the focus should be heading.

How many Flops or whatever can your phone do? I bet it's enough to paint the walls of your living room, or draw a pretty good pelican on a bike.

0xbadcafebee 58 days ago |

Here's some things you can do right now with local models on a consumer device:

- text-to-speech - speech-to-text - dictionary - encyclopedia - help troubleshooting errors - generate common recipes and nutritional facts - proofread emails, blog posts - search a large trove of documents, find information, summarize it (RAG) - manipulate your terminal/browser/etc - analyze a picture or video - generate a picture or video - generate PDFs, documents, etc (code exec) - simple programming - financial analysis/planning - math and science analysis - find simple first aid/medical information - "rubber ducking" but the duck talks back

A quarter of those don't need more than a gig of RAM, the rest benefit from more RAM. Technically you don't even need a GPU, it just makes it faster. I do half that stuff on my laptop with local models every day.

That said, it really doesn't need to be local. I like the idea that I can do all that stuff offline if I'm traveling, but I usually have cell service, and the total tokens is pretty cheap (like $2/month for all my non-coding AI use).

chakintosh 57 days ago |

I'm literally working on an iOS app right now that needs to infer some input fields from free text typed by the user. Now to take into consideration typos, unstructured text (pricing, dates .. etc), I was pondering a cloud LLM or a basic local parser or even a local on-device LLM (ANE for 15+ devices and a different on-device LLM for the older models)

For the different on-device LLM, I literally went to HuggingFace and filtered by the smallest available models that can do the job, and Granite-4.0-h-1b works just fine, it corrects typos, infers dates, currencies all fields I need.

And it got me thinking how my first reflex was to rely on a cloud LLM which is waaay overkill for my need. Granted, an on-device LLM will need to be loaded on the devices on install or downloaded after the fact (which adds latency when the user needs it for the first time) but still, it's a better tradeoff than a cloud LLM.

I decided on a basic parser, and so far it seems to work fine. granted, it struggles with some words, but I just need to finetune it to have as much coverage as possible in terms of typos without triggering false positives.

A lot of developers have that reflex too and go along with it and then just pass the API costs to the customer. I could have gone that route too but turned out I don't even need an LLM for my usecase.

coevcan 57 days ago | |

Apple includes a local LLM on all recent iPhones, https://developer.apple.com/documentation/foundationmodels. Seems like a bad idea to force your users to download a 3GB LLM just to parse a text field.

chakintosh 57 days ago | | |

Yeah but I need broader coverage on older phones. No I'm not going for a 3rd party LLM. Foundation Models for iPhone 15 and newer, and a parser for the older ones. Currently training a Word Tagger in Create ML

adamtaylor_13 58 days ago |

Cool, well let me know when Opus 4.5 level performance is available locally, at speeds that serve everyday use, and 100% I'm right there with you.

Until then, I'm going to keep sending my JSON to the server farm in Virginia because it's the only place that can serve me a model that actually works for my uses.

TheJCDenton 58 days ago |

For the mainstream audience, the sentiment around local ai today is the same that they had around open source a few decades ago. For a few products, some paid solutions were so much more advanced that open source were very often completely overlooked. Why bother ? And the like. Then we had captive SaaS and other plateforms and now it's obviously wrong for most of us.

The dependency we have with anthropic and openai for coding for instance is insane. Most accept it because either they don't care, or they just hope chinese will never stop open weights. The business model of open weights is very new, include some power play between countries and labs, and move an absurd amount of money without any concrete oversight from most people.

It's a very dangerous gamble. Today incredible value is available for nearly everyone. But it may stop without any warning, for reason outside our control.

gkcnlr 58 days ago |

It seems like everybody is focused on "LLM"s, a.k.a Large Language Models. One interesting addition to that is fine-tuned- small parameter, distilled, context-dependent small language models that:

1- Do a particular task with great capability (due to its constrained, limited scope) 2- Do it in such a way, it integrates gracefully in your workflow without ever requiring you to know you are using an LM.

There is a difference between outsourcing your workflow to AI and actually utilizing it.

Check this: https://www.distillabs.ai/blog/we-benchmarked-12-small-langu...

fennecfoxy 58 days ago | |

Eh I think the small model thing is kind of a no-go.

Reason being is that many workloads for AI are dynamically mixed, where training from multiple subjects comes into play and you just can't know exactly what mix will be required for each task ahead of time.

I was hoping loras would do this for us as well but they don't really seem to have worked out for llms (compared to in the image/video diffusion space).

Perhaps some future model will have some sort of "core" that can load/unload portions of itself dynamically at runtime. Like go for a very horizontal architecture/hundreds of MoE and unload/load those paths/weights once a parent value meets or exceeds some minimum, hmmm.

wrxd 58 days ago |

The example in the post confirms my theory that for local models to succeed they need to be "good enough", not big enough that they can compete with frontier models.

They need to be able to do a small task well and they need to be able to run reasonably on consumer-class devices. Even better if they can run on mobile phones.

In my experiments with local LLMs I noticed that while increasing the size of the model is nice the real thing that turns a barely useless model into something useful is the ability to use tools. Giving my models the ability to search the web and fetch web pages did way more to solve hallucinations than getting a bigger model. And it doesn't have a training cutoff. Sure, the bigger model is probably better at using tools but I often find the smaller models to be good enough.

Gigachad 58 days ago | |

Will there even be a web to search in the future? These days public access blogs are dying and being replaced with hallucinated AI websites. Sites with original research like Reddit and YouTube are being locked up to prevent 3rd party indexing.

Knowledge and clean data sets are becoming increasingly valuable, and free community knowledge is drying up. The next big programming language won’t have years of Stack Overflow posts to train on.

Maybe we will see some kind of licensing deals where owners of good datasets charge you a fee to let your AI search them.

Guillaume86 58 days ago |

I think we should separate the private AI discussion from the local AI discussion. The pragmatic choice to run big LLMs is one/several big servers online, but that doesn't mean private companies should be the only ones to run them.

A self hosted inference solution that offer good tenant isolation guarantees (ideally zero trust) and is easy enough to deploy and maintain (think Plex for AI) would be my choice for privacy. Now to be honest I have done zero research about this and have zero idea how feasible that is, maybe it already exists and there's some discord servers I should join?

Edit: I don't need to mention it here but what's incredible is that open models are in the ballpark of the best commercial models so supposedly, the hardest part by far is already solved.

FrasiertheLion 58 days ago | |

Another option is verifiably private inference with open source models running inside secure enclaves on the cloud (using NVIDIA confidential computing), and the enclave code is open source and verified via remote attestation upon connection, cryptographically proving that the inference provider cannot see any data. Tinfoil: https://tinfoil.sh/ is a good example of this (disclaimer: i'm the cofounder). You can read more about how this works here: https://docs.tinfoil.sh/verification/verification-in-tinfoil

>that open models are in the ballpark of the best commercial models

This is basically true for certain tasks. As an example, chat interfaces are not well poised to take advantage of higher model intelligence than what the best open source models already provide. But coding harnesses still benefit from greater model intelligence and even more so, the reinforcement learning that tightly interlinks the provider's coding harness (claude-code, codex) with the model's tool calling interfaces is another reason for discrepancy in effectiveness even when controlled for model intelligence. The opencode founder (open source coding harness that supports different model providers) was recently complaining about the challenges making the harness work well with different providers: https://x.com/thdxr/status/2053290393727324313

rmunn 58 days ago |

For image generation, this has already happened. To what degree, I can't tell, as I don't do image generation much so I don't have numbers on Midjourney subscriptions or any other image-AI-as-a-service sites. But civitai.com has become a place where people share their models, based off of Stable Diffusion or other similar bases, with various fine-tunings to achieve desired results. You name it, you can find a model for it at Civitai, and people doing some very creative things with them. (And also a lot of the obvious things, but it's the Internet, what did you expect?)

I haven't seen a text-based model sharing site spring up yet (perhaps they already have and I don't know about it yet). Civitai, being focused on image-generation, has the obvious advantage that it's easy to show off impressive results from the model on the front page of the website, and judging what someone's home-grown fine-tuned LLM will produce is a lot harder. But at some point I expect a Civitai equivalent site for text models, especially code-based ones, to become popular. That will seriously undercut Anthropic, OpenAI, et al, and will probably force them to find a price equilibrium.

Because once you're competing with "I spend $2,500 up front on a powerful video card, download an open-source model for free, and then I get pretty much everything I need for free" (additional power cost of running that video card isn't nothing, but probably not noticeable in your power bill compared to what you're already using)... then suddenly $200/month means your customers are thinking "after one year I would have been better off with the homegrown solution". The only way they'll continue to pay $200/month is if Claude/GPT/Gemini/whoever is truly head-and-shoulders above the "pay upfront once for hardware then use it for free afterwards" models available. And that's going to be doable, perhaps, but tough.

supermdguy 58 days ago |

Interesting to see this after the recent post about Chrome’s on-device model using up 4gb of storage, which frustrated a lot of people [1].

I agree local models are great, and it’s cool that Apple has models built in now. But I feel like it basically has to be an OS level feature or users are going to get upset. I’d certainly rather have a small utility call out to OpenAI than download its own model.

[1]: https://news.ycombinator.com/item?id=48019219

appreciatorBus 57 days ago | |

The way I interpret the drama over the Chrome model is that for a large chunk of users, perhaps the majority, Chrome is the OS, and this 4GB model will be their OS Level feature for local AI.

tzm 58 days ago |

People want local AI, but only if UX is good. Tooling/harness quality may matter as much as model quality.

I think the future will probably be a hybrid of:

1. local AI for simple, private, everyday tasks

2. online AI for very hard or long tasks

anemoknee 58 days ago | |

The Clippy app someone made and posted here a while back is the perfect average person LLM interface;

https://felixrieseberg.github.io/clippy/

all2 57 days ago | | |

This is so good. Wiring in small models for a variety of tasks would make this absolutely sing.

rufasterisco 58 days ago | |

it's a self enforcing loop

local LLMs builds tool that does exactly what user wants, how it wants it, which is bext UX

this becomes AI literacy

LLMs already nicely bridge the gap form "I want this" to "here's a local page that does it".

examples of tools i have built that requires almost very low tech knowledge * push a button on my phone to take screenshot in my mac (when i watch videos) * help me exercise, gamify it for me * "help me track time spent online to how it impacts what i do in real life, built a tool that rewards and me points me towads things that make me DO things online" * i want to improve my writing, give me exercises and build addiitonal tools (leading to an "append only" digital keyboard i use to exercise )

local AI can already create these tools, and no external company is ever going to beat me/the-user because instead of getting features i don't want, or that almost do what i want, or that do something that advantages the company they just do what I want

Repositories of tools-as-ideas created by others are quite often just index.html and ... that's all? manage data in localstorage, end of it?

Online inferences is still needed for large data (audio/video/images) processing. For now? we don't know, history suggests we'll have the capabilities to do that locally "soon". Or maybe not :)

The main issue is "online for collaboration". Not same user across different devices, that is easy. MeteorJS-style approaches (making local copies of part of dbs, reconcile to remote/origin) seems to be an interesting possibility at small scale, since once you have the right primitives in place you can go horizontally everywhere.

revolvingthrow 58 days ago |

A local Answer Machine is the dream, especially when the internet is decaying and generally on its last legs, but the hardware requirements seem like a huge mountain to climb. Things are progressing tremendously - deepseek v4 flash is very good for what it is - but even that goes beyond any reasonable local setup, which imo is 128 GB ram + 16 GB vram. 4 ram slots on a consumer board craters ram speed, 256 gb macs are too expensive, and even then the inference is ungodly slow.

On the other hand… v4 flash model is actual magic compared to what was available 2 years ago. If the rate of improvement stays as is, we’ll get a similar performance in a ~120B model in a year, which is viable (if expensive) for everyman hardware. Possibly you’ll be able to run its equivalent on a ~$1200 laptop by 2028, which for me-in-2020 would sound straight out of a scifi movie. A good harness that lets the model fetch data from other sources like a local wikipedia copy from kiwix could do a lot for factual knowledge, too; there’s only so much you can encode in the model itself, but even a cheapish (pre-curent prices) 2TB drive can hold an immense amount of LLM-accessible data.

Big caveat: I don’t see local models for programming or generally demanding agentic tasks being worth it anytime soon. You likely want bleeding edge models for it, and speed is far more important. Chat at 20tok/s is fine; working on even a small codebase at 20tok/s, especially on a noticeably weaker model, is just a waste of time. Maybe it’s a PEBKAC but I have no idea how people make any meaningful use out of qwen 3.6.

zozbot234 58 days ago | |

> and even then the inference is ungodly slow.

This is the wrong way of putting it. Local inference with SOTA models is all about slowing down compute for the sake of fitting on bespoke repurposed hardware. You don't need to go fast if you have the whole machine to yourself 24/7. Cloud AI vendors can't match that kind of economics.

wolvoleo 57 days ago |

> Most app features don’t need a model that can write Shakespeare, explain quantum mechanics, and pass the bar exam. They need a model that can do one of these reliably: summarize, classify, extract, rewrite, or normalize.

> And for those tasks, local models can be truly excellent.

100% true and I use them for this. But the open-source models seem to be drying up unfortunately. There never was much incentive for the big players to train a model and give it away for free, it was mostly virtue signalling and advertising for their knowhow. The AI "race" seems to have entered a new phase that's more on clamping down costs and making money and this doesn't fit in well.

I hope good local models will still appear but the days that there was a new groundbreaking model for download every couple of weeks is over :'(

robot-wrangler 58 days ago |

Entrenched interests are going to do everything to stop local, but there's at least a few technical reasons to believe small and specialized models could be the norm eventually. If that does happen, local will follow.

TFA is focused on whether big models are necessary for what users want. There's some evidence they may never actually be reliable enough unless a) mechanistic interpretation matures far enough or b) our multi-agent systems all become multi-model.

For (a), advancement in MI might fix problems with big models, but would also mean we can maybe get unified representations, and just slice and dice the useful stuff out of huge models, getting only what we need without the junk. Ability to isolate problems won't really come without bringing the ability to isolate functional subsystems. Only want logic? Only vision? Just cut it out of the big monster and enjoy reduced costs and surface area for problems.

For (b), just look at stuff like the evil vector, or the category of hallucinations specific to tool-use. Without a complete solution for helpful/honest/harmless alignment, it seems likely that creativity and rigor (and many other things) are fundamentally at odds. If you start to need many models for everything anyway, why do we need the huge expensive do-everything ones? So specialization also becomes a pressure to shrink everything towards minimal reliable experts

scriptsmith 58 days ago |

I've got some demos of what the new Prompt API in Chrome that uses a local model can do: https://adsm.dev/posts/prompt-api/#what-could-you-build-with...

As OP says, it shines in constrained environments where the model is transforming user-owned data. Definitely less useful for anything more open-ended.

2ndorderthought 58 days ago | |

Yea I do not recommend treating chromes prompt API as a good example of local LLMs. It's fine and stuff but it's really weak. 8b models from a year ago are better in some ways. And a lot of the recent model drops are meaningfully better.

scriptsmith 58 days ago | | |

It's based on a Gemma 3n model, and yeah it's not the best. But if you have a use case that needs constrained JSON output for example, it's pretty neat.

Maybe it would do better with the new Gemma 4 models, which the Chrome devs have been hinting at moving to. And why the API doesn't let you introspect / pick the model, I'm still not sure.

robot-wrangler 58 days ago | |

> I've got some demos of what the new Prompt API can do: > Use surrounding context to rewrite your ad copy:

Yup, that's the plan. No local model, no webpage; more, better and cheaper adtech extortion/surveillance for vendors while everyone else pays for the juice and hardware degradation.

dakolli 58 days ago | |

So you're running an llm to do data transformation that deterministic processes would be much better suited for and running 1,000 watt power supply to do so. Wild.

sinansaka 57 days ago |

I'm betting my startup on it. The subsidised model subscription will start to dry out and providers will lean heavier into locking down how they want their models to be used (Anhropic has been paving the way already). The only way forward is open weight models. If you are working on any LLM powered product be careful betting on utilising user subscriptions.

0xbadcafebee 57 days ago | |

Maybe you know something I don't, but it seems the standard will continue to be a large number of companies hosting and reselling LLMs as both subscription plans and pay-as-you-go. It's virtually identical to the mobile market: the economics of the business require a large regular infusion of cash, and limits are used to prevent a minority of users from making the service unusable/unprofitable. A few giants are the most expensive but offer the most features, and cheap providers offer less for less. All of this will happen because people constantly want "more": more bandwidth, more quality, etc. Capitalism rewards this constant growth/advancement with constantly increasing bills.

Anthropic is going to go out of business by probably Q1 2027 due to not paying their bills. OpenAI will become a new Oracle, serving a luxury product for enterprises and governments. Google and Microsoft will keep doing what Google and Microsoft do. Chinese vendors will capture a significant amount of business over the next 10 years by running the models in non-Chinese DCs, with demand coming from their much lower prices. 95% of regular users will be paying for open model subscriptions, even if their local machine can run the model, because the providers will be offering features that are hard to impossible to replicate locally.

sanderjd 57 days ago | |

What is your startup?

jillesvangurp 58 days ago |

I get the sentiment for self hosting. But there are a few counter arguments:

- Self hosting is expensive. It involves expensive machines with GPUs that cost hundreds per month if you use cloud based ones. You might need multiple of those. And you need people to mind those machines and they are even more expensive per month.

- If you run stuff on your laptop, it consumes a lot of resources and energy. I have qwen running on my laptop. Even minimal usage turns my laptop in a radiator. Nice as a demo, but I can't have it this hot all the time. It would run out of battery, and it's probably not great for longevity of components in the laptop.

- Models are evolving quickly and the self hosted smaller ones aren't as good when it comes to things like tool usage, reasoning, etc. Being able to switch tot he latest model is valuable.

- It's easier to get your use case working with one of the top models than with one of the smaller self hosted ones.

- If you get the wrong hardware, it might not be able to run the latest models very soon.

- Self hosting models is mostly a cost optimization. It only becomes relevant if you hit a certain scale.

- You have alternatives in the form of hosted models via a wide range of service providers. Some of those are EU based and offer all the things you'd be looking for if you are offering your services there. Including legal requirements.

- Reinventing what these companies do in house is technically challenging and possibly more expensive than self hosting models because now you need a lot of engineering capacity dedicated to that. And legal. And all the rest.

If, like most companies/people, you are at the experimenting stage, the cheapest and fastest is just getting an API key from an API provider of your choice. You can take it from there if your experiment actually works. And then it's mostly about optimizing cost. If your API usage goes to the thousands per month or worse, it becomes a cost/quality trade off.

timeattack 58 days ago |

My problem with LLMs (apart from philosophical aspects and economical impact) is that it would be unlikely for any of us to be able to train something functional locally (toy-like LLMs -- sure, but something really useful -- no). Apart from that it requires immense computing power, it also requires a dataset which is for the most part is obtained illegally.

duchenne 58 days ago |

Cloud models can use batch processing which is significantly more efficient. A local model has basically a batch of one which takes as much time to process as a batch of 100 because the gpu is memory bound and spend most of its time loading the model from vram to the gpu cache while the gpu cores are idle. With a batch of 100 the model loading time and compute time are roughly similar. So local Models have a first 100x lower efficiency. Secondly, local models are idle most of the time waiting for the user to write a prompt, so the efficiency gap is probably more around 1000x.

r0b05 58 days ago | |

It's an interesting point but local gpu efficiency is not something I think about when I'm being rate limited or when my subscription costs keep rising.

fleventynine 57 days ago | | |

I think folks in this thread are underestimating how expensive it is to serve a SoTA model at 100 tokens a second. In addition to the $500k in capital costs, you also have significant electricity costs.

This stuff is expensive because supply is much lower than demand. If everyone was to run their own hardware with a batch size of 1, we'd have 100x more demand for inference hardware and electricity than we do now, and people would be even more frustrated. Efficiency is everything, and we need all the economies of scale we can get to meet demand.

DrScientist 58 days ago | |

And what if your local computer essentially has an model chip with dedicated memory where the model stays loading 100% of the time?

vb-8448 58 days ago |

> Use cloud models only when they’re genuinely necessary.

The problem is that it's much easier to use the SOTA models (especially if they are subsidized) instead of spending time fixing the knobs with the local one.

I just realized this with coding agents, yeah, you probably shouldn't always use latest version at xhigh, but you will end doing it because you do the job in less time, with less "effort" and basically at the same price.

I guess we'll see a real effort for local AI only when major vendors will start billing based on actual token usage.

holtkam2 58 days ago |

I wish I could upvote this twice. We (devs) really REALLY need to consider on-device compute before going to the cloud for LLM inference.

leoc 57 days ago |

(I am not an expert on anything.) One happy circumstance here is that while the RAM cartel is chasing Big AI's money today, in the medium term its self-interest probably makes it a supporter of local AI. A new, compelling reason to have 128GiB, 256GiB or more of VRAM on all your devices? You can be sure that the dollar signs are glowing in their eyes already. The less efficient use of VRAM by personal devies (any given device's VRAM will be mostly idle much of the time) tends to make it more attractive, all else being equal (though of course it isn't) compared to the centralised systems run by engineers and accountants striving all day to maximise ROI; and in any case, since the short-run supply constraints on RAM go away in the longer term, the RAM manufacturers will be able to supply both. My guess is that you can probably also also explain Apple's AI strategy (sit tight and wait for Moore's Law to make local AI more viable) and maybe even nVidia's (lay the groundwork for a gradual switch from selling shovels to the army to selling shovels at Home Depot over time, at least as a Plan B) in similar terms.

dTal 57 days ago | |

Just because we'll have to pay for the hardware, doesn't mean we'll have meaningful control. Look at what happened with phones - weak and limited slaves to the mothership, secured against pesky users with powerful encryption, yet costing more than a vastly superior laptop; quasi-mandatory platforms for highly addictive experiences, centered around the flow of information.

And now with LLMs we can create even more fabulously addictive experiences, even more finely tuned information flows, even more treacherous servants. I very much doubt that we'll be allowed full control of it all. Every effort will be spent to centralize power, and every effort will be spent to extract as much cash as possible from us for the privilege.

array_key_first 57 days ago | | |

Phones are such a travesty because they're so incredibly overpowered. I think there's a lot of people out there where their iPhone has more compute than their laptop or desktop, but it can't do 1/10th the amount of stuff. What a waste!

fsflover 57 days ago | | |

> Look at what happened with phones - weak and limited slaves to the mothership, secured against pesky users with powerful encryption

Not all phones are like this. GNU/Linux phones obeying users exist too.

jjordan 58 days ago |

It feels like we're one technological breakthrough away from all of these data centers going up to be deemed irrelevant.

Lalabadie 58 days ago | |

The cynical take is getting more and more to be the only rational one:

The promised mega-data center deals are meant to boost valuations today, not serve tons of customers three years from now.

_heimdall 58 days ago | | |

It seems pretty clearly inline with the dotcom bubble to me. Every company claims to be a leading AI company, those building infrastructure are promising the moon and getting 1/3 of the way there, and no one knows how to monetize it justify the hype or expense.

jjordan 58 days ago | | |

oof, this bubble popping is gonna be brutal.

i_love_retros 58 days ago | |

What would that breakthrough be?

Waterluvian 58 days ago | | |

Magic math and computer science that allows us to get the same quality response for a fraction of the GPU.

_heimdall 58 days ago | | |

I'd assume its a totally different architecture that isn't based on storing a compressed dataset of all digital human text.

krupan 58 days ago | |

It took us only, what 70-ish years of computer and AI research to get to this point, so yeah, probably just one little thing and then we'll have it </sarcasm>

Seriously. I have never ever seen so many people so willingly drink the marketing kool-aid from companies selling their product before. It's scarier to me than any threats of AI actually disrupting society (because it is so far from being capable of doing that).

Animats 58 days ago |

Question: for software development, how much of an AI do you need for local development? Can it be run locally? Can someone train something that knows a lot about software but lacks comprehensive coverage of history, politics, and popular culture?

mrkeen 58 days ago | |

This is a good snapshot of things:

https://news.ycombinator.com/item?id=48050751

A specialist handrolls a cut-down framework to power a 1 or 2 bit quantised version of a cut-down sort-of-frontier model.

It can be yours if you have 128GB or 256GB of RAM.

dd8601fn 58 days ago | |

The ones that are good for more than elaborate auto-complete are pretty hefty, but it can be done. They’re still not Opus behind claude code.

hyfgfh 58 days ago |

Local LLMs is the only thing viable and probably the only thing it will remain once the hype dies down.

A smaller cheaper local model can delivery most the value for coding, while we still use some services for code review and security compliance.

Once the VC money runs out and they start to charge the real price, the C-level will have to impose budges or limits. The current pissing contest over who can expend the most tokens is both ridiculous and shortsighted

manyatoms 58 days ago |

It just depends how quickly models become "good enough" that we don't care about SOTA

julianlam 58 days ago | |

Arguably, some of the things HN readers ask for can be capably completed by a local open weight model for free.

mattlondon 58 days ago |

Yet there is another post a few rows down where people are losing their shit that Chrome has a local LLM model that uses a couple of GB of space for local-inference.

Damned if they do, damned if they don't.

dlcarrier 58 days ago | |

Maybe don't use gigabytes of bandwidth and storage space, without asking.

hparadiz 58 days ago | | |

Easy. Stop using Chrome.

themafia 58 days ago | |

If it was such a good and laudable idea why didn't they tell me about it before they activated it? It seems to me like they avoided it in the hopes that I wouldn't notice, because, presumably if I had, I would have IMMEDIATELY disabled it.

Also why doesn't their task manager show that it's actually the one downloading? Why does it go out of it's way to hide this activity?

Since I have conky on my desktop I could catch this immediately, and take the action I preferred with my own computer, which was to _immediately_ disable it.

StilesCrisis 58 days ago | | |

I'm guessing you immediately close the What's New Chrome tab when you update?

https://developer.chrome.com/blog/new-in-chrome-148#prompt-a...

https://www.google.com/chrome/ai-innovations/

They have absolutely not been shy about any of this.

userbinator 58 days ago | |

If I want a model I'll go download one. (And I did, not long ago, to play around with image generation.)

bytecauldron 58 days ago | |

This is a bit disingenuous. People aren't losing their shit about a local model being installed. It's the lack of user autonomy. Just give the option to download a model instead of a silent install. It's not that hard. This is how every other local option works.

wmf 58 days ago | | |

AFAIK Apple and MS auto-download local models.

aabhay 58 days ago | |

This is a weird take. If its not opt in or you’re shoe horning it into a browser, then that sucks. Nobody is getting enraged that an app for running local LLMs downloads data to do so.

avadodin 58 days ago | | |

Although you can opt out and even disable the download feature when you build them in some cases, most of the local LLM tools are too download–happy by default.

fg137 58 days ago | |

You might want to read the comments to understand what people are actually complaining about.

This comment is quite dishonest about the nature of the discussion.

ekjhgkejhgk 58 days ago | |

You don't understand the difference between "I run a local LLM because I chose to" vs "The browser chose to run a local LLM and I have no say"? You don't understand?

Not to mention that the LLM that I choose to run requires a monster machine and is infinitely more capable than whatever google chose to put on their browser?

I mean, none of this affects me because I don't use chrome, obviously, but you don't see the difference? Bewildering.

StilesCrisis 58 days ago | | |

Did you opt into WebGPU? QUIC? Canvas 2D? Brotli? Browsers don't work that way.

tedzhu 57 days ago | |

Typical HN arguing they need a button to opt in. In reality 99% people don't care, if it works they're fine with that.

ninjahawk1 58 days ago |

In my opinion, this is similar to the earlier internet and computers. Few households or individuals had access to state of the art computers, it was primarily research or more well-off individuals. Most random people didn’t really know what it was and certainly didn’t use one.

Now today, AI is very expensive and not readily accessible to most people without paying a good amount.

The early internet became now you can just get a free phone from phone companies so long as you get their extras. Then you get a ton of subscriptions and ad-ons, but you don’t have to spend money, could just use youtube with ads etc.

Local AI would similarly shift this dynamic to paying for access to plug-in’s and tools for your local AI to be able to use. Like how the subscription model works right now.

With local model advancements, such as specifically Qwen 3.6 35B A3B, this future is becoming more likely by the year IMO.

almogodel 58 days ago |

Remember nodes and graphs? A comfy user interface allows pretty incredible wiring among models local ai is like eurorack. The current graph skews heavily towards a a pair of small dense models collaborating with the large heavyweights selectively. It’s Qwen 3.6 27B with Gemma 4 31B, both unquantized, bf16/fp16, with phi 14b, nemotron cascade 2, and then those large heavyweights, r1 and subsequent deepseek models including speciale, gpt oss 120b, glm, min max,kimi, command r, mistrals, ever body, up in one graph, all them llm nodes patched and interconnected. Slow, resource intense, better than non local ai. I used Matteo’s graphllm for inspiration, and comfy ui (and st), and used the models to roll a new imgui node/graph model compositor. Now what?!

gpugreg 58 days ago | |

> Slow, resource intense, better than non local ai

Why should connecting small models to big models result in higher output quality than just running the big models without the small models?

CamperBob2 57 days ago | | |

A hardware analogy: an amplifier might have an open-loop gain of a hundred million or more, but if you actually try to use it without some negative feedback, it will only give you one of two possible output levels. And/or a whole lot of noise.

tomelders 58 days ago |

I do think local models are the future, but there's still the question of cost to be answered. Even if there's some slew of effincency improvements that mean an LLM can run locally on consumer level hardware on an affordable budget (and that's a big "if"), there's still the cost of training the modles to consider.

Assuming we end up in a future where people pay to run multiple smaller models on their machines for specific tasks (e.g. A summariser model, a python coding model, or however fine grained/macro you want to go), the people training those models will need to turn a profit.

So how much will that cost? And how often will consumers have to pay? Models have a very short self life. Say you have a dedicated python coding model - that needs re-training every time there's a significant update to the language itself, any popular packages, related technologies (e.g. servers, cloud infra etc). So how often will users need to "upgrade" to the lastest version? It's going to be "frequently".

And it still needs the language stuff on top of that. Users aren't going to interact with a python coding model by writing python. They're going to use natural language. So the model needs all that stuff. And they're going to give it problems to solve. What if you asked the model "Write me a Bezier curve function". It needs to know about bezier curves, which have nothing to do with Python. So where do these LLM providers draw the line on what makes it into the training data and what doesn't?

And if an LLM doesn't know what a Bezier curve is, that's not going to stop it from just hallucinating an answer. If a significat proportion of prompts resulted in a response that said "Sorry, I don't know what you're talking about", then people will just stop using it. The utility of these things will be quickly overshadowed by the frustrations.

The way these frontier models have been introduced and promoted has set unrealistic expectations, and there's no putting the genie back in the bottle.

rufasterisco 58 days ago | |

> the question of cost to be answered.

Commoditizing complements. If Anthropic/OpenAI/etc is eating your lunch, make it work with cheap local LLMs , you can beat them on price by having local inference you don't pay (nor need data centers for), and try to keep your (user/data) moat.

The more Anth/OAI disrupt, the more likely this is to happen. If they don't disrupt enough (.ie: grow as an ecosystem to defend against incentives to commoditize), then yes, those incentives are removed, but they also leave money on the table, which they need.

Not only at business level, but also geopolitical (to a lesser extent? or not since lots of open weight models comes form China?).

tomelders 58 days ago | | |

What are you talking about Willis?

mgrund 58 days ago |

I really really want to like local AI, but I highly doubt it will see wide adoption for a long time.

The additional up-front cost for hardware designed to run an LLM in addition to normal workload is unlikely to be accepted by most consumers.

The scale will be very constrained (like Apples on-device models which are small, heavily quantized, and have a small 4K token context window). It’s also terrible for battery life.

AI as it is implemented today is simply just computationally expensive and unless you put in dedicated hardware (like the ANE) for only this purpose - a large cost driver - I don’t really see it getting large scale adoption.

Companies will probably need a server-backed solution as fallback if they want reasonable user experience, so why even invest in diverse hardware support.

mitchsayre 57 days ago |

I now fully believe that the models will soon be compact enough to work even on older mobile devices. I work on lightweight text-to-speech models. After training on distilled datasets the models sound basically the same as any closed-source speech API model, they just need a ton of data to train on. Other researchers are seeing similar gains with other types of models and its only a matter of time before one drops thats commercially viable. Once this happens, the innovative apps and games will begin shipping AI as a feature that drives the user experience forward, rather than the thing you price the entire product around.

Tepix 58 days ago |

I'm pretty sure that AI assistants will become widespread.

I consider it to be very careless to entrust your emails, your chats, your calendar, your notes, your calls, your pictures, your contacts, your location history, your waking hours, your files, your TODO list, i.e. stuff including your health data to the for-profit AI companies. The temptation to earn money with your data is just too great, plus the risk of the data being stolen and sold illegally.

Local AI should be the default. For everone who can't do local AI, we need confidential compute. Yes, it has been hacked before. But it's making it a lot harder.

pjerem 58 days ago | |

> I consider it to be very careless to entrust your emails, your chats, your calendar, your notes, your calls, your pictures, your contacts, your location history, your waking hours, your files, your TODO list, i.e. stuff including your health data to the for-profit AI companies.

Still, we all do it with Google. (I don't do it anymore but i did it for mostly two decades so I include myself)

jesterson 56 days ago | | |

> Still, we all do it with Google

We don't. And never did.

dana321 58 days ago |

"NO AI" needs to be the norm, we should be working on better ways of sharing information and better documentation instead of fighting with computers for substandard results.

diwank 58 days ago |

in order for us to get there, i think we need a standardized api at the os layer for local models so that the os could optimize, batch and safely allocate resources. something like an analog of chrome's local model "prompt" api but provided and managed by the os itself. the user can choose which model they want to primarily use and so on but all of the heavy lifting and continuous batching is done automatically by the os

wilg 58 days ago |

Two issues -

1. Local models are likely to be more power-expensive to run (per-"unit-of-intelligence") than remote models, due to datacenter economies of scale. People do not like to engage with this point, but if you have environmental concerns about AI, this is a pretty important one.

2. Using dumb models for simple tasks seems like a good idea, but it ends up being pretty clear pretty quick that you just want the smartest model you can afford for absolutely every task.

manc_lad 58 days ago | |

I think using the best model for every tasks makes sense when these models are subsidised. when the prices go up (assuming they do) this could trigger a more varied approach. assuming the model doesn't self select for you.

nate 58 days ago |

I've been fooling with the Apple Foundation model for AlliHat, so you can chat with it from a Safari sidebar instead of just Claude. It's passible for some basic things like summarizing a page. But it really reminds me of Claude from like 3 years ago. I was trying to get it to generate synonyms for me and it would only generate about 10 with some duplicates. And when I asked for more, it said it would be a waste of resources to generate more. It has some kind of "act responsible" thing that Claude seemed to have. I also asked it to help me come up with synonyms for the game Pimantle, and it decided Pimantle was related to the adult industry and no matter how many times I said "it's just a game" or "I think you've misunderstood", it was stuck on not helping me with anything related to adult websites. And recommended I should play Wordle instead.

All of this being said, it seems Claude gave up this "constitution" it used to train on? I remember trying to get it to help me code some video editing tools, and it was convinced I was pirating videos and so wouldn't help me anymore in that session.

QuadrupleA 58 days ago |

Not sure how excited I feel about visiting your website and having it auto-download a 8GB model with GPT-3.5 level hallucinations, and then probably crash because I only have 6GB of VRAM. My dad won't be able to use it, or anyone else without a bleeding edge device. On a powerful enough "neural engine" device the battery will be drained quickly, while the heatsink burns a hole in my lap.

dgb23 58 days ago | |

Local could also mean self hosted.

The obvious optimization for the case presented would be to generate all the summaries on a server instead of in the client. Then the totally used compute would scale with the number of articles instead of number of users.

AuditMind 57 days ago |

It's almost here. Look at the new Qwen 3.6 models. Solid stuff there.

It runs by now on 8GB Vram, so a Legion 5 for about 1500$ could be a good workhorse.

hackyhacky 58 days ago |

I would like a standardized API for local AI to exist outside of the Apple ecosystem. The Prompt API is Chrome is halfway there.

* What is the answer to local AI for native apps on Windows?

* What is the answer to local AI for Linux?

This is a big opportunity for Linux, given the high quality of open-weight models. I hope some answer emerges before designs fracture and we get a dozen mutually incompatible answers.

franze 58 days ago | |

i researched that question for apfel https://github.com/Arthur-Ficial/apfel and standardized API is openai api so thats what i went with

hackyhacky 58 days ago | | |

OpenAI's API is not local AI.

teravor 58 days ago | |

> What is the answer to local AI for Linux?

run an ai api endpoint on a unix domain socket

tristor 57 days ago |

The biggest challenge I have with local models right now (and I use them extensively) is search integration and tool calling. The thing that Claude and ChatGPT get right for most general purpose use cases which is hard to do with a local model is the model deciding when to search vs use its built-in training, and having strong search tooling, as well as tool calling for additional data sources via MCP. If you can incorporate the right data into the context window, local models are more than good enough for general purpose usage as they stand today. Qwen 3.5, Gemma 4, even gpt-oss-120b are solid at reasonable quants if they have the right data.

The moment we see standardized and batteries-included pathways to integrate search, ideally at no additional cost, in things like LM Studio combined with better tool calling in the local models, you'll quickly see local model performance catch up.

everlier 58 days ago |

There was never a better time to run LLMs locally. It's just a few commands from zero till a fully working LLM homelab.

``` harbor pull unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL

# Open WebUI -> llama.cpp + SearXNG for Web RAG + OpenTerminal as sandbox harbor up searxng webui llamacpp openterminal ```

That's it, it's already better than Claude's or ChatGPT's app.

shmerl 58 days ago |

Depending on some remote AI provider is a major lock-in pitfall. But it's exactly what those AI providers want you to do.

gregjw 58 days ago |

Is there a place to learn more about Local AI specifically and maybe even more specifically about models for bespoke purposes or curating them yourself for more specific uses? Feels like theres a lot of fat you can trim off because you don't need generic use, but I don't understand where to even begin there.

StevenWaterman 58 days ago | |

/r/localllama is one of the most useful places

harrouet 57 days ago |

Running LLMs locally is one way to realize the level of hardware and infrastructure that frontier AI companies are running. Makes me wonder about future strategies.

As one commenter mentioned, 2x Mac Studio M3 Max with 512GB can run frontier models and it costs $30k (with RDMA). Apply an efficiency ratio for being in a datacenter, and you understand why OpenAI and the likes spend north of $10k _per customer_ of CAPEX.

Add to that the electricity costs and you've got a very shaky business model. I for one would like to thank the VC for subsidizing my tokens.

With that said, the VCs are not crazy and probably factored in an annual cost decrease of computing power. But how do you make sure that we won't run local LLMs when the HW becomes affordable -- if ever ?

The answer has always been the same in our industry: vendor lock-in. They are getting the users now at a loss, hoping for future captive revenues.

So, be careful when your code maintenance requires the full context that yielded that code, and that this context is in [Claude Code|Codex|Cursor].

testfrequency 58 days ago |

Local AI is definitely going to be the future as these models continue to advance at the rapid pace they already are.

This is why I believe OAI and Anthropic I’ve been so aggressive at offering services outside of their pure models like Claude Design. This is what will be competitive and keeping people subscribed.

ksec 58 days ago |

While I agree that would be the goal, we are too early for that. Just like how speech recognition used to require many server in a Datacenter to process and you send your data over. It is now completely on devices.

We are at least 5 years away from that. And DRAM needs a substantial breakthrough in cost reduction.

hackermanai 58 days ago |

> “But Local Models Aren’t As Smart”

This is what makes me continuously doubt and rewrite the local-first approach to inline chat in my editor. Next edit/ code complete makes more sense due to latency advantage. But chat is hard.

It's fast and feels good to run locally, but output quality is just not ChatGPT etal.

deivid 58 days ago |

Sounds great, but if you din't cave to apple/google (eg: graphene, lineage), models are not built-in. Every app needs to ship their own models, and they are not tiny.

Is there a solution for this? I'm currently just making users download onnx models if they want a feature, but it's not smooth UX

refulgentis 58 days ago |

The shitty thing here is, either everyone's shipping 800 MB at least with their binary, or, you have to rely on the platform vendor anyway. I'm hoping there's enough external pressure that the OS vendors turn it more into a repository than a blessed-model-garden.

wrxd 58 days ago | |

To be fair the author of the post is using the model Apple provides with the OS so it doesn't have any extra binary size

Galanwe 58 days ago |

I would love for local inference to be possible, but from my experience, Kimi 2.6 is the only model that would be worth it, and its a $10k (M3 Ultra max spec'd - 30s TTFT so kind of slowish) to $30k (RTX6000/700GB+ DDR5) upfront, noise / power consumption aside.

mft_ 58 days ago | |

You're maybe missing the article's point, which is to use local models appropriately:

> “But Local Models Aren’t As Smart”

> Correct.

> But also so what?

> And for those tasks, local models can be truly excellent.

Galanwe 58 days ago | | |

This is a bit naive IMHO...

I have tried quite a bunch of local models, and the reality is that it's not just a matter of of "it's a small model that should be hostable easily". Its also a matter of whats your acceptable prefill TTFT and decode t/s.

All the local models I used, on a _consumer grade_ server (32GB DDR5, AMD Ryzen) have been mostly unusable interactively (no use as coding agent decently possible), and even for things like classification, context size is immediatly an issue.

I say that with 6m experience running various local models for classifying and summarizing my RSS feeds. Just offline summarizing ans tagging HN articles published on the front page barely make the queue sustainable and not growing continuously.

mikrl 58 days ago | | |

One of my hobbyist workflows involved transcribing ETF prospecti into yaml for an optimizer to optimize over.

Used to take me maybe 10-20 minutes per sheet.

Then I got codex to whip up a script that sends each sheet to a fairly low parameter locally running LLM and I have the yaml in a couple seconds.

My dream is to bootstrap myself to local productivity with providers… I know I’ll never get there because hedonic treadmill etc, but I do feel there’s lots more juice to squeeze. I just need to invest more time into AI engineering…

manlymuppet 58 days ago |

People are trying to “make the best software”, though.

I think the Quixotic accelerationists of AI are more or less a vocal minority of the people who make software, and the choice of online APIs over local systems is largely a choice made for users, rather than developer’s laziness.

You can do more and better with private AI today than with local models. There is no getting around that. Even if local AIs get better, being on the cutting edge of LLM performance is often a very worthy investment.

Most people won’t settle for a product if it’s not the very best and incredibly convenient. That’s a high bar, and local AI often doesn’t meet those standards.

HN’s insistence on treating all users like they are open-source, privacy-first, self-hosted Linux fanatics is painfully corny.

jdub 58 days ago | |

> Most people won’t settle for a product if it’s not the very best and incredibly convenient.

... uh?

manlymuppet 58 days ago | | |

That is, excluding Microsoft users.

deweywsu 57 days ago |

How is having local AI going to produce a result that's any better than using OpenAI or Anthropic? Isn't what we really need programmers who rely on themselves more than AI so they avoid technical debt accumulation?

jononor 57 days ago | |

Having local AI as a credible threat will keep them on their toes. Which will benefit consumers a lot.

imnes 58 days ago |

I'm going through a similar exercise right now in an app I'm building. No server dependencies, for features that have traditionally used server side APIs, moving those capabilities onto the device. And also utilizing the on-board AI features provided by Android and iOS. So far it's been a very positive experience, and the capabilities provided on these devices have been more than capable for my needs. Working on providing apps that don't have ongoing operation costs of running server side infrastructure, so I can offer them as "pay once, run it forever" instead of ongoing subscription costs for the user.

h05sz487b 58 days ago |

I really want this to be true. For me getting all models to run to the best of my hardwares ability and the cli tool to also make best use of the model is still a headache. I had coding models not being able to do a search and replace depending on the tool through which they were called, visible <thinking> elements in my message flow, agents doing a task, failing at the linter, then reverting everything again so the linter is happy and presenting the result as a "good compromise".

Right now it feels like we have all the pieces but nobody integrating all that into an amazing experience.

acidhousemcnab 57 days ago |

We need better GUI and OS integrations with sandboxed local LLMs, before this is thrust on everyone and rolled out as the default in commercial OSes. Here in Berlin, I was functionally surrounded and hounded out of a local meetup, due to confrontation over the naive pushing of OS-level and network access agentic AI, done in the mode of mystical powers and artistic possibilities, which due to recent experiences, comes off as string-pulling, to produce a threat or danger that then must be observed and kept tabs on, according to Goodhart's Law.

continueops_com 58 days ago |

Opus 1M context window and lighting fast response time is hard to compete with, even if you run a local A100 the local models are just not as good as tool calling, long running tasks and non-hallucinations

twoodfin 58 days ago | |

It was hard for an Apple ][ to compete with an IBM mainframe at enterprise data processing, but the power of personal ownership & commodity economics was disruptive enough that 30 years later 99%+ of enterprise data processing was taking place on descendants of the original personal computers.

antidamage 58 days ago |

The roadblock to this is you seem to have to build it yourself. I've noted that none of the current cloud models are very good at building a replacement for themselves, and there's significant work that needs to be done to make a local LLM reliable in any way. I haven't found a single standalone package that makes setting them up easy. Sure, I can run Hermes Agent and a model, but getting the self-reflection loop in and all of the other stuff the need to actually be good? I'm still at it, trying to get anything to work reliably and factually.

DonsDiscountGas 58 days ago | |

Could be an opportunity for a business? Except nobody ever wants to pay for software

nezhar 58 days ago |

For me, building with open weights models sounds like the right approach — you are able to switch providers, and you can control where the server is running.

You don't have any guarantees in terms of data, that's true, you rely on the provider. But this is similar to a database or other services where you don't have the knowledge or resources to run them yourself. Hardware cost is an additional factor here.

If on the other hand your idea works out and the model fits the use case, you can always decide to move to a dedicated infrastructure later.

alfiedotwtf 58 days ago |

This would be nice, but unfortunately the norm at the moment is - release a rushed model that doesn’t work with llama.cpp, but if it does, make sure that the chat template is broken. And even if it did have a perfect chat template, let the model loop endlessly rewriting the same file with same content for hours on end.

It would be nice if model makers could at minimum embrace test harnesses, and stretch goal if they’re going to change underlying formats then at least land compatible readers in the big engines (e.g. llama.cpp and vllm)

8cvor6j844qw_d6 57 days ago |

Any recommendations to run a local model on a Raspberry Pi 5 16 GB?

try-working 58 days ago |

I'm building a protocol and router runtime for hybrid local/cloud AI.

The goal is that you would assign roles to models based on tasks, capabilities and observed performance. The router would then take care of model selection in the background.

It's tricky though. Probably have another two weeks before I can release the runtime.

I have a preview up at https://role-model.dev/

You can follow me on Twitter if you want updates (see profile)

krupan 58 days ago |

Here I was hoping that this was some plea for us to get away from proprietary solutions that we have no control over and go back to open source, but no, not that at all.

vivzkestrel 58 days ago |

- can we get suggestions from people on what would the equivalent for android

- and for the web / javascript / svelte applications?

- suggestions for local OCR for bulk images?

kajman 58 days ago | |

I hope there's no web equivalent for a while. I usually hate app lock-in, but any hasty API for this is going to be a DoS or fingerprinting nightmare.

holoduke 58 days ago |

We need computers with 128gb or maybe even 192gb of memory before local use make sense. From my own experience 32b LLMs are the absolute minimum for proper tool use and decent output quality. But for local ai you want also vision models and maybe even various LLMs. Plus some memory for the system of course. On my 36gb M3 the 24b Gemma model is nice. But the entire system gets allocated for that thing.

teiferer 58 days ago |

Every reply here forgets/overlooks the main reason for why this is not going to happen: The astronomical AI data center investments currently underway. Those place are not just for training. They are for inference too and the way all those investments are expected to eventually pay off. The whole AI sector of our industry depends on running models in these places.

zozbot234 58 days ago | |

These astronomical AI data centers will be used for high-value inference with smarter models that really are too large for running locally. The investments will be fine once they pivot to that use. Currently available open models are not in that range.

teiferer 57 days ago | | |

I don't buy that that will be a useful distinction.

First of all, no AI model will say "I'm too smart for this question, I suggest you use a cheaper one so I don't make unnecessary money for my owner" or "I'm too dumb, so instead of hallucinating I'll suggest you go to the cloud and ask my smarter sibling".

Second, there is no incentive in the market for tooling to evolve that way. There will be the illusion that some models will do that, similar to today (or maybe some harnesses rather) but nobody will willinglylet money sit on the table. These data centers are not being built to solve world hunger. They are built to ultimately hook you on more realistic fake bs youtube videos so you feel good while getting even more ads injected into your life.

unnouinceput 57 days ago |

Quote 1: "We need to return to a habit of building software where our local devices do the work."

Quote 2: "I can only speak on the tooling available within the Apple ecosystem since that’s what I focused initial development efforts on."

Oh, the irony. I will use your tooling when is available on Android with F-droid, that's when, at least, be decoupled from big companies grip.

andychiare 57 days ago |

> “AI everywhere” is not the goal. Useful software is the goal.

Great observation! Often the excitement of novelty makes us lose sight of the real goal

JamesSwift 57 days ago |

I think moving straight to local models is missing the required next step of open/self-hostable models which is certain to be the "AI future" end-state. Then local models become an optimization on top of that.

I just dont want us to put all this effort in to on-device computation when we need to get to "SOTA-equivalent" self-hosted computation faster.

butz 57 days ago |

Really silly, when you buy "AI PC" with "AI CPU" and still run any "GenAI" related stuff in the cloud.

mercurialsolo 58 days ago |

Not your weights not your brain. Owning your own action and decision model is super important as these models emulate more of our decisions, thinking and learning. Built claudectl - a local brain for coding agents https://github.com/mercurialsolo/claudectl

eldenring 58 days ago |

This article makes 0 sense. Its not up to billing or computer systems or ease of use or anything else that matters. The question is will the scaling laws, which in the asymptote are likely the laws of physics, hold up in converting energy to smarter models. Its not really up to anyone, the labs or developers, to choose if local or remote models will be the norm.

bluGill 57 days ago |

WRONG, this completely ignores the most important issue and so is completely wrong.

The important issue is where is the data stored. And there are far to many advantages to having your data in the cloud: you can access it from whatever device you happen to have, and it isn't lost if you lose the device. This also outsources your backups to the cloud which is probably doing a much better job than you would (maybe no on hacker news, but nearly everyone else) - the cloud has earned a bad reputation for backups, but it is still much better than most people would be.

Once you accept the data is going to be elsewhere it doesn't matter if the compute is elsewhere or not. The data is the important part.

What needs to be the norm is more self-hosting your own data. Companies should not be outsourcing this by default - even where you outsource some of it, you need to watch your contracts and ensure the ownership is yours - not shared. Once your data is yours on your own cloud accessible servers we can start asking can we run our AI models in the same data center as we already have our data in. I don't need my AI model to run on my phone, it can run on the server in my basement which has a lot more power available (my phone has a better GPU but I can't afford the battery power to run AI on my phone)

mohamedkoubaa 57 days ago | |

I want a way to backup my data fully encrypted somewhere and have custody of the keys - but importantly, the data should all be decrypted locally where all my apps can use the data without any network

chasd00 57 days ago | | |

tar -czf - /path/to/folder | gpg -c -o folder.tar.gz.gpg then scp/POST that somewhere /s..kinda

pcthrowaway 57 days ago | |

> What needs to be the norm is more self-hosting your own data

I assumed self-hosted AI would fall under local AI for the purposes of this article. Does the author really need to spell it out?

butterNaN 56 days ago |

The ability to build my own local AI is exactly what I want to learn. Are there any good resources to learn this?

z3t4 58 days ago |

We are experimenting with local LLM and opencode at work and the quality is not as good as Claude code et.al but it's not far off and local speed is actually faster. We got 3 of Nvidias latest AI GPU's which was not cheep. It's not good enough to train our own models, but we can run the biggest open models with some tweaking.

giancarlostoro 56 days ago |

We could have been there if the big AI companies didnt create a RAM crisis. I will be buying the next iteration of the Mac Studio, I have been doing local inference on my Macbook Pro and just small models, I cant imagine how much better things will be on the Mac Studio.

msteffen 58 days ago |

> One of the current trends in modern software is for developers to slap an API call to OpenAI or Anthropic for features within their app.

Well there’s your problem, control needs to go the other way. If you want your app to be AI-enabled, you need to make it easy for AI to control your app. Have you used OpenClaw? It’s awesome!

vegabook 58 days ago |

>> years ago I launched "The Brutalist Report"

proceeds to brutalise the reader with an 88-point headline font.

knlam 58 days ago |

you know what is the hard part about local ai? Supporting it cross platform. The OP get it easy by playing in Apple ecosystem but when you need to support local AI to both iOS/Android the approach is completely different. Even get the users to download the smallest models can be a challenge

FrasiertheLion 58 days ago |

Overall I'm bullish on standardized local APIs that ship with the browser or platform. Far more tractable than expecting end users to stand up their own local model instances, though r/LocalLLaMA is a fantastic community to follow if you want to go that route.

A useful framing over “local vs cloud AI” can be split along two axes: does the task touch private data, and does it need frontier intelligence? You can use frontier models for developing the software (doesn’t touch data), but open-source models running locally for ops: maintenance, debugging and monitoring (touches data). If you need to fall back to frontier intelligence at some point for a particularly hard to resolve problem, you can still rely on local models for pre-transforming and filtering input in a way that's privacy-preserving or satisfies some constraint before it’s sent off to the cloud for processing. OpenAI's privacy filter is a good example of a model that can be used to mask PII and secrets and that can run locally: https://openai.com/index/introducing-openai-privacy-filter/, before sending any data externally for processing.

Another framing for local vs frontier closed which the article mentions is whether the task saturates model capability. With certain tasks like PDF processing or voice or summarization, adding more intelligence isn't necessarily useful. Arguably we've approached that point for chat interfaces already with frontier open-source models. But for coding and ops through well structured tool use inside a coding capable harness, we're still a ways away.

Tangentially, a contrarian take here is that AI can actually enable more privacy preserving software if you’re so inclined. You can just build personalized software and it lowers the barrier to entry and the effort required to self host. SaaS complexity often comes from scaling and supporting features for all types of customers, and if you're building software for personal use, you don't need all that additional complexity. Additionally, foundational and infra software that is harder to vibecode with AI is often already open source.

reshef316 58 days ago |

not saying i disagree with the general statement, but there need to be options, not everyone has a machine capable of doing the same type of lifting required to properly run a local version. so what, if my machine is older i'll be locked out? restricted? forced to pay?

Slix 57 days ago |

Chrome did this, and there was a huge outcry. Even though local AI is much better for privacy.

daishi55 58 days ago |

> We are building applications that stop working the moment the server crashes or a credit card expires

Isn’t this true of any application that accesses anything not running on your computer? This is just describing what it means to add an API call to your app. Nothing to do with AI (?)

simonkagedal 58 days ago | |

Furthermore, for the example given, it would have made a lot of sense to me to generate those article summaries on the backend. Once and for all, no need to burden each client device (which are going to need to download the content anyway), no need to tie yourself to a specific provider (Apple in this case), can have the same experience everywhere. Of course, the backend could use a local (to itself) model.

Not saying it’s _wrong_ either – maybe it doesn’t use a backend of its own (the client downloads content directly from some predefined set of sites), maybe there is functionality to adjust how the summaries work that benefit from doing it on device, etc. Just doesn’t convince me that ”local AI should be the norm”.

barrkel 58 days ago |

Local models are extraordinarily expensive if you're not maximizing throughput, and you're not going to be maximizing it.

Local models need to be resident in expensive RAM, the kind that has fat pipes to compute. And if you have a local app, how do you take a dependency on whatever random model is installed? Does it support your tool calling complexity? Does it have multimodal input? Does it support system messages in the middle of the conversation or not? Is it dumb enough to need reminders all the time?

Spend enough time building against local models and you'll see they're jagged in performance. You need to tune context size, trade off system message complexity with progressive disclosure. You simply can't rely on intelligence. A bunch of work goes into the harness.

Meanwhile, third party inference is getting the benefits of scale. You only need to rent a timeslice of memory and compute. It's consistent and everybody gets the same experience. And yes, it needs paying for, but the economics are just better.

ge96 57 days ago |

I'm looking into it since it I'm going to be sending personal info/thoughts would like to keep it local. I have a 4070 running the TheBloke 7B mistral via llama cpp. I still am not using llms daily though other than Google searches.

katzito 57 days ago |

Most people are lazy (which is (mostly) good) and don't care (which is (mostly) not good), as Gmail has proven since 2004 (according to Google AI).

Still waiting for those analog AI chips that were supposed to make it lightning fast using minimal energy...

nsvd2 57 days ago | |

Assuming you're talking about Taalas, they have a live demo for inference on their HC1 chip.

katzito 57 days ago | | |

Taalas HC1 is digital. Was thinking more along those lines: https://mythic.ai/

khoury 58 days ago |

Agree with the sentiment, but: "We are building applications that stop working the moment the server crashes or a credit card expires."

This has been the case for way longer than openAI and Anthropic has been around with services like AWS, Cloudflare, etc.

noashavit 57 days ago |

Relying on external APIs network failure points and unavoidable latency from the round trips. There is also the AI API rate limits that come into play. We might find that for critical workflows, local compute is the only reliable architecture.

october8140 58 days ago |

They will never let us have enough RAM every again. RAM will be kept behind locked doors in the name of national security and only trusted corporations will be aloud to run AIs and "safely" run them in the cloud and sell them to us.

j3th9n 58 days ago | |

I’ll make my own RAM, with the help of AI.

yuppiepuppie 58 days ago | |

Is this a conspiracy?

dgb23 58 days ago |

I‘m surpised at the presented dichotomy between JSON formatting and what the Apple SDK provides to parse output into structs.

Based on what I understand about how the former works, I would assume that the latter has the same properties and failure modes.

PeterStuer 57 days ago |

I use a 4090 and 96GB ram to run local models slowly (atm Qwen-code-next at 7 tps) with their full context window. I keep this up just for testing and practicing fallback should I lose access to Claude and GPT.

rduffyuk 58 days ago |

agree with the article but the limitation for local llm usefulness is the limited scope from my experiments. eventually context heavy data pipelines require larger models which consumer hardware can't deal with yet. the local model for summary on a page like you describe could be done via code as well, i've found using an llm isn't always the right choice. for example i use ner tagging in my md docs for better indexing and llm search capabilities. this is purely code based and not via an llm. tried with an llm and the results were a lot worse. augmenting tools to make the llm produce better outputs gives better results.

hydra-f 58 days ago |

Unless there's a breakthrough or a transition to diffusion models, it's hard to imagine them becoming an affordable commodity

Small models are still in their infancy, and there's still much to sort out about and around them, as well

RataNova 58 days ago |

I mostly agree, though I think local AI will need better UX around failure modes. Cloud models are often used not just because developers are lazy, but because they are more capable and easier to support consistently across devices.

artursapek 58 days ago |

I'm someone who is trying to build a subscription-based business to cover underlying LLM costs, and very hopeful I can one day just sell a permanent license to the software instead with customers using local LLMs to power it.

shailendra_sis 52 days ago |

Yes, local ai is the future. More important is democratizing the ai for the common masses.

kandros 57 days ago |

We need more tools like QMD that beautifully download and use local models under the hood

https://github.com/tobi/qmd

latentframe 57 days ago |

A lot of AI aspects probably don’t need to be permanent cloud services as local hardware improves part of the industry may change from renting intelligence to on-device computing.

Aleesha_hacker 58 days ago |

To what extent is this strategy currently feasible for windows of android development? I am interested in how portable local-first AI is across platforms, but it seems promising on Apple devices.

selectedambient 57 days ago |

Agree. We ought to be measuring the minimum viability of lesser parameter, local models for specific tasks. You don't need opus 4.7 or sonnet 4.6 to accomplish some of these basic, yet tedious tasks, i.e. the news aggregator you demonstrated. Thinking about things like, how many parameters does it take to manipulate a pdf in every way possible with accurate results? Likely, a reason there isn't a coordinated push toward people running local models is the fact that your data couldn't be mined, manipulated, and abused; obviously outside pure capability of some of the frontier models (which truthfully some of which aren't even very good). While I think we may see more things like Apple's models, like you mentioned being run locally, I think we all know at the end of the day they're phoning home in some way (which if that is fine for you, fine). Again though, and you touch on this in the article, highly specified tasks that have a certain amount of redundancy built in are very suitable for these local models right now, without relying on enormous weights and token usage.

I have been working on a VERY SMALL local-first ai lab myself. nothing crazy, a text editor, a claw, and some lightweight models I started playing with. Absolutely looking for contributions as well.

selectedambient 57 days ago | |

didn't want to lead with it but if interested: https://mithraeums.github.io/

worthless-trash 57 days ago |

How long till we have distributed AI, where we can have different people run/understand different parts of problems and pass off work to different nodes across the internet.

stuaxo 58 days ago |

Harnessed seem to be a big part of what makes stuff good or not.

I tried Cline and couldn't get it working well and part of this was that at the time it expected OpenAIs output format.

grig0r 56 days ago |

This doesn't make sense for consumer apps if it chugs a ton of RAM.

No student will want to use local AI apps if their Macbook Air's battery dies in 2 hours.

ramon156 58 days ago |

GLM 5.1 is very impressive, I wouldn't be surprised if we get to a point where it can live in ~48Gb and have a reliable speed/quality

RyanZhuuuu 58 days ago |

I’m skeptical that local AI will work well with today’s technology. Running capable models consumes too many resources on end-user devices.

imrozim 58 days ago |

I use Claude api for my startup and the billings and rate limiting hurt. But local models cant do what i needed yet. Wish they could.

anArbitraryOne 58 days ago |

Just let me turn it off to preserve battery life

rarisma 58 days ago |

I think with turbo quant forks eventually being merged, its becoming more feasible on mid tier consumer h/w

Dont quite think its ready yet.

prometheus1992 58 days ago |

Agreed, but the way ram prices are going, I don't think we would be able to afford hardware that can run any useful model.

cubefox 58 days ago |

Local AI is a bit like wind parks. Everyone is in favor, except if they are in your own backyard. There was recently a huge outcry when Chrome shipped a local 4 GB AI model: https://news.ycombinator.com/item?id=48019219

I have to conclude that people would like to have powerful local AI but it should at the same time only be a tiny model. In which case it wouldn't be powerful.

TechSquidTV 58 days ago |

Local AI will catch up. Unless we can't get our hands on hardware anymore, which is a legitimate concern I have.

throawayonthe 57 days ago |

it's not going to happen with LLMs unless ram + storage gets several orders of magnitude cheaper like, yesterday

informatics aren't magic, you'll never be able to compress """knowledge""" into a small model in a way equivalent to the 1.5 TB model

kilroy123 57 days ago | |

I agree. But I also think the future is some kind of hybrid approach where agents run locally, what they can, and then call out to the cloud for what they can't.

acidhousemcnab 57 days ago | |

This will happen, but reconfiguring the infrastructure of the entire planet to train LLMs and run them over networks might be the "bubble", the megalomania.

maxdo 58 days ago |

The start of the argument is already broken . Ok , slapping api is bad , so you push api that mimics to your provider, install some Chinese llm that will never obey any lawsuit in your country , install 500 packages to do so , every of them has a potential risk a security issue . How is that better ?

Oh yeah , it feels independent and not lazy , sure

1a527dd5 58 days ago |

Consumer/private needs to be local.

Work? I don't want it local at all. I want it all cloud agent.

eyk19 58 days ago |

Apple stock is going to skyrocket

baal80spam 58 days ago | |

Maybe. What about NVDA?

j45 58 days ago |

It’s easier to say 32 gb ram needs to be the norm to start getting movement on this

karmasimida 58 days ago |

How? Memory price is sky high, that is the choke hold the monopoly will not let go

tuananh 58 days ago |

local llm doesn't need to match SOTA performance in order to be useful.

unixhero 56 days ago |

How do we know that Qwen is not up to anything nefarious?

cl0ckt0wer 58 days ago |

If they do then hardware costs will explode even more

runfreeapps 57 days ago |

Any project that requires a local model should always be the way to go on first attempts and if the functionality is acceptable should stay with local models. Token burn is a serious problem and will ultimately lead developers to ask one question "Do I really need Opus xyz?" For most requirements of standard applications the answer is no. So using open-source llm models that are integrating in practical use-cases to create a value-add not for 'hey look I have AI in my app, sign up please.' Open source models are competing well and is the way to go for the majority of projects and mindsets do have to change and I see them changing this way rapidly. You don't have to host your open-source llm locally but host it with a 3rd party, it is cost-effective and the token burn is not a barrier.

hypfer 58 days ago |

Same as local compute.

Welcome back to 2014. Let us now continue yelling at the cloud.

agentifysh 58 days ago |

Until the hardware is economical and powerful enough, local AI that can compete with frontier models today is still far off.

If we could even get something like GPT 5.5 running locally that would be quite useful.

Salgat 58 days ago |

Local models are much less energy efficient right?

HDBaseT 58 days ago | |

It's a good question, although I think hard to quantify.

If you are simply measuring Watt Cost per Token, you are missing the mark drastically. You have to measure quality output per Watt.

It sounds reasonably difficult to benchmark this, maybe I'm wrong though.

osjxjsjxjs 57 days ago |

No AI needs to be the norm. Again.

sgt 58 days ago |

I guess Google got that memo!

krupan 58 days ago |

If you don't need a lot of smarts, do you even need an LLM? Aren't older machine learning techniques just as good, or like, you know, old-school algorithms?

a96 55 days ago | |

Yes. For essentially any problem where a complete solution exists that doesn't use an LLM, it will beat any solution that does in size, speed, energy use, reliability and everything else.

Naturally, it's actually complicated. But LLM is a considerable weight and risk. Maybe it's worth involving and maybe not.

williamtrask 58 days ago |

I wonder if a popularization moment for local AI will ultimately be the pin-prick that pops the AI bubble. Like the deepseek or openclaw moments but bigger/next.

gdulli 58 days ago | |

That's like wondering if enough people discovering local media streaming will disrupt commercial streaming services. It's not going to happen. Most people are not ambitious and will let themselves be controlled by the services of least resistance.

And you can't take comfort in knowing that you, personally, will remain in control of your own computing. The majority will let the range and direction of their thoughts and output be determined by the will of the tech giant whose AI they adopt. And that will shape society.

HDBaseT 58 days ago | | |

I like the analogy of streaming services vs local media streaming, although I don't think it holds up when looking at history.

Streaming Services are getting worse and more expensive. I don't see a single report suggesting piracy is decreasing, it seemingly is only increasing now.

When costs increase, quality decreases people look for alternatives. The advent of faster broadband enabled Napster and MP3 sharing. I think this could have a resurgence if the peices align correctly (a new bitorrent client, a new torrent site, something to break the status quo).

How this related to AI, I don't know, although I wouldn't be set on the idea that we will never have local AI as the norm. There is a lot more movement in this space then there is for local streaming imo.

williamtrask 58 days ago | | |

Yeah... probably right. I do hold out hope that this is mostly a timeframe thing. Like, the library, printing press, etc. all had their moments of centralization. But eventually they federated.

ChoGGi 58 days ago |

Who can afford local AI?

m463 58 days ago | |

Who can afford to backup their own photos?

who can afford a house?

jmyeet 58 days ago |

I've been looking into options for this and we are getting close. There are two main constraints: memory and memory bandwidth.

NVidia segments the market by limiting the amount of memory on GPUs. It currently tops out at 32GB (on a 5090) but it has excellent memory bandwidth (~1.8TB/s). If you want more than the you need to buy an RTX Pro (eg RTX 6000 Pro w/ 96GB for ~$10K) or you get into high high end solutions like H100, H200, etc that have significantly more memory and even higher bandwidth on HBM memory (eg 3.2TB/s+).

NVidia has released the DGX Spark w/ 128GB of memory for ~$4k. The problem is the memory bandwidth. It's only 273GB/s, which is less than the M5 Pro (307GB/s) but more than the M5. You can buy a 16" Macbook Pro with an M5 Max and 128GB of memory for $6k and it has a bandwidth of 614GB/s. So the DGX Spark is a joke, really.

In case it wasn't clear, Apple is interesting in this space because it has a shared memory architecture so the GPU can use all the memory.

Many, myself include, expect there to be no refresh to the 5000 series consumer GPUs this year, which would otherwise happen based on product cycles. So no 5080 Super, for example. And I wouldn't expect a 6090 before 2028 realistically.

One thing Apple hasn't done yet is release the M5 Mac Studios, which are widely expected in Q3 this year. They are interesting because, for example, the M3 Ultra has a memory bandwidth of 819GB/s and previously had a max spec of 512GB but that got discontinued (and the 256GB version also got discontinued more recently).

So many expect an M5 Max Mac Studio with 1TB/s+ bandwidth and specs up to 256GB or 512GB, probably for ~$10k later this year.

You really have to use this hardware almost 24x7 for it to be economical because otherwise H100 computer hours are probably cheaper.

But what happens when the next generation of GPUs comes out to the trillions in AI DC investment? It's going to halve its value. That's over $1 trillion in capex that will disappear overnight, effectively.

I think Apple is the dark horse here because they have no interest in NVidia's psuedo-monopoly. I'm just waiting for them to realize it.

Now CUDA is an issue here still but I think as time goes on it's going to be less of an issue. Memory is still a huge constraint both in terms of price and just general supply because NVidia can justify paying way more for it than you can, probably.

It's still sad to see that 128GB (2x64GB) DDR5 kits are almost $2k now and werre $400 a year ago. Expect that to continue until this bubble pops (which IMHO it will) and we're likely in a global recession.

So the other issue is models. OpenAI and Anthropic are built on proprietary models. Their entire valuation depends on this moat. I don't think this last so both companies are doomed because open source models are going to be sufficiently good.

We can already do some reasonably cool stuff on local hardware that isn't that expensive and even more so once you get to $5-10k hardware. That's going to be so much better in 2 years that I'm hesitant to spend any amount of money now.

Plus the code for running these things is getting better. Just in the last month there have been huge speed ups in local LLMs with MTP.

DoctorOetker 58 days ago |

One advantage of local AI is continual learning.

When I say 'moat' I don't mean moat specific to a company vis-a-vis other companies, but 'moat' specific to the set of inference providers vis-a-vis self-hosted local inference.

The moat consists primarily of being able to batch inference requests.

If we pretend people weren't interested in long context-lengths, there would be a moat for inference providers. who can batch many requests so that streaming the model weights (regardless if from system RAM to GPU RAM; or from GPU RAM to GPU cache SRAM) can be amortized over multiple requests.

However people do want longer memory than the native context length.

One approach is continual learning (basically continue training by using the past conversation as extra corpus material; interspersed with training on continuations from the frozen model, so it doesn't drift or catastrophically forget knowledge / politeness / ...).

However this is very expensive for inference providers, since they would have to multiply model weight storage with the number of users U=N. For a single user the memory cost of continual learning is much less since they only need to support a single user, and are returned some of the memory cost through elimination of KV-caches, and returned higher quality answers compared to subquadratic approximations of quadratic attention.

An advantage of continual learning is that the conversation / code base / context is continuously rebaked into model weights, and so doesn't need KV caches! It doesn't need imperfect approximations to quadratic attention, it attends through working knowledge being updated.

Nothing prevents local LLM users from implementing this and benefiting from the dropped requirements of KV caches and enjoying true quadratic attention implicitly over the whole codebase, or many overlapping projects indeed.

The only remaining moat of inference providers vis-a-vis continual learning local LLM's is the batching advantage, plus the gradient update costs for continual learning minus the KV storage and compute costs, minus the performance loss due to inexact approximations to quadratic attention.

This points towards a stronger incentive for local hosting than currently realized (none of the popular local LLM tools currently support continual learning, once this genie is out of the bottle it will be a permanent decrease of the inference provider moat, the cost of which can't be expressed merely in hardware or energy costs, since it is difficult to quantify the financial loss of inexact approximations to quadratic attention, the financial loss due to limited effective context length and the concomitant loss in quality of the result)

QuadrupleA 58 days ago |

This is just emotional rhetoric. Pretty much any app in the last 20 years has depended on a server somewhere, or a cloud provider. Like an AI provider, they can go down, they can turn off if you don't pay your bill, etc.

And local inference requires fairly beefy hardware, that is FAR from ubiquitous across today's userbases. Local models are also still far dumber than what frontier labs can serve.

Weird that this is getting such a tidal wave of upvotes.

cryo32 57 days ago |

I think no AI needs to be the norm. Even if we have enough RAM to run it locally, the dependency stack we have on hardware, training and geopolitics is too much of a risk to take on. If something breaks, like supply chain, or the model is found to have particular bias or exploits baked in, we're fucked.

senko 57 days ago |

I love this line:

> Stop shipping distributed systems when you meant to ship a feature.

But not in the contex the author meant.

Many people don't realize that when you have a frontend, a backend (several instances, for failover/scaling), a (separate) database, maybe some object store -- you have a distributed system.

A recent article[0] touched on that, although most HN commenters[1] latched on the "go" part. But there's something to avoiding rube goldberg machines where we don't need them.

[0] https://blainsmith.com/articles/just-fucking-use-go/

[1] https://news.ycombinator.com/item?id=48062997

plexescor 57 days ago |

Yea i agree to this. Especially considering that now even igpus can get respectable scores, like my iris xe 80eu 16gb ram @ 2133mhz gets like 6-8 Tokens per second in gemma-4-E4B model