Apple Silicon costs more than OpenRouter(williamangel.net) |
Apple Silicon costs more than OpenRouter(williamangel.net) |
* but Apple will collect all your keystrokes anyway
Oh, and the author didn't mention at all anything related to inference optimization, so no idea if they even know about or enabled things like speculative decoding, optimized attention backends, quantization, etc.
At least AI slop would have hit on far more of the things I listed above. This is worse-than-AI.
Chances are that token prices will go down, but chances also are that the AI bubble pops and all of a sudden all these companies will either have to make a buck out of the inference or go bankrupt.
Getting your own hardware just grants you stable pricing.
its almost guaranteed imo that as the model quality evens out, inference cost will drop towards 0 at insane speeds, given how well llama works as an asic.
Also nobody I know picks local over OpenRouter on price. They pick it for offline, for data not leaving the machine, for no rate limits, for not having a provider go down mid-task. If $/Mtok is the only axis, sure, cloud wins.
In practice the pattern I see is leaving a small model running on easy background tasks while using the laptop normally, not a dedicated inference box hammered flat out for 5 years.
I wish people stopped deluding themselves — I regularly try (and benchmark for my purposes) local models and they are NOWHERE near the huge models like Sonnet or Opus. Nowhere. Yes, you can sometimes get plausibly-looking output for simple tasks, but for anything even remotely requiring thinking there is simply no comparison.
Local models are useful. I use them for spam filtering, and soon intend to use them for image tagging and OCR. But let's stop saying they can get us "anthropic sonnet levels of performance", because that's just not true.
Now, it looks like the providers I use have good limits. But I do worry about this.
Obviously if RAM apocalypse passes by then high-end configurations preserve resale value worse than base models, but still it's hefty bonus of Apple hardware that might change math a lot.
tldr;
Hardware deprecation costs are the major factor.
But, if we assume ZERO hardware deprecation (not realistic), then local inference becomes super cheap.. roughly, 90%+ cheaper.
Third case: the break-even happens only if we can get at the very very very least, 8.7 years of useful hardware life. A more realistic number, however, when working 8 hrs/day and not of 24 hrs/day, is around 25 years.
So, for now, local inference is preferable if you deeply care about privacy. From cost perspective, it's still not there.
Next paragraph
> At ~50-100 watts and $0.18/kWh that's $0.009 or $0.018 per hour. $0.02 per hour. $0.48 cents per day for the electricity to be running inference at 100%.
lol
So, for comparison, a 5090 has 32GB of VRAM and you can get one for ~$3000 maybe. To go beyond that memory with current generation (ie Blackwell) GPUs, you have to go to the RTX 6000 Pro w/ 96GB of VRAM and that's almost $10,000 for the GPU by itself. Beyond that you're in the H100/H200 GPUs and you're talking much bigger money.
Part of the problem here is the author is looking at laptops. That's the only place you'll find the M5 Max currently. The real problem here is that the Mac Studios haven't been updated in almost 2 years. There were configs of those with 256/512GB of RAM but they've been discontinued, possibly because of the RAM shortage and possibly because of they're reaching EOL. Apple hasn't said why. They never do.
Many expect M5 Ultra Mac Studios in Q3 and the M5 Ultra may well have >1TB/s of memory bandwidth (for comparison, the 5090 is 1.8TB/s). Memory bandwidth isn't the only issue. A 5090 will still have more compute power (most likely) but being able to run large models without going to a $10k+ GPU could be huge.
But yes, it's hard to compete with the scales and discounted electricity of a data center. Even H200 compute hours are kinda cheap if you consider the capital cost of what you're using.
I've looked into getting a 128GB M5 Max 16" MBP. That retails for $6k. You might be able to get it for $5400. But I don't think the value proposition is quite there yet. It's close though.
Apple isn’t expecting wholesale adoption of on-device models this year or next. But all of their design and iteration suggests they see it coming.
Apart from that, like detailed in the the article, pricing for local compute also depends on electricity prices.
By the way, I don't want to snark about it, my English is not very good, but it's "per se", not "per say". Just commenting on this petty thing because it seems to be a common misspelling, and it always trips me up a bit. Makes me wonder about another supposed meaning like "from hearsay".
E.g.
Privacy
Uptime
Future cost structure controls
This is a field that has moved very quickly. And it has moved in a direction to try to trap users into certain habits. But these habits might not best align with what best benefits end users today or some time in the future.
> 64 gigs should run a model like Gemma 4 31b
No, it can run anything in the 70B range. It's a notable quality upgrade from the 30B, which isn't obvious because the famous flurry of April releases didn't contain any 70Bs.
It can also run 120B in UD-Q3. Or 230B disk-streamed.
If small model is great it will be hosted with good electricity cost and will be utilized 24/7.
Isn't it 2+2 of economics ?
CPU is a commodity, and we are still buying cpu and ram from vendors for same reason
A 2022 Mac Studio w/ M1 Ultra and 128gb was ~$5200 new and I see them selling for over $4k on eBay.
Can’t sell your used tokens…
Running locally, you get confidentiality of knowing your tokens are only ever being processed by your own hardware. You get the integrity of knowing your model isn't being secretly or silently quantized differently behind the scenes, or having it's weights updated in ways you don't want. And you get the availability of never having to worry about an API outage, or even an internet outage, for local inference capacity.
And this isn't even starting to address the whole added world of features and tunability you get when you control the inference stack. Sampling parameters, caching mechanisms, interpretability etc.
OpenRouter may be cheaper than frontier labs, but you still lose all of these benefits from open weight models the moment you decide to rely on someone else's hardware for your processing.
But you are dependent on them, which is the biggest factor IMO, there was a website posted here before of people getting banned from using it over silly reasons, not to mention price hikes, or privacy concerns. Maybe now it’s more expensive or slower to run locally, but you are in full control of everything.
When I see so many options, that looks like it would take months to audit whether it actually makes sense and is safe to use. But I guess some people are fine with YOLO-ing it.
This is why the idea that the AI labs are in trouble because inference will be a commodity is _completely backwards_. Some of the largest and most powerful companies in the world sell commodities. They compete on scale and efficiency, and you are never going to be able to compete with the big labs on either.
But then they talk about using a newly purchased Mac to do the inference, running at full capacity, 24/7. Why would you do that? Apple silicon is fast but the author points out: you're only getting 10-40 tokens per second. It's not bad, but it's not meant for this!
It's comparing apples to oranges. Yeah, data centers don't pay residential electricity rates. Data centers use chips that are power efficient. Data centers use chips that aren't designed to be a Mac.
Apple silicon works out pretty good if you're not burning tokens 24/7/365 and you're not buying hardware specifically to do it. I use my Mac Studio a few times a week for things that I need it for, but I can run ollama on it over the tailnet "for free". The economics work when I'm not trying to make my Mac Studio behave like a H100 cluster with liquid cooling. Which should come as no surprise to anyone: more tokens per watt on hardware that's multi tenant with cheap electricity will pretty much always win.
I don't do local inference other than hobby & learning reasons because electricity is so expensive where I am at.
Some do it to have control over their ability to use AI. Some do it because they think it will be cheaper to not have to pay a SaaS to generate tokens for them.
But for those interested in the latter case, it seems like it's not actually cheaper after all, at least at current prices. But then I don't expect prices to drastically jump because of how much competition there is in model development.
For most people, they might be better off with OpenRouter models and providers supporting Zero Data Retention. On the cloud, that’s as good as it gets for privacy - your data is never retained beyond the life of the request.
the less you use local LLM, the less sense it makes since you paid a lot for hardware you don't use
He should compare his MacBook to Open Router on Kimi 2.6 1.1T or GLM 5.1 (754B), at bfloat16 precision, which he can't ofc.
But it furthers his point that things like open router are a better idea, which is not surprising.
There are 2 caveats here:
Some places have higher prices for industrial than residential power as residential one might be subsidied by govt.
And DC also pay for cooling, which residential will only effectively pay if they have AC and is hot outside. So power rates are some multiply of industrial pricing.
That isn't the case for many, though, and there is a whole social media space where people are hyping up the latest homebrew options for running models, believing it frees them from the yoke of big AI.
Millions of people are buying big $ maxed-out hardware like the Mac Studios or DGX specifically to run LLMs. Someone rationally running the numbers is a good thing.
What's your source for this?
You also get the benefit of privacy, freedom from censorship, and control over the model used (i.e. it will not be rugpulled on you in three months after you've built a workflow around a specific model's idiosyncrasies).
Excusing everything else that u/bastawhiz said[0]; the obvious fact here is that Claude, OpenAI, Gemini et al. are quite literally burning through 100's of billions of dollars and selling it back to you for pennies on the dollar in the hopes that they get to be the only one left.
If I spend $10 growing Oranges and sell them to you for $1; then of course it's more expensive for you to do the growing.
I feel like I'm taking crazy pills. These models will become more expensive over time, it's functionally impossible for them not to, they just want to capture the market before they have to stop selling at a huge loss.
If you want a faster model, go for qwen3.6 35B (or gemma 4 26B if gemma models perform better for your tasks). There is a reason why people (myself included) haven't shut up about those two (especially the 27B). Its small enough to run at a decent speed (especially with the built in MTP that finally has official llama.cpp support) and for many workloads (every benchmark I have ever thrown at it) it is matching or surpassing models it has no right to.
A couple of days ago I woke up with my internet being down, started 27B in pi, told it to diagnose whats wrong by giving it my router's password, went to grab a coffee and by the time I got back, i had a full report with suggestion on how to proceed. I love openrouter and I use it for many things, but it is not cheaper.
Subjectivity and opinions based on personal experience with all those models implied naturally, I assume the 31B gemma has cases in which it edges out, I've just failed finding any and I have been running all 4 models mentioned since hours after each of them dropped nonstop for different tasks. Hell, for my hermes, I've started getting better results once I switched from gemma 4 26B to qwen3.5 9B, not even the massively improved 3.6 series. It just feels outdated/ cherrypicked to not use what by many accounts is the current consumer hardware SOTA if doing such an analysis.
More critically, in practice, setting up local models seems more like a hobby, an educational exercise, or an act of privacy control than it is for cost cutting or productivity.
Personal computers eliminated an earlier terminal era, and most if not all of those companies are gone except for IBM and a few stragglers and they are a shell of their former selves.
I say, as the author of the blog post, writing this on a MacBook M5 max 128gb..
Another open secret is that that certain companies give you tens of thousands of tokens freely, with pretty respectable models such as Gemini 3.1 and GLM 4.6.
I looked at a couple random agentic sessions in my openrouter activity, and the input cost is 10x the output cost.
Prompt caching on openrouter is complicated and unreliable. On local hardware with llama-cpp, it's mostly free.
This is like comparing e-bike at home with e-bike rental and concluding therefore we need to rent Toyota since it can go at similar speeds. Getting tired of bad posts getting much attention .
I expect self-hosted to be quite competitive pretty soon. Github Copilot is already wildly more expensive than it was last month. People are going from spending a few bucks to a few thousand for that same usage. So, if it doesn't get a lot more efficient (like 3x the tokens, or more, from the same infrastructure), the prices will have to go up quite a lot to keep the lights on. Everything in AI is running partly on investors money, everyone is trying to buy a monopoly and insurmountable lead and some way to lock people into a specific model and ecosystem, but so far that hasn't happened (except for people who voluntarily lock themselves into a specific ecosystem, but even in those cases, it's usually easy to get the AI to help move to another, there are no truly unique features in AI that at least one, and probably three or four, other players don't also offer).
Second thing is you can starkly upgrade the token generation locally if you use agent teams. Single conversations are memory bandwidth bound and don't fully make use of your compute. If you can batch tokens from multiple agents you can easily 5x token generation.
It's in the same category as rooftop solar for me. It doesn't have to make strict economic sense if you're the particular type of person who gets peace of mind from control of infrastructure / reduced dependency.
Shortening the lifespan?
There's a bunch of retro hardware which should make people pause and realize they're stupid to assume hardware slows down on average even 5% 20 years later (it's probably closer to 2% and I'm being generous).
HVAC/power delivery and generation are the major factors, and if you didn't skimp/get defective parts for this and replace failed moving parts (usually fans), your hardware is basically the same 20 years down the line as it was today.
Also using LLMs locally doesn't even induce sustained 100% GPU usage over significant periods of time for most real (agentic coding in OpenCode) use-cases.
So we shouldn’t be comparing it to the cost of open router api access at all, we should be comparing it to the cost of a 4 credit university course.
* Industrial power pricing
* Wholesale hardware pricing
* Utilization density of shared API
means API always wins a cost shootout.
Privacy & tinkering is cool too though
But in _every_ metric other than privacy it was better to run via OpenRouter than a local model, and not by a small amount.
Direct link to the comparison charts:
https://sendcheckit.com/blog/ai-powered-subject-line-alterna...
Add to that the privacy improvements and data protection and potentially further specific inferance if needed it's a no brainer.
Again, Ai is a tool, and the right tool for the job, I would wager with no evidence looked up, is that the majority of Devs would be happy with 10-30 per second locally.
also you gotta realize frontier models have massive "system prompts" that clog up the context window with garbage.
being able to write your own system prompts gives you a MASSIVE edge..
This is common when processing PII. Lawyers, doctors our similar should not be using cloud solutions.
Also it's harder to setup and always more expensive than any cloud solution.
It depends on how often you use it (and your tolerance for slow inference) and whether you would have otherwise bought a higher spec. For my needs, this costs a LOT more.
but you lose access to the most capable models, you can run only the small ones
And, since it's a Mac, whenever you're ready to upgrade it'll still have a fairly decent resale value.
https://www.ebay.com/sch/i.html?_nkw=apple+mac+studio+m3+ult...
I think you'd need to tinker quickly, realize anything with CUDA (other than the awful DGX Spark) is better for learning, the prefill is killing your ability to actually run models large enough to justify that RAM, and then cure yourself before the rest of the crowd.
An unreasonable number of these people spent $10,000+ for Mac Studios that are still compute bottlenecked and don't have anything more efficient than Gemma 4 to run.
Also, there a good technical reasons for inference being much more efficient at scale.
But that's not the point I'm making. (or, it kind of is, but it's more high level than that).
They're running spot and preemptible GPU instances (60-80% cheaper than on-demand), paying wholesale industrial electricity rates, and running at multi-tenant utilisation densities that make your MacBook look like a bonfire. Of course they're not individually loss-making on inference, they're aggregating cheap commodity compute and skimming a margin, and on paper that's what makes it seem like a good idea, certainly not a loss leader right?
But zoom out a bit; the entire stack is swimming in VC money. OpenRouter itself just raised at a $1.3B valuation backed by a16z. The Chinese models that now account for 36% of all tokens routed through the platform (DeepSeek, Qwen) are priced the way they are because Beijing-adjacent capital has decided market share matters more than margin right now.
So yes, technically no single party is "throwing money away" on each token; they're just all simultaneously subsidising different parts of the stack for strategic reasons. The floor price you're seeing isn't a stable equilibrium, it's a pile of investor money that hasn't entirely finished burning yet.
On Apple Silicon you can get 4x-8x more tokens per second if you run more queries in parallel (as long as your inference server supports it, and has enough spare RAM for more KV caches).
When inference is done at datacenter scales, when you distribute generation across multiple GPUs and have kernels carefully tuned to specific hardware, the compute vs DRAM bandwidth speed ratio gets absurd like 200:1. That's why everyone gives you batch inference at a steep discount.
The most intelligent model at a given time is much larger than the previous, which is why token costs for GPT5.5 are higher than 5.4. But you should expect that 2 years from now, serving a GPT5.5 sized model will be cheaper than GPT5.5 today. You should expect it to be even cheaper to get an equally intelligent model 2 years from now, because distillation techniques are effective at reducing the necessary parameter count for the same benchmark scores.
I’m struggling to find the quotes.
If the arms race stopped tomorrow the current price pays for the inference.
Seems to be on its way! I know of at least one person whose company is looking at a 20x increase, and afaict (from related looking around, nothing concrete tho) business accounts are missing some costs in the calculator so it'll likely be higher.
[0]: these API are not sold at a loss either, by the way. But it's a nice meme so let's just pretend they are.
In other words, inference is fairly profitable for them and the rest of the money is spent growing revenue as quickly as possible. Building models is still an expensive line item but the costs for that are going down with time.
There is also maybe a “capture the market” mentality but I don’t think that’s necessarily it - the tools and processes are largely fungible and that’s a huge problem. They need to figure out how to make it sticky for “capture the market”, but there’s also a very real “grow as big as possible as quickly as possible to take on Google”; Google has an existential threat here.
They could have said the same about transistors. People keep inventing new ways to keep the costs down. Just look at the latest Qwen, DeepSeek, BitNet. Interesting tidbit: they’re all open, and as Google said in 2022: they have no moat.
How big/deep of a loss?
I feel like I read this every day for years that Uber did this same "idiotic, losing" strategy (how it was pitched/discussed) and then one day we woke up and... without much fuss, boom, they were profitable seemingly overnight.
For me nothing says low class like the Porsche dealer saying we can call Uber for you to take you home ridiculous… and it was a low class experience dirty car small never again ha ha ha…
Why? It's no less crazy than when Uber and Lyft were doing the same thing. Or when the entire tech industry was doing it in the dot com boom.
Investment-driven market growth at a loss is like the least surprising thing in all of this. The tech is new and fascinating. The bubble is just another trip through the funhouse.
There are huge economies to be had by batching requests and using lots of RAM for MoE (sparse models). You can't achieve that efficiency at batch size 1 on a single node.
Likewise, DeepSeek V4 Flash is quite accessible on local models, with DwarfStar 4 making it easy to run on a 96GB MacBook.
There's nothing wrong with paying for inference, but local models bring up some pretty amazing possibilities, such as entirely offline usage or being able to work on private PII, legally privileged, etc. sort of data, or performing tasks with no concern given whatsoever towards billing overruns.
The other possibility is being able to build a service which you can be 100% assured you can keep running without worrying about a service going down or being end-of-lifed, which is currently a problem with frontier models. My local Qwen setup is entirely predictable. It can run as long as I can keep finding hardware to run it.
A sensible strategy uses both: have local inference tools available, and use both low-cost and high-cost cloud based models. You can use GPT-5.5 and Opus-4.7 for things they excel at (including laundering the latter via a Claude subscription to make it cheaper) for demanding reasoning tasks, DeepSeek V4 Pro for slightly less demanding tasks, V4 Flash for most (not all) code generation, and then local models for things where you want a local model.
> If you want a good dense model, use qwen3.6 27B instead, speed will be up, and if you don't take my word for it being smarter, take openrouter's prices of it against the bigger, slower and less memory-efficient gemma do the talking.
Don't know if this is the correct read. I think those providers are simply taking cue from Alibaba's first-party pricing for the 27B Dense. It's kinda overpriced imo. Perhaps it can be explained by how 'reasoning-inefficient' (relative to frontier models or even Gemma) the Qwen models are and longer sequence lengths are expensive to serve.
But I do agree that the openrouter prices aren't a strong signal and probably should have worded it a little better. It's just a really stark and 'in your eyes' gap.
Ignoring that it was just tosser hyperbole (that absolutely zero reasonable people need to question), yes, enormous numbers of people are buying GPUs or hardware with the explicit goal of running local LLMs, and social media is full of people hyping various setups and models. Mac Minis are almost impossible to find, and that alone is selling at a clip of about 300,000 every four months. Large memory GPUs are basically a myth at this point. All so people can pay more to get a worse result than commercial options, which is precisely the point of the submission.
These local setups only ever make sense if you have something that confidential, or you're doing something that ToS of the majors would ban you for.
Now given this pedantry horseshit, you'll probably demand that I specifically show a citation on DGX or Studio sales, which...rofl.
I have not observed this at all, If anything the big 3 are getting both better and more helpful in this area. And if you pass the security checks Antropic will give you Limited access to Mythos.
Devil's advocate:
* inflation caused everything to go up to some degree since then
* if it was "that bad" as you say, they wouldn't be extremely profitable and have so many users
both things can be true? "they cut the driver pay in half and doubled the price" did not lead to the collapse of the business/people to stop using it.
because it costs $1k-$2k instead of $10k-30k+ for optimized devices
Anthropic: https://x.com/jaminball/status/2052112309364162874
> Those are the economics of the industry today, or not today but where we're projecting forward in a year or two.
Given those models aren't sold at MSRP anymore and are primarily being sought for inference, it's apropos of an article on the costs of inference on Apple Silicon.
As for privacy, I'm sure there are many people that are not so interested in that aspect.
Millions of people are paying thousands of dollars a year to buy a slightly upgraded entertainment package in their car. There are 60 million or so millionaires alone, including 6 million+ in China.
There are a lot of people with a lot of wealth on the planet. A lot. Millions...it isn't that unfounded, friend.
So doing this "this is HN" snide jerk act, and then basically projecting your lot on the planet is...I don't know if you intended it, but it's rather amazing.
America is basically proposing AI using the equivalent bloatware of Windows 11.
also it's possible that the scale of inference needed (e.g. Jevons paradox) keeps growing to the point that training costs can fully be absorbed (since training cost is one off vs. inference that can scale).
(I suspect that might be the thinking, I don't know if it will be true, it's also possible that no model will create a moat big enough to attract enough of the inference traffic to make it true).
Depending on the chips/architecture used, the off-peak traffic from inference can also subsidize the training costs.
I think the question in terms of throwing money away isn't the inference layer: it's whether the companies training open models will be able to financially keep doing so. How long will Moonshot keep releasing future Kimi models? I think there's an interesting wedge they're exploring with being basically a base-model-trainer-as-a-service, i.e. selling rights to Fireworks to sell finetuning services to the Cursors of the world, but it's entirely possible it doesn't pan out.
That being said, Nvidia seems willing to step up to being the base model trainer of last resort via the Nemotron family of open models, since it helps sell more of their hardware — similar to their investments in the CUDA stack to sell hardware (unsurprisingly, Nemotron is designed to run most efficiently on Nvidia hardware, e.g. native NVFP4). So I suspect there will continue to be a pretty good market here.
All that says is that it gets more expensive in the future as competitors exit the market and sustainability becomes important. That’s why Uber and Lyft were so cheap until they killed taxis. One major difference of course is that some models will remain largely good enough and the incremental cost of running will keep dropping to 0 over time since the hardware needed doesn’t get more expensive and is already purchased.
I only object to taking current prices as if they are perpetual prices.
Correct me if I'm wrong, but I believe this is a feature that only Google has figured out how to implement. All of the other pay-as-you-go token services have a cap you can set, some by monthly spending, some with API key resolution, others by how much you put into the account. I use many, and if configured with auto-purchase disabled, it's not possible to have a "surprise" bill (except for Google!)
It can worry over Part C while I have my 10:30 group meet. And it can worry over Part D while I do whatever other silly, time-wasting thing all humans do in almost all organizations. Then I still haven't reviewed Part B, yet, so the extremely slow AI is waiting on me.
Maybe someday I'll be good enough to need faster AI so I can rewrite something like Bun in a few days. Right now, slow and local fits my use case very well.
If I used an actual direct API it probably would've been much faster, but I'm doing it for hobby / fun reasons. You also get to fiddle with a lot more params.
I use a 5060ti 16gb and a minipc.
I tunnel in via Tailscale and access it with my phone or laptop from anywhere. It’s pretty good and will only get better as I optimize.
Amazon were losing money, they were losing money because were growing and spent all of their cash flow on growth. It wasn't merely regarded as a hopelessly unprofitable business, if was regarded as potentially fraudulent. The share price collapsed in 2014 because, some thought, the profit would never come, investing in growth was pointless, etc.
Last year Amazon made nearly $100bn in profit. Stock is up 20x from then...this is after AWS was known (everyone also that was a massive fraud, could never be profitable...we know it was printing from day one), after it was the world's biggest retailer, etc.
It is difficult to understate how consistently people make this mistake, not just individually but in aggregate. You see the same thing with restaurants, consumer products, office leasing, so many businesses. This is not to say that the future will happen any particular way but that what Anthropic and co are doing is obviously rational and based upon very real cash flow. Anthropic's growth in revenue is, I believe, unparalleled in modern corporate history. A slight difference in this case is also that the economics of training these models is improving exponentially over time.
The restaurant next to the mines were profitable up until the moment the mines themselves shut down: one doesn't exist without the other.
You can't ringfence inference as "the profitable bit" and then hand-wave away the training. Without continuous training there is no inference product.
Claude 3 Opus isn't sitting there making revenue in 2026 - the thing is just deprecated. The moment you stop spending billions on the next model, your "profitable" inference business is on borrowed time until someone else makes it obsolete.
Maybe I made a mistake in my analogy... They're not growing a farm and then selling oranges. They're on a treadmill where stopping is death, and the treadmill costs $10bn a year to keep running.
This claim deserves teasing apart.
Clearly, training is a Red Queen's race today. If a model provider were to unilaterally decide to stop training, they would very quickly lose market share to competitors with better models.
On the other hand, what if market and investment conditions change such that everybody has to stop training?
In that case, the models are still there and still as useful as they were the day before. So why wouldn't there still be an inference product?
You’re literally describing all companies. Google takes about $270bn/year to run. If they stopped spending that they’d die pretty darn quick. It’s also a description of working - unless you’d built up significant savings, if you stopped working you’re also going to die.
The reason they're losing money on paper is because the models keep growing 10x in size every generation but they're not getting 10x returns on model inference (closer to 2x)
Unless they are changing the architecture in huge ways. The pre-training done for 3 goes into later models. I am sure the frontier labs are figuring out how to pretrain generic feedstocks that can be fed into downstream training pipelines. DeepSeeks incremental training run cost was what, 5M? Alibaba and DeepSeek have the best most efficient training pipelines, look at the rate at which custom Qwen models are being pumped out.
In this analogy, model training would be akin to developing better oranges, but your competitors are also developing better oranges so if you stop spending heavily to improve your oranges, consumers are going to buy ~zero oranges from you within a couple years. (Expanding the farm might be analogous to expanding data centers.)
The news they successfully buried was that companies like AirBnB are now running Qwen and open source models. The free oranges are now good enough. There is no future unless the goal is to get to super intelligence and utterly take over the world before anyone else gets one. Anything else and free models are six months behind. The money now is the opposite of what everyone thought a year ago: datacenters. Everyone thought AWS was fucked. Turns out AWS is really good at running Qwen.
Sure, I can wait hours for my local model to finish, or I can spend basically as much and get the answer right away
There’s a lot of exciting stuff with local LLMs despite the speed, but for me I don’t have the discipline and working memory to jump from project to project.
Dario's point was the opposite of yours; he used per-model accounting to explain why the company P&L gets worse every year, not better. His own numbers (10x training costs each generation and ~2x revenue return.)…
"It looks like it's getting worse and worse" are his words, not mine.
In 2025, Anthropic's inference costs came in 23% over their own projections. They cut their gross margin forecast from 50% to 40%.
1) Roughly break-even to a little bit cheaper per token cost 2) Much, much, faster
So the cost of the mac barely even matters, it's just an extra cost beyond.
Sure, data center providers can pay lower rates.
The point of this article is that LLMs at home really don't make a ton of sense, unless you are willing to pay through the nose for privacy. There is absolutely no cost saving to be had.
If you're looking at your own datacenter as a larger corporate client, that could change.
There are also some providers that will contractually keep your data private, like AWS Bedrock or parts of Google/Azure (I don't know their stack names).
AWS even has AWS Secret Region and AWS Top Secret Region if you want to use LLMs on classified data.
You have to value privacy at a roughly absurd level to not want to use LLMs run efficiently at scale by someone else. For the home user, just the extra efficiency produced by batching requests from a large number of users in a datacenter in a real win.
Some of these companies are even selling tokens below cost to get marketshare. If someone will sell you a service for a dollar bill or three quarters, why wouldn't you take the three quarters?
Because one day they'll send you an email informing you the new rate is $1.50, and if you missed the email, that's not their problem.
So if you are lucky you might end up with something that still runs but most folks won't find it particularly useful
Real estate is only a clearly good investment if you ignore opportunity cost.
In addition, the interest payments almost always end up being near the rent the owner would have paid, so mortgage payments are higher, but that increase is generally (and quickly becomes) principal while being able to counteract inflation of rent.
Even if you move out after 5 years, you still own the place and can rent it out and then it pays for itself, to skip the cost of selling it back to market
House sellers receive offers from buyers, sometimes including letters, and can choose to sell to any of them (or none of them), whether or not those offers are higher than the listed price. It's not so different.
> And then the management company would come inside my house to inspect that I wasn't running a meth lab or something.
Yeah that part is different. I also prefer owning.
No, not quite. It really comes down to opex vs capex and the depreciation schedule for your investment.
Software development is typically categorized as capex, on a 3-5 year depreciation schedule. You assume the software you write today will be generating value for you that long.
If a big, expensive model training project only gives you value for a year or less, that is not like most companies.
But believing that the financials of a project are governed solely by how IRS rules force you to account for headcount is kind of silly.
> If a big, expensive model training project only gives you value for a year or less, that is not like most companies.
The model itself that gets built? Sure (although clearly the timelines are getting longer). However the important bit here is the research that got done along the way and the infrastructure built to make that model building process cheaper, better etc. all of that stuff sticks around but because it’s hard to appreciate externally you discount it to 0 when it’s literally what they actually spent the money on.
But none of that even matters. Google had 270B in opex and their capex has grown from 50B in 2024 to 90B in 2025 and is projected to grow to ~175B for 2026. But even if you discount the “AI” treadmill, you’re still looking at many tens of billions in capex that if they stopped they’d die.
There is no moat. In the end, what we are calling AI today will just be something that is incorporated into an existing programs that people will use to help them accomplish a task. The public will not be paying more for it. It will just be a commodity added to the existing ecosystems we have today. They
https://kpmg.com/kpmg-us/content/dam/kpmg/pdf/2023/tcja-chan...
My home is "paid for". Except for the HOA and property taxes that are not that far off from what I was previously paying in rent, the ongoing maintenance costs with random large spikes, and the opportunity cost of having a large chunk of money in the house and not in the market. It was still probably the right decision, but it's not at all a free lunch.
And you didn't need to go live in a HOA. I don't, and it's much cheaper.
Sure, the same way that the benefits of a fixed mortgage payment are baked into sale prices. The efficient market hypothesis would say that neither renting nor buying should be obviously superior in the long term, because if either was then people would bid up rents/prices until it wasn't.
And you didn't need to go live in a HOA
I pretty much did, unless I wanted to significantly compromise on other factors.
I keep hearing that properties are in the biggest bubble yet in the USA - with the affordable housing shortage being a red herring, because real estate managers and boomers are unwilling/unable to reduce their prices - despite not getting renters/buyers because it would kick off a death spiral as their interests would consequently go up (because of lower security). Along with the ai layoffs etc
I'm not American so I only hear the occasional interview so don't have any idea if it's really as pressing as these industry professionals keep saying but I'm definitely at the edge of my seat watching...
Buying have much higher entry point, need a bunch of cash at start then a ton of paperwork.
It is absolutely possible that local buying market is inflated precisely because the area is so desirable buying to rent is (or was) good investment, but that's rarely is true for a bigger market