Apple Silicon costs more than OpenRouter

Apple Silicon costs more than OpenRouter(williamangel.net)

305 points by datadrivenangel 17 hours ago | 262 comments

bastawhiz 16 hours ago |

This isn't a good analysis, and it's because it keeps rounding everything up. He rounds up the cost of electricity by 10%. He has a range of power use, takes the high end (which is 2x the low end) and multiplies it by the inflated electricity cost.

But then they talk about using a newly purchased Mac to do the inference, running at full capacity, 24/7. Why would you do that? Apple silicon is fast but the author points out: you're only getting 10-40 tokens per second. It's not bad, but it's not meant for this!

It's comparing apples to oranges. Yeah, data centers don't pay residential electricity rates. Data centers use chips that are power efficient. Data centers use chips that aren't designed to be a Mac.

Apple silicon works out pretty good if you're not burning tokens 24/7/365 and you're not buying hardware specifically to do it. I use my Mac Studio a few times a week for things that I need it for, but I can run ollama on it over the tailnet "for free". The economics work when I'm not trying to make my Mac Studio behave like a H100 cluster with liquid cooling. Which should come as no surprise to anyone: more tokens per watt on hardware that's multi tenant with cheap electricity will pretty much always win.

datadrivenangel 16 hours ago | |

Rounding everything down in the most optimistic setting got me to $0.40 per million tokens, and openrouter has the same model at $.38/mtok.

nativeit 13 hours ago | | |

But once all that is done you still own a Mac in one case, and you don’t in the other, correct?

650REDHAIR 16 hours ago | | |

I’ll keep my data local over a $.02/mtok difference.

formerly_proven 14 hours ago | | |

What is it with AI SaaS naming themselves "openxyz" when there is 0% open about them?

novok 11 hours ago | | |

Also many have power even cheaper or even free unused surplus power with solar.

I don't do local inference other than hobby & learning reasons because electricity is so expensive where I am at.

avidphantasm 6 hours ago | |

Not sure where 40 tokens per second is coming from. I’ve seen 95-100 tokens per second on M5 Max 128GB running Gemma 4 31B. I’ve done experiments where it is faster than Claude Opus 4.5 for the same prompts.

dhiraj_bhakta_ 38 minutes ago | | |

can you provide your configurations pls ?

faitswulff 15 hours ago | |

The article makes no sense. I can't use OpenRouter as a general purpose computing device. Why are we comparing a whole computer to a single purpose SaaS?

mpyne 14 hours ago | | |

They're responding to the people doing things like buying the most expensive Mac they can find specifically to do local inference for their AI agents.

Some do it to have control over their ability to use AI. Some do it because they think it will be cheaper to not have to pay a SaaS to generate tokens for them.

But for those interested in the latter case, it seems like it's not actually cheaper after all, at least at current prices. But then I don't expect prices to drastically jump because of how much competition there is in model development.

sheepscreek 13 hours ago | | |

No, that’s not the point. I think this is to help people who are thinking about getting a beefier Mac so they can run their LLMs on it too. Some in particular want a dedicated Mac Mini or Studio for this purpose. The breakdown, even if slightly flawed, offers a good insight into the economics of it.

For most people, they might be better off with OpenRouter models and providers supporting Zero Data Retention. On the cloud, that’s as good as it gets for privacy - your data is never retained beyond the life of the request.

tuwtuwtuwtuw 15 hours ago | | |

I think it's because there are a lot of people writing articles about the benefits of running local models. I think it's fair to say that there are daily threads on HN singing the praises or local inference. I also see people buying new hardware where the main trigger is ability to run local models.

ikidd 6 hours ago | |

Actually, figuring it on generating tokens 24/7 is the best case scenario. if you figure it at 8 hours a day of actual use, you still have the fixed cost of the hardware being the highest portion of the budget, but now you generate 1/3 the tokens so you triple that cost per token.

econ 4 hours ago | |

Boss, I make 16.50 per hour, say 15, I work 36 hours, say 35, say 500 per week, say 4 weeks per month, that's only about 2000! Don't you agree I need a raise?

statestreet123 14 hours ago | |

Rounded up, yes, and oddly inefficient for someone obsessed with inefficiency. One could buy a brand new 64gb M5 macbook for well over 4k. Another could buy a scratched up but functioning M1 Max 64gb off of ebay for a little over 1k—and somehow get the same 10-20 t/s with 31b that the author does with an M5. Or better yet, have a frontier model do the planning and judging, and have a local MOE model execute at 50 t/s. All of this achievable by a former English major with too much free time.

novok 11 hours ago | | |

I have an M1 Pro, and a M4 & M5 max to play with at work and the speed difference is very significant between all 3 machines, the M1 Pro is far slower, and the M5 is significantly faster than the M4. And a windows 3090 beats all of them but eats twice the amount of power per token. This is all running the same 24GB memory friendly model with LM studio.

outside1234 2 hours ago | |

We also have no idea what it actually costs Anthropic. This could be wildly subsidized and actually Apple Silicon is more cost effective.

giancarlostoro 9 hours ago | |

Honestly, I don't even see my Macbook Pro costing me anywhere near as much as using any of these AI services, but maybe I'm just not seeing a significant increase in my power bill to notice? I am the power user who uses Claude Max pretty much all the time to prototype ideas, and build things I actually use, and has given me a lot of value, I work full time and have a family to raise and care for, my free coding time is mostly limited to ideas. Now I can draft a plan with detail, review the code, run the code, test it, and use software custom tailored to my needs.

dist-epoch 16 hours ago | |

using it 24/7 brings the average cost down, not up.

the less you use local LLM, the less sense it makes since you paid a lot for hardware you don't use

bastawhiz 15 hours ago | | |

That's the point: why would you buy a device that's specifically not optimized to be used for 24/7 inference? It's expensive hardware that's not designed to be used in that situation! The power use for inference isn't especially good and you're not getting even a fraction of the benefit from the hardware that you're paying for.

groundzeros2015 16 hours ago | | |

The hardware has multiple uses for the same cost. The pay-per-use server does not.

make3 7 hours ago | |

The real reason this comparison makes no sense is that only a vanishingly small fraction of people seriously using ai to code would seriously use a model so far from the top models (including open source ones).

He should compare his MacBook to Open Router on Kimi 2.6 1.1T or GLM 5.1 (754B), at bfloat16 precision, which he can't ofc.

But it furthers his point that things like open router are a better idea, which is not surprising.

PunchyHamster 9 hours ago | |

> Yeah, data centers don't pay residential electricity rates.

There are 2 caveats here:

Some places have higher prices for industrial than residential power as residential one might be subsidied by govt.

And DC also pay for cooling, which residential will only effectively pay if they have AC and is hot outside. So power rates are some multiply of industrial pricing.

bastawhiz 9 hours ago | | |

Generally you don't build a data center in a place that doesn't sell you electricity for cheap

llm_nerd 14 hours ago | |

Your post makes sense if you bought the hardware for other reasons, and maybe run models occasionally as a novelty.

That isn't the case for many, though, and there is a whole social media space where people are hyping up the latest homebrew options for running models, believing it frees them from the yoke of big AI.

Millions of people are buying big $ maxed-out hardware like the Mac Studios or DGX specifically to run LLMs. Someone rationally running the numbers is a good thing.

atq2119 12 hours ago | | |

Let's not get ahead of ourselves. Millions, really? I can believe there are a lot of enthusiasts doing this, but "millions" needs a citation.

curt15 8 hours ago | | |

> Millions of people are buying big $ maxed-out hardware like the Mac Studios or DGX specifically to run LLMs.

What's your source for this?

cyanydeez 16 hours ago | |

nothing about the current data center craze looks efficient.

bastawhiz 15 hours ago | | |

Whether you think building data centers or not is a good idea it's inarguable that the per-token efficiency (power, hardware, etc) is FAR higher in a data center. That's literally what it's designed for.

trollbridge 14 hours ago | | |

Probably because lots of data centres are being built (or half-built) which are sitting idle.

applfanboysbgon 16 hours ago |

Unless I'm misunderstanding, this is counting the entire laptop in the cost of generating tokens. The calculation seems to omit that, in addition to receiving LLM output, you have also received a laptop in exchange for your money. If you intend to put this machine in a dark corner and run it solely as a token-munching server, a laptop would be an exceptionally poor choice of technology for this purpose. But if you intend to use the laptop as a laptop, having a laptop is a pretty big benefit over not having a laptop.

You also get the benefit of privacy, freedom from censorship, and control over the model used (i.e. it will not be rugpulled on you in three months after you've built a workflow around a specific model's idiosyncrasies).

dijit 15 hours ago |

Frontier AI companies are selling at a loss.

Excusing everything else that u/bastawhiz said[0]; the obvious fact here is that Claude, OpenAI, Gemini et al. are quite literally burning through 100's of billions of dollars and selling it back to you for pennies on the dollar in the hopes that they get to be the only one left.

If I spend $10 growing Oranges and sell them to you for $1; then of course it's more expensive for you to do the growing.

I feel like I'm taking crazy pills. These models will become more expensive over time, it's functionally impossible for them not to, they just want to capture the market before they have to stop selling at a huge loss.

[0]: https://news.ycombinator.com/item?id=48168433

sleepyeldrazi 15 hours ago |

If you want a good dense model, use qwen3.6 27B instead, speed will be up, and if you don't take my word for it being smarter, take openrouter's prices of it against the bigger, slower and less memory-efficient gemma do the talking.

If you want a faster model, go for qwen3.6 35B (or gemma 4 26B if gemma models perform better for your tasks). There is a reason why people (myself included) haven't shut up about those two (especially the 27B). Its small enough to run at a decent speed (especially with the built in MTP that finally has official llama.cpp support) and for many workloads (every benchmark I have ever thrown at it) it is matching or surpassing models it has no right to.

A couple of days ago I woke up with my internet being down, started 27B in pi, told it to diagnose whats wrong by giving it my router's password, went to grab a coffee and by the time I got back, i had a full report with suggestion on how to proceed. I love openrouter and I use it for many things, but it is not cheaper.

Subjectivity and opinions based on personal experience with all those models implied naturally, I assume the 31B gemma has cases in which it edges out, I've just failed finding any and I have been running all 4 models mentioned since hours after each of them dropped nonstop for different tasks. Hell, for my hermes, I've started getting better results once I switched from gemma 4 26B to qwen3.5 9B, not even the massively improved 3.6 series. It just feels outdated/ cherrypicked to not use what by many accounts is the current consumer hardware SOTA if doing such an analysis.

konaraddi 15 hours ago |

A lot of comments here are about the issues with the analysis in OP’s post but much of them are “a distinction without a difference” with respect to the broader conclusion. When we look at purely cost and performance (setting aside privacy) then it’s better for individual devs to pay for hosted then for self hosting. Employers are paying for tokens on the job and most devs are finding the $PREFERRED_PROVIDER’s $20/$100/$200/month subscription sufficient outside of work. Most devs don’t fall in the conditions under which running local models make sense purely on the basis of cost vs performance.

More critically, in practice, setting up local models seems more like a hobby, an educational exercise, or an act of privacy control than it is for cost cutting or productivity.

Danox 13 hours ago | |

The model makers, mainframe dream of computer’s isn’t coming back no matter what OpenAI, Google, Anthropic or Microsoft want, there are too many smart tech barbarians at the gate that want in and they’re not going to be satisfied to go back to the computer terminal era.

Personal computers eliminated an earlier terminal era, and most if not all of those companies are gone except for IBM and a few stragglers and they are a shell of their former selves.

antirez 15 hours ago |

Mmmm, nope if you do the smart thing. MacBook M5 max 128gb is a premium laptop at 6k, but with it you can do many things and is your good main driver for the day. Then, it can also run DeepSeek V4 flash and perform non trivial tasks locally, without censorship or limitations, even without an internet connection and on very privacy sensitive data. That's a good deal. If you buy 25k for a dual Mac Studio 512gb to abandon OpenAI and company you are going to be disappointed by both performance and cost.

datadrivenangel 14 hours ago | |

The smart thing is to get a ~48gb MacBook and use it as your daily driver, and then budget ~$800/year for AI subscriptions or tokens and you'll end up at the same price.

I say, as the author of the blog post, writing this on a MacBook M5 max 128gb..

antirez 12 hours ago | | |

I agree with you, practically. But there is another angle of the story: for instance models are starting to be useless to do security stuff, since they are every day more censored. Also prices skyrocketed in the latest months, what will happen later? A few months ago I was shocked people resisted to spend 20$/month to get basically free frontier models, and I warned we were headed to house monthly rent figures in the future as AI becomes more and more required to do work. So indeed what you say is absolutely true now (but $800/year is not accurate: you need 20x accounts to do real work in my experience, so $200 * 12 = 2400$/year). But if you have a 128GB MacBook, that no longer looks so costly compared to 2400$/year of frontier models, you can experience uncensored LLMs, a quick thing that always works to do low-value work like TLDR this blog post for me, or what's wrong in this function? Or could you explain me this API? And for this kind of work, DeepSeek v4 Flash looks basically frontier. So if you look at things in perspective, they could have a different shape.

kamranjon 15 hours ago | |

Yea my m4 max with 128gb has ended up making a lot of sense for me. I do video editing, I train ml models, I run large open AI models, I do 3d modeling, rendering and cad work. I never do all of this 100% of the time, I’ll setup a ml training to run over night and check results in the morning, during work I’ll set it up as a server and run local models, on my own time I’ll edit video and work on 3d modeling. It’s an incredibly versatile machine - and all of this is done while keeping your data on your device and giving you full control over your workflows.

throwa356262 12 hours ago | |

Don't tell the HN crowd, but you can run some of these models on a $200 rpi5 or a $500 AMD mini PCs.

Another open secret is that that certain companies give you tens of thousands of tokens freely, with pretty respectable models such as Gemini 3.1 and GLM 4.6.

maho 16 hours ago |

The author only compared output token costs -- but for typical agentic workloads, input tokens dominate the costs by a large margin. Running inference locally, input tokens are, to first order, free. (They only generate implicit costs through higher time-to-first-token, higher power use, and lower token output speed).

Wilya 15 hours ago | |

Yeah, that completely invalidates his point.

I looked at a couple random agentic sessions in my openrouter activity, and the input cost is 10x the output cost.

Prompt caching on openrouter is complicated and unreliable. On local hardware with llama-cpp, it's mostly free.

amluto 12 hours ago | |

Even ignoring superior caching on a local setup, Mac hardware can often process input token around 10x as quickly as they produce output tokens. Openrouter seems to have only a 2x difference on the same models.

bigyabai 8 hours ago | | |

For larger contexts (eg. 20,000+ token agent workflows), being 10x faster still isn't enough. You have to be close to ~100x faster at crunching contexts for it to feel like realtime.

Jayakumark 16 hours ago |

OP is comparing against Gemma everywhere but concludes paying Anthropic make more sense. Anthropic is $15 per million output token which is 30-35x more expensive even in openrouter .

This is like comparing e-bike at home with e-bike rental and concluding therefore we need to rent Toyota since it can go at similar speeds. Getting tired of bad posts getting much attention .

SwellJoe 3 hours ago |

Everything is currently heavily subsidized. If the AI companies don't improve efficiency, they'll eventually have to start charging what it actually costs to offer the service, which is a multiple of what the currently charge.

I expect self-hosted to be quite competitive pretty soon. Github Copilot is already wildly more expensive than it was last month. People are going from spending a few bucks to a few thousand for that same usage. So, if it doesn't get a lot more efficient (like 3x the tokens, or more, from the same infrastructure), the prices will have to go up quite a lot to keep the lights on. Everything in AI is running partly on investors money, everyone is trying to buy a monopoly and insurmountable lead and some way to lock people into a specific model and ecosystem, but so far that hasn't happened (except for people who voluntarily lock themselves into a specific ecosystem, but even in those cases, it's usually easy to get the AI to help move to another, there are no truly unique features in AI that at least one, and probably three or four, other players don't also offer).

Sinidir 14 hours ago |

Article is seriously wrong, because it makes a huge mistake in the last part. You can't simply look at the produced tokens and that is your cost. In agentic coding there are lots of turns meaning you not only pay for the output tokens you also pay for all the input tokens sent each time (even if a lot cheaper, like 10x when cached). So this calculation does not accurately represent the api cost at all.

Second thing is you can starkly upgrade the token generation locally if you use agent teams. Single conversations are memory bandwidth bound and don't fully make use of your compute. If you can batch tokens from multiple agents you can easily 5x token generation.

regexorcist 16 hours ago |

I simply can't go back to cloud AI. Privacy and full control are more important to me than speed and SOTA models.

xyzzy123 16 hours ago | |

Also predictability, resilience, sovereignty. I'm not worried about other people's outages, that unexpected demand will impact me at an inconvenient time, that someone's watering down my model, that my costs will change unpredictably or that some unforseen error will lead to a huge bill.

It's in the same category as rooftop solar for me. It doesn't have to make strict economic sense if you're the particular type of person who gets peace of mind from control of infrastructure / reduced dependency.

ycui7 4 hours ago |

This is not surprising at all. The biggest benefit of cloud model in terms of energy efficiency is that when running more than 1 requests, the power consumption of said GPU roughly stayed the same. The more concurrency requests the server can handle, the less power each request consume. The server GPU is already likely more energy efficient than local GPU, concurrency make the cost structure unbeatable by local hardware. It is generally assumed the local hardware only run 1 request, but if the local engine is meant to serve a small business with meaningful concurrency, the economy might still work out.

nu11ptr 16 hours ago |

"Accelerated depreciation (if any) from shortening the lifespan of the device will be more expensive than the electricity"

Shortening the lifespan?

Der_Einzige 16 hours ago | |

The amount of FUD and notion that hardware depreciates in this manner is widely held. I blame Michael Burry of the Big Short who is perpetuating these lies to the investor community today.

There's a bunch of retro hardware which should make people pause and realize they're stupid to assume hardware slows down on average even 5% 20 years later (it's probably closer to 2% and I'm being generous).

HVAC/power delivery and generation are the major factors, and if you didn't skimp/get defective parts for this and replace failed moving parts (usually fans), your hardware is basically the same 20 years down the line as it was today.

Also using LLMs locally doesn't even induce sustained 100% GPU usage over significant periods of time for most real (agentic coding in OpenCode) use-cases.

datadrivenangel 14 hours ago | | |

There are tons of things that can start failing on hardware. I don't realistically expect some LLM usage to materially reduce the lifespan of the laptop, but running it 24/7 for AI usage makes me think that I'm more likely to get 3 years out of the device instead of 10.

zmmmmm 2 hours ago |

I have free electricity from solar and an old Macbook Pro M1 Max that has depreciated to zero and has no other use. Now how do the economics work out?

synthos 16 hours ago |

How much does your data privacy cost?

datadrivenangel 16 hours ago | |

As stated in the analysis, thousands of dollars. That said, the smart thing to do is target smaller models (few billion parameters) and then use larger models for non-privacy tasks.

Guillaume86 11 hours ago |

I suppose folks here already know this but it deserves a mention: subscription pricing is 10-20x cheaper than API pricing at Anthropic for example and it will be a far better experience (better models, faster responses, as much parallelism as you want, etc) so if it works for you there's no economic argument to buy a machine for inference at the moment.

schaefer 4 hours ago |

For me, the value in local inference is getting your hands dirty and goofing around. That is to say, learning.

So we shouldn’t be comparing it to the cost of open router api access at all, we should be comparing it to the cost of a 4 credit university course.

netika 13 hours ago |

In my testing, qwen-3.6-27b in full precision is well below sonnet, but above claude haiku in coding tasks. Gemma is not even close to qwen, it’s much, much worse.

robertkarl 3 hours ago | |

How do you test? I made this comment elsewhere... but I don't see a good benchmark that covers "how good is this thing at actually driving coding with tool use locally"?

Havoc 16 hours ago |

I like that the numbers were crunched, but the answer to these is always a bit of a foregone conclusion.

* Industrial power pricing

* Wholesale hardware pricing

* Utilization density of shared API

means API always wins a cost shootout.

Privacy & tinkering is cool too though

michaelbuckbee 16 hours ago |

Slightly different slice into this a very similar situation (local vs OpenRouter AI inference).

But in _every_ metric other than privacy it was better to run via OpenRouter than a local model, and not by a small amount.

Direct link to the comparison charts:

https://sendcheckit.com/blog/ai-powered-subject-line-alterna...

bilekas 16 hours ago |

I don't hear people debating which is cheaper, local or cloud run models. The conversation, at least what I hear, is a lot of the time users are not utilizing an awful lot of tickets all the time, those providers will be paid if you never use them. If 80% - 90% of the work I and my team are doing with Ai is grunt work, write tests for this, implement a FFT here, write the dB query for X. Nothing exhausting. Those who are using AI for whole cloth "vibe coded" applications and services are definitely better suited to cloud. If a work laptop can run my local models and get my works needed performance for development, why wouldn't I as a company prefer that?

Add to that the privacy improvements and data protection and potentially further specific inferance if needed it's a no brainer.

Again, Ai is a tool, and the right tool for the job, I would wager with no evidence looked up, is that the majority of Devs would be happy with 10-30 per second locally.

trvz 16 hours ago |

Local LLMs aren’t about cost, but control.

macwhisperer 9 hours ago |

I run the latest 20b-30b models on a MacBook Air... running inference with an MoE (25 tps) for like 2 hours is like 10% battery.. (look me up on huggingface to download my models)

also you gotta realize frontier models have massive "system prompts" that clog up the context window with garbage.

being able to write your own system prompts gives you a MASSIVE edge..

klipklop 11 hours ago |

Even if a Mac mini at home was slightly cheaper per token I still use OpenRouter because I want to out source the heat generation and noise to a datacenter.

cientifico 13 hours ago |

Right now, local inference only make sense for privacy reasons.

This is common when processing PII. Lawyers, doctors our similar should not be using cloud solutions.

Also it's harder to setup and always more expensive than any cloud solution.

jwr 13 hours ago |

> "run a model like Gemma 4 31b, which is almost anthropic sonnet levels of performance"

I wish people stopped deluding themselves — I regularly try (and benchmark for my purposes) local models and they are NOWHERE near the huge models like Sonnet or Opus. Nowhere. Yes, you can sometimes get plausibly-looking output for simple tasks, but for anything even remotely requiring thinking there is simply no comparison.

Local models are useful. I use them for spam filtering, and soon intend to use them for image tagging and OCR. But let's stop saying they can get us "anthropic sonnet levels of performance", because that's just not true.

g-technology 7 hours ago | |

It all depends on use case. A local fine tuned model on a very specific use case can definitely out perform a much bigger cloud model that doesn’t have the training on your use case. But, that requires looking at the ai models as a means to end and not a Swiss Army knife that can do it all.

zkmon 15 hours ago |

Consider deepseek as well. About 50 cents per 1M tokens, for >1T model

____tom____ 10 hours ago |

One important difference is that costs are bounded on your own machine. Like with cloud providers, I'm always worried that cost may accidentally explode if I launch an agent swarm wrong.

Now, it looks like the providers I use have good limits. But I do worry about this.

kryptiskt 9 hours ago |

This doesn't compare like for like, since its comparing the total cost for the local machine with the usage cost for the cloud service, despite the cloud service also needing a local machine to be useful.

ares623 9 hours ago | |

So typical AI booster slop

SXX 13 hours ago |

Author forgot that after 3 years when hardware no longer decent for inference you can still resell it for 25-50% of price.

Obviously if RAM apocalypse passes by then high-end configurations preserve resale value worse than base models, but still it's hefty bonus of Apple hardware that might change math a lot.

freakynit 16 hours ago |

So I did the India-specific analysis for a tier-3 city. Here, electricity costs 1/3rd of the US version, and you also get solar subsidy up to a certain amount.

https://shorturl.at/q6gRE

tldr;

Hardware deprecation costs are the major factor.

But, if we assume ZERO hardware deprecation (not realistic), then local inference becomes super cheap.. roughly, 90%+ cheaper.

Third case: the break-even happens only if we can get at the very very very least, 8.7 years of useful hardware life. A more realistic number, however, when working 8 hrs/day and not of 24 hrs/day, is around 25 years.

So, for now, local inference is preferable if you deeply care about privacy. From cost perspective, it's still not there.

datadrivenangel 14 hours ago | |

I think your link is broken? Would love to see the analysis as well.

freakynit 13 hours ago | | |

not much of an analysis really.. just simple math... anyways, that markdown share site takes around 5-10 seconds to load the page.. so, just hang on a bit more time :)

perbu 15 hours ago |

For me, the appeal of local compute is first and foremost confidentiality and having the possibility to run my 200K documents through an LLM just to see what happen without having to consider the cost.

matrix12 9 hours ago |

And this all assumes OpenRouter costs and availability will persist.

brisket_bronson 16 hours ago |

> Let's round up to $0.20 per kWh.

Next paragraph

> At ~50-100 watts and $0.18/kWh that's $0.009 or $0.018 per hour. $0.02 per hour. $0.48 cents per day for the electricity to be running inference at 100%.

lol

jmyeet 15 hours ago |

I've dug into this previously for one simple reason: NVidia segments the market by capping VRAM and Apple silicon uses a shared memory model that could challenge that but it currently doesn't. And I really wonder if Apple realizes the potential of what they have or if they even care.

So, for comparison, a 5090 has 32GB of VRAM and you can get one for ~$3000 maybe. To go beyond that memory with current generation (ie Blackwell) GPUs, you have to go to the RTX 6000 Pro w/ 96GB of VRAM and that's almost $10,000 for the GPU by itself. Beyond that you're in the H100/H200 GPUs and you're talking much bigger money.

Part of the problem here is the author is looking at laptops. That's the only place you'll find the M5 Max currently. The real problem here is that the Mac Studios haven't been updated in almost 2 years. There were configs of those with 256/512GB of RAM but they've been discontinued, possibly because of the RAM shortage and possibly because of they're reaching EOL. Apple hasn't said why. They never do.

Many expect M5 Ultra Mac Studios in Q3 and the M5 Ultra may well have >1TB/s of memory bandwidth (for comparison, the 5090 is 1.8TB/s). Memory bandwidth isn't the only issue. A 5090 will still have more compute power (most likely) but being able to run large models without going to a $10k+ GPU could be huge.

But yes, it's hard to compete with the scales and discounted electricity of a data center. Even H200 compute hours are kinda cheap if you consider the capital cost of what you're using.

I've looked into getting a 128GB M5 Max 16" MBP. That retails for $6k. You might be able to get it for $5400. But I don't think the value proposition is quite there yet. It's close though.

gizajob 15 hours ago | |

I think Apple really do care and know that Moore’s law is likely to position them as major winners in this race in 3-7 years time.

brookst 15 hours ago | | |

This. The M5’s massive speed up in refill is a good sign.

Apple isn’t expecting wholesale adoption of on-device models this year or next. But all of their design and iteration suggests they see it coming.

SpyCoder77 16 hours ago |

Open router doesn't cost money per say, it depends on the providers pricing

moritzwarhier 16 hours ago | |

> OpenRouter has Gemma4 31b at ~38-50 cents per million tokens. This means that on the optimistic side (50 watts, 40 tokens per second, and 10 years) the pro max is as cheap as openrouter. On the pessimistic side (100 watts and 3 years at 10 tokens per second) the pro max is 10x the cost. I think ~3x the cost per million tokens is likely the right number for local inference on the pro max from an accounting perspective.

Apart from that, like detailed in the the article, pricing for local compute also depends on electricity prices.

By the way, I don't want to snark about it, my English is not very good, but it's "per se", not "per say". Just commenting on this petty thing because it seems to be a common misspelling, and it always trips me up a bit. Makes me wonder about another supposed meaning like "from hearsay".

mnahkies 16 hours ago | |

They do take a cut of 5.5%, (as they should)

SecretDreams 16 hours ago |

Will this cost structure always be this way and are there other benefits to not running your LLM on the cloud?

E.g.

Privacy

Uptime

Future cost structure controls

This is a field that has moved very quickly. And it has moved in a direction to try to trap users into certain habits. But these habits might not best align with what best benefits end users today or some time in the future.

Archit3ch 14 hours ago |

Except I already have a local Mac to run Xcode. OpenRouter cannot help with that, at any price.

> 64 gigs should run a model like Gemma 4 31b

No, it can run anything in the 70B range. It's a notable quality upgrade from the 30B, which isn't obvious because the famous flurry of April releases didn't contain any 70Bs.

It can also run 120B in UD-Q3. Or 230B disk-streamed.

maxdo 15 hours ago |

I'm even surprised people ignorantly talking about advantages of buying very expensive device , run it only sometimes and aiming to beat cloud vendors.

If small model is great it will be hosted with good electricity cost and will be utilized 24/7.

Isn't it 2+2 of economics ?

CPU is a commodity, and we are still buying cpu and ram from vendors for same reason

throw1234567891 15 hours ago | |

Put a cost on sending your intellectual property to a saas provider who knows where. Half a problem when it is just your IP, hopefully not the IP of your clients. Maybe if one is building yet another html nobody really cares about.

bitwize 12 hours ago | |

It's a good thing the market is adjusting to the reality that no one needs to own powerful computers, just terminals into the feed of compute available through the cloud, then! Nothing could possibly go wrong from that!

JSR_FDED 16 hours ago |

Wouldn’t a Mac Mini be a better comparison?

sgt 16 hours ago | |

Yes, or Mac Studio. Laptops with screens aren't made to run 24/7 heavy workloads.

650REDHAIR 15 hours ago | |

Also after a few years you can sell and upgrade.

A 2022 Mac Studio w/ M1 Ultra and 128gb was ~$5200 new and I see them selling for over $4k on eBay.

Can’t sell your used tokens…

onesociety2022 14 hours ago | | |

You can’t actually - due to the RAM shortage you can’t even upgrade to an M3 Ultra Mac Studio with 128GB RAM. That model has been discontinued. Even the 96GB model has a wait time of 5 months in most locations. This is the reason why the resale value is so high.

clearstack 14 hours ago |

Apple services are ~27% of revenue and growing double-digits. The chip is a moat for that flywheel, not a standalone compute bet.

not_the_fda 5 hours ago |

Isn't this just saying cloud AI providers are heavily subsidizing the true cost of the service.

weird-eye-issue 5 hours ago | |

Not necessarily since they are definitely not running them on Apple Silicon and you don't know their costs

anonym29 15 hours ago |

The true advantage of locally self-hostable, open weight models isn't about monetary cost at all, it's about the CIA triad.

Running locally, you get confidentiality of knowing your tokens are only ever being processed by your own hardware. You get the integrity of knowing your model isn't being secretly or silently quantized differently behind the scenes, or having it's weights updated in ways you don't want. And you get the availability of never having to worry about an API outage, or even an internet outage, for local inference capacity.

And this isn't even starting to address the whole added world of features and tunability you get when you control the inference stack. Sampling parameters, caching mechanisms, interpretability etc.

OpenRouter may be cheaper than frontier labs, but you still lose all of these benefits from open weight models the moment you decide to rely on someone else's hardware for your processing.

panny 16 hours ago |

Your laptop AI costs too much? Speculative investors can help!

tamimio 10 hours ago |

> Throwing money at anthropic makes more sense in this context.

But you are dependent on them, which is the biggest factor IMO, there was a website posted here before of people getting banned from using it over silly reasons, not to mention price hikes, or privacy concerns. Maybe now it’s more expensive or slower to run locally, but you are in full control of everything.

varispeed 10 hours ago |

What is the security of OpenRouter? I have a feeling user has no idea where their data is going and how it will be used or am I wrong?

When I see so many options, that looks like it would take months to audit whether it actually makes sense and is safe to use. But I guess some people are fine with YOLO-ing it.

empath75 12 hours ago |

It should not at all be surprising that running models at home is more expensive than commodity providers. That's just generally true of running your own stuff. Even if the cost in money isn't higher, the cost in time is often _significantly_ higher.

This is why the idea that the AI labs are in trouble because inference will be a commodity is _completely backwards_. Some of the largest and most powerful companies in the world sell commodities. They compete on scale and efficiency, and you are never going to be able to compete with the big labs on either.

deadbabe 14 hours ago |

What would really elevate an article like this is if we could somehow quantify human brain’s equivalent outputs and compare the costs with local LLM and cloud LLMs.

datadrivenangel 14 hours ago | |

The computer does most specific tasks better, faster, and cheaper than I do.

christkv 15 hours ago |

Bizarre running local models have nothing to do with cost. It's about privacy first and foremost

newsclues 16 hours ago |

Local isn’t (just) about cost, it’s control and trust.

pshirshov 9 hours ago |

... but comes with privacy guarantees *

* but Apple will collect all your keystrokes anyway

mbgerring 14 hours ago |

Now include the externalized cost in the U.S. of deploying ~100% of productive capital to build data centers instead of, for example, first-world transportation infrastructure, and tell me which one is cheaper

tuwtuwtuwtuw 14 hours ago | |

Why would I want to include that when determining the cost per token?

mbgerring 14 hours ago | | |

It’s part of the cost per token

an0malous 16 hours ago |

OpenRouter and other LLM platforms are being subsidized by VC investment to less than it costs them to run inference, the MacBook Pro is not

hankerapp 12 hours ago | |

Bingo. I, for one, am loving this phase of enjoying the LLMs at the expense of VC money. Just like how I enjoyed cheap rides and deliveries on Uber. And with the fragmentation in the field, I don't see a monopoly coming up.

Kwpolska 16 hours ago | |

When the AI bubble inevitably pops, the author will find a new way to skew results in favor of cloud LLMs. Like including the price of a desk and a chair in the local token cost.

datadrivenangel 16 hours ago | | |

I really wanted the laptop to look better cost-wise, but it doesn't.

Der_Einzige 16 hours ago |

OpenRouter doesn't expose all the LLM sampling parameters/research that llamacpp, vllm, sglang, et al expose (so no high temperature/highly diverse outputs). Also OpenRouter doesn't let you use steering vectors or LoRA or other personalization techniques per-request. Also no true guarantees of ZDR/privacy/data sovereignty.

Oh, and the author didn't mention at all anything related to inference optimization, so no idea if they even know about or enabled things like speculative decoding, optimized attention backends, quantization, etc.

At least AI slop would have hit on far more of the things I listed above. This is worse-than-AI.

znpy 12 hours ago |

I think that the main flaw in the reasoning is assuming that cost of token will stay the same over the years.

Chances are that token prices will go down, but chances also are that the AI bubble pops and all of a sudden all these companies will either have to make a buck out of the inference or go bankrupt.

Getting your own hardware just grants you stable pricing.

8note 5 hours ago | |

if all those companies go bankrupt, you could also buy their hardware on the cheap.

its almost guaranteed imo that as the model quality evens out, inference cost will drop towards 0 at insane speeds, given how well llama works as an asic.

mrtimeman 16 hours ago |

The full-amortization framing is doing a lot of work here. I bought my laptop because I needed a laptop, not as an inference box, and running a model on it is incidental to that. Once the hardware is sunk for other reasons, the only cost left is electricity plus whatever depreciation you accelerate by hammering the SoC, which the post actually acknowledges in one parenthetical before allocating the full $4299 to tokens anyway.

Also nobody I know picks local over OpenRouter on price. They pick it for offline, for data not leaving the machine, for no rate limits, for not having a provider go down mid-task. If $/Mtok is the only axis, sure, cloud wins.

In practice the pattern I see is leaving a small model running on easy background tasks while using the laptop normally, not a dedicated inference box hammered flat out for 5 years.