Always rooting for Hugging Face
It seems to me there is no chance local ML is going to be anywhere out of the toy status comparing to closed source ones in short term
a) to have an idea how much tokens I use and
b) be independent of VC financed token machines and
c) I can use it on a plane/train
Also I never have to wait in a queue, nor will I be told to wait for a few hours. And I get many answers in a second.I don't do full vibe coding with a dozen agents though. I read all the code it produces and guide it where necessary.
Last not least, at some point the VC funded party will be over and when this happens one better knows how to be highly efficient in AI token use.
Whats the advantage of qwen code cli over opencode ?
the space moved from Consumer to Enterprise pretty fast due to models getting bigger
Ollama and webui seem to rapidly lose their charm. Ollama now includes cloud apis which makes no sense as a local.
Both $0 revenue "companies", but have created software that is essential to the wider ecosystem and has mindshare value; Bun for Javascript and Ggml for AI models.
But of course the VCs needed an exit sooner or later. That was inevitable.
but maybe I'm just slightly out of the loop
I have tried many tools locally and was never really happy with any. I tried finally Qwen Code CLI assuming that it would run well with a Qwen model and it does. YMMV, I mostly do javascript and Python. Most important setting was to set the max context size, it then auto compacts before reaching it. I run with 65536 but may raise this a bit.
Last not least OpenCode is VC funded, at some point they will have to make money while Gemini CLI / Qwen CLI are not the primary products of the companies but definitely dog-fooded.
Btw I also get 42-60 tps on M4 Max with the MLX 4 bit quants hosted by LM Studio, which software do you use to run it ?
Yesterday I tested the latest llama.cpp and the result is that PP has made a huge jump to 420 tps which is 30% faster than MLX on my M1. TG is now 25 tps which is below MLX but does not degrade much, at 50k context it is still 22-23 tps.
Together with Qwen code CLI llama.cpp does a lot less often re-process the full KV cache. So for now I am switching back to llama.cpp.
It is worth to spend some time with the settings. I am really annoyed by the silly jokes (was it Claude that started this?). You can disable them with customWittyPhrases. Also setting contextWindowSize will make the CLI auto compress, which works really well for me.
And depending on what you do, maybe set privacy.usageStatisticsEnabled to false.
Like Gemini, Qwen CLI supports OpenTelemetry. When I have time I'll have a look why the KV cache gets invalidated.
I'm old enough to remember when traffic was expensive, so I've no idea how they've managed to offer free hosting for so many models. Hopefully it's backed by a sustainable business model, as the ecosystem would be meaningfully worse without them.
We still need good value hardware to run Kimi/GLM in-house, but at least we've got the weights and distribution sorted.
They provide excellent documentation and they’re often very quick to get high quality quants up in major formats. They’re a very trustworthy brand.
If you stream weights in from SSD storage and freely use swap to extend your KV cache it will be really slow (multiple seconds per token!) but run on basically anything. And that's still really good for stuff that can be computed overnight, perhaps even by batching many requests simultaneously. It gets progressively better as you add more compute, of course.
This is fun for proving that it can be done, but that's 100X slower than hosted models and 1000X slower than GPT-Codex-Spark.
That's like going from real time conversation to e-mailing someone who only checks their inbox twice a day if you're lucky.
Harder to track downloads then. Only when clients hit the tracker would they be able to get download states, and forget about private repositories or the "gated" ones that Meta/Facebook does for their "open" models.
Still, if vanity metrics wasn't so important, it'd be a great option. I've even thought of creating my own torrent mirror of HF to provide as a public service, as eventually access to models will be restricted, and it would be nice to be prepared for that moment a bit better.
Here's that README from March 10th 2023 https://github.com/ggml-org/llama.cpp/blob/775328064e69db1eb...
> The main goal is to run the model using 4-bit quantization on a MacBook. [...] This was hacked in an evening - I have no idea if it works correctly.
Hugging Face have been a great open source steward of Transformers, I'm optimistic the same will be true for GGML.
I wrote a bit about this here: https://simonwillison.net/2026/Feb/20/ggmlai-joins-hugging-f...
I generally try to include something in a comment that's not information already under discussion - in this case that was the link and quote from the original README.
And for those who think it's just organic with all of the upvotes, HN absolutely does have a +/- comment bias for users, and it does automatically feature certain people and suppress others.
How solid is its business model? Is it long-term viable? Will they ever "sell out"?
Since I don't see it mentioned here, LlamaBarn is an awesome little—but mighty—MacOS menubar program, making access to llama.cpp's great web UI and downloading of tastefully curated models easy as pie. It automatically determines the available model- and context-sizes based on available RAM.
https://github.com/ggml-org/LlamaBarn
Downloaded models live in:
~/.llamabarn
Apart from running on localhost, the server address and port can be set via CLI: # bind to all interfaces (0.0.0.0)
defaults write app.llamabarn.LlamaBarn exposeToNetwork -bool YES
# or bind to a specific IP (e.g., for Tailscale)
defaults write app.llamabarn.LlamaBarn exposeToNetwork -string "100.x.x.x"
# disable (default)
defaults delete app.llamabarn.LlamaBarn exposeToNetworkAs for models, plenty of GGUF quantized (down to 2-bit) available on HF and modelscope.
I want this to be true, but business interests win out in the end. Llama.cpp is now the de-facto standard for local inference; more and more projects depend on it. If a company controls it, that means that company controls the local LLM ecosystem. And yeah, Hugging Face seems nice now... so did Google originally. If we all don't want to be locked in, we either need a llama.cpp competitor (with a universal abstration), or it should be controlled by an independent nonprofit.
I am somewhat anxious about "integration with the Hugging Face transformers library" and possible python ecosystem entanglements that might cause. I know llama.cpp and ggml already have plenty of python tooling but it's not strictly required unless you're quantizing models yourself or other such things.
Is my only option to invest in a system with more computing power? These local models look great, especially something like https://huggingface.co/AlicanKiraz0/Cybersecurity-BaronLLM_O... for assisting in penetration testing.
I've experimented with a variety of configurations on my local system, but in the end it turns into a make shift heater.
Then I fell down the rabbit holes of uv, rust and C++ and forgot about LLMs. Today after I saw this announcement and answered someone’s question about how to set it up, when I got home, I decided play with llama.cpp again.
I was surprised and impressed:
https://ontouchstart.github.io/rabbit-holes/llama.cpp/
I am not going to use mlx-lm or lmstudio anymore. llama.cpp is so much fun.
I did use candle for wasm based inference for teaching purposes - that was reasonably painless and pretty nice.
How can I realistically get involved the AI development space? I feel left out with what’s going on and living in a bubble where AI is forced into by my employer to make use of it (GitHub Copilot), what is a realistic road map to kinda slowly get into AI development, whatever that means
My background is full stack development in Java and React, albeit development is slow.
I’ve only messed with AI on very application side, created a local chat bot for demo purposes to understand what RAG is about to running models locally. But all of this is very superficial and I feel I’m not in the deep with what AI is about. I get I’m too ‘late’ to be on the side of building the next frontier model and makes no sense, what else can I do?
I know Python, next step is maybe do ‘LLM from scratch”? Or I pick up Google machine learning crash course certificate? Or do recently released Nvidia Certification?
I’m open for suggestions
But if you're adjacent to some leaf use-case for AI, you're likely already as good as anyone else at productizing it.
And that's who is getting hired: people who show they can deliver product-market fit.
Hopefully this does not mean consolidation due to resource dry up but true fusion of the bests.
In either case - huge thanks to them for keeping AI open!
I think, for some definition of “banned”, that’s the case. It doesn’t stop the Chinese labs from having organization accounts on HF and distributing models there. ModelScope is apparently the HF-equivalent for reaching Chinese users.
That's interesting. I thought they would be somewhat redundant. They do similar things after all, except training.
However these things are dynamic and change over time. As I read the discussion just now, the GP comment was the ~5th top-level comment.
https://giftarticle.ft.com/giftarticle/actions/redeem/9b4eca...
GitHub is great -- huge fan. To some degree they "sold out" to Microsoft and things could have gone more south, but thankfully Microsoft has ruled them with a very kind hand, and overall I'm extremely happy with the way they've handled it.
I guess I always retain a bit of skepticism with such things, and the long-term viability and goodness of such things never feels totally sure.
Oh no, never. Don't worry, the usual investors are very well known for fighting for user autonomy (AMD, Nvidia, Intel,IBM, Qualcomm)
They are all very pro consumers and all backers are certainly here for your enjoyment only
For your Mac, you can use Ollama, or MLX (Mac ARM specific, requires different engine and different model disk format, but is faster). Ramalama may help fix bugs or ease the process w/MLX. Use either Docker Desktop or Colima for the VM + Docker.
For today's coding & reasoning models, you need a minimum of 32GB VRAM combined (graphics + system), the more in GPU the better. Copying memory between CPU and GPU is too slow so the model needs to "live" in GPU space. If it can't fit all in GPU space, your CPU has to work hard, and you get a space heater. That Mac M1 will do 5-10 tokens/s with 8GB (and CPU on full blast), or 50 token/s with 32GB RAM (CPU idling). And now you know why there's a RAM shortage.
Is hopelessly dated. There are much better newer models around.
I picked up a second-hand 64GB M1 Max MacBook Pro a while back for not too much money for such experimentation. It’s sufficiently fast at running any LLM models that it can fit in memory, but the gap between those models and Claude is considerable. However, this might be a path for you? It can also run all manner of diffusion models, but there the performance suffers (vs. an older discrete GPU) and you’re waiting sometimes many minutes for an edit or an image.
https://www.reddit.com/r/LocalLLM/
Everytime I ask the same thing here, people point me there.
https://www.docker.com/blog/run-llms-locally/
As far as how to find good models to run locally, I found this site recently, and I liked the data it provides:
Sounds like you're very serious about supporting local AI. I have a query for you (and anyone else who feels like donating) about whether you'd be willing to donate some memory/bandwidth resources p2p to hosting an offline model:
We have a local model we would like to distribute but don't have a good CDN.
As a user/supporter question, would you be willing to donate some spare memory/bandwidth in a simple dedicated browser tab you keep open on your desktop that plays silent audio (to not be put in the background and deloaded) and then allocates 100mb -1 gb of RAM and acts as a webrtc peer, serving checksumed models?[1] (Then our server only has to check that you still have it from time to time, by sending you some salt and a part of the file to hash and your tab proves it still has it by doing so). This doesn't require any trust, and the receiving user will also hash it and report if there's a mismatch.
Our server federates the p2p connections, so when someone downloads they do so from a trusted peer (one who has contributed and passed the audits) like you. We considered building a binary for people to run but we consider that people couldn't trust our binaries, or would target our build process somehow, we are paranoid about trust, whereas a web model is inherently untrusted and safer. Why do all this?
The purpose of this would be to host an offline model: we successfully ported a 1 GB model from C++ and Python to WASM and WebGPU (you can see Claude doing so here, we livestreamed some of it[2]), but the model weights at 1 GB are too much for us to host.
Please let us know whether this is something you would contribute a background tab to hosting on your desktop. It wouldn't impact you much and you could set how much memory to dedicate to it, but you would have the good feeling of knowing that you're helping people run a trusted offline model if they want - from their very own browser, no download required. The model we ported is fast enough for anyone to run on their own machines. Let me know if this is something you'd be willing to keep a tab open for.
[1] filesharing over webrtc works like this: https://taonexus.com/p2pfilesharing/ you can try it in 2 browser tabs.
[2] https://www.youtube.com/watch?v=tbAkySCXyp0and and some other videos
What services would you need that Hugging Face doesn't provide?
That is not true. I am serving models off Cloudflare R2. It is 1 petabyte per month in egress use and I basically pay peanuts (~$200 everything included).
Most people will not choose Metal if they're picking between the two moats. CUDA is far-and-away the better hardware architecture, not to mention better-supported by the community.
To summarize, they rejected Nvidia's offer because they didn't want one outsized investor who could sway decisions. And "the company was also able to turn down Nvidia due to its stable finances. Hugging Face operates a 'freemium' business model. Three per cent of customers, usually large corporations, pay for additional features such as more storage space and the ability to set up private repositories."
Finally, we would like the possibility of setting up market dynamics in the future: if you aren't currently using all your ram, why not rent it out? This matches the p2p edge architecture we envision.
In addition, our work on WebGPU would allow you to rent out your gpu to a background tab whenever you're not using it. Why have all that silicon sit idle when you could rent it out?
You could also donate it to help fine tune our own sovereign model.
All of this will let us bootstrap to the point where we could be trusted with a download.
We have a rather paranoid approach to security.
We are not going to do what you suggest. Instead, our approach is to use the RAM people aren't using at the moment for a fast edge cache close to their area.
We've tried this architecture and get very low latency and high bandwidth. People would not be contributing their resources to anything they don't know about.
Once you're swapping from disk, the performance will be quite unusable for most people. And for local inference, KV cache is the worst possible choice to put on disk.
The issue you'll actually run into is that most residential housing isn't wired for more than ~2kW per room.
Not saying this is the case, but it's what the comment implies, so "just upvote your faves" doesn't really address it.
I would like to see others, being promoted to the top rather than Simon’s constant shilling for backlinks to his blog every time an AI topic is on the front page.
BitTorrent protocol is IMO better for downloading large files. When I want to download something which exceeds couple GB, and I see two links direct download and BitTorrent, I always click on the torrent.
On paper, HTTP supports range requests to resume partial downloads. IME, it seems modern web browsers neglected to implement it properly. They won’t resume after browser is reopened, or the computer is restarted. Command-line HTTP clients like wget are more reliable, however many web servers these days require some session cookies or one-time query string tokens, and it’s hard to pass that stuff from browser to command-line.
I live in Montenegro, CDN connectivity is not great here. Only a few of them like steam and GOG saturate my 300 megabit/sec download link. Others are much slower, e.g. windows updates download at about 100 megabit/sec. BitTorrent protocol almost always delivers the 300 megabit/sec bandwidth.
Suppose HF did the opposite because the bandwidth saved is more and they're not as concerned you might download a different model from someone else.
Exactly.
There are configurable settings for each account, which might be automatically or manually set—I'm not sure–, that control the initial position of a comment in threads, and how long it stays there. There might be a reward system, where comments from high-karma accounts are prioritized over others, and accounts with "strikes", e.g. direct warnings from moderators, are penalized.
The difference in upvotes that account ultimately receives, and thus the impact on the discussion, is quite stark. The more visible a comment is, i.e. the more at the top it is, the more upvotes it can collect, which in turn makes it stay at the top, and so on.
It's safe to assume that certain accounts, such as those of YC staff, mods, or alumni, or tech celebrities like simonw, are given the highest priority.
I've noticed this on my own account. Before being warned for an IMO bullshit reason, my comments started to appear near the middle, and quickly float down to the bottom, whereas before they would usually be at the top for a few minutes. The quality of what I say hasn't changed, though the account's standing, and certainly the community itself, has.
I don't mind, nor particularly care about an arbitrary number. This is a proprietary platform run by a VC firm. It would be silly to expect that they've cracked the code of online discourse, or that their goal is to keep it balanced. The discussions here are better on average than elsewhere because of the community, although that also has been declining over the years.
I still find it jarring that most people would vote on a comment depending on if they agree with it or not, instead of engaging with it intellectually, which often pushes interesting comments to the bottom. This is an unsolved problem here, as much as it is on other platforms.
This isn't to say that social media is fair, or that people vote properly or that any ranking system based on agreement by readers is a good one. However, generally when you are getting negativity communicated to you and you are seeing consistently poor results around actions you take, it is going to be useful to examine the possibility that there is a difference in how you perceive what you are doing vs how others do. In that case spending time trying to figure out ways in which you are being wronged so that you can continue in the same manner is going to be time wasted.
We don't have the source for HN, nor do we have the obvious bias metadata that the moderators have put in place, but simply paying attention betrays that manipulation mechanisms exist and are heavily utilized.
For instance I clearly have a "bad guy" flag on my account, and frequently see my highly rated comments sorted below literally greyed out comments. Comments older than mine, so it isn't just the normal "well newer comments get a boost", it's just that there is a comment "DEI" in place where some people get a freebie boost and some people get a freebie detriment. It's why often mediocre content and comments by the core group is always floating high.
And let me make it very clear that I do not care. I don't harbour any delusions about some tight community or the like, and HN is not important in my life or my ego. I also know that it's basically a propaganda network for YC (I mean...it's right in the URL), and good for them. It's their site and they can do anything they want with it.
I only commented because some people really think this place is a meritocracy+democracy. That isn't how it works, even if they really want people to think that.
My point is that HN definitely has certain weights associated with accounts, which control the karma, visibility, and ultimately discussion of certain topics.
This problem doesn't affect only negativity or downvotes, but upvotes as well. The most upvoted comments are not necessarily of the highest quality, or contribute the most to the discussion. They just happen to be the most visible, and to generally align with the feeling of the hive mind.
I know this because some of my own comments have been at the top, without being anything special, while others I think are, barely get any attention. I certainly examine my thinking whenever it strongly aligns with the hive mind, as this community does not particularly align with my values.
I also tend to seek out comments near the bottom of threads, and have dead comments enabled, precisely to counteract this flawed system. I often find quality opinions there, so I suggest everyone do the same as well.
An essential feature of a healthy and interesting discussion forum is to accomodate different viewpoints. That starts by not burying those that disagree with the majority, or boosting those that agree. AFAIK no online system has gotten this right yet.
On another note: I'm a bit paranoid about quantization. I know people are not good at discerning model quality at these levels of "intelligence" anymore, I don't think a vibe check really catches the nuances. How hard would it be to systematically evaluate the different quantizations? E.g. on the Aider benchmark that you used in the past?
I was recently trying Qwen 3 Coder Next and there are benchmark numbers in your article but they seem to be for the official checkpoint, not the quantized ones. But it is not even really clear (and chatbots confuse them for benchmarks of the quantized versions btw.)
I think systematic/automated benchmarks would really bring the whole effort to the next level. Basically something like the bar chart from the Dynamic Quantization 2.0 article but always updated with all kinds of recent models.
Very hard. $$$
The benchmarks are not cheap to run. It'll cost a lot to run them for each quant of each model.
Attention is ALL You Need.
Hypothetically my ISP will sell me unmetered 10 Gb service but I wonder if they would actually make good on their word ...
It's a bit like any legalization question -- the black market exists anyway, so a regulatory framework could bring at least some of it into the sunlight.
But that'll only stop a small part, anyone could share the infohash and if you're using the dht/magnet without .torrent files or clicks on a website, no one can count those downloads unless they too scrape the dht for peers who are reporting they've completed the download.
Which can be falsified. Head over to your favorite tracker and sort by completed downloads to see what I mean.
I feel like you're making this statement in bad faith, rather than honestly believing the developers of the forum software here have built in a clause to pin simonw's comments to the top.
This does not happen. It hasn't even happened when pg made the forum in the first place.
I know things change rapidly so I'm not counting them out quite yet but I don't see them as a serious contender currently
Its still not fully post trained and its a non-reasoning model, but its worth keeping an eye on if you dont want to use the Chinese models that currently are the best open-weight options.
there’s a small tinfoil hat part of me that suspects part of their obscene investments and cornering the hardware market is driven by an conscious attempt to stop open source local from taking off. they want it all, the money, the control, and to be the only source of information to us.
Assume AWS spot say $20/hr B200 for 8 GPUs, then $20 ish per quant, so assuming benchmark is on BF16, 8bit, 6, 5, 4, 3, 2 bits then 7 ish tests so $140 per model ish to $420 ish/hr. Time wise 7 hours to 1 day ish.
We could run them after a model release which might work as well.
This is also on 1 benchmark.
As general purpose chatbots small Mistral models are better than comparably sized Chiniese models, as they have better SimpleQA scores and general knowledge of Western culture.
I am not sure if you actually tried that. Mistrals are widely asccepted go-to models for roleplay and creative writing. No Qwens are good at prose, except for their latest big Qwen 3.5.
> I don’t think their corpus is lacking in western knowledge,
It absolutely does, especially pop culture knowledge.
That doesn't make any sense to me. Am I missing something?
That would besuboptimal, as Gemini has too old knowledge cutoff. I am long past the need for such an advice anyway, as I've been using local models since mid 2024.
It’s only a very low level model access where search isn’t used. Local models also need to be configured to use search, and I haven't had a use case to do that yet.
Gemini seems to call this “grounding with google search”. If you have Gemini installed in your enterprise, it will also search internal data sources for context.
If decides to do so, and even then baked in knowledge would influence the result.
In any case I do not need Gemini or any other LLMs to figure out setting for my llama.cpp, thank you very much.
If you are able to figure out the right settings for a model Thats was released last week, then great for you! But it sounds like you just don’t trust LLMs to use current knowledge, and have some misconception about how they satisfy recent knowledge requests.