Ask HN: Which LLMs can run locally on most consumer computers Are there any? I was thinking about LLM based agents and games and this will probably only be viable when most devices can handle LLMs running locally. |
Ask HN: Which LLMs can run locally on most consumer computers Are there any? I was thinking about LLM based agents and games and this will probably only be viable when most devices can handle LLMs running locally. |
I think we're at least 10-15 years from being able to run low latency agents that "rag" themselves into the games they are a part of, where there are 100's of them, some of them NPC's other's controlling some game mechanic or checking if the output from other agents is acceptable or needs to be run again.
At the moment a macbook air 16 gb can run Phi-Medium 14gb, which is extremely impressive, but it's 7 tokens per second, way to slow for any kind of gaming, you need to 100x performance and we need 5+ generations before i can see this happening.
Unless there's some other application?
I think it's two-fold. The primary one is that it's likely very difficult to maintain a designers storyline vision and desired "atmosphere / feel", because LLM's currently "go off the rails" too easily. The second is that the teams with enough funding to properly fine-tune generative AI to do dialog, level/environment-creation, character-generation, etc. that funding means they're generally making AAA or AAA-adjacent games, which already need so much of a consumer GPU VRAM that there's not a lot left over for large ML models to run in parallel.
I do think though that we should already be seeing indie games doing more with LLM's and 3D character/level/item generation than we are. Of course AI Dungeon has been trailblazing this for a long time but I just expected to see more widely-recognized success by now from many projects. I take this as a signal that it's hard to make a "good" game using AI generation. If anyone has any suggestions for open-world games with significant amount of AI generation that allows player interaction to significantly affect the in-game universe, I'd be very interested in play-testing them. Can be any genre / style / budget. I just want to see more of what people are accomplishing in this space.
My hope is that there will be space for both the current style of game where every aspect is created/designed by a human, as well as for games of various types where the world is given an overall narrative/aesthetic/vision by the creators, but the details are implemented by AI and allows true open-world play where you finally can just walk into any shop and use RAG/etc to allow complete continuity over months/years of play where characters remember your conversations/interactions/actions of you and anyone playing in the same world.
I do think there's something of an "end-game" for this where a game is released that has no game at all in it, but rather generates games for each player based on what they want to play that day, and creates them as you play them. But I'd like to imagine that this won't replace other games (even if it does take a bit of the air out of the room), but rather exist alongside games with human-curated experiences.
I think we're currently stuck in a local minima where AI isn't up to the task of making a coherent player-interactable world, but an incoherent or fragmented and non-interactable world isn't impressive enough (like No Man's Sky).
But this thing still has a long way to go.
Anyone working on top games through mods that wants to explore this, let me know, Next AI Labs would be interested in supporting such efforts.
It's all very exciting, if a little janky.
You're also forgetting that batch performance is already an order of magnitude better than single session inference.
I can't see LLMs in games being used for anything more than some random NPC voice quips. And whose voice would be used? Would voice actors be okay with this?
There are already too many bad games, we certainly don't need thousands more with AI-generated drivel dialogue, although having human writers is not a panacea either way.
> wget https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF...
> chmod +x TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile
> ./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile -ngl 999
You'll likely want to move beyond the first examples so you can choose models & methods. Either way, LI has tons of great documentation and was originally built for this purpose. They also have a commercial Parsing product with very generous free quotas (last I checked)
https://cookbook.openai.com/examples/parse_pdf_docs_for_rag
There are several other examples like this .. but I got stuck in jargon of Langchain or LlamaIndex etc..
You can also upload files to ChatGPT and ask questions about it.
For example, another comment asked:
"If I have a big pile of PDFs and wanted to get an LLM to be really good at answering questions about what's in all those PDFs, would it be best for me to try running this locally?"
So what if you used a paid LLM to analyze these PDFs and create the data, but then moved that data to a weaker LLM in order to run question-answer sessions on it? The idea being that you don't need the better LLM at this point, as you've already extracted the data into a more efficient form.
In fact, if you split data preprocessing in small enough steps, they could also be run on weaker LLMs. It would take a lot more time, but that is doable.
It or something like it could likely be applied to any form of generation including what you are describing.
[0] - https://github.com/primeqa/primeqa/tree/4ae1b456dbe9f75276fe...
For example, ask the (better, costlier) Claude Opus to generate high-quality prompts, which get fed into (worse, cheaper) Claude Sonnet.
8GB vram cards can run 7B models
16GB vram cards can run 13B models
24GB vram cards can run up to 33B models
Now to your question, what can most computers run? You need to look at the tiny but specialized models. I would think 3B models could be ran reasonably well even on the CPU. Intellij has a absolutely microscopic < 1B model that it uses for code completion locally. It's quite good and I don't notice any delay.
I hope that helps, it's not 1:1, and it's a bit confusing
I own a 4090 and I can only run very heavily quantised 33B models. It's not really worth it.
My LLM server with 16gb gpu mainly runs llama3 with expanded context window which also costs much more memory.
I imagine that Copilot+ will become the target minimum spec for many local LLM products and that most local LLM vendors will use GPU instead of NPU if a good GPU is available.
Nvidia 4070 Ti has roughly the same performance: https://www.techpowerup.com/gpu-specs/geforce-rtx-4070-ti.c3...
Of course, I'm massively oversimplifying, but it should be in the ballpark.
One can ran local LLMs even on RaspberryPi, although it will be horribly slow.
The underlying CLI tools do this, the app makes it easier to see and manage.
Thankfully, between llama 3 8b [1] and mistral 7b [2] you have two really capable generic instruction models you can use out of the box that could run locally for many folks. And the base models are straightforward to finetune if you need different capabilities more specific to your game use cases.
CPU/sysmem offloading is an option with gguf-based models but will hinder your latency and throughput significantly.
The quantized versions of the above models do fit easily in many consumer grade gpus (4-5GB for the weights themselves quantized at 4bpw), but it really depends on how much of your vram overhead you want to dedicate to the model weights vs actually running your game.
[1] https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
[2] https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
Without a GPU I think it will likely be a poor experience, but it won't be long until you'll have to go out of your way to buy consumer hardware that doesn't integrate some kind of TPU.
Hope your game doesn’t have a big texture budget.
Even more than LLMs, I'm curious about how transformers can be used to produce more convincing game AI in the areas where they are notoriously bad like 4x games.
The game itself is not going to have much VRAM to work with though on older GPUs. Unless you use something fairly tiny like phi3-mini.
There are a lot more options if you can establish that the user has a 3090 or 4090.
They will say things like "Its a GPU inside a CPU". No that is the marketers telling you about integrated GPUs.
There is a huge divide between CPU and GPU people. GPU people are doing application. CPU people are... happy that they got anything to run.
Is there a way to reliably package these models with existing games and make them run locally? This would virtually make inference free right?
What I think is, from my limited understanding about this field, if smaller models can run on consumer hardware reliably and speedily that would be a game changer.
Not on most consumer computers, which likely lack a dedicated GPU. My M2 struggles (only thing that makes it warm) with a 7B model, but token speed is unbearable. I switched to remote APIs for the speed.
If you are targeting gamers with a GPU, the answer may change, but as others have pointed out, there are numerous issues here.
> This would virtually make inference free right?
Yes-ish, if you are only counting your dollars, however it will slow their computer down and have slow response time, which will impact adoption of your game.
If you want to go this route, I'd start with a 2B sized model, and not worry about shipping it nicely. Get some early users to see if this is the way forward.
I suspect that remote LLM calls with sophisticated caching (cross user / convo / pre-gen'd) is something worth exploring as well. IIRC, people suspected gtp3-turbo was caching common queries and avoided the LLM when it could, for the speed
You can also look into lower parameter models (3B for example) to determine if the balance between accuracy and performance fits under your usecase.
>Is there a way to reliably package these models with existing games and make them run locally? This would virtually make inference free right?
I don't have any knowledge on game dev so I can comment on this but yes, packaging it locally would make the inference free.
Here is one example, testing performance of different GPUs and Macs with various flavours of Llama:
https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...
And even in AI Dungeon the AI plays so fast and loose that it breaks immersion. Like if I’m doing a space trading roleplay, it doesn’t consider things like making sure the product I’m buying selling meets a specific spec, and often a vendor will start offering to buy Product X from me while I’m negotiating purchasing Product X from them. This "type" of continuity problem happens constantly in AI dungeon.
We’re just not there yet, but I have confidence we’ll get there. I think it’s possible even with our current model/training paradigms but we aren’t using RLHF for game applications yet.
I really think the next step is a heavily AI-integrated version of D&D where the DM can serve as a "filter" for some of the more unhinged output (where appropriate; an intentionally incoherent goblin with some text-to-speech could be phenomenal).
I think that's about where we're at, and I'm expecting a wave of "AI-enhanced" D&D apps any day now. They probably already exist and I just haven't seen them. I would imagine there are still occasional issues with the AI utterly choking; I see it every once in a while on some of my more "fantasy" prompts where I get too specific and it just ignores what I asked.
That was my experience when I was experimenting with using current LLMs to generate quests. You can of course ask for both a human-readable quest description and also a JSON object (according to some schema) describing the important quest elements, but the failure rate of the results was too high. Maybe 10% of quests would have some important mismatch between the description and the JSON; the description would mention an important object but it would be left out of the JSON, or the JSON would mention an important NPC but the description wouldn't, etc.
As a player, I think it would get frustrating quickly if 10% of quests were unsolvable, especially since, as a player, you don't know when a quest is unsolvable; maybe you just haven't found the item/NPC yet.
An interesting flip side I was just thinking about is the AI saying too much. NPCs keeping secrets until the player gets enough reputation or does a favor or whatever is pretty common. I wonder how good they are at keeping those secrets.
Prompt injection is one thing, and vaguely equivalent to cheat codes which is fine, but what is the likelihood that a player just asking for more info ends with the AI spitting out the secret without completing the quest? Will the AI know to unlock the next area or whatever, because there's no reason for the player to do that NPCs quest?
Should be neat stuff, I'm looking forward to how this all works together when the kinks get ironed out.
Take a single character in the game, and enable that character the depth and nuance of a true experience between a Zen Master / Inquiry facilitator, powered by AI. IXCoach.com can do a phenomenal job powering this, so literally the only code needed for an MPV is the mod + character api.
Then, the cost benefit ratio is 400x, and in a day of coding you have taken a game that is mostly pure entertainment, and provided a means for depth, nuance and personal development that literally leads the market.
I pinged the executive producer of CD Project Red on this, it's viable.
Then llama3:8b[2]. It output 28 words/second. This is higher despite the larger model, perhaps because llama3 obeyed my request to use short words.
Then mixtral:8x7b[3]. That output 10.5 words/second. It looked like 2 tokens/word, as the pattern was quite repetitive and visible, but again I have no easy way to measure it.
That was on battery, set to "Low power" mode, and I was impressed that even with mixtral:8x7b, the fans didn't come on at all for the first 2 minutes of continuous output. Total system power usage peaked at 44W, of which about 38W was attributable to the GPU.
[1] https://ollama.com/library/phi3 [2] https://ollama.com/library/llama3 [3] https://ollama.com/library/mixtral
I came across this thread while doing some research, and it's been helpful.
(I hate how common Tragedy of the Commons is. =/)
What I think is, from my limited understanding about this field, if smaller models can run on consumer hardware reliably and speedily that would be a game changer.
Not really. Inference is never "free" unless you cache the result (which is just a static output) or unless you reduce complexity (which yields procedurally less-usable outputs).
In fact the Radeon which cost me only 300 bucks new performs almost as well running LLMs as the 4090 which really surprised me! I think the fast memory (the Radeon has the same 1TB/s memory bandwidth as the 4090!) helps a lot there.
When I run a local model (significantly) bigger than the 24GB VRAM on the 4090 it won't even load for 15 minutes while the 4090 is pegged at 100% all the time. Eventually I just gave up.
Yeah the key here is partial offloading. If you're trying to offload more layers than your GPU has memory for, you're gonna have a bad time. I find it kind of infuriating that this is still kind of a black art. There's definitely room for better tooling here.
Regardless, with 24GB of vram, I try to limit my offloading to 20GB and let the rest go to ram. Maybe it's the nature of the 8x7B model I run that makes it better at offloading than other large models. I'm not sure. I wouldn't try the 70B models for sure.