Ask HN: Which LLMs can run locally on most consumer computers

75 points by FezzikTheGiant 2 years ago | 94 comments

Are there any? I was thinking about LLM based agents and games and this will probably only be viable when most devices can handle LLMs running locally.

MyFirstSass 2 years ago |

I've been curious as to when games would implement any kind of these new technologies, but i think they are simply too slow for now?

I think we're at least 10-15 years from being able to run low latency agents that "rag" themselves into the games they are a part of, where there are 100's of them, some of them NPC's other's controlling some game mechanic or checking if the output from other agents is acceptable or needs to be run again.

At the moment a macbook air 16 gb can run Phi-Medium 14gb, which is extremely impressive, but it's 7 tokens per second, way to slow for any kind of gaming, you need to 100x performance and we need 5+ generations before i can see this happening.

Unless there's some other application?

reaperman 2 years ago | |

> for games: i think they are simply too slow for now?

I think it's two-fold. The primary one is that it's likely very difficult to maintain a designers storyline vision and desired "atmosphere / feel", because LLM's currently "go off the rails" too easily. The second is that the teams with enough funding to properly fine-tune generative AI to do dialog, level/environment-creation, character-generation, etc. that funding means they're generally making AAA or AAA-adjacent games, which already need so much of a consumer GPU VRAM that there's not a lot left over for large ML models to run in parallel.

I do think though that we should already be seeing indie games doing more with LLM's and 3D character/level/item generation than we are. Of course AI Dungeon has been trailblazing this for a long time but I just expected to see more widely-recognized success by now from many projects. I take this as a signal that it's hard to make a "good" game using AI generation. If anyone has any suggestions for open-world games with significant amount of AI generation that allows player interaction to significantly affect the in-game universe, I'd be very interested in play-testing them. Can be any genre / style / budget. I just want to see more of what people are accomplishing in this space.

My hope is that there will be space for both the current style of game where every aspect is created/designed by a human, as well as for games of various types where the world is given an overall narrative/aesthetic/vision by the creators, but the details are implemented by AI and allows true open-world play where you finally can just walk into any shop and use RAG/etc to allow complete continuity over months/years of play where characters remember your conversations/interactions/actions of you and anyone playing in the same world.

I do think there's something of an "end-game" for this where a game is released that has no game at all in it, but rather generates games for each player based on what they want to play that day, and creates them as you play them. But I'd like to imagine that this won't replace other games (even if it does take a bit of the air out of the room), but rather exist alongside games with human-curated experiences.

everforward 2 years ago | | |

I think any NPC with dialogue important to a goal (a quest, a tutorial, etc) is going to be hard to use generative AI for. It not only needs to be coherent with the story, but it needs to correctly include certain ideas. I.e. if the NPC gives a quest to go find some item at some location, it needs to say what the item is and where it is.

I think we're currently stuck in a local minima where AI isn't up to the task of making a coherent player-interactable world, but an incoherent or fragmented and non-interactable world isn't impressive enough (like No Man's Sky).

daemon_9009 2 years ago | | |

Current games which are using LLMs only activate the model when the user is talking to the NPC, but in order to create a real dynamic story which is completely random but to the point, the agents need to interact with other as well,so lets say there are around 100 agents in the game they need to interact with each other to generate some emergent behavior. The form of interaction can be questioned here. will it be in natural lang? or just some embeddings or states.

But this thing still has a long way to go.

IXCoach 2 years ago | | |

I agree in the context of LLMs running locally. For API connected games, cloud support for nuanced conversations would be a tremendous value add. Take a hit like Cyberpunk, create a Mod that wires into a custom AI from ixcoach.com... we could literally integrate the most nuanced self inquiry practices into the top games this way.

Anyone working on top games through mods that wants to explore this, let me know, Next AI Labs would be interested in supporting such efforts.

wing-_-nuts 2 years ago | |

There are mods for skyrim right now that run an NPC's dialog and lore through a small 7B model outputs text dialog. Heck if you wanted you could run a 2B whisper model and get reasonably decent voice output.

It's all very exciting, if a little janky.

pants2 2 years ago | |

If we're just talking about NPCs in a video game, I bet the game studios have the resources to train a very specific LLM optimized for NPCs. Lots of training data could probably be stripped out; after all your average quest-giver in Skyrim doesn't need to know how to implement Black Scholes in Rust.

imtringued 2 years ago | |

The problem is that you need two GPUs and the AI one can't be from AMD. We aren't 15 years away. More like two or three. NPUs are coming and DDR6 plus quad channel memory would get you decent performance on small LLMs like llama3.

You're also forgetting that batch performance is already an order of magnitude better than single session inference.

FezzikTheGiant 2 years ago | |

I agree on the most part, but I still think some pretty cool games can come up with local LLMs. Suck up for example, though not local afaik, is a pretty cool one.

phi-go 2 years ago | |

There are a few games that use LLMs and voice, they are usually hilariously janky.

FezzikTheGiant 2 years ago | | |

Could you name some?

antisthenes 2 years ago | |

How in the world would this be tested? Anything pertaining to game logic needs to be deterministic.

I can't see LLMs in games being used for anything more than some random NPC voice quips. And whose voice would be used? Would voice actors be okay with this?

There are already too many bad games, we certainly don't need thousands more with AI-generated drivel dialogue, although having human writers is not a panacea either way.

pants2 2 years ago | | |

Have other AI agents test the game in thousands of scenarios. Voice actors are not needed, SOTA TTS systems can synthesize a brand new voice from a description.

ynniv 2 years ago |

See llamafile (https://github.com/Mozilla-Ocho/llamafile), a standalone packaging of llama.cpp that runs an LLM locally. It will use the GPU, but falls back on the CPU. CPU-only performance of small, quantized models is still pretty decent, and the page lists estimated memory requirements for currently popular models.

ultrasaurus 2 years ago | |

+100 to this, I don't think many people reading this thread realize how easy they've made it to run a LLM locally. It's a great start if you want to kick multiple tires (be careful to clean up! the gigs add up).

> wget https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF...

> chmod +x TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile

> ./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile -ngl 999

https://euri.ca/blog/2024-llm-self-hosting-is-easy-now/

blakesterz 2 years ago |

Maybe a dumb question, but I think anyone reading this question would know a good answer for me. If I have a big pile of PDFs and wanted to get an LLM to be really good at answering questions about what's in all those PDFs, would it be best for me to try running this locally? "Best" in this case would be I would want to get the best/smartest answers from my questions about these PDFs. They're all full-text PDFs, studies and results on a specific genetic condition that I'd like to understand better by asking something smart questions.

verdverm 2 years ago | |

LlamaIndex can make this task possible in a very few (surprisingly few) lines of code: https://docs.llamaindex.ai/en/stable/understanding/putting_i...

You'll likely want to move beyond the first examples so you can choose models & methods. Either way, LI has tons of great documentation and was originally built for this purpose. They also have a commercial Parsing product with very generous free quotas (last I checked)

manishsharan 2 years ago | |

If its just for you, may I suggest Open AI's python notebook examples. This was the one I used to get started.

https://cookbook.openai.com/examples/parse_pdf_docs_for_rag

There are several other examples like this .. but I got stuck in jargon of Langchain or LlamaIndex etc..

solardev 2 years ago | |

Not self hosted, but Google Notebook LLM is OK at that: https://notebooklm.google.com/

You can also upload files to ChatGPT and ask questions about it.

keiferski 2 years ago |

Is there any validity to the idea of using a higher-level LLM to generate the initial data, and then copying that data to a lower-level LLM for actual use?

For example, another comment asked:

"If I have a big pile of PDFs and wanted to get an LLM to be really good at answering questions about what's in all those PDFs, would it be best for me to try running this locally?"

So what if you used a paid LLM to analyze these PDFs and create the data, but then moved that data to a weaker LLM in order to run question-answer sessions on it? The idea being that you don't need the better LLM at this point, as you've already extracted the data into a more efficient form.

StrauXX 2 years ago | |

Maybe. You'd need to develop such a "more efficient" format. Turning unstructured text into knowledge graphs has gotten attention lately. Though I'm honestly skeptical of how useful those will turn out to be. Often times you just can't break down unstructured data into structured data without loosing a ton of information. Turning the data into an intermediary, not directly understandable by humans (say very-high density embeddings) format might be a more promising path.

abdullin 2 years ago | |

Yes, this can work. I’ve done that in a few cases.

In fact, if you split data preprocessing in small enough steps, they could also be run on weaker LLMs. It would take a lot more time, but that is doable.

kkielhofner 2 years ago | |

There is actually a specific approach of this concept for generating synthetic data for training datasets called UDAPDR[0].

It or something like it could likely be applied to any form of generation including what you are describing.

[0] - https://github.com/primeqa/primeqa/tree/4ae1b456dbe9f75276fe...

kevinkeller 2 years ago | |

Yes, this model works in many cases.

For example, ask the (better, costlier) Claude Opus to generate high-quality prompts, which get fed into (worse, cheaper) Claude Sonnet.

thibaut_barrere 2 years ago | |

Yes, that is what I am doing on some projects

Isuckatcode 2 years ago |

I was able to successfully run Llama 3 8B, mistral 7B, phi and other 7B models using Ollama [1] on my M1 MacBook Air.

[1] https://ollama.com

wing-_-nuts 2 years ago |

The general rule is that VRAM == parameter count in billions (I'm generalizing gguf finetunes here)

8GB vram cards can run 7B models

16GB vram cards can run 13B models

24GB vram cards can run up to 33B models

Now to your question, what can most computers run? You need to look at the tiny but specialized models. I would think 3B models could be ran reasonably well even on the CPU. Intellij has a absolutely microscopic < 1B model that it uses for code completion locally. It's quite good and I don't notice any delay.

noboostforyou 2 years ago | |

Perhaps there's a simple explanation but why does 24GB of VRAM offer such a large relative uplift in parameter count? (is memory bandwidth a factor rather than just the total memory amount?)

wing-_-nuts 2 years ago | | |

So, this is a bit misleading. For whatever reason the models tend to be released in certain parameter sizes. 7B models are popular. The next highest is 13B. There are few in between (some 11B). Likewise the jump from 13 is straight to 33B. You can run finetunes of a 33B model that have been cut down a little and fit them in a 24GB card. Likewise those 13B models running on 16GB cards have a lot of head room. You don't need to run as cut down a model, and you can run it with more context (i.e. the amount of your chat it can hold in memory)

I hope that helps, it's not 1:1, and it's a bit confusing

wkat4242 2 years ago | | |

Probably quantisation.

I own a 4090 and I can only run very heavily quantised 33B models. It's not really worth it.

My LLM server with 16gb gpu mainly runs llama3 with expanded context window which also costs much more memory.

onion2k 2 years ago |

I run Mistral 7b and Llama 3 locally using jani.ai on a 32GB Dell laptop and get about 6 tokens per second with a context window of 8k. It's definitely usable if you're patient. I'm glad I also have a Hugging Face account though.

Liquix 2 years ago | |

seconded - IMHO Jan has the cleanest UI and most straightforward setup out of all LLM frontends available now.

https://jan.ai/

https://github.com/janhq/jan

bryanlarsen 2 years ago |

Related question: what's the minimum GPU that's roughly equivalent to Microsoft's Copilot+ spec NPU?

I imagine that Copilot+ will become the target minimum spec for many local LLM products and that most local LLM vendors will use GPU instead of NPU if a good GPU is available.

tda 2 years ago | |

I was looking out for a new laptop but was wondering the same. This NPU thing might be one of Microsoft's bets that pays off, and makes all pre-NPU hardware obsolete quickly. Though of course they have doubled down on various failed projexts before (arm Windows, windows phones, etc)

kevinkeller 2 years ago | |

The NPU in the Snapdragon SoC used by the Windows Surface laptops was quoted to be ~ 40 trillion ops/s (TOPS).

Nvidia 4070 Ti has roughly the same performance: https://www.techpowerup.com/gpu-specs/geforce-rtx-4070-ti.c3...

Of course, I'm massively oversimplifying, but it should be in the ballpark.

artemisart 2 years ago | | |

No, the Nvidia 4070 Ti has much higher performance, TOPS is for integer operations, the 4070 Ti has ~40 float32 TFLOPS and 641 TOPS https://www.nvidia.com/fr-fr/geforce/graphics-cards/40-serie... (which I would say would be peak TOPS for int4 operations, comparing it to the 4080 datasheet, and a bit more than half that for int8 operations) https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvid... page 34. I did not find the datasheet for 4070 Ti.

imtringued 2 years ago | |

Basically any GPU with at least 32GB RAM and 12 TFLOPs.

andy_ppp 2 years ago |

“Caniuse” equivalent for LLMs depending on machine specs would be extremely useful!

abdullin 2 years ago | |

There are too many variables at play, unfortunately.

One can ran local LLMs even on RaspberryPi, although it will be horribly slow.

andy_ppp 2 years ago | | |

Maybe it wouldn’t be an algorithm, maybe it would be a reporting site where you can review your experience if there’s no way to calculate it.

Terretta 2 years ago | |

LM Studio on MacOS provides an estimate of whether a model will run on the GPU, also lets you partially offload.

The underlying CLI tools do this, the app makes it easier to see and manage.

spmurrayzzz 2 years ago |

Running them at the edge is definitely possible on most hardware, but not ideal by any means. You'll have to set latency and throughput expectations fairly low if you don't have a GPU to utilize. This is why I'd disagree with your statement re: viability — its really going to be most viable if you centralize the inference in a distributed cloud environment off-device.

Thankfully, between llama 3 8b [1] and mistral 7b [2] you have two really capable generic instruction models you can use out of the box that could run locally for many folks. And the base models are straightforward to finetune if you need different capabilities more specific to your game use cases.

CPU/sysmem offloading is an option with gguf-based models but will hinder your latency and throughput significantly.

The quantized versions of the above models do fit easily in many consumer grade gpus (4-5GB for the weights themselves quantized at 4bpw), but it really depends on how much of your vram overhead you want to dedicate to the model weights vs actually running your game.

[1] https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

[2] https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2

psynister 2 years ago |

Check out Ollama, it's built to run models locally. Llama3 8b runs great locally for me, 70b is very slow. Plenty of options.

b5n 2 years ago |

Quantized 6-8b models run well on consumer GPUs. My concern would be vram limits given you'll likely be expecting the card to do compute _and_ graphics.

Without a GPU I think it will likely be a poor experience, but it won't be long until you'll have to go out of your way to buy consumer hardware that doesn't integrate some kind of TPU.

xyc 2 years ago |

I have been using local LLM as a daily driver. Built https://recurse.chat for it. I've used Llama 3, WizardLM 2, Mistral mostly, and sometimes just trying out models from hugging face (Recently added support for adding it from Hugging Face https://x.com/recursechat/status/1794132295781322909)

pshc 2 years ago |

Quantized 4/5-bit 8b models with medium-short context might be shippable. Still, it’s going to require a nice GPU for all that RAM. Plus you would have to support AMD—I would experiment with llama.cpp as it runs on many architectures.

Hope your game doesn’t have a big texture budget.

root_axis 2 years ago |

Seems like there is high potential for some NPC text generation from LLMs, especially a model that is trained to produce NPC dialog alongside discrete data that can be processed to correlate the content of the speech with the state of the game. This is going to be a tough challenge with a lot of room for research and creative approaches to producing immersive experiences. Unfortunately, only single-player and cooperative experiences will be practical for the foreseeable future since its trivial to totally break the immersion with some prompt poisoning.

Even more than LLMs, I'm curious about how transformers can be used to produce more convincing game AI in the areas where they are notoriously bad like 4x games.

talldayo 2 years ago |

Gemma 2B and Phi-3 3B, if you run them at Q4 quantization. I wouldn't bother with anything larger than 4B parameters; you're just not going to be able to reliably expect an end-user to run that size of model on a phone yet.

calculito 2 years ago |

I assume the question is rather which LLM can cover most of the tasks while delivering decent quality. I would prefer an architecture using different LLM for different tasks rather like 'specialists' instead of simple 'agents'. I used to take the main task and divide it in smaller tasks and see what can I use to solve the problem. Sometimes rule-based approaches can be already enough for a sub-task and LLM would be not only overkill but also more difficult to implement and maintain.

rahimrezgui 2 years ago | |

so what is your answer to the question?

calculito 2 years ago | | |

Depends of what you want to do!? Just for testing most of the 7B model are a good compromise between quality and performance (speak execution time)

jsheard 2 years ago |

I imagine you would have to solve some tricky scheduling issues to run an LLM on the GPU while it's also busy rendering the game. Frames need to be rendered at a more or less consistent rate no matter what, but the LLM would likely have erratic, spiky GPU utilisation depending on what the agents are doing, so you would have to throttle the LLM execution very carefully. Probably doable but I don't think there's any existing framework support for that.

callwhendone 2 years ago | |

or have 2 gpus

jsheard 2 years ago | | |

That also works but approximately zero gamers have two discrete GPUs. You can't even rely on users to have an integrated GPU and a discrete GPU, there's a lot of systems which only have one or the other.

ilaksh 2 years ago |

You can 100% do that with quantized models that are 8b and below. Take a look at ollama to experiment. For incorporating in a game I would probably use llama.cpp or candle.

The game itself is not going to have much VRAM to work with though on older GPUs. Unless you use something fairly tiny like phi3-mini.

There are a lot more options if you can establish that the user has a 3090 or 4090.

sn0wr8ven 2 years ago |

There definitely are smaller LLMs that can run on consumer computers, but as for their performance... You would be lucky to get a full sentence. On the other hand, sending and receiving responses as text is probably the fastest and most realistic way to implement these things in games.

imtringued 2 years ago | |

I've gone past the 8k context window with very good text generation on llama3. I don't know what you're smoking.

winwang 2 years ago |

Check out this subreddit for a decent "source of truth": reddit.com/r/localllama

resource_waste 2 years ago | |

Nah, too many fanboys thinking their CPU testing is actually using LLMs.

They will say things like "Its a GPU inside a CPU". No that is the marketers telling you about integrated GPUs.

There is a huge divide between CPU and GPU people. GPU people are doing application. CPU people are... happy that they got anything to run.

Terretta 2 years ago |

Macbook Pro with 128GB RAM runs Llama 3 70B entirely in memory and on GPU. It's remarkable to have a performant LLM that smart and that fast on a (pro)sumer laptop.

jaggs 2 years ago |

Mistral is pretty good, and delivers solid results.

FezzikTheGiant 2 years ago | |

Interesting - is it viable do you think to package a llm like that with an existing game and run it locally - I assume it will be intensive to run but wouldn't that eliminate inference costs?

Werewolf255 2 years ago | | |

It would be intensive but it's very doable. You could use koboldcpp or something like that with an exposed endpoint just on the local machine and use that. You'll likely run into issues with GPU vendors and ensuring that you've got the right software versions running, but with some checking, it should be viable. Maybe include a fallback in case the system can't produce results in a timely manner.

jaggs 2 years ago | | |

Why would you get costs with a local model?