Ask HN: What is the current (Apr. 2024) gold standard of running an LLM locally? There are many options and opinions about, what is currently the recommended approach for running an LLM locally (e.g., on my 3090 24Gb)? Are options ‘idiot proof’ yet? |
Ask HN: What is the current (Apr. 2024) gold standard of running an LLM locally? There are many options and opinions about, what is currently the recommended approach for running an LLM locally (e.g., on my 3090 24Gb)? Are options ‘idiot proof’ yet? |
flox will also install properly accelerated torch/transformers/sentence-transfomers/diffusers/etc: they were kind enough to give me a preview of their soon-to-be-released SDXL environment suite (please don’t hold them to my “soon”, I just know it looks close to me). So you can do all the modern image stuff pretty much up to whatever is on HuggingFace.
I don’t have the time I need to be emphasizing this, but the last piece before I’m going to open source this is I’ve got a halfway decent sketch of a binary replacement/conplement for the OpenAI-compatible JSON/HTTP one everyone is using now.
I have incomplete bindings to whisper.cpp and llama.cpp for those modalities, and when it’s good enough I hope the bud.build people will accept it as a donation to the community managed ConnectRPC project suite.
We’re really close to a plausible shot at open standards on this before NVIDIA or someone totally locks down the protocol via the RT stuff.
edit: I almost forgot to mention. We have decent support for multi-vendor, mostly in practice courtesy of the excellent ‘gptel’, though both nvim and VSCode are planned for out-of-the-box support too.
The gap is opening up a bit again between the best closed and best open models.
This is speculation but I strongly believe the current opus API-accessible build is more than a point release, it’s a fundamental capability increase (though it has a weird BPE truncation issue that could just be a beta bug, but it could hint at something deeper.
It can produce verbatim artifacts from 100s of thousands of tokens ago and restart from any branch in the context, takes dramatically longer when it needs to go deep, and claims it’s accessing a sophisticated memory hierarchy. Personally I’ve never been slackjawed with amazement on anything in AI except my first night with SD and this thing.
(I just got a little emotional because these are things we used to say on reddit and now we say them about reddit. How the mighty have fallen)
Did for me just now, although as of a week or two ago reddit has been blocking many of my attempts to access through a VPN old. or not. Usually need to reconnect about 3 or 4 times before a page will load.
I'm sure there are myriad browser extensions that will do it at the DOM level, but that's such a heavy-handed solution, and also lol I'm not putting an extension on the cartesian product of all my browsers on all my machines in the service of dis-enshittifying one once-beloved social network.
why do you think my info about the 3090 https://en.wikipedia.org/wiki/IBM_3090 is going to be anything less than up-to-date?
on the other hand, 24gb of late 80's memory... how many acres of raised floor data center would that take?
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...
Everyone in the VC world misunderstands why oobabooga is successful and tries to embrace not maximalism.
Your example product to benchmark yourself against is blender, if you want to serious compete against oobabooga. You need maximalism
brew install ollama
brew services start ollama
ollama pull mistral
Ollama you can query via http. It provides a consistent interface for prompting, regardless of model.
https://github.com/ollama/ollama/blob/main/docs/api.md#reque...
We also run a Mac Studio with a bigger model (70b), M2 ultra and 192GB ram, as a chat server. It's pretty fast. Here we use Open WebUI as interface.
Software wise Ollama is OK as most IDE plugins can work with it now. I personally don't like the go code they have. Also some key features are missing from it that I would need and those are just never getting done, even as multiple people submitted PRs for some.
LM Studio is better overall, both as server or as chat interface.
I can also recommend CodeGPT plugin for JetBrains products and Continue plugin for VSCode.
As a chat server UI as I mentioned Open WebUI works great, I use it with together ai too as backend.
Or maybe I'm just working in cash poor environments...
Edit: also, can you do training / finetuning on an m2 like that?
it's pretty 'idiot proof', if you ask me.
What do you do with one of these?
Does it generate images? Write code? Can you ask it generic questions?
Do you have to 'train' it?
Do you need a large amount of storage to hold the data to train the model on?
Many of the opensource tools that run these models let you also edit the system prompt, which lets you tweak their personality.
The more advanced tools let you train them, but most of the time, people are downloading pre-existing models and using them directly.
If you are training models, it depends what you are doing. Finetuning an existing pre-trained model requires lots of examples but you can often do a lot with, say, 1000 examples in a dataset.
If you are training a large model completely from scratch, then, yes, you need tons of data and very few people are doing that on their local machines.
Our tool, https://github.com/transformerlab/transformerlab-app also supports the latter (document search) using local llms.
https://python.langchain.com/docs/get_started/introduction
I like llangchain but it can get complex for use cases beyond a simple "give the llm a string, get a string back". I've found myself spending more time in llangchain docs than working on my actual idea/problem. However, it's still a very good framework and they've done an amazing job IMO.
edit: "Are options ‘idiot proof’ yet?" - from my limited experience, Ollama is about as straightforward as it gets.
I've got an Ollama instance running on a VPS providing a backend for a discord bot.
Granted it's slower of course, but best bang for your buck on vram, so you can run larger models than on a smaller bit faster card might be able to. (Not an expert.)
Edit: if using in desktop tower, you'll need to cool it somehow. I'm using a 3D printed fan thingy, but some people have figured out how to use a 1080 ti APU cooler with it too.
Apple Mac M2 or M3's are becoming a viable option because of MLX https://github.com/ml-explore/mlx . If you are getting an M series Mac for LLMs, I'd recommend getting something with 24GB or more of RAM.
Edit: using a P40, whisper as ASR
Together Gift It solves the problem the way you’d think: with AI. Just kidding. It solves the problem by keeping everything in one place. No more group texts. There are wish lists and everything you’d want around that type of thing. There is also AI.
The thing to watch out for (if you have exposable income) is new RTX 5090. Rumors are floating they are going to have 48gb of ram per card. But if not, the ram bandwidth is going to be a lot faster. People who are on 4090 or 3090s doing ML are going to go to those, so you can pick up a second 3090 for cheap at which point you can load higher parameter models, however you will have to learn hugging face Accelerate library to support multi gpu inference (not hard, just some reading trial/error).
Its API is great if you want to integrate it with your code editor or create your own applications.
I have written a blog [1] on the process of deployment and integration with neovim and vscode.
I also created an application [2] to chat with LLMs by adding the context of a PDF document.
Update: I would like to add that because the API is simple and Ollama is now available on Windows I don’t have to share my GPU between multiple VMs to interact with it.
[1] https://www.avni.sh/posts/homelab/self-hosting-ollama/ [2] https://github.com/bovem/chat-with-doc
Also they have Python (and less relevant to me) Javascript libraries. So I assume you dont have to go through LangChain anymore.
we screwed around with it on a live stream: https://www.youtube.com/live/3YhBoox4JvQ?si=dkni5LY3EALnWVuE...
If you're writing something that will run on someone's local machine I think we're at the point where you can start building with the assumption that they'll have a local, fast, decent LLM.
I don't believe that at all. I don't have any kind of local LLM. My mother doesn't, either. Nor does my sister. My girl-friend? Nope.
Guess it's going to be a variant of Llama or Grok.
Ease? Probably ollama
Speed and you are batching on gpu? vLLM
gpt4all is decent as well, and also provides a way to retrieve information from local documents.
Seriously, this is the insane duo that can get you going in moments with chatgpt3.5 quality.
For squeezing every bit of performance out of your GPU, check out ONNX or TensorRT. They're not exactly plug-and-play, but they're getting easier to use.
And yeah, Docker can make life a bit easier by handling most of the setup mess for you. Just pull a container and you're more or less good to go.
It's not quite "idiot-proof" yet, but it's getting there. Just be ready to troubleshoot and tinker a bit.
Source code: https://github.com/leoneversberg/llm-chatbot-rag
https://ollama.com/download/windows
WinGet and Scoop apparently also have it. Chocolatey doesn't seem to.
https://medium.com/@tofujoy77/loom-ai-uncovering-creative-wr...
> Request access to Llama
Which to me gives the impression that access is gated and by application only. But Ollama downloaded it without so much as a "y".
Is that just Meta's website UI? Registration isn't actually required?
If this is your concern, I'd encourage you to read the code yourself. If you find it meets the bar you're expecting, then I'd suggest you submit a PR which updates the README to answer your question.
But it seems like there's a linux brew.
Oobabooga the closest thing we have to maximalism. It has exposure for by far the largest number of parameters/settings/backends compared to all others.
My main point is that the world yearns for a proper "Photoshop for text" - and no one has even tried to make this (closest is oobabooga). All VC backed competitors are not even close to the mark on what they should be doing here.
https://huggingface.co/meta-llama/Llama-2-7b
Ollama's installer didn't ask me for any contact info.
The requirement has always been more of a fig leaf than anything else. It allows Facebook to say everyone who downloaded from them agreed to some legal/ethics nonsense. Like all click-through license agreements nobody reads it; it's there so Facebook can pretend to be shocked when misuse of the model comes to light.
Some will do it anyway for PR/marketing, but if the creator does not interact with, have access to, or collect your data they have no obligation to have a privacy policy.
You could perhaps intercept the HTTP requests, but that would require decrypting SSL, so you'd have to MITM yourself if you wanted to do it at the network level.
You'd replace the DNS request for `reddit.com` to some device that intercepts the traffic. That device would redirect the HTTP request to `old.reddit.com` if applicable; static assets would be routed to `reddit.com`. I don't know how things like HSTS and certificate pinning fit into an idea like this.
Doing this at the device level is probably easiest. I've used Redirector [0] in the past and it works well.
Create a vhost for reddit.com, add reddit.com to your hosts file to point to the webserver (or setup a stub in the dns to the vhost webserver), and do a redirect to old.reddit.com on your webserver vhost.
I can't imagine any type of Pi-hole setup which would be faster, seems like a "holding a hammer" kind of solution.