A brief history of LLaMA models

A brief history of LLaMA models(agi-sphere.com)

245 points by andrewon 3 years ago | 84 comments

jiggawatts 3 years ago |

It keeps saying the phrase “model you can run locally”, but despite days of trying, I failed to compile any of the GitHub repos associated with these models.

None of the Python dependencies are strongly versioned, and “something” happened to the CUDA compatibility of one of them about a month ago. The original developers “got lucky” but now nobody else can compile this stuff.

After years of using only C# and Rust, both of which have sane package managers with semantic versioning, lock files, reproducible builds, and even SHA checksums the Python package ecosystem looks ridiculously immature and even childish.

Seriously, can anyone here build a docker image for running these models on CUDA? I think right now it’s borderline impossible, but I’d be happy to be corrected…

valine 3 years ago | |

I’ve got 4 different llama models running locally with CUDA and can freely switch between them, including LLaVA which is a multimodal LLaMA variant.

None of them are particularly difficult to get running, the trick is to search the project’s github issue tracker. 99% of the time your problem will be in there with steps to fix it.

jiggawatts 3 years ago | | |

> the trick is to search the project’s github issue tracker.

What ever happened to the crazy notion of Dockerfiles that simply build successfully?

Isn’t half the point of containerisation that it papers over the madness of the Python module ecosystem?

MakeUsersWant 3 years ago | | |

Could you publish a set of known-good versions (pip freeze, OS version, etc)?

nl 3 years ago | |

Use the HuggingFace Transformer library. Unlike random github repos they are professionally maintained with proper versioning.

Here's the docs: https://huggingface.co/docs/transformers/main/model_doc/llam...

int_19h 3 years ago | |

All of these things exist in the Python package ecosystem, and are generally much more common outside of ML/DS stuff. The latter... well, it reminds me of coding in early PHP days. Basically, anything goes so long as it works.

kmod 3 years ago | |

I believe the cuda stuff, via Nvidia licensing restrictions, is forced to live outside of these packaging systems (so that you sign a Nvidia eula). Not saying this is a good thing but I think that none of the systems you mentioned would handle this well either

emikulic 3 years ago | | |

This used to be the case but nowadays you can just pip install e.g. https://libraries.io/pypi/nvidia-cuda-runtime-cu11

crowwork 3 years ago | |

https://mlc.ai/mlc-llm/

jiggawatts 3 years ago | | |

I love how the response to a complaint about unreproducible builds without any versions being specified is an install script that straight up clones the "current commit" of a Git repo instead of a specific working commit id or tag.

Astonishing.

Taek 3 years ago | |

I have it running locally using the oobabooga webui, setup was moderately annoying but I'm definitely no python expert and I didn't have too much trouble.

DustinBrett 3 years ago | |

I had it running before with Dalai (https://github.com/cocktailpeanut/dalai) but have since moved to using the browser based WebGPU method (https://mlc.ai/web-llm/) which uses Vicuna 7B and is quite good.

KETpXDDzR 3 years ago | |

llama.cpp was easy to setup IMO

jiggawatts 3 years ago | | |

Can you link to a working Dockerfile?

I've heard several people say that it is easy, but then surely it ought to be trivial to set script the build so that it works reliable in a container!

rch 3 years ago | |

Just use Nixpkgs already.

microtonal 3 years ago | | |

Upstream Hydra doesn't build packages with CUDA because it uses a non-FLOSS license. So they are not in the binary cache. You'll end up rebuilding every CUDA-using package every time a transitive dependency is changed. Yeah, I know, pin the world. But you'll still have to build these packages on every machine. So, you have to run your own binary cache. As you see, the rabbit hole gets deep pretty quickly.

The only recourse is using the -bin flavors of PyTorch, etc. which will just download the precompiled upstream versions. Sadly, the result will still be much slower than other distributions. First because Python isn't compiled with optimizations and LTO in nixpkgs by default, because it is not reproducible. So, you override the Python derivation to enable optimizations and LTO. Python builds fine, but to get the machine learning ecosystem on you machine, Nix needs to build a gazillion Python packages, since the derivation hash of Python changed. Turns out that many derivations don't actually build. They build with the little amount of parallelism available on Hydra builders, but many Python packages will fail to build because of concurrency issues in tests that do manifest on your nice 16 core machine.

So, you spend hours fixing derivations so that they build on many core machines and upstream all the diffs. Or YOLO and you disable unit tests altogether. A few hours/days later (depending on your knowledge of Nix), you finally have a built of all packages that you want, you launch whatever you are doing on your CUDA-capable GPU. Turns out that it is 30-50% slower. Finding out why is another multi-day expedition in profiling and tinkering.

In the end pyenv (or a Docker container) on a boring distribution doesn't look so bad.

(Disclaimer: I initially added the PyTorch/libtorch bin packages to nixpkgs and was co-maintainer of the PyTorch derivation for a while.)

dagenix 3 years ago | |

There is a lot that can be improved with python packaging, but calling it "childish" is itself a pretty immature comment.

jiggawatts 3 years ago | | |

Is it?

Literally every example I've seen so far is completely unversioned and mere weeks after being written simply doesn't work as a direct consequence.

E.g: https://github.com/oobabooga/text-generation-webui/blob/ee68...

Take this line:

    pip3 install torch torchvision torchaudio

Which version of torch is this? The latest.

    FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04

Which version of CUDA is this? An incompatible one, apparently. Game over.

Check out "requirements.txt":

    accelerate==0.18.0
    colorama
    datasets
    flexgen==0.1.7
    gradio==3.25.0
    markdown
    numpy
    pandas
    Pillow>=9.5.0
    pyyaml
    requests
    rwkv==0.7.3
    safetensors==0.3.0
    sentencepiece
    tqdm

Wow. Less than half of those have any version specified. The rest? "Meh, I don't care, whatever."

Then this beauty:

    git+https://github.com/huggingface/peft

I love reaching out to the Internet in the middle of a build pipeline to pull the latest commit of a random repo, because that's so nice and safe, scalable, and cacheable in an artefact repository!

The NPM ecosystem gets regularly excoriated for the exact same mistakes, which by now are so well known, so often warned against, so often exploited, so regularly broken that it's getting boring.

It's like SQL injection. If you're still doing it in 2023, if your site is still getting hacked because of it, then you absolutely deserve to be labelled immature and even childish.

doodlesdev 3 years ago |

   > Our system thinks you might be a robot!
   We're really sorry about this, but it's getting harder and harder to tell the difference between humans and bots these days.

Yeah, fuck you too. Come on, really, why put this in front of a _blog post_? Is it that hard to keep up with the bot requests when serving a static page?

api 3 years ago | |

A lot of people just stick cloudflare in front of anything because of cargo cultism.

A $5/mo VPS can serve a blog to tens of thousands of people unless you are running something stupidly inefficient. If it’s a static blog make that hundreds of thousands. For millions you might need to splurge on the $10 or $20 per month VPS.

hewlett 3 years ago | | |

You can either spend $5 per month for VPS for a webserver for your static blog which you now have to secure properly, or you can just stick it on Cloudflare Pages for free

Spivak 3 years ago | | |

Or you use the free thing and never think about it?

vessenes 3 years ago |

Most places that recommend llama.cpp for mac fail to mention https://github.com/jankais3r/LLaMA_MPS, which runs unquantized 7b and 13b models on the M1/M2 GPU directly. It's slightly slower, (not a lot), and significantly lower energy usage. To me the win not having to quantize while not melting a hole in my lap is huge; I wish more people knew about it.

simonw 3 years ago |

I'm running Vicuna (a LLaMA variant) on my iPhone right now. https://twitter.com/simonw/status/1652358994214928384

The same team that built that iPhone app - MLC - also got Vicuna running directly in a web browser using Web GPU: https://simonwillison.net/2023/Apr/16/web-llm/

newswasboring 3 years ago | |

With all these new AI models, both stable diffusion and llama specially, I'm considering switching to iPhone. I don't think I fully understand why iPhones and Macs are getting so many implementations but it seems like it's hardware based.

simonw 3 years ago | | |

My understanding is that part of it is that Apple Silicon shares all available RAM between CPU and GPU.

I'm not sure how many of these models are actively taking advantage of that architecture yet though.

AnthonyMouse 3 years ago | | |

Most of these implementations are not platform-specific. I've been running llama.cpp on x86_64 hardware and the performance is fine. The small models are fast and the quantized 65B model generates about a token per second on a system with dual-channel DDR4, which isn't unusable.

The tough thing to find is something affordable that will run the unquantized 65B model at an acceptable speed. You can put 128GB of RAM in affordable hardware but ordinary desktops aren't fast. The things that are fast are expensive (e.g. I bet Epyc 9000 series would do great). And that's the thing Apple doesn't get you either, because Apple Silicon isn't available with that much RAM, and if it was it wouldn't be affordable (the 96GB Macbook Pro, which isn't enough to run the full model, is >$4000).

sp332 3 years ago | | |

iPhones leaned in to "computational photography" a long time ago. Eventually they added custom hardware to handle all the matrix multiplies efficiently. They exposed some of it to apps with an API called CoreML. They've been adding more features like on-device photo tagging, voice recognition, VR stuff.

bkm 3 years ago | | |

Homogenized hardware I assume, this is why iOS had so many photography Apps too.

brucethemoose2 3 years ago |

There is also CodyCapybara (7B finetuned on code competitions), the "uncensored" Vicuna, OpenAssistant 13B (which is said to be very good), various non English tunes, medalpaca... the release pace maddening.

acapybara 3 years ago | |

And let's not forget about Alpacino (offensive/unfiltered model).

brianjking 3 years ago |

I'll never understand why everyone is spending so much time on a model you cannot use commercially (at all).

Secondly, most of us can't even use the model for research or personal use, given the license.

FloatArtifact 3 years ago |

There needs to be a slight dedicated to tracking all these models with regular updates.

mdaniel 3 years ago | |

Heh, there is, and you're on it. But a slightly more serious answer is that would be a good feature for Huggingface to add since they're the GitHub of models. I actually suggested to GitHub that they should allow contributions of the repo topics since a lot of developers don't know or don't bother to add topics to their repos, making discoverability harder than necessary. GH ignored it but maybe Huggingface could implement such a thing

foobarbecue 3 years ago |

Ok I gotta know... what's the art?