A brief history of LLaMA models(agi-sphere.com) |
A brief history of LLaMA models(agi-sphere.com) |
None of the Python dependencies are strongly versioned, and “something” happened to the CUDA compatibility of one of them about a month ago. The original developers “got lucky” but now nobody else can compile this stuff.
After years of using only C# and Rust, both of which have sane package managers with semantic versioning, lock files, reproducible builds, and even SHA checksums the Python package ecosystem looks ridiculously immature and even childish.
Seriously, can anyone here build a docker image for running these models on CUDA? I think right now it’s borderline impossible, but I’d be happy to be corrected…
None of them are particularly difficult to get running, the trick is to search the project’s github issue tracker. 99% of the time your problem will be in there with steps to fix it.
What ever happened to the crazy notion of Dockerfiles that simply build successfully?
Isn’t half the point of containerisation that it papers over the madness of the Python module ecosystem?
Here's the docs: https://huggingface.co/docs/transformers/main/model_doc/llam...
Astonishing.
I've heard several people say that it is easy, but then surely it ought to be trivial to set script the build so that it works reliable in a container!
The only recourse is using the -bin flavors of PyTorch, etc. which will just download the precompiled upstream versions. Sadly, the result will still be much slower than other distributions. First because Python isn't compiled with optimizations and LTO in nixpkgs by default, because it is not reproducible. So, you override the Python derivation to enable optimizations and LTO. Python builds fine, but to get the machine learning ecosystem on you machine, Nix needs to build a gazillion Python packages, since the derivation hash of Python changed. Turns out that many derivations don't actually build. They build with the little amount of parallelism available on Hydra builders, but many Python packages will fail to build because of concurrency issues in tests that do manifest on your nice 16 core machine.
So, you spend hours fixing derivations so that they build on many core machines and upstream all the diffs. Or YOLO and you disable unit tests altogether. A few hours/days later (depending on your knowledge of Nix), you finally have a built of all packages that you want, you launch whatever you are doing on your CUDA-capable GPU. Turns out that it is 30-50% slower. Finding out why is another multi-day expedition in profiling and tinkering.
In the end pyenv (or a Docker container) on a boring distribution doesn't look so bad.
(Disclaimer: I initially added the PyTorch/libtorch bin packages to nixpkgs and was co-maintainer of the PyTorch derivation for a while.)
Literally every example I've seen so far is completely unversioned and mere weeks after being written simply doesn't work as a direct consequence.
E.g: https://github.com/oobabooga/text-generation-webui/blob/ee68...
Take this line:
pip3 install torch torchvision torchaudio
Which version of torch is this? The latest. FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04
Which version of CUDA is this? An incompatible one, apparently. Game over.Check out "requirements.txt":
accelerate==0.18.0
colorama
datasets
flexgen==0.1.7
gradio==3.25.0
markdown
numpy
pandas
Pillow>=9.5.0
pyyaml
requests
rwkv==0.7.3
safetensors==0.3.0
sentencepiece
tqdm
Wow. Less than half of those have any version specified. The rest? "Meh, I don't care, whatever."Then this beauty:
git+https://github.com/huggingface/peft
I love reaching out to the Internet in the middle of a build pipeline to pull the latest commit of a random repo, because that's so nice and safe, scalable, and cacheable in an artefact repository!The NPM ecosystem gets regularly excoriated for the exact same mistakes, which by now are so well known, so often warned against, so often exploited, so regularly broken that it's getting boring.
It's like SQL injection. If you're still doing it in 2023, if your site is still getting hacked because of it, then you absolutely deserve to be labelled immature and even childish.
> Our system thinks you might be a robot!
We're really sorry about this, but it's getting harder and harder to tell the difference between humans and bots these days.
Yeah, fuck you too. Come on, really, why put this in front of a _blog post_? Is it that hard to keep up with the bot requests when serving a static page?A $5/mo VPS can serve a blog to tens of thousands of people unless you are running something stupidly inefficient. If it’s a static blog make that hundreds of thousands. For millions you might need to splurge on the $10 or $20 per month VPS.
The same team that built that iPhone app - MLC - also got Vicuna running directly in a web browser using Web GPU: https://simonwillison.net/2023/Apr/16/web-llm/
I'm not sure how many of these models are actively taking advantage of that architecture yet though.
The tough thing to find is something affordable that will run the unquantized 65B model at an acceptable speed. You can put 128GB of RAM in affordable hardware but ordinary desktops aren't fast. The things that are fast are expensive (e.g. I bet Epyc 9000 series would do great). And that's the thing Apple doesn't get you either, because Apple Silicon isn't available with that much RAM, and if it was it wouldn't be affordable (the 96GB Macbook Pro, which isn't enough to run the full model, is >$4000).
Secondly, most of us can't even use the model for research or personal use, given the license.
It was discovered though, that while models may need this level of precision when creating them ("training"), they don't need it nearly as much after the fact, when simply running them to get results ("inference").
So quantisation is the process of getting that big set of, say, 32-bit floats, and "mapping" them to a much smaller number type. Eg, an 8-bit integer ("INT8"). This is a number in the range 0-255 (or -128 to +127).
So, to quantise a list of 32-bit floats, you could go through the list and analyse. Maybe they're all in the range -1.0 to +1.0. Maybe there are many around the value of 0.99999 and 0.998 etc, so you decide to assign those the value "255" instead.
Repeat this until you've squashed that bunch of 32-bit values into 8-bits each. (Eg, maybe 0.750000 could be 192, etc.)
This could give a saving in memory footprint for the model of 4x smaller, and also makes it able to be run faster. So while you needed 16GB to run it before, now you might only need 4GB.
The expense is the model won't be as accurate. But, typically this is on the order of values like 90%, versus the memory savings of 4x. So it's deemed worth it.
It's through this process folks can run models that would normally require a 5-figure GPU to run, on their home machine, or even on the CPU, as it might be able to process integers easier and faster than floating point.
https://huggingface.co/docs/optimum/concept_guides/quantizat...
The companies working on AI would be foolish to argue for more copyrightability are because it would be hard to conclude the models were copyrightable works without also concluding that the models are unlawful derivatives of the material they were trained on. "Congrats, models can be owned, but regrets: you're bankrupt now because you just committed 4.6 billion acts of copyright infringement carrying statutory damages of $250k each."
You might argue that this is far from sure, OKAY-- but parties that take this view will out-compete ones that don't. If it does turn out to be problematic, the people that had something to work from now will pivot to backing their work on something else and will still be ahead of people sitting on their hands.
You could see it as a calculated risk, but it seems at least as safe as the one behind the underlying authors of the model weights training on material they're not licensed to distribute.
Also, with business there are few "can do / can't do" - it's about managing risks. If a penalty for doing is negligible (FB cannot catch you abusing license in private), from a business standpoint there is no issue in doing so - especially with things that are ethically kind-of-ok.
https://github.com/togethercomputer/RedPajama-Data/
https://twitter.com/togethercompute/status/16479179892645191...
I doubt the Facebook Police are going to bust down your door at 3am.
…or are they? peeks through curtains
It's still noticeably slower than GPU, though.
SP5 system board ~$1000
Epyc 9124 $1083
192GB registered DDR5 (12x16GB) ~$1000
case, power supply, modest storage: ~$300
460GB/s bandwidth from 12 memory channels, 50% more memory and you'd have more than $1000 left over. But >$3000 is not a low price either, it's just lower.Complaining that people won’t work for you for free is a bit much, don’t you think?
Now it is apparently seen as "working for free for ungrateful people"
Facebook seems to be pretty hands off (as is expected since the code is open source) unless you distribute the model weights and then they drop the dmca banhammer.
So, yeah, simply complaining with no effort to understand the problem is kind of ungrateful.
congratulations, it now works.
If you're not a developer, maybe you'll have to type sudo apt install build-essential first. Congratulations, now you too, a non-developer, are running it locally.
Do you appreciate that people aren't making technical mistakes on purpose just to spite you? Or that maybe some of the folks writing these libraries are experts in fields other than dependency management? Are you an expert in all things? Would you find it helpful if someone identifies one thing that you aren't great at and then calls you names on the internet over that one thing?
There is a pretty significant difference between making a technical critique and just being rude. And being right about the former doesn't make the latter ok.
Have you ever read about how open source project leaders often experience a lot of toxicity and anxiety about trying to keep up with the users they are supporting for free? If not, I suggest you do since this is the exact type of comment that is hurtful and unhelpful.
I was thinking if it is possible in nixpkgs to create a branch that attempts to create a version match to specific distributions, especially Ubuntu as the ML world is most using it. My idea is to somehow use the deb package information to “shadow” another distribution.
> First because Python isn't compiled with optimizations and LTO in nixpkgs by default, because it is not reproducible. So, you override the Python derivation to enable optimizations and LTO. Python builds fine, but to get the machine learning ecosystem on you machine, Nix needs to build a gazillion Python packages, since the derivation hash of Python changed. Turns out that many derivations don't actually build. They build with the little amount of parallelism available on Hydra builders, but many Python packages will fail to build because of concurrency issues in tests that do manifest on your nice 16 core machine.
I understand your comments including above and the one about CUDA binaries. Just one clarification on the concurrency in tests failure, do you mean it overloads the machine running multi process tests that then tests fail due to assumptions by the package authors?
My main point is that Nix as a system is so incredibly powerful that perhaps there is an ability to “shadow” boring distributions, especially debian based, in some automated way. The we would have the best of both, baseline stability from the distribution and extensibility of nix.
I've found that quite some test suites have race conditions (e.g. simultaneous modification of files, etc.), which manifest themselves e.g. when a package uses pytest-xdist (and the machine has enough cores).
My main point is that Nix as a system is so incredibly powerful that perhaps there is an ability to “shadow” boring distributions,
I think things would improve vastly if it was possible to do CUDA builds in Hydra and have the resulting packages in the binary cache. My idea (when I was still contributing to nixpkgs) was to somehow mark CUDA derivations specially, so that they get built but not stored in the binary cache. That would allow packages with CUDA dependencies to get built as well (e.g. PyTorch). Nix would then only have to build CUDA locally (which is cheap, since it only entails unpacking the binary distribution and putting things in the right output paths) and would get everything else through the binary cache (like prebuilt PyTorch). But AFAIR it'd require some coordinated changes between Nix, Hydra, etc.
Then I started working for a company in the Python/Cython ecosystem and quickly found out that Nix is not really viable for most Python development. So I am now just using pyenv and pip, which works fine most of the time (we have some people in our team who are very good at maintaining proper package version bounds).
Ubuntu seems to be winning mindshare across the board and while this would be different than nixpkgs itself I was thinking if it is possible to mass convert deb packages into nix expressions, this combined with overlays would allow rapid incremental testing of marginal modifications to a current distribution’s stacking of versions.
A bit like how Nix community has tools on top of the various language packaging systems but this would be a layer on top of the debian packaging standards.
Maybe it’s crazy but just an idea I’ve been having recently and wondering how hard it might be. Importantly debian deb and apt systems are very reproducible and structured which is a good fit for a Nix based layer.
But I also think it's fine for individuals and researchers working in ML to expect some extra compiling, as long as the outcome is reliable. I'm stuck at home this weekend resurrecting an analysis from 10+ years ago, complete with Python, R, Java, and Fortran dependencies^, and I'm definitely wishing I'd known about Nix back then.
^btw, thanks to whomever included hdf5-mpi in Nixpkgs. Your work is greatly appreciated.
If you absolutely must then build it separately and link (or use) that exactly like blender does with their binaries. Campbell (one of the core blender devs) used to love to bump the python version as soon as it was released and if you wanted to do any dev work you’d have to run another python environment until the distro version caught up. Being as I liked to use the fedora libs as a sort of sanity check this was a bit of a hassle to say the least.
That's not much of an issue on Nix. You can just override Python for your particular (machine learning) package set. The rest of the system will continue to use Python unmodified.
That has nothing to do with Python tooling being bad. A safe assumption is that Python package managers are being developed by developers, who have no excuse.
If a C++ codebase developed by scientists had null pointer exceptions in it, then I could excuse things. But if the C++ compiler itself introduced unforced null pointer errors, then it absolutely deserves criticism.
It shouldn't be possible for a ML researcher to use Conda or whatever package manager in a way that despite using a formally specified "requirements.txt", it won't build a week later because of how loose the specification of module versions is allowed to be.
The Python attitude and more specifically Conda is at fault here, not the ML researching trying to get his job done.
> Conda is at fault here
Conda is so so bad. But trying to explain why to people who have fallen into it’s trap is difficult. People don’t realize the packages are not signed on enough information to reproduce them. The optimizer to find matching versions to make an environment that satisfies your constraints is really bad idea.
As an experienced C++ developer unfortunately and fortunately I’ve concluded the most “correct” solution is to use nixpkgs.
It’s a problem.