OpenLLaMA: An Open Reproduction of LLaMA

OpenLLaMA: An Open Reproduction of LLaMA(github.com)

484 points by sadiq 3 years ago | 180 comments

diimdeep 3 years ago |

To use with llama.cpp on CPU and 8GB RAM

  git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && cmake -B build && cmake --build build
  python3 -m pip install -r requirements.txt

  cd models && git clone https://huggingface.co/openlm-research/open_llama_7b_preview_200bt/ && cd -
  python3 convert-pth-to-ggml.py models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights 1
  ./build/bin/quantize models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/ggml-model-f16.bin models/open_llama_7b_preview_200bt_q5_0.ggml q5_0
  ./build/bin/main -m models/open_llama_7b_preview_200bt_q5_0.ggml --ignore-eos -n 1280 -p "Building a website can be done in 10 simple steps:" --mlock

gigel82 3 years ago | |

You the real MVP!

Though I'm getting this error on an Intel macbook (Monterey); it works fine on a Windows11 box:

   python3 convert-pth-to-ggml.py models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights 1
   Loading model file models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/pytorch_model-00001-of-00002.bin
   Traceback (most recent call last):
    File "/l/llama.cpp/convert-pth-to-ggml.py", line 11, in <module>
      convert.main(['--outtype', 'f16' if args.ftype == 1 else 'f32', '--', args.dir_model])
    File "/l/llama.cpp/convert.py", line 1129, in main
       model_plus = load_some_model(args.model)
     File "/l/llama.cpp/convert.py", line 1055, in load_some_model
       models_plus.append(lazy_load_file(path))
     File "/l/llama.cpp/convert.py", line 857, in lazy_load_file
       raise ValueError(f"unknown format: {path}")
   ValueError: unknown format: models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/pytorch_model-00001-of-00002.bin

sebastianhoitz 3 years ago | | |

I had the same issue and then noticed that I need git lfs - otherwise just cloning the repo will not download the weights.

kdtsh 3 years ago | | |

I get the same error on an M series MacBook (Ventura). However from the repo README.md it looks like make should work instead of cmake, I’ll give that a try.

logicchains 3 years ago |

It's not clear from the GitHub; are there any plans to eventually train the 30 or 65 billion weight LLaMA models? The 65B model seems comparable to GPT3.5 for many things, and can run fine on a beefy desktop just on CPU (CPU ram is much cheaper than GPU ram). It'd be amazing to have an open source version.

wokwokwok 3 years ago | |

There’s a lot of controversy about “7B is good enough and small enough for consumer hardware so it’s good enough fullstop”

…but, although it is true that for a fixed compute budget that these small models can have impressive results with good training data, it is also true that smaller models (7B) appear to have an upper performance bound that is beaten easily by larger well trained models.

It’s just way more expensive to train larger models.

They specifically note they are training a smaller 3B model In the future.

So… it seems reasonable to assume that this is a proof of concept, and that no, the Berkeley AI lab will not be fielding the cost for training a larger model.

This is probably more about exploring the “can we make a cheap good-enough model?” than “here is your GPT4 replacement”.

b33j0r 3 years ago | | |

Agreed. With some work, 13B runs on consumer hardware at this point. That redefines consumer to a 3090 (but hey, some depressed crypto guys are selling them. I recently got another GPU for my homelab this way).

30B is within reach, with compression techniques that seem to lose very little information of the overall network. Many argue that machine learning IS fundamentally a compression technique, but the topology of the trained network turns out to be more important. Assuming an appropriate activation function after this transformation.

No… definitely not your GPT4 replacement. However this is the kind of PoC I keep following… every… 18 hours or so? Amazing.

scotty79 3 years ago | | |

Do you know of any research that tries to take large pre-trained model and make it smaller by cutting out least activated neurons and training it a bit not to loose performance?

moffkalast 3 years ago | | |

> They specifically note they are training a smaller 3B model In the future.

They're kidding right, there's no way that thing will be more useful than one of those flan models.

ummonk 3 years ago | | |

Given inference costs and ability to run on devices, there's an argument to be made for training models that are smaller than Chinchilla-optimal though, especially if you can still eek out improved performance with longer training times.

tarruda 3 years ago | |

I ran the 30b and 65b Q4 on a laptop with 64 gb of RAM (8/16 CPU). It worked but token/s was very low for it to be practically useful.

logicchains 3 years ago | | |

That's unfortunate. Running the 65B Q4 on an AMD Epyc with 32 1.5ghz cores and 256 GB of ram I get around 3 tokens/sec, which is useable if not ideal. I wonder if the difference is related to the RAM or the number of CPUs?

simion314 3 years ago | | |

slow could be useful if you do not want to chat with it, and instead you could code it to do a long running job, like code review your entire project like a code analysis tool. Or summarize a lot of content.

bagels 3 years ago | | |

How low? I think everybody has different requirements there.

quickthrower2 3 years ago | | |

If I rent an A100 what kind of speed could I expect?

newswasboring 3 years ago | |

At least for now they are focused on 7B and then 3B[1].

[1]https://github.com/openlm-research/open_llama#future-plans

Silverback_VII 3 years ago | |

I'm not sure whether the number of parameters serves as a reliable measure of quality. I believe that these models have a lot of redundant computation and could be a lot smaller without losing quality.

cubefox 3 years ago | | |

The Chinchilla scaling law describes, apart from the training data size, the optimal number of parameters for a given amount of computing power for training. See

https://dynomight.net/scaling/

jjice 3 years ago |

Does anyone have any resources they recommend for just understanding the base terminology of models like this? I always see the terms "weights", "tokens", "model", etc. I feel like I understand what these mean, but I have no idea what I need to care about them for in open models like this? If I were to download an open model to run on my machine, would I download the weights? I'm just ignorant in the ML space I guess but not sure where to start.

superpope99 3 years ago |

I'm always curious about the cost of these training runs. Some back of the envelope calculations:

> Overall we reach a throughput of over 1900 tokens / second / TPU-v4 chip in our training run

1 trillion / 1900 = 526315789 chip seconds ~= 150000 chip hours.

Assuming "on-demand" pricing [1] that's about $500,000 training cost.

[1] https://cloud.google.com/tpu/pricing

p1esk 3 years ago | |

At these levels of spending the actual cost is heavily negotiated and is usually far below the advertised on-demand pricing.

Considering I could negotiate A100 for under a dollar/hr - 8 months ago, when they were in high demand, I wouldn't be surprised if the cost was close to 100k for this training run.

execveat 3 years ago | |

Nobody in their right mind is using GCE for training. Take a look at real prices: https://vast.ai/

simonw 3 years ago | | |

I got the impression that kind of thing (buying time on GPUs hosted in people's homes) isn't useful for training large models, because model training requires extremely high bandwidth connections between the GPUs such that you effectively need them in the same rack.

qeternity 3 years ago | | |

Anyone training this size of model is almost certainly using AWS/GCE.

The GPU marketplaces are nice for people who need smaller/single GPU setups, don't have huge reliability or SLA concerns, and where data privacy risks aren't an issue.

superpope99 3 years ago | | |

Aren't they explicitly using TPUs in their training? Vast AI are only offering GPUs.

bravura 3 years ago | | |

These nodes typically have slow downstream, and thus are hard to use when training requires pulling a huge dataset.

lostmsu 3 years ago | | |

Only 19 GPUs with 30+G of VRAM in the entire North America.

I might be misreading it. It might be just 12 GPUs.

jeron 3 years ago | | |

also, https://brev.dev/

quickthrower2 3 years ago |

I am quite new to this, I would like to get it running. Would the process roughly be:

1. Get a machine with decent GPU, probably rent cloud GPU.

2. On that machine download the weights/model/vocab files from https://huggingface.co/openlm-research/open_llama_7b_preview...

3. Install Anaconda. Clone https://github.com/young-geng/EasyLM/.

4. Install EasyLM:

    conda env create -f scripts/gpu_environment.yml
    conda activate EasyLM

5. Run this command, as per https://github.com/young-geng/EasyLM/blob/main/docs/llama.md:

    python -m EasyLM.models.llama.llama_serve \
         --mesh_dim='1,1,-1' \
         --load_llama_config='13B' \
         --load_checkpoint='params::path/to/easylm/llama/checkpoint' \

Am I even close?

newswasboring 3 years ago |

How is this model performing better than LLaMa in a lot of tasks[1] even though its trained on a fifth of the data (1 trillion vs 200 billion).

[1]https://github.com/openlm-research/open_llama#evaluation

YetAnotherNick 3 years ago | |

They are likely doing some interpolation for 200B or benchmarking it in wrong way. e.g. Hellaswag accuracy for llama 7b is 0.76[1], but it is written 0.56 in the repo. Even at 200B tokens, it is higher than 0.56 for llama looking at the charts.

[1]: https://arxiv.org/pdf/2302.13971.pdf

byefruit 3 years ago | | |

They ran lm-evaluation-harness on both this model and the original llama weights, which is the correct way to do it.

Many people have been struggling to reproduce the benchmark numbers included in the original llama paper.

slekker 3 years ago | |

Nobody knows :^)

tarruda 3 years ago | |

Maybe it uses a higher quality dataset

logicchains 3 years ago |

Would be very interesting to see https://github.com/BlinkDL/RWKV-LM trained on the same data

leobg 3 years ago | |

Interesting. Have you done anything with RWKV?

vessenes 3 years ago | | |

I evaluated RWKV recently, and it's interesting for sure. It's undertrained, and has a quirky architect, so some parts of it are different than playing with the llama ecosystem. The huge context length is super appealing, and in my tests, long prompts do seem to work and get coherent results.

Where it's slow is in tokenization -- it can be very, very slow to make an initial tokenization of a prompt. I think this has to do with how the network actually functions, like there's a forward loop that feeds each token in to the network sequentially.

I would guess if it had the same level of attention and work that the Llama stack is getting it would be pretty fantastic, but that's just a guess, I'm a hobbyist only.

logicchains 3 years ago | | |

Nope, not yet, the current 14B version is much worse than LLaMA 65B. But there are apparently plans to train a RWKV-65B by the end of the year, and if including the LLaMA training dataset results in something like LLaMA-65B but with infinite context then that'd be really amazing.

Taek 3 years ago |

How is this different from what RedPajamas is doing?

Also, most people don't mind running LLaMA 7B at home so much because of enforceability, but a lot of commercial businesses would love to run a 65b parameter model if possible and can't because the license is more meaningfully prohibitive in a business context. Open versions of the larger models are a lot more meaningful to society at this point.

execveat 3 years ago | |

RedPajama is creating a dataset. This is a permissively licensed model trained on that dataset.

slama 3 years ago | | |

RedPajama is also training both foundation and instruct-tuned models

Source: https://twitter.com/togethercompute/status/16527350961501757...

bradleyjg 3 years ago | |

I agree with this. For a lot of companies hundreds of thousands of dollars or single digit millions on fine tuning, inference, and so on is entirely feasible but using model weights with clouded legal status isn’t.

bluecoconut 3 years ago |

Really exciting how fast fully pre-trained new models are appearing.

Here's another repo (with the same "open-llama" name) that has been available on hugging face as well for a few weeks. (different training dataset)

https://github.com/s-JoL/Open-Llama https://huggingface.co/s-JoL/Open-Llama-V1

LudwigNagasena 3 years ago |

Is anyone familiar with the BOINC-style grid computing scene for ML and, specifically, LLM? Is there something interesting going on, or is it infeasible? Will things like OpenLLaMA help it?

literalAardvark 3 years ago | |

They seem to scale up, not out, so grids don't really work.

What everyone is using are HPC grade low latency interconnects to make the cluster look as close as possible to a single big TPU.

pmoriarty 3 years ago | | |

"They seem to scale up, not out, so grids don't really work."

Can someone explain what this means? I don't understand.

sigmar 3 years ago | |

I haven't looked into it or tried it yet, but there is https://petals.ml/

Eduard 3 years ago |

Can someone explain how to tell if a model doesn't require a GPU and can run on a CPU?

After setting up dalai, OpenAssistant, gpt4all and a bunch of other (albeit nonworking) LLM thingies, my current hunch is:

if the model somewhere has "GGML" in its name, it doesn't require a GPU.

execveat 3 years ago | |

Technically anything that's based on pytorch can run on CPU, you just need to tell it to do so. For example, in textgen add '--cpu' and you're done. It will be super slow though.

GGML format is meant to be executed through llama.cpp, which doesn't use GPU by default. You can often find these models in a quantized form as well, which helps performance (at a cost of accuracy). Look for q4_0 for the fastest performance and lowest RAM requirements, look for 5_1 for the best quality right now (well, among quantized models).

Oh yeah, textgen supports llama.cpp, and also provides API, so it looks like a clear winner. You might want to manually pull newer dependencies for torch and llama.cpp though:

pip install -U --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.h... pip install -U llama-cpp-python

martythemaniak 3 years ago |

Has anyone successfully used embeddings with anything other than OpenAI's APIs? I've seen lots of debates on using embeddings vs fine-tuning for things like chatbots on private data, but is there a reason why you can't use both? IE, fine-tune LLaMA on your data, then run the same embeddings approach on top of your own fine-tuned model?

ianpurton 3 years ago |

> We are currently focused on completing the training process on the entire RedPajama dataset.

So that's 1.2 trillion tokens. Nice.

jasonm23 3 years ago |

Forgive me for the ignorance, but can a refined training model be a specific codebase, after say training on all standard docs for the language, and 3rd party libs, and so on.

I have no formal idea how this is done, but my assumption is that "something like that" should work.

Please disabuse me of any silly ideas.

quickthrower2 3 years ago |

So is this free as in “do what you f’ing like with it”?

mkl 3 years ago | |

Mostly, yes. It's Apache License 2.0: https://github.com/openlm-research/open_llama/blob/main/LICE...

venelin_valkov 3 years ago |

I made a YouTube video on how to run OpenLLaMa on Google Colab with Hugging Face Transformers (using a T4 GPU): https://www.youtube.com/watch?v=1NOPciKuQb8

Hope that helps!

version_five 3 years ago |

Has anyone actually used this? I poked around and it's so poorly documented that I don't see how one can readily, short of trying to go through the code, understand how to do a minimal run.

gigel82 3 years ago | |

I've used it with llama.cpp; results are not great, but not entirely terrible (I'd say somewhere between GPT-2 and GPT-3). Still, totally free and open source is great and I'm looking forward to more development from them (and others building on top like an RLHF / alpaca / chat kind of thing).

version_five 3 years ago | | |

Thanks for answering! In my skim of the thread I only saw people mention trying it with llama.cpp. I tried to get his EasyML framework going but could not figure out the parameters I needed. Definitely agree it's great to see real open source models being built.

scotty79 3 years ago |

Motivation?

igravious 3 years ago | |

Happily, licensing.

newswasboring 3 years ago | | |

why the hell will you be happy about duplicate work?

newswasboring 3 years ago | |

Sadly, licensing.

(modal) fme:/mnt/c/temp/modal$ modal run openllama.py ? Initialized. View app at https://modal.com/apps/ap-9... ? Created objects. +-- ?? Created download_models. +-- ?? Created mount /mnt/c/temp/modal/openllama.py +-- ?? Created OpenLlamaModel.generate. +-- ?? Created mount /mnt/c/temp/modal/openllama.py Downloading shards: 0%| | 0/2 [00:00<?, ?it/s]Downloading shards: 100%|¦¦¦¦¦¦¦¦¦¦| 2/2 [00:00<00:00, 1733.54it/s] Loading checkpoint shards: 100%|¦¦¦¦¦¦¦¦¦¦| 2/2 [00:12<00:00, 5.70s/it]Loading checkpoint shards: 100%|¦¦¦¦¦¦¦¦¦¦| 2/2 [00:12<00:00, 6.23s/it] Building a website can be done in 10 simple steps: 1. Choose a domain name. 2. Choose a web hosting service. 3. Choose a web hosting package. 4. Choose a web hosting plan. 5. Choose a web hosting package. 6. Choose a web hosting plan. 7. Choose a web hosting package. 8. Choose a web hosting plan. 9. Choose a web hosting package. 10. Choose a web hosting plan. 11. Choose a web hosting package. 12. Choose a web hosting package. 13. Choose a web hosting package. 14. Choose a web hosting ? App completed.