OpenLLaMA: An Open Reproduction of LLaMA(github.com) |
OpenLLaMA: An Open Reproduction of LLaMA(github.com) |
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && cmake -B build && cmake --build build
python3 -m pip install -r requirements.txt
cd models && git clone https://huggingface.co/openlm-research/open_llama_7b_preview_200bt/ && cd -
python3 convert-pth-to-ggml.py models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights 1
./build/bin/quantize models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/ggml-model-f16.bin models/open_llama_7b_preview_200bt_q5_0.ggml q5_0
./build/bin/main -m models/open_llama_7b_preview_200bt_q5_0.ggml --ignore-eos -n 1280 -p "Building a website can be done in 10 simple steps:" --mlockThough I'm getting this error on an Intel macbook (Monterey); it works fine on a Windows11 box:
python3 convert-pth-to-ggml.py models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights 1
Loading model file models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/pytorch_model-00001-of-00002.bin
Traceback (most recent call last):
File "/l/llama.cpp/convert-pth-to-ggml.py", line 11, in <module>
convert.main(['--outtype', 'f16' if args.ftype == 1 else 'f32', '--', args.dir_model])
File "/l/llama.cpp/convert.py", line 1129, in main
model_plus = load_some_model(args.model)
File "/l/llama.cpp/convert.py", line 1055, in load_some_model
models_plus.append(lazy_load_file(path))
File "/l/llama.cpp/convert.py", line 857, in lazy_load_file
raise ValueError(f"unknown format: {path}")
ValueError: unknown format: models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/pytorch_model-00001-of-00002.bin…but, although it is true that for a fixed compute budget that these small models can have impressive results with good training data, it is also true that smaller models (7B) appear to have an upper performance bound that is beaten easily by larger well trained models.
It’s just way more expensive to train larger models.
They specifically note they are training a smaller 3B model In the future.
So… it seems reasonable to assume that this is a proof of concept, and that no, the Berkeley AI lab will not be fielding the cost for training a larger model.
This is probably more about exploring the “can we make a cheap good-enough model?” than “here is your GPT4 replacement”.
30B is within reach, with compression techniques that seem to lose very little information of the overall network. Many argue that machine learning IS fundamentally a compression technique, but the topology of the trained network turns out to be more important. Assuming an appropriate activation function after this transformation.
No… definitely not your GPT4 replacement. However this is the kind of PoC I keep following… every… 18 hours or so? Amazing.
They're kidding right, there's no way that thing will be more useful than one of those flan models.
[1]https://github.com/openlm-research/open_llama#future-plans
> Overall we reach a throughput of over 1900 tokens / second / TPU-v4 chip in our training run
1 trillion / 1900 = 526315789 chip seconds ~= 150000 chip hours.
Assuming "on-demand" pricing [1] that's about $500,000 training cost.
Considering I could negotiate A100 for under a dollar/hr - 8 months ago, when they were in high demand, I wouldn't be surprised if the cost was close to 100k for this training run.
The GPU marketplaces are nice for people who need smaller/single GPU setups, don't have huge reliability or SLA concerns, and where data privacy risks aren't an issue.
I might be misreading it. It might be just 12 GPUs.
1. Get a machine with decent GPU, probably rent cloud GPU.
2. On that machine download the weights/model/vocab files from https://huggingface.co/openlm-research/open_llama_7b_preview...
3. Install Anaconda. Clone https://github.com/young-geng/EasyLM/.
4. Install EasyLM:
conda env create -f scripts/gpu_environment.yml
conda activate EasyLM
5. Run this command, as per https://github.com/young-geng/EasyLM/blob/main/docs/llama.md: python -m EasyLM.models.llama.llama_serve \
--mesh_dim='1,1,-1' \
--load_llama_config='13B' \
--load_checkpoint='params::path/to/easylm/llama/checkpoint' \
Am I even close?Many people have been struggling to reproduce the benchmark numbers included in the original llama paper.
Where it's slow is in tokenization -- it can be very, very slow to make an initial tokenization of a prompt. I think this has to do with how the network actually functions, like there's a forward loop that feeds each token in to the network sequentially.
I would guess if it had the same level of attention and work that the Llama stack is getting it would be pretty fantastic, but that's just a guess, I'm a hobbyist only.
Also, most people don't mind running LLaMA 7B at home so much because of enforceability, but a lot of commercial businesses would love to run a 65b parameter model if possible and can't because the license is more meaningfully prohibitive in a business context. Open versions of the larger models are a lot more meaningful to society at this point.
Source: https://twitter.com/togethercompute/status/16527350961501757...
Here's another repo (with the same "open-llama" name) that has been available on hugging face as well for a few weeks. (different training dataset)
https://github.com/s-JoL/Open-Llama https://huggingface.co/s-JoL/Open-Llama-V1
What everyone is using are HPC grade low latency interconnects to make the cluster look as close as possible to a single big TPU.
Can someone explain what this means? I don't understand.
After setting up dalai, OpenAssistant, gpt4all and a bunch of other (albeit nonworking) LLM thingies, my current hunch is:
if the model somewhere has "GGML" in its name, it doesn't require a GPU.
GGML format is meant to be executed through llama.cpp, which doesn't use GPU by default. You can often find these models in a quantized form as well, which helps performance (at a cost of accuracy). Look for q4_0 for the fastest performance and lowest RAM requirements, look for 5_1 for the best quality right now (well, among quantized models).
Oh yeah, textgen supports llama.cpp, and also provides API, so it looks like a clear winner. You might want to manually pull newer dependencies for torch and llama.cpp though:
pip install -U --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.h... pip install -U llama-cpp-python
So that's 1.2 trillion tokens. Nice.
I have no formal idea how this is done, but my assumption is that "something like that" should work.
Please disabuse me of any silly ideas.
Hope that helps!
I have felt the same in the past, related to a completely different topic. I know how it feels, it's like people are not saying things what they are, just using weird words.
"weights" - synapses in the AI brain
"tokens" - word fragments
"model" - of course, the model is the AI brain
"context" - the model can only handle a piece of text, can't put whole books in, so this limited window is the context
"GPT" - predicts the next word, trained on everything; if you feed its last predicted word back in, it can write long texts
"LoRA" - a lightweight plug-in model for tweaking the big model
"loss" - a score telling how bad is the output
"training" - change the model until it fits the data
"quantisation" - making a low precision version of the model because it still works, but now is much faster and needs less compute
"embedding" - just a vector, it stands for the meaning of a word token or a piece of image; these embeddings are learned
It's like generating code in a language that you know nothing about. You should check for bugs, but you can't.
I do not use ChatGPT as a search engine. Its ability to confidently hallucinate consistently places it much below a human expert on any topic that I care to understand correctly.
My advice to folks is, if you actually want to know how this stuff works at some basic level, put in some time learning how basic linear and logistic regression work, including how to train it using back propagation. From there you'll have a solid foundation that gives enough context to understand most deep learning concepts at a high level.
when it can hallucinate content, why do that instead of reading a blog post from an expert?
After going through this series I can say I basically understand weights, tokens, back-propagation, layers, embeddings, etc.
Just curious, didn't see any date...
A model is some architecture of how data will flow through these weight matrices, along with the values of each weight.
Tokens are sort of "words" in a sentence, but the ML may be translating the word itself into a more abstract concept in 'word space': eg, a bunch of floating point values.
At least some of what I just said is probably wrong, but now someone will correct me and we'll both me more right!
> A model is some architecture of how data will flow through these weight matrices, along with the values of each weight.
Because data doesn't really flow through weight matrices, though perhaps this is true if you squint at very simple models. Deep learning architectures are generally more complicated than multiplying values by weights and pushing the results to the next layer, though which architecture to use depends heavily on context.
> Tokens are sort of "words" in a sentence
Tokens are funny. What a token is depends on the context of the model you're using, but generally a token is a portion of a word. (Why? Efficiency is one reason; handling unknown words is another.)
To learn more deeply though, get started with getting it to work and when you are curious or something doesn't work, try to understand why and recursively go back to fill in the foundational details.
Example, download the code try to get it to work. Why is it not working? Oh it's trying to look for the model. Search for how to get the model and set it up. Then key step, recursively look up every single thing in the guide or set up. Don't try to set something up or fix some thing without truly understanding what it is you are doing (e.g. copy and paste). This gives you a structured why to fill in the foundations of what it is you are trying to get to work in a more focused and productive manner. At the end you might realize that their approach or yours is not optimal "oh it was telling me to download the 65k model when I can only run 7k on my machine bc ..."
If you want access to a serious GPU or TPU, then the sensible solution is to rent one in the cloud. If you just want to run smaller versions of these models, you can achieve impressive results at home on consumer grade gaming hardware.
The FastChat framework supports the Vicuna LLM, along with several others: https://github.com/lm-sys/FastChat
The Oobabooga web interface aims to become the standard interface for chat models: https://github.com/oobabooga/text-generation-webui
I don't see any indication that OpenLLaMa will run on either of those without modification. But one of those, or some other framework may emerge as a de-facto standard for running these models.
https://github.com/modal-labs/modal-examples/blob/main/06_gp...
Refined training is usually updating the weights of usually what's called a foundational model with well structured and numerous data. It's very expensive and can disrupt the usefulness of having all the generalizations baked in from training data [1].
While LLMs can generate text based on a wide range of inputs, they're not designed to retrieve specific pieces of information in the same way that a database or a search engine would. But I do think they hold a lot of promise in reasoning.
Small corollary: LLMs do not know a head of time what they are generating. Secondly, they use the input from you and itself to drive the next message.
This sets us up for a strategy called in-context learning [1]. We take advantage of the above corollary and prime the model with context to drive the next message. In your case, a query about some specific code base with knowledge about standard docs etc.
Only there is a big problem, context sizes. Damn. 4k tokens?
We can be clever about this but there is still a lot of work and research needed. We can take all that code and standard docs and create embeddings of them [2]. Embeddings are mathematical representations of words or phrases that capture some of their semantic meaning. Basically the state of a trained neural network given inputs.
This will allow us to group similar words and concepts together closer in what is called a vector space. We can then do the same for our query and iterate over each pair finding the top-k or whatever most similar pairs. Many ways to find the most similar pairs but what's nice is cosine similarity search. Basically a fancy dot product of the pairs with a higher score indicating greater similarity. This will allow us to prime our model with the most "relevant" information to deal with the context limit. We can hope that the LLM would reason about the information just right and voila.
So yeah basically create a fancy information retrieval system that picks the most relevant information to give your model to reason about (basically this [3]). That and while also skirting around the context limitations and not overfitting and narrowing the training information that allow them to reason (controversial).
1: "Language Models are Few-Shot Learners" Brown et al. https://arxiv.org/pdf/2005.14165.pdf
2: Embeddings https://arxiv.org/pdf/2201.10005.pdf
3: https://twitter.com/marktenenholtz/status/165156810719298355...
Loading model file models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/pytorch_model-00001-of-00002.bin
Traceback (most recent call last):
File "convert-pth-to-ggml.py", line 11, in <module>
convert.main(['--outtype', 'f16' if args.ftype == 1 else 'f32', '--', args.dir_model])
File "/Volumes/mac/Dev/llama.cpp/convert.py", line 1145, in main
model_plus = load_some_model(args.model)
File "/Volumes/mac/Dev/llama.cpp/convert.py", line 1071, in load_some_model
models_plus.append(lazy_load_file(path))
File "/Volumes/mac/Dev/llama.cpp/convert.py", line 865, in lazy_load_file
return lazy_load_torch_file(fp, path)
File "/Volumes/mac/Dev/llama.cpp/convert.py", line 737, in lazy_load_torch_file
model = unpickler.load()
TypeError: 'staticmethod' object is not callableI assume that it's completely impractical to train on distributed systems?
A quick, very unscientific, test using the oobabooba/text-generation-webui with some models I tried earlier gives me:
* oasst-sft-7-llama-30b (spread over 4x GPU): Output generated in 28.26 seconds (5.77 tokens/s, 163 tokens, context 55, seed 1589698825)
* llama-30b-4bit-128g (only using 1 GPU as it is so small): Output generated in 12.88 seconds (6.29 tokens/s, 81 tokens, context 308, seed 1374806153)
* llama-65b-4bit-128g (only using 2 GPU): Output generated in 33.36 seconds (3.81 tokens/s, 127 tokens, context 94, seed 512503086)
* llama (vanilla, using 4x GPU): Output generated in 5.75 seconds (4.69 tokens/s, 27 tokens, context 160, seed 1561420693)
They all feel fast enough for interactive use. If you do not have an interface that streams the output (so you can see it progressing) it might feel a bit weird if you often have to wait ~30s to get the whole output chunk.
Or a beefy MacBook Pro. I recently bought one with 64gb of memory and Llama 65B infers very promptly as long as I'm using quantized weights (and the Mac's GPU).
But I’m waiting until my friends can afford it. Right now (which in this pace might mean I change my mind tonight)
…I am earnestly studying how to make this a thing anyone can install as a part of a product they can use without a subscription.
When doing quick estimates, I just assume every syllable is a token. It tends to overestimate, which is fine for my OOM mitigation purposes.
They really have come a long way since... A few weeks ago.
Using cublas with my 1080ti results in a 52% speedup compared to cpu-only. Vram usage is very minimal.
If you don't trust its memory, copy a piece of high quality text in the topic of interest inside the context, as reference.
And getting hid of the NC clause of the original llamas too, of course.
As of right now, there's trouble replicating the eval results of the paper, for example.
I don’t think it will cost me much to not use the explicitly-not-a-search-engine thing as a search engine.
Which LLM will you use to verify that ChatGPT is more knowledgeable than human experts on a given topic?
They are both insanely powerful tools, and like most insanely powerful tools, the hazards are considerable.
(modal) fme:/mnt/c/temp/modal$ modal run openllama.py
? Initialized. View app at https://modal.com/apps/ap-9...
? Created objects.
+-- ?? Created download_models.
+-- ?? Created mount /mnt/c/temp/modal/openllama.py
+-- ?? Created OpenLlamaModel.generate.
+-- ?? Created mount /mnt/c/temp/modal/openllama.py
Downloading shards: 0%| | 0/2 [00:00<?, ?it/s]Downloading shards: 100%|¦¦¦¦¦¦¦¦¦¦| 2/2 [00:00<00:00, 1733.54it/s]
Loading checkpoint shards: 100%|¦¦¦¦¦¦¦¦¦¦| 2/2 [00:12<00:00, 5.70s/it]Loading checkpoint shards: 100%|¦¦¦¦¦¦¦¦¦¦| 2/2 [00:12<00:00, 6.23s/it]
Building a website can be done in 10 simple steps:
1. Choose a domain name. 2. Choose a web hosting service. 3. Choose a web hosting package. 4. Choose a web hosting plan. 5. Choose a web hosting package. 6. Choose a web hosting plan. 7. Choose a web hosting package. 8. Choose a web hosting plan. 9. Choose a web hosting package. 10. Choose a web hosting plan. 11. Choose a web hosting package. 12. Choose a web hosting package. 13. Choose a web hosting package. 14. Choose a web hosting
? App completed.2-3c per run seems very high. That's probably just the cost if you have to spin up a new container. You can shorten the idle timeout on a container if its going to just serve one request typically. If it's going to serve more requests, then the startup and idle shutdown cost is amortized over more requests :)
In a typical fully connected hidden layer, the neurons each need to compute the values of the all others in the previous layer, so you need all the data in one place. Obviously you can distribute the actual calculations which is what a GPU does, but distributing that over networked CPUs will be incredibly slow and require the whole thing to be loaded into memory on all instances.
My bet is on some kind of light based or analog electric accelerator PCIE card to be the next best thing for this sort of inference, since it should be able to calculate multiple layers at once. FPGAs also work but only for fixed weights.
Out=lots of machines through network
Confabulation is the unintended generation of false memories.
Hallucination is false perception.
Clearly, the phenomenon we are seeing with LLM researchers call Hallucination better fits Confabulation.
It's not perceiving reality incorrectly, it's presenting wholesale fiction as fact both coherently and with absolute confidence. It even forges supporting documentation ad-hoc.
GPT is not a poor schizophrenic suffering from delusions or innocuous "hallucinations." It is the world's most advanced liar.
IMO, so long as you're aware the information is often subtly wrong, it's not that different from, e.g., physics classes progressively lying to you less to allow your brain to build a framework to house the incoming ideas.
The latest Genoa has 12 channel DDR5-4800 support (and boosted AVX-512) and I'd imagine should perform quite well, but if you primarily want to run inference on a quantized 65B model, I think you're best bang/buck (for local hardware) would be 2 x RTX 3090s (each of those has 24GB of GDDR6X w/ just shy of 1TB/s of memory bandwidth).
With my LLaMA AVX implementation on 32bit floats [0] there no performance gain after 2 threads, so remaining 14 threads available are of no use, there no memory bandwidth to load them with work :)
The primary bottleneck for now is compute.
They've recently made a big improvement to performance by introducing partial gpu acceleration if you compile with a gpu accelerated variant of BLAS. Either cublas (Nvidia) or CLBlast (slightly slower but supports almost everything: Nvidia, Apple, AMD, mobile, raspberry pi etc)
When did you test this? Maybe llama.cpp had some improvements since I used it (which was at the start of the project).
These are worse as they imply the thing generating the words knows the truth and purposely says something else.
An LLM is just doing next token prediction. It's a mathematical process. It's not trying to "hide" the truth from you.
Lies, BS, and Con artistry all require conscious motive and intent. Thats a bridge to far, for me, in ascribing ‘intelligence’ to these models.
Hallucination, to me, conveys ‘seeing things (facts) that are not there’. To the extent the models are ‘perceiving’, they ARE perceiving reality incorrectly. Granted, I expect many times it’s because the source of the model training data are, at best, just wrong or are lying.
Besides,
> it's presenting wholesale fiction as fact both coherently and with absolute confidence
That is not in any way distinct from perceiving reality incorrectly. It is a symptom common to both skilled lying and hallucination.
You split the big matrices into smaller matrices to dispatch the workload. But this means you have to add some communication overhead (roughly nblayers sequential synchronisation point per token). In official LLama implementation this is done transparently using RowParallelLinear, ColumnParallelLinear, ParallelEmbedding see https://github.com/facebookresearch/llama/blob/main/llama/mo...
Transformer have multiple attention heads, that can be computed independently and then summed together to produce the output of the layer. This allow to split the parameter space among machines without having to transfer them at each iteration.
In practice, they typically use servers with clusters of these machines, up to about 1000 GPUs in total (so around 80TB of memory, give or take a few?). This allows even the biggest models to be trained on large batches of several hundreds, or even thousands, of elements (the total memory usage is _not_ proportional to the product of number of parameters and the batch size, but it does increase as a function of both of them, a term of which being indeed the product of the two). It makes for some very tricky engineering choices to make just the right data travel across connections, trying to avoid as much as possible that you have to sync large amount of data between different machines (so "chunking" things to stay on the 640GB range) with strategies such as ZeRO being published every now and then. Plus of course the practical effort to make physical connections as fast as possible...
To get an idea of how hard these things are, take a look at how long the list of names in the published paper about BLOOM language model is :-)
I saw a reference that said GPT-3, with 96 decoder layers, was trained on a 400 GPU cluster, so that seems like the ballpark for a 175B parameter model. That's 50 of the hypothetical machines we talked about (well .. really 100 for GPT-3 since back in those days, max was 40 or 48 GB per GPU).
I also wonder why NVIDIA (or Cerebras) isn't beefing up GPU memory. If someone sold a 1TB GPU, they could charge a 100grand easy. As I understood it, NVIDIA's GPU memory is just HBM-6 .. so they'd make a profit?
That's absolutely nuts. That's basically the entire capital cost of an 8x A100 hyperplane from LambdaLabs [1] plus power for a year plus administration! What's the point of cloud hardware if you're paying for everything reserve anyway?
Roughly the same setup costs $12/hour at Lambda if you're lucky enough to snag one so it looks like demand for 8x A100 is so high that you basically have to pay AWS for an entire pod to get access to one, unless you want to pay $40 per hour (!!!)
[1] https://shop.lambdalabs.com/deep-learning/servers/hyperplane...
For multiple GPUs there are quite a few ways to improve memory footprint and speed: https://huggingface.co/docs/transformers/perf_train_gpu_many Although I'm not sure if the implementations in HuggingFace are really on par with the SOTA methods (they shouldn't be far away in any case). I guess they should be at least on par, if not better, with whatever OpenAI used for GPT-3 back then, things evolving so quickly in this realm...
On the last point, I can only assume there are some hard thresholds which are difficult to overcome in order to add more memory, otherwise they would. Just an 80GB memory GPU was something unthinkable a dozen years ago, before the deep learning explosion around 2GB was the norm. A couple of years ago, when 16GB or 32GB was the best you'd get from Nvidia, AMD did come out with consumer grade GPUs having significant larger memory (maybe 48GB back then? I can't remember), which could have stirred the market a bit I guess, but it didn't pick up for deep learning (I suspect mostly due to a lack of the equivalent to cudnn / cuda, that makes it possible to "easily" build deep learning frameworks on top of the GPUs).
My take on this is, if there's a competitor who fights hard to regain market share, and bets big on offering more memory, and still the best it comes up with is just a couple of times more than what the others have, it must be not as easy as "let's stick another bank of memory here and sell it", or they would have...?
I also think people who say that search engines lie are seriously overestimating the amount of lies on returned by a search result. Social media is one thing but the broader internet is filled with articles from relatively reputable sources. When I Google "what is a large language model" my top results (there aren't even ads on this particular query to really muddle things) are:
1. Wikipedia
Sure this is the most obvious place for lies but we already understand that. Moreover, the people writing the text have some notion of what is true and false unlike an LLM. I can always also use the links it provides.
2. Nvidia
Sure they have a financial motive to promote LLMs but I don't see a reason they have to outright mislead me. They also happen to publish a significant amount of ML research so probably a good source.
3. TechTarget
I don't know this source well but their description seems to agree deeply with the other two so I can be relatively sure on both this and the others' accuracy. It's a really similar story with Bing. I can also look for sources that cite specific people like a sourced Forbes article that interviews people from an LLM company.
With multiple sources, I can also build a consensus on what an LLM is and reach out further. If I really want to be sure I can type a site:edu to just double check. When I have the source and the text I can test both agreement with consensus and weigh the strength of a source. I can't do that with an LLM since it's the same model when you reprompt. I get that LLMs can give a good place to begin by giving you keywords and phrases to search but it's a really, really poor replacement for search or for learning stuff you don't have experience in.
There is a rather substantial difference between a search engine, which suggests sources which the reader can evaluate based on their merits, and a language model, whose output may or may not be based on any sources at all, and which cannot (accurately) cite sources for statements it makes.
> Similar degrees of caution and skepticism must be applied to results from both ML and traditional search engines.
This is a fairly ridiculous statement.
Really? Have you used Google lately -- say, in the past 6-12 months?
If a person is in the habit of using a search engine like a chat bot by typing in questions AskJeeves-style and then believing what text pops up in the info cards above the ads (which are themselves above the search results), I could see how the distinction between chat bots and search engines could seem trivial.
The similarity between chat bots and search engines breaks down significantly if the user scrolls down past the info cards and ads and then clicks on a link to an external website. At that point in the user experience it is no longer like chatting with a confident NPC.
This is a weird thing to write to a stranger. I suppose there will be no need to caution people about rudeness or making strange assumptions in the utopian future where humans only talk to chatbots, though.
Of course, it will be trivial for such bots to emulate humans if they find that useful.
Fun times.
"I do not use ChatGPT as a search engine. Its ability to confidently hallucinate consistently places it much below a human expert on any topic that I care to understand correctly."
:)
Thank goodness that I didn’t do that, I’d certainly have egg on my face if I hadn’t included myself in the joke and somebody called me out on it!