Releasing 3B and 7B RedPajama(together.xyz) |
Releasing 3B and 7B RedPajama(together.xyz) |
The one-l lama,
He's a priest.
The two-l llama,
He's a beast.
And I will bet
A silk pajama
There isn't any
Three-l lllama.As non-native English speaker (while though a parent of a toddler too) I wasn't familiar with the book series.
There's really only one thing I care about: How does this compare to GPT-4?
I have no use for models that aren't at that level. Even though this almost definitely isn't at that level, it's hard to know how close or far it is from the data presented.
The big story here for me is that the difference in training set is what makes the difference in quality. There is no secret sauce, the open source architectures do well, provided you give them a large and diverse enough training set. That would mean it is just a matter of pooling resources to train really capable open source models. That makes what RedPajama is doing, compiling the best open dataset, very important for the future of high quality open source LLM’s.
If you want to play around with this yourself you can install oobabooga and figure out what model fits your hardware from the locallama reddit wiki. The llama.cpp 7B and 13B models can be run on CPU if you have enough RAM. I’ve had lots of fun talking to 7B and 13B alpaca and vicuna models running locally.
It's really fun to enable both the whisper extension and the TTS extension and have two-way voice chats with your computer while being able to send it pictures as well. Truly mind bending.
Quantized 30B models run at acceptable speeds on decent hardware and are pretty capable. It's my understanding that the open source community is iterating extremely fast on small model sizes getting the most out of them by pushing the data quality higher and higher, and then they plan to scale up to at least 30B parameter models.
I really can't wait to see the results of that process. In the end you're going to have a 30B model that's totally uncensored and is a mix of Wizard + Vicuna. It's going to be a veryyyy capable model.
Bigger ones as well, you just have to wait longer. Nothing for real time usage, but if you can wait 10-20 minutes, you can use them on CPU.
For example a therapist, a search bot for you diary, a company intranet help bot. Anything where the prompt contains something you don’t want to send to a third party.
Thanks!
Assume a truly competitive model in the Open Source world is still a ways off. These teams and their infrastructure are still in their early days while OpenAI is more at the fine-tuning and polishing stage. The fact that these open teams are able to have something in the same universe in terms of functionality this fast is pretty amazing... but it will take time before there's an artifact that will be a strong competitor.
I'll give you the answer for every open source model over the next 2 years: It's far worse
I suspect Open Source LLMs will outpace the release version of GPT-4 before the end of this year.
It's less likely they will outpace whatever version of GPT-4 is shipped later this year, but still very much possible.
Open source models can already approximate GPT-3.5 for most tasks on common home hardware, right now.
On one hand, the resources required to run these models continues falling dramatically, thanks to the techniques discovered by researchers: GPTQ quantizing down to 4, 3, 2, even 1 bits! model pruning! hybrid vram offloading! better, more efficient architectures! 1-click finetuning on consumer hardware! Of course, the free lunches won't last forever, and this will level off, but it's still incredible.
And on the other side of the coin, the power of all computing devices continues its ever-upward exponential growth.
So you have a continuous lowering of requirements, combined with a continuous increase in available power... surely these two trends will collide, and I can only imagine what this stuff will be like at that intersection.
As the resouces required to train and fine tune these models becomes consumer handware friendly, I think we'll see a shift towards a bunch of smaller models. Open models like these also mean the results of securty and capability research is publicly available. Models like this one and the Replit code model will become the new base all open source models are based on. I am really looking forward to the gptj 4bit, cuda optimized 7b models, the others I have tested run fast on 2070max q and 16gb ram, I was getting ~7tokens/second. Lora can work directly with 4bit quantized models. While ggml, cpu models are very strong, I don't believe we're move away from gpu accelarated training and fine tuning anytime soon.
LLaMA’s main issue is that its license prevents commercial use.
If you want to use a LLM inside of a product, you may need to internationalize it at some point, so multilingual support matters.
Let's wait for someone to port it to a cheaper and more powerful C-based engine like llama-cpp.
build a model that can change the number of parameters in the vicinity of some meaning, effectively increasing the local resolution around that meaning
so parameter space becomes linked-parameter space, between models
links could be pruned based on activation frequency
another way of seeing the concept is a tree of models/llms
and one additional model/llm that all it does is manage the tree (ie. build it as it goes, use it to infer, prune it, etc)
Or is it too dumb what I’m saying?
The 3B model, being super fast and accessible, is a game changer for a lot of us who may not have the latest hardware. I mean, running on an RTX 2070 that was released 5 years ago? That's pretty cool.
As for the 7B model, it's great to see that it's already outperforming the Pythia 7B. The bigger dataset definitely seems to be making a difference here. I'm eager to see how far this project goes, and what kinda improvements we can expect in the coming weeks with the new RedPajama dataset they're working on.
One thing I found interesting is the mention of differences between the LLaMA 7B and their replication. I'd love to learn more about those differences, as it could shed light on what's working well and what could be improved further.
Furthermore, model size is still the most significant contributor to output quality. E.g. vanilla llama-30b at 4-bit has better perplexity than any llama-13b finetune at 8-bit. Thus, if 4-bit lets you fit a larger model into available (V)RAM, you're still better off.
This is also why analog computing is seriously considered as a hardware architecture for LLMs: if you don't actually need bit-perfect matmul for things to work well, it can be done much simpler as an analog circuit, and then you can cram a lot more of them on the same chip. Any resulting quality loss would presumably be minor, and in any case would be more than compensated by the much larger model sizes allowed by such architecture.
The weights scale the output values from the previous layer, and the weighted values are summed. So it seems to me, instead of having a high-precision weight scale a single output, if you cloned the node in the previous layer M times, you could still have sqrt(M) bits of precision with 1-bit weights (or M bits, my brain is in weekend mode).
Thus a larger network with lower-precision weights should have the ability to have approximately the same precision as a smaller network with high-precision weights.
The larger network has more interconnects though, so seems like it could allow for more interesting space to explore during training, leading to better results.
Then again, I could be entirely wrong.
We’re finding out that many models are undertrained for their sizes, and a good option is to post process them into smaller models by teaching a smaller model to mimic their output. Quantization effectively cuts down the model size as well. No loss in quality means that the model has not been trained enough to take advantage of the depth of precision that is available.
We can use GPS to locate anything down to a sliding scale of decimal precision. There are only so many digits you need to locate a city or even a house.
I played with a pirated 7B model a while back. My computer runs a 1080 TI - so it used to be good but now it's pretty old. The model ran with a reasonable number of tokens/sec, but the quality was just trash compared to what I'd grown used to with ChatGPT. It was a novelty I interacted with for just a single evening.
I truly don't understand the use case for a 3B model with our current technologies.
What are you going to use it for?
Also, ChatGPT just can't do a lot of things because of their "rules". I was doing question answering about products on Amazon with ChatGPT and refused to answer any questions about underwear, certain books/videos, etc
Would the way the m2 MacBooks share memory be an advantage, or would the lack of cuda support be a killer? Can you do anything with 16GB, or do you need 128gb or something like that? How large are the datasets?
I've only used scikit-learn and pandas so far, I'm not very familiar with neural networks yet
Sure, you may have played with a 7B model in the past, but that doesn't mean there's no use case for a smaller model like the 3B. In fact, having a performant, smaller model is a game changer for a lot of applications that don't require the massive scale of the larger models. Plus, smaller models are generally faster and more accessible, which is always a plus.
I find it very uncanny to see comments like this that sound like ChatGPT but are surprisingly relevant to the discussion.
But the actual model architecture is slightly different, based on Pythia
I guess what is needed is a pythia.cpp https://github.com/ggerganov/llama.cpp/issues/742
That's exactly the core of the email that leaked out of Google: it's proving far better to be able to have lots of people iterating quickly (which necessarily means broad access to the necessary hardware) than to rely on massive models and bespoke hardware.
I'd anticipate something along the lines of a breakthrough in guided model shrinking, or some trick in partial model application that lets you radically reduce the number of calculations needed. Otherwise whatever happens isn't as likely to come out of the open source LLM community.
Very true, but can't Google just wait and take from the open-source-LLM community the findings, then quickly update their models on their huge clusters? It's not like they will lose the top position, already done that.
So do they use the weights that are say 32 bit floats and just round them to the nearest something putting them in a range 0-255? I guess I can see how it could work if weights are all close to zero, so -1 to 1 is mapped to 0-255.
But I would have though the model relied on the higher accuracy during training. So losing that would screw it up.
>So do they use the weights that are say 32 bit floats and just round them to the nearest
That's how they used to do it, and still how 8bit quantization works. That's called "Round to Nearest" or RTN quantization. That's not how it works anymore though.
The current algorithms (GPTQ, RTPQ, etc.) are more complex, including things like lining up the weights in order of least to greatest, placing them in bins (typically 32 or 128 weights per bin), and then computing an offset for each bin which is added to the RTN value. In some cases bins are identical and redundant and can be re-used without saving the same identical bin twice. These are just a few of the space saving measures which go into effective low-bit quantization without sacrificing quality.
It's very similar to state of the art video codecs or image compression algorithms. A raw photograph taken by my digital camera is 60MB, but a PNG of the same photo is 30x smaller at 2MB without a single artifact. It should be no surprise that we can reduce models by 4x, 8x, or even more without sacrificing quality.
I can actually see jpg artifacts on the jpg variants of the png files that I generate in Stable Diffusion, and the impacts from quantization down to 3,2, even 1 bit are FAR more than the impacts of switching from png to jpg.
Also, I actually have published peer reviewed research on LLMs and spend a majority of my time on this earth thinking about and coding for them. I know what I'm talking about and you shouldn't try to dismiss my criticisms so quickly.
Even the coomers at civitai have done polls where their own users find dreambooth models better than lora models on average, likely because the likeness of a person can be more properly trained when heavier/stronger methods are utilized. Same dynamic here with quantization.
Yes, as a model scales up in size quantization hurts it less. The claims made that extreme quantization is not noticable at all when the model is super large is just pathetically wrong.
Yes, during training, where you need to make tiny adjustments to weights. But as far as I understand it inference can still work well because of the sheer number of weights. Give a black-and-white image a high resolution and you can represent any shade of gray if you zoom out a bit.
On the other hand, getting > 8 GiB VRAM on a laptop GPU is rare; you're definitely not getting 128 GiB VRAM, so Apple Arm, with 32 or 64 GiB or RAM (get 128 if you can afford it) is going to get you more gigabytes of usable RAM for training/inference.
A brand new RTX A6000 (48Gb VRAM) is probably the largest you can get in a single card that can run in a regular PC. It can be had for $4-5k and is sufficient for llama-65b.
Beyond that, yeah, you're looking at dedicated multi-GPU server hardware.
Both consumer and workstation (the latter may be cheaper per RAM, but with fewer shaders) 16-24 GB GPUs (RTX 3080Ti/3090/4090/A4000/A4500/A5000), including in laptops, are not hard to find (pricey, but not “hyperexpensive clusters”), and its not until you jump above a single 48 GB RTX A6000 that you need a “cluster”.
So we are all in agreement here that a 3B model is fundamentally inferior to a larger model?
Not that it doesn’t have uses; not that there’s no value in research in small models.
Just, honestly, that these smaller models don’t have the capabilities of the larger models.
It’d be good to be a direct acknowledgment of that, because it seems like you’re going out of your way to promote the “it’s fine to have a small model”; and it is, roughly speaking. Parameter count isn’t everything. Small models are accessible, you can easily fine tune them. They are interesting.
…but, they are not as good, as far as I’m aware, in terms of output, in terms of general purpose function, as larger models.
Sounds like the difference between edge and centralized ML scoring.
At my job, I can’t casually fire up 8xA100 80gb instances. And if I could, the performance wouldn’t have the throughput I require to be useful. Big models are operationally much more expensive.
The smallest/fastest model that is accurate enough for your use case is ideal.
Sure.
…but it’s also fair to say that the smallest model that can fit your use case will be bounded by the parameter count.
No amount of training data can make 100 param model do text summarisation.
If you have a 3B param model, and you want a chat-GPT to embed in your app, do you think it’ll do?
I don’t.
The output is not at that quality level, because it’s too small.
Not everyone needs that; but these 3B / 7B models don’t have the capability to do everything.
But ultimately small models are very good for most things, and much more preferable (to run at the home to organize your digital life, with a small SBC or old computer)
Sure, you may have played with a 7B model in the past, but that doesn't mean there's no use case for a smaller model like the 3B. In fact, having a performant, smaller model is a game changer for a lot of applications that don't require the massive scale of the larger models. Plus, smaller models are generally faster and more accessible, which is always a plus.
It's hard to pick out the actual answer: what is the application that this is good at? What has their "more nuanced" approach to understanding performance increased this model's performance at doing?
> How can someone get into using these models
You can use gradio(online) or download(git will not download, it's too big, do it manually) the weights at https://huggingface.co/lmsys/vicuna-13b-delta-v1.1/tree/main and then load the model in pytourch and try inference(text generation). But you'll need either a lot of RAM(16GB,32GB+) or VRAM(Card).
> How might I go about using these models for doing things like say summarizing news articles or video transcriptions Again, you might try online or setup a python/bash/powershell script to load the model for you so you can use it. If you can pay I would recommend runpod for the shared GPUs.
> When someone tunes a model for a task, what exactly are they doing and how does this ‘change’ the model? From my view ... not much ... "fine-tuning" means training(tuning) on a specific dataset(fine, as in fine-grained). As I believe(I'm not sure) they just run more epochs on the model with the new data you have provided it until they reach a good loss(the model works), that's why quality data is important.
You might try https://github.com/oobabooga/text-generation-webui they have a pretty easy setup config. Again, you'll need a lot of RAM and a good CPU for inference on CPU or a GPU.
https://huggingface.co/lmsys/vicuna-13b-delta-v1.1/tree/main
Yeah it wouldn't be as flexible as a LLM (for example synonyms won't work), but I doubt that for this particular task it'll be that big of problem, and you can ask it to tweak the program in various ways (for example introducing crude spaced-repetition) making it arguably better than the AI solution which takes sometime to prompt engineer and will never be "perfect".
I don't really know how much better fine-tuning makes these models, so I can't think of anything that they can actually be used for where they aren't worse than traditional programs, maybe as an AI in games? for example making them role-play as a historical figure in Civilization 6.
Overall I like the progress: LLama releases -> LLama fine turned on larger models gets similar performance to ChatGPT on lower parameters(more efficient) -> People can replicate LLama's model without anything special, effectively making LLMs a "Commodity" -> You are Here.