SEQUOIA: Exact Llama2-70B on an RTX4090 with half-second per-token latency

SEQUOIA: Exact Llama2-70B on an RTX4090 with half-second per-token latency(infini-ai-lab.github.io)

131 points by zinccat 2 years ago | 61 comments

spxneo 2 years ago |

this is quite worrying for OpenAI as the rate token prices have been plummeting thanks to Meta and its going to have to keep cutting its prices while capex remains flat. whatever Sam says in interviews just think the opposite and the whole picture comes together.

It's almost a mathematical certainty that people who invested in OpenAI will need to reincarnate in multiple universes to ever see that money again but no bother many are probably NVIDIA stock holders to even out the damage.

jiggawatts 2 years ago | |

There’s a Pareto frontier where Meta is pushing out the boundaries along the “private” and “cheap” axes.

Open AI can release GPT 4.5 or 5 and push out the boundary in the direction of “correctness” and “multimodality”.

Either way, we win as customers while the the level of competition remains this hot.

I personally want a smart AI much more than a cheap or fast one. Your mileage may vary.

mft_ 2 years ago | | |

Well, Pareto is about optimisation, not either/or. I want a model that’s smart enough, while also being locally-executable.

I don’t know whether/when we’ll get there, and whether it will be improvements in models, or underlying model technology, or GPU/TPUs with larger memory at a consumer price point, or something else, that will deliver it.

michelsedgh 2 years ago | |

I agree with you somewhat. You are correct unless they have a much better GPT model that have not released for whatever reason. They are a year ahead than competitors and GPT4 is pretty old now. I find it hard to believe they don’t have much more capable models now. We Will see though

j45 2 years ago | | |

The polish of OpenAI stuff when released has been quite mature since gpt4 or even 3.5.

They are no doubt sitting on ultra polished stuff. When you are the tip of the arrow though and the cutting edge itself it might not be as efficient but does it ever show you things you can’t unsee.

When OpenAI can launch a video thing a day after because it’s ready to go. I am less and less skeptical e dry time they ship because the quality of the first version isn’t sliding back wards even in different areas like video.

Maybe releasing it is strategic, or releasing it also requires supporting it infrastructure wise and then some. That might be a challenge.

My feeling is the next model of an k between may have massive efficiency and performance improvements without having to go quantum with brute forcing it.

Meanwhile others who are following what OpenAI has done seem to be able to optimize it and make it more efficient whether it’s open source or otherwise.

Both are doing important work and I'm not sure I want to see it as a one winner take all game.

The way AI vendors are responding suddenly to another’s launch feels like they are always ready to launch and continue to add functionality to it that could also ship.

It reminds me of when Google spent a billion dollars advertising bing had a billion pages indexed. Google stayed quiet. Then when the money was spent by Microsoft, Google simply added a zero or two to their search page, when they used to list how many pages they have indexed. They were just sitting on it already done, announcing it when it’s to their benefit.

Plankaluel 2 years ago | | |

GPT-4 is not a single model. The GPT-4 that was released initially a year ago is way worse in benchmarks than the newest versions of it and the original version has been beat by quite a lot of other models by this point.

The newest version of GPT-4 is probably still overall the best model currently, but it is only a few months old, and the picture depends a lot on what benchmarks you are looking at.

E.g. for what we are doing at our company (document processing, etc.) Claude-3 Opus and Gemini-1.5 Pro are currently the better models. The newest GPT-4 even performed worse than a previous version.

So to me it def. seems like the gap is getting smaller. Of course, OpenAI could be coming out with GPT-5 next week and it could be vastly better than all other current models.

easygenes 2 years ago | | |

There's wide speculation that what will be branded as either GPT-4.5 or GPT-5 has finished pretraining now and is undergoing internal testing for a fairly near-term release.

imtringued 2 years ago | | |

I'm not saying Claude 3 and Gemini are better than GPT4 in every aspect, but those two models can at least perform addition on arbitrarily long numbers, meanwhile GPT4 struggles.

j-bos 2 years ago | |

Isn't that why he's making rounds to lock down the biggest AI's?

hiddencost 2 years ago | |

I suspect that when it costs 0.5c per 100 million generated token, and you can generate 1000 tokens per second, they'll be very happy.

moralestapia 2 years ago | |

Disclaimer: not a fan of "Open"AI

Everyone could say anything about open source models, but they're comparing themselves to what OpenAI released a year ago. They haven't shown all of their cards yet and they have a decent moat already in place; some say they have no moat, I disagree, they have one of the best moats possible which is brand awareness.

Sora on its own could bring in billions in revenue; an open-source Sora will take at least another year, if not two, to come out. Then more time until it can run on commodity hardware. An open source model that only runs in a dedicated H100 is actually less useful than a closed model behind an API call; not to detract from open source, I think it's the way to go but I'm just being pragmatic and realistic. There's a reason why MS Office is still the top productivity app in the world, even though dozens of open source alternatives exist.

Hendrikto 2 years ago | | |

> they have one of the best moats possible which is brand awareness.

Do they though?

If you talk to "regular people", everybody knows ChatGPT, but nobody knows or cares about OpenAI. And most of them don‘t even really know that name. They call it ChatUuuuhm, ChatThingy, Chad Gippity, or similar.

I think they will just switch, when something better comes along.

poslathian 2 years ago | | |

MS had yet to fully stabilize that lead a full decade after they had won the os platform standard for ibm compatible pcs. A platform standard moat goes way way beyond a brand advantage.

Azure, while significant, has no similar monopoly to support OpenAI. Do you really see a structural advantage to openAI beyond the Microsoft products integrating it?

jstummbillig 2 years ago | |

I disagree.

a) A year after GPT-4 set the bar, it's still the best model, despite everyone else not having to do it first. Just copy, and just software. And that's not for lack of trying by every other viable prime player on the planet with unprecedented acceleration.

Imagine any other piece of software, where the incumbent has a mere 2-3 year head start, in which they had to work out the entire product that everyone else, despite just having to copy and pressing the pedal through the floor is struggling just trying to catch up with.

b) The current models including GPT-4 are so bad. The few billions can be made by just by continue playing this game of improvements for a few years and getting better each year. I think people are wildly confused about how big this market is going to be when that happens. They are not squeezing hosting or compute. They are squeezing intelligence. Intelligence is the entire economy. The notion that there would ever not be room for multiple things here, maybe through size or specialisation or cost (as with all other intelligence), and that a few billion dollar are a big deal, is so strange to me.

c) The game will at some point, be mostly about infra and optimization. People come to the conclusion that's a problem for the incumbents, when our entire industry is mostly about infra and optimization. AWS is infra and optimization. I think even the average hn tinkerer understands that therein lies a proposition that's not exactly equivalent to "just rent a few servers and do it yourself".

anon373839 2 years ago | | |

> A year after GPT-4 set the bar, it's still the best model

Debatable. Many people find Claude Opus superior, and I know I've found it consistently better for challenging coding questions. More importantly, the delta between GPT-4 and everything else is getting smaller and smaller. Llama 3 is basically interchangeable with GPT-4 for a huge number of tasks, despite its smaller size.

hackerlight 2 years ago | |

Depends how good their next model is, and if they prevent leaks and departures so they can prolong the lead for an undetermined amount of time.

14u2c 2 years ago | |

Most of the big "investments" in OpenAI are in the form of compute credits. I fail to see the downside of that.

modeless 2 years ago |

I don't need exact results. FP8 quantization is almost lossless and even 6-bit quantization is usually acceptable. Can this be combined with quantization?

mmoskal 2 years ago | |

Yes. It's speculative decoding but instead of generating just a few sequential tokens with the draft model they generate a whole tree of some sort of optimal shape with hundreds of possible sequences.

It ends up being somewhat faster than regular speculative decoding in normal setting (GPU only). If you are doing CPU offloading it's massively faster.

Edit typo

dimask 2 years ago | |

> Can this be combined with quantization?

It is in their TODO part in https://github.com/Infini-AI-Lab/Sequoia/tree/main

alecco 2 years ago | | |

INT8, not FP8

freeqaz 2 years ago |

So this is 8x faster for serving these models than before? Or is this about it being more deterministic? I can't quite tell from reading it.

maccam912 2 years ago | |

The idea is to serve models that would normally be considered too large for GPU memory (70 billion parameters at 16 bytes each for 140 GB of memory required). Some people figured out you can offload the model and only have parts of it loaded so a 24 GB GPU like the 4090 can still serve the model, but it goes a lot slower. They have a new way to serve the same model on the same GPU but 8x better throughput. Something about decoding tokens on a smaller model maybe, then just checking multiple tokens on the larger model in a single batch. Magic, but ultimately its the same model, same GPU, same output as before, but much better throughout.

aussieguy1234 2 years ago |

I'm looking at buying 2 X RTX 3060s to run LLama 70b for my new PC I just purchased.

Will this work, or do I need a Tesla P40 or two?

tarruda 2 years ago | |

Note that 2 RTX 3060 will probably be significantly slower than RTX4090.

Even with RTX 4090, 2 tokens per second is very slow and likely not ideal for most tasks. It is impressive (much faster than previous solutions), but still very slow for real time use.

If you want to run Llama 3 70b, might be better to purchase a mac studio with 64gb RAM (more for longer contexts) and run with 4-bit quantization.

My 2 cents: For most common tasks Llama 3 8b will be more than enough, and you can run that with full precision using a single rtx 3090. At a much lower cost, you can also run Llama 3 8b with 8-bit quantization in a single RTX 3060, if it has 12GB RAM.

dannyw 2 years ago | |

Theoretically there's no reason why this shouldn't work, but you likely will find the software isn't designed for multi-GPU and have to reimplement/fix things yourself.

You will also be getting about 720GB/s of memory bandwidth with 2x3060; instead of 1TB/s with the 4090; so expect lower performance.

34679 2 years ago | |

I picked up a couple RTX 4060ti in the 16GB version for $450 each a couple days ago from Bestbuy. Had been looking at the 3060 like yourself. Installed LM Studio and have been trying out a bunch of models with varying levels of quantization, completely pain free.

thelittleone 2 years ago |

Other than portability and privacy, are there any benefits to running a local model with a 4090, versus running the same model on-demand on a cloud service with the same or more powerful card?

razodactyl 2 years ago | |

There are always going to be pros and cons. That's why solutions like managed databases are reality. From an expert perspective it seems like there's more to lose but from the perspective of a company with employee turn over, possible data loss, security etc. the benefits start to far outweigh the costs.

This reasoning can mostly be applied here. If you want to learn about and pull the LLM apart. Perhaps fine-tune and tinker then 100% go ahead running locally. You however won't be able to scale this up easily for a consumer base and the electricity use and heat output starts to become a problem.

At some point it's more beneficial to pay the provider for inference, this includes upkeep, latest models, faster generation, stability, hosting etc.

Pros and cons! Choice is important and Meta is doing the right thing by the AI community and tech community in general by being realistic with these programs. The ecosystem is giving back by being able to access these high quality models.

j45 2 years ago | | |

What Meta is doing is very nice and differentiates them.

I also hope that it ought not change if it became more palatable to not be open.

kaliqt 2 years ago | |

Guaranteed uptime.

you are the guarantor but that's good enough.

choppaface 2 years ago | |

Eventually these models will need to run on mobile devices, so commodity desktop GPUs are a good stepping stone. Alexnet / Caffe got traction because they could be run on commodity desktop machines. Then a few years later phones could run object detection etc.

imtringued 2 years ago | |

If you have a robot or self driving car, you're going to want on device inference for your vision language models.

For video games, being locked to a cloud service means the feature will disappear when the servers are shut down.

zwaps 2 years ago |

Is it me or is this paper basically missing all technical information?

I get that Therese proprietary technology, but if so, can we please not put this on arxiv and pretend it’s a scientific contribution?

qrios 2 years ago | |

The linked github repo [1] seems to have the code available and well documented.

[1] https://github.com/Infini-AI-Lab/Sequoia/tree/main/Engine

halyconWays 2 years ago |

Someone get this into koboldcpp