https://github.com/vllm-project/vllm/releases/tag/v0.7.1
MHA is still faster in low QPS regime apparently.
https://neuralmagic.com/blog/enhancing-deepseek-models-with-...
Also published this month was theoretical proof showing that for the same KV Cache overhead, MLA consistently offers greater expressive power than GQA. Furthermore, widely used GQA-based pre-trained models (e.g. LLaMA, Qwen, Mixtral) can be converted into MLA-based models.
I am very curious to see how well-optimized Deepseek's code is compared to leading LLM serving softwares like vLLM or SGLang.
(w/ the extra memory V3/R1 fits on a single MI300X or H200 node)
It'll be interesting to see if either project can take advantage/get any benefits from this FlashMLA implementation.
Training and prefill are compute bound. Decode is memory bound. FlashAttention massively increases the arithmetic intensity of naive MHA, such that you can remain compute bound at lower batch sizes during decode.
If Deepseek R1 had used standard MHA, they would need 1749KB per token for KV cache storage. This means that once the conversation reaches ~46,000 tokens, the KV cache will have exceeded the entire storage capacity of a single H100.
Using MLA, each token now consumes 125KB. This means you can hit ~640,000 tokens (2x Ulysses) before overflowing.
https://verticalserve.medium.com/group-query-attention-58283...
Unrelated, it's always impressed me how Singapore buys 15% of the world's h100's. Really is the AI development capital of the world.
>Achieving up to 3000 GB/s in memory-bound configuration and 580 TFLOPS in computation-bound configuration on H800 SXM5, using CUDA 12.6.
(Showing my lack of breadth of knowledge in the ecosystem (s))
Inference providers like Fireworks, or major clouds, can use this to reduce their cost, if they don't already have a replication with similar perf.
vLLM and SGLang may integrate this to be faster at serving DeepSeek-V2/V2.5/V3/R1 on H100/H800s.
I believe that's why they didn't release this back then, this is part of their "moat" (pretty weak tho) and it only benefits competitors.
Open sourcing this after being very popular may indicate that they don't want all the users to use their API/Chat and now want the world to serve it instead? Idk.
There is an extremely high chance (in fact a 99.9% chance) that an AI did not build this and the ones who are able to build or adapt projects like this which are deep into hardware systems will be the most sort after.
Not the horrendous JS or even TS slop across GitHub that is extremely easy for an AI to generate correctly.
You've got until 2030 to decide. And my advice is to study the codebases of pytorch (backends), DeepSeek, tinygrad and ggml.
I suspect it's much higher throughput than vLLM, which in turn is much higher throughput than llama.cpp. The MLA kernel they just open-sourced seems to indicate that, although we'll see how it does in third party benchmarks on non-hobbled GPUs vs FlashAttention. They only released the BF16 version — whereas most people, including DeepSeek themselves, serve in FP8 — so it might not be immediately useful to most companies quite yet, although I imagine there'll be FP8 ports soon enough.
Do you feel GenAI coding is substantially different from the lineage of 4GL to 'low code' approaches?
Reason I'm asking is because despite all promises al suffered from what Spolsky coined the 'leaky abstraction' problem.
Once something goes wrong, the user is left without recourse in a sea of additional complexity created by the tooling that was meant to not have to deal with it in the first place.
My own opinion is that GenAI is different because of (a) its recursive reflexive potential (you can use the tool itself to help you past the failure) and (b) it shifts the input out of the necessity for algorithmic/systemic thinking (which may come as a surprise to the audience here but my experience has taught me is alien to dare I say the majority of people).
Now don't get me wrong. We have not reached the point where (a)+(b) make it to where you don't need application layer devs, but we are definitely seeing some progress.
As for going deeper into the stack to "escape" AI, I would venture that is probably a non starter as the deeper you go the more constrained the domain is, so your escape strategy relies on AI reasoning making little progress, where AI reasoning has always been more successful in smaller well defined spaces.
I do agree that if you are "only" a developer, you will have to be in some sort of tightly defined niche, and how long those niches survive is anyone's guess.
Yes, this unfortunately does mean a reduction in the less skilled workforce, but frankly that's an on the whole good thing. Does anyone really enjoy writing and testing boilerplate day in day out for low pay, it's the same as the old white collar pushing paper around until retirement...
> FlashAttention ... such that you can remain compute bound at lower batch sizes during decode.
So, which one is it then?
https://jax-ml.github.io/scaling-book/inference/ - good read!
Also, they could have outsourced the computation to a subsidiary company in the US, I suppose.
Incidentally, I put the word "only" in quotes because I morally and aesthetically appreciate the strength of someone who can write to spec. I have no interest in demeaning the effort it takes to do so. I have worked with supposedly senior developers who ignore specs completely, even when the specs are done by a technical person and include details / unit tests.
Singapore is the billing location, not the shipping location, which makes sense because they’re the HQ of a lot of companies in the region.
https://www.tomshardware.com/tech-industry/deepseek-gpu-smug...
From the authors of FlashAttention:
> This [decoding] operation has been optimized with FlashAttention (v1 and v2 recently) in the training case, where the bottleneck is the memory bandwidth to read and write the intermediate results
And then they continue with:
> However, these optimizations don’t apply directly to the inference case, because the bottlenecks are different. For training, FlashAttention parallelizes across the batch size and query length dimensions. During inference, the query length is typically 1 ... With a batch size of 1, FlashAttention will use less than 1% of the GPU!
And then they come up with a different proposal, FlashDecoding, that optimizes for inference time:
> Our new approach Flash-Decoding is based on FlashAttention, and adds a new parallelization dimension: the keys/values sequence length. It combines the benefits of the 2 approaches from above. Like FlashAttention, it stores very little extra data to global memory, however it fully utilizes the GPU even when the batch size is small, as long as the context length is large enough.
Link: https://crfm.stanford.edu/2023/10/12/flashdecoding.html
Classic softmax attention aka Softmax(Q K^T/sqrt(d_k))V consists of two matrix multiplications.
This means QK^T=O and then softmax(O/sqrt(d_k)V.
The matrix O is quadratic with respect to the number of input tokens. Writing the O matrix to main memory is bound by the maximum bandwidth of your memory.
Then it has to be read out again to be multiplied against V.
What flash attention does is change the algorithm. Flash attention is numerically similar to softmax attention, but not equivalent. The changed algorithm allows you to fuse the independent kernels.
Instead of writing out the O matrix to main memory, its softmax is calculated against V immediately. The double memory roundtrip is now gone. This in itself does not change the fact that both softmax attention and flash attention are quadratic with respect to the input, but it sure as hell improves the speed of "prefill".
If you tile the Q, K, V matrices into n blocks each, you will still have to load O(n^2) blocks.
But here is the thing. Matrix multiplication is an operation with a significant amount of shared data. This means the multipliers calculating the dot products are being fed from the same flip flops, or the data is shifted around via a systolic array. You end up in a situation with an insignificant memory load, but a massive amount of arithmetic.
In addition to that, you have all the tokens already, so the MLPs at the end of the layer can be processed as GEMM instead of GEMV.
This is why "prefill" is compute intensive instead of memory intensive.
During token generation, you need to perform attention for the next token, with all the tokens already in the KV cache. You load n entries from the KV cache, then do GEMV on the MLP and you have to do this over and over again in a sequential fashion. This means that memory bandwidth is the deciding factor for token generation.
Now here is a caveat: if SRAM is limited Vs your TOPS, then it is possible that even flash attention is memory bound, but for a different reason. It's memory bound, because the maximum tile size that can be held in SRAM can be processed faster than it takes to load it from system memory or VRAM and you are performing a quadratic amount of tile loading operations. This is only noticeable near the extreme top end of context lengths between 32k and 128k tokens.
O(seq_len*dk + seq_len^2)
whereas Att(i) computation with FA runs in O(seq_len^2*dk^2/SRAM_size)
Q, K, V computation remains the same. And ATTN(0,n)*Wo also remains the same.In a smaller model, with N=12, D=768, dk=64, seq_len=1k, SRAM=32KB, ..., FA optimization would roughly translate to 0.5M vs 4.5M per-head(att(i)). So ~10x improvement but in the grand scheme of things, in per-attention-layer it becomes ~91M vs ~45M so ~2x of net improvement.
> This is why "prefill" is compute intensive instead of memory intensive.
Yes, I think I agree and I have corrected myself elsewhere in the thread. The original thought that I actually wanted to convey in my initial comment which was somehow lost throughout the discussion is that - prefill/training will benefit from the FlashAttention/MLA but the inference will not. I can agree that the formulation "only when memory access time dominates the compute in attention implementation" was wrong.
> During token generation ... memory bandwidth is the deciding factor for token generation.
LLama3-70B MLP layer roughly takes 1 TFLOPS and 0.6 GB of bandwidth for 1024 tokens. Assuming that 1023 entries are taken from a KV-cache, attention layer computation for a single token will take ~0.6 GFLOPS and ~0.2 GB of bandwidth. To load the rest of the values from KV-cache at FP16 precision, it will take us 1023*0.1MB or ~1 GB.
So, ~1 TFLOPS and ~1 GB of bandwidth per each Transformers layer. On hardware such as H100, this still looks like a compute-bound problem to me. OTOH on the CPU with 15 TFLOPS of compute but only <1TB/s of memory bandwidth, it becomes memory-bound problem. Or no?
FA, compared to naive implementation, made training / prefill (i.e. when you can have multiple tokens in the same sequence visible) compute-bound instead of memory-access bound.
So, currently, on MHA/GQA, with Flash Attention, training/prefill is compute-bound, whereas decoding is memory-access-bound.
Before FA, both prefill / decode are bound by memory-access. FA solved the problem of training/prefill. But because kvcache is large, decoding is inherently bound by memory-access.
Our goal is always to make everything compute-bound.
I did not say anything like that? What I said is that FlashAttention and arguably MLA will not make any significant gains in the inference time. And this is true.
Also, FWIW there are certainly model shapes that are compute-bound in the decode phase so saying that decoding is universally inherently bound by memory access is what is plain wrong, if I were to use your dictionary.
MLA made it possible to cache a smaller form of k/v, mitigating (but not completely solve, on shorter context & smaller batches it's still memory-access bound) the problem.
Another country has X because they were expected (in the terms of their purchase) to not sell it to an adversary. So yes they’re supposed to honor that agreement and are not supposed to trade that particular thing X with each other. Not doing so invites sanctions and other consequences. Is it worth the risk just to do business with a dictatorship? Probably not.
How about if they got {X} from Mexico ( who got it from Agnes .. ) ?
People say “it’s used for money laundering” as if we’re supposed to be on China’s side about restricting people’s ability to move money out of the country over certain amounts
Like, oh you’re against freedom from a repressive regime? Or oh you’re only against it when it’s the American government restricting US citizens flow of capital? like I’m confused, pick a lane
Capital controls are obsoleted under any context
It means that people can and will provide this service, and 1000's will build on this and make offers that you can use in either a commodity base market, or with a specific niche target.
It means regulatory capture and control will be much, much harder to execute.
It means AI might continue to be a benefit also to you rather than just a way to control, propagandize and exploit you.
That atleast allows other companies/research labs to develop competing cutting edge LLM technology and come up with efficiency breakthroughs. The alternative is for the tech to be hidden inside OpenAI and FANGs or released as old versions.
With the next wave of investment targeting local on-device robotics, I'm way more bullish about local AI than vertical SaaS AI.
"99% written by DeepSeek-R1" according to the author.
To be fair torch didn't try very hard optimizing on CPU either.
[0] 200MB is actually a very generous number, i tried to download some AI thing via pip3 the other day and it wanted 600MB or so of CUDA stuff. Meanwhile i do not even have an Nvidia GPU.
Fwiw, there are always many attempts at optimizing code (assembly etc). This is good! Great to try new techniques. However, you get what you constrain. So I've seen optimized code that drops checks that the compiler authors say are required in the standard. So, if you don't explicitly tell your optimizer "this is a case I care about, this is the desired output" it will ignore that case.
Did we find a faster implementation than the compiler creates? Well, I mean, sure, if you don't know why the compiler is doing what is doing
It is indeed pretty silly that's not the default and you have to go to https://pytorch.org/get-started/locally/, copy the argument `--index-url https://download.pytorch.org/whl/cpu` to install CPU-only torch. But the alternative would be having the worlds scientists wondering why they can't use their GPUs after `pip install torch` so /shrug
Since the number differs by roughly 1024x, maybe you forgot that you just need to work on the last decoded token for MLP, too? Because you don't need hidden state for previous tokens in Attn now.
So, the final number would be ~0.6 GFLOPS (self-attention across heads) + ~0.15 GFLOPS (attention) + ~1 GFLOPS (ffwd) which in total give or take is ~2 GFLOPS per-layer.
Bandwidth-wise, the ~1GB number I previously gave was also wrong (llama3-70B has 8 KV heads). Now, with more precise calculations that figure is ~0.6 GB per-layer.
So, at batch_size=1, FP8 precision, 1024 tokens, during the decode phase with KV-cache, we need ~2GFLOPS of compute and ~0.6GB of bandwidth per each layer. Still looks compute-bound to me.
H100 has 3.3TB/s HBM bandwidth on paper, and ~1000TFLOPS bf16 compute on paper. That's 1:300. 0.6GB vs ~2GFLOPS is 1:3. Tell me how is this compute bound?
(also, your number, even after accounting for GQA, is still off. You usually can't store kvcache in fp8.)
> MLA, FlashAttention and similar optimizations will provide the benefits only when memory access time dominates
> Those would be [...] not the decode phase
This does sound like you are saying that memory access time does NOT dominate during the decode phase. But it does.
Reading your quotes, it looks like maybe you are talking about GPU utilization issues? (i.e. not launching enough threads). Due to the parallelization strategy of the original FA it indeed does not even keep the GPU busy if q*bs is too small. But this is not an inherent limitation of FA-style kernels and can be solved and people did solve it. Or you simply batch more. Now you can keep the GPUs busy at 100% waiting for memory access, but memory access time still dominates, hence "memory-access-bound". And here comes MLA.
> FWIW there are certainly model shapes that are compute-bound in the decode phase
Yeah. But so far all I read don't really work ("work" means being at least just slightly worse than alternatives) under same wall-clock time compute budget. Do you have any pointer to a working example, even on smaller 3B-ish models?
Let's take llama3-8B for an example. GFLOPS needed for self-attention per-layer per-token is roughly 0.15 GFLOPS. For simplicity reasons let's assume that we store all our weights in FP8 precision, then our load memory-bandwidth required for the same is 0.05 GB. Store memory-bandwidth is negligible. If we expand this further to a 1k tokens context, this becomes ~180 GFLOPS and ~0.35 GB per-layer per-1k-ctx.
Assuming that our HW is H100, is this compute-bound or memory-bound?
> It's going to take me some minutes to find out what's wrong in this napkin math.
I am sure you will. Please don't be so entitled.
Sorry, what? Who the fuck in this world runs decode without k/v cache??! If you run without k/v cache you are basically doing prefill for every token you generate and that's not what we called "decode". That's what we called "prefill".
k/v cache, while named "cache", is a lot more important than what people would perceive as a "cache". It's the essential part of the algorithm. If you lose your k/v cache you must run prefill again. If you run prefill for every token you generate it's not O(n^2), it's going to be O(n^3).
And yeah, you can run prefill 1000 times to generate a 1000 tokens output. Or you can run prefill once and with the persisted k/v cache run decode 1000 times. Tradeoff has to be made here but it simply makes no sense to drop a k/v cache in the middle of generating a response, as your number shows, recomputing is guaranteed to be slower than loading k/v cache.
> Please don't be so entitled.
When someone came up with a wrong number, I try to be nice and run the numbers myself and figure out why someone would end up with such a number and point out the specific mistake, instead of dumping a page of my own calculation. It's usually just a missing factor somewhere. Guess I shouldn't be so nice to retards who keep insisting that you can be fine without k/v cache during decoding. Also in this case I admit I failed to have a theory on why your number is so off because giving out prefill numbers and claiming it's decode isn't in my book.
Yeah, I know this sounds extremely mean, feel free to downvote, but I hope readers can feel my frustration now.
> Also in this case I admit I failed to have a theory on why your number is so off because giving out prefill numbers and claiming it's decode isn't in my book.
Maybe it's because it is not off? It's not terribly difficult to sum up all the matmul calculcations and number of bytes one needs to load and store per each layer in self-attention. My number could be off for a bit but it is certainly not terribly off.
> different opinions
I won't argue with you so hard if it's your "opinions". What you described is not an opinion. And facts could be wrong. Plainly wrong.
> Maybe it's because it is not off?
Yeah, as I said earlier your number might be correct as an estimation for prefilling 1000 tokens on Llama 3 8B. That's not what everybody here called "decode". Your number shows that prefill is compute-bound. So what?