DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs

DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs(github.com)

441 points by helloericsf 1 year ago | 108 comments

vLLM supports MLA for Deepseek models as of 3 weeks ago. 3x higher generation throughput and 10x token memory capacity.

https://github.com/vllm-project/vllm/releases/tag/v0.7.1

MHA is still faster in low QPS regime apparently.

https://neuralmagic.com/blog/enhancing-deepseek-models-with-...

Also published this month was theoretical proof showing that for the same KV Cache overhead, MLA consistently offers greater expressive power than GQA. Furthermore, widely used GQA-based pre-trained models (e.g. LLaMA, Qwen, Mixtral) can be converted into MLA-based models.

https://arxiv.org/pdf/2502.07864

shihab 1 year ago | |

For future readers, note that those 3x and 10x figures are compared to vLLM's own previous release, and NOT compared to Deepseek's implementation.

I am very curious to see how well-optimized Deepseek's code is compared to leading LLM serving softwares like vLLM or SGLang.

lhl 1 year ago | |

It's great to see vLLM getting faster/better for DeepSeek. I tested vLLM vs SGLang a couple weeks ago and SGLang's DeepSeek support was much better/faster (on 2 x p5 H100 nodes). It's great that no one's standing still, I saw this recent AMD article that reported SGLang perf on MI300X has increased by 4X over the past couple weeks: https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR...

(w/ the extra memory V3/R1 fits on a single MI300X or H200 node)

It'll be interesting to see if either project can take advantage/get any benefits from this FlashMLA implementation.

menaerus 1 year ago | |

Pretty significant improvements. However, my back on the napkin math suggests that MLA, FlashAttention and similar optimizations will provide the benefits only when memory access time dominates the compute in attention implementation? Those would be the prefill-phase (or TTFT) and training (when batch_size >> 1) but not the decode phase (inference)?

FL33TW00D 1 year ago | | |

You have it backwards.

Training and prefill are compute bound. Decode is memory bound. FlashAttention massively increases the arithmetic intensity of naive MHA, such that you can remain compute bound at lower batch sizes during decode.

rfoo 1 year ago | | |

You've got it backwards. After FlashAttention, it's the decoding part being bound mainly by memory access. With FA as long as you have enough batch size you can push training/prefill to be compute-bound.

albertzeyer 1 year ago | |

I also just read that paper. But I wonder, even though MLA is strictly more powerful, do you really gain by that in experiments? This paper doesn't really do too much experimental comparisons. GQA on the other side should still be faster (no need to an extra linear transformation).

helloericsf 1 year ago |

X:https://x.com/deepseek_ai/status/1893836827574030466 BF16 support Paged KV cache (block size 64) 3000 GB/s memory-bound & 580 TFLOPS compute-bound on H800

WithinReason 1 year ago | |

That's 90% bandwidth efficiency and 60% compute efficiency

https://www.nvidia.com/en-us/data-center/h100/

helloericsf 1 year ago | | |

They don't have h100. wink,wink.

FL33TW00D 1 year ago |

It seems to me that MLA will become the standard from here on out.

If Deepseek R1 had used standard MHA, they would need 1749KB per token for KV cache storage. This means that once the conversation reaches ~46,000 tokens, the KV cache will have exceeded the entire storage capacity of a single H100.

Using MLA, each token now consumes 125KB. This means you can hit ~640,000 tokens (2x Ulysses) before overflowing.

ur-whale 1 year ago |

For those who wonder ... it's somewhat likely that MLA mean Multi-head latent attention

https://verticalserve.medium.com/group-query-attention-58283...

https://paperswithcode.com/method/multi-head-attention

eigenvalue 1 year ago |

Nice, probably saved a bunch of FANG devs a lot of hours of work trying to knock this off.

nicce 1 year ago | |

There were likely some startups that tried to sell the same thing…

anon389r58r58 1 year ago | | |

You mean like Modular?

imranq 1 year ago |

Dang only forward passes. The real secret was in the backward pass! I was also curious to learn how they implemented the dualpipe scheduler

rfoo 1 year ago | |

Do they even have an optimized backward? It looks like optimizations like this aren't needed during training. Their V2 paper also suggests so.

mohsen1 1 year ago |

I'm confused. Wasn't there sanctions against Chinese companies about Hopper GPUs? Are they just admitting that they had access to H100 against the US sanctions?!

thot_experiment 1 year ago | |

Just the H100, the H800 is a region-specific version of the card for china with shitty nvlink bandwidth which makes it rougher for making big clusters, but deepseek was able to mitigate the impact of that by being clever (rumored to have made significant use of PTX assembly instead of just using CUDA, we'll probably find out in the releases this week)

ahofmann 1 year ago | |

It isn't illegal for chinese companies to buy H100 cards. It is illegal for USA companies to sell them to China. So the "admit" part wouldn't be on Chinas side.

jofzar 1 year ago | | |

It's also totally legal to sell h100 cards to a country that is very close to China.

Unrelated, it's always impressed me how Singapore buys 15% of the world's h100's. Really is the AI development capital of the world.

amelius 1 year ago | | |

Also breaking the law to growth-hack happens all the time, see Uber.

Tiberium 1 year ago | |

H800 is the export variant that they had access to. They directly reference it in the repo:

>Achieving up to 3000 GB/s in memory-bound configuration and 580 TFLOPS in computation-bound configuration on H800 SXM5, using CUDA 12.6.

WiSaGaN 1 year ago | |

H20 is a Hopper GPU, and they are allowed to be sold in China.

jonplackett 1 year ago | |

Can everyone stop downvoting people just for asking questions - this isn’t Stack Overflow!

feverzsj 1 year ago | |

The secret ingredient is smuggling.

tasuki 1 year ago | | |

I'd be very careful when using that word in this situation. If China wants X, and another country has X, who are you to say they shouldn't trade with each other?

7952 1 year ago | | |

Do you think that would be morally wrong? Honest question.

rob_c 1 year ago |

Great work any plans to integrate with pyT or TF I wonder?

(Showing my lack of breadth of knowledge in the ecosystem (s))

behnamoh 1 year ago |

Open AI is back!

echelon 1 year ago | |

The real "Open" AI.

fsndz 1 year ago | | |

DeepSeek is just the gift that keeps on giving. I now agree with people who say open source AI will win: https://open.substack.com/pub/transitions/p/deepseek-is-comi...

mclau156 1 year ago |

Was really hoping we could get flash games back with AI

kridsdale1 1 year ago | |

Ask an LLM to write you some ActionScript3

syntex 1 year ago |

What i can do with that?

rfoo 1 year ago | |

Probably nothing.

Inference providers like Fireworks, or major clouds, can use this to reduce their cost, if they don't already have a replication with similar perf.

vLLM and SGLang may integrate this to be faster at serving DeepSeek-V2/V2.5/V3/R1 on H100/H800s.

I believe that's why they didn't release this back then, this is part of their "moat" (pretty weak tho) and it only benefits competitors.

Open sourcing this after being very popular may indicate that they don't want all the users to use their API/Chat and now want the world to serve it instead? Idk.

rvz 1 year ago |

This is the minimum bar that I expect very elite programmers should be striving for in the age of AI and DeepSeek should be studied as an example and this is the only just the first of many projects from them.

There is an extremely high chance (in fact a 99.9% chance) that an AI did not build this and the ones who are able to build or adapt projects like this which are deep into hardware systems will be the most sort after.

Not the horrendous JS or even TS slop across GitHub that is extremely easy for an AI to generate correctly.

You've got until 2030 to decide. And my advice is to study the codebases of pytorch (backends), DeepSeek, tinygrad and ggml.

m3kw9 1 year ago |

MHGA making hopper great again

deyiao 1 year ago |

I heard their inferencing framework is way lower than typical deployment methods. Can this be verified from that open-source project? How does it stack up against vllm or llama.cpp

reissbaker 1 year ago | |

By "lower" you mean cheaper/better?

I suspect it's much higher throughput than vLLM, which in turn is much higher throughput than llama.cpp. The MLA kernel they just open-sourced seems to indicate that, although we'll see how it does in third party benchmarks on non-hobbled GPUs vs FlashAttention. They only released the BF16 version — whereas most people, including DeepSeek themselves, serve in FP8 — so it might not be immediately useful to most companies quite yet, although I imagine there'll be FP8 ports soon enough.

nialv7 1 year ago | | |

i think they meant lower level.

helloericsf 1 year ago | |

What do you mean by "lower"? To my understanding, they will open 5 infra related repos this week. Let's revisit your comparison question on Friday.

find0x90 1 year ago | |

I don't see any use of PTX, might be in one of the other repos they plan to release.

DesiLurker 1 year ago | | |

right, I think PTX use is a bigger deal than its getting coverage for. this opens an opening for other vendors to get their foot in with PTX to LLVM-ir translation for existing cuda kernels.

feverzsj 1 year ago | |

Maybe. Apple ditched them in China, because their infra can't handle large scale users.

helloericsf 1 year ago | | |

Don't think the decision is based on infra, or any technical reasons. It's more on the service support side. How a 200-person company supports 44M iPhone users in China?

chvid 1 year ago | | |

Is that true? I thought Apple was going to use their own infrastructure.

tw1984 1 year ago | | |

deepseek doesn't have any experience on support a 50 million user base. that was the reason cited by apple a few weeks ago.