Popping the GPU Bubble

113 points by radq 3 hours ago | 21 comments

augment_me 22 minutes ago |

As someone who works in the field, the blog is nice but it has a lot of CODEX fingerprints on it, and it's also very specific to the size of the model in question in a way that is not explicit from the blog until the very last section.

In general, for some reason CODEX loves CUDA-streams, it's the first optimization it goes for every time when writing GPU kernels. However in many cases this is not a bottleneck, it happens to be so here because the model in the blog is small (2.4ms FW-pass is tiny, and 9B params sit on a single GPU). Large models are closer to 30-40ms. The CPU-GPU sync is 1-2ms, when working on larger MoE models the scheduling of tokens in this way is much less important than for example scheduling of computation/communication or kernel optimization.

I wish the blog would state this at the start with the premise of what has been done, or show that this is indeed the bottleneck with some benchmarking. Otherwise is kind of overselling things imo.

radq 3 minutes ago | |

Appreciate you saying the blog was nice. Not sure what you mean by "CODEX fingerprints", but I'll engage with the other points. We work on small models, and our customers want real-time inference on modern GPUs. The sub-title says "near-realtime VLM inference". 20-30ms forward passes are a non-starter for these workloads.

If you scroll down to the section titled "A cost model for the bubble", you will find both benchmark results and us saying, "you get back anywhere from a few percent to a third; more the faster your accelerator/model is".

blueblazin 2 hours ago |

I really appreciate this type of articles. I feel like a lot of knowledge in LLM training and inference is locked inside the heads of practitioners. Similar to compiler engineers before.

To work in LLM training/inference you’re expected to know this stuff but to know this stuff you need to be working in the space.

radq 2 hours ago | |

Thank you for the kind words. We will write and share more of these.

rjzzleep 2 hours ago | |

Gentle reminder that while most money is spent on LLM inference, the vast majority of useful AI use is in fact not LLMs. Also, more and more work is poured into making small models. One thing I like about the whole export controls saga is that people are finding creative ways to squeeze performance out of these devices as witnessed in this post. But, if you then look at solutions like vLLM, vLLM will just fill whatever VRAM is available, no matter the context size, or the model size. So then you have two things to worry about:

First, where do you know exactly what the optimal VRAM assignment per model, per context size is, which seems to be currently based purely on experience and second how do you make sure that only that amount is available to your infra/containers, which is being handled by DRA and stuff like https://project-hami.io

While only tangentially related to the blog post here. The title is picked in such a way that I couldn't help, but put the shameless plug here. When he wrote popping the bubble, I thought we're talking about devices and reducing NVIDIA dependency, but this seems very focused on Cuda.

Disclaimer: I work with Dynamia.ai, the founders of which created HAMi.

gardnr 2 hours ago |

Different bubble than the one I was hoping for.

This appears to be different than the recent "Speculative Pipeline Decoding" paper: https://arxiv.org/abs/2605.30852

nl 2 hours ago |

> you find that the GPU often sits idle, not for lack of work, but because the CPU hasn't told it what to do next yet. This phenomenon is called a GPU bubble.

This is true, but I've never heard anyone refer to this as a GPU bubble before.

I think most people hear "GPU bubble" and think of a financial bubble of some kind.

SCdF 2 hours ago | |

It appears to be a real term? https://docs.vulkan.org/tutorial/latest/Synchronization/Asyn...

Very odd, but perhaps more familiar to graphics programmers? I will say I'd probably call it a stall, which is exactly what the Vulkan docs call it moments later, so :shrug:

kibibu 2 hours ago | | |

"bubble" used to be used a lot more when talking about very deep pipelines, eg Pentium 4 depth.

spaqin 1 hour ago | |

Pretty sure that would be "[GPU performance] bottlenecked [by the CPU]" in most common terms.

_zoltan_ 1 hour ago | |

while the title is misreading, when reading GPU profiling data, we do call these bubbles - where the GPU _could_ do something, but it's idle.

any time your GPU is idle = you are losing $$$ = your TCO is going up. you don't want that.

vkazanov 2 hours ago | |

I saw it in literature on cpu pipelines in quotes, never without.

IshKebab 2 hours ago | | |

I've never seen it in quotes, but yeah it is a very common term in pipelined CPUs.

cma 2 hours ago | |

It's very common to call it a GPU bubble in gamedev, though not strictly for CPU induced bubbles.

nnevatie 2 hours ago | |

Yes, the title seems off - I also thought I am going to be reading about the AI/pricing bubble.

rusk 2 hours ago | |

The term I would use would be “underutilised”

barries11 2 hours ago | | |

"stall" is the best term I can think of as in "pipeline stall".

Better term, anyone?

Schlagbohrer 1 hour ago |

I love the brand name, Moondream

fragmede 28 minutes ago |

That's a terrible name for that and I can't say that Hanlon's razor applies. Bubble that everyone's knowingly referring to is the stock market collapsing like in 2001. To choose a headline that can be mistaken for that just to get clicks is shit. You could've called it GPU-CPU pipeline stall, but no, you intentionally chose a name that would be confused for something else just to get clicks?