Nvidia DGX GH200 Whitepaper

Nvidia DGX GH200 Whitepaper(resources.nvidia.com)

95 points by volta87 2 years ago | 43 comments

tuetuopay 2 years ago |

Why is this called a whitepaper, as this is more of a documentation and architecture overview of the cluster? Wow a CLOS topology for networking, very innovative.

Details on NVLink would be great. For example, the needs and problems solved by their custom cables seemingly required by NVLink would be worth a whitepaper.

Don't get me wrong, this is still great the general public can get a glimpse into Grace Hopper. And they do a good job of simplifying while throwing around mind-boggling numbers (the NVLink bandwidth is insane, though no words on latency, crucial for remote memory access).

mmaunder 2 years ago | |

> Why is this called a whitepaper, as this is more of a documentation and architecture overview of the cluster?

That’s what a marketing white paper is and does. It’s not an academic paper.

flakiness 2 years ago | | |

To be fair NVIDIA used to publish more detailed "white paper" for their GPUs ex. [1] and CPU textbooks like H&P [2] draws a lot of details from these. This less detailed "whitepaper" still has a scent of these old tradition.

[1] https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-...

[2] https://www.amazon.com/Computer-Architecture-Quantitative-Jo...

danpalmer 2 years ago | | |

I was always taught that “whitepapers” were this sort of thing and were distinct from academic papers. However this seems to be industry or ecosystem specific because the cryptocurrency ecosystem uses “whitepaper” to mean their academic papers, or at least their approximation of them.

callalex 2 years ago | |

NVDA has spent too much time surrounded by cryptocurrency hacks that published “whitepapers” left and right with zero technical information or innovation. As they say, never get high on your own supply.

syntaxing 2 years ago | |

Agreed, seems like an application note more than a white paper.

smodad 2 years ago |

What's funny is that even though the DGX GH200 is some of the most powerful hardware available, there's such a voracious demand that it's not gonna be enough to quench it. In fact, this is one of those cases where I think the demand will always outpace supply. Exciting stuff ahead.

I heard Elon say something interesting during the discussion/launch of xAI: "My prediction is that we will go from an extreme silicon shortage today, to probably a voltage-transformer shortage in about year, and then an electricity shortage in about a year, two years."

I'm not sure about the timeline, but it's an intriguing idea that soon the rate limiting resource will be electricity. I wonder how true that is and if we're prepared for that.

jiggawatts 2 years ago | |

He’s just plain wrong about the electricity usage going up because of AI compute.

To a first approximation, the amount of silicon wafers going through fabs globally is constant. We won’t suddenly increase chip manufacturing a hundredfold! There aren’t enough fabs or “tools” like the ASML EUV machines for that.

Electricity is used for lots of things, not just compute, and within compute the AI fraction is tiny. We’re ramping up a rounding error to a slightly larger rounding error.

What will increase is global energy demand for overall economic activity as manufacturing and industry is accelerated by AIs.

Anyone who’s played games like Factorio would know intuitively that the only two real inputs to the economy are raw materials and energy. Increases to manufacturing speed need matching increases to energy supply!

smodad 2 years ago | | |

I bet you're right. Even if you take into account that a data center is a monster consumer of energy, in the grand scheme of things it's not that big. Some back of the envelope math:

Global electrical production in 2022 was ~30,000 TWh.[1]

If we over-estimate that a hyperscale data-center will consume about 100 MW of power, per year that would be around 876 GWh.[2]

Let's overestimate again and say that 1,000 new data centers spring up in a year, every year they would consume 876 TWh.

Which, is 2.92% of total electricity production. Which given the fact that I overestimated the energy consumption by more than an order of magnitude, I would say the term "rounding error" is accurate.

I think the main limiting factor in the near term is going to be chip production capacity. The fabs take so long to spin up, it's going to be a while before we can even consider "electricity production" being a limiting factor.

[1] https://yearbook.enerdata.net/electricity/world-electricity-... [2] https://cc-techgroup.com/data-center-energy-consumption/

paskjdfparwerwe 2 years ago | | |

Elon is speaking with all the Eliezur-esque "foom" in mind, where in AI will explode and either kill us or help us take over the Universe (and destroy everything in our way).

wmf 2 years ago | | |

A wafer of H100s uses far more electricity than a wafer of [Apple] A16s though.

michaelt 2 years ago | |

> I wonder how true that is

An Nvidia A100 costs $10000 and consumes 300W.

It seems unlikely that anyone could afford the number of A100s needed to create an electricity shortage.

If there is an electricity shortage, far more likely that ageing infrastructure and rising demand for air conditioning and electric car charging are to blame.

callalex 2 years ago | |

Are there any examples at all about that guy being right about a technology prediction?

kanwisher 2 years ago | | |

Electric cars, rocket ships …

swyx 2 years ago | |

i mean he's not the only one. sama's other big bet is on nuclear fusion. https://blog.samaltman.com/helion

mmaunder 2 years ago |

The memory and bandwidth numbers are mind blowing. Going to be very hard to catch Nvidia. It’s as if competitors are going through the motions for participation prizes.

jacquesm 2 years ago |

I wonder how much this thing will cost, best I've been able to find so far is a 'low 8 digits' estimate in Anandtech article but nothing more specific than that.

https://www.anandtech.com/show/18877/nvidia-grace-hopper-has...

tikkun 2 years ago | |

Some private cloud execs I talked with ballparked it at $15-25mm [1].

[1]: (I wrote this) https://gpus.llm-utils.org/nvidia-h100-gpus-supply-and-deman...

tikkun 2 years ago |

As context: 1x dgx gh200 has 256x gh200s which each have 1x h100 and 1x grace cpu

luc4sdreyer 2 years ago | |

Adding up to "1 exaFLOPS" (sparse FP8). For reference, the fastest FP64 supercomputer is the AMD-based Frontier supercomputer, at 1.1 exaFLOPS.

danbruc 2 years ago | | |

Does sparse mean anything other than we can not actually do as many FP8 operations per second as we just claimed? To me it sounds like they can do X matrix operations per second on sparse matrices using Y FP8 operations per second, but instead of just saying what Y is they tell us how many FP8 operations would be required if the matrices were not sparse. Is this pure marketing bullshit or is there some logic to this? How sparse do those matrices have to be? Or am I misunderstanding this claim?

redox99 2 years ago | | |

They quote sparse FP8 because it's the biggest number. The most relevant number would be FP16 (non sparse) but they don't mention that.

LASR 2 years ago |

I would be interesting to know what kind of next-gen models this can train.

On the LLM frontier, we’re starting to hit the limits of reasoning abilities in the current gen.

paskjdfparwerwe 2 years ago | |

... and the current generation is just an ensemble of the prev. generation.

moab 2 years ago |

Unfortunate that they don't mention the running times for any of the applications they benchmark (e.g., PageRank). Does anyone in the know have some idea how long this takes?

m3kw9 2 years ago |

So basically 2x faster than H100

luc4sdreyer 2 years ago | |

They claim 1.1x to 7x, depending on what you're doing. The 10% to 50% is for the ~10k GPU LLM training, where the main bottleneck tends to be networking:

> DGX GH200 enables more efficient parallel mapping and alleviates the networking communication bottleneck. As a result, up to 1.5x faster training time can be achieved over a DGX H100-based solution for LLM training at scale.

kvetching 2 years ago | |

Was this upgrade known or is this out of left field and people that stocked up on H100s going to feel a little regret

wmf 2 years ago | | |

It's been on the roadmap for a few years although there were no performance numbers. I assume GH200 is more expensive so the price/performance advantage may not be overwhelming. Worst case you order GH200s and then scalp your H100s on the used market.