GPU utilization can be a misleading metric

GPU utilization can be a misleading metric(trainy.ai)

144 points by roanakb 1 year ago | 36 comments

> you can get 100% GPU utilization by just reading/writing to memory while doing 0 computations

Indeed! Utilization is a proxy for what you actually want (which is good use of available hardware). 100% GPU utilization doesn't actually indicate this.

On the other hand, if you aren't getting 100% GPU utilization, you aren't making good use of the hardware.

tanelpoder 1 year ago | |

This reminds me of the Linux/Unix disk busy "%util" metric in tools like sar and iostat. People sometimes interpret the 100%util as a physical ceiling for the disk IO capacity, just like with CPUs ("we need more disks to get disk I/O utilization down!").

It is a correct metric when your block device has a single physical spinning disk that can only accept one request at a time (dispatch queue depth=1). But the moment you deal with SSDs (capable of highly concurrent NAND IO), SAN storage block devices striped over many physical disks or even a single spinning disk that can internally queue and reorder IOs for more efficient seeking, just hitting 100%util at the host block device level doesn't mean that you've hit some IOPS ceiling.

So, looks like the GPU "SM efficiency" analysis is somewhat like logging in to the storage array itself and checking how busy each physical disk (or at least each disk controller) inside that storage array is.

serial_dev 1 year ago | |

This sounds like the good old "having high test coverage is bad because I can get to 100% just by calling functions and doing nothing, asserting nothing with them".

100% test coverage doesn't mean your tests are good, but having 50% (or pick your number) means they are bad / not sufficient.

heavenlyblue 1 year ago | | |

That isn't even necessarily true. For interpreted languages having a test that just runs code asserts that the code is able to run (i.e. you are not calling a string object as a function for example). Which is not enough to always assert functionality but still better than nothing.

HPsquared 1 year ago | | |

In other words it's "necessary, but not sufficient".

roanakb 1 year ago | |

Yup, similar to SM efficiency in that sense too. If you aren't seeing >80%, there is certainly time left on the table. But getting a high SM efficiency value doesn't guarantee you're making good use of the hardware as well. (still a better proxy than GPU util though)

shaklee3 1 year ago | |

This is not true. Lots of algorithms simply can't use 100% of the GPU even though they're written as optimal as possible. FFT is one.

defrost 1 year ago | | |

In remote sensing | computation physicas applications it's rare to have a single FFT to compute (whatever algorithm is chosen).

Hence the practice of stuffing many FFT's through GPU grids in parallel and working to max out the hardware usage in order to increase application throughput.

eg:

https://arxiv.org/pdf/1707.07263

https://ieeexplore.ieee.org/document/9835388

jorvi 1 year ago | |

> On the other hand, if you aren't getting 100% GPU utilization, you aren't making good use of the hardware.

Some of us like having more than 2 hours of battery life, and not scalding our skin in the process of using our devices.

antognini 1 year ago |

When understanding the performance of your model it's very helpful to look at a roofline plot [1]. The roofline plot will show you the floating-point performance as a function of arithmetic intensity for the various ops in your model. The plot has two regimes: a memory-bound regime on the left and a compute-bound regime on the right. This can help to identify memory-bound ops that are taking a significant fraction of compute time.

[1]: https://en.wikipedia.org/wiki/Roofline_model

roanakb 1 year ago | |

Agreed, roofline plots would be quite powerful in this context. From a quick search, seems like the only way to create a roofline plot for your model would be to use Nsight [1]? Would be interested to know if there are any simpler tools, since one of the big benefits of SM efficiency is how easily the metric is accessed.

[1]: https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s...

jfkfif 1 year ago | | |

Depending on the size of your application you can calculate flops by hand

https://docs.nersc.gov/tools/performance/roofline/

sundalia 1 year ago |

Application-specific metrics are the way to go. For ML training this is one example: https://cloud.google.com/blog/products/ai-machine-learning/g...

roanakb 1 year ago | |

Nice, seems like ML Productivity Goodput is a pretty well thought-out metric to understand the overall efficiency of your cluster. I'll consider adding this into our cluster management platform. Only potential drawbacks I'd guess are it being somewhat difficult to compute since it relies on metrics like MFUs, and not something we can observe layer-by-layer to understand inefficient kernels, but I'll take a deeper look. Thanks!

sergiotapia 1 year ago |

running GPU models and maximizing utilization is pretty opaque to me as a layman coming into the scene.

take this example: https://gist.github.com/sergiotapia/efc9b3f7163ba803a260b481... - running a fairly simple model that takes only 70ms per image pair, but because I have 300 images it becomes a big time sink.

by using ThreadPoolExecutor, I cut that down to about 16 seconds. i wonder if there is a fairly obvious way to truly utlize my beefy L40S GPU! is it MPS? I haven't been successful at even running the MPS daemon on my linux server yet. very opaque for sure!

dahart 1 year ago | |

Start with Nsight Systems and turn on GPU metrics. It’s super easy and the plots will give you an immediate sense of your utilization, and low-hanging optimization opportunities.

So using 10-wide parallel processing took your batch from 21 seconds down to 16 seconds, did I do the arithmetic correctly? That suggests the single-threaded version isn’t too bad. I mean a 25% improvement is great and nothing to sneeze at, but batching might only be trimming the gaps in between image pairs, or queueing up your memory copies while the previous inference is running. You can verify this with nsys profiles.

> i wonder if there is a fairly obvious way to truly utilize my beefy L40S GPU! is it MPS?

No idea, it’s not always easy (and generally speaking gets harder and harder as you approach 100%), but first profile to see what your utilization is before going down any big technical route. Maybe with your ThreadPoolExecutor, you’re already getting max utilization and using MPS can’t possibly help.

zaptrem 1 year ago | |

Batch as many requests together as possible and your utilization will increase.

asaiacai 1 year ago | |

totally agreed. A lot of our findings during this process is that there's still a lot of alpha in finding the right kernels for the job/model. We're hoping that in the future `torch.compile` will become more mature because current docs on performance at least on pytorch side definitely leave us wanting more

DamonsJ 1 year ago |

"If we have a CUDA kernel that continuously runs for 10 seconds but only uses 1 SM, on an H100, this would register 100% utilization, but the SM efficiency would be 1 / 132 = 0.7%."

does this situation register 100% utilization? BTW, the SM OCCUPANCY is also a metric you need to care about if you concern on kernel efficiency

roanakb 1 year ago | |

Yup, you'll see 100% utilization on a kernel over a time period if it's considered active, which includes just having a single thread executing [1]. SM occupancy is great but can be a little difficult to interpret since you're not simply trying to maximize it, unlike SM efficiency.

[1]: https://pytorch.org/blog/pytorch-profiler-1.9-released/#gpu-...

rurban 1 year ago | |

That's why I look mostly at the H100 temperatures. Gives a better utilization metric

saagarjha 1 year ago |

If you have a basic understanding of what your kernels are supposed to do, looking at pipe usage and roofline analysis in Nsight Compute is often helpful, since it will show you how hard you’re saturating those.

pavelstoev 1 year ago |

I recommend hidet backend in torch.compile - implements many advanced model-specific optimizations automatically. https://github.com/hidet-org/hidet

roanakb 1 year ago | |

oh this looks great, thank you for bringing this up! I'll have to give it a try, but seems like the FSDP limitation on torch.compile might carry over?

areichenbach 1 year ago |

I’ve recently been trusting gpu watt usage over utilization. Any idea how good that is as a simple proxy (if I’m just looking at nvidia-smi)?

aabhay 1 year ago | |

Power usage is indeed a better representation of GPU utilization during ML training. It has the advantage of combining many important indirect signals that aren’t visible, and avoids many downfalls of compute usage, which can go to 100% even in all-reduce deadlocks, among other scenarios.

asaiacai 1 year ago | |

power is also a good proxy. For example, we've had distributed runs that we monitored on WandB where one of our workers died in the middle and the rest were basically stalling on the dead worker. On WandB, we were only logging GPU stats on one worker and that one had 100% util but basically no excess power draw compared to having nothing running, which is how I found out something was stalling. Restarting fixed it and got the power draw up to normal, but even with high power draw, we were still having some sections of code with low SM efficiency (~20%) for that training.

danielvaughn 1 year ago |

We ran into a similar problem with CPU utilization at my job. Created an alert for when our systems hit 90% CPU util, and ended up with a ton of noise. We realized that for some of our workloads, this was normal and expected.

ScoutOrgo 1 year ago |

As someone that is familiar with using nvidia-smi to track util, what are some commands people use to track the SM efficiency? The end of the article had some references, but no examples of what to use explicitly.

roanakb 1 year ago | |

Unfortunately, SM efficiency is not accessible via nvidia-smi. The best methods to track it would be to:

1. Profile your model with Pytorch Profiler 2. Export metrics with Nvidia DCGM

AeZ1E 1 year ago |

gpu utilization is not everything, people! mfus are where it's at. time to recalibrate those expectations and tap into the true potential of your gpus. brace yourselves, the real efficiency is yet to come!