GPU utilization can be a misleading metric(trainy.ai) |
GPU utilization can be a misleading metric(trainy.ai) |
Indeed! Utilization is a proxy for what you actually want (which is good use of available hardware). 100% GPU utilization doesn't actually indicate this.
On the other hand, if you aren't getting 100% GPU utilization, you aren't making good use of the hardware.
It is a correct metric when your block device has a single physical spinning disk that can only accept one request at a time (dispatch queue depth=1). But the moment you deal with SSDs (capable of highly concurrent NAND IO), SAN storage block devices striped over many physical disks or even a single spinning disk that can internally queue and reorder IOs for more efficient seeking, just hitting 100%util at the host block device level doesn't mean that you've hit some IOPS ceiling.
So, looks like the GPU "SM efficiency" analysis is somewhat like logging in to the storage array itself and checking how busy each physical disk (or at least each disk controller) inside that storage array is.
100% test coverage doesn't mean your tests are good, but having 50% (or pick your number) means they are bad / not sufficient.
Hence the practice of stuffing many FFT's through GPU grids in parallel and working to max out the hardware usage in order to increase application throughput.
eg:
Some of us like having more than 2 hours of battery life, and not scalding our skin in the process of using our devices.
[1]: https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s...
take this example: https://gist.github.com/sergiotapia/efc9b3f7163ba803a260b481... - running a fairly simple model that takes only 70ms per image pair, but because I have 300 images it becomes a big time sink.
by using ThreadPoolExecutor, I cut that down to about 16 seconds. i wonder if there is a fairly obvious way to truly utlize my beefy L40S GPU! is it MPS? I haven't been successful at even running the MPS daemon on my linux server yet. very opaque for sure!
So using 10-wide parallel processing took your batch from 21 seconds down to 16 seconds, did I do the arithmetic correctly? That suggests the single-threaded version isn’t too bad. I mean a 25% improvement is great and nothing to sneeze at, but batching might only be trimming the gaps in between image pairs, or queueing up your memory copies while the previous inference is running. You can verify this with nsys profiles.
> i wonder if there is a fairly obvious way to truly utilize my beefy L40S GPU! is it MPS?
No idea, it’s not always easy (and generally speaking gets harder and harder as you approach 100%), but first profile to see what your utilization is before going down any big technical route. Maybe with your ThreadPoolExecutor, you’re already getting max utilization and using MPS can’t possibly help.
does this situation register 100% utilization? BTW, the SM OCCUPANCY is also a metric you need to care about if you concern on kernel efficiency
[1]: https://pytorch.org/blog/pytorch-profiler-1.9-released/#gpu-...
1. Profile your model with Pytorch Profiler 2. Export metrics with Nvidia DCGM
What I mean is: where did you take that from? I program FFTs on GPUs, and I see no reason for the "inherently can't reach 100% utilization by any metric".
Even then they ran @ 80% "by design" for expected hard real time usage .. they only went to 11 and dropped results in toast until they smoke tests and with operators that redlined limits (and got feedback to that effect).