Continuous Nvidia CUDA Profiling in Production

Continuous Nvidia CUDA Profiling in Production(polarsignals.com)

98 points by brancz 253 days ago | 10 comments

gnurizen 246 days ago |

Author here, would be happy to field any questions or feedback!

knlb 246 days ago | |

Thanks for the post, this is pretty cool!

I feel like I've seen Cupti have fairly high overhead depending on the cuda version, but I'm not very confident -- did you happen to benchmark different workloads with cupti on/off?

---

If you're taking feature requests: a way to subscribe to -- and get tracebacks for -- cuda context creation would be very useful; I've definitely been surprised by finding processes on the wrong gpu and being easily able to figure out where they came from would be great.

I did a hack by using LD_PRELOAD to subscribe/publish the event, but never really followed through on getting the python stack trace.

gnurizen 246 days ago | | |

CUPTI is kind of a choose your own adventure thing, as you subscribe to more stuff the overhead goes up, this is kind of minimalist profiler that just subscribes to the kernel launches and nothing else. Still to your point depending on kernel launch frequency/granularity it may be higher overhead than some would want in production, we have plans to address that with some probabilistic sampling instead of profiling everything but wanted to get this into folks hands and get some real world feedback first.

embedding-shape 246 days ago | |

This "low-overhead always on GPU profiler" seems really cool and useful, but we're not using Kubernetes for anything, and the instructions for how to use it seems to only include Kubernetes. Is there a way of running this without Kubernetes?

gnurizen 246 days ago | | |

Yeah the quickstart guide covers docker, k8s and "raw" binary options:

https://www.parca.dev/docs/quickstart/

sirhcm 246 days ago | |

Does the profiler read any of the GPU's performance counters? Would be super cool to have an open source tool that can capture the same data nsight compute does.

gnurizen 246 days ago | | |

This profiler is focused on kernel execution but we do scrape high level metrics (https://www.polarsignals.com/blog/posts/2025/06/04/latest-in... which is based on https://github.com/polarsignals/gpu-metrics-agent). What performance counters in particular were you interested in?