so built an emulator instead
you give it a kernel, it predicts execution time on any gpu without running it. h100, a100, v100, whatever.
how: scraped specs for 50+ nvidia gpus, built tile-based simulator that models memory bandwidth, occupancy, and sm scheduling. validated against 12 real gpus and the mean error 1.2%
doesn't work for: dynamic parallelism, multi-gpu, tiny kernels under 1us but I will figure it out soon
if anyone's solved this differently?