we couldn’t afford that, so we built an emulator that predicts how your kernel runs on any GPU like H100, A100, 4090, or V100 without running a single line
it’s not a guess, it gives real numbers
2.4ms on RTX 4090, 5.1ms on V100 within 1% of hardware
how it works
- NeuSight (99%) splits the kernel into tiles, simulates each one using real GPU specs like 132 SMs on H100 or 10 on 1060, checks occupancy, bandwidth, wave scheduling
- NCU Baseline (95–98%) if you profiled once, we scale it across GPUs, Hopper is 1.05x Ada, Ampere 0.92x, all measured manually
- Analytical (85–92%) roofline model fallback, works even without source code
we validated on 47 kernels across 12 GPUs
accuracy stayed above 98%, occupancy predictions were almost perfect
one team saved $18k in GPU cloud time
another found bugs on an A100 they didn’t own
still missing dynamic parallelism, multi-GPU, and tensor core perfection but we’re getting there
happy to go into the math or architecture details if anyone’s curious