So we built an emulator. give it your kernel code and it tells you exactly how it runs on any GPU. H100, A100, RTX 4090, V100, whatever "without running a single line"
Not a rough estimate. Real numbers. 2.4ms on RTX 4090, 5.1ms on V100. Within 1% of actual hardware.
How it works:
We have three emulators. Each trades accuracy for speed:
1. NeuSight (99% accurate): Breaks your kernel into tiles, simulates each one. Uses real GPU specs from our database (132 SMs for H100, 10 SMs for GTX 1060, etc). Checks occupancy, memory bandwidth, wave scheduling
2. NCU Baseline (95-98% accurate): If you already profiled on one GPU, we scale to others. Hopper is 1.05x faster than Ada at compute. Ampere is 0.92x. We measured it all
3. Analytical (85-92% accurate): Fast backup using roofline model. Works even without source code
We validated with 47 test kernels on 12 real GPUs. Results: 98-99% accuracy on execution time. Occupancy prediction is basically perfect
One team saved $18,000 in cloud costs. Another caught bugs on an A100 they don't even own
What doesn't work yet: Dynamic parallelism, multi-GPU, perfect tensor core modeling but nothing is impossible, we will figure it out soon
Built into RightNow AI (our CUDA editor). Free to try
https://www.rightnowai.co/blog/building-99-accurate-gpu-emul...
We're a small team and we spent 3 months on this
Happy to answer questions!!