undefined | Dark Hacker News

1 points by jaberjaber23 270 days ago

Testing CUDA kernels on 15 different GPUs costs $3,000/month. We couldn't afford that!!

So we built an emulator. give it your kernel code and it tells you exactly how it runs on any GPU. H100, A100, RTX 4090, V100, whatever "without running a single line"

Not a rough estimate. Real numbers. 2.4ms on RTX 4090, 5.1ms on V100. Within 1% of actual hardware.

How it works:

We have three emulators. Each trades accuracy for speed:

1. NeuSight (99% accurate): Breaks your kernel into tiles, simulates each one. Uses real GPU specs from our database (132 SMs for H100, 10 SMs for GTX 1060, etc). Checks occupancy, memory bandwidth, wave scheduling

2. NCU Baseline (95-98% accurate): If you already profiled on one GPU, we scale to others. Hopper is 1.05x faster than Ada at compute. Ampere is 0.92x. We measured it all

3. Analytical (85-92% accurate): Fast backup using roofline model. Works even without source code

We validated with 47 test kernels on 12 real GPUs. Results: 98-99% accuracy on execution time. Occupancy prediction is basically perfect

One team saved $18,000 in cloud costs. Another caught bugs on an A100 they don't even own

What doesn't work yet: Dynamic parallelism, multi-GPU, perfect tensor core modeling but nothing is impossible, we will figure it out soon

Built into RightNow AI (our CUDA editor). Free to try

https://www.rightnowai.co/blog/building-99-accurate-gpu-emul...

We're a small team and we spent 3 months on this

Happy to answer questions!!