I was wrong!!
We're now seeing multi-agent systems that take your PyTorch code and spit out CUDA or Triton kernels with 2x to 14x speedups over torch.compile(mode='max-autotune-no-cudagraphs'). Not on toy benchmarks. On real models like Llama-3.1-8B, Whisper, and Stable Diffusion