I made a kernel 2.2x faster. It made my training loop 3x slower | Dark Hacker News