How to Optimize a CUDA Matmul Kernel for cuBLAS-Like Performance: A Worklog | Dark Hacker News