Learning CUDA by optimizing matrix-vector multiplication for cuBLAS-like perf | Dark Hacker News