In our most recent blog post, we explore two different operations commonly found in almost all deep learning models: Reduction and Matrix Multiplication (Matmul). The post highlights the difficulties of manually choosing the right kernel, given the uncertainties arising from the hardware and various input shapes and strides.
For instance, in some scenarios, our first reduction algorithm can be 3X as fast as our second one, but in other scenarios, it can be 19X slower, highlighting the importance of selecting the right kernel for the job. For Matmul, often we have the best kernel performing 3X as fast as the worse one, but the fastest one changes constantly across a spectrum of scenarios.
We hope the post highlights our flexible solution to this problem and how it can support our mission of creating the fastest framework on all hardware.