Optimizing Deep Learning Framework Performance: Autotuning GPU Kernels

Optimizing Deep Learning Framework Performance: Autotuning GPU Kernels(burn.dev)

8 points by nathanielsimard 2 years ago | 1 comment

When developing our WebGPU backend for the Burn deep learning framework, we faced numerous challenges in optimizing the execution speed. Autotune serves as our solution to the challenge of selecting the most efficient kernel for GPU operations, taking into account factors such as device specifications and input shapes. A kernel is an algorithm that accomplishes a task of relative simplicity, normally in the hot loop of an AI model.

In our most recent blog post, we explore two different operations commonly found in almost all deep learning models: Reduction and Matrix Multiplication (Matmul). The post highlights the difficulties of manually choosing the right kernel, given the uncertainties arising from the hardware and various input shapes and strides.

For instance, in some scenarios, our first reduction algorithm can be 3X as fast as our second one, but in other scenarios, it can be 19X slower, highlighting the importance of selecting the right kernel for the job. For Matmul, often we have the best kernel performing 3X as fast as the worse one, but the fastest one changes constantly across a spectrum of scenarios.

We hope the post highlights our flexible solution to this problem and how it can support our mission of creating the fastest framework on all hardware.