VkFFT – Vulkan Fast Fourier Transform library

VkFFT – Vulkan Fast Fourier Transform library(github.com)

123 points by DTolm 5 years ago | 50 comments

DTolm 5 years ago |

Hello! Since the last post VkFFT has experienced a number of huge improvements and optimizations. Namely:

-It now supports sequences up to 2^32 in all dimensions (algorithmically, in reality limited to allocatable memory size, switch to 64-bit addressing scheme is planned for future release)

-configurations optimized for bigger range of systems and vendors

-benchmarked Radeon VII and RTX 3080, shows that FFT is extremely bandwidth limited on modern GPUs

-VkFFT is able to match and outperform cuFFT on the whole tested range from 2^7 to 2^28 in single precision

-added double and half precision support and precision tests against FFTW on CPU

-improved native zeropadding - up to 3x performance boost

-switched license to MPL 2.0

Thanks for your attention! I am happy to answer any questions.

devit 5 years ago | |

> VkFFT is able to match and outperform cuFFT on the whole tested range from 2^7 to 2^28 in single precision

What is your explanation for this?

Is the VkFFT algorithm better? Is SPIR-V fundamentally more expressive than PTX? Are nVidia drivers better at compiling SPIR-V than PTX?

Have you compared the generated GPU assembly from both?

DTolm 5 years ago | | |

FFT is an extremely bandwidth limited problem, so if most time is taken by one upload by both algorithms, the overall time will be similar. More in-depth analysis of how VkFFT and cuFFT scales with memory clocks and bandwidth can be found here: https://www.reddit.com/r/nvidia/comments/jxlbjs/rtx_3090_ove...

I don't know exactly what cuFFT does differently, but I am fairly certain they use very similar memory layout and algorithms behind their code (judging by execution times only).

What should be the main take from this is that Vulkan allows for similar in performance low-level memory control, while being cross platform and open source. I don't think that SPIR-V is more expressive - bet Nvidia wouldn't allow this. But it doesn't prohibit it from still being good.

stagger87 5 years ago | |

Do I understand your benchmark plots correctly?

Using the single precision at 1k FFT size as my example.

~165,000 kB/ms performance

Converts to 165,000 MB/s performance

Divide by 8 to convert to complex samples, so 20,625 M complex samples per second.

Divide by 1k to get FFT count of ~20.14M FFT/IFFTs per second?

These benchmarks also include transfer time to and from the GPU?

DTolm 5 years ago | | |

1k FFT size in single precision is 1024 x 2 x sizeof(float) = 8KB. If we don't think that it won't utilize full GPU (not even one compute unit) and assume that it scales similarly to big systems then: 1)165GB/s is an algorithmic bandwidth of benchmark, including consecutive FFT+iFFT. Both of them take one upload and one download from chip - total 4 memory transfers. The real bandwidth for this value will be 4*165=660GB/s. 2)one FFT is 2 transfers - upload and download. Total 16KB. 3)660GB/s / 16KB = 43M iterations per second. Similar to your number, but your number didn't account that benchmark has 4 uploads instead of 2.

These benchmarks don't include transfers to and from GPU, as those are done with PCI-E bandwidth (30GB/s) which is really slow compared to VRAM-chip bandwidth (>500GB/s). This is why it is important to have enough VRAM and avoid CPU communications as much as possible.

jiehong 5 years ago | |

> -benchmarked Radeon VII and RTX 3080, shows that FFT is extremely bandwidth limited on modern GPUs

Great to see that!

I expect huge improvements in that area with AMD's new RX series with SAM activated [0].

[0]: https://www.amd.com/en/technologies/smart-access-memory

DTolm 5 years ago | | |

Actually, it is still best to aim at zero transfers between GPU and CPU during the execution. The GPU is limited by VRAM-chip bandwidth which is much bigger than the PCI-E bandwidth. And it should not be affected by SAM.

enriquto 5 years ago | |

Any plans for arbitrary-size transforms? (i.e., not restricted to vectors whose dimension is a power of two)

DTolm 5 years ago | | |

Yes, this is indeed something I would like to add in the future. While adding different radix kernels support for small prime factors is not that hard, writing efficient scheduler is a much more challenging task (each sequence, even for power of 2 now is split differently targeting different architectures to optimize performance).

The Bluestein's algorithm typically used for arbitrary prime sizes requires both zero-padding and convolutions support which are already efficiently implemented, so it is also not completely out of reach.

dcgudeman 5 years ago | |

why did you choose MPL 2.0?

DTolm 5 years ago | | |

It is a great open-source license for library projects. For example, Eigen uses it: https://eigen.tuxfamily.org/index.php?title=News:Relicensing...!

p0sixlang 5 years ago |

Can someone ELI5 what this library is useful for?

Lichtso 5 years ago | |

So far there have been two ways to to heavy compute tasks on GPUs: CUDA (Nvidia only) and OpenCL (all vendors). Nvidia invested a lot in software and toolchains to make CUDA the go to option for many projects (especially in the machine learning community). Meanwhile OpenCL is falling apart and sees less and less support and updates.

However, the Vulkan API which is also supported by most vendors (except Apple where you have to use a compatibility layer called MoltenVK) is gaining traction in the compute sector. If you trust the benchmarks, then this library here is showing that you can get a similar performance out of Vulkan compute than what you would expect from CUDA. It is just that this library only provides a very small fraction of the features of what the CUDA ecosystem does, so the Vulkan compute ecosystem still has a lot catching up to do.

Edit: In case it is not obvious from the title, the library is used to calculate the https://en.wikipedia.org/wiki/Fast_Fourier_transform

matthiasv 5 years ago | | |

> Meanwhile OpenCL is falling apart and sees less and less support and updates.

I think this view is too pessimistic. In fact, support either gets better (Intel oneAPI, Microsoft CLonD3D12, AMD ROCm, Mesa NIR-clover, …) or is unchanged but still maintained (NVIDIA). Moreover, Khronos noticed that OpenCL 2.x was a dead end and was to start over from a point that all vendors could agree on.

enriquto 5 years ago | | |

I'm fascinated, and at the same time slightly troubled, by your usage of the word "compute".

mrweasel 5 years ago |

I’m don’t write C++, but isn’t the code extremely messy? Also it appears to be C++ and not C like the “Read me” says.

DTolm 5 years ago | |

The library only includes vkFFT.h file (in C) and a set of shaders (C-like language compiled to SPIR-V). Vulkan_FFT.cpp is only an example that shows how VkFFT can be used. It also contains the benchmark in it, but it is not a part of the library.

mrweasel 5 years ago | | |

Aah, okay, I where a little confused about the Vulkan_FFT.cpp. It seemed a little weird to have everything in the .h file, and not just the functions you want to expose in the library.

Again, I know no C++ and a very limited amount of C, so don’t put to much value in my comment. You seem to be very fond of switch statements, consider not stuffing to much code into each case. It make the flow hard to follow. Break the case code into functions and call those.

You have a switch with 40 cases, to load the SPIR-V. I feel like there’s a better way to deal with that. Maybe just a strict naming convention, so having the ID is enough to locate the file.

Impressive work in anycase.

29athrowaway 5 years ago |

Rendering a triangle in Vulkan will make you cry.

Narann 5 years ago | |

With OpenGL you draw a triangle, and eventually write a pipeline.

With Vulkan you write a pipeline, and eventually draw a triangle.

exDM69 5 years ago | | |

Best explanation of the two APIs I've ever heard.

OpenGL "hello triangle" is short only if you cut corners. If you do it the way you'd do in a production app, you're not that far off from the lines of code it takes to do it in Vulkan. It's still less, but on the same order of magnitude.

0-_-0 5 years ago | | |

OpenGL gives you a Toyota, Vulkan gives you the parts to a Ferrari

NL807 5 years ago | |

That's because Vulkan is designed to render millions of triangles, not one.

You wouldn't use Vulkan to render a single triangle for the same reasons why you wouldn't use a helicopter to get a bottle of milk from your local shop.

Reelin 5 years ago | |

What's your point? So will writing your own custom UI toolkit, or manually doing your own font rendering, or implementing a custom equivalent to TensorFlow, or writing your own implementation of the C standard library, or ...

If you don't need low level control, you should be using middleware or a full blown engine (Godot, Unity, Unreal, etc).

meekrohprocess 5 years ago | |

Eh, it's an investment.

Debugging your first segfault will also make you cry, but it's good for you. It builds character, and prepares you for the more insidious segfaults that are lurking out in the tall grasses.