VkFFT – Vulkan Fast Fourier Transform Library

VkFFT – Vulkan Fast Fourier Transform Library(github.com)

220 points by ah- 5 years ago | 127 comments

zdw 5 years ago |

If I were a hiring person at AMD or Intel, I'd shortlist this guy for a job, as they need help competing against the headstart CUDA has in the GPU-base compute space.

slavik81 5 years ago | |

The AMD Math Libraries team is hiring [1], and one of the libraries they develop is rocFFT [2]. Disclosure: I work at AMD, though not on rocFFT.

[1]: https://jobs.amd.com/job/Calgary-GPU-Libraries-Software-Deve... [2]: https://github.com/ROCmSoftwarePlatform/rocFFT

tinus_hn 5 years ago | | |

The author lists his email address on the site and indicates he’s looking for a position.

jjeaff 5 years ago | |

Ya, but the important question is can they invert a binary tree on a whiteboard?

umvi 5 years ago | | |

Just get a clear whiteboard, draw the binary tree, then flip the whiteboard 180 around the vertical axis so you are now looking through the back of the whiteboard.

qppo 5 years ago | | |

"Write an FFT" is the DSP engineer interview question that's analogous to tree traversal algorithm whiteboarding. The hard part is remembering how a butterfly computation works, and you'll almost never need to implement it.

mangamadaiyan 5 years ago | | |

... or are leetcode-proficient, these days.

TomVDB 5 years ago | |

One should hope that the non-CUDA GPU compute library ecosystem has already advanced beyond being able to calculate FFTs!

singhrac 5 years ago | | |

Sure, but if Nvidia/OpenAI/Google/Facebook have shown anything, it's that there's always more kernels to invent and train bigger nets with.

andi999 5 years ago | | |

Last time I checked there was no good fft for AMD.

slavik81 5 years ago |

What are the common applications for these sorts of GPU-accelerated FFTs? We mostly just solved problems analytically in undergrad, and the little bit of naive coding we did seemed pretty fast. I feel like this must be used for problems I would have learned about in grad school, if I had continued in electrical engineering.

DTolm 5 years ago | |

I have used VkFFT to create GPU version of a magnetic simulation software Spirit (https://github.com/DTolm/spirit). Except for FFT it also has a lot of general linear algebra routines, like efficient GPU reduce/scan and system solvers, like CG, LBFGS, VP, Runge-Kutta and Depondt. This version of Spirit is faster than CUDA based software that has been out and updated for ~6 years due to the fact that I have full control over all the code I use. You might want to check the discussions on reddit for this project: https://www.reddit.com/r/MachineLearning/comments/ilcw2f/p_v... and https://www.reddit.com/r/programming/comments/il9sar/vulkan_...

Reelin 5 years ago | |

Likely any HPC application that has an FFT somewhere in its pipeline and is otherwise amenable to being run on a GPU.

Fluid flow, heat transfer, and other such physical phenomena that you might want to simulate.

Phase correlation in image processing is another example. (https://en.wikipedia.org/wiki/Phase_correlation)

MD simulations rely on FFT but I'm not sure how much is typically (or can be) done on the GPU. For example, NAMD employs cuFFT on the GPU in some cases. (https://aip.scitation.org/doi/10.1063/5.0014475)

amelius 5 years ago | | |

Machine learning uses CNNs, which are directly based on FFTs.

hadeson 5 years ago | |

It could be used to accelerate Convolutional Neural Nets training [0]

[0] https://arxiv.org/abs/1312.5851

enriquto 5 years ago | |

If you could filter and focus raw radar data in realtime it would be really cool!

gorkish 5 years ago | |

Software defined radio / RF DSP is another area where FFT and IFFT performance and accuracy are critical.

looping__lui 5 years ago | |

Imaging. E.g., large convolutions.

HelloNurse 5 years ago | |

The same as any FFT, but accelerated; with the tradeoff that the cost of moving data from and to the GPU needs to be amortized. It's also a good proof of concept for other kinds of GPU computations.

p1mrx 5 years ago |

How does using Vulkan for computation fit into the OpenCL/CUDA landscape? Is CUDA's proprietary nature doing meaningful harm, and does Vulkan help?

Jhsto 5 years ago | |

You can run OpenCL kernels on Vulkan at least in theory: SPIR-V supports OpenCL memory model. CUDA might be machine translatable if you can compile into LLVM target (clang seems to have experimental support developed outside of Nvidia) which you then retarget into SPIR-V using a cross-compiler. The LLVM to SPIR-V cross-compiler however is limited in its translation for the time being.

In general, Vulkan is a thing which commands the GPU, but is not opinionated on what the language used to represent the kernel is as long as it compiles to SPIR-V. SPIR-V in itself is like parallel LLVM IR. If you look into the project source, the shaders are in GLSL which have been pre-compiled using a cross-compiler into SPIR-V. The C file you find on the project root constitutes as the loader program for the SPIR-V files.

Futhark project did some initial benchmarks on translating OpenCL to Vulkan. The results were mainly slowdowns. You can read about it in here: https://futhark-lang.org/student-projects/steffen-msc-projec...

jgavris 5 years ago | | |

We run OpenCL on top of Vulkan in a production application on Android, thanks to a project from Google / Codeplay and other contributors https://github.com/google/clspv. SPIR-V can't represent all of OpenCL, but maybe enough for most people's use cases.

pjmlp 5 years ago | |

Badly, OctaneRender had moved away from Vulkan into CUDA, because they found out that Vulkan compute wasn't at the level that they wanted.

https://home.otoy.com/octane2020-rndr-released/

"OTOY | GTC 2020: Real-Time Raytracing, Holographic Displays, Light Field Media and RNDR Network"

https://www.youtube.com/watch?v=Qfy6CTaSHcc

littlestymaar 5 years ago | | |

I couldn't find any details about the migration on either links but it looks like they make massive use of Nvidia-specific features, so even with exactly the same performances it would make total sense to use Cuda just because the tooling is more mature.

querez 5 years ago |

"VkFFT aims to provide community with an open-source alternative to Nvidia's cuFFT library, while achieving better performance."

There are no error bars on the graphs, so it's very hard to judge if the minor differences are significant. I work in research, so probably I'm peculiar about this point, but: I'd expect better from anyone who's taken basic statistics. But from a quick look, it seems like the performance is pretty much just "on par".

It would also be nice to know how performance is on other hardware. I'm assuming it's tuned to nvidida GPUs (or maybe even the specific GPU mentioned). But how does this perform on Intel or AMD hardware? How does it compare to `rocFFT` or Intel's own implementation?

DTolm 5 years ago | |

The FFT and iFFT are performed consecutively up to 1000 times and then each run is done 5 more times. The total result is averaged both for VkFFT and cuFFT and stays roughly the same between launches. The minor performance gains (5-20%) are noticeable. If you have a better testing technique, I am open to the suggestions.

I have tested VkFFT on Intel UHD620 GPU and the performance scaled on the same rate as most benchmarks do. There are a couple of parameters that can be modified for different GPUs (like the amount of memory coalesced, which is 32bits on Nvidia GPUs after Pascal and is 64bits for Intel). I have no access to an AMD machine, otherwise I would have refined the lauch configuration parameters for it too. I have not tested other libraries than cuFFT yet.

querez 5 years ago | | |

Thanks for the further clarification! If you ran this several times, you could calculate standard deviations or confidence intervals. It would be nice if you could report one such measure, so it's clearer that the differences are not just some random fluctuations. E.g. you could include them as error bars in your plots. You could also run a statistical test (in this case, a t-test is very easy to do) and report the p-value. Those are the things I'd expect my students to do if they'd have to do something like this for a report or a project, because it's the only way for people to judge if differences show clear signal or are just random fluctuations due to measurement noise.

Also: I should've said this in my first post already, which in hindsight might sound too negative: I think this is a cool project and you did a great job! I just thought this might improve the presentation of your results a bit.

Jhsto 5 years ago |

I think this guy will have no problem getting hired. Being conscious enough to push code online works so much better than the CV preparation courses. You know you're on the right path when you are asked to play up your CV abstract than to downplay it.

Personally, I would have a hard time hiring anyone without a Github account and less so working in a place where nobody has one.

ncmncm 5 years ago | |

To me a Gitlab account, instead, would signify superior judgment.

adamnemecek 5 years ago | | |

Not if you want your work to be discovered.

solipsism 5 years ago | | |

Getting downvoted, but this is no more arbitrary, myopic, and unfair to the applicant than the parent.

oxxoxoxooo 5 years ago |

What is "Native zero padding to model open systems"? And how come it is "up to 2x faster than simply padding input array with zeros"?

gct 5 years ago | |

So you can pad your input array with zeros, but the algorithm doesn't know that it's padded, and will just compute with those zeros like any other value. If you could tell it that they were zeros it could take advantage of x*0=0 and x+0=x to significantly reduce computation. That's what I think that is.

DTolm 5 years ago | | |

That is almost the correct answer. To go even further, there are sequences that are completely full of zeros in the padded case of multidimensional FFTs and we can omit their FFTs entirely.

Lichtso 5 years ago |

Very cool!

Seems a bit more feature complete than my take on the problem: https://github.com/Lichtso/VulkanFFT

Still, to beat CUDA with Vulkan a lot is still missing: Scan, Reduce, Sort, Aggregate, Partition, Select, Binning, etc.

DTolm 5 years ago | |

I have some of these routines like Reduce and Scan in my other project https://github.com/DTolm/spirit. It also has implementations of linear algebra solvers like CG, VP, Runge-Kutta and some others. These routines have to be inlined in users shaders in some way to have a good performance. Releasing them as a standalone library will require some thinking due to the fact that some routines have multiple shader dispatches.

meisel 5 years ago |

Warning: LGPL license

ncmncm 5 years ago | |

... which, being a header-only library, happens to place no restrictions or requirements of any kind on the calling program.

detaro 5 years ago | | |

I don't think it's that easy? LGPLv3 has an explicit carve-out for headers which makes that scenario easy, but this is 2.1...

phkahler 5 years ago |

Isn't LGPL 2.1 is an odd license for something like this? Does it produce a library?

microcolonel 5 years ago | |

> Does it produce a library?

It is a library.

bialpio 5 years ago | | |

A _header-only_ library. Not sure how LGPL works for those - not much to avoid linking against... Throw it in your own .dll / .so and use that in your closed-source projects? Standard disclosure: IANAL.

fluffything 5 years ago |

> Support for big FFT dimension sizes. Current limits: C2C - (2^24, 2^15, 2^15),

What about bigger than big? > 2^29 or so ? Are these sizes for double precision ?

DTolm 5 years ago | |

Currently, I hit the limit of maximum workgroups amount for one submit dispatch (this is why y and z axis are lower than x one for now). It can be removed by adding multiple dispatches to the code, which I will do in one of the next updates. To go past 2^24 I need to polish the four stage FFT algorithm to allow for >2 data transfers, which I have implemented, but not yet tested. There will also be a single precision limit in this range, as the twiddle factors values will be close to 1e-8 which will be close to a machine error.

bobowzki 5 years ago |

I wonder if this works on the raspberry pi with the new Vulkan drivers.

Mizza 5 years ago |

I'm very eager to see GPU acceleration make its way into audio production, which is all still heavily CPU bound.

A Free GPUFFT implementation will certainly help! Great work.

mmis1000 5 years ago | |

https://en.wikipedia.org/wiki/AMD_TrueAudio I believe AMD did that, but there is little to no softwares actually make use of it.

adamnemecek 5 years ago | |

It's not gonna happen, audio is much less throughput intensive but a lot more latency sensitive.

reitzensteinm 5 years ago | | |

You can read off a GPU in 10us, which is just a single sample at 96khz.

If your entire stack lived in the GPU, and you're just reading out the result, this is trivial.

If you're constantly copying buffers back and forth because some effects are implemented in the CPU and some in the GPU, not so much!

It's probably the case that a full stack GPU implementation would blow what we have out of the water, but you'd lose your entire ecosystem in the process, so it's probably never going to happen.

codetrotter 5 years ago | | |

I would think a GPU might help if you have a lot of audio channels and a lot of effects on each channel.

But even if that is not the case, machine learning is making its way into music production tools more and more. No doubt a beefy GPU will be useful to a lot of music production professionals in the future at least, as the tools they are using begin to leverage ML more and more.

viraptor 5 years ago | | |

Why do you think it's not going to happen? And for which use case?

The time budget to refresh a video frame is 8ms on 120HZ if everything else came free. In practice closer to <4ms. So even looking at the close to worst conditions, that's about the delay of the sound traveling a meter - should be fine for a lot of real life applications.

colejohnson66 5 years ago | | |

Could it be possible to “prerender” the audio on the GPU when it’s not being worked on (say, a track not being edited)? Then just play that track if it’s not edited before the user hits play?

singhrac 5 years ago | | |

I've heard credible claims that GPUs these days (esp. TPUs) have lower latency for big models than CPUs. I haven't really investigated, but I could see it happening if you give the TPU a huge L1 cache or something.

rektide 5 years ago |

may someday please someone help dethrone the underlord of AI & rise us up

person_of_color 5 years ago |

This guy will get a foot in but still have to do a gotcha interview loop