Optimizing Gaussian blurs on a mobile GPU(sunsetlakesoftware.com) |
Optimizing Gaussian blurs on a mobile GPU(sunsetlakesoftware.com) |
The more "boxes" you use in your sliding window (analogous to a convolution kernel), the better you can approximate a Gaussian kernel. If you're using a big Gaussian kernel on a big image then integral images can result in a large reduction in the number of operations performed and thus potentially enable a big speed-up.
Additionally, the advantage of computing pixel by pixel is that the shader can operate massively parallel.
Or, do you mean that memory is the bottleneck as in, shipping the image to the GPU's memory space?
Also, if your convolution kernel is short enough, you can beat an FFT. So, in 2D with an NxN image, an FFT filter will have a complexity of about (N^2 + N^2 Log N). If your kernel is KxK, then direct convolution will be (N^2 K^2), so you have to compare Log N with K^2. But keep in mind 1.) these are asymptotic complexities the coefficients matter 2.) K may or may not depend on N 3.) FFTs are may have larger memory requirements, although you may be able to get away with in-place FFTs, and you don't have to explicitly compute the FFT of the gaussian - you can work that out analytically (you should get another gaussian...).
Also, as someone pointed out, the gaussian is separable, so then the complexity of the direct convolution is (K N^2).
Anyway, what are FFT libraries like for iOS? I suppose with Apples policies, you can't compile FFTW for your App?
There's an analysis of the 1D case here with some interesting cautions at the end: http://www.engineeringproductivitytools.com/stuff/T0001/PT15...
Really great read!
[0]: http://whoisryannystrom.com/2013/09/17/Live-blur-in-iOS7/
You might need an Arbitrary Hack Constant (oh, sorry, "tuning parameter") to decide when to make that transition.
> Instead of filtering the image in its native resolution, I used OpenGL's native support for linear interpolation to downsample the source image by 4X in width and height, blurred that downsampled image, then linearly upsampled via OpenGL afterward.
Amusingly Autodesk ship a GLSL shader with Flame which uses a triangular kernel under the hood but labels the button in the interface "Gaussian", the cheeky swines...
The nicest option to imitate lens blur is a circular kernel but that's quite aggressively non-separable. The other thing that really helps is blurring linear-light values instead of the usual gamma-encoded images. The blurs in Windows 7's Aero themes very clearly don't do this, to my eye - dark lines spread out too much and look generally muddy.
VFX guy who loves to blur things here :)
And the problem with that is, you can't guess the cache size. You can help yourself with profiling, but this leads to a local optimization for only some GPUs.
If you wish to run your code optimized for any GPU, the pixel-by-pixel approach usually works best. Then, the GPU scheduler can run as many neighboring threads as possible in subprocessors. Note that every subprocessor has another local cache which is really quick.