Understanding Convolution in Deep Learning (2015)(timdettmers.com) |
Understanding Convolution in Deep Learning (2015)(timdettmers.com) |
"Convolution with a kernel K" describes a system whose impulse response is K. In discrete time, suppose you have K=[1,2] and convolve [0,1,2,0] with it- you wind up with [0,1,3,2,0], if I'm awake enough for arithmetic.
Correlation with a kernel K is convolution with K time-reversed (i.e. [2,1])- you'd get [0,2,5,2,0] (again if I'm awake). Note that 5- right there, the input signal "lines up just right" with the kernel- 2x2 + 1x1. That's why it's called correlation- its output is big when the input looks like the kernel.
It's a binary operator on functions that yields a third function. It has a lot of useful properties and equivalences, like that it can be described as the product of two Fourier transforms (although that's very roundabout).
You're actually introduced to convolution in middle school when you're taught how to multiply monomials to build a polynomial (at my middle school they called it "FOIL").
That doesn't help one understand what it is at all. Convolution in DL is simply a set of dot products of a patch from the input with a bunch of filters. Each resulting dot product is simply a measure of similarity between a patch and a filter. That's all there is to it.
The idea behind convolution in deep learning is that, if a particular pattern of pixels is meaningful, then it is probably also meaningful if you shift the whole thing in some direction. So you can force some layers of the network to be the same under translation, and it'll be faster to pick up some sorts of patterns.
It's faster because its reduces the dimensionality of the inputs down to something manageable (hundreds or low thousands). You can replace convolutions with most other types of dimensionality reduction (including other types of layers) and outside of image tasks you'll get very similar or even better performance.
(Even before that it has been used in signal processing.)
> technically being simpler to compute.
They're equivalent, since the only meaningful way to "compute" a continuous convolution is symbolically, and discrete convolutions obey most of the same identities.
If one can place a lower bound on the time step resolution of a simulation then continuous convolutions are evaluated using discrete convolutions, which can represent the continuous case exactly via the Nyquist-Shannon sampling theorem.
Interestingly enough, to prove the Sampling Theorem you need to rely on the identity that multiplication in frequency is convolution in time, and to prove that it can't be realized in a physical system (breaks causality, since you multiply by a superposition of Heavisides which of course are infinitely long sinc functions in both directions of time).
And more interesting is that signals and systems is mostly applied dynamics and statistics, so it shouldn't be surprising if there's overlap.
1 : a form or shape that is folded in curved or tortuous windings e.g the convolutions of the intestines
2 : one of the irregular ridges on the surface of the brain and especially of the cerebrum of higher mammals
3 : a complication or intricacy of form, design, or structure … societies in which the convolutions of power and the caprices of the powerful are ever-present dangers to survival.
After this is clear read the mathematical idea on wikipedia. After reading that, do google scholar search on the AI papers that first mentioned it. That is the way to go.
Convolution of f and g at t is:
integration of f(x) * g(t - x)
Cross correlation(which is termed convolution in DL) at t is: integration of f(x) * g(t + x)
See the figure in the wikipedia page you shared.