Automatic Differentiation with Julia

Automatic Differentiation with Julia(blog.rogerluo.me)

146 points by Acur 7 years ago | 78 comments

You may like to take a look at Flux's implementation [1]; roughly the same idea but "professionalised" with performance work, tighter integration with the type system, nested AD and so on. It's a little less simple for that, of course, but is still under 500loc of fairly straightforward Julia code, and is generally a bit faster than PyTorch.

The Julia world has done a lot of experimentation with AD and is converging on some really cool things, so if you're interested in this field it's definitely worth a look.

[1]: https://github.com/FluxML/Flux.jl/blob/master/src/tracker/Tr...

cbkeller 7 years ago | |

Flux is awesome. One of the biggest advantages IMO is that kernels you can easily write yourself should be by default just as fast as what's built-in -- since it's all written in Julia and Julia itself is fast, without having to rely on C/C++/FORTRAN under the hood. As the Flux devs say:

"You could have written Flux. All of it, from LSTMs to GPU kernels, is straightforward Julia code. When in doubt, it’s well worth looking at the source. If you need something different, you can easily roll your own." http://fluxml.ai/Flux.jl/stable/

edit: and the automatic differentiation works on them too!

IngoBlechschmid 7 years ago |

For those unaware of what automatic differentiation is: It's a close-to-magical tool which turns code for evaluating a function f into code for evaluating its derivative f'.

It uses a special sort of invented numbers which square to zero even though they are not themselves zero.

Here is one of the many tutorials on automatic differentiation: https://pizzaseminar.speicherleck.de/automatic-differentiati...

kxyvr 7 years ago | |

To clarify, your statement that, "It uses a special sort of invented numbers which square to zero even though they are not themselves zero." is not entirely true. In case someone else is looking at this, what the OP is referring to is dual numbers. That is one way to implement things, but not the most common way for fast tools.

Fundamentally, automatic differentiation is the methodical application of the chain rule. Forward mode results from the application of the rules for directional derivatives. Reverse mode results from the total derivative. The reason that we have a reverse pass in reverse mode can be seen from this perspective. The directional derivative is `(f o g)'(x)dx = f'(g(x)) g'(x)`. Note, we compute `g(x)` before `f(g(x))` and the derivative `g'(x)` before `f'(g(x))`. Therefore, we can compute the derivatives as we compute the answer. If we want the gradient, which results from the total derivative, we have `grad (f o g)(x)` = `g'(x)* grad f(g(x))`. Although we still compute `g(x)` before `f(g(x))` during our computation, the gradient requires the computation of `grad f(g(x))` before the application of the adjoint operator `g'(x)*`. We do the evaluations on the first pass, and cache extra values, and then compute the gradient on a reverse pass because we need the adjoint of the total derivatives in the reverse order.

Or, at least that's my bias in how to derive things.

IngoBlechschmid 7 years ago | | |

Thank you for the very insightful background.

Besides efficiency concerns (which I don't have a clue about), a disadvantage of the point of view using dual numbers is that, to my knowledge, it can only be used to derive the forward mode of automatic differentiation. Still I take pleasure in appreciating the slightly mystic aura of the dual numbers. :-)

aaaaaaaaaab 7 years ago | |

>close-to-magical

As magical as the chain rule of differentiation.

abecedarius 7 years ago | | |

It’s interesting though that the way calculus is classically taught does not make this obvious.

elcomet 7 years ago | |

Thid is really neat.

Is this really used in practice?

It seems to me that most of the AD frameworks used for deep learning implement the backward function that returns the jacobian for every initial function, and then chain those backward functions

cultus 7 years ago | | |

I used it for non ML tasks in geophysics, which made my life a lot easier. However, I think most scientists and engineers aren't aware of it. It has been described as "criminally underused."

aaaaaaaaaab 7 years ago | | |

To calculate the gradient of a function R^n -> R^m, forward-mode AD is preferable if n << m, while reverse-mode is faster for n >> m.

From this it should be clear why machine learning uses reverse-mode.

On the other hand, forward-mode is better for e.g. calculating the tangent of a high-dimensional curve (i.e. R -> R^n).

IngoBlechschmid 7 years ago | | |

Yes, definitely, there are even battle-tested implementations for Fortran available.

Though I have never seen AD frameworks used in the production contexts for neural networks/backpropagation. As you say, the code for this seems to be mostly handrolled. Please take this negative statement with a grain of salt, I don't actually work in machine learning.

RivieraKid 7 years ago | |

How useful is it in practice?

RBerenguel 7 years ago | | |

It's used extensively in numerical analysis/differential equations/some PDE work and the like, at least.

ncfausti 7 years ago |

For those curious about Julia, I just found this:

https://www.infoworld.com/article/3284380/data-science/what-...

Close to C speed in a dynamic language? Seems pretty great on paper. Is this generally the case?

kxyvr 7 years ago |

Alright, this looks neat, but I'm having a terrible time figuring out what's going on with the benchmark. Typically for AD, it's easiest to see four things: number of variables, time to compute function directly with no AD, time to compute function with AD enabled and calculate the gradient, ratio between these two numbers. Then, we run the same example with a varying number of variables to see how things scale. The advantage of reverse mode is that, theoretically, this ratio is fixed and bounded at 4 or 5 regardless of the number of variables. Realistically, this is much higher, but I think it's fine if it's somewhat between 10-40 times function evaluation as long as it scales. I can't figure out what their ratio is or whether or not it's appropriately scaling. Can anyone else?