Implement your own source transformation AD with Julia

Implement your own source transformation AD with Julia(blog.rogerluo.me)

66 points by metalwhale 6 years ago | 27 comments

We have so many AD implementations in Julia now that we actually have infrastructure that separates out the definition of primitive derivate rules from the actual AD mechanism itself, so the rules can be shared amongst all of them. Of course it would be better if there was just one AD to rule them all, but there are tradeoffs in the design space that make that hard. I think having all these different implementations has actually helped crystalize what the design space actually is, what choices need to be made and what the interesting classes of applications are. I think that'll help with the next generation of these tools (disclaimer I just started working on one of those next generation tools yesterday ;) ).

stochastimus 6 years ago | |

Thanks for this. Glad to see the Julia community is going strong. At Zebrium we use Julia at the core of our log structuring engine, and talk to it over gRPC. Keep it up!

memexy 6 years ago |

Differentiating through control flow has never made sense to me. What does it mean to differentiate the following function: "f(x) = x > 0 ? x : -x"? If you plot this function you get a sharp corner at 0 which means it's not differentiable there because the limit from the left is -1 and the limit from the right is 1. Since 1 =/= -1 the derivative does not exist at 0.

So how are AD libraries claiming to differentiate such functions? Is there an implicit assumption that the user knows the derivative does not make sense at 0?

Edit: I just tried this and it gives the wrong answer without any hint that it's incorrect:

"""

julia> f

f (generic function with 1 method)

julia> f(0), f'(0)

(0, -1)

julia> f'(1), f'(-1)

(1, -1)

"""

KenoFischer 6 years ago | |

Yes, you can just consider it a piecewise differentiable function. That's not usually a problem in practice if you think of the derivatives more as heuristics in a search problem than absolute truth. Obviously how well your optimization works does depend on the differentiability attributes of your underlying function and how well your optimization scheme can handle it.

throwawayiionqz 6 years ago | |

If subgradients are enough (-1 is correct subgradient at 0 in your example) then there are valid approaches for AD subgradient, see https://arxiv.org/abs/1809.08530

memexy 6 years ago | | |

Thanks for the link.

currymj 6 years ago | |

if you want to be rigorous about it, you can often talk in terms of elements of the set of subgradients rather than gradients, and the convergence proofs for many of the popular optimization algorithms still go through.

memexy 6 years ago | | |

Any references for sub-gradients and convergence proofs I can take a look at?

ssivark 6 years ago | |

The kinks correspond to a set of measure zero, which you will likely never hit during execution, so one can safely ignore the problem as not physically relevant. One way to think of the problem is that the cost function we’re differentiating is approximate/fake, and whatever it needs to be (at some special neighborhoods) to give us derivatives we consider sensible (in large regions).

After all, there’s nothing so special about the ReLU... It would be very very weird/unstable if our algorithms worked for ReLU, but not the link-smoothed version of ReLU.

LolWolf 6 years ago | | |

Hmm... I'm not sure I agree.

All optimal points (for, say, optimizing a linear function) will lie on the extremal points of the feasible domain, many of which will be points where the constraint functions are not differentiable. In all cases you can turn nonlinear objective function optimization (say over f) into linear objective function optimization by adding a constraint f(x) ≤ t and moving t to the objective.

Now, I will agree that smooth optimization algorithms will work ok, but try optimizing abs(x) with GD; you'll find that the best possible error you can achieve (other than by sheer luck) will be ~O(L) where L is your stepsize.

memexy 6 years ago | | |

My question wasn't about the theoretical aspects of measurability since any countable set of points will have measure 0 but about all AD libraries sweeping this kind of issue under the rug. Where in the Zygote docs is it mentioned that the absolute value function will give the wrong answer when differentiated?

superdimwit 6 years ago | |

In the same way that the ReLU derivative is not defined at x=0. Most of the time, in practice, this all doesn't really matter and you can still get gradient descent to work in a useful way.

throwgeorge 6 years ago | |

the gradient doesn't exist but subgradients do exist (at points of non-differentiability) and that is still useful (case in point as someone else mentions ReLU)

https://see.stanford.edu/materials/lsocoee364b/01-subgradien...

memexy 6 years ago | | |

Thanks. That's a nice tutorial.

kersny 6 years ago |

Some more related info on different algorithmic differentiation approaches in Julia: https://github.com/MikeInnes/diff-zoo

metalwhale 6 years ago | |

Thank you so much for sharing this great repo! I have noticed that the source transformation notebook is not finished yet. How is it now?

dgb23 6 years ago |

This article is way over my head right now. But I bookmarked it. Differentiable programming and probabilistic programming are among the things that motivated me to learn the language (still a beginner), aside from just brushing up and sharpening my math skills in a practical manner.

About that... One thing that I didn't expect but should have been obvious is that introductory content is often geared towards scientists/mathematicians rather than engineers, which makes sense given that this is the target audience.

They often explain the programming side and not the mathematical/scientific side. Which is fine, because they present the right vocabulary for me to explore from different sources.

This article seems to be very much engineering focused but there is a ton of vocabulary I'm not used to yet. I assume the reader is expected to have a solid understanding of the paradigm and at least a high level understanding of Zygote.

KenoFischer 6 years ago | |

You are correct. Our technical documentation is mostly aimed at working scientists who want to start using these techniques in their work. That does sometimes lead to funny cases where a document assumes you know what a smooth manifold is but will explain try/catch blocks. We've started trying to put together more introductory-focused material at https://juliaacademy.com/. We don't currently have anything particularly AD focused (outside of the general ML courses), but I think that's a topic that's high on the interest list.

dgb23 6 years ago | | |

Thank you for pointing out this fantastic resource.

> mostly aimed at working scientists

My primary goals are to learn what (primarily) data-scientists do. In the sense of: How do they think and approach problems, what are the limitations and the prerequisites etc. (And as I said to improve my math skills.)

I think there is merit in engineers learning these things (within reasonable scope) because at some point there needs to be a system that provides and transforms data into a format that scientists/analysts can work with. And in the other hand there are things that engineers can implement and learn to improve their systems. I'm excited about both and curious about how far I can get.

cat199 6 years ago |

bit of a detour, but being able to do things like this is a big part of what lisp programmers are getting at when they bring up the advantages of 'syntax as data' - being able to perform complete high-level runtime program introspection, transformation, and code generation - good to see these kinds of techniques are available in other good languages that might appeal to the less parenthetically-inclined, not that that is the only benefit of julia

metalwhale 6 years ago |

Disclaimer: I'm not the author. Just interested in the article and want to share this awesome post.