Implement your own source transformation AD with Julia(blog.rogerluo.me) |
Implement your own source transformation AD with Julia(blog.rogerluo.me) |
So how are AD libraries claiming to differentiate such functions? Is there an implicit assumption that the user knows the derivative does not make sense at 0?
Edit: I just tried this and it gives the wrong answer without any hint that it's incorrect:
"""
julia> f
f (generic function with 1 method)
julia> f(0), f'(0)
(0, -1)
julia> f'(1), f'(-1)
(1, -1)
"""
After all, there’s nothing so special about the ReLU... It would be very very weird/unstable if our algorithms worked for ReLU, but not the link-smoothed version of ReLU.
All optimal points (for, say, optimizing a linear function) will lie on the extremal points of the feasible domain, many of which will be points where the constraint functions are not differentiable. In all cases you can turn nonlinear objective function optimization (say over f) into linear objective function optimization by adding a constraint f(x) ≤ t and moving t to the objective.
Now, I will agree that smooth optimization algorithms will work ok, but try optimizing abs(x) with GD; you'll find that the best possible error you can achieve (other than by sheer luck) will be ~O(L) where L is your stepsize.
https://see.stanford.edu/materials/lsocoee364b/01-subgradien...
About that... One thing that I didn't expect but should have been obvious is that introductory content is often geared towards scientists/mathematicians rather than engineers, which makes sense given that this is the target audience.
They often explain the programming side and not the mathematical/scientific side. Which is fine, because they present the right vocabulary for me to explore from different sources.
This article seems to be very much engineering focused but there is a ton of vocabulary I'm not used to yet. I assume the reader is expected to have a solid understanding of the paradigm and at least a high level understanding of Zygote.
> mostly aimed at working scientists
My primary goals are to learn what (primarily) data-scientists do. In the sense of: How do they think and approach problems, what are the limitations and the prerequisites etc. (And as I said to improve my math skills.)
I think there is merit in engineers learning these things (within reasonable scope) because at some point there needs to be a system that provides and transforms data into a format that scientists/analysts can work with. And in the other hand there are things that engineers can implement and learn to improve their systems. I'm excited about both and curious about how far I can get.
Even in this case, though, subgradients may not exist (they're only guaranteed to exist for convex functions and any nonconvex function has, almost by definition, at least one point for which there exists no subgradient), in which case one talks about sub-derivatives instead. These always exist whenever a function is semicontinuous (such as the "if" statement example given in the GGP comment), even if it is not convex [1].
All of this is to say, the subject of general variational analysis is mathematically nice, but computationally extraordinarily difficult.
The first proofs of even local convergence of GD to local optima (not stationary points!) with high probability, even in the case of smooth, differentiable functions, only recently emerged as well, and the results are rather weak in the sense of limits [2]. Query-hardness has also been shown for many examples (even with unbounded computational time) in [3] even when the functions are smooth.
The non-smooth case becomes even harder and you have exponential-size query complexity in the number of dimensions even when you restrict that the function cannot "vary too much," locally (i.e., if it is required to be L-Lipschitz) [4].
-----
[0] For example, Variational Analysis covers a good amount of this stuff https://sites.math.washington.edu/~rtr/papers/rtr169-VarAnal...
[1] c.f., chapter 8 in [0].
[2] https://arxiv.org/abs/1602.04915
[3] https://arxiv.org/abs/1912.02365 in the stochastic case, and https://arxiv.org/abs/1710.11606 along with https://arxiv.org/abs/1711.00841 in the nonstochastic case
[4] See, e.g., 1.1.3 in Nesterov's Introductory Lectures in Convex Optimization (can be easily found online).
Sorry I screwed up a bit when writing that: an optimal point exists which lies on the extremal points of the feasible domain. In many cases, it can be shown that the optimal point lies on the extremal points of the feasible domain (when the objective is linear)
This is not, in general, true for smooth functions so long as L is small enough (you can reach arbitrary accuracy with GD if L is smaller than ~ the reciprocal of the Lipchitz constant of a differentiable objective function but it need not be arbitrarily small).