The Modern Mathematics of Deep Learning

The Modern Mathematics of Deep Learning(arxiv.org)

276 points by tims457 5 years ago | 70 comments

It seems like this can leave the reader with the wrong impression. Calculus really is "the mathematics of Newtonian physics". This is just "some mathematics that might help a bit in your intuitions of deep learning".

IE, Deep learning is fundamentally just about getting the mathematically simple but complex and multi-layerd "neural networks" to do stuff. Training them, testing them and deploying them. There are many intuitions about these things but there's no complete theory - some intuitions involve mathematical analogies and simplifications while other involve "folk knowledge" or large scale experiments. And that's not saying folks giving math about deep learning aren't proving real things. It's just they characterizing the whole or even a substantial part of such systems.

It's not surprising that a complex like a many-layered Relu network can't fully characterized or solved mathematically. You'd expect that of any arbitrarily complex algorithmic construct. Differential equations of many variables and arbitrary functions also can't have their solutions fully characterized.

fogof 5 years ago | |

As a PhD student who sort of burned out on this type of research, I agree that the complexity of Neural Networks as a mathematical construct makes them very difficult to analyze. This might also have to do with Deep learning theory being a subset of learning theory which is subject to "No Free Lunch" [1], which means that you always have to be very careful not to try to prove something that turns out to be impossible.

That being said, research on the Kernel regime is one of the very cool ideas, in my opinion, to gain traction in this field in the past few years. To summarize: "If you make a neural network wide enough, it gains the power to control its output on each individual input separately, and will begin to fit its training data perfectly". Of course, the real pleasure is in understanding all the mathematical details of this statement!

[1] : https://en.wikipedia.org/wiki/No_free_lunch_theorem

joe_the_user 5 years ago | | |

I got my master's years ago so now I'm a strict amateur. That said, I don't think the "No free lunch theorem" is very "interesting". It's nearly tautological that no approximation method works for "any" function. The set of predictable/interesting/useful/"real-world" functions is going to have measure 0 compared to white noise so "any function" will basically look like white noise and can't be predicted. Approximating functions/sequences with vanishingly low Kolmogorov complexity is more interesting, impossible in general by Godel's theorem but what's the case "on average"? (depends on the choice process and so ill-defined but defining might be interesting). The kernel regime stuff looks interesting but I don't know it's relation to wide networks.

Neural networks "tend to generalize well in the real world". That's a pretty fuzzy statement imo since "real world" is hardly defined but it's still what people experience and it's more useful to provide a more precise model where this works rather than a model where this doesn't work.

Also, there's good theory on deep networks as universal well as theories of wide/shallow networks [1].

[1]: https://arxiv.org/abs/1901.02220

conformist 5 years ago | |

It seems like it aims at giving somebody who would like to get started doing theoretical research in the field some pointers and basic insights. I don't think it does a particularly bad job at this, in particular given that it will be a book chapter? The target audience are probably people who have had some exposure to Functional Analysis and the likes before.

jhrmnn 5 years ago | |

There are a few works that try to put deep learning on some theoretical basis, I like this one, for example:

https://arxiv.org/abs/1703.00810

This goes beyond mere intuition, but it is also still very far from a “complete theory”.

I find it disappointing that so few people in deep learning work on the theoretical foundations.

quibono 5 years ago | | |

What are some subfields of mathematics that you would say are crucial for gaining a proper understanding of all the things related to deep learning (e.g. let's say the paper you linked)? Even though the theory isn't complete, I'm sure a grounding in certain fields of mathematics will be helpful.

0-_-0 5 years ago | | |

Of the many "understanding neural networks" papers this is one of the few valuable ones.

keithalewis 5 years ago | | |

Agreed. Until we get to the point where there are theorems of the form, for example, "Given a problem satisfying conditions X, the optimal number of layers to minimize expected training time for data satisfying Y is Z", it is just stamp collecting.

pcbro141 5 years ago |

Tangent, but has anyone taken Fast.ai or similar courses and transitioned into the Deep Learning/ML field without a MS/PhD? To be honest, I don't even know what 'doing ML/DL' looks like in practice, but I'm just curious if a lot of folks get in to the field without graduate degrees.

amelius 5 years ago |

What are the prerequisites?

mathgenius 5 years ago |

After skimming through the paper it's clear that the title should be read as "The Modern (Mathematics of Deep Learning)" and not my original parse which was "The (Modern Mathematics) of Deep Learning." Very different things.

somewhereoutth 5 years ago |

Wake me up when 'deep learning' has independently created a language to communicate within a group of peers while under environmental pressure.

(and that language is co-expressive with human languages)

scaraffe 5 years ago | |

> a language to communicate within a group of peers while under environmental pressure

what does this mean?

scaraffe 5 years ago | |

> a language to communicate within a group of peers while under environmental pressure what does this mean?

visarga 5 years ago | |

Self play applied to language creation as opposed to go and chess? I like this idea.

rohittidke 5 years ago |

I believe that the curse of dimensionility doesn't apply here as we are optimizing the "universal apppriximator" of the "surface" of the possible real world function.

ganzuul 5 years ago | |

> Kernel methods owe their name to the use of kernel functions, which enable them to operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space. - https://en.wikipedia.org/wiki/Kernel_method

As it relates to this: https://en.wikipedia.org/wiki/Neural_tangent_kernel

To me, this is JFM. Not sure if I'm connecting the dots right either. I just don't know of anything else claiming to solve the curse.

antipaul 5 years ago | |

Does “possible” in your statement refer to the inherent constraints of the architecture as specified by the researcher, or something else?