The Modern Mathematics of Deep Learning(arxiv.org) |
The Modern Mathematics of Deep Learning(arxiv.org) |
IE, Deep learning is fundamentally just about getting the mathematically simple but complex and multi-layerd "neural networks" to do stuff. Training them, testing them and deploying them. There are many intuitions about these things but there's no complete theory - some intuitions involve mathematical analogies and simplifications while other involve "folk knowledge" or large scale experiments. And that's not saying folks giving math about deep learning aren't proving real things. It's just they characterizing the whole or even a substantial part of such systems.
It's not surprising that a complex like a many-layered Relu network can't fully characterized or solved mathematically. You'd expect that of any arbitrarily complex algorithmic construct. Differential equations of many variables and arbitrary functions also can't have their solutions fully characterized.
That being said, research on the Kernel regime is one of the very cool ideas, in my opinion, to gain traction in this field in the past few years. To summarize: "If you make a neural network wide enough, it gains the power to control its output on each individual input separately, and will begin to fit its training data perfectly". Of course, the real pleasure is in understanding all the mathematical details of this statement!
Neural networks "tend to generalize well in the real world". That's a pretty fuzzy statement imo since "real world" is hardly defined but it's still what people experience and it's more useful to provide a more precise model where this works rather than a model where this doesn't work.
Also, there's good theory on deep networks as universal well as theories of wide/shallow networks [1].
https://arxiv.org/abs/1703.00810
This goes beyond mere intuition, but it is also still very far from a “complete theory”.
I find it disappointing that so few people in deep learning work on the theoretical foundations.
(and that language is co-expressive with human languages)
what does this mean?
As it relates to this: https://en.wikipedia.org/wiki/Neural_tangent_kernel
To me, this is JFM. Not sure if I'm connecting the dots right either. I just don't know of anything else claiming to solve the curse.
Fast.ai is too high level. I don't like it. You would be better served taking actual university courses. A few days ago people linked to LeCun's university class[1]. This is a solid introduction. Does not cover everything but that is OK. Seems like it is missing Bayesian approaches. Then if you want to specialize in vision or speech or robotics or whatever, you take special classes on that topic and learn all the SOTA techniques. Then you are ready to do research already, or apply your knowledge to build stuff. Of course you still have to learn how to do real machine learning, which involves all the data manipulation stuff, but that is learned by doing.
The youtube playlist is here: https://www.youtube.com/playlist?list=PL_iWQOsE6TfVmKkQHucjP...
Prof. Sergey Levine is REALLY good at explaining the intuitions of DL algorithms. This class also includes lectures on ML basics and very approachable assignments.
Many classes/blog posts start with describing what a neuron is - that IMHO is a super terrible way to teach a beginner.
To understand DL, one should know why we need activations (because linear models are not enough), why we need back-propagation (because we are optimizing a loss using SGD). This class is very great at explaining those things in an intuitive way. Following through I felt I built a pretty solid ML/DL foundation for myself.
I don't think fast.ai is enough if you want to do theoretical research in deep learning, but it certainly provides enough to work on practical problems with deep learning. That said, many of us in the fastai community are able to delve deep into, understand, and implement recent deep learning papers and even develop novel techniques. So I think with a little extra studying, one could go easily transition to core deep learning research.
The pytorch codebase for, say, a transformer (a deep learning architecture which makes use of "attention") - is still not something I've yet grokked. I have however been able to pitch in with bug fixes as I continue learning and getting to that point.
This is how I would hope an entry-level position would be at a job. At some point companies have to realize education is just a part of it and that it takes time; particularly when things change this fast. I have no real-world clue though unfortunately.
Anyway, working on machine learning with vision is the first time I've actually felt like my work was exciting. The "result" you get is so much fun and working together with people given the proper culture is presumably a fantastic experience. I just (personally) can't get excited about using my code to write CRUD/frontend anymore. Not to imply those are the only two options; but that's been the case for me until recently.
I don’t do this work myself, but we’ve hired many interns from bootcamps to do ML, and ones from college with ML projects. The bootcamp grads with no additional background have almost universally hit hard walls once anything gets more complex than using Keras to glue together layers. It’s given me the impression, anecdotally, that bootcamps are largely predatory to take ones money and provide only a veneer of knowledge in the area. This doesn’t seem to apply to people with a CS or math background that took an ML bootcamp to add that dimension to their already-mathematical skillset. But people who have, again only anecdotally in my experience with an n of perhaps only 20, taken a bootcamp to reskill from a totally unrelated and perhaps qualitative field have not had success with a bootcamp alone, but have had success in doing what the above poster recommended in taking university courses in the area.
Very respectfully, if you’re in a boot camp right now, you’re unlikely deep enough into the day to day work of ML to make the assertion you’re making.
Unfortunately, not all data is available or provided in a data "friendly" format - sometimes all you get are image files, and similar. Maybe you want to read some value off these images, count objects, or whatever - which traditionally has been done by trained/skilled workers.
With CNNs, it _can_ be a trivial task implement models for solving the above problems. That's time and money saved for a business.
That being said, I'm also thinking about starting an ML PhD because it does honestly open more doors to top research groups.
Correction: not ML PhD by itself - publications in top conferences open doors. Looking at the acceptance rates, I'm guessing most people with ML PhDs don't have such publications.
(1) Linear Algebra
(2) Optimization Theory (Convex Analysis, non-convex optimization) [0], [2]
(3) Probability Theory and Statistics (Measure Theory, Multivariate Statistics) [1], [3], [4], [5]
(4) Analysis, to a lesser extent. (2) and (3) are the most important.
I would give more references, but my background is too theoretical (and my field is Numerical Analysis of PDE). From the classes I took in college, three or four on each of (1-4), a person with a similar background can recognize the tools without much digging. Maybe some folks here can provide some insights into books that center on applications. So I'm trying not to diverge into too much theory (i.e. for measures, [4] instead of Folland). There also seems to be good use of Analysis techniques in the paper, see theorem 2.1.
I love that the paper references the Moore-Penrose pseudo-inverse, an object of study in both statistics and optimization for which I had to give a lecture for a course.
[0] https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf Convex Optimization, Boyd and Vandenberghe
[1] An Introduction to Multivariate Statistical Analysis, Anderson
[2] Convex Analysis and Monotone Operator Theory in Hilbert Spaces, Bauschke-Combettes
[3] Theory of Multivariate Statistics, Bilodeau-Brenner
[4] The Elements of Integration and Lebesgue Measure, Bartle
[5] Probability: Theory and Examples, Durrett
Some helpful resources are linked here: https://www.reddit.com/r/MachineLearning/comments/najnjg/r_t...
These include linear algebra, graph theory, probability, algorithms, mathematical analysis, topology, differential geometry. But the most important prereqs are math maturity and mental toughness/endurance.
I stopped studying maths well before university. I am not some kind of math super genius. But working on my own stuff, which did involve new problems, I was up the creek fairly quickly without a solid mathematical understanding of the techniques I was trying to use.
I don't think the bar is particularly high here. Solid understanding of stats, ESL...but I have seen people shotgunning models (I did this years ago too), and that isn't going to work very long.
Also, I don't really understand why you wouldn't study some of this stuff. Maths as taught in schools treats you like a meat calculator...that isn't fun. But if you are interested in ML, going through Stats, Linear Algebra...it is pretty interesting because there are so many clear connections with your work.
I've seen differential equations, Markov chains, differential geometry and other stuff. We might be in heady days before the "big breakthrough" is made. But these constructs might be inherently pathological (even then, non-pathological variants might be possible).
It’s a question if it’s useful.
Is there a group doing this in Zurich?
I've always interpreted that as "we've found an algorithm that could, given a foreseeable amount of computing power and maybe some tweaks, simulate human decision making".
It isn't so much that neural networks can approximate the real world as they can approximate human perception of the real world.
Neural networks are "universal approximators" in that they work as well as virtually any previous approximation method. So given big snapshot of input data and human judgement on it, they can approximate that. They can also approximate a snapshot of some input-output pairs not produced by human but having patterns (solutions to differential equations, for example).
So, they can approximate what humans do in a given domain. But there's no reason to think they're acting in the same way as humans and I'd say very few people seriously working on ML believe that.
Neural nets don't learn anything like us, and they don't reproduce our functions. We build on massive amounts of general symbolic knowledge, and can zero shot tasks (without explicit examples) easily.
Neural networks really should be seen as just giant random functions that you progressively modify in tiny ways until they fit your data. As parent says, we've just been lucky or good at constraining these functions in a way that they can only learn useful functions (ie convnets) or that they somehow learn these more quickly
It is completely plausible that when neural nets get scaled up to something approaching human-brain numbers of connections they will well approximate a human brain or be a few tweaks away. Obviously it won't be knowable until state of the art gets there, but there is no reason to think human intelligence is going to be complicated. It is one evolutionary step up from some pretty basic animals.
That's why residual blocks are interesting. They pass that low-level information to later blocks (which have an easier time processing the granular details) while also leveraging the ability of earlier blocks to extract abstract information. It allows you to extract and combine information at multiple levels of granularity (or abstraction).
Convnets are also invariant to generalisation (e.g. translation, and to some degree scale), which I think is a better definition than "can only learn something useful." They're forced learn information that is more general, which increases the usefulness of each bit, which means you get a higher density of usefulness per FLOP. But you also lose specific information in that process. What if location is meaningful? For example, audio spectrogram analysis can suffer from that property, because specific location on the Y axis is highly meaningful.
Now, tempered distributions are functions that assign a complex number to a very rapidly decaying function (a Schwarz space function), and it satisfies linearity properties. So this is a function that takes functions and maps them to complex numbers. https://secure.math.ubc.ca/~feldman/m321/distributions.pdf
In general, the dual of a space of functions is a space of set functions, aka measures. https://keithalewis.github.io/math/dual.html
Some mathematical concepts are needed in order to present rigorous results. While one can argue about the necessity and relevance of these results for real-world applications, they at least explain various aspects of deep learning in restricted settings, leading to a better general understanding and intuition.
Then again math is hard for us. So I think there are nuances.
It is reasonable to believe that written language is easier to train on a neural net that is trained on both images and words so it can form visual links between words. Maybe that takes more computational grunt than we have at the moment. The failure so far proves nothing.
You do realize we can train a neural network to perform this task? It is a binary classification problem. When I look at a grammatically incorrect sentence I don't do much symbolic reasoning - it just feels "wrong" to me. It does not match any patterns I have in my head for grammatically correct sentences. There's a lot of pattern matching in our thinking process.
What's missing in the current generation of neural networks is efficient information storage and ability to recall that information (e.g. lookup) or update it (direct write).
I'm doing a master's in deep learning for NLP and I'm not sure we can. Language modelling can't do this because grammatical yet semantically implausible combinations of words yield very low perplexity, like the classic being Noam Chomsky's "Colorless green ideas sleep furiously".
What would be a training set for this? I assume we would first try to do parsing to extract the grammatical role of each word. Then what would be the dataset? A massive attempt at generating the set of all possible trees that are grammatical?
I guess we could use massive textual datasets from reputable sources and extract their grammatical role tree, and learn from that. Generating negative examples with sufficient coverage would be very hard. Strict generative modelling without negative examples with good coverage would see the same problem as with language modelling, where acceptable but unlikely examples would have low perplexity despite being good.
It would seem to me that in order to generate negative examples with good coverage, your would need to have a man made program with a definition of what grammaticality means, which would make making a neural network useless to begin with.
Seems like the experts agree with my take: https://linguistics.stackexchange.com/a/1108
That isn't a problem of scale, it's a problem of architecture. This is one of the reasons Deepmind decided to tackle Starcraft. It's very difficult to solve Starcraft without your AI having some ability to develop and then manipulate a mental model of the game, because that's what you need to construct and unfold original, non-linear strategies.
Unlike current DL models, humans have a world model (common sense) which is formed through an ability to create/update/lookup explicit rules/facts. Once we figure out how to incorporate that into a learning algorithm and/or a model architecture, AI will become a lot smarter.
I’d encourage you to read a little more about the topic with an open mind. You might learn something.