Understand how transformers work by demystifying the math behind them(osanseviero.github.io) |
Understand how transformers work by demystifying the math behind them(osanseviero.github.io) |
We arent going to see more progress until we have a way to generalize the compute graph as a learnable parameter. I dunno if this is even possible in the traditional sense of gradients due to chaotic effects (i.e small changes reflect big shifts in performance), it may have to be some form of genetic algorithm or pso that happens under the hood.
That's not it at all. What's special about transformers is they allow each element in a sequence to decide which parts of data are most important to it from each other element in the sequence, then extract those out and compute on them. The big theoretical advantage over RNNs (which were used for sequences prior to transformers), is that transformers support this in a lossless way, as each element has full access to all the information in every other element in the sequence (or at least all the ones that occurred before it in time sequences). RNNs and "linear transformers" on the other hand compress past values, so generally the last element of a long sequence will not have access to all the information in the first element of the sequence (unless the RNN internal state was really really big so it didn't need to discard any information).
They do that in theory. In practice, its just all matrix multiplication. You could easily structure a transformer as a bunch of fully connected deep layers and it would be mathematically equivalent, just computationally inefficient.
That's a bold statement since a ton of progress has been made without learning the compute graph.
I suppose we'll see in the next year!
Information compression is cool, but I want actual AI.
Im more concerned with an LLM having the ability to be trained to the point where a subset of the graph represents all the nand gates necessary for a cpu and ram, so when you ask it questions it can actually run code to compute them accurately instead of offering a statistical best guess, i.e decompression after lossy compression.
Agreed. Seems analogous with how human mental processes are used to solve the kind of problems we'd like LLMs to solve (going beyond "language processing" which transformers do well, to actual reasoning which they can only mimic). Although you risk it becoming a Turing machine by giving it flow control & then training is a problem as you say. Perhaps not intractable though.
You can un-discretize the space of compute graphs by interpolating its points by simplices. More precisely, each graph is a subgraph of the complete graph, and the subgraph is identified by the indicator function of its edges whose values are either 0 or 1. By using weighted edges with values between 0 and 1, the space of all graphs (with the same number of vertices) becomes continuous and connected, and you can gradient move around it in small steps.
Of course, "compute graphs" are more general beasts than "graphs", but it is likely that the same idea will apply. At least, for a reasonably large class of compute graphs.
I think that Hebbian Learning is going to make a comeback at some point and time which will be used to connect static subgraphs to to other subgraphs subgraphs, which can be trained either separately or on the fly.
As far as I understand this is wrong. You're not computing gradients at any point, so this is no gradient explosion. I believe the problem is with the implementation of softmax, here [0] you have an explanation of how to implement a numerically stable softmax.
That said, the whole neural network will be sensible to large values, so it won't be fixed by a numerically stable softmax. The normalization is a key aspect for the network to work.
A hard concept?
But a monad is just a monoid in the category of endofunctors, so what's the problem?
> Hello -> [1,2,3,4] World -> [2,3,4,5]
The vectors are random, but they look like they have a pattern here. Does the 2 in both vector mean something? Or, is it the entire set that makes it unique?
That the numbers are reused isn’t meaningful here: a 1 in the first position is quite unrelated to a 1 in the second (as no convolutions are done over this vector)
I have a feeling it should be a common question, but I just can't find the keyword to search.
PS. If anyone has any links with thoroughly discussion about positional embedding, that would be great. I never got a satisfying answer about the usage of sine / cosine and (multiplication vs addition)
Yes, it seems like a transformer model simple enough for us to understand isn't able to do anything interesting, and a transformer complex enough to do something interesting is too complex for us to understand.
I would love to study something in the middle, a model that is both simple enough to understand and complex enough to do something interesting.
https://www.neelnanda.io/mechanistic-interpretability/gettin...
I asked ChatGPT to explain how to modify a basic ANN to implement self-attention without using the terms Matrix or Vector and it gave me a really simple explanation. Though I haven't tried to implement it yet.
I prefer to think of everything in terms of nodes, weights and layers. Matrices and vectors just makes it harder to relate to what's happening in the ANN.
The way I'm used to writing ANNs, each input node is a scalar but the feed forward algorithm looks like vector-matrix multiplication since you multiply all the input nodes by the weights then sum them up... Anyway, I feel like I'm approaching these descriptions with the wrong mindset. Maybe I lack the necessary background.
So for “World”
PE(1, 0) = sin(1 / 10000^(2*0 / 4)) = sin(1 / 10000^0) = sin(1) ≈ 0.84
PE(1, 1) = cos(1 / 10000^(2*0 / 4)) = cos(1 / 10000^0) = cos(1) ≈ 0.54
PE(1, 2) = sin(1 / 10000^(2*1 / 4)) = sin(1 / 10000^.5) ≈ 0.01
PE(1, 3) = cos(1 / 10000^(2*1 / 4)) = cos(1 / 10000^.5) ≈ 1
I also wondered if these formulae were devised with 1-based indexing in mind (though I guess for larger dimensions it doesn't make much difference), as the paper states
> The wavelengths form a geometric progression from 2π to 10000 · 2π
That led me to this chain of PRs - https://github.com/tensorflow/tensor2tensor/pull/177 - turns out the original code was actually quite different to that stated in the paper. I guess slight variations in how you calculate this encoding doesn't affect things too much?
Z_encoder_decoder = layer_norm(Z_encoder_decoder + Z)
in Decoder step 7 instead be Z_encoder_decoder = layer_norm(Z_encoder_decoder + Z_self_attention)
? Also, is layer_norm missing in Decoder step 8...From their abstract:
``One of the most exciting and promising novel architectures, the Transformer neural network, was developed without the brain in mind. In this work, we show that transformers, when equipped with recurrent position encodings, replicate the precisely tuned spatial representations of the hippocampal formation; most notably place and grid cells. Furthermore, we show that this result is no surprise since it is closely related to current hippocampal models from neuroscience. We additionally show the transformer version offers dramatic performance gains over the neuroscience version.``
https://medium.com/@Mosbeh_Barhoumi/forward-forward-algorith...
I love them because they do give another resource at explaining models such as transformers and I think this one is pretty well done (note: you really need to do something about the equation in 4.2...)
First, the critique is coming from love. Great work, so I don't want it to be taken as I'm saying anything it isn't.
Why I hate these is that they are labeled as "math behind" but I think this is not quite fitting. This is the opposite of the complaint I made about the Introduction to DL post the other day[0]. The issue isn't that there isn't math, but contextually it is being labeled as a mathematical approach but I'm not seeing anything that distinguishes it as deeper than what you'd get from Karpathy's videos or the Annotated Transformer (I like this more than illustrated). There's nothing wrong with that, but just think it might mislead people, especially as there is a serious lack of places to find a much deeper mathematical explanation behind architectures and the naming makes it harder to find for those that are looking for that, because they'll find these posts. Simply, complaint is about framing.
To be clear, the complaint is just about the subtitle, because the article is good and a useful resource for people seeking to learn attention and transformers. But let me try to clarify some of what would I personally (welcome to disagree, it is an opinion) more accurately be representative of " demystifying all the math behind them":
- I would include a much deeper discussion of both embedding and positional embedding. The former you should at minimum be discussion how it is created and discussing the dequantization. This post may give a reader the impression that this is not taking place (there is ambiguity between distinction of embedding vs tokenization and embedding, this looks to just briefly mention tokenization. I specifically think a novice might take away that the dequantization is happening due to the positional encoding, and not in the embedding). The tokenization and embedding is a vastly underappreciated and incredibly important aspect of making discrete models work (not just LLMs or LMs. Principle is more general).
- Same goes for the positional embedding which I have only in a handful of cases seen discussed and taken rather matter of factly. For a mathematical explanation you do need to explain the idea behind generating unique signals for each position, explain why we need a a high frequency, and it is worth mentioning how this can be learnable (often with similar results, which is why most don't bother), and other forms like rotational. The principle is far more general than even a Fourier Series (unmentioned!). The continuous aspect also matters a lot here, and we (often) don't want discritized positional encoding. If this isn't explained it feels rather arbitrary, and in some ways it is but others it isn't.
- The attention mechanism is vastly under-explained, though I understand why. There are many approaches to tackle this, some from graphs, some from category theory, and many others. They're all valuable pieces to the puzzle. But at minimum I think there needs to be a clear identification as to what the dot product is doing, the softmax, the scale (see softmax tempering), and why we then have the value. Their key-query-value names were not chosen at random and the database analogy is quite helpful. Maybe many don't understand the relationship of dot products and angles between vectors? But this can even get complex as we would expect values to go to 0 in high dimensions (which they kinda do if you look at the attention matricies post learning which often look highly diagonal and why you can initialize them as diagonally spiked for sometimes faster training). This would be a great place to bring up how there might be some surprising aspects to the attention mechanism considering matrices represent affine transformations of data (linear) and we might not see the non-linearity here (softmax) or understand why softmax works better than other non-linears or normalizers (try it yourself!).
- There's more but I've written a wall. So I'll just say we can continue for the residuals (also see META's 3 Things Everyone Should Know About Vision Transformers, in the Deit repo), why we have pre-norm as opposed to the original post-norm (which it looks like post norm is being used!), the residuals (knot theory can help here a bit), and why we have the linear layer (similarly the unknotting discussion helps, especially quantifying why we like a 4x ratio, but isn't absolutely necessary).
Idk, are people interested in these things? I know most people aren't, and there's absolutely nothing wrong with that (you can still build strong models without this knowledge, but it is definitely helpful). I do feel that we often call these things black boxes but they aren't completely opaque. They sure aren't transparent, especially through scale, but they aren't "black" either. (Allen-Zhu & Li's Physics of LLMs is a great resource btw and I'd love if other users posted/referenced more things they liked. I purposefully didn't link btw)
So, I do like the post, and I think it has good value (and certainly there is always value in teaching to learn!), but I disagree with the HN title and post's subtitle.
Although this is HN but my background is still stronger.
And by the way, is it worth it to invest time to get some idea about this whole AI field? I'm from a compE background
Might be worth thinking about how it will specifically affect your field of expertise. Jensen Huang says your job won't be taken over by an AI but by a human using an AI.
Not today.
Off to a search...
It still baffles me why such stochastic parrot / next token predictor, will recognize these "Unseen combinations of tokens" and reuse them in response.
That is to say: Having a correct conditional probability distribution over the next token conditional on the previous tokens, produces a correct probability distribution over sequences of tokens.
And, “correct probability distribution over sequences of tokens” (or, “correct conditional probability distribution over sequences of tokens, conditional on whatever)”, can be... well, you can describe pretty much any kind of input/output behavior in those terms.
So, “it works by predicting the next token” is, at least in principle, not much of a constraint on what kinds of input/output behavior it can have?
So, whatever impressive thing it does, is not really in conflict with its output being produced from the probability distribution P(X_{n+1}=x_{n+1} | X_1=x_1, ..., X_n=x_n) (“predicting the next token”)
Next token prediction is more intelligent than it sounds
Let's say you want to predict if you'll pass an exam based on how many hours you studied (x1) and how many exercises you did (x2). A neuron will learn a weight for each variable (w1 and w2). If the model learns w1=0.5 and w2=1, the model will provide more importance to the # of exercises.
So if you study for 10 hours and only do 2 exercises, the model will do x1w1 + x2w2=10x0.5 + 2x1 = 7. The neuron then outputs that. This is a bit (but not much) simplified - we also have a bias term and an activation to process the output.
Congrats! We built our first neuron together! Have thousands of these neurons in connected layers, and you suddenly have a deep neural network. Have billions or trillions of them, you have an LLM :)
See the “definition” section in https://en.wikipedia.org/wiki/Perceptron .
It’s mainly fancy math. With tools like PyTorch or tensorflow, you use python to describe a graph of computations which gets compiled down into optimized instructions.
There are some examples of people making transformers and other NN architectures in about 100 lines of code. I’d google for those to see what these things look like in code.
The training loop, data, and resulting weights are where the magic is.
The code is disappointingly simple.
> The code is disappointingly simple.
I absolutely adore this sentence, it made me laugh to imagine coders or other folks looking at the code and thinking "That's it?!? But that's simple!"Although it feels a little similar to some of the basic reactions that go to make up DNA: start with simple units that work together to form something much more complex.
(apologies for poor metaphors, I'm still trying to grasp some of the concepts involved with this)
Someone please correct me if I'm wrong or my terminology is wrong.
Transformers do have coefficients that are fit, but that's more broad.. could be used for any sort of regression or optimization, and not necessarily indicative of biological analogs.
So I think the terms "learned model" of "weights" are malapropisms for Transformers, carried over from deep nets because of structural similarities, like many layers, and the development workflow.
The functional units in Transformer's layers have lost their orginal biological inspiration and functional analog. The core function in Transformers is more like autoencoding/decoding (concepts from info theory) and model/grammar-free translation, with a unique attention based optimization. Transformers were developed for translation. The magic is smth like "attending" to important parts of the translation inputs&outputs as tokens are generated, maybe as a kind of deviation on pure autoencoding, due to the bias from the .. learned model :) See I can't even escape it.
Attention as a powerful systemic optimization is the actual closer bit of neuro/bio-insporation here.. but more from Cog Psych than micro/neuro anatomy.
Btw, not only is attention a key insight for Transformers, but it's an interesting biographical note that the lead inventor of it, Jakob Uzkereit, went on to work on a bio-AI startup after Google.
I really appreciate you taking the time to provide all this feedback. This feedback + additional resources are extremely useful.
I agree that the subtitle is not as accurate as it could be. I'll revisit it! As for content updates, I've been doing some additional updates in the last days based on feedback (e.g. more info about tokenization and the token embeddings). Although diving in some of your suggestions is likely out of scope for this article, I in particular agree that expanding the attention mechanism content (e.g. the analogy with databases or explaining what is dot product) would increase the quality of the article. I will look into expanding this!
I also think a more rigorous, separate mathematical exploration into attention mechanisms and recent advancements would be a great tool for the ecosystem.
Once again, thank you for all the amazing feedback!
And I just realized we're in a slack channel together haha (I don't think we've ever talked though). I poked around your website and saw you're at HF. Love you guys to death. You all also have tons of awesome blog posts and you're one of the most useful forces in ML. So I really do appreciate all the work.
It mentions it comes from the original Attention Is All You Need paper and goes on into more detail.
It seems to be named exactly as you would expect. Key/Value as in KV store, with Query being the term being retrived.
Though to be fair, actual biological evolution is more complex than simple genetic algorithms. More like evolution strategies with meta-parameter-learning and adaptive rate tuning among other things.
A good place to start on that topic is the word2vec paper.
I was sure I missed something, so I didn’t even try to implement it since I was so sure I missed the complicated bit.
But no, all the complexity is in the mathematical implications
Also, I found these 2 links pretty good too. 1. http://ai.stanford.edu/blog/understanding-incontext/ 2. http://ai.stanford.edu/blog/in-context-learning/
I'm still not completely convinced. Probably need to dwell on the topic longer.
We eventually moved on to lighter than air flight, which once again did not teach us any of those things and also was a dead end from the "get to the sky/moon" perspective, so then we invented heavier than air flight, which once again could not teach us about orbits, rockets, distances, or the vacuum of space.
What got us to the moon was rigorous analysis of reality with math to discover Newton's laws of motion, from which you can derive rockets, orbits, the insane scale of space, etc. No amount of further progress in planes, airships, kites, birds, anything on earth would ever have taught us the techniques to get to the moon. We had to analyze the form and nature of reality itself and derive an internally consistent model of that physical reality in order to understand anything about doing space.
Considering the chasm in the number of neurons between the apes and most other animals, I think one could claim that climbing those trees had some contribution to the ability to understand those things. ;) Navigating trees, at weight and speed, has a minimum intelligence reqiurement.
That aside, it seems like AI has had the most empirical success by not imposing hard constraints/structure, but letting models learn completely "organically". The computationalists (the folks who have historically been more into this "AI has to have things like logical consistency embedded into its structure" kind of thinking) seem to have basically lost, empirically. Who even knows what Soar[1] is nowadays? Maybe some marriage of the two paradigms will lead to better results, but I doubt that things will head in that direction anytime soon given how massively far just having parallelizable architectures and adding more parameters has gotten us.
[1] https://en.wikipedia.org/wiki/Soar_(cognitive_architecture)
As you were querying specs for a board at component level it could give you a schematic, I think, with citations to the actual data sheets.
I suppose the same scale up could be used for systems that needed a varying number of specific power supplies.
So being able to load a boatload of official datasheets and have them referenced in the design was what caught my eye as being useful.
Philosophically, you can start ad hoc-ing functionalities on top of LLMs and expect major progress. Sure, you can make them better, but you will never get to the state where AI is massively useful.
For example, lets say you gather a whole bunch of experts in respective fields, and you give them a task to put together a detailed plan on how to build a flying car. You will have people doing design, doing simulations, researching material sourcing, creating CNC programs for manufacturing parts, sourcing tools and equipment, writing software, e.t.c. And when executing this plan, they would be open to feedback for anything missed, and can advise on how to proceed.
The AI with above capability should be able to go out on the internet, gather respective data, run any soft of algorithms it needs to run, and perhaps after a month of number crunching on a cloud rented TPU rack produce step by step plan with costs on how to do all of that. And it would be better than those experts because it should be able to create a much higher fidelity simulations to account for things like vibration and predict if some connector if going to wobble loose .
Evolution created various neural structures in biological brains (visual cortex, medulla, thalamus, etc) rather ad-hoc, and those resulted in "massively useful" systems. Why should AI be different?
Consider how humans design things. We don't talk through every CPU cycle to convince ourself a design works, we use bespoke tooling. Not all problems are language shaped.
Tool usage is better, because the LLM can access the relevant computing/simulation at the highest fidelity and as fast as they can run on a real or virtual computer, rather than emulated poorly in a giant pyramid of matrix multiplications.
Am I missing the point?
This is why I am very interested in analog again—quantum stuff is statistical already, so why go from statistical (analog) to digital (huge drop off of performance, e.g. just look at basic addition in a ALU) and back to statistical. Very interested. Not sure if it will ever be worth it, but can’t rule it out.
Your whole brain might just be doing "information compression" by that analogy. An LLM is sort of learning concepts. Even Word2Vec "learned" than king - male + female = queen and that's a small model that's really just one part (not exact, but similar) of a transformer.
One level deep information compression is cool, but I want actual AI.
Its true that our brains compress information, but we compress it in a much more complex manner, in the sense that we can not only recall stuff, but also execute a decision tree that often involves physical actions to find the answer we are looking for.
The minute you take a token and turn it into an embedding, then start changing the numbers in that embedding based on other embeddings and learned weights, you are playing around with concepts.
As for executing a decision tree, ReAct or Tree of Thought or Graph of Thought is doing that. It might not be doing it as well as a human does, on certain tasks, but it's pretty darn amazing.
Is Ibn Sina (Avicenna, year ~1000) fine?
> [the higher faculty proper of humans is] the primary function of a natural body possessing organs in so far as it commits acts of rational choice and deduction through opinion; and in so far as it perceives universal matters
Or, "Intelligence is the ability to reason, determining concepts".
(And a proper artificial such thing is something that does it well.)
It is a tool that can be given a project in language X and produce an idomatic port in language Y.
It is a tool that given a 20 pages paper spec will ask the questions needed to clarify the specs.
Like, it really just seems like LLMs are a really good way of doing statistics rather than the closest model we have of the brain/mind, even if there are some connections we can draw post-hoc between transformers and the human brain.
Hebbian learning has never been used with much success in training neural nets. Backpropagation is not bio-inspired, but backpropagation is certainly used to train transformers.
For Backprop, I'm basing this off the development of the Perception. Wiki supports this and its bio-inslired origin[1].
As for its use in Transformers, if you mean simple regressing of errors or use of gradient descent, I'd agree, but that's not usually called Backprop and the term isn't used in the original paper. The term typically means back propagating the errors thru the entire network at a certain stage of learning, and that's not present in Transformers that I can tell.
Happy to see any support for your claims tho.
I don't see any information in your linked Wikipedia article that supports a bio-inspired origin. In fact, researchers have been wondering whether an equivalent to Backprop might be found in biological brains, but Backprop is widely believed to be biologically implausible (see e.g. https://arxiv.org/pdf/1502.04156.pdf, https://www.sciencedirect.com/science/article/pii/S089360801...).
It's not surprising that the term Backprop is not mentioned in the original paper, it isn't mentioned in most neural network research, because it's simply the default method to optimize weights and additionally it's hidden away by modern autodiff frameworks, so no one actually has to give it any thought. But backprop is definitely used in transformers (see e.g. https://aclanthology.org/2020.emnlp-main.463.pdf, https://arxiv.org/pdf/2004.08249, https://proceedings.mlr.press/v202/phang23a/phang23a.pdf, https://dinkofranceschi.com/docs/bft.pdf)
"For example, when doing the backpropagation (the technique through which the models learn), the gradients can become too large"
But I think this is more of a borrowing and it's not used again in description and may just be a misconception. There's no use of the Backprop term in the original paper nor any stage of learning where output errors are run thru the whole network in a deep regression.
What I do see in Transformers is localized uses of gradient descent, and Backprop in NNs also uses GD...but that seems the extent of it.
Is there a deep regression? Maybe I'm missing it
https://courses.grainger.illinois.edu/ece448/sp2023/slides/l...
From another source:
Backpropagation Through Time (BPTT) is an adaptation of backpropagation used for training recurrent neural networks (RNNs), which are designed to process sequences of data and have internal memory. Because the output at a given time step might depend on inputs from previous time steps, the forward pass involves unfolding the RNN through time, which essentially converts it into a deep feedforward neural network with shared weights across the time steps. The error for each time step is computed, and then BPTT is used to calculate the gradients across the entire unfolded sequence, propagating the error not just backward through the layers but also backward through the time steps. Updates are then made to the network weights in a way that should minimize errors for all time steps. This is computationally more involved than standard backpropagation and has its own challenges such as exploding or vanishing gradients"
The bio-inspiration was via Frank Rosenblatt, who is referred to in that article tho yeah, the history is over in his article:
https://en.wikipedia.org/wiki/Frank_Rosenblatt#Perceptron
"Rosenblatt was best known for the Perceptron, an electronic device which was constructed in accordance with biological principles and showed an ability to learn.
He developed and extended this approach in numerous papers and a book called Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms, published by Spartan Books in 1962.[6] He received international recognition for the Perceptron.
The Mark I Perceptron, which is generally recognized as a forerunner to artificial intelligence, currently resides in the Smithsonian Institution in Washington D.C."
Your Juergen page is interesting, tho no direct comment on Rosenblatt there. He does cite the work on this page:
https://people.idsia.ch/~juergen/deep-learning-overview.html (refs R58, R61)
My reading is that a long-known idea, about multi-variate regression, was reinterpreted by Rosenblatt by 1958 via the bio-inspired Perceptron, and then that was criticized by Minksy and others and viable methods were achieved by 1965. When I was taught NNs by Mitchell at CMU in the 1990s (lectures similar to his book Machine Learning), this was the same basic story. Also reminds me of a moment in class one day when a Stats Prof who was surveying the course broke out with "but wait, isn't this all just multivariate regression??" :) Mitchell agreed to the functional similarity, but I think that helps highlight how the biomimicry was crucial to developing the idea. it had laid hidden in plain sight for a century.
Agreed, and I was aware, there has since been criticism of the biological plausibility of backprop.
Your further links with refs to backprop in transformers are interesting; I hadn't seen these. It's clear the term is being used like you say, tho I still see ambiguity of it utility here. Autodifferentiation, gradient descent, multi-variate regerssion etc. are ofc in common use and scanning these papers it's not clear to me the terms aren't simply to a point of conflation. What had stood unique for me with backprop was a coherent whole-network regression. This to me looks like a piecewise approach.
But anyways, I see your point. Thanks!
Link to PDF and some screens from intro here..
Sort of. You can get LLMs to produce some new things, but these are statistical averages of existing information. Its kinda like a static "knowledge tree", where it can do some interpolation, but even then, its interpolation based on statistically occurring text.
(if not obvious.. you'd shove it in right after the embedding layer...)