Do Machine Learning Models Memorize or Generalize?

Do Machine Learning Models Memorize or Generalize?(pair.withgoogle.com)

454 points by 1wheel 2 years ago | 210 comments

Sometimes I think the reason human memory in some sense is so amazing, is what we lack in storage capacity that machines have, we makeup for in our ability to create patterns that compress the amount of information stored dramatically, and then it is like we compress those patterns together with other patterns and are able to extract things from it. Like it is an incredibly lossy compression, but it gets the job done.

ComputerGuru 2 years ago | |

That’s not exactly true, there doesn’t seem to be an upper bound (that we can reach) on storage capacity in the brain [0]. Instead, the brain actually works to actively distill knowledge that doesn’t need to be memorized verbatim into its essential components in order to achieve exactly this “generalized intuition and understanding” to avoid overfitting.

[0]: https://www.scientificamerican.com/article/new-estimate-boos...

halflings 2 years ago | | |

> That’s not exactly true [...] Instead, the brain actually works to actively distill knowledge that doesn’t need to be memorized verbatim into its essential components

...but that's exactly what OP said, no?

I remember attending an ML presentation where the speaker shared a quote I can't find anymore (speaking of memory and generalization :)), which said something like: "To learn is to forget"

If we memorized everything perfectly, we would not learn anything: instead of remembering the concept of a "chair", you would remember thousands of separate instances of things you've seen that have a certain combination of colors and shapes etc

It's the fact that we forget certain details (small differences between all these chairs) that makes us learn what a "chair" is.

Likewise, if you remembered every single word in a book, you would not understand its meaning; understanding its meaning = being able to "summarize" (compress) this long list of words into something more essential: storyline, characters, feelings, etc.

jjk166 2 years ago | | |

Distilling knowledge is data compression.

nonameiguess 2 years ago | | |

I've thought about this a lot in the context of the desire people seem to have to try and achieve human immortality or at least indefinite lifespans. If SciAm is correct here and the upper bound is a quadrillion bytes, we may not be able to hit that given the bound on possible human experiences, but someone who lived long enough would eventually hit that. After a hundred million years or whatever the real number is of life, you'd either lose the ability to form new memories or you'd have to overwrite old ones to do so.

Aside from having to eventually experience the death of all stars and light and the decay of most of the universe's baryonic matter and then face an eternity of darkness with nothing to touch, it's yet another reason I don't think immortality (as opposed to just a very long lifespan) is actually desirable.

oneTbrain23 2 years ago | | |

You obviously hand wave alzheimer and dementia. Human don't know exactly how brains works. The computational storage is just an estimate of what we understand von Neuman computer storing data 1 and 0. In every psychological test conducted on human mind, they clearly have a limit.

downboots 2 years ago | | |

Can "distill knowledge" be made precise ?

gattilorenz 2 years ago | | |

Is there a “realistic upper bound” in things that should be memorized verbatim? Ancient greeks probably memorized the Iliad and other poems (rhyming and metre might work as a substitute for data compression, in this case), and many medieval preachers apparently memorized the whole Bible…

firecall 2 years ago | | |

Does the brain require more energy to store more information?

Or is it always running at the same pace regardless of if it’s empty or not?

I guess the Brian doesn’t really work like that…. But I’m curious :-)

TheRealSteel 2 years ago | | |

You seem to have just re-stated what the other person said.

bufferoverflow 2 years ago | |

There are rare people who remember everything

https://youtu.be/hpTCZ-hO6iI

svachalek 2 years ago | | |

It's pretty fascinating to me how "normal" Marilu Henner seems to be. I'm getting older and my memory is not what it was, but when I was younger it was pretty extraordinary. I did really well in school and college but over time I've realized it was mostly due to being able to remember most things pretty effortlessly, over being truly "smart" in a classic sense.

But having so much of the past being so accessible is tough. There are lots of memories I'd rather not have, that are vivid and easily called up. And still, I think it's only a fraction of what her memory seems to be like.

hgsgm 2 years ago | | |

Is there scientific evidence of that or just claims?

tbalsam 2 years ago | |

For more information and the related math behind associative memories, please see Hopfield Neural Networks.

While the upper bound is technically "infinity", there is a tradeoff between the amount of concepts stored and the fundamental amount of information storable per concept, similar to how other tradeoff principles like the uncertainty principle, etc work.

scrps 2 years ago | | |

Thank you

mr_toad 2 years ago | |

Artificial neural networks work a lot like compression algorithms in their ability to predict the future. The trained network is a compression algorithm - it does not store compressed data.

We don’t know if the animal brain works the same way, but I suspect it is mostly compression algorithms designed to predict things, and doesn’t store much data at all.

bobboies 2 years ago | |

Good example in my math and physics classes I found it really helpful to understand the general concepts, then instead of memorizing formulas could actually derive them from other known (perhaps easier-to-remember) facts.

Geometry is good for training in this way—and often very helpful for physics proofs too!

lacrimacida 2 years ago | | |

Too bad this method is penalized most on tests (timed) where memorization is favored. But deriving results reinforce knowledge, understanding and patterns best in my opinion.

pillefitz 2 years ago | |

That is essentially what embeddings do

nightski 2 years ago | | |

Maybe, except from my understanding an embedding vector tends to be much larger than the source token (due to the high dimensionality of the embedding space). So it's almost like a reverse compression in a way. That said I know vector DBs have much more efficient ways of storing those vector embedding.

BSEdlMMldESB 2 years ago | |

yes, when we do this to history, it becomes filled with conspiracies. but is merely a process to 'understand' history by projecting intentionalities.

this 'compression' is what 'understanding' something really entails; at first... but then there's more.

when knowledge becomes understood it enables perception (e.g. we perceive meaning in words once we learn to read).

when we get really good at this understanding-perception we may start to 'manipulate' the abstractions we 'perceive'. an example would be to 'understand a cube' and then being able to rotate it around so to predict what would happen without really needing the cube. but this is an overly simplistic example

NovaDudely 2 years ago | | |

This was the thinking I was taking. It is a useful tool at first but taken too far can be a bad thing in some situations.

pyinstallwoes 2 years ago | |

Maxwell’s demon to entropy

greenflag 2 years ago |

It seems the take home is weight decay induces sparsity which helps learn the "true" representation rather than an overfit one. It's interesting the human brain has a comparable mechanism prevalent in development [1]. I would love to know from someone in the field if this was the inspiration for weight decay (or presumably just the more equivalent nn pruning [2]).

[1] https://en.wikipedia.org/wiki/Synaptic_pruning [2] https://en.wikipedia.org/wiki/Pruning_(artificial_neural_net...

gorjusborg 2 years ago |

Grr, the AI folks are ruining the term 'grok'.

It means roughly 'to understand completely, fully'.

To use the same term to describe generalization... just shows you didn't grok grokking.

jimwhite42 2 years ago |

I'm not sure if I'm remembering it right, but I think it was on a Raphaël Millière interview on Mindscape, where Raphaël said something along the lines of when there are many dimensions in a machine learning model, the distinction between interpolation and extrapolation is not clear like it is in our usual areas of reasoning. I can't work out if this could be something similar to what the article is talking about.

_ache_ 2 years ago |

Does anyone know how that charts are created ? I bet that it's half generated by some sort of library and them manually improved but the generated animated SVG are beautiful.

1wheel 2 years ago | |

Basically just a bunch of d3 — could be cleaned up significantly, but that's hard to do while iterating and polishing the charts.

I also have a couple of little libraries for things like annotations, interleaving svg/canvas and making d3 a bit less verbose.

- https://github.com/PAIR-code/ai-explorables/tree/master/sour...

- https://1wheel.github.io/swoopy-drag/

- https://github.com/gka/d3-jetpack

- https://roadtolarissa.com/hot-reload/

iaw 2 years ago | | |

I was going to ask the same question. Those are some great visualizations

ComputerGuru 2 years ago |

PSA: if you’re interested in the details of this topic, it’s probably best to view TFA on a computer as there is data in the visualizations that you can’t explore on mobile.

SimplyUnknown 2 years ago |

First of all, great blog post with great examples. Reminds me of distill.pub used to be.

Second, the article correctly states that typically L2 weight decay is used, leading to a lot of weights with small magnitudes. For models that generalize better, would it then be better to always use L1 weight decay to promote sparsity in combination with longer training?

I wonder whether deep learning models that only use sparse fourier features rather than dense linear layers would work better...

medium_spicy 2 years ago | |

Short answer: if the inputs can be represented well on the Fourier basis, yes. I have a patent in process on this, fingers crossed.

Longer answer: deep learning models are usually trying to find the best nonlinear basis in which to represent inputs; if the inputs are well-represented (read that as: can be sparsely represented) in some basis known a-priori, it usually helps to just put them in that basis, e.g., by FFT’ing RF signals.

The challenge is that the overall-optimal basis might not be the same as those of any local minima, so you’ve got to do some tricks to nudge the network closer.

qumpis 2 years ago | |

Slightly related but sparsity-inducing activation function Relu is often used in neural networks

taeric 2 years ago |

I'm curious how representative the target function is? I get that it is common for you to want a model to learn the important pieces of an input, but a string of bits, and only caring about the first three, feels particularly contrived. Literally a truth table on relevant parameters of size 8? And trained with 4.8 million samples? Or am I misunderstanding something there? (I fully expect I'm misunderstanding something.)

jaggirs 2 years ago | |

I have observed this pattern before in computer vision tasks (train accuracy flatlining for a while before test acc starts to go up). The point of the simple tasks is to be able to interpret what could be going on behind the scenes when this happens.

taeric 2 years ago | | |

No doubt. But I have also seen what people thought were generalized models failing on outlier, but valid, data. Quite often.

Put another way, it isn't just how simple this task seems to be in the number of terms that are important, but isn't it also a rather dense function?

Probably better question to ask is how sensitive are models that are looking at less dense functions to this? (Or more dense.). I'm not trying to disavow the ideas.

superkuh 2 years ago |

There were no auto-discovery RSS/Atom feeds in the HTML, no links to the RSS feed anywhere, but by guessing at possible feed names and locations I was able to find the "Explorables" RSS feed at: https://pair.withgoogle.com/explorables/rss.xml

lachlan_gray 2 years ago |

It looks like grid cells!

https://en.wikipedia.org/wiki/Grid_cell

If you plot a head map of a neuron in the hidden layer on a 2D chart where one axis is $a$ and the other is $b$, I think you might get a triangular lattice. If it's doing what I think it is, then looking at another hidden neuron would give a different lattice with another orientation + scale.

Also you could make a base 67 adding machine by chaining these together.

I also can't help the gut feeling that the relationship between W_in-proj's neurons compared to the relationship between W_out-proj's neurons looks like the same mapping as the one between the semitone circle and the circle of fifths

https://upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Pi...

flyer_go 2 years ago |

I don't think I have seen an answer here that actually challenges this question - from my experience, I have yet to see a neural network actually learn representations outside the range in which it was trained. Some papers have tried to use things like sinusoidal activation functions that can force a neural network to fit a repeating function, but on its own I would call it pure coincidence.

On generalization - its still memorization. I think there has been some proof that chatgpt does 'try' to perform some higher level thinking but still has problems due to the dictionary type lookup table it uses. The higher level thinking or agi that people are excited about is a form of generalization that is so impressive we don't really think of it as memorization. But I actually question if our wantingness to generate original thought isn't as actually separate from what we currently are seeing.

mjburgess 2 years ago |

Statistical learning can typically be phrased in terms of k nearest neighbours

In the case of NNs we have a "modal knn" (memorising) going to a "mean knn" ('generalising') under the right sort of training.

I'd call both of these memorising, but the latter is a kind of weighted recall.

Generalisation as a property of statistical models (ie., models of conditional freqs) is not the same property as generalisation in the case of scientific models.

In the latter a scientific model is general because it models causally necessary effects from causes -- so, necessarily if X then Y.

Whereas generalisation in associative stats is just about whether you're drawing data from the empirical freq. distribution or whether you've modelled first. In all automated stats the only diff between the "model" and "the data" is some sort of weighted averaging operation.

So in automated stats (ie., ML,AI) it's really just whether the model uses a mean.

esafak 2 years ago |

I haven't read the latest literature but my understanding is that "grokking" is the phase transition that occurs during the coalescing of islands of understanding (increasingly abstract features) that eventually form a pathway to generalization. And that this is something associated with over-parameterized models, which have the potential to learn multiple paths (explanations).

https://en.wikipedia.org/wiki/Percolation_theory

A relevant, recent paper I found from a quick search: The semantic landscape paradigm for neural networks (https://arxiv.org/abs/2307.09550)

ajuc 2 years ago |

I was trying to make an AI for my 2d sidescrolling game with asteroid-like steering learn from recorded player input + surroundings.

It generalized splendidly - it's conclusion was that you always need to press "forward" and do nothing else, no matter what happens :)

huijzer 2 years ago |

A bit of both, but it does certainly generalize. Just look into the sentiment neuron from OpenAI in 2017 or come up with an unique question to ChatGPT.

davidguetta 2 years ago |

hierarchize would be a better term than generalize

westurner 2 years ago |

If you omit the training data points where the baseball hits the ground, what will a machine learning model predict?

You can train a classical ML model on the known orbits of the planets in the past, but it can presumably never predict orbits given unseen n-body gravity events like another dense mass moving through the solar system because of classical insufficiency to model quantum problems, for example.

Church-Turing-Deutsch doesn't say there could not exist a Classical / Quantum correspondence; but a classical model on a classical computer cannot be sufficient for quantum-hard problems. (e.g. Quantum Discord says that there are entanglement and non-entanglement nonlocal relations in the data.)

Regardless of whether they sufficiently generalize, [LLMs, ML Models, and AutoMLs] don't yet Critically Think and it's dangerous to take action without critical thought.

Critical Thinking; Logic, Rationality: https://en.wikipedia.org/wiki/Critical_thinking#Logic_and_ra...

tehjoker 2 years ago |

Well they memorize points and lines (or tanh) between different parts of the space right? So it depends on whether a useful generalization can be extracted from the line estimation and how dense the points on the landscape are no?

djha-skin 2 years ago |

How is this even a shock.

Anyone who so much as taken a class on this knows that even the simplest of perceptron networks, decision trees, or any form of machine learning model generalizes. That's why we use them. If they don't, it's called overfit[1], where the model is so accurate on the training data that its inferential ability on new data suffers.

I know that the article might be talking about a higher form of generalization with LLMs or whatever, but I don't see why the same principle of "don't overfit the data" wouldn't apply to that situation.

No, really: what part of their base argument is novel?

1: https://en.wikipedia.org/wiki/Overfitting

MagicMoonlight 2 years ago |

Memorise because there is no decision component. It attempts to just brute force a pattern rather than thinking through the information and making a conclusion.

blueyes 2 years ago |

If your data set is too small, they memorize. If you train them well on a large dataset, they learn to generalize.

visarga 2 years ago | |

they only generalise with big datasets, that is the rule

blueyes 2 years ago | | |

That's what I said.

wwarner 2 years ago |

This is such a good explainer

lsh123 2 years ago |

Current ML models neither memorize or generalize, but instead approximate.

tipsytoad 2 years ago |

Seriously, are they only talking about weight decay? Why so complicated?

agumonkey 2 years ago |

They ponderize.

lewhoo 2 years ago |

So, the TLDR could be: they memorize at first and then generalize ?

drdeca 2 years ago | |

depends on the hyperparameters, and the architecture (and probably the task)

aappleby 2 years ago |

They digest.

xaellison 2 years ago |

what's the TLDR: memorize, or generalize?