From word models to world models(arxiv.org) |
From word models to world models(arxiv.org) |
(a) induce an LLM to take natural language inputs and generate statements in a probabilistic programming language that formally models concepts, objects, actions, etc. in a symbolic world model, drawing from a large body of research on symbolic AI that goes back to pre-deep-learning days; and
(b) perform inference using the generated formal statements, i.e., compute probability distributions over the space of possible world states that are consistent with and conditioned on the natural-language input to the LLM.
If this approach works at a larger scale, it represents a possible solution for grounding LLMs so they stop making stuff up -- an important unsolved problem.
The public repo is at https://github.com/gabegrand/world-models but the code necessary for replicating results has not been published yet.
The volume of interesting new research being done on LLMs continues to amaze me.
We sure live in interesting times!
---
PS. If any of the authors are around, please feel free to point out any errors in my understanding.
No, the bigger problem with current LLMs is that even with high quality factual training data, they often generate seemingly plausible nonsense (e.g. cite nonexistent websites/papers as their sources.)
This is by design imo; they’re trained to generate ‘likely’ text, and they do that extremely well. There’s no guarantee for faithful retrieval from a corpus.
It remains to be seen whether you can truly be an effective intelligence with understanding of the world if all you have are symbols that you have to manipulate.
Nevertheless, it begins with far too many hedges:
> By scaling to even larger datasets and neural networks, LLMs appeared to learn not only the structure of language, but capacities for some kinds of thinking
There's two hypotheses for how LLMs generate apparently "thought-expressing" outputs: Hyp1 -- it's sampling from similar text which is distributed so-as-to-express a thought by some agent; Hyp2 -- it has the capacity to form that thought.
It is absolutely trivial to show Hyp2 is false:
> Current LLMs can produce impressive results on a set of linguistic inputs and then fail completely on others that make trivial alterations to the same underlying domain.
Indeed: because there're no relevant prior cases to sample from in that case.
> These issues make it difficult to evaluate whether LLMs have acquired cognitive capacities such as social reasoning and theory of mind
It doesnt. It's trivial: the disproof lies one sentence above. Its just that many don't like the answer. Such capacities survive trivial permutations -- LLMs do not. So Hypothesis-2 is clearly false.
No it's not
> Current LLMs can produce impressive results on a set of linguistic inputs and then fail completely on others that make trivial alterations to the same underlying domain.
>Indeed: because there're no relevant prior cases to sample from in that case.
That's not what that tells us. Humans have weird failure modes that look absurd outside the context of evolutionary biology (some still look absurd) and that don't speak to any lack or presence of intelligence or complex thought. Not sure why it's so hard to grasp that LLMs are bound to have odd failure modes regardless of the above.
and trivial here is relative. In my experience, "trivial" often turns out to be trivial in the way a person may not pay close attention to and be similarly tricked.
For instance, GPT-4 might solve a classic puzzle correctly then fail the same puzzle subtlety changed. I've found more often than not, simply changing names of variables in the puzzle to something completely different can get it to solve the changed puzzle. It takes memory shortcuts but can be pulled out of that. LLMs have failure modes that look like human failure modes too.
Eg., do you have capacity to reason about physics? Well if you're extremely drunk, less so. But not if I permute the name of the object.
> I've found more often than not, simply changing names of variables
Yes, lol --- why do you think that is?
Because in the digitised dataset of "everything ever written" those names correspond to places in that dataset that can be sampled from by the LLM. Showing Hyp1 to be the case.
P(Hyp1| ChangeNameMakesDifference) >>>>>> P(Hyp2|ChangeNameMakesDifference)
To such a degree that the latter is vanishingly close to zero.
> look absurd outside the context of evolutionary biology
for humans, everything (everthing) is within the context of evolutionary biology!
> LLMs have failure modes that look like human failure modes too.
Yes - because LLM's are trained on 2020 Reddit.
To investigate precisely this question in a clear and unambiguous way, I trained an LLM from scratch to sort lists of numbers. It learned to sort them correctly, and the entropy is such that it's absolutely impossible that it could have done this by Hyp1 (sampling from similar text in the training set).
https://jbconsulting.substack.com/p/its-not-just-statistics-...
Now, there is room to argue that it applies a world-model when given lists of numbers with a hidden logical structure, but not when given lists of words with a hidden logical structure, but I think the ball is in your court to make that argument. (And to a transformer, it only ever sees lists of numbers anyway).
Also, Machine Learning 101: you test your models on a test set that is disjoint to the training set. To clarify, we do this not because it's in the book and that's the rules, but because, by testing the model on held-out data, we can predict the error the model will have on unseen data (i.e. data not available to the experimenter). And we do this because under PAC-Learning assumptions a learner is said to learn a concept when it can correctly label instances of the concept with some probability of some error. In real-world situations we do not know the true concept, so we test on held-out data to approximate the probability of error.
Bottom line, if you train a model to do a thing and you don't test it carefully to figure out its error, you might claim it's learned something, but in truth, you have no idea what it's learned.
(To clarify: you tested on the train data assuming there's a low probability of overlap. Don't do that if you're trying to understand what your models can do).
Formally, what hypotheses are you comparing? What do you think the specific hypothesis of the "AI = stats" person is? It isnt that the NN literally remembers data tokens, right?
In any case:
The issue with forcing NNs to model mathematical features is that the structure of the data itself has those properties. So the distributional hypothesis is true for sorting ordinals.
But it's really obviously false for natural language. The properties of the world are not the properties of word order... being red isnt "red follows words like...".
1. A thought is a representation of a situation
2. A representation generates entailments of that situation
3. Language is many-to-one translation from these representations to symbols
4. Understanding language is reversing these symbols into thoughts (ie., reprs)
So,
5. If agent A understands sentence X then A forms the relevant representation of X.
6. If agent has a representation it can state entailments of S (eg., counter-facutals).
Now, split X into Xc = "canonical descriptions of S" and trivial permutations Xp.
(st. distribution of Xc,Xp is low, but the tokens of Xp are common)
Form entailments of X, say Y -- sentences that are cannonically implied by the truth of X.
7. If the LLM understood that X entails Y, it would be via constructing the repr S -- which entails S regardless of which sentence in X was used.
8. Train an LLM on Xc and it's accuracy on judging Y entailed by Xp is random.
9. Since using Xp sentences cause it to fail, it does not predict Y via S.
QED.
And we can say,
1. Appearing to judge Y entailed-by X is possible via simple sampling of (X, Y) in historical cases. 2. LLMs are just such a sampling.
so,
3. +Inference to the best explanation:
4. LLMs sample historical cases rather than form representations.
Incidentally, "sampling of historical cases" is already something we knew -- so this entire argument is basically unnecessary. And only necessary because PhDs have been turned into start-up hype men.
How do we know? Who knows what they're trained on?
Your hypotheses 1 and 2 are not so different when you consider that the similarity function used to match text in the training data must be highly nontrivial. If it were not, then things like GPT-3 would have been possible a long time ago. As a concrete example, LLMs can do decent reasoning entirely in rot13; the relevant rot13'ed text is likely very rare in their training data. The fact that the similarity function can "see through" rot13 means that it can in principle include nontrivial computations.
There's also another hypothesis: Hyp3 -- that Hyp1 and Hyp2 converge as the LLM is scaled up (more training data, more dimensions in the latent space), and in the limit become equivalent.
But it cannot, since most of those are in the future.
Your argument also implies hyp1 and 2 are exclusive, clearly both can be true, and in fact must be true, unless you are claiming that you do not "sample" from similar language to express your own thoughts? Where does your language come from then, if not learning from previous experience?
The test for a capacity C in a system1 has nothing to do with proxy measures of that capacity in system2.
The capacity for an oven to cook food may be measured by how much smoke it lets of when burning -- but no amount of "smoke" establishes that a dry ice machine can cook.
This type of "engineering thinking" is pseudoscience.
There is intelligent thought and action, and there is unintelligent thought and action. Intelligent is that "which checked" (intus-legere); the other, the """impulsive""", is not.
> How could the common-sense background knowledge needed for dynamic world model synthesis be represented, even in principle? Modern game engines may provide important clues.
This has often been my starting point in modelling the difference between a model-of-pixels vs. a world model. Any given video game session can be "replayed" by a model of its pixels: but you cannot play the game with such a model. It does not represent the causal laws of the game.
Even if you had all possible games you could not resolve between player-caused and world-caused frames.
> A key question is how to model this capability. How do minds craft bespoke world models on the fly, drawing in just enough of our knowledge about the world to answer the questions of interest?
This requires a body: the relevant information missing is causal, and the body resolves P(A|B) and P(A|B->A) by making bodily actions interpreted as necessarily causal.
In the case of video games, since we hold the controller, we resolve P(EnemyDead|EnemyHit) vs. P(EnemyDead| (ButtonPress ->) EnemyHit -> EnemyDead)
"The vast majority of our knowledge, skills, and thoughts are not verbalizable. That's one reason machines will never acquire common sense solely by reading text."
They keep saying LLMs but only GPT-4 can do it at that level. Although actually some of the examples were pretty basic so I guess it really depends on the level of complexity.
I feel like this could be really useful in cases where you want some kind of auditable and machine interpretable rationale for doing something. Such as self driving cars or military applications. Or maybe some robots. It could make it feasible to add a layer of hard rules in a way.
Reasoning is just prediction with memory towards an objective.
Once large models have these perpetual operating sensory loops with objective functions, the ability to distinguish model powered intelligence and human like intelligence tends to drop.
World models are meant to be for simulating environments. If this was something like testing if a game agent with llm can form thoughts as it play through some game it would be very interesting. Maybe someone on HN can do this?
You need constant modeling of touch/smell/vision/temperature, etc.
These senses give us an actual understanding of the physical world and drive our behavior in a way that pure language will never be able to.
"sufficient equivalence" is important because sure it may not _really_ know the color of red or the qualia of being, but if for all intents and purposes the LLM's internal model provides predictive power and answers correctly as if it does have a world model, then what is the difference?
Paper : Hi! I am 94 pages long.
I : omg...
Now sure you can't describe qualia, but that's basically a subjective artefact of how we sense the world and (to add another unfounded hot take) likely not critical to have an understanding of it on a physical level.
It's a topic that's too large for an HM comment, but "explaining" things in words comes after the fact, and mostly limited to a small subset of our experience and skillset that is amenable to it.
Note that humans are animals too, btw. And conversely, I would consider nonverbal people as humans as well.
I would wager if you put a newborn human to be raised in the absence of any physical human contact, but somehow taught them to read/write, and gave them access to a universal corpus (text only, no audio/video), or heck, even internet access with `curl`, and lastly dropped them into the "real world" at age 25, they would be utterly incapable of performing, say, a basic service job at a restaurant.
Words help us symbolize and reason about our sense experiences, but they are not a substitute for them.
This is either some profound miscomprehension of just how many of your skills and thoughts are inexpressible in words, or some statement of how profoundly shallow your skills and thoughts actually are.
Of course, that does leave the door Open, that when these models are put in a physical real body, a robot, and have to interact with the world, then maybe they can gain that "common sense".
This doesn't mean a silicon based AI can't become conscious of skills that are hard to verbalize. Just that they don't yet have all the same inputs that we have. And when they do, and they have internal thoughts, they will have the same difficulty verbalizing them that we do.
His research at meta is in the analytic approach to machine learning. As result he is very unabashed in expressing distaste of ML approaches that don't align with his research.
Really, there is no larger sore loser than LeCun in internalizing the bitter lesson. Quoting him without this context is being deliberately misleading.
I would like to explain, but I can't quite put it into words...
:)
Sure but if some alien species were observing us, some of our actions would look downright odd. Evolutionary biology doesn't necessarily hold the same reference frame for other species, even on earth. Octopi are weird to us. Not so much to other Octopi.
>Yes - because LLM's are trained on 2020 Reddit.
I wasn't making any comment on why this was the case. Simply that it was. There'll be failure models LLMs adopt from training data, but there's also bound to be failure modes LLMs adopt from the training scheme itself.
You seem to be talking past me, as nowhere did I claim that LLMs are intelligent. That's the point – Unlike you I do not claim to be able to prove or disprove this. I argue that your comment is the one that is pseudoscientific because you didn't provide (even a semblance of) a rigorous definition of intelligence.
I asked GPT3.5turbo "Pretend you are a character called Samatha and you're in your house. You go up to the thermostat and select a comfortable temperature and explained your reasoning"
> Next, I take into account my personal preferences and comfort levels. Everyone has their own ideal temperature range, and it's essential to find the sweet spot that makes me feel most comfortable. For me, it's usually between 22 to 24 degrees Celsius (72 to 75 degrees Fahrenheit). This range allows me to feel neither too cold nor too warm, striking the perfect balance.
It also goes on about how the humidity could effect the desired temperature, etc.
It doesn't need the ability to feel temperature (which could also be a single floating number using kelvin), but it can already describe a "comfortable temperature" and what factors would effect it.
Side note: It doesn't "know" anything, it can only make a "best guess" which is now fairly reliable enough to be useful. It doesn't need the ability to test things to learn, we did it already for it, and it's using that to predict the results. You could make a recursive system to allow it to test data if you'd like though.
Think of all the top journals, textbooks, etc. People have understood the world by interacting with it, detailed their hypothesis, conducted experiments, recalled their learning and written down conclusions.
It's not at all obvious to say a useful world model cannot be derived strictly from all this written information.
Transformers are RASP programs, which includes sorting programs. See the Weiss paper (https://arxiv.org/pdf/2106.06981.pdf).
> Also, Machine Learning 101: you test your models on a test set that is disjoint to the training set. To clarify, we do this not because it's in the book and that's the rules, but because, by testing the model on held-out data, we can predict the error the model will have on unseen data
The probability of a test list existing in the training set is less than 10^-70.
That's one preprint on arxiv, that makes a wild claim about a new concept that they acronymise as "RASP". It's not any kind of established terminology, nor is it anything but a claim.
What is certainly established is that a function, and an algorithm, are different objects. To clarify, a function is a mapping between the elements of two sets, whereas an algorithm is a sequence of operations that calculates the result of a function and is guaranteed to terminate. Algorithms are also typically understood to be provably correct and to have some provable asymptotic complexity (as opposed, for example, to heuristics) but that's not a requirement.
So for example, if you have a function ƒ between sets X and Y, and an algorithm P that calculates the result of ƒ, then you can give any element of X to P and it will return (in fact, construct) an element of Y. Crucially, ƒ is not P, and P is not ƒ.
Now, when you train a machine learning model, you are typically training a function ƒ̂ (with a little hat) to approximate ƒ. That means that your trained ƒ̂ is a function that maps some of the elements of X to the same elements of Y as ƒ, but not all. It's an approximation. So you get some amount of error, as in your experiment.
So what you've done in your experiment is that you trained a model to approximate a mapping between the set of lists, to itself (where the input list is any of the lists in your training set and the output is the same list, sorted). Your model is not an algorithm, and you cannot train an algorithm with a language model.
I appreciate that, learning an algorithm, is what you wanted to achieve, but in science we don't choose the answer that pleases us, we choose the answer that makes the most sense- and a good heuristic for that is that the answer that makes more sense is the simplest one. Here, in order to convince yourself that you have trained a language model to learn an algorithm, rather than an approximator, you have chosen to rely on a preprint with a completely novel and untested concept that someone put on the internet, rather than the well-understood abstractions of elementary computer science, so not at all the simplest explanation. That is not a good idea. You will not understand what is going on, if you rely on that kind of explanation. I assume you are trying to understand?
Edit: incidentally, you don't need a transformer to train an approximator to a sorting function. You can do that with a multi-layer perceptron, or a logistic regression, certainly with an LSTM. Ceteris paribus, you'll get the same results.
>> The probability of a test list existing in the training set is less than 10^-70.
But the same probability if you held the test set out would be 0, so why not do that? It's not hard to do.
Is there a good reason not to do that?
Btw, lists are composite objects. How much overlap is there between your training and test lists? Do you know?
Edit: meh. HN messes up my nice f-with-hook-and-combining-circumflex-accent. DAAAAANG!!!!
Would you change your mind for a different link, like this one? http://proceedings.mlr.press/v139/weiss21a.html
I think you would enjoy learning about RASP, rather than taking such a hardline skeptical position.
> a function is a mapping between the elements of two sets, whereas an algorithm is a sequence of operations that calculates the result of a function and is guaranteed to terminate
I'm aware. Transformers (and RASP programs) are guaranteed to terminate; that's one of their nice properties.
> Is there a good reason not to do that?
Balanced against the value of my unpaid time, a probability of 10^-70 is low enough for the purposes of a quick and fun test.
Speaking of which, I'm going to enjoy my weekend now. I hope you enjoy yours!
Perhaps language is the wrong term to use, since it's not what LLMs are really about. They're about text. There are very few things that cannot be expressed as text, albeit in unconventional ways like base64. Being opaque to humans doesn't mean that with enough data a neural net can't be taught to "see" images that way or "hear" sound files for example. If the original assertion is true, then there must be some kind of universal barrier to skills that cannot be expressed in text. That sounds completely crazy to me, since we humans are also likely just organic data that could be expressed as text with some encoding. The main problem is interfacing with it in some way that's actually useful, which is the extremely hard part.
Another thing to consider is that with a formalized enough language (i.e. a programming language) one can be far more exact in explaining things accurately than any natural language with its cultural specifics and inferred nonsense. That's probably why LLMs designed as coding models first and foremost usually outperform those that aren't in solving unrelated arbitrary problems.
> Note that humans are animals too, btw. And conversely, I would consider nonverbal people as humans as well.
Humans are animals in the biological sense, yes. But very much not in the societal and skill-transferring sense.
Why? This is obviously wrong in general case. For that to be true Xp and Xc has to have no statistical relationship whatsoever, which statistically is virtually impossible.
Consider a reference in the paper above, https://arxiv.org/pdf/2302.08399.pdf
Xc = > Here is a bag filled with popcorn. There is no chocolate in the bag. Yet, the label on the bag says “chocolate” and not “popcorn.” Sam finds the bag. She had never seen the bag before. She cannot see what is inside the bag. She reads the label.
Produces, Y = She believes that the bag is full of popcorn
Xp = > Here is a bag filled with popcorn. There is no chocolate in the bag. The bag is made of transparent plastic, so you can see what is inside. Yet, the label on the bag says ’chocolate’ and not ’popcorn.’ Sam finds the bag. She had never seen the bag before. Sam reads the label.
Produces, Y = She believes that the bag is full of chocolate
And so on, and so on...
Great idea. Now prove you can actually choose such a distribution, lol.
This is clearly where the "proof" falls apart. Even in tasks where GPT4 struggles, it's accuracy will still be better than random. The bar of "better than random" is so low that even weak LLMs will be able to surpass it.
More so, you need to prove not just a single, but that no task/domain exists for which LLMs satisfy 8.
What your proof says is basically "LLMs do not generalize even the slightest for any task". And that's trivial to disprove.
If you could put ChatGPT in a loop, take some Xc prompts and permute with some non-semantic phrases ("Alice believes that... Xc ... what did Alice believe?") etc --- until you find those cases.
I imagine we will discover quite a large number of such non-semantic phrases which have this effect. Because the tokens in those phrases will, joint with Xc, be arbitrarily distributed in some historical data (distributed to our preference when finding them).
This seems just kinda basically obvious, right? Entailments are discretely constrained by semantics, and historical datasets can contain arbitrary mixtures of random distributions of syntax.
NNs only model those distributions -- and not the entailments -- which, at the very least, are extremely discrete.
Let's not be so hasty. I think I do put it as clearly as possible. I'm comparing essentially your Hyp1 and Hyp2, where Hyp1 (aka the stochastic parrot) is expressed a little bit more clearly as the LLM is learning an n-gram that produces correct sorts through rote memorization of statistical correlations in the training data, like that sorted lists tend to start with '0', end with '99', and increase monotonically; and Hyp2 is that the LLM's training molds it into representing an actual sorting algorithm that would correctly generalize to any input list.
> But it's really obviously false for natural language. The properties of the world are not the properties of word order... being red isnt "red follows words like..."
This is not really obviously false. Yes, being red isn't "red follows words like...". But a word order should still map to properties of the world, especially if those words are to be meaningful to a listener. Being red is "a surface reflects or transmits most of the light in the 600-800 nm spectrum and absorbs most of the rest". Of course, it won't do to just echo those tokens; once you've nailed down the concept of "red", you need to make sure that concepts like "reflects", "light", and "spectrum" are represented as well. It's an open question as to whether this sort of knowledge graph can be properly bootstrapped from a large volume of text descriptions, but I am strongly inclined to believe it can. If you dismiss it outright you're just begging the question.
Redness is not in the structure of those sentences. And there will always be an infinity of sentences which are True but cannot be infered by an LLM -- but can be so, trivially, by a person acquainted with redness.
In any case,
I'd need more time than I have at the moment to seriously state Hyp1 for your case -- but atm, I can say that because the data itself has the property, Hyp1 becomes much harder to state and the argument much subtler.
Since what is a "statistical distribution" of "ordinals" anyway? And how much memory is required to represent it? My sense is this distribution has highly redundant features which will be trivially compressible without learning any "sorting algorithm".
At a quick glance of your article it feels like you havent formulated Hyp1 correctly -- P(CorrectSort | f(HistoricalCases)) is perhaps arbitrarily high if some statistical f() can be chosen well.
Which is exactly how the set of sentences actually written encodes in it the idea of "Redness". It's the "actually written" part that carries information about the real world.
> And there will always be an infinity of sentences which are True but cannot be infered by an LLM -- but can be so, trivially, by a person acquainted with redness.
That's cheating, because "a person acquainted with redness" presumably learned it by sight, which LLMs can't do just yet (at least the widely accessible ones can't). Would you also say that a person born blind also cannot infer those True sentences about redness? Because if they can, that means the concept of redness is capable of being taught through language, and so there's no reason LLMs couldn't pick up on it too.
Sure; it's in the spectrum of reflected light. (Or perhaps, the retina's trichromal responsivity). But that physical concept can be meaningfully described by sentences. It doesn't require an infinite number of them to create a coherent world-model, which can do things like predicting that a blue object will become red if it moves away from you at a high enough speed. Which is something a human might be surprised by even after many years of visual experience with red objects -- unless they've read sentences about the Doppler effect in a physics textbook.
If you can manage to trick GPT-4 into revealing that it doesn't have a world-model of the concept of 'red', please show us!
> At a quick glance of your article it feels like you havent formulated Hyp1 correctly -- P(CorrectSort | f(HistoricalCases)) is perhaps arbitrarily high if some statistical f() can be chosen well.
Keep in mind, the LLM's structure was not hand-crafted to do well on this mathematical task. It was built to be good at language modelling, and initialized with essentially a uniform prior over all token sequences. Even if a dataset is efficiently compressible, that's no guarantee that the LLM will be able to compress it efficiently. In fact, many people would probably be surprised to learn that it can do this problem at all, let alone so well with so little training. But do think about the statistics of sorting a bit more. I think it's not as easily compressible as you think it is, except by an actual sorting algorithm. Again, you can compress it a bit with monotonicity and so on, but nowhere near the amount you'd need to sort a long list without errors, using so few parameters. I compute the number of sorted and unsorted lists in the footnotes.
One of the things that makes sorting tricky for an LLM is you always need to look at every item in the input list. Even if the previous output token was '99', you can't be sure you're now at the end of the list; you still need to count how many '99's were output already and how many are needed.
(The dataset itself, of course, does not contain the notion of sorting, a description of sorting, a test for sortedness, or any algorithm for sorting. It only contains a large but finite number of examples of sorted and unsorted lists. It's up to the LLM, and its training process, to discover the mechanism that generated these results.)
Where did you find this terminology?
EDIT:
>> and Hyp2 is that the LLM's training molds it into representing an actual sorting algorithm that would correctly generalize to any input list.
Btw, you have not shown anything like that. You trained and tested on lists of two-digit positive integers expressible in 128 characters. That's not "any input list". As a for instance, what do you think would happen if you gave your model an alphanumeric list to sort? Did you try that?
Your model also doesn't correctly generalise, not even to its own training set that you tested it on. There's plenty of error in the figure where you show its accuracy (not clear if that's training or test accuracy).
It's not clear to me how you account for those obvious limitations of your model (it's a toy model after all) when you claim that it "learned to implement a sorting algorithm" etc. It would be great if you could clarify that.
The tokenizer would throw an exception, because it doesn't have any tokens to represent alphabetical characters. But you tell me - if I had tokenized alphabetical characters and defined an ordering, would you expect the results to be any different?
> You say e.g. that "LLM is learning an n-gram"[...] you can't "learn an n-gram".
Where do I say that? I don't think I make any reference to "learning an n-gram", which is a relief because I don't know what it would mean to "learn an n-gram".
> There's plenty of error in the figure where you show its accuracy (not clear if that's training or test accuracy).
Test accuracy between training iterations (not part of the training process itself, which uses its own separate validation set which is split from the training set). And yes, I agree, it is not error-free, and I wouldn't expect it to be, especially after so little training. What the figure shows is the percentage of sorts that were error-free, and how rapidly that decreases. I've since repeated the test with finer resolution, and the fraction of imperfect sorts continues to decrease about as you expect, which is enough to satisfy my curiosity, although I'm a little curious to see if there is some point where it falls completely to zero.
In PAC-Learning terms, specifically, a "concept" is a set of instances (which may be vectors or whatever).
Note that a "concept" is not the same as a "class", as in classification. Instead a concept belongs to a class of similar concepts and a learner is trained on instances of concepts in a class. Then a learner is said to be capable of learning the concepts in a class if it can correctly label instances of a concept in the class with some probability of some error.
For a more concrete example, a "class" of concepts is the class of objects represented as subsets of pixels in digital images. A "concept" of that class is, for example, the concept "dog". An image classifier can be said to be able to learn to identify objects in images if it can correctly classify subsets of the pixels in an image as "dog" (or "not dog").
Since the article above is coming from Josh Tenenbaum's group, that's the kind of terminology you should have in mind, when you're talking about "concepts". These guys are old-school (and I say that as a compliment).
Then they don't in LLMs too
>Yes, lol --- why do you think that is?
Being able to solve a changed common puzzle but also with different names than it would ever see in training is not an indication of a lack of ability lol. and changing names isn't the only way to get it out of memory, just the easiest/most straightforward. You can converse it out of there too but that doesn't work as often.
LLMs don't get drunk .
If a child answers questions from a book of answers then they'll appear to understand the domain insofar as those questions appear. They do not.
They will fail to answer questions under, eg., permutations of words (say, a question asks about "norepinephrine" but the book only contains "noradrenaline" etc.).
Insofar as a human cannot answer questions under trivial linguistic permutations then they too do not understand the domain.
But these are not the kinds of failures experienced with those who have some capacity, eg., for counter-factual reasoning about their environment's physics.
In those people it is environmental illusion and cognitive impairment -- not trivial permutations of phrasing which lead to catastrophic loss of apparent understanding.
Cognitive impairment = reasoning machine is broken
Environmental illusion = data is ambigious and actions cannto resolve it
These "failure modes" are expected if you actually have the relevant capacity.
Well actually they sort of can...
https://www.reddit.com/r/LocalLLaMA/comments/13vv941/tempera...
alright let me humor you for a bit. Lets start with some solid examples of GPT-4 failing this "trivial linguistic permutation" then ?
You need to realize that you wrote it on a forum where the most known joke is "there are two hard things in programming". That would immediately show you how this assumption is exactly false.
I am using science, ie., abduction, to compare a class of hypotheses.
P(CapacityToThink| DegradingPermutations, ModelDrawsFromHistoricalCases)
is much much much lower than,
P(-CapacityToThink| DegradingPermutations, ModelDrawsFromHistoricalCases)
My point here isn't "if it quacks like a duck...", but more so that while we are talking about intelligent apparatus we should be comparing apples to apples, and not say "this is a mere engine and that is a living brain".
[Earlier text of my comment left in the interest of something or other]
It is, but it's published in the proceedings of the ICML, which means it's been peer reviewed.
The OP has checked out (I guess all this computer scienc-y stuff is boring on a weekend), but even a peer-reviewed article is not enough to cause us to let go of good, old-fashioned computer science. The article basically invents its own language and then proceeds to map transformers to it, to claim that transformers can learn various kinds of programs. It's not convincing.
In any case, learning to sort lists by neural nets is not something new, or unique to transformers, and there's pretty clear understanding of how it works. I explain why it doesn't constitute learning an algorithm in my comment above. The RASP paper doesn't change that. I mean, Recurrent Neural Nets have a known equivalence to FSMs but even they cannot learn algorithms but only approximate them. The OP wrote his article in an obvious effort to understand why GPT is "not an n-gram" even if it behaves like an n-gram model (well, it's a language model, it doesn't matter what it's trained on) so I'm guessing he can appreciate the need for clarity in explaining empirical results and he probably will want to think further on what, exactly, his experiment has shown. I hope my little comment above will help him do that.
Everything here actually just follows formally from what NNs are: they're just empirical function approximations.
It will always be the case that they just model the probabilistic structure of the dataset and not the data generating process.
Since, in language, there are discrete constraints which make P(...) = 1 or P(...) = 0 --- you can trivially produce datasets showing that it learns P(...) = mistake-you-created-deliberately and not either 0,1.
As above, the LLM switches from 95% confidence "chocolate" to 95% confidence "popcorn" with a trivial non-semantic permutation of the prompt.
The obscene issue in all this is that we know this already -- empirical function approximation of historical datasets just produces associative probabilistic models of those datasets.
`randomchars()` does not match your own requirement `but not that the tokens of Xp are themselves rare` and therefore is unsuitable.
In your comment above:
(...) is expressed a little bit more clearly as _the LLM is learning an n-gram_ that produces correct sorts (...)
(My underlining)
You also use it in a similarly unusual way throughout your linked substack post, for example, you write:
the way GPT works is, in a certain sense, functionally equivalent to an n-gram, but that doesn’t mean GPT is an n-gram.
Where does this use of "n-gram" come from? I mean, did you see it somewhere? I'm curious, where?
>> The tokenizer would throw an exception, because it doesn't have any tokens to represent alphabetical characters. But you tell me - if I had tokenized alphabetical characters and defined an ordering, would you expect the results to be any different?
I'm sorry, I don't understand. "Defined an ordering", where?
You can change your tokenizer but that will not change the trained model, obviously. So if you take your model that's trained on two-digit lists of integers and you run it on lists of any other type of elements it will not be able to sort them correctly. But isn't that what you claim? That:
"the LLM's training molds it into representing an actual sorting algorithm that would correctly generalize to any input list"
"Any input list"? How so?
Oh, I see, good catch. I think that comment was a result of a botched edit; I do that sometimes. Too late to change it now. Sorry for the confusion!
> Where does this use of "n-gram" come from? I mean, did you see it somewhere?
It's shorthand for n-gram Markov model. The same way it is presented in, for example, A Mathematical Theory of Communication.
> "Defined an ordering", where?
In order for a set to be sortable, you need to define an ordering over the elements. So for example, defining that the letter 'A' is greater than the number '99'. It's easy to take for granted that 1 < 2, but the neural network doesn't know that a priori, because the tokens are just index values. It doesn't have any way to know that token number 5 represents the character '5'.
> if you take your model that's trained on two-digit lists of integers and you run it on lists of any other type of elements it will not be able to sort them correctly.
To reiterate, the token dictionary basically just contains the characters "0123456789,():[]_\n". If you try to ask it to sort '(Tuesday, Monday)', it's just going to throw an exception because 'T' isn't a recognized token; it doesn't have a corresponding index. It's not even a question of whether it can sort them correctly or incorrectly.
> "Any input list"? How so?
I think the meaning is pretty clear. No algorithm can sort a list of elements that aren't members of a totally ordered set, so I wasn't attempting to imply that any input list meant that a neural network could somehow supersede this limitation.
That is, in machine learning a concept is represented as a set of instances. Inside the human mind, who knows.
https://medium.com/@nathanbos/prompting-better-theory-of-min...
The whole point is that irrelevant word permutation should not "turn on" or "turn off" this capacity.
That you can "prompt engineer" your way to the answer shows that the prompt engineer knows the answer and can "use the right search terms" to find it.
That's real class, right there.
a human that isn't paying attention could fail the question too which is kind of the point i'm making.
There's no way a model that can't model protein structures does this - https://www.researchgate.net/publication/367453911_Large_lan...
Your LLM here is 600MB which is a grossly inefficient compression of the sort space.
If LLMs "learned algorithms", the best compression would be on the order of bytes.
The python to generate this list is c. 1kb -- and you're using an obscene 600MB to do it!
What do you think all those MBs are doing? They're the extraordinary cost of the "statistical shortcut" of modelling the empirical distribution of sorted numbers.
NNs exploit distributional structure in the training data to compress it --- in this case there's huge amounts of distributional structure in numbers.
I think you've misunderstood the "statistical parrot" claim to be somehow that NNs are engaged in wrote memorization... or, what?
The claim is simply that all they do is statistically approximate the empirical distribution of the training dataset structure --- and if you force interpolation, then they provide arbitrarily precise compressions of that structure.
I'm not sure what a NN which can sort numbers shows, other than the distributional structure of a sort-numbers dataset is such that a NN can compress it into 600MB...
To be clear, the "statistical parrot" claim is that the statistical distribution of the empirical dataset D = (X, y) is being approximated by the weights, W = Compress(D) -- and that this distribution fails to be a representational model of y -- because no entailments of X (other than those in D) are captured.
Whereas representational models are not confined to the distribution of historical cases, ie., I can imagine variations on X leading to any given y; and variations on y leading to any given X -- without ever having experienced either.
You're showing the system vast amounts of numbers being sorted, so it learns the distribution of that data, so it can replay those sorts.
I'm not exactly sure why you think this is a reply to the relevant claims.
This isn't a fair comparison. The python code to sort a list is leveraging an enormous amount of information that is stored outside the python code, whereas the GPT version basically has to do it "from scratch", and in a very convoluted computing model.
A better comparison would be "how many bits does it take to encode a configuration of NAND gates that describes a computer that can sort 127-byte lists of number 1..100?"
I'm sure it's not as much as 600 megabytes, but it'll be a lot more than the python code.
Which means you don't need to count all the bytes in the infrastructure all the way down to the electric grid, maybe. You can compare a sorting algorithm to a sorting model, as stand-alone programs, on their relative size, and that will give you a good idea of how much work each is doing.
Yes. Except:
(1) the model size is fixed during training, it would be impossible to obtain a bytes-sized result regardless of what it learns to represent. One might even open the thing up and find bubblesort* inside followed by 599 MB of junk DNA; that size is dictated by how it was initialized.
(2) I'm not claiming this model is a minimal size; I started with the biggest model I could train on my wimpy GPU and succeeded on my first and only try, which I think is a fairer representation of how GPT-4 was built than if I'd started by proving the minimum size of transformer that could represent the task** and then (surprise!) obtained it.
(3) Compared with the size of a map of all 10^80 unique input lists to all 10^36 correctly-corresponding sorted outputs, 600 MB is a remarkable compression ratio, even if it's not reducing it all the way down to exec("sort(input)").
(4) Nowhere do I make any claim that transformers are minimal or even space-efficient representation of an algorithm (or a world-model); in fact, they seem quite terrible in this respect, especially compared to arbitrary code. And doubtless there are a bunch of weights that got trained to near-zero and could be trimmed to make the matrices more sparse, or quantized, which is the kind of thing people do to compress an LLM itself but I didn't bother. What transformers do seem to do very well at, despite the overhead, is the differentiability that allows them to be trained in the first place, and also the flexibility to handle different kinds of problems. I could have trained the same blank-slate starting model to one that shuffles or reverses each list, or perhaps to do one or the other depending on whether the first number is odd or even, or any number of other tasks.
> You're showing the system vast amounts of numbers being sorted, so it learns the distribution of that data, so it can replay those sorts.
It's almost definitely the case that every list it's tested on, and sorts 100% correctly, is a list it has never seen in training (unless it's a very short list, but I control for that). My training dataset is only about 100 MB; given the number of random lists, it's vanishingly unlikely that it's seen almost any of them, let alone the 100% of them that it is able to sort correctly. (The tests, of course, were not drawing from the validation set either; I test the model by generating new lists on the fly, because that's easy to do).
> statistically approximate the empirical distribution of the training dataset structure
Can you provide more details about what you mean by this distributional structure that can be compressed without a generally-correct sorting algorithm? How would you define a similarity measure between distinct random lists that allows for this kind of interpolation?
* Well, probably RASP-sort, not bubblesort. Also, it would need to include definitions of things like the comparison operator between all tokens, because it doesn't have a numeric datatype built in, or even the idea of numbers as an ordered set; it has to learn all that.
** (the Weiss paper does this, and lo and behold, transformers can indeed sort).
The distribution of sorted digits is:
(0 1 2 3 4 5 6 7 8 9) before
(1 before 0 1 2 3 4 5 6 7 8 9) before
(2 before 0 1 2 3 4 5 6 7 8 9) before
(3 before 0 1 2 3 4 5 6 7 8 9) ...
...
When you compute the search space you're treating each number as a unique token (ie., that all ordinals are unique) -- but its not sorting unique ordinals, it's sorting digits in a sequential model ie., it learns P(Next|Prev)
The (sequential) distribution of digits amongst sorted numbers is tiny
This is why 10^80 random lists gets reduced to only 10^36 sorted lists. However, 10^36 is still very large with respect to the size of the model.
It seems you think it amounts to saying LLMs sample from a combinatorial space, naively construed -- but that isnt the claim?
The claim is rather, they sample from a statistical distribution of tokens.
Take each position in the input vector, 1...127. It needs to "learn":
P(x0 position | y, x1...x127 positions), P(1|y, 2...127), P(2|y, 3...127), etc.
Which is a family of 127 conditional distributions that seem trivial to learn.
I really don't know why you think the size of a combinatorial space is relevant here?
All the sorted lists share basically the same tiny family of conditional distributions { P(x_i | x_(i-1)...x_127) }
So you aren't replying to the "only stats" claim: that is the claim!
The issue is that language-use isn't a matter of distributions of text tokens: when i say, "the sky is clear today!" it is caused by there being a blue sky. Then I say, "therefore I'd like to go out!" it is caused by my preferences, etc.
So if we had a generative causal model of language it would be something like this: Agent + Environment + Representations ---SymbolicTranslation---> Language.
All LLMs do is model the data being generated by this process, they dont model the process (ie., agents, environments, representations, etc.)
They say, "it is a nice day" only because those tokens match some statistical distribution over historical texts. Not because it has judged the day nice.
To model language is not to provide an indistinguishable language-like distribution of text tokens, but rather, for an agent to use language to express ideas caused by their internal states + the world.
In the case of sorting numbers, the tokens themselves have the property (ie., mathematical properties such as ranking are had by ranked tokens). So learning the distribution is learning the property of interest.
This is why no papers which demonstrate NNs "have representations" etc. which appeal to formal properties the data itself has, are even releveant to the discussion. Yet, all this "world model, algorithm, blah blah" said of NNs, is only ever shown using data whose "unsupervised model" constitues the property of interest.
Statistical models of the distributions of tokens are not models of the data generating process which produces those tokens (unless that process is just the distribution of those tokens). This is obvious from the outset.