From word models to world models

From word models to world models(arxiv.org)

100 points by dimmuborgir 3 years ago | 113 comments

cs702 3 years ago |

After a quick/superficial read, my understanding is that the authors:

(a) induce an LLM to take natural language inputs and generate statements in a probabilistic programming language that formally models concepts, objects, actions, etc. in a symbolic world model, drawing from a large body of research on symbolic AI that goes back to pre-deep-learning days; and

(b) perform inference using the generated formal statements, i.e., compute probability distributions over the space of possible world states that are consistent with and conditioned on the natural-language input to the LLM.

If this approach works at a larger scale, it represents a possible solution for grounding LLMs so they stop making stuff up -- an important unsolved problem.

The public repo is at https://github.com/gabegrand/world-models but the code necessary for replicating results has not been published yet.

The volume of interesting new research being done on LLMs continues to amaze me.

We sure live in interesting times!

---

PS. If any of the authors are around, please feel free to point out any errors in my understanding.

skepticATX 3 years ago | |

I have not yet read the paper, but based on this description it seems like it provides grounding in the context of the training data, which is kind of the rub with current LLMs to begin with, right? We don't have a set of high quality training data that is completely unbiased and factual.

agnosticmantis 3 years ago | | |

> … which is kind of the rub with current LLMs to begin with, right?

No, the bigger problem with current LLMs is that even with high quality factual training data, they often generate seemingly plausible nonsense (e.g. cite nonexistent websites/papers as their sources.)

This is by design imo; they’re trained to generate ‘likely’ text, and they do that extremely well. There’s no guarantee for faithful retrieval from a corpus.

cs702 3 years ago | | |

I'd describe it as grounding the model with a formally specified symbolic world model.

andsoitis 3 years ago | |

Humans’ experience and understanding of the world around them isn’t limited to a symbolic representation.

It remains to be seen whether you can truly be an effective intelligence with understanding of the world if all you have are symbols that you have to manipulate.

mjburgess 3 years ago |

It's a surprise to see a paper actually try to solve the problem of modelling thought via language.

Nevertheless, it begins with far too many hedges:

> By scaling to even larger datasets and neural networks, LLMs appeared to learn not only the structure of language, but capacities for some kinds of thinking

There's two hypotheses for how LLMs generate apparently "thought-expressing" outputs: Hyp1 -- it's sampling from similar text which is distributed so-as-to-express a thought by some agent; Hyp2 -- it has the capacity to form that thought.

It is absolutely trivial to show Hyp2 is false:

> Current LLMs can produce impressive results on a set of linguistic inputs and then fail completely on others that make trivial alterations to the same underlying domain.

Indeed: because there're no relevant prior cases to sample from in that case.

> These issues make it difficult to evaluate whether LLMs have acquired cognitive capacities such as social reasoning and theory of mind

It doesnt. It's trivial: the disproof lies one sentence above. Its just that many don't like the answer. Such capacities survive trivial permutations -- LLMs do not. So Hypothesis-2 is clearly false.

famouswaffles 3 years ago | |

>It is absolutely trivial to show Hyp2 is false

No it's not

> Current LLMs can produce impressive results on a set of linguistic inputs and then fail completely on others that make trivial alterations to the same underlying domain.

>Indeed: because there're no relevant prior cases to sample from in that case.

That's not what that tells us. Humans have weird failure modes that look absurd outside the context of evolutionary biology (some still look absurd) and that don't speak to any lack or presence of intelligence or complex thought. Not sure why it's so hard to grasp that LLMs are bound to have odd failure modes regardless of the above.

and trivial here is relative. In my experience, "trivial" often turns out to be trivial in the way a person may not pay close attention to and be similarly tricked.

For instance, GPT-4 might solve a classic puzzle correctly then fail the same puzzle subtlety changed. I've found more often than not, simply changing names of variables in the puzzle to something completely different can get it to solve the changed puzzle. It takes memory shortcuts but can be pulled out of that. LLMs have failure modes that look like human failure modes too.

mjburgess 3 years ago | | |

The "failure modes" in humans do not show we lack the capacity.

Eg., do you have capacity to reason about physics? Well if you're extremely drunk, less so. But not if I permute the name of the object.

> I've found more often than not, simply changing names of variables

Yes, lol --- why do you think that is?

Because in the digitised dataset of "everything ever written" those names correspond to places in that dataset that can be sampled from by the LLM. Showing Hyp1 to be the case.

P(Hyp1| ChangeNameMakesDifference) >>>>>> P(Hyp2|ChangeNameMakesDifference)

To such a degree that the latter is vanishingly close to zero.

sgt101 3 years ago | | |

was with you until:

> look absurd outside the context of evolutionary biology

for humans, everything (everthing) is within the context of evolutionary biology!

> LLMs have failure modes that look like human failure modes too.

Yes - because LLM's are trained on 2020 Reddit.

jbay808 3 years ago | |

> It is absolutely trivial to show Hyp2 is false

To investigate precisely this question in a clear and unambiguous way, I trained an LLM from scratch to sort lists of numbers. It learned to sort them correctly, and the entropy is such that it's absolutely impossible that it could have done this by Hyp1 (sampling from similar text in the training set).

https://jbconsulting.substack.com/p/its-not-just-statistics-...

Now, there is room to argue that it applies a world-model when given lists of numbers with a hidden logical structure, but not when given lists of words with a hidden logical structure, but I think the ball is in your court to make that argument. (And to a transformer, it only ever sees lists of numbers anyway).

YeGoblynQueenne 3 years ago | | |

Your model is not sorting correctly and it sure has not learned any "algorithm". At best it has learned to approximate a sorting algorithm. That's what statistical machine learning models do, they are function approximators; not program learners.

Also, Machine Learning 101: you test your models on a test set that is disjoint to the training set. To clarify, we do this not because it's in the book and that's the rules, but because, by testing the model on held-out data, we can predict the error the model will have on unseen data (i.e. data not available to the experimenter). And we do this because under PAC-Learning assumptions a learner is said to learn a concept when it can correctly label instances of the concept with some probability of some error. In real-world situations we do not know the true concept, so we test on held-out data to approximate the probability of error.

Bottom line, if you train a model to do a thing and you don't test it carefully to figure out its error, you might claim it's learned something, but in truth, you have no idea what it's learned.

(To clarify: you tested on the train data assuming there's a low probability of overlap. Don't do that if you're trying to understand what your models can do).

mjburgess 3 years ago | | |

So this is a really good starting point -- but you havent formulated any hypotheses that can be tested. You've just looked at the graph and "reckoned something".

Formally, what hypotheses are you comparing? What do you think the specific hypothesis of the "AI = stats" person is? It isnt that the NN literally remembers data tokens, right?

In any case:

The issue with forcing NNs to model mathematical features is that the structure of the data itself has those properties. So the distributional hypothesis is true for sorting ordinals.

But it's really obviously false for natural language. The properties of the world are not the properties of word order... being red isnt "red follows words like...".

redox99 3 years ago | |

If it's "absolutely trivial" to show that LLMs don't have the capacity to form thought, then please publish a paper proving that. So all the "stupid" people studying LLMs that can't come up with such trivial proofs can move on to other stuff.

parpfish 3 years ago | | |

"I have a truly marvelous demonstration that LLMs don't have the capacity to form thought which this margin is too narrow to contain."

mjburgess 3 years ago | | |

You may wish to read the paper above. But if you want a quick proof:

1. A thought is a representation of a situation

2. A representation generates entailments of that situation

3. Language is many-to-one translation from these representations to symbols

4. Understanding language is reversing these symbols into thoughts (ie., reprs)

So,

5. If agent A understands sentence X then A forms the relevant representation of X.

6. If agent has a representation it can state entailments of S (eg., counter-facutals).

Now, split X into Xc = "canonical descriptions of S" and trivial permutations Xp.

(st. distribution of Xc,Xp is low, but the tokens of Xp are common)

Form entailments of X, say Y -- sentences that are cannonically implied by the truth of X.

7. If the LLM understood that X entails Y, it would be via constructing the repr S -- which entails S regardless of which sentence in X was used.

8. Train an LLM on Xc and it's accuracy on judging Y entailed by Xp is random.

9. Since using Xp sentences cause it to fail, it does not predict Y via S.

QED.

And we can say,

1. Appearing to judge Y entailed-by X is possible via simple sampling of (X, Y) in historical cases. 2. LLMs are just such a sampling.

so,

3. +Inference to the best explanation:

4. LLMs sample historical cases rather than form representations.

Incidentally, "sampling of historical cases" is already something we knew -- so this entire argument is basically unnecessary. And only necessary because PhDs have been turned into start-up hype men.

rytill 3 years ago | |

I don't think you really disproved anything. You're just saying another hypothesis. Often, LLMs produce impressive results on domains that aren't in the training set.

sgt101 3 years ago | | |

>LLMs produce impressive results on domains that aren't in the training set.

How do we know? Who knows what they're trained on?

canjobear 3 years ago | |

> it's sampling from similar text which is distributed so-as-to-express a thought by some agent;

Your hypotheses 1 and 2 are not so different when you consider that the similarity function used to match text in the training data must be highly nontrivial. If it were not, then things like GPT-3 would have been possible a long time ago. As a concrete example, LLMs can do decent reasoning entirely in rot13; the relevant rot13'ed text is likely very rare in their training data. The fact that the similarity function can "see through" rot13 means that it can in principle include nontrivial computations.

TeMPOraL 3 years ago | |

> There's two hypotheses for how LLMs generate apparently "thought-expressing" outputs: Hyp1 -- it's sampling from similar text which is distributed so-as-to-express a thought by some agent; Hyp2 -- it has the capacity to form that thought.

There's also another hypothesis: Hyp3 -- that Hyp1 and Hyp2 converge as the LLM is scaled up (more training data, more dimensions in the latent space), and in the limit become equivalent.

mjburgess 3 years ago | | |

They're indistinguishable via naive measurement (prompting) if the LLM can sample from all possible data: there's a very large infinity of (Q, A, time) triples (ie., it's real-valued).

But it cannot, since most of those are in the future.

bryan0 3 years ago | |

Failing on "trivial alterations to the same underlying domain" is a not a disproof of thought.

Your argument also implies hyp1 and 2 are exclusive, clearly both can be true, and in fact must be true, unless you are claiming that you do not "sample" from similar language to express your own thoughts? Where does your language come from then, if not learning from previous experience?

anankaie 3 years ago | | |

While I agree with you on the relation of GP's Hyp1 and Hyp2, you are making an unfounded assumption of a sampling process being necessary to perform human speech. I do not believe we have the understanding of how thought is represented in the human brain to make that judgement. In other words, just because sampling from a distribution can produce human-like text does not mean that it is the only way to do that, and thus that it must be the way that humans produce text, spoken or written.

mirekrusin 3 years ago | |

"Trivial to show" as in it's trivial to show that addition on uint8 doesn't work ie. 250+250?

fiso64 3 years ago | |

Don't try to ham-fist scientific sounding wording into your (very unscientific) argument. This is not a disproof of anything because you failed to define what it means to have the ability to form rational thoughts. With a definition, you would then wanna prove this for humans as a sanity check: Do we never make stupid mistakes? Ok, we make fewer of those than LLMs. Then what is the threshold for accuracy after which you consider a system to be intelligent? Do all humans pass that threshold, or do kids or people with a lower than average IQ fail?

mjburgess 3 years ago | | |

This entire paper is written as a disproof of the distributional hypothesis. If you want to understand why it's a profoundly unhelpful pseudoscientific idea, this paper is a good start.

The test for a capacity C in a system1 has nothing to do with proxy measures of that capacity in system2.

The capacity for an oven to cook food may be measured by how much smoke it lets of when burning -- but no amount of "smoke" establishes that a dry ice machine can cook.

This type of "engineering thinking" is pseudoscience.

mdp2021 3 years ago | | |

> humans

There is intelligent thought and action, and there is unintelligent thought and action. Intelligent is that "which checked" (intus-legere); the other, the """impulsive""", is not.

mjburgess 3 years ago |

The level of understanding of the problem that this paper expresses is extraordianry in my reading of this field --- it's a genuinely amazing synthesis.

> How could the common-sense background knowledge needed for dynamic world model synthesis be represented, even in principle? Modern game engines may provide important clues.

This has often been my starting point in modelling the difference between a model-of-pixels vs. a world model. Any given video game session can be "replayed" by a model of its pixels: but you cannot play the game with such a model. It does not represent the causal laws of the game.

Even if you had all possible games you could not resolve between player-caused and world-caused frames.

> A key question is how to model this capability. How do minds craft bespoke world models on the fly, drawing in just enough of our knowledge about the world to answer the questions of interest?

This requires a body: the relevant information missing is causal, and the body resolves P(A|B) and P(A|B->A) by making bodily actions interpreted as necessarily causal.

In the case of video games, since we hold the controller, we resolve P(EnemyDead|EnemyHit) vs. P(EnemyDead| (ButtonPress ->) EnemyHit -> EnemyDead)

antiquark 3 years ago |

I doubt that word models can lead to world models. To quote Yann LeCun:

"The vast majority of our knowledge, skills, and thoughts are not verbalizable. That's one reason machines will never acquire common sense solely by reading text."

https://twitter.com/ylecun/status/1368235803147649028

gibsonf1 3 years ago |

Unfortunately, this effort fully misses the boat. Human cognition is about concepts, not language, and that's where one must start to understand it. Language simply serializes our conceptual thinking in multiple language formats, the key is what's being serialized and how that actually works in conceptual awareness.

buzzy_hacker 3 years ago | |

Maybe they can’t be so fully separated. https://en.m.wikipedia.org/wiki/Linguistic_relativity

gibsonf1 3 years ago | | |

I think the key point is that serialized words symbolize concepts and other logic such that if you can't retrieve that concept into your awareness, you will not understand the word. Learning and forming the concepts comes prior to attaching common word symbols to them based on the region you live in. So if you start with words, you never get anywhere, hence the complete lack of any intelligence in the LLM approach.

canjobear 3 years ago | |

Read more carefully. Their "language of thought" is not a natural language, it's a variant of lambda calculus with probabilistic semantics.

gibsonf1 3 years ago | | |

Right, derived from word pattern statistics. The CYC project tried first order predicate calculus with complete failure. This is not how we think or how conceptual awareness works. The key give away is what they don't talk about, Concepts.

dimatura 3 years ago |

This is really interesting. The title is referencing the "Language of Thought" hypothesis from early cognitive psychology, that posited thought consisted of symbol manipulation akin to computer programs. The same idea was behind was also what is often referred to GOFAI. But the idea has largely fallen out of fashion in both psychology and AI. There's a twist here in the "probabilistic" part, and of course the surprising success of LLMs makes this a more compelling idea than it would've been only a couple of years ago. And there's also an acknowledgement of the need for some kind of sensorimotor grounding as well. Pretty cool!

ilaksh 3 years ago |

So they are using GPT-4 to write Lisp? Or some probabilistic language that looks like Lisp.

They keep saying LLMs but only GPT-4 can do it at that level. Although actually some of the examples were pretty basic so I guess it really depends on the level of complexity.

I feel like this could be really useful in cases where you want some kind of auditable and machine interpretable rationale for doing something. Such as self driving cars or military applications. Or maybe some robots. It could make it feasible to add a layer of hard rules in a way.

mercurialsolo 3 years ago |

Humans come in all shapes and forms of sensory as well as cognitive abilities. Our true ability to be human comes from objectives (derived from biological and socially bound complex systems) that drive us, feedback loops (ability to morph / affect the goals) and continuous sensory capabilities.

Reasoning is just prediction with memory towards an objective.

Once large models have these perpetual operating sensory loops with objective functions, the ability to distinguish model powered intelligence and human like intelligence tends to drop.

wilonth 3 years ago |

Was excited for a moment, thought it was related to this https://worldmodels.github.io/.

World models are meant to be for simulating environments. If this was something like testing if a game agent with llm can form thoughts as it play through some game it would be very interesting. Maybe someone on HN can do this?

Philpax 3 years ago | |

Check out https://voyager.minedojo.org/, which uses a LLM to play Minecraft.

wilonth 3 years ago | |

"Hush hush, I'm gonna sacrifice the queen to do a surprise checkmate!" Agent said

antisthenes 3 years ago |

World modeling is impossible without sensory input.

You need constant modeling of touch/smell/vision/temperature, etc.

These senses give us an actual understanding of the physical world and drive our behavior in a way that pure language will never be able to.

stevenhuang 3 years ago | |

A facsimile of sufficient equivalence to the world models we derive from our 5 senses may be approached through derivation of descriptive language only.

"sufficient equivalence" is important because sure it may not _really_ know the color of red or the qualia of being, but if for all intents and purposes the LLM's internal model provides predictive power and answers correctly as if it does have a world model, then what is the difference?

esafak 3 years ago | | |

That's not how physics works. We understand the world by interacting with it. How do you know your internal model is right until it is tested in reality?

sgt101 3 years ago |

I : hhmmppp a paper from Tenenbaum's group, let's read.

Paper : Hi! I am 94 pages long.

I : omg...