Diffusion models are real-time game engines

Diffusion models are real-time game engines(gamengen.github.io)

1149 points by jmorgan 1 year ago | 409 comments

vessenes 1 year ago |

So, this is surprising. Apparently there’s more cause, effect, and sequencing in diffusion models than what I expected, which would be roughly ‘none’. Google here uses SD 1.4, as the core of the diffusion model, which is a nice reminder that open models are useful to even giant cloud monopolies.

The two main things of note I took away from the summary were: 1) they got infinite training data using agents playing doom (makes sense), and 2) they added Gaussian noise to source frames and rewarded the agent for ‘correcting’ sequential frames back, and said this was critical to get long range stable ‘rendering’ out of the model.

That last is intriguing — they explain the intuition as teaching the model to do error correction / guide it to be stable.

Finally, I wonder if this model would be easy to fine tune for ‘photo realistic’ / ray traced restyling — I’d be super curious to see how hard it would be to get a ‘nicer’ rendering out of this model, treating it as a doom foundation model of sorts.

Anyway, a fun idea that worked! Love those.

wavemode 1 year ago | |

> Apparently there’s more cause, effect, and sequencing in diffusion models than what I expected

To temper this a bit, you may want to pay close attention to the demo videos. The player rarely backtracks, and for good reason - the few times the character does turn around and look back at something a second time, it has changed significantly (the most noticeable I think is the room with the grey wall and triangle sign).

This falls in line with how we'd expect a diffusion model to behave - it's trained on many billions of frames of gameplay, so it's very good at generating a plausible -next- frame of gameplay based on some previous frames. But it doesn't deeply understand logical gameplay constraints, like remembering level geometry.

dewarrn1 1 year ago | | |

Great observation. And not entirely unlike normal human visual perception which is notoriously vulnerable to missing highly salient information; I'm reminded of the "gorillas in our midst" work by Dan Simons and Christopher Chabris [0].

[0]: https://en.wikipedia.org/wiki/Inattentional_blindness#Invisi...

nmstoker 1 year ago | | |

I saw a longer video of this that Ethan Mollick posted and in that one, the sequences are longer and they do appear to demonstrate a fair amount of consistency. The clips don't backtrack in the summary video on the paper's home page because they're showing a number of district environments but you only get a few seconds of each.

If I studied the longer one more closely, I'm sure inconsistencies would be seen but it seemed able to recall presence/absence of destroyed items, dead monsters etc on subsequent loops around a central obstruction that completely obscured them for quite a while. This did seem pretty odd to me, as I expected it to match how you'd described it.

whiteboardr 1 year ago | | |

But does it need to be frame-based?

What if you combine this with an engine in parallel that provides all geometry including characters and objects with their respective behavior, recording changes made through interactions the other model generates, talking back to it?

A dialogue between two parties with different functionality so to speak.

(Non technical person here - just fantasizing)

mensetmanusman 1 year ago | | |

That is kind of cool though, I would play like being lost in a dream.

If on the backend you could record the level layouts in memory you could have exploration teams that try to find new areas to explore.

codeflo 1 year ago | | |

Even purely going forward, specks on wall textures morph into opponents and so on. All the diffusion-generated videos I’ve seen so far have this kind of unsettling feature.

hoosieree 1 year ago | | |

Small objects like powerups appear and disappear as the player moves (even without backtracking), the ammo count is constantly varying, getting shot doesn't deplete health or armor, etc.

TeMPOraL 1 year ago | | |

So for the next iteration, they should add a minimap overlay (perhaps on a side channel) - it should help the model give more consistent output in any given location. Right now, the game is very much like a lucid dream - the universe makes sense from moment to moment, but without outside reference, everything that falls out of short-term memory (few frames here) gets reimagined.

Groxx 1 year ago | | |

There's an example right at the beginning too - the ammo drop on the right changes to something green (I think that's a body?)

Workaccount2 1 year ago | | |

I don't see this as something that would be hard to overcome. Sora for instance has already shown the ability for a diffusion model to maintain object permanence. Flux recently too has shown the ability to render the same person in many different poses or images.

nielsbot 1 year ago | | |

You can also notice in the first part of the video the ammo numbers fluctuate a bit randomly.

alickz 1 year ago | | |

is that something that can be solved with more memory/attention/context?

or do we believe it's an inherent limitation in the approach?

refibrillator 1 year ago | |

Just want to clarify a couple possible misconceptions:

The diffusion model doesn’t maintain any state itself, though its weights may encode some notion of cause/effect. It just renders one frame at a time (after all it’s a text to image model, not text to video). Instead of text, the previous states and frames are provided as inputs to the model to predict the next frame.

Noise is added to the previous frames before being passed into the SD model, so the RL agents were not involved with “correcting” it.

De-noising objectives are widespread in ML, intuitively it forces a predictive model to leverage context, ie surrounding frames/words/etc.

In this case it helps prevent auto-regressive drift due to the accumulation of small errors from the randomness inherent in generative diffusion models. Figure 4 shows such drift happening when a player is standing still.

rvnx 1 year ago | | |

The concept is that if you train a Diffusion model by feeding all the possible frames seen in the game.

The training was over almost 1 billion frames, 20 days of full-time play-time, taking a screenshot of every single inch of the map.

Now you show him N frames as input, and ask it "give me frame N+1", then it gives you the frame n. N+1 back based on how it was originally seen during training.

But it is not frame N+1 from a mysterious intelligence, it's simply frame N+1 given back from past database.

The drift you mentioned is actually a clear (but sad) proof that the model does not work at inventing new frames, and can only spit out an answer from the past dataset.

It's a bit like if you train stable diffusion on Simpsons episodes, and that it outputs the next frame of an existing episode that was in the training set, but few frames later goes wild and buggy.

nine_k 1 year ago | |

But it's not a game. It's a memory of a game video, predicting the next frame based on the few previous frames, like "I can imagine what happened next".

I would call it the world's least efficient video compression.

What I would like to see is the actual predictive strength, aka imagination, which I did not notice mentioned in the abstract. The model is trained on a set of classic maps. What would it do, given a few frames of gameplay on an unfamiliar map as input? How well could it imagine what happens next?

PoignardAzur 1 year ago | | |

> But it's not a game. It's a memory of a game video, predicting the next frame based on the few previous frames, like "I can imagine what happened next".

It's not super clear from the landing page, but I think it's an engine? Like, its input is both previous images and input for the next frame.

So as a player, if you press "shoot", the diffusion engine need to output an image where the monster in front of you takes damage/dies.

Sharlin 1 year ago | | |

No, it’s predicting the next frame conditioned on past frames AND player actions! This is clear from the article. Mere video generation would be nothing new.

taneq 1 year ago | | |

It's more like the Tetris Effect, where the model has seen so much Doom that it confabulates gameplay.

TeMPOraL 1 year ago | | |

It's a memory of a video looped to controls, so frame 1 is "I wonder how would it look if the player pressed D instead of W", then the frame 2 is based on frame 1, etc. and couple frames in, it's already not remembering, but imagining the gameplay on the fly. It's not prerecorded, it responds to inputs during generation. That's what makes it a game engine.

mensetmanusman 1 year ago | | |

They could down convert the entire model to only utilize the subset of matrix components from stable diffusion. This approach may be able to improve internet bandwidth efficiency assuming consumers in the future have powerful enough computers.

WithinReason 1 year ago | | |

If it's trained on absolute player coordinates then it would likely just morph into the known map at those coordinates.

pradn 1 year ago | |

> Google here uses SD 1.4, as the core of the diffusion model, which is a nice reminder that open models are useful to even giant cloud monopolies.

A mistake people make all the time is that massive companies will put all their resources toward every project. This paper was written by four co-authors. They probably got a good amount of resources, but they still had to share in the pool allocated to their research department.

Even Google only has one Gemini (in a few versions).

fennecfoxy 1 year ago | |

If anybody Google would know most about that after their LLM memo all that time ago (basically "we're losing because we're trying to fight/compete with OS models"): https://www.semianalysis.com/p/google-we-have-no-moat-and-ne...

raghavbali 1 year ago | |

Nicely summarised. Another important thing that clearly standsout (not to undermine the efforts and work gone into this) is the fact that more and more we are now seeing larger and more complex building blocks emerging (first it was embedding models then encoder decoder layers and now whole models are being duck-taped for even powerful pipelines). AI/DL ecosystem is growing on a nice trajectory.

Though I wonder if 10 years down the line folks wouldn't even care about underlying model details (no more than a current day web-developer needs to know about network packets).

PS: Not great examples, but I hope you get the idea ;)

bubaumba 1 year ago | |

> nice reminder that open models are useful to

You didn't say open _what_ models. Was that intentional?

Philpax 1 year ago | | |

They did, SD 1.4

wkcheng 1 year ago |

It's insane that that this works, and that it works fast enough to render at 20 fps. It seems like they almost made a cross between a diffusion model and an RNN, since they had to encode the previous frames and actions and feed it into the model at each step.

Abstractly, it's like the model is dreaming of a game that it played a lot of, and real time inputs just change the state of the dream. It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.

SeanAnderson 1 year ago |

After some discussion in this thread, I found it worth pointing out that this paper is NOT describing a system which receives real-time user input and adjusts its output accordingly, but, to me, the way the abstract is worded heavily implied this was occurring.

It's trained on a large set of data in which agents played DOOM and video samples are given to users for evaluation, but users are not feeding inputs into the simulation in real-time in such a way as to be "playing DOOM" at ~20FPS.

There are some key phrases within the paper that hint at this such as "Key questions remain, such as ... how games would be effectively created in the first place, including how to best leverage human inputs" and "Our end goal is to have human players interact with our simulation.", but mostly it's just the omission of a section describing real-time user gameplay.

zzanz 1 year ago |

The quest to run doom on everything continues. Technically speaking, isn't this the greatest possible anti-Doom, the Doom with the highest possible hardware requirement? I just find it funny that on a linear scale of hardware specification, Doom now finds itself on both ends.

fngjdflmdflg 1 year ago | |

>Technically speaking, isn't this the greatest possible anti-Doom

When I read this part I thought you were going to say because you're technically not running Doom at all. That is, instead of running Doom without Doom's original hardware/software environment (by porting it), you're running Doom without Doom itself.

ynniv 1 year ago | | |

It's dreaming Doom.

bugglebeetle 1 year ago | | |

Pierre Menard, Author of Doom.

Terr_ 1 year ago | |

> the Doom with the highest possible hardware requirement?

Isn't that possible by setting arbitrarily high goals for ray-cast rendering?

Vecr 1 year ago | |

It's the No-Doom.

WithinReason 1 year ago | | |

Undoom?

x-complexity 1 year ago | |

> Technically speaking, isn't this the greatest possible anti-Doom, the Doom with the highest possible hardware requirement?

Not really? The greatest anti-Doom would be an infinite nest of these types of models predicting models predicting Doom at the very end of the chain.

The next step of anti-Doom would be a model generating the model, generating the Doom output.

nurettin 1 year ago | | |

Isn't this technically a model (training step) generating a model (a neural network) generating Doom output?

yuchi 1 year ago | | |

“…now it can implement Doom!”

rldjbpin 1 year ago | |

to me the closer analogy here is the "running minecraft inside minecraft" (https://news.ycombinator.com/item?id=32901461)

godelski 1 year ago |

Doom system requirements:

  - 4 MB RAM
  - 12 MB disk space

Stable diffusion v1

  > 860M UNet and CLIP ViT-L/14 (540M)
  Checkpoint size:
    4.27 Gb 
    7.7 GB (full EMA)
  Running on a TPU-v5e
    Peak compute per chip (bf16)  197 TFLOPs
    Peak compute per chip (Int8)  393 TFLOPs
    HBM2 capacity and bandwidth  16 GB, 819 GBps
    Interchip Interconnect BW  1600 Gbps

This is quite impressive, especially considering the speed. But there's still a ton of room for improvement. It seems it didn't even memorize the game despite having the capacity to do so hundreds of times over. So we definitely have lots of room for optimization methods. Though who knows how such things would affect existing tech since the goal here is to memorize.

What's also interesting about this work is it's basically saying you can rip a game if you're willing to "play" (automate) it enough times and spend a lot more on storage and compute. I'm curious what the comparison in cost and time would be if you hired an engineer to reverse engineer Doom (how much prior knowledge do they get considering pertained models and visdoom environment. Was doom source code in T5? And which vit checkpoint was used? I can't keep track of Google vit checkpoints).

I would love to see the checkpoint of this model. I think people would find some really interesting stuff taking it apart.

- https://www.reddit.com/r/gaming/comments/a4yi5t/original_doo...

- https://huggingface.co/CompVis/stable-diffusion-v-1-4-origin...

- https://cloud.google.com/tpu/docs/v5e

- https://github.com/Farama-Foundation/ViZDoom

- https://zdoom.org/index

Sohcahtoa82 1 year ago |

It's always fun reading the dead comments on a post like this. People love to point how how pointless this is.

Some of ya'll need to learn how to make things for the fun of making things. Is this useful? No, not really. Is it interesting? Absolutely.

Not everything has to be made for profit. Not everything has to be made to make the world a better place. Sometimes, people create things just for the learning experience, the challenge, or they're curious to see if something is possible.

Time spent enjoying yourself is never time wasted. Some of ya'll are going to be on your death beds wishing you had allowed yourself to have more fun.

HellDunkel 1 year ago |

Although impressive i must disagree. Diffusion models are not game engines. A game engine is a component to propell your game (along the time axis?). In that sense it is similar to the engine of the car, hence the name. It does not need a single working car nor a road to drive on do its job. The above is a dynamic, interactive replication of what happens when you put a car on a given road, requiring a million test drives with working vehicles. An engine would also work offroad.

refibrillator 1 year ago |

There is no text conditioning provided to the SD model because they removed it, but one can imagine a near future where text prompts are enough to create a fun new game!

Yes they had to use RL to learn what DOOM looks like and how it works, but this doesn’t necessarily pose a chicken vs egg problem. In the same way that LLMs can write a novel story, despite only being trained on existing text.

IMO one of the biggest challenges with this approach will be open world games with essentially an infinite number of possible states. The paper mentions that they had trouble getting RL agents to completely explore every nook and corner of DOOM. Factorio or Dwarf Fortress probably won’t be simulated anytime soon…I think.

danjl 1 year ago |

So, diffusion models are game engines as long as you already built the game? You need the game to train the model. Chicken. Egg?

dtagames 1 year ago |

A diffusion model cannot be a game engine because a game engine can be used to create new games and modify the rules of existing games in real time -- even rules which are not visible on-screen.

These tools are fascinating but, as with all AI hype, they need a disclaimer: The tool didn't create the game. It simply generated frames and the appearance of play mechanics from a game it sampled (which humans created).

kqr 1 year ago | |

> even rules which are not visible on-screen.

If a rule was changed but it's never visible on the screen, did it really change?

> It simply generated frames and the appearance of play mechanics from a game it sampled (which humans created).

Simply?! I understand it's mechanically trivial but the fact that it's compressed such a rich conditional distribution seems far from simple to me.

znx_0 1 year ago | | |

> If a rule was changed but it's never visible on the screen, did it really change?

Well for "some" games it does really change

darby_nine 1 year ago | | |

> Simply?! I understand it's mechanically trivial but the fact that it's compressed such a rich conditional distribution seems far from simple to me.

It's much simpler than actually creating a game....

throwthrowuknow 1 year ago | |

They only trained it on one game and only embedded the control inputs. You could train it on many games and embed a lot more information about each of them which could possibly allow you to specify a prompt that would describe the game and then play it.

calebh 1 year ago | |

One thing I'd like to see is to take a game rendered with low poly assets (or segmented in some way) and use a diffusion model to add realistic or stylized art details. This would fix the consistency problem while still providing tangible benefits.

momojo 1 year ago | |

The title should be "Diffusion Models can be used to render frames given user input"

sharpshadow 1 year ago | |

So all it did is generate a video of the gameplay which is slightly different from the video it used for training?

TeMPOraL 1 year ago | | |

No, it implements a 3D FPS that's interactive, and renders each frame based on your input and a lot of memorized gameplay.

alkonaut 1 year ago |

The job of the game engine is also to render the world given only the worlds properties (textures, geometries, physics rules, ...), and not given "training data that had to be supplied from an already written engine".

I'm guessing that the "This door requires a blue key" doesn't mean that the user can run around, the engine dreams up a blue key in some other corner of the map, and the user can then return to the door and the engine now opens the door? THAT would be impressive. It's interesting to think that all that would be required for that task to go from really hard to quite doable, would be that the door requiring the blue key is blue, and the UI showing some icon indicating the user possesses the blue key. Without that, it becomes (old) hidden state.

helloplanets 1 year ago |

So, any given sequence of inputs is rebuilt into a corresponding image, twenty times per second. I wonder how separate the game logic and the generated graphics are in the fully trained model.

Given a sufficient enough separation between these two, couldn't you basically boil the game/input logic down to an abstract game template? Meaning, you could just output a hash that corresponds to a specific combination of inputs, and then treat the resulting mapping as a representation of a specific game's inner workings.

To make it less abstract, you could save some small enough snapshot of the game engine's state for all given input sequences. This could make it much less dependent to what's recorded off of the agents' screens. And you could map the objects that appear in the saved states to graphics, in a separate step.

I imagine this whole system would work especially well for games that only update when player input is given: Games like Myst, Sokoban, etc.

toppy 1 year ago | |

I think you've just encoded the title of the paper

panki27 1 year ago |

> Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation.

I can hardly believe this claim, anyone who has played some amount of DOOM before should notice the viewport and textures not "feeling right", or the usually static objects moving slightly.

arc-in-space 1 year ago | |

This, watching the generated clips feels uncomfortable, like a nightmare. Geometry is "swimming" with camera movement, objects randomly appear and disappear, damage is inconsistent.

The entire thing would probably crash and burn if you did something just slightly unusual compared to the training data, too. People talking about 'generated' games often seem to fantasize about an AI that will make up new outcomes for players that go off the beaten path, but a large part of the fun of real games is figuring out what you can do within the predetermined constraints set by the game's code. (Pen-and-paper RPGs are highly open-ended, but even a Game Master needs to sometimes protects the players from themselves; whereas the current generation of AI is famously incapable of saying no.)

aithrowaway1987 1 year ago | |

I also noticed that they played AI DOOM very slowly: in an actual game you are running around like a madman, but in the video clips the player is moving in a very careful, halting manner. In particular the player only moves in straight lines or turns while stationary, they almost never turn while running. Also didn't see much strafing.

I suspect there is a reason for this: running while turning doesn't work properly and makes it very obvious that the system doesn't have a consistent internal 3D view of the world. I'm already getting motion sickness from the inconsistencies in straight-line movement, I can't imagine turning is any better.

freestyle24147 1 year ago | |

It made me laugh. Maybe they pulled random people from the hallway who had never seen the original Doom (or any FPS), or maybe only selected people who wore glasses and forgot them at their desk.

meheleventyone 1 year ago | |

It's telling IMO that they only want people opinions based on our notoriously faulty memories rather than sitting comparable situations next to one another in the game and simulation then analyzing them. Several things jump out watching the example video.

GaggiX 1 year ago | | |

>rather than sitting comparable situations next to one another in the game and simulation then analyzing them.

That's literally how the human rating was setup if you read the paper.

golol 1 year ago |

What I understand is the folloeing: If this works so well, why didn't we have good video generation much earlier? After diffusion models were seen to work the most obvious thing to do was to generate the next frame based on previous framrs but... it took 1-2 years for good video models to appear. For example compare Sora generating minecraft video versus this method generating minecraft video. Say in both cases the player is standing on a meadow with fee inputs and watching some pigs. In the Sora video you'd expect the typical glitched to appear, like erratic, sliding movement, overlapping legs, multiplication of pigs etc. Would these glitches not appear in the GameNGen video? Why?

Closi 1 year ago | |

Because video is much more difficult than images (it's lots of images that have to be consistent across time, with motion following laws of physics etc), and this is much more limited in terms of scope than pure arbitrary video generation.

golol 1 year ago | | |

This misses the point, I'm comparing two methods of generating minecraft videos.

pantalaimon 1 year ago | |

I would have thought it is much easier to generate huge amounts of game footage for training, but as I understand this is not what was done here.

mo_42 1 year ago |

An implementation of the game engine in the model itself is theoretically the most accurate solution for predicting the next frame.

I'm wondering when people will apply this to other areas like the real world. Would it learn the game engine of the universe (ie physics)?

radarsat1 1 year ago | |

There has definitely been research for simulating physics based on observation, especially in fluid dynamics but also for rigid body motion and collision. It's important for robotics applications actually. You can bet people will be applying this technique in those contexts.

I think for real world application one challenge is going to be the "action" signal which is a necessary component of the conditioning signal that makes the simulation reactive. In video games you can just record the buttons, but for real world scenarios you need difficult and intrusive sensor setups for recording force signals.

(Again for robotics though maybe it's enough to record the motor commands, just that you can't easily record the "motor commands" for humans, for example)

cubefox 1 year ago | |

A popular theory in neuroscience is that this is what the brain does:

https://slatestarcodex.com/2017/09/05/book-review-surfing-un...

It's called predictive coding. By trying to predict sensory stimuli, the brain creates a simplified model of the world, including common sense physics. Yann LeCun says that this is a major key to AGI. Another one is effective planning.

But while current predictive models (autoregressive LLMs) work well on text, they don't work well on video data, because of the large outcome space. In an LLM, text prediction boils down to a probability distribution over a few thousand possible next tokens, while there are several orders of magnitude more possible "next frames" in a video. Diffusion models work better on video data, but they are not inherently predictive like causal LLMs. Apparently this new Doom model made some progress on that front though.

ccozan 1 year ago | | |

Howver, this is due how we actually digitize video. From a human point a view, looking in my room reduces the load to the _objects_ in the room and everyhing else is just noise ( like the color of the wall could be just a single item to remember, while otherwise in the digital world, it needs to remember all the pixels )

icoder 1 year ago |

This is impressive. But at the same time, it can't count. We see this every time, and I understand why it happens, but it is still intriguing. We are so close or in some ways even way beyond, and yet at the same time so extremely far away, from 'our' intelligence.

(I say it can't count because there are numerous examples where the bullet count glitches, it goes right impressively often, but still, counting, being up or down, is something computers have been able to do flawlessly basically since forever)

(It is the same with chess, where the LLM models are becoming really good, yet sometimes make mistakes that even my 8yo niece would not make)

marci 1 year ago | |

'our' intelligence may not be the best thing we can make. It would be like trying to only make planes that flaps wings or trucks with legs. A bit like using a llm to do multiplication. Not the best tool. Biomimcry is great for inspiration, but shouldn't be a 1-to-1 copy, especialy in different scale and medium.

icoder 1 year ago | | |

Sure, although I still think a system with less of a contrast between how well it performs 'modally' and how bad it performs incidentally, would be more practical.

What I wonder is whether LLM's will inherently always have this dichotomy and we need something 'extra' (reasoning, attention or something les biomimicried), or whether this will eventually resolves itself (to an acceptable extend) when they improve even further.

lIl-IIIl 1 year ago |

How does it know how many times it needs to shoot the zombie before it dies?

Most enemies have enough hit points to survive the first shot. If the model is only trained on the previous frame, it doesn't know how many times the enemy was already shot at.

From the video it seems like it is probability based - they may die right away or it might take way longer than it should.

I love how the player's health goes down when he stands in the radioactive green water.

In Doom the enemies fight with each other if they accidentally incur "friendly fire". It would be interesting to see it play out in this version.

meheleventyone 1 year ago | |

> I love how the player's health goes down when he stands in the radioactive green water.

This is one of the bits that was weird to me, it doesn't work correctly. In the real game you take damage at a consistent rate, in the video the player doesn't and whether the player takes damage or not seems highly dependent on some factor that isn't whether or not the player is in the radioactive slime. My thought is that its learnt something else that correlates poorly.

golol 1 year ago | |

It gets a number of previous frames as input I think.

lupusreal 1 year ago | |

> In Doom the enemies fight with each other if they accidentally incur "friendly fire". It would be interesting to see it play out in this version.

They trained this thing on bot gameplay, so I bet it does poorly when advanced strategies like deliberately inducing mob infighting are employed (the bots probably didn't do that a lot, of at all.)

masterspy7 1 year ago |

There's been a ton of work to generate assets for games using AI: 3d models, textures, code, etc. None of that may even be necessary with a generative game engine like this! If you could scale this up, train on all games in existence, etc. I bet some interesting things would happen

rererereferred 1 year ago | |

But can you grab what this Ai has learned and generate the 3d models, maps and code to turn it into an actual game that can run on a user's PC? That would be amazing.

passion__desire 1 year ago | | |

Jensen Huang's vision that future games will be generated rather than rendered is coming true.

kleiba 1 year ago | | |

What would be the point? This model has been trained on an existing game, so turning it back into assets, maps, and code would just give you a copy of the original game you started with. I suppose you could create variations of it then... but:

You don't even need to do all of that - this trained model already is the game, i.e., it's interactive, you can play the game.

whamlastxmas 1 year ago | |

I would absolutely love if they could take this demo, add a new door that isn’t in the original, and see what it generates behind that door

nolist_policy 1 year ago |

Makes me wonder... If you stand still in front of a door so all past observations only contain that door, will the model teleport you to another level when opening the door?

zbendefy 1 year ago | |

I think some state is also being given (or if its not, it could be given) to the network, like 3d world position/orientation of the player, that could help the neural network anchor the player in the world.

smusamashah 1 year ago |

Has this model actually learned the 3d space of the game? Is it possible to break the camera free and roam around the map freely and view it from different angles?

I noticed a few hallucinations e.g. when it picked green jacket from a corner, walking back it generated another corner. Therefore I don't think it has any clue about the 3D world of the game at all.

kqr 1 year ago | |

> Is it possible to break the camera free and roam around the map freely and view it from different angles?

I would assume only if the training data contained this type of imagery, which it did not. The training data (from what I understand) consisted only of input+video of actual gameplay, so that is what the model is trained to mimick.

This is like a dog that has been trained to form English words – what's impressive is not that it does it well, but that it does it at all.

Sohcahtoa82 1 year ago | |

> Therefore I don't think it has any clue about the 3D world of the game at all.

AI models don't "know" things at all.

At best, they're just very fuzzy predictors. In this case, given the last couple frames of video and a user input, it predicts the next frame.

It has zero knowledge of the game world, game rules, interactions, etc. It's merely a mapping of [pixels, input] -> pixels.

ravetcofx 1 year ago |

There is going to be a flood of these dreamlike "games" in the next few years. This feels likes a bit of a breakthrough in the engineering of these systems.

Kapura 1 year ago |

What is useful about this? I am a game programmer, and I cannot imagine a world where this improves any part of the development process. It seems to me to be a way to copy a game without literally copying the assets and code; plagiarism with extra steps. What am I missing?

arduinomancer 1 year ago |

How does the model “remember” the whole state of the world?

Like if I kill an enemy in some room and walk all the way across the map and come back, would the body still be there?

a_e_k 1 year ago | |

Watch closely in the videos and you'll see that enemies often respawn when offscreen and sometimes when onscreen. Destroyed barrels come back, ammo count and health fluctuates weirdly, etc. It's still impressive, but its not perfect in that regard.

Sharlin 1 year ago | | |

Not unlike in (human) dreams.

Jensson 1 year ago | |

It doesn't even remember the state of the game you look at. Doors spawning right in front of you, particle effects turning into enemies mid flight etc, so just regular gen AI issues.

Edit: Can see this in the first 10 seconds of the first video under "Full Gameplay Videos", stairs turning to corridor turning to closed door for no reason without looking away.

csmattryder 1 year ago | | |

There's also the case in the video (0:59) where the player jumps into the poison but doesn't take damage for a few seconds then takes two doses back-to-back - they should've taken a hit of damage every ~500-1000ms(?)

Guessing the model hasn't been taught enough about that, because most people don't jump into hazards.

raincole 1 year ago | |

It doesn't. You need to put the world state in the input (the "prompt", even it doesn't look like prompt in this case). Whatever not in the prompt is lost.

rldjbpin 1 year ago |

this is truly a cool demo, but a very misleading title.

to me it seems like a very bruteforce or greedy way to give the impression to a user that they are "playing" a game. the difference being that you already own the game to make this possible, but cannot let the user use that copy!

using generative AI for game creation is at a nascent stage but there are much more elegant ways to go about the end goal. perhaps in the future with computing so far ahead that we moved beyond the current architecture, this might be worth doing instead of emulation perhaps.

dabochen 1 year ago |

So there is no interactivity, but the generated content is not the exact view in the training data, is this the correct understanding?

If so, is it more like imagination/hallucination rather than rendering?

famouswaffles 1 year ago | |

It's conditioned on previous frames AND player actions so it's interactive.

rrnechmech 1 year ago |

> To mitigate auto-regressive drift during inference, we corrupt context frames by adding Gaussian noise to encoded frames during training. This allows the network to correct information sampled in previous frames, and we found it to be critical for preserving visual stability over long time periods.

I get this (mostly). But would any kind soul care to elaborate on this? What is this "drift" they are trying to avoid and how does (AFAIU) adding noise help?

jamilton 1 year ago |

I wonder if the MineRL (https://www.ijcai.org/proceedings/2019/0339.pdf and minerl.io) dataset would be sufficient to reproduce this work with Minecraft.

Any other similar existing datasets?

A really goofy way I can think of to get a bunch of data would be to get videos from youtube and try to detect keyboard sounds to determine what keys they're pressing.

jamilton 1 year ago | |

Although ideally a follow up work would be something where there won’t be any potential legal trouble with releasing the complete model so people can play it.

A similar approach but with a game where the exact input is obvious and unambiguous from the graphics alone so that you can use unannotated data might work. You’d just have to create a model to create the action annotations. I’m not sure what the point would be, but it sounds like it’d be interesting.

throwthrowuknow 1 year ago |

Several thoughts for future work:

1. Continue training on all of the games that used the Doom engine to see if it is capable of creating new graphics, enemies, weapons, etc. I think you would need to embed more details for this perhaps information about what is present in the current level so that you could prompt it to produce a new level from some combination.

2. Could embedding information from the map view or a raytrace of the surroundings of the player position help with consistency? I suppose the model would need to predict this information as the neural simulation progressed.

3. Can this technique be applied to generating videos with consistent subjects and environments by training on a camera view of a 3D scene and embedding the camera position and the position and animation states of objects and avatars within the scene?

4. What would the result of training on a variety of game engines and games with different mechanics and inputs be? The space of possible actions is limited by the available keys on a keyboard or buttons on a controller but the labelling of the characteristics of each game may prove a challenge if you wanted to be able to prompt for specific details.

bufferoverflow 1 year ago |

That's probably how our reality is rendered.

TheRealPomax 1 year ago |

If by "game" you mean "literal hallucination" then yes. But if we're not trying to click-bait, then no: it's not really a game when there is no permanence or determinism to be found anywhere. It might be a "game-flavoured dream simulator", but it's absolutely not a game engine.

t1c 1 year ago |

They got DOOM running on a diffusion engine before GTA 6

broast 1 year ago |

Maybe one day this will be how operating systems work.

misterflibble 1 year ago | |

Don't give them ideas lol terrifying stuff if that happens!

KhoomeiK 1 year ago |

NVIDIA did something similar with GANs in 2020 [1], except users could actually play those games (unlike in this diffusion work which just plays back simulated video). Sentdex later adapted this to play GTA with a really cool demo [2].

[1] https://research.nvidia.com/labs/toronto-ai/gameGAN/

[2] https://www.youtube.com/watch?v=udPY5rQVoW0

dysoco 1 year ago |

Ah finally we are starting to see something gaming related. I'm curious as to why we haven't seen more of neural networks applied to games even in a completely experimental fashion; we used to have a lot of little experimental indie games such as Façade (2005) and I'm surprised we don't have something similar years after the advent of LLMs.

We could have mods for old games that generate voices for the characters for example. Maybe it's unfeasible from a computing perspective? There are people running local LLMs, no?

raincole 1 year ago | |

> We could have mods for old games that generate voices for the characters for example

You mean in real time? Or just in general?

There are a lot of mods that use AI-generated voices. I'll say it's the norm of modding community now.

troupo 1 year ago |

Key: "predicts next frame, recreates classic Doom". A game that was analyzed and documented to death. And the training included uncountable runs of Doom.

A game engine lets you create a new game, not predict the next frame of an existing and copiously documented one.

This is not a game engine.

Creating a new good game? Good luck with that.

throwmeaway222 1 year ago |

You know how when you're dreaming and you walk into a room at your house and you're suddenly naked at school?

I'm convinced this is the code that gives Data (ST TNG) his dreaming capabilities.

gwern 1 year ago |

People may recall GameGAN from May 2020: https://arxiv.org/abs/2005.12126#nvidia https://nv-tlabs.github.io/gameGAN/#nvidia https://github.com/nv-tlabs/GameGAN_code

kcaj 1 year ago |

Take a bunch of videos of the real world and calculate the differential camera motion with optical flow or feature tracking. Call this the video’s control input. Now we can play SORA.

jetrink 1 year ago |

What if instead of a video game, this was trained on video and control inputs from people operating equipment like warehouse robots? Then an automated system could visualize the result of a proposed action or series of actions when operating the equipment itself. You would need a different model/algorithm to propose control inputs, but this would offer a way for the system to validate and refine plans as part of a problem solving feedback loop.

Workaccount2 1 year ago | |

>Robotic Transformer 2 (RT-2) is a novel vision-language-action (VLA) model that learns from both web and robotics data, and translates this knowledge into generalised instructions for robotic control

https://deepmind.google/discover/blog/rt-2-new-model-transla...

yair99dd 1 year ago |

Yotube user hu-po streams critical in-depth streams of Ai papers. Here is his take on this (and other relevant) paper https://www.youtube.com/live/JZgqQB4Aekc

lynx23 1 year ago |

Hehe, this sounds like the backstory of a remake of the Terminator, or "I have no mouth, but I must scream." In the aftermath of AI killing off humanity, researchers look deeply into how this could have ahppened. And after a number of dead ends, they finally realize: it was trained, in its infancy, on Doom!

wantsanagent 1 year ago |

Anyone have reliable numbers on the file sizes here? Doom.exe from my searches was around 715k, and with all assets somewhere around 10MB. It looks like the SD 1.4 files are over 2GB, so it's likely we're looking at a 200-2000x increase in file size depending on if you think of this as an 'engine' or the full game.

lukol 1 year ago |

I believe future game engines will be state machines with deterministic algorithms that can be reproduced at any time. However, rendering said state into visual / auditory / etc. experiences will be taken over by AI models.

This will also allow players to easily customize what they experience without changing the core game loop.

nuz 1 year ago |

I wonder how overfit it is though. You could fit a lot of doom resolution jpeg frames into 4gb (the size of SD1.4)

JDEngi 1 year ago |

This is going to be the future of cloud gaming, isn't it? In order to deal with the latency, we just generate the next frame locally, and we'll have the true frame coming in later from the cloud, so we're never dreaming too far ahead of the actual game.

KETpXDDzR 1 year ago |

I think the correct title should be "Diffusion Models Are Fake Real-Time Game Engines". I don't think just more training will ever be sufficient to create a complete game engine. It would need to "understand" what it's doing.

ciroduran 1 year ago |

Congrats on running Doom on an Diffusion Model :D

I was really entranced on how combat is rendered (the grunt doing weird stuff in very much the style that the model generates images). Now I'd like to see this implemented in a shader in a game

seydor 1 year ago |

I wonder how far it is from this to generating language reasoning about the game from the game itself, rather than learning a large corpus of language, like LLMs do. That would be a true grounded language generator

golol 1 year ago |

Certain categories of youtube videos can also be viewed as some sort of game where the actions are the audio/transcript advanced a couple of seconds. Add two eggs. Fetch the ball. I'm walking in the park.

darrinm 1 year ago |

So… is it interactive? Playable? Or just generating a video of gameplay?

vunderba 1 year ago | |

From the article: We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality.

The demo is actual gameplay at ~20 FPS.

darrinm 1 year ago | | |

It confused me that their stated evaluations by humans are comparing video clips rather than evaluating game play.

holoduke 1 year ago |

I saw a video a while ago where they recreated actual doom footage with a diffusion technique so it looked like a jungle or anything you liked. Cant find it anymore, but looked impressive.

jumploops 1 year ago |

This seems similar to how we use LLMs to generate code: generate, run, fix, generate.

Instead of working through a game, it’s building generic UI components and using common abstractions.

qnleigh 1 year ago |

Could a similar scheme be used to drastically improve the visual quality of a video game? You would train the model on gameplay rendered at low and high quality (say with and without ray tracing, and with low and high density meshing), and try to get it to convert a quick render into something photorealistic on the fly.

When things like DALL-E first came out, I was expecting something like the above to make it into mainstream games within a few years. But that was either too optimistic or I'm not up to speed on this sort of thing.

agys 1 year ago | |

Isn't that what Nvidia’s Ray Reconstruction and DLSS (frame generation and upscaler) are doing, more or less?

qnleigh 1 year ago | | |

At a high level I guess so. I don't know enough about Ray Reconstruction (though the results are impressive), but I was thinking of something more drastic than DLSS. Diffusion models on static images can turn a cartoon into a photorealistic image. Doing something similar for a game, where a low-quality render is turned into something that would otherwise take seconds to render, seems qualitatively quite different from DLSS. In principle a model could fill in huge amounts of detail, like increasing the number of particles in a particle-based effect, adding shading/lighting effects...

LtdJorge 1 year ago |

So is it taking inputs from a player and simulating the gameplay or is it just simulating everything (effectively, a generated video)?

lackoftactics 1 year ago |

I think Alan's conservative countdown to AGI will need to be updated after this. https://lifearchitect.ai/agi/ This is really impressive stuff. I thought about it a couple of months ago, that probably this is the next modality worth exploring for data, but didn't imagine it would come so fast. On the other side, the amount of compute required is crazy.

acoye 1 year ago |

Nvidia CEO reckons your GPU will be replaced with AI in “5-10 years”. So this is what the sort of first working game I guess.

acoye 1 year ago |

I'd love to see John Carmack come back from his AGI hiatus and advance AI based rendering. This would be supper cool.

amunozo 1 year ago |

This is amazing and an interesting discovery. It is a pity that I don't find it capable of creating anything new.

harha_ 1 year ago |

This is so sick I don't know what to say. I never expected this, aren't the implications of this huge?

aithrowaway1987 1 year ago | |

I am struggling to understand a single implication of this! How does this generalize to anything other than other than playing retro games in the most expensive way possible? The very intention of this project is overfitting to data in a non-generalizable way! Maybe it's just pure engineering, that good ANNs are getting cheap and fast. But this project still seems to have the fundamental weaknesses of all AI projects:

- needs a huge amount of data, which a priori precludes a lot of interesting use cases

- flashy-but-misleading demos which hide the actual weaknesses of the AI software (note that the player is moving very haltingly compared to a real game of DOOM, where you almost never stop moving)

- AI nailing something really complicated for humans (98% effective raycasting, 98% effective Python codegen) while failing to grasp abstract concepts rigorously understood by fish (object permanence, quantity)

I am genuinely struggling to see this as a meaningful step forward. It seems more like a World's Fair exhibit - a fun and impressive diversion, but probably not a vision of the future. Putting it another way: unlike AlphaGo, Deep Blue wasn't really a technological milestone so much as a sociological milestone reflecting the apex of a certain approach to AI. I think this DOOM project is in a similar vein.

harha_ 1 year ago | | |

I agree with you, when I made this comment I was simply excited but that didn't last too long. I find this technology both exciting and dystopian, the latter because the dystopic use of it is already happening all over the internet. For now, it's been used only for entertainment AFAIK, which is the kind of use I don't like either, because I prefer human created entertainment over this crap.

maxglute 1 year ago |

RL tetris effect hallucination.

Wish there was 1000s of hours of hardcore henry to train. Maybe scrape gopro war cams.

nicman23 1 year ago |

what i want from something like this is a mix. a model that can infinitely "zoom" into an object's texture which even if not perfect it would be fine and a model that would create 3d geometry from bump maps / normals

mobiuscog 1 year ago |

Video Game streamers are next in line to be replaced by AI I guess.

EcommerceFlow 1 year ago |

Jensen said that this is the future of gaming a few months ago fyi.

Fraterkes 1 year ago | |

Thousands of different people have been speculating about this kind of thing for years.

weakfish 1 year ago | |

Who is that?

kqr 1 year ago |

I have been kind of "meh" about the recent AI hype, but this is seriously impressive.

Of course, we're clearly looking at complete nonsense generated by something that does not understand what it is doing – yet, it is astonishingly sensible nonsense given the type of information it is working from. I had no idea the state of the art was capable of this.

gwbas1c 1 year ago |

Am I the only one who thinks this is faked?

It's not that hard to fake something like this: Just make a video of DOSBox with DOOM running inside of it, and then compress it with settings that will result in compression artifacts.

GaggiX 1 year ago | |

>Am I the only one who thinks this is faked?

Yes.

amelius 1 year ago |

Yes, and you can use an LLM to simulate role playing games.

piperswe 1 year ago |

This is honestly the most impressive ML project I've seen since... probably O.G. DALL-E? Feels like a gem in a sea of AI shit.

jasonkstevens 1 year ago |

AI no longer plays Doom-it is Doom.

aghilmort 1 year ago |

looking forward to &/or wondering about overlap with notion of ray tracing LLMs

itomato 1 year ago |

The gibs are a dead giveaway

joseferben 1 year ago |

impressive, imagine this but photo realistic with vr goggles.

thegabriele 1 year ago |

Wow, I bet Boston Dynamics and such are quite interested

YeGoblynQueenne 1 year ago |

Misleading Titles Are Everywhere These Days.

danielmarkbruce 1 year ago |

What is the point of this? It's hard to see how this is useful. Maybe it's just an exercise to show what a diffusion model can do?

richard___ 1 year ago |

Uhhh… demos would be more convincing with enemies and decreasing health

Kiro 1 year ago | |

I see enemies and decreasing health on hit. But even if it lacked those, it seems like a pretty irrelevant nitpick that is completely underplaying what we're seeing here. The fact that this is even possible at all feels like science fiction.

dean2432 1 year ago |

So in the future we can play FPS games given any setting? Pog

sitkack 1 year ago |

What most programmers don't understand, that in the very near future, the entire application will be delivered by an AI model, no source, no text, just connect to the app over RDP. The whole app will be created by example, the app developer will train the app like a dog trainer trains a dog.

Jonovono 1 year ago | |

I think it's possible AI models will generate dynamic UI for each client and stream the UI to clients (maybe eventually client devices will generate their UI on the fly) similar to Google Stadia. Maybe some offset of video that allows the remote to control it. Maybe Wasm based - just stream wasm bytecode around? The guy behind VLC is building a library for ulta low latency: https://www.kyber.video/techology.

I was playing around with the idea in this: https://github.com/StreamUI/StreamUI. Thinking is take the ideas of Elixir LiveView to the extreme.

sitkack 1 year ago | | |

I am so glad you posted, this is super cool!

I too have been thinking about how to push dynamic wasm to the client for super low latency UIs.

LiveView is just the beginning. Your readme is dreamy. I'll dive into your project at the end of Sept when I get back into deep tech.

ukuina 1 year ago | |

So... https://websim.ai except over pixels instead of in your browser?

sitkack 1 year ago | | |

Yes, and that is super neat.

Grimblewald 1 year ago | |

that might work for some applications, especially recreational things, I think we're a while away from it doing away with all things, especially where deterministic behavior, efficiency, or reliability are important.

sitkack 1 year ago | | |

Problems for two papers down the line.