Diffusion models are real-time game engines(gamengen.github.io) |
Diffusion models are real-time game engines(gamengen.github.io) |
I get this (mostly). But would any kind soul care to elaborate on this? What is this "drift" they are trying to avoid and how does (AFAIU) adding noise help?
Any other similar existing datasets?
A really goofy way I can think of to get a bunch of data would be to get videos from youtube and try to detect keyboard sounds to determine what keys they're pressing.
A similar approach but with a game where the exact input is obvious and unambiguous from the graphics alone so that you can use unannotated data might work. You’d just have to create a model to create the action annotations. I’m not sure what the point would be, but it sounds like it’d be interesting.
1. Continue training on all of the games that used the Doom engine to see if it is capable of creating new graphics, enemies, weapons, etc. I think you would need to embed more details for this perhaps information about what is present in the current level so that you could prompt it to produce a new level from some combination.
2. Could embedding information from the map view or a raytrace of the surroundings of the player position help with consistency? I suppose the model would need to predict this information as the neural simulation progressed.
3. Can this technique be applied to generating videos with consistent subjects and environments by training on a camera view of a 3D scene and embedding the camera position and the position and animation states of objects and avatars within the scene?
4. What would the result of training on a variety of game engines and games with different mechanics and inputs be? The space of possible actions is limited by the available keys on a keyboard or buttons on a controller but the labelling of the characteristics of each game may prove a challenge if you wanted to be able to prompt for specific details.
We could have mods for old games that generate voices for the characters for example. Maybe it's unfeasible from a computing perspective? There are people running local LLMs, no?
You mean in real time? Or just in general?
There are a lot of mods that use AI-generated voices. I'll say it's the norm of modding community now.
A game engine lets you create a new game, not predict the next frame of an existing and copiously documented one.
This is not a game engine.
Creating a new good game? Good luck with that.
I'm convinced this is the code that gives Data (ST TNG) his dreaming capabilities.
https://deepmind.google/discover/blog/rt-2-new-model-transla...
This will also allow players to easily customize what they experience without changing the core game loop.
I was really entranced on how combat is rendered (the grunt doing weird stuff in very much the style that the model generates images). Now I'd like to see this implemented in a shader in a game
I was playing around with the idea in this: https://github.com/StreamUI/StreamUI. Thinking is take the ideas of Elixir LiveView to the extreme.
I too have been thinking about how to push dynamic wasm to the client for super low latency UIs.
LiveView is just the beginning. Your readme is dreamy. I'll dive into your project at the end of Sept when I get back into deep tech.
The demo is actual gameplay at ~20 FPS.
Instead of working through a game, it’s building generic UI components and using common abstractions.
When things like DALL-E first came out, I was expecting something like the above to make it into mainstream games within a few years. But that was either too optimistic or I'm not up to speed on this sort of thing.
- needs a huge amount of data, which a priori precludes a lot of interesting use cases
- flashy-but-misleading demos which hide the actual weaknesses of the AI software (note that the player is moving very haltingly compared to a real game of DOOM, where you almost never stop moving)
- AI nailing something really complicated for humans (98% effective raycasting, 98% effective Python codegen) while failing to grasp abstract concepts rigorously understood by fish (object permanence, quantity)
I am genuinely struggling to see this as a meaningful step forward. It seems more like a World's Fair exhibit - a fun and impressive diversion, but probably not a vision of the future. Putting it another way: unlike AlphaGo, Deep Blue wasn't really a technological milestone so much as a sociological milestone reflecting the apex of a certain approach to AI. I think this DOOM project is in a similar vein.
Wish there was 1000s of hours of hardcore henry to train. Maybe scrape gopro war cams.
Of course, we're clearly looking at complete nonsense generated by something that does not understand what it is doing – yet, it is astonishingly sensible nonsense given the type of information it is working from. I had no idea the state of the art was capable of this.
It's not that hard to fake something like this: Just make a video of DOSBox with DOOM running inside of it, and then compress it with settings that will result in compression artifacts.
Yes.
The two main things of note I took away from the summary were: 1) they got infinite training data using agents playing doom (makes sense), and 2) they added Gaussian noise to source frames and rewarded the agent for ‘correcting’ sequential frames back, and said this was critical to get long range stable ‘rendering’ out of the model.
That last is intriguing — they explain the intuition as teaching the model to do error correction / guide it to be stable.
Finally, I wonder if this model would be easy to fine tune for ‘photo realistic’ / ray traced restyling — I’d be super curious to see how hard it would be to get a ‘nicer’ rendering out of this model, treating it as a doom foundation model of sorts.
Anyway, a fun idea that worked! Love those.
To temper this a bit, you may want to pay close attention to the demo videos. The player rarely backtracks, and for good reason - the few times the character does turn around and look back at something a second time, it has changed significantly (the most noticeable I think is the room with the grey wall and triangle sign).
This falls in line with how we'd expect a diffusion model to behave - it's trained on many billions of frames of gameplay, so it's very good at generating a plausible -next- frame of gameplay based on some previous frames. But it doesn't deeply understand logical gameplay constraints, like remembering level geometry.
[0]: https://en.wikipedia.org/wiki/Inattentional_blindness#Invisi...
If I studied the longer one more closely, I'm sure inconsistencies would be seen but it seemed able to recall presence/absence of destroyed items, dead monsters etc on subsequent loops around a central obstruction that completely obscured them for quite a while. This did seem pretty odd to me, as I expected it to match how you'd described it.
What if you combine this with an engine in parallel that provides all geometry including characters and objects with their respective behavior, recording changes made through interactions the other model generates, talking back to it?
A dialogue between two parties with different functionality so to speak.
(Non technical person here - just fantasizing)
If on the backend you could record the level layouts in memory you could have exploration teams that try to find new areas to explore.
or do we believe it's an inherent limitation in the approach?
The diffusion model doesn’t maintain any state itself, though its weights may encode some notion of cause/effect. It just renders one frame at a time (after all it’s a text to image model, not text to video). Instead of text, the previous states and frames are provided as inputs to the model to predict the next frame.
Noise is added to the previous frames before being passed into the SD model, so the RL agents were not involved with “correcting” it.
De-noising objectives are widespread in ML, intuitively it forces a predictive model to leverage context, ie surrounding frames/words/etc.
In this case it helps prevent auto-regressive drift due to the accumulation of small errors from the randomness inherent in generative diffusion models. Figure 4 shows such drift happening when a player is standing still.
The training was over almost 1 billion frames, 20 days of full-time play-time, taking a screenshot of every single inch of the map.
Now you show him N frames as input, and ask it "give me frame N+1", then it gives you the frame n. N+1 back based on how it was originally seen during training.
But it is not frame N+1 from a mysterious intelligence, it's simply frame N+1 given back from past database.
The drift you mentioned is actually a clear (but sad) proof that the model does not work at inventing new frames, and can only spit out an answer from the past dataset.
It's a bit like if you train stable diffusion on Simpsons episodes, and that it outputs the next frame of an existing episode that was in the training set, but few frames later goes wild and buggy.
I would call it the world's least efficient video compression.
What I would like to see is the actual predictive strength, aka imagination, which I did not notice mentioned in the abstract. The model is trained on a set of classic maps. What would it do, given a few frames of gameplay on an unfamiliar map as input? How well could it imagine what happens next?
It's not super clear from the landing page, but I think it's an engine? Like, its input is both previous images and input for the next frame.
So as a player, if you press "shoot", the diffusion engine need to output an image where the monster in front of you takes damage/dies.
A mistake people make all the time is that massive companies will put all their resources toward every project. This paper was written by four co-authors. They probably got a good amount of resources, but they still had to share in the pool allocated to their research department.
Even Google only has one Gemini (in a few versions).
Though I wonder if 10 years down the line folks wouldn't even care about underlying model details (no more than a current day web-developer needs to know about network packets).
PS: Not great examples, but I hope you get the idea ;)
Abstractly, it's like the model is dreaming of a game that it played a lot of, and real time inputs just change the state of the dream. It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.
It's trained on a large set of data in which agents played DOOM and video samples are given to users for evaluation, but users are not feeding inputs into the simulation in real-time in such a way as to be "playing DOOM" at ~20FPS.
There are some key phrases within the paper that hint at this such as "Key questions remain, such as ... how games would be effectively created in the first place, including how to best leverage human inputs" and "Our end goal is to have human players interact with our simulation.", but mostly it's just the omission of a section describing real-time user gameplay.
When I read this part I thought you were going to say because you're technically not running Doom at all. That is, instead of running Doom without Doom's original hardware/software environment (by porting it), you're running Doom without Doom itself.
Isn't that possible by setting arbitrarily high goals for ray-cast rendering?
Not really? The greatest anti-Doom would be an infinite nest of these types of models predicting models predicting Doom at the very end of the chain.
The next step of anti-Doom would be a model generating the model, generating the Doom output.
- 4 MB RAM
- 12 MB disk space
Stable diffusion v1 > 860M UNet and CLIP ViT-L/14 (540M)
Checkpoint size:
4.27 Gb
7.7 GB (full EMA)
Running on a TPU-v5e
Peak compute per chip (bf16) 197 TFLOPs
Peak compute per chip (Int8) 393 TFLOPs
HBM2 capacity and bandwidth 16 GB, 819 GBps
Interchip Interconnect BW 1600 Gbps
This is quite impressive, especially considering the speed. But there's still a ton of room for improvement. It seems it didn't even memorize the game despite having the capacity to do so hundreds of times over. So we definitely have lots of room for optimization methods. Though who knows how such things would affect existing tech since the goal here is to memorize.What's also interesting about this work is it's basically saying you can rip a game if you're willing to "play" (automate) it enough times and spend a lot more on storage and compute. I'm curious what the comparison in cost and time would be if you hired an engineer to reverse engineer Doom (how much prior knowledge do they get considering pertained models and visdoom environment. Was doom source code in T5? And which vit checkpoint was used? I can't keep track of Google vit checkpoints).
I would love to see the checkpoint of this model. I think people would find some really interesting stuff taking it apart.
- https://www.reddit.com/r/gaming/comments/a4yi5t/original_doo...
- https://huggingface.co/CompVis/stable-diffusion-v-1-4-origin...
- https://cloud.google.com/tpu/docs/v5e
Some of ya'll need to learn how to make things for the fun of making things. Is this useful? No, not really. Is it interesting? Absolutely.
Not everything has to be made for profit. Not everything has to be made to make the world a better place. Sometimes, people create things just for the learning experience, the challenge, or they're curious to see if something is possible.
Time spent enjoying yourself is never time wasted. Some of ya'll are going to be on your death beds wishing you had allowed yourself to have more fun.
Yes they had to use RL to learn what DOOM looks like and how it works, but this doesn’t necessarily pose a chicken vs egg problem. In the same way that LLMs can write a novel story, despite only being trained on existing text.
IMO one of the biggest challenges with this approach will be open world games with essentially an infinite number of possible states. The paper mentions that they had trouble getting RL agents to completely explore every nook and corner of DOOM. Factorio or Dwarf Fortress probably won’t be simulated anytime soon…I think.
These tools are fascinating but, as with all AI hype, they need a disclaimer: The tool didn't create the game. It simply generated frames and the appearance of play mechanics from a game it sampled (which humans created).
If a rule was changed but it's never visible on the screen, did it really change?
> It simply generated frames and the appearance of play mechanics from a game it sampled (which humans created).
Simply?! I understand it's mechanically trivial but the fact that it's compressed such a rich conditional distribution seems far from simple to me.
Well for "some" games it does really change
It's much simpler than actually creating a game....
I'm guessing that the "This door requires a blue key" doesn't mean that the user can run around, the engine dreams up a blue key in some other corner of the map, and the user can then return to the door and the engine now opens the door? THAT would be impressive. It's interesting to think that all that would be required for that task to go from really hard to quite doable, would be that the door requiring the blue key is blue, and the UI showing some icon indicating the user possesses the blue key. Without that, it becomes (old) hidden state.
Given a sufficient enough separation between these two, couldn't you basically boil the game/input logic down to an abstract game template? Meaning, you could just output a hash that corresponds to a specific combination of inputs, and then treat the resulting mapping as a representation of a specific game's inner workings.
To make it less abstract, you could save some small enough snapshot of the game engine's state for all given input sequences. This could make it much less dependent to what's recorded off of the agents' screens. And you could map the objects that appear in the saved states to graphics, in a separate step.
I imagine this whole system would work especially well for games that only update when player input is given: Games like Myst, Sokoban, etc.
I can hardly believe this claim, anyone who has played some amount of DOOM before should notice the viewport and textures not "feeling right", or the usually static objects moving slightly.
The entire thing would probably crash and burn if you did something just slightly unusual compared to the training data, too. People talking about 'generated' games often seem to fantasize about an AI that will make up new outcomes for players that go off the beaten path, but a large part of the fun of real games is figuring out what you can do within the predetermined constraints set by the game's code. (Pen-and-paper RPGs are highly open-ended, but even a Game Master needs to sometimes protects the players from themselves; whereas the current generation of AI is famously incapable of saying no.)
I suspect there is a reason for this: running while turning doesn't work properly and makes it very obvious that the system doesn't have a consistent internal 3D view of the world. I'm already getting motion sickness from the inconsistencies in straight-line movement, I can't imagine turning is any better.
That's literally how the human rating was setup if you read the paper.
I'm wondering when people will apply this to other areas like the real world. Would it learn the game engine of the universe (ie physics)?
I think for real world application one challenge is going to be the "action" signal which is a necessary component of the conditioning signal that makes the simulation reactive. In video games you can just record the buttons, but for real world scenarios you need difficult and intrusive sensor setups for recording force signals.
(Again for robotics though maybe it's enough to record the motor commands, just that you can't easily record the "motor commands" for humans, for example)
https://slatestarcodex.com/2017/09/05/book-review-surfing-un...
It's called predictive coding. By trying to predict sensory stimuli, the brain creates a simplified model of the world, including common sense physics. Yann LeCun says that this is a major key to AGI. Another one is effective planning.
But while current predictive models (autoregressive LLMs) work well on text, they don't work well on video data, because of the large outcome space. In an LLM, text prediction boils down to a probability distribution over a few thousand possible next tokens, while there are several orders of magnitude more possible "next frames" in a video. Diffusion models work better on video data, but they are not inherently predictive like causal LLMs. Apparently this new Doom model made some progress on that front though.
(I say it can't count because there are numerous examples where the bullet count glitches, it goes right impressively often, but still, counting, being up or down, is something computers have been able to do flawlessly basically since forever)
(It is the same with chess, where the LLM models are becoming really good, yet sometimes make mistakes that even my 8yo niece would not make)
What I wonder is whether LLM's will inherently always have this dichotomy and we need something 'extra' (reasoning, attention or something les biomimicried), or whether this will eventually resolves itself (to an acceptable extend) when they improve even further.
Most enemies have enough hit points to survive the first shot. If the model is only trained on the previous frame, it doesn't know how many times the enemy was already shot at.
From the video it seems like it is probability based - they may die right away or it might take way longer than it should.
I love how the player's health goes down when he stands in the radioactive green water.
In Doom the enemies fight with each other if they accidentally incur "friendly fire". It would be interesting to see it play out in this version.
This is one of the bits that was weird to me, it doesn't work correctly. In the real game you take damage at a consistent rate, in the video the player doesn't and whether the player takes damage or not seems highly dependent on some factor that isn't whether or not the player is in the radioactive slime. My thought is that its learnt something else that correlates poorly.
They trained this thing on bot gameplay, so I bet it does poorly when advanced strategies like deliberately inducing mob infighting are employed (the bots probably didn't do that a lot, of at all.)
You don't even need to do all of that - this trained model already is the game, i.e., it's interactive, you can play the game.
I noticed a few hallucinations e.g. when it picked green jacket from a corner, walking back it generated another corner. Therefore I don't think it has any clue about the 3D world of the game at all.
I would assume only if the training data contained this type of imagery, which it did not. The training data (from what I understand) consisted only of input+video of actual gameplay, so that is what the model is trained to mimick.
This is like a dog that has been trained to form English words – what's impressive is not that it does it well, but that it does it at all.
AI models don't "know" things at all.
At best, they're just very fuzzy predictors. In this case, given the last couple frames of video and a user input, it predicts the next frame.
It has zero knowledge of the game world, game rules, interactions, etc. It's merely a mapping of [pixels, input] -> pixels.
Like if I kill an enemy in some room and walk all the way across the map and come back, would the body still be there?
Edit: Can see this in the first 10 seconds of the first video under "Full Gameplay Videos", stairs turning to corridor turning to closed door for no reason without looking away.
Guessing the model hasn't been taught enough about that, because most people don't jump into hazards.
to me it seems like a very bruteforce or greedy way to give the impression to a user that they are "playing" a game. the difference being that you already own the game to make this possible, but cannot let the user use that copy!
using generative AI for game creation is at a nascent stage but there are much more elegant ways to go about the end goal. perhaps in the future with computing so far ahead that we moved beyond the current architecture, this might be worth doing instead of emulation perhaps.
If so, is it more like imagination/hallucination rather than rendering?
As Richard Dawkins recently put it in a podcast[1], our genes are great prediction machines, as their continued survival rests on it. Being able to generate a visual prediction fits perfectly with the amount of resources we dedicate to sight.
If that is the case, what does aphantasia tell us?
[1] https://podcasts.apple.com/dk/podcast/into-the-impossible-wi...
It is running on an entire v5 TPU (https://cloud.google.com/blog/products/ai-machine-learning/i...)
It's unclear how that compares to a high-end consumer GPU like a 3090, but they seem to have similar INT8 TFLOPS. The TPU has less memory (16 vs. 24), and I'm unsure of the other specs.
Something doesn't add up, in my opinion, though. SD usually takes (at minimum) seconds to produce a high-quality result on a 3090, so I can't comprehend how they are like 2 orders of magnitudes faster—indicating that the TPU vastly outperforms a GPU for this task. They seem to be producing low-res (320x240) images, but it still seems too fast.
This, to me, seems extremely reductionist. Like you start with AI and work backwards until you frame all cognition as next something predictors.
It’s just the stochastic parrot argument again.
This is an incredibly complex hypothesis that doesn't really seem justified by the evidence
> A is the set of key presses and mouse movements…
> …to condition on actions, we simply learn an embedding A_emb for each action
So, it’s clear that in this model the diffusion process is conditioned by embedding A that is derived from user actions rather than words.
Then a noised start frame is encoded into latents and concatenated on to the noise latents as a second conditioning.
So we have a diffusion model which is trained solely on images of doom, and which is conditioned on current doom frames and user actions to produce subsequent frames.
So yes, the users are playing it.
However, it should be unsurprising that this is possible. This is effectively just a neural recording of the game. But it’s a cool tech demo.
Since the splats are specifically designed for rendering it seems like it would be an efficient way for the image model to learn the geometry without having to encode it on the image model itself.
https://www.youtube.com/watch?v=udPY5rQVoW0 "Playing a Neural Network's version of GTA V: GAN Theft Auto"
> Figure 1: a human player is playing DOOM on GameNGen at 20 FPS.
The abstract is ambiguously worded which has caused a lot of confusion here, but the paper is unmistakably clear about this point.
Kind of disappointing to see this misinformation upvoted so highly on a forum full of tech experts.
Well you're wrong as specified in the first video and by the authors themselves, maybe next time check better instead of writing comments with such authoritative tone of things you don't actually know.
The people surveyed in this study are not playing the game, they are watching extremely short video clips of the game being played and comparing them to equally short videos of the original Doom being played, to see if they can spot the difference.
I may be wrong with how it works, but I think this is just hallucinating in real time. It has no internal state per se, it knows what was on screen in the previous few frames and it knows what inputs the user is pressing, and so it generates the next frame. Like with video compression, it probably doesn't need to generate a full frame every time, just "differences".
As with all the previous AI game research, these are not games in any real sense. They fall apart when played beyond any meaningful length of time (seconds). Crucially, they are not playable by anyone other than the developers in very controlled settings. A defining attribute of any game is that it can be played.
I would've really liked to see a section of the paper explicitly call out that they used humans in real time. There's a lot of sentences that led me to believe otherwise. It's clear that they used a bunch of agents to simulate gameplay where those agents submitted user inputs to affect the gameplay and they captured those inputs in their model. This made it a bit murky as to whether humans ever actually got involved.
This statement, "Our end goal is to have human players interact with our simulation. To that end, the policy π as in Section 2 is that of human gameplay. Since we cannot sample from that directly at scale, we start by approximating it via teaching an automatic agent to play"
led me to believe that while they had an ultimate goal of user input (why wouldn't they) they sufficed by approximating human input.
I was looking to refute that assumption later in the paper by hopefully reading some words on the human gameplay experience, but instead, under Results, I found:
"Human Evaluation. As another measurement of simulation quality, we provided 10 human raters with 130 random short clips (of lengths 1.6 seconds and 3.2 seconds) of our simulation side by side with the real game. The raters were tasked with recognizing the real game (see Figure 14 in Appendix A.6). The raters only choose the actual game over the simulation in 58% or 60% of the time (for the 1.6 seconds and 3.2 seconds clips, respectively)."
and it's like.. okay.. if you have a section in results on human evaluation, and your goal is to have humans play, then why are you talking just about humans reviewing video rather than giving some sort of feedback on the human gameplay experience - even if it's not especially positive?
Still, in the Discussion section, it mentions, "The second important limitation are the remaining differences between the agent’s behavior and those of human players. For example, our agent, even at the end of training, still does not explore all of the game’s locations and interactions, leading to erroneous behavior in those cases." which makes it more clear that humans gave input which went outside the bounds of the automatic agents. It doesn't seem like this would occur if it were agents simulating more input.
Ultimately, I think that the paper itself could've been more clear in this regard, but clearly the publishing website tries to be very explicit by saying upfront - "Real-time recordings of people playing the game DOOM" and it's pretty hard to argue against that.
Anyway. I repent! It was a learning experience going back and forth on my belief here. Very cool tech overall.
Imagine if text2game was possible. there would be some sort of network generating each frame from an image generated by text, with some underlying 3d physics simulation to keep all the multiplayer screens sync'd
this paper does not seem to be of that possibility rather some cleverly words to make you think people were playing a real time video. we can't even generate more than 5~10 second of video without it hallucinating. something this persistent would require an extreme amount of gameplay video training. it can be done but the video shown by this paper is not true to its words.
Yes, the computational cost is ridicolous compared to the original game, and yes, it lacks basic things like pre-computing, storing, etc. That said, you could assume that all that can be either done at the margin of this discovery OR over time will naturally improve OR will become less important as a blocker.
The fact that you can model a sequence of frames with such contextual awareness without explictly having to encode it, is the real breakthrough here. Both from a pure gaming standpoint, but on simulation in general.
OR one can hope it will be thrown to the heap of nonviable tech with the rest of spam waste
1) the model has enough memory to store not only all game assets and engine but even hundreds of "plays".
2) me mentioning that there's still a lot of room to make these things better (seems you think so too so maybe not this one?)
3) an interesting point I was wondering to compare current state of things (I mean I'll give you this but it's just a random thought and I'm not reviewing this paper in an academic setting. This is HN, not NeurIPS. I'm just curious ¯ \ _ ( ツ ) _ / ¯)
4) the point that you can rip a game
I'm really not sure what you're contesting to because I said several things.
> it lacks basic things like pre-computing, storing, etc.
It does? Last I checked neural nets store information. I guess I need to return my PhD because last I checked there's a UNet in SD 1.4 and that contains a decoder.That's the least of it. It means you can generate a game from real footage. Want a perfect flight sim? Put a GoPro in the cockpit of every airliner for a year.
I guess that's the occasion to remind that ML is splendid at interpolating, but extrapolating, maybe don't keep your hopes too high.
Namely, to have a "perfect flight sim" using GoPros, you'll need to record hundreds of stalls and crashs.
> Want a perfect flight sim? Put a GoPro in the cockpit of every airliner for a year.
You're jumping ahead there and I'm not convinced you could do this ever (unless you're model is already a great physics engine). The paper itself has feeds the controls into the network. But a flight sim will be harder better you'd need to also feed in air conditions. I just don't see how you could do this from video alone, let alone just video from the cockpit. Humans could not do this. There's just not enough information.And, unless you wanted a simulator that only allowed perfectly normal flight, you'd have to have those airliners go through every possible situation that you wanted to reproduce: warnings, malfunctions, emergencies, pilots pushing the airliner out of its normal flight envelope, etc.
You can feed it with videos of usage of any software or real world footage recorded by a Go Pro mounted on your shoulder(with body motion measured by some sesnors though the action space would be much larger).
Such a "game engine" can potentially be used as a simulation gym environment to train RL agents.
When in reality this is the least efficient and reliable form of Doom yet created, using literally millions of times the computation used by the first x86 PCs that were able to render and play doom in real-time.
But it's a funny party trick, sure.
It's unavoidable though. Cost of living being increasingly expensive and romantization of entrepreneurs like they are rock stars leads towards this hustle mindset.
And here we are, binging netflix movies over such copper wires.
I'm not saying games will be replaced by diffusion models dreaming up next images based on user input, but a variation of that might end up in a form of interactive art creation or a new form of entertainment.
I don't see how.
This game "engine" is purely mapping [pixels, input] -> new pixels. It has no notion of game state (so you can kill an enemy, turn your back, then turn around again, and the enemy could be alive again), not to mention that it requires the game to already exist in order to train it.
I suppose, in theory, you could train the network to include game state in the input and output, or potentially even handle game state outside the network entirely and just make it one of the inputs, but the output would be incredibly noisy and nigh unplayable.
And like I said, all of it requires the game to already exist in order to train the network.
In a way this is a "simulated game engine", trained from actual game engine data. But I would argue a working simulated game engine becomes a game engine of its own, as it is then able to "propell the game" as you say. The way it achieves this becomes irrelevant, in one case the content was crafted by humans, in the other case it mimics existing game content, the player really doesn't care!
> An engine would also work offroad.
Here you could imagine that such a "generative game engine" could also go offroad, extrapolating what would happen if you go to unseen places. I'd even say extrapolation capabilities of such a model could be better than a traditional game engine, as it can make things up as it goes, while if you accidentally cross a wall in a typical game engine the screen goes blank.
training the model with a final game will never give you an engine. maybe a „simulated game“ or even a „game“ but certainly not an „engine“. the latter would mean the model would be capable to derive and extract the technical and intellectual concepts and apply them elsewhere.
They easily could have demonstrated this by seeding the model with images of Doom maps which weren't in the training set, but they chose not to. I'm sure they tried it and the results just weren't good, probably morphing the map into one of the ones it was trained on at the first opportunity.
At which point, you effectively would be interpolating in latent space through the source code to actually "render" the game. You'd have an entire latent space computer, with an engine, assets, textures, a software renderer.
With a sufficiently powerful computer, one could imagine what interpolating in this latent space between, say Factorio and TF2 (2 of my favorites). And tweaking this latent space to your liking by conditioning it on any number of gameplay aspects.
This future comes very quickly for subsets of the pipeline, like the very end stage of rendering -- DLSS is already in production, for example. Maybe Nvidia's revenue wraps back to gaming once again, as we all become bolted into a neural metaverse.
God I love that they chose DOOM.
Neural nets are not guaranteed to converge to anything even remotely optimal, so no that isn't how it works. Also even though neural nets can approximate any function they usually can't do it in a time or space efficient manner, resulting in much larger programs than the human written code.
> With enough computation, your neural net weights would converge to some very compressed latent representation of the source code of DOOM.
You and I have very different definitions of compressionhttps://news.ycombinator.com/item?id=41377398
> Someone in the field could probably correct me on that.
^__^The first thing I thought when I saw this was: couldn't my immediate experience be exactly the same thing? Including the illusion of a separate main character to whom events are occurring?
I would expect something in this realm to be a little better at not being visually inconsistent when you look away and look back. A red monster turning into a blue friendly etc.
Sit down and write down a text prompt for a "fun new game". You can start with something relatively simple like a Mario-like platformer.
By page 300, when you're about halfway through describing what you mean, you might understand why this is wishful thinking
Not really. This is a reproduction of the first level of Doom. Nothing original is being created.
(Jk of course I know what you mean, but you can seriously see text prompts as compressed forms of programming that leverage the model's prior knowledge)
- you could build a non-real-time version of the game engine and use the neural net as a real-time approximation
- you could edit videos shot in real life to have huds or whatever and train the neural net to simulate reality rather than doom. (this paper used 900 million frames which i think is about a year of video if it's 30fps, but maybe algorithmic improvements can cut the training requirements down) and a year of video isn't actually all that much—like, maybe you could recruit 500 people to play paintball while wearing gopro cameras with accelerometers and gyros on their heads and paintball guns, so that you could get a year of video in a weekend?
I imagine a game like that could get so convincing in its details and immersiveness that one could forget they're playing a game.
We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality.
Sora was trained on a much more diverse dataset, and so has to learn more general solutions in order to maintain consistency, which is harder. The low resolution and simple, highly repetitive textures of doom definitely help as well.
In general, this is just an easier problem to approach because of the more focused constraints. It's also worth mentioning that noise was added during the process in order to make the model robust to small perturbations.
The link may be that we actually see differences between “frames”, rather than the frames directly. That in itself would imply that a from of sub-visual representation is being processed by our brain. For aphantasia, it could be that we work directly on this representation instead of recalling imagery through the visual system.
Many people with aphantasia reports being able to visualize in their dreams, meaning that they don't lack the ability to generate visuals. So it may be that the brain has an affinity to rely on the abstract representation when "thinking", while dreaming still uses the "stable diffusion mode".
I’m no where near qualified to speak of this with certainty, but it seems plausible to me.
1: Build the entire game first
2: Record agents playing hundreds/thousands/millions of hours of it
3: Be able to run the simulation at far higher resolution than what's in the demo videos here for it to even matter that the hypothetical game is 'very graphically advanced'
This is the most impractical way yet invented to 'play' a game.
I don't really know why everyone is piling on me here. Sorry for a bit of fun speculating! This model is on the continuum. There is a latent representation of Doom in weights. some weights, not these weights. Therefore some representation of doom in a neural net could become more efficient over time. That's really the point I'm trying to make.
It says in a shy way that it is based on: "Ha & Schmidhuber (2018) who train a Variational Auto-Encoder (Kingma & Welling, 2014) to encode game frames into a latent vector"
So it means they most likely took https://worldmodels.github.io/ (that is actually open-source) or something similar and swapped the frame generation by Stable Diffusion that was released in 2022.
Also: https://news.ycombinator.com/item?id=41376722
Also: define "fun" and "new" in a "simple text prompt". Current image generators suck at properly reflecting what you want exactly, because they regurgitate existing things and styles.
> Many people with aphantasia reports being able to visualize in their dreams, meaning that they don't lack the ability to generate visuals. So it may be that the [aphantasia] brain has an affinity to rely on the abstract representation when "thinking", while dreaming still uses the "stable diffusion mode".
(I obviously don't know what I'm talking about, just a fellow aphant)
1) yes you are correct. the point i was making is that, in the context of the discovery/research, that's outside the scope, and 'easier' to do, as it has been done in other verticals (ie.: e2e self driving)
2) yep, aligned here
3) I'm not fully following here, but agree this is not NeurIPS, and no Schmidhuber's bickering.
4) The network does store information, it just doesn't store a gameplay information, which could be forced, but as per point 1, it is , and I think it is the right approach, beyond the scope of this research
3) It's always hard to evaluate. I was thinking about the ripping the game and so a reasonable metric is a comparison of ability to perform the task by a human. Of course I'm A LOT faster than my dishwasher at cleaning dishes but I'm not occupied while it is going, so it still has high utility. (Someone tell reviewer 2 lol)
4) Why should we believe that it doesn't store gameplay? The model was fed "user" inputs and frames. So it has this information and this information appears useful for learning the task.
I guess I should start hoarding video of myself now.
All video games are, by definition, interactive videos.
What I imagine you're asking about is, a typical game like Doom is effectively a function:
f(internal state, player input) -> (new frame, new internal state)
where internal state is the shape and looks of loaded map, positions and behaviors and stats of enemies, player, items, etc.A typical AI that plays Doom, which is not what's happening here, is (at runtime):
f(last frame) -> new player input
and is attached in a loop to the previous case in the obvious way.What we have here, however, is a game you can play but implemented in a diffusion model, and it works like this:
f(player input, N last frames) -> new frame
Of note here is the lack of game state - the state is implicit in the contents of the N previous frames, and is otherwise not represented or mutated explicitly. The diffusion model has seen so much Doom that it, in a way, internalized most of the state and its evolution, so it can look at what's going on and guess what's about to happen. Which is what it does: it renders the next frame by predicting it, based on current user input and last N frames. And then that frame becomes the input for the next prediction, and so on, and so on.So yes, it's totally an interactive video and a game and a third thing - a probabilistic emulation of Doom on a generative ML model.
That opens up a new branch of possibilities.
With that said, I wholly disagree that this is not an engine. This is absolutely a game engine and while this particular demo uses the engine to recreate DOOM, an existing game, you could certainly use this engine to produce new games in addition to extrapolating existing games in novel ways.
Maybe it is, but doing that with the entire scene instead of just a small part of it makes the problem massively harder, as the model needs to grow exponentially to remember more things. It isn't something that we will manage anytime soon, maybe 10-20 years with current architecture and same compute progress.
Then you make that even harder by remembering a whole game level? No, ain't gonna happen in our lifetimes without massive changes to the architecture. They would need to make a different model keep track of level state etc, not just an image to image model.
But to say the model is simulating actual gameplay (i.e. that a person could actually play Doom in this) is far fetched. It's definitely great that the model was able to remember that the gray wall was still there after we turned around, but it's untenable for actual gameplay that the wall completely changed location and orientation.
It would in an SCP-themed game. Or dreamscape/Inception themed one.
Hell, "you're trapped in Doom-like dreamscape, escape before you lose your mind" is a very interesting pitch for a game. Basically take this Doom thing and make walking though a specific, unique-looking doorway from the original game to be the victory condition - the player's job would be to coerce the model to generate it, while also not dying in the Doom fever dream game itself. I'd play the hell out of this.
(Implementation-wise, just loop in a simple recognition model to continously evaluate victory condiiton from last few frames, and some OCR to detect when player's hit points indicator on the HUD drops to zero.)
(I'll happily pay $100 this year to the first project that gets this to work. I bet I'm not the only one. Doesn't have to be Doom specifically, just has to be interesting.)
Mainly just wanted to temper expectations I'm seeing throughout this thread that the model is actually simulating Doom. I don't know what will be required to get from here to there, but we're definitely not there yet.
What I'd posit is that it's not actually a very good replication of the game but very good a replicating short clips that almost look like the game and the short time horizons are deliberately chosen because the authors know the model lacks coherence beyond that.
Do you mean the PSNR and LPIPS metrics used in paper?
I’m not very familiar with Gaussian splats models, but aren’t they just a way of constructing images using multiple superimposed parameterized Gaussian distributions, sort of like the Fourier series does with waveforms using sine and cosine waves?
I’m not seeing how that would apply here but I’d be interested in hearing how you would do it.
There's been a bunch of work on making splats efficient and good at representing geometry. Reading more, perhaps NERFs would be a better fit, since they're an actual neutral network.
My thinking is that if you trained a NERF ahead of time to represent the geometry and layout of the levels, and plug that in to the diffusion model (as a part of computing the latents, and then also on the other side so it can be used to improve the rendering) then the diffusion model could focus on learning how actions manipulate the world without having to learn the geometry representation.
To be honest none of the stuff in the paper is very practical, you almost certainly do not want a diffusion model trying to be an entire game under any circumstances.
What you might want to do is use a diffusion model to transform a low poly, low fidelity game world into something photorealistic. So the geometry, player movement and physics etc would all make sense, and then the model paints over it something that looks like reality based on some primitive texture cues in the low fidelity render.
I’d bet money that something like that will happen and it is the future of games and video.
Sounds like a great game.
> not to mention that it requires the game to already exist in order to train it
Diffusion models create new images that did not previously exist all of the time, so I'm not sure how that follows. It's not hard to extrapolate from TFA to a model that generically creates games based on some input
Well you see a wall you turn around then turn back the wall is still there. With enough training data the model will be able to pick up the state of the enemy because it has ALREADY learned the state of the wall due to much more numerous data on the wall. It's probably impractical to do this, but this is only a stepping stone like said.
> not to mention that it requires the game to already exist in order to train it.
Is this a problem? Do games not exist? Not only due we have tons of games, but we also have in theory unlimited amounts of training data for each game.
It's really important to understand that ALL THE MODEL KNOWS is a mapping of [pixels, input] -> new pixels. It has zero knowledge of game state. The wall is still there after spinning 360 degrees simply because it knows that the image of a view facing away from the wall while holding the key to turn right eventually becomes an image of a view of the wall.
The only "state" that is known is the last few frames of the game screen. Because of this, it's simply not possible for the game model to know if an enemy should be shown as dead or alive once it has been off-screen for longer than those few frames. It also means that if you keeping turning away and towards an enemy, it could teleport around. Once it's off the screen for those few frames, the model will have forgotten about it.
> Is this a problem? Do games not exist?
If you're trying to make a new game, then you need new frames to train the model on.
This is false. What occurs in inside the model is unknown. It arranges pixel input and produces pixel output as if it actually understands game state. Like LLMs we don't actually fully understand what's going on internally. You can't assume that models don't "understand" things just because the high level training methodology only includes pixel input and output.
>The only "state" that is known is the last few frames of the game screen. Because of this, it's simply not possible for the game model to know if an enemy should be shown as dead or alive once it has been off-screen for longer than those few frames. It also means that if you keeping turning away and towards an enemy, it could teleport around. Once it's off the screen for those few frames, the model will have forgotten about it.
This is true. But then one could say it knows game state for up to a few frames. That's different from saying the model ONLY knows pixel input and pixel output. Very different.
There are other tricks for long term memory storage as well. Think Radar. Radar will capture the state of the enemy beyond just visual frames so the model won't forget an enemy was behind them.
Game state can also be encoded into some frame pixels at the bottom lines. The Model can pick up on these associations.
edit: someone mentioned that the game state lasts past a few frames.
>If you're trying to make a new game, then you need new frames to train the model on.
Right so for a generative model you would instead of training the model on one game you would train it on multitudes of games. The model would then based off of a seed number output a new type of game.
Alternatively you could have a model generate a model.
All of what I'm saying is of course speculative. As I said, this model is a stepping stone for the future. Just like the LLM which is only trivially helpful now, the LLM can be a stepping stone for replacing programmers all together.
It's easy to see this by noting that you can often prune networks quite a bit without any loss in performance. I.e. the effective dimension of the manifold the weights live on can be much, much smaller than the total capacity allows for. In fact, good regularization is exactly that which encourages the model itself to be compressible.
Capacity is autological. The amount of information it can express.
Training dynamics are the way the model learns, the optimization process, etc. So this is where things like regularization come into play.
There's also architecture which affects the training dynamics as well as model capacity. Which makes no guarantee that you get the most information dense representation.
Fwiw, the authors did also try distillation.
> With enough computation, your neural net weights would converge to some very compressed latent representation of the source code of DOOM. Maybe smaller even than the source code itself? Someone in the field could probably correct me on that.
And they're not wrong! An ideally trained network could, in principle, learn the data-generating program, if that program is within its class of representable functions. I might have a NN that naively looks like it takes up GBs of space, but it might actually be parameterizing a much simpler function (hence our ability to prune/compress the weights without performance loss - most of the capacity wasn't being used for any interesting computation).
You're right that there's no guarantee that the model finds the most "dense" representation. The goal of regularization is to encourage that, though!
All over the place in ML there are bounds like:
test loss <= train loss + model complexity
Hence minimizing model complexity improves generalization performance. This is a kind of Occam's Razor: the simplest model generalizes best. So the OP is on the right track - we definitely want networks to learn the "underlying" process that explains the data, which in this case would be a latent representation of the source code (well, except that doesn't really make sense since you'd need the whole rest of the compute stack that code runs on - the neural net has no external resources/embodied complexity it calls, unlike the source code which gets to rely on drivers, hardware, operating systems, etc.)
They succeeded in the research, gained knowledge, and might be able to do something awesome with it.
It’s a success even if they don’t sell anything.
If it would be able to invent action and maps and let the user play "infinite doom", then it would be very different (and impressive!).
Generating "infinite Doom" is exactly what this model is doing, as it does not capture the larger map layout well enough to stay consistent with it.
I mean... no? Not even close? Multiply the number of game states with the number of inputs at any given frame gives you a number vastly bigger than 1 billion, not even comparable. Even with 20 days of play time to train no, it's entirely likely that at no point did someone stop at a certain location and look to the left from that angle. They might have done from similar angles, but the model then has to reconstruct some sense of the geometry of the level to synthesize the frame. They might also not have arrived there from the same direction, which again the model needs some smarts to understand.
I get your point, it's very overtrained on these particular levels of Doom, which means you might as well just play Doom. But this is not a hash table lookup we're talking about, it's pretty impressive work.
Not exactly that, but Nvidia does something like this already, they call it DLSS. It uses previous frames and motion vector to render a next frame using machine learning.
It always blew my mind how well it worked on a 33 Mhz 486. I'm fairly sure it ran at 30 fps in 320x200. That gives it just over 17 clock cycles per pixel, and that doesn't even include time for game logic.
My memory could be wrong, though, but even if it required a 66 Mhz to reach 30 fps, that's still only 34 clocks per pixel on an architecture that required multiple clocks for a simple integer add instruction.
“Everything” would mean all objects and the elements they’re made of, their rules on how they interact and decay.
A modularized ecosystem i guess, comprised of “sub-systems” of sorts.
The other model, that provides all interaction (cause for effect) could either be run artificially or be used interactively by a human - opening up the possibility for being a tree : )
This all would need an interfacing agent that in principle would be an engine simulating the second law of thermodynamics and at the same time recording every state that has changed and diverged off the driving actor’s vector in time.
Basically the “effects” model keeping track of everyones history.
In the end a system with an “everything” model (that can grow overtime), a “cause” model messing with it, brought together and documented by the “effect” model.
(Again … non technical person, just fantasizing) : )
Repeat for quest lines, new cities, etc, with the npcs having real time dialogue and interactions that happen entirely off screen, no guarantee of there being a massive quest objective, and some sort of recorder of events that keeps a running tally of everything that goes on so that as the PCs interact with it they are never repeating the same dreary thing.
If this were a MMORPG it would require so much processing and architecting, but it would have the potential to be the greatest game in human history.
Off the top of my head DOOM is open source so it should be reasonable to setup repeatable scenarios and use some frames from the game to create a starting scenario for the simulation that is the same. Then the input from the player of the game could be used to drive the simulated version. You could go further and instrument events occurring in the game for direct comparison to the simulation. I’d be interested in setting a baseline for playtime of the level in question and using sessions of around that length as an ultimate test.
There are some on obvious mechanical deficiencies seen in the videos they’ve published. One that really stood out to me was the damage taken when in the radioactive slime. So I don’t think the analysis would need to particularly deep to find differences.
Heck, it is far simpler than video, because the point of view and frame is fixed.
Further - "a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions." specifically "and actions"
User input is being fed into this system and subsequent frames take that into account. The user is "actually" firing a gun.
I highly suggest you to read the paper briefly before commenting on the topic. The whole point is that it's not just generating a video.
Wouldn't be as pure though.
In this sora video the dragon covers half the scene, and its basically identical when it is revealed again ~5 seconds later, or about 150 frames later. The is lots of evidence (and some studies) that these models are in fact building internal world models.
https://www.youtube.com/watch?v=LXJ-yLiktDU
Buckle in, the train is moving way faster. I don't think there would be much surprise if this is solved in the next few generations of video generators. The first generation is already doing very well.
You always get this from AI enthusiast, they come and post "proof" that disproves their own point.
Most of the mob of people are indistinct, but there is a woman in a lime green coat who is visible, and then obstructed by the dragon twice (beard and ribbon) and reappears fine. Unfortunately when dragon fully moves past she has been lost to frame right.
There is another person in black holding a red satchel which is visible both before and after the dragon has passed.
Nothing about the storefronts appear to change. The complex sign full of Chinese text (which might be gibberish text: it's highly stylized and I don't know Chinese) appears to survive the dragon passing without even any changes to the individual ideograms.
There is also a red box shaped like a Chinese paper lantern with a single gold ideogram on it at the store entrance which spends most of the video obscured by the dragon and is still in the same location after it passes (though video artifacting makes it more challenging to verify that that ideogram is unchanged it certainly does not appear substantially different)
What detail are you seeing that is different before and after the obstruction?
If you want to be right because you can find any difference. Sure. You win. But also completely missed the point.
> An ideally trained network could, in principle, learn the data-generating program
No disagreement > I might have a NN that naively looks like it takes up GBs of space, but it might actually be parameterizing a much simpler function (hence our ability to prune/compress the weights without performance loss - most of the capacity wasn't being used for any interesting computation).
Also no disagreement.I suggested that this probably isn't the case here since they tried distillation and saw no effect. While this isn't proof that this particular model can't be compressed more it does suggest that it is non-trivial. This is especially true given the huge difference in size. I mean we're talking about 700x...
Where I think our disagreement is in that I read the OP as saying __this__ network. If we're talking about a theoretical network, well... nothing I said anywhere is in any disagreement with that. I even said in the post I linked to that the difference shows that there's still a long way to go but that this is still cool. Why did I assume OP was talking about __this__ network? Well because we're in a thread talking about a paper and well... yes, we're talking about compression machines so theoretically (well not actually supported by any math theory) this is true for so many things and that is a bit elementary. So makes more sense (imo) that we're talking about this network. And I wanted to make it clear that this network is nowhere near compression. Can further research later result in something that is better than the source code? Who knows? For all the reasons we've both mentioned. We know they are universal approximators (which are not universal mimicers and have limits) but we have no guarantee of global convergence (let alone proof such a thing exists in many problems).
And I'm not sure why you're trying to explain the basic concepts to me. I mentioned I was an ML researcher. I see you're a PhD at Oxford. I'm sure you would be annoyed if I was doing the same to you. We can talk at a different level.
I agree with you that this network probably has not found the source code or something like a minimal description in its weights.
Honestly, I'm writing a paper on model compression/complexity right now, so I may have co-opted the discussion to practice talking about these things...! Just a bit over-eager (,,>﹏<,,)
Have you given much thought to how we can encourage models to be more compressible? I'd love to be able to explicitly penalize the filesize during training, but in some usefully learnable way. Proxies like weight norm penalties have problems in the limit.
I actually have some stuff I'm working on in that area that is having some success. I do need to extend it to diffusion but I see nothing stopping me.
Personally I think a major slowdown for our community is it's avoidance of math. Like you don't need to have tons of math in the papers, but many of the lessons you learn in the higher level topics do translate to usable techniques in ML. Though I would also like to see a stronger push on theory because empirical results can be deceiving (Von Neumann's elephant and all)
edit: someone should train it on MyHouse.wad
Objectively, Simons and Chabris (and many others) have a lot of data to support these ideas. Subjectively, I can say that these types of tasks (inattentional blindness, change blindness, etc.) are humbling.
Even having a clue why I'm linking this, I virtually guarantee you won't catch everything.
And even if you do catch everything... the real thing to notice is that you had to look. Your brain does not flag these things naturally. Dreams are notorious for this sort of thing, but even in the waking world your model of the world is much less rich than you think. Magic tricks like to hide in this space, for instance.
We don't memorize things that the environment remembers for us if they aren't relevant for other reasons.
What this demo demonstrates to me is how incredible willing we are to accept what seems familiar to us as accurate.
I bet if you look closely and objectively you will see even more anomalies. But at first watch, I didn’t see most errors because I think accepting something is more efficient for the brain.
The people were told to focus very deeply on a certain aspect of the scene. Maintaining that focus means explicitly blocking things not related to that focus. Also, there is social pressure at the end to have peformed well at the task; evaluating them on a task which is intentionally completely different than the one explicitly given is going to bias people away from reporting gorillas.
And also, "notice anything unusual" is a pretty vague prompt. No-one in the video thought the gorillas were unusual, so if the PEOPLE IN THE SCENE thought gorillas were normal, why would I think they were strange? Look at any TV show, they are all full of things which are pretty crazy unusual in normal life, yet not unusual in terms of the plot.
Why would you think the gorillas were unusual?
Furthermore, even what we attend to isn't always represented with all that much detail. Simons has a whole series of cool demonstration experiments where they show that they can swap out someone you're speaking with (an unfamiliar conversational partner like a store clerk or someone asking for directions), and you may not even notice [0]. It's rather eerie.
The map 1 has 2'518 walkable map units. There are 65536 angles.
2'518*65'536=165'019'648
If you capture 165M frames, you already cover all the possibilities in terms of camera / player view, but probably the diffusion models don't even need to have all the frames (the same way that LLMs don't).
And Doom movement isn't tile based. The map may be, but you can be in many many places on a tile.
Correct. You are certainly not moving between the tiles as discrete units in doom.
What makes you think a mechanical "predict next frame based on existing games" will be any good?
We could build a 'game' which would learn and adapt to precisely the chemistry that makes someone tick and then provide them a map to find the state in which their brain releases their desired state.
Then if the game has a directive - it should be pointed to work as a training tool to allow the user to determine how to release these chemicals themselves at will. Resulting in a player-base which no longer requires anything external for accessing their own desired states.
Not to mention this childish nonsense about "forget they're playing a game," as if every game needs to be lifelike VR and there's no room for stylization or imagination. I am worried for the future that people think they want these things.
Compare it to music gen algo's that can now produce music that is 100% indiscernible from generic crappy music. Which is insane given that 5 years ago it could maybe create the sound of something that maybe someone would describe as "sort of guitar-like". At this rate of progress it's probably not going to be long before AI is making better music than humans. And it's infinitely available too.
Fun variant: give it hidden state by doing the offscreen scratch pixel buffer thing, but not grading its content in training. Train the model as before, grading on the "onscreen" output, and let it keep the side channel to do what it wants with. It'd be interesting to see what way it would use it, what data it would store, and how it would be encoded.
First frame, guy in blue hat next to a flag. That flag and the guy is then gone afterwards.
The two flags near the wall are gone, there is something triangular there but there was two flags before the dragon went past.
Then not to mention that the crowd is 6 people deep after the dragon went past, while just 4 people deep before, it is way more crowded.
Instead of the flag that was there before the dragon, it put in 2 more flags afterwards far more to the left.
Third second a guy was out of frame for a few frames, and suddenly gained a blue scarf. AFter dragon went by he turned into a woman. Next to that person was a guy with a blue cap, he completely disappears.
> Most of the mob of people are indistinct
No they aren't, they are mostly distinct and basically all of them changes. If you ignore that the entire mob totally changes both in number and appearance and where it is, sure it is pretty good, except it forgot the flags, but how can you ignore the mob when we talk about the model remembering details? The wall is much less information dense than the mob, so that is much easier to remember for the model, the difficulty is in the mob.
> but there is a woman in a lime green coat who is visible,
She was just out of frame for a fraction of a second, not the big bit where the dragon moves past. The guy in blue jacket and blue cap behind her disappears though, or merges with another person and becomes a woman with a muffler after the dragon moved past.
So, in the end some big strokes were kept, and that was a very tiny part of the image that was both there before and after the dragon moved past so it was far from a whole image with full details. Almost all details are wrong.
Maybe he meant that the house looked mostly the same, I agree the upper parts does, but I looked at the windows and they were completely different, it is full of people heads after the dragon moved past while before it was just clean walls.
Not in a game and those were enemies, it completely changed what and how many they are, people would notice such a massive change instantly if they looked away and suddenly there were 50% more enemies.
> The model clearly shows the ability to go beyond "image-to-image" rendering.
I never argued against that. Adding a third dimension (time) makes generating a video the same kind of problem as generating an image, it is not harder to draw a straight pencil with something covering it than to draw the scene with something covering it for a while.
But still, even though it is that simple, these models are really bad at it, because it requires very large models and much compute. So I just extrapolated based on their current abilities that we know, as you demonstrated there, to say roughly how long until we can even have consistent short videos.
Note that videos wont have the same progression as images, as the early image models were very small and we quickly scaled up there, while now for video we start at really scaled up models and we have to wait until compute gets cheaper/faster the slow way.
> But also completely missed the point.
You completely missed my point or you changed your point afterwards. My point was that current models can only remember little bits under such circumstances, and to remember a whole scene they need to be massively larger. Almost all details in the scene you showed were missed, the large strokes are there but to keep the details around you need an exponentially larger model.
There cannot be "video compression artifacts" because it hasn’t even seen any compressed video during training, as far as I can see.
Seriously, how is this even a discussion? The article is clear that the novel thing is that this is real-time frame generation conditioned on the previous frame(s) AND player actions. Just generating video would be nothing new.