Genie: Generative Interactive Environments(sites.google.com) |
Genie: Generative Interactive Environments(sites.google.com) |
If these are generating a fully interactive environments, why are all the clips ~1 second long?
Based on the first sentence in your paper, I would have expected a playable example as a demo. Or 20.
But reading a bit further into the paper, it sounds like the model needs to be actively running inference and will generate the next frame on the fly as actions are taken- is that correct?
Secondly, why are all videos like half a second long? I thought video generation came much farther than this. My guess would be that the world models unravel at any length longer than that, which is (and has always been) the problem with models such as these. Minus the video generation part, we had pretty good world models for games already, see Dreamer line of work: https://danijar.com/project/dreamerv3/
Anyway, about my second question: why are the videos only half second ish long? Does the model unravel after that?
Also
> This is the first version of something that is now possible and will only improve with scale.
11b params is already pretty large considering the stable diffusion and LLM scale. How much higher do we need to scale until we get something useful beyond simple setups?
In the video, the character becomes a pixelated mess. In the static image, the character is clearly on rocks in the foreground, but in the "game" we see the character magically jumping from the foreground rocks to the background structure which also contains significant distortions.
The extremely short demo videos make it slightly harder to catch these obvious issues.
The internal politics at these places must be exhausting. Industry research was supposed to be free from the publish or perish mindset, but it seems like it just got replaced by a different kind of need for posturing.
The resolution is 90p but we use an upsampler to make it 360p for examples on the website.
Point of clarification -- we don't expect bigger models to be the only way to improve this and are working on innovations on the modeling side, however we don't want to overlook the significance of scaling either :)
Why not add inductive biases then and make your life easier? What's with this choice to try and do everything the hard way, presumably to make a point? In the end the point made is so specific that it translates to nothing that is usable in real problems.
See MuZero for example- sure, you can learn without being given the rules explicitly, just from the win/loss signal, but then that only works in board games and atari games, and without the chance of a snowball in hell that it will work in the real world. We're dazzled by the technical prowess, but real utility? Where is that?
In the Appendix we have a case study that should be possible to re-implement and run with a single GPU/TPU. We are hoping the community can build from that and innovate. If you take these steps and get stuck, feel free to get in touch!