Don't Look Twice: Faster Video Transformers with Run-Length Tokenization(rccchoudhury.github.io) |
Don't Look Twice: Faster Video Transformers with Run-Length Tokenization(rccchoudhury.github.io) |
https://en.wikipedia.org/wiki/Event_camera
“Event cameras do not capture images using a shutter as conventional (frame) cameras do. Instead, each pixel inside an event camera operates independently and asynchronously, reporting changes in brightness as they occur, and staying silent otherwise.”
https://bioone.org/journals/journal-of-vertebrate-paleontolo...
However, I do think that background information can sometimes be important. I reckon a mild improvement on this model would be to leave the background in the first frame, and perhaps every x frames, so that the model gets better context cues. This would also more accurately replicate video compression.
I feel like this is very much like the early days of data compression where a few logical but kind of ad-hoc principles are being investigated in advance of a more sophisticated theory that integrates the ideas of what is being attempted, how to identify success, and recognizing pathways that move towards the optimal solution.
These papers are the foundations of that work.
That's similar to how the human visual system 'paints' a coherent scene from a quite narrow field of high-resolution view, with educated guesses and assumptions
There are other recent ones that do a new camera from any vantage point, not just rotation+fov changes like the above as well. But they still might want stabilized video as the baseline input if they don't already use it.
Besides saccades and tracking, your eyes also do a lot of stabilization, even counter rotating on the roll axis as you lean your head to the side. I'm not sure if they roll when tracking a subject that rolls, I would think not common enough to need to be a thing.