Perceptually lossless (talking head) video compression at 22kbit/s

Perceptually lossless (talking head) video compression at 22kbit/s(mlumiste.com)

225 points by skandium 1 year ago | 140 comments

gwd 1 year ago |

This reminds me of a scene in "A Fire Upon the Deep" (1992) where they're on a video call with someone on another spaceship; but something seems a bit "off". Then someone notices that the actual bitrate they're getting from the other vessel is tiny -- far lower than they should be getting given the conditions -- and so most of what they're seeing on their own screens isn't actual video feed, but their local computer's reconstruction.

Rebelgecko 1 year ago | |

Was that the same book that had the concept of (paraphrasing using modern terminology) doing interstellar communications by sending back and forth LLMs trained on the people who wanted to talk, prompted to try and get a good business deal or whatever?

alex-robbins 1 year ago | | |

That happened in Redemption Ark by Alastair Reynolds (2002), though of course the idea may also have been used before or since.

DoneWithAllThat 1 year ago | | |

This idea was also used in The Algebraist.

miohtama 1 year ago | |

And also it was a deep fake.

BTW This is the best sci-fi book ever.

Retric 1 year ago | | |

Might be better if you like space opera style really soft science fiction. I really didn’t enjoy it.

jf 1 year ago | | |

I beg to differ. A Deepness In The Sky is the best sci-fi book ever.

_kb 1 year ago | |

At least for audio, that dystopia is already shipping in end-user product: https://blog.webex.com/collaboration/hybrid-work/next-level-...

janandonly 1 year ago | |

I came here to reply just this exactly and found a fellow geek beat me to it. Indeed a brilliant book.

zbobet2012 1 year ago |

These sorts of models pop here quite a bit, and they ignore fundamental facts of video codecs (video specific lossy compression technologies).

Traditional codecs have always focused on trade offs among encode complexity, decode complexity, and latency. Where complexity = compute. If every target device ran a 4090 at full power, we could go far below 22kbps with a traditional codec techniques for content like this. 22kbps isn't particularly impressive given these compute constraints.

This is my field, and trust me we (MPEG committees, AOM) look at "AI" based models, including GANs constantly. They don't yet look promising compared to traditional methods.

Oh and benchmarking against a video compression standard that's over twenty years old isn't doing a lot either for the plausibility of these methods.

skandium 1 year ago | |

This is my field as well, although I come from the neural network angle.

Learned video codecs definitely do look promising: Microsoft's DCVC-FM (https://github.com/microsoft/DCVC) beats H.267 in BD-rate. Another benefit of the learned approach is being able to run on soon commodity NPUs, without special hardware accommodation requirements.

In the CLIC challenge, hybrid codecs (traditional + learned components) are so far the best, so that has been a letdown for pure end to end learned codecs, agree. But something like H.267 is currently not cheap to run either.

zbobet2012 1 year ago | | |

Winning in bd rate though isn't hard. You need to win in bd rate and have a hardware implementable, power efficient, cheap decoder.

Agreed hybrid presents real opportunity.

AzzyHN 1 year ago | | |

Did you mean H.266? Or is there some secret H.267 that hasn't been agreed upon yet

smokel 1 year ago | |

Why so sour? This particular article doesn't seem to ignore a lot, it even references the Nvidia work that inspired it, as well as a recent benchmark.

Someone was just having fun here, it's not as if they present it as a general codec.

AzzyHN 1 year ago | |

Really? Would there be a way to replicate this with currently available encoders? I'd like to try it

LeoPanthera 1 year ago |

This is very impressive, but “perceptually lossless” isn’t a thing and doesn’t make sense. It means “lossy”.

Vecr 1 year ago |

Fire Upon the Deep had more or less this. Story important, so I won't say more. That series in general had absolutely brutal bandwidth limitations.

MayeulC 1 year ago |

I like how the saddle in the background moves with the reconstructed head; it probably works better with uncluttered backgrounds.

This is interesting tech, and the considerations in the introduction are particularly noteworthy. I never considered the possibility of animating 2D avatars with no 3D pipeline at all.

antiquark 1 year ago |

Not quite lossless... look at the bicycle seat behind him. When he tilts his head, the seat moves with his hair.

manmal 1 year ago | |

His gaze also doesn’t quite match.

hinkley 1 year ago | | |

Why is nobody noticing the eyes?? This is important!

I feel like I’m taking crazy pills.

metaphor 1 year ago | |

Very noticeable jitter in bicycle front tire too.

red0point 1 year ago |

> But one overlooked use case of the technology is (talking head) video compression.

> On a spectrum of model architectures, it achieves higher compression efficiency at the cost of model complexity. Indeed, the full LivePortrait model has 130m parameters compared to DCVC’s 20 million. While that’s tiny compared to LLMs, it currently requires an Nvidia RTX 4090 to run it in real time (in addition to parameters, a large culprit is using expensive warping operations). That means deploying to edge runtimes such as Apple Neural Engine is still quite a ways ahead.

It’s very cool that this is possible, but the compression use case is indeed .. a bit far fetched. A insanely large model requiring the most expensive consumer GPU to run on both ends and at the same time being limited in bandwidth so much (22kbps) is a _very_ limited scenario.

gambiting 1 year ago | |

One cool use would be communication in space - where it's feasible that both sides would have access to high-end compute units but have a very limited bandwidth between each other.

bliteben 1 year ago | | |

Wonder if its better than a single color channel hologram though

bityard 1 year ago | | |

Bandwidth is not the limitation in space comms, latency is.

JamesLeonis 1 year ago | | |

Increasingly mobile networks are like this. There are all kinds of bandwidth issues, especially when customers are subject to metered pricing for data.

loa_in_ 1 year ago | |

Staying in contact with someone for hours on metered mobile internet connection comes to mind. Low bandwidth translates to low total data volume over time. If I could be video chatting on one of those free internet SIM cards that's a breakthrough.

omh 1 year ago | |

One use case might be if you have limited bandwidth, perhaps only a voice call, and want to join a video conference. I could imagine dialling in to a conference with a virtual face as an improvement over no video at all.

hinkley 1 year ago |

The second example shown is not perceptually lossless, unless you’re so far on the spectrum you won’t make eye contact even with a picture of a person. The reconstructed head doesn’t look in the same direction as the original.

However is does raise an interesting property in that if you are on the spectrum or have ADHD, you only need one headshot of yourself staring directly at the camera and then the capture software can stop you from looking at your taskbar or off into space.

DCH3416 1 year ago | |

> unless you’re so far on the spectrum you won’t make eye contact even with a picture of a person.

I don't know. I think you'd be surprised.

That's already kind of an issue with vloggers. Often they're looking just left or right of the camera at a monitor or something.

AndrewVos 1 year ago |

Elon weirdly looks more human than usual in the AI version!

initramfs 1 year ago |

nice feature for low bandwidth 4G cell systems.

Reminds me of the video chat in Metal Gear Solid 1 https://youtu.be/59ialBNj4lE?t=21

dormento 1 year ago | |

Now that you mention it, it never occurred to me that Snake's radio transmitted video as well. "Did you like my new sunglasses?"

If you could reserve a small portion of the radio bandwidth to broadcast a thumbnail + low bandwidth compressed representation of the face movements, you could technically have something similar without encoding any video (think low res, eye + mouth movements).

hinkley 1 year ago | |

Nice feature for many to one video conferencing as well. Though I don’t know if the organizers will agree.

pastelsky 1 year ago |

Did not expect to see Emraan Hashmi in this post!

shaan7 1 year ago | |

Indeed! Bollywood makes it to HN xD

stuaxo 1 year ago |

Bit off putting that it's Musk for some reason, maybe it's just overexposure to his bullshit, I could quite happily never see him again.

Maybe there is a custom web filter in there somewhere that could block particular people and images of them.

Separo 1 year ago | |

We could run something like TensorFlow.js in a Chrome extension to identify the person in the image and replace it in the dom. A little resource intensive for inference on every image in but probably worth it in this case.

JimDabell 1 year ago |

I got some interesting replies when I suggested this technique here:

https://news.ycombinator.com/item?id=22907718

jacobgorm 1 year ago |

userbinator 1 year ago |

the only information that needs to be transmitted is the change in expression, pose and facial keypoints

Does anyone else remember the weirder (for lack of a better term) features of MPEG-4 part 2, like face and body animation? It did something like that, but as far as I know nearly no one used that feature for anything.

https://en.wikipedia.org/wiki/Face_Animation_Parameter

and in the worst, trust on the internet will be heavily undermined

...as long as the model doesn't include data to put a shoe on one's head.

accra4rx 1 year ago |

why does he has to deep fake Imran Hashmi the serial kisser

tommiegannert 1 year ago |

Now that we're moving towards context-specific compression algorithms, can we please use WASM as the file header for these media files, instead of inventing something new. :)

up2isomorphism 1 year ago |

“Perceptually lossless” is an oxymoron.

ranger_danger 1 year ago | |

As there are several patents, published studies, IEEE papers and thousands of google results for the term, I think it's safe to say that many people do not agree with your interpretation of the term.

hinkley 1 year ago | |

You’re still listening to vinyl, arntcha?

Lossiness definitely matters when you’re doing forensics. But not for consumers.

If you just want to bop to Taylor who the fuck cares. The iPod ended that argument. Yes I can be a perfectionist, or I can have one thousand songs in my pocket. That was more than half of your collection for many people at the time.

up2isomorphism 1 year ago | | |

Calm down dude. It is just a marketing term for something lossy.

esafak 1 year ago | |

It means you don't perceive the loss. What are you arguing; that you can perceive any loss?

Brian_K_White 1 year ago | |

There is no oxymoron in "no perceived loss".

andrewstuart 1 year ago |

The more magic AI makes, the less magical the world becomes.

EarlKing 1 year ago | |

Clearly Sauron is a jealous ringmaker and doesn't like hobbits using his ring to shitpost.

Joel_Mckay 1 year ago | | |

Probably just disappointed at the wasted bandwidth:

24fps * 52 facial 3D marker * 16bit packed delta planar projected offsets (x,y) = 19.968 kbps

And this is done in Unreal games on a potato graphics card all the time:

https://apps.apple.com/us/app/live-link-face/id1495370836

I am sure calling modern heuristics "AI" gets people excited, but it doesn't seem "Magical" when trivial implementations are functionally equivalent. =3

psychoslave 1 year ago | |

The greatest feat ever: let magic disappear before wonder of understanding.

xyzsparetimexyz 1 year ago | |

Oh shut up. There's plenty of awful uses for ai but this isn't one of them

HPsquared 1 year ago | |

This is the power of numerical methods.

andrewstuart 1 year ago | | |

There’s a finite amount of magic and if AI borrows it here then it must be repaid there.

andai 1 year ago | |

What did you mean by this?

satvikpendem 1 year ago | |

> Any sufficiently advanced technology is indistinguishable from magic.

- Arthur C. Clarke

andai 1 year ago | |

andai 1 year ago | | |

Why am I downvoted for asking parent to clarify? Was I impolite for not using a full sentence?