Falsehoods programmers believe about video(haasn.xyz) |
Falsehoods programmers believe about video(haasn.xyz) |
I would like to know what's wrong with this approach. I watch a lot of commentated speed-run videos: that's often something like ~244p video, plus soft subtitles. The subtitles get rendered at the source resolution (presumably, into the video framebuffer) and then upscaled along with the image, forcing them to be a tiny blurry mess instead of the crisp, readable text they could be.
Closed captions are positioned on the screen to indicate who's talking, have descriptive audio for sound effects, and should be in a high contrast easy to read font (most people with hearing deficiencies also have problems seeing, ie: out of date prescriptions for both hearing aids and eye glasses).
As far as I know, QuickTime does it right but the Apple TV, Netflix, and YouTube fuck it up, but that's because I helped write the QuickTime one way back.
Here is a demo: https://www.youtube.com/watch?v=BbqPe-IceP4
Please do not spread falsehoods.
Disclamer: I work at YouTube.
The issue you can run into in practice is stuff like softsubbed signs, which can clash and look out of place with the native video if you render them at full res. There's also a related issue, which is that if you're using something like motion interpolation (e.g. “smoothmotion”, “fluidmotion” etc. or even stuff like MVTools/SVP), softsubbed signs will not match the video during pans etc., making them stutter and look very out-of-place - the only way to fix that is to render them on top of the video before applying the relevant motion interpolation algorithms.
Personally I've always wished for a world in which subtitles are split into two files, one for dialogue and for signs, with an ability to distinguish between the two. (Heck, I think softsubbed signs should just be separate transparent video streams that are overlayed on top of the native picture, allowing you to essentially hardsub signs while still being capable of disabling them)
Also, sometimes, rendering at full resolution is prohibitively expensive, e.g. watching heavily softsubbed 720p content on a 4K screen.
Sure, you have to transform the coordinates to the output. But still, better to render fonts at the final resolution; they'll always look better than if scaled after rendering.
The only practical downside I have noticed is that accurate rendering of subs containing complex vector graphics or effects (ASS supports that) at > HD resolutions takes a lot of CPU time, sometimes more than a single core can handle in realtime.
There probably is a lot of potential for optimization, but those are hobby projects for their maintainers.
whilst i don't necessarily agree... i do agree that if you want to conform to specs then you can't go thinking this way.
Hah, this strikes really close to home. I've had to work with so so many subtile files in Eastern European and Turkish Windows codepages mostly but not entirely compatible with Win-1252. There's no way to tell them apart programmatically, so you check that the extended characters make sense. It's a bit of a nightmare.
hell, they don't survive alt-tabbing into a game that has a different resolution than the monitor
Mplayer and co, on the other hand can cope with it but my window manager can mess it up so I don't bother.
> I can exclusively use the video clock for timing
Heh. I just finished writing up a design doc to address problems I had with this, and I referenced "Falsehoods programmers believe about time". Then I opened Hacker News and saw this article. So this is very timely for me.
(My doc: https://github.com/scottlamb/moonfire-nvr/blob/new-schema/de...)
its a nightmare, but the reason for these observations is precisely that it shouldn't be a nightmare. this area of programming is a wasteland ... nobody that good wants to solve these trivial problems :/
/sarcasm
and
> video decoding is easily parallelizable
At a previous job, I don't know if it was just the field I was in or just bad luck, but having to explain this over and over again was kind of a personal nightmare.
That being said, this is an excellent list!
Also, none of these unfounded preconceptions make intuitive sense, so I don't see why people would believe them.
Interlaced video files should no longer exist.
Seriously, fk interlaced video.
> upscaling algorithms can invent information that doesn’t exist in the image
That's not a falsehood. Upscaling does invent information that doesn't exist in the image.
Yes, they should, as should silent movies, black and white movies, old game consoles with exotic output formats like vector graphics, and the like.
It is a worthy endeavor to create and maintain video playback software that lets people consume beloved content that was made to the technology of its day, including home videos, sports games, TV shows with special effects edited in 60i, and video games.
The upscaled image does not have more information than what was in the original image; you can reconstruct the upscaled image given only the information available in the original image, the output resolution dimensions, and upscaling algorithm.
https://hn.algolia.com/?query=falsehoods%20programmers%20bel...
And while this topic is not personally relevant to me since I don't work with video decoding, I do find learning about different technologies interesting. Reading this gives me an appreciation for how much effort goes into making video, something we all take for granted, work.
If people only posted articles that were relevant to a majority of readers, HN would be a much less interesting place.
I've no direct experience with, say, Russian or Latin American governments, but cultures that use explicit patronymic or matronymic names might expect that broken out as well.
If you ever need to submit user data to the government (e.g. for tax reasons), and you don't ask your user to break the name apart, then you will necessarily be guessing, which seems strictly worse than just asking them how their name might split.
At the end of the day, if you operate in a given culture, then you need to address those cultural norms. Bending over backwards to support every possible edge case seems unwise if they also happen to disagree with those norms.
"Dear [first name]," flows better than "Dear [opaque string],".
1. Throw an error and won't let me enter my last name as it is supposed to be spelled.
2. Truncate the last part of my last name.
3. Try to be clever and end up shoving the first half of my last name into a middle-name field.
My preference for names, addresses, and other personal data is to stop trying to constrain people to preconceived "standards" and just let them enter their information the way they want it to be.
Normally I just use it with an space in the last name field, but then I get exactly the same problems you mention.
So many mundane things have the "how hard can this be" ..
I honestly think this genre is horrible and counterproductive, even though the writer's intentions are good. It gives no examples, no explanations, no guidelines for proper implementations - just a list of condescending gotchas, showing off the superior intellect and perception of the author.
The "Name" version is a good example of that, I can easily see how most of the examples on this list can be falsehoods.
On the other hand in TFA some of the affirmations leave me more perplexed. For instance, regarding color conversion: "converting from A to B is just the inverse of converting from B to A". I wonder what's meant here. Is it just a matter of rounding or is there more to it than that?
The catch 22 here is that if you understand this list then chances are you already knew about most of these gotchas.
So yeah, a pretty bad format. Now we just have to write "`Falsehood programmers believe about X` considered harmful".
A better approach would be to pick the list up and turn them into a collaborative work. Wiki, maybe?
Try experimenting with chroma subsampling in JPGs, but note that not all image viewers have good chroma upscaling. MPV can display still images as well as video and you can choose the chroma scaling algorithm.
What's more, YCbCr is more efficiently compressed than RGB even if you don't subsample, for the same reason that a DCT saves bits even if you don't quantize: Linearly dependent or redundant information is moved into fewer components, in this case most of the information moves into the Y channel with the Cb and Cr both being very flat in comparison. (Just look at a typical YCbCr image reinterpreted as grayscale to see what I meant)
but you get the exact same effect from higher resolutions, e.g. going from SD->HD->2K->4K we see the same thing... and we are still doing it, so i would question highly that it is subjectively better in a long-term sense given this continuing trend.
i remember hearing people discuss this sort of thing when HD was new, and they stopped after while - i suspect because they got used to it, and they now realise how low the quality of the SD image was. i noticed this in myself as well...
edit: incidentally there is a discussion about this here (first google thing i found): http://www.neogaf.com/forum/showthread.php?t=1308591
its seems either nobody or very few are taking the perspective that 4:2:0/4:2:2 looks better, and there are even a few descriptions of precisely what they notice as being worse.
what i think of as undershooting or overshooting is relative to the range... and besides that, what is wrong with clamping? its how computer graphics has always had to deal with these things... limited range simply doesn't exist in that context, and it doesn't harm anything.
when computer games are forced into limited range for consoles you don't get these unless your tv is applying one of those god awful filters that ruins everything anyway... (i'm still not sure why so many tvs have these - reference monitors never do anything this insane) ... but i can tell you what you do get, a subjectively /and/ measurably worse quality of image than from a monitor.
(i don't think i'm alone in this based on the contents of the ITU-R BT.2100 either... which defines a full range as well as a 'narrow' one)
If you can jump ahead, it would seem to be easy to have multiple threads, starting at key frames to decode the content. You'd have to splice them together, but this seems possible.
It's a resource issue (memory, cpu, etc; and meeting latency requirements between those constraints), versus the subtly different standards "H.264" hardware and software follow, as well as a few other intricacies with how the whole standard works anyways. Again, it's not that it can't be done, but as the article says it can't be done easily or at least in certain situations done consistently.
Key frames are a good anchor around anything you're doing with H264 (and other formats), but it's not the end all and be all -- and they may even cause you trouble if you "trust" them too much. It is perhaps a bit like date time programming. You can create something fairly easily that works for a decent amount of time, and even if it ends up being incorrect your clients may not even notice... or it may breakdown in a catastrophic manner in the future. But doing the latter is certainly not correct and it's certainly not professional. Quite honestly, I'd say date time programming looks like a dream compared to the inconsistent nightmare that is video programming. Date/time logic needs to be sound because many programs rely on consistent and sane output from a program perspective, where as video programming gets to slide as long as the output is generally correct from a human visual perspective.
It's been a few years since I've dived into this stuff, so some things may have changed/gotten cleaned up. But the article seems to indicate that the ecosystem hasn't really changed.
although i contend that most decoders are very threadable - just that the people trying to do it usually lack the time or the skill, more usually the former.
the state of video in programming is a total mess from my experiences.
The font will look better but you have zero guarantee that the subtitles will be better too. Furthermore, you will lose any artistic value that the creator intended.
For example, go get the Russian movie Night Watch and watch it with the original subtitles hardcoded and as a separate file. The director insisted on doing the subtitles himself and he used them for great artistic effect throughout the movie [1]. Watch it with scaling and aspect ratio stretching to see how nicely rendered, crisp high resolution fonts can be inferior to a pixelated, stretched version created with intent by an artist.
[1] http://readingsounds.net/wp-content/uploads/2015/12/NightWat...
As far as I understand it, limited range was historically used so you could use efficient fixed-function integer math for your processing filters without needing to worry about overflow or underflow inside the processing chain. You can't just “clamp back” a signal after an overflow happens.
Of course, it's pretty much irrelevant in 2016 when floating point processing is the norm and TVs come with their own operating systems, so these days it just exists for backwards compatibility with the existing stuff - which is a property that video standards have tried to preserve as much as possible since the early beginnings of television.
Nobody is trying to argue that 4:2:0 video looks objectively superior to 4:4:4 video if given a free choice. Obviously, full chroma information will always be better, such as is the case for something like a PC monitor vs a TV with subsampling.
The problem is that 4:4:4 chroma requires more bits to compress, so when you're designing a video/image codec, you have to ask yourself whether the difference in bitrate between 4:2:0 and 4:4:4 is worth the difference in quality, and the answer seems to be “no”.
This means that when you're serving, say, a 5 Mbps youtube video where the bitrate is already fixed, 4:2:0 is going to give you more bits to put into useful stuff (e.g. luma plane) instead of having to waste them on mostly-redundant chroma information.
In lossless JPEG it seems they omitted the DCT primarily for this reason: It not being a lossless operation to begin with, if you actually want to store the result. What other lossless codecs often do is store a lossy version such as that produced by a DCT, alongside a compressed residual stream coding the difference (error).
In either case, it's important to note the distinction between reordering and compressing; reordering tricks like DCT can reorder entropy without affecting the number of bits required to store them, but the simple fact of having reordered data can make the resultant stream much easier to predict.
For example, compare an input signal like this one:
FF 00 FF 01 FF 02 FF 03 FF 04 ...
By applying a reordering transformation to move all of the low and high bytes together, you turn it into
FF FF FF FF FF .. 00 01 02 03 04 ..
which is much more easily compressed. As for whether that's the case for (some suitable definition of) lossless DCT, I'm not sure.
The essential function of subtitles and closed captions is to enable a viewer to read dialogue (or contextual audio elements) without needing to either hear or understand the audio. It may be in the same language or not.
As one example, in some Chinese markets TV and movies are all subtitled in Chinese, not (primarily) for the deaf, but because the standard Chinese subtitles are intelligible to readers whose only spoken language is a mutually unintelligible dialect.
In Ukraine we either have three fields in government forms (Family name, Given name, Patronymic) or two (Family name, Given name), for example on ticket booking forms. Or just one, but you should write names in FGP order.
Funny thing is that patronymic part is left out in transliteration, so in travel documents you see FGP form in Cyrillic and FG form in Latin letters. Transliteration algorithm is a bit funny, so people tend to have different Latin spelling of same name. And even different Cyrillic spelling of same name, depending if it's written in Ukrainian or Russian - Ukrainian, because Russian doesn't have dotted and double-dotted "i", the use "и" instead.
In past the formal way to address people used given name and patronymic, but that's not that true anymore.
In documents, sometimes full FGP form is only used at the start of the document and subsequent uses just include family name and first two letters of given and patronymic name. Tha is the only thing you can safely do automatically. Signatures also use this short form.
Other thing is that, male and female version of patronymic and family name can differ, so you can't even compare names, not just process them automatically.
And the good thing with patronymic names - combination of three names and birth date uniquely identifies 99.7% of voters, so this really matters.
At the end of the day there are cultural conventions around names, and various agencies use them. I don't see why software should be explicitly culturally neutral, unless your audience is explicitly a global one (and even then, I think localization is preferable to just sticking names into a single field).
Many colour spaces are non-overlapping, ie. one colour space has colours a different colour space simply doesn't have, so converting between them is often lossy and thus non-invertible.
Wouldn't that be overlapping but non-coextensive? Non-overlapping would be no colors in common between color spaces, which would be odd.
1. Everything said in every "Falsehoods Programmers Believe..." list is true.
The Falsehoods sound like ultimate truths only because of the literary genre. They sound like they were written by an expert who not only knows what's true, but also knows what we think we know, which kind of automatically takes him/her to the next level of expertise.
4. Every falsehood that is true CAN be accounted for.
5. Making your code compatible with a falsehood doesn't come with a price.
6. There are no falsehoods which are mutually exclusive.
Hmm.
Wouldn't "the keyframes just don't work correctly" result in corrupted output anyway?
If we're worrying about already-broken situations then it is quite obvious that additional breakage may occur in related features.
[1] I haven't actually read most of the h.265 spec. It's possible these are technically invalid files.
This is how my subtitles / closed captions have looked for me on youtube for a year or so now [1] (on up-to-date Mac Chrome). The font is extremely small and blurry and practically transparent, and there is a horrible background color, which is usually yellow until a week or so ago, but has now changed to green for Christmas.
All I want for Christmas is readable YouTube text. I'm so glad YouTube is trying to keep up with the season's festivities by changing the background color of their absolutely unreadable text from yellow to green, but shouldn't they try to make the text readable by default somehow instead? Maybe a point size larger than 10 points, and a transparency higher than 10 percent, and a neutral or at least less nauseating background color?
Do all users have different randomly selected fonts and point sizes and colors? Why does it change randomly without any user intervention? Is this some sort of a/b/.../z testing? Get it together, YouTube!
I most certainly didn't do anything to configure the closed captions like this. Are there keyboard commands so power users can quickly switch fonts to strange colors and point sizes, that my cats may have pressed when walking on my keyboard?
Some genius at YouTube decided to implement persistent keyboard shortcuts that enable cats to easily and stealthily change the closed captioned text into unreadable colors!
My cat can press "o" to make the text lighter and fuzzier, and press "b" to cycle through a garish series of primary background colors plus black and white, including the same color as the text, rendering it invisible. There may be others, but I can't tell and I'm afraid to try.
Hoping that my opposable thumbs would enable me to get some help, I pressed "?" expecting to get a list of keyboard shortcuts, but that didn't do anything but violate the Principle of Least Astonishment [2].
It's not all my cat's fault, though -- some of the blame lies with YouTube: purposefully designing, implementing and not documenting such annoyingly cat-friendly but unhelpfully user-hostile keyboard shortcuts.
Googling for "youtube keyboard shortcuts" doesn't show any links to official YouTube documentation on the first page of results -- the top featured hit is an outdated page from an "SEO Consultant" full of social networking widgets and ads and self promotion, that doesn't even mention the closed captioning related keyboard shortcuts, which my cat discovered all by himself.
Does YouTube itself even document its own keyboard shortcuts online anywhere, let alone providing pop-up "?" help?
And does anybody really think that changing the transparency and background color of closed captioned text is so important that it deserved several dedicated undocumented keyboard shortcuts, no matter what the usability consequences were? Or that the user's inadvertent color and transparency preferences should be persisted across all videos instead of applied per-video? Who would even want partially transparent text anyway, let alone a key to change between several transparencies?
[2] https://en.wikipedia.org/wiki/Principle_of_least_astonishmen...
Please assume the other side possibly doesn't know something you know (if that's really the case here), instead of being rude and accusing them of spreading falsehoods.
That is frustratingly poor contrast.
Here is what mine look like http://imgur.com/HLIVXQ6
You can click settings again to change font sizes, font family, colors, etc.
Really? I've been assuming that addressing people by first name--even people you've just met--is now the default, at least for the United States and Canada. Are you in the USA/Canada, by the way?
I know that it used to be rude to address someone by their first name unless you knew them well. You had to say, Mr. last-name or Mrs/Ms./Miss last-name. I know this from old movies.
But I thought that the etiquette has changed completely: First name is fine and last name sounds rather formal. Do others have a different experience?
This is incorrect. Calling people by something less formal than Mr./Miss/Mrs. <family-name> is certainly the current norm, but the alternative used is a personal choice of the person being addressed and often different from (sometimes, though far from always, a shortened form of) the legal personal name.
> I know that it used to be rude to address someone by their first name unless you knew them well. You had to say, Mr. last-name or Mrs/Ms./Miss last-name.
It remains rude to address someone by less formal terms until and unless you have sufficient contact with them to know the less formal appellation that they prefer you to use. It is more expected now than in the past that people will very quickly accept the use of less formal address and inform you of their preferred form.
It is also more common for businesses wishing to feign familiarity to presume that first name information from a customer registry, credit card, or other source is equivalent to stating a preferred form of address and consenting to have the businesses agents use that form; it is not, and quite a lot of people react badly to it. You would be well advised not to imitate those businesses.
Nope.
> I know that it used to be rude to address someone by their first name unless you knew them well.
Correct.
> But I thought that the etiquette has changed completely: First name is fine and last name sounds rather formal. Do others have a different experience?
Outside of school/university it's simply presumptuous. Don't do it.
Far creepier is the thing where they use your purchase history to make predictions about your health condition, like where Target would send out baby-related stuff when its algorithms discerned that customers were likely pregnant: http://www.businessinsider.com/the-incredible-story-of-how-t...
Compared to that stuff, "Dear Frondo" seems absolutely benign.
So I still do not see how this would prohibit parallel processing.
However, software that is 100% perfect is pretty much impossible to write, and if you think there's a systematic issue, please file a bug, so it can help others in same situation.
The bug reports I've submitted to google have been ignored, and that's a frustrating distraction from what I'm paid to do. Maybe if you submit one yourself, somebody will pay attention, because google is paying you to work on youtube, and hopefully they will take you more seriously than their users.
FWIW, I always read "file a bug report", when not used to mean "I need more detail" but to mean "talk to the hand" and when spoken by someone working close to a project, as "fuck off", particularly if the person never even bothered to determine whether or not you've used their bug tracker in the past (or even filed a bug already for this specific issue).
When I find someone on a forum with a bug that I haven't heard about, I sit around and talk to them until they either get tired of wanting to talk to me or I get the information I need to fix the problem. The alternative would essentially translate to "I don't actually care about this bug", as that's the only way you are going to get certain classes of bug report. I have shown people at Apple bugs that they were absolutely fascinated by momentarily and then told "File a Radar". I clearly wasn't in a position to do at the moment and which of course I forgot to do it when I got home... they should know this happens, because this assuredly happens to almost every single person they tell that to (and no, "well, we do see a large number of bugs filed" is not evidence against "people you tell to file a bug using your arcane system, particularly if they have to do it days later, probably won't"), and yet even when a potentially rare and real and critical bug is shown to them in person (this was even at an event where the whole point was to work with customers on their issues), their response is easentially "engh, I don't care if this doesn't work unless it affects a ton of people". As someone who works in security, I'm going to assert "do you want vulnerabilities? because this attitude is how you get vulnerabilities": every bug is precious as it is a mistake in your mental model of the software, and who knows how far down the rabbit hole that mistake will take you.
Sure: I realize that the engineer isn't always the best person to do this, and even in my tiny company I had to solve that, but the solution isn't to tell people to "go use the bug tracker", a comment which shunts annoying work learning a new system, one which is all too likely to demoralize them (Apple's Radar is a great example of this), but instead to have someone whose job is to talk to people to follow up with credible bugs: I'd go "hey Xyz, there's a guy on this forum who's complaining about something I hadn't heard of before; can you try to get more details from them?" (where Xyz has changed over the years, but has always been one of the few key positions). I couldn't begin to count the number of times I have debugged an issue with someone on reddit.
I can't actually remember ever being in a business meeting/academic setting, where Mr. or Ms. was used, at any point. In movies, sure, but it really does seem quaint.
No, it's not. Preferred personal informal names are the norm for anyone you've been introduced to (and are normally part of that introduction); those may be legal personal ("first" in the usual English order) names, but often are legal middle names, derivative forms of either first or middle names, or names distinct from any legal name.
you cut the video into a handful of parts at keyframes, process the parts individually in a streaming manner and then splice the partial results together.
If we're talking about playback then creating seeking-thumbnails could similarly benefit from parallel processing.
If YouTube were open source, and I could look at the source code of the keyboard handler to find the cause of the problem myself, prove my bad experience was not just a falsehood to be brushed off, and possibly even suggest a fix, then maybe I would have been more motivated to put my own time into filing a bug report.
But Google is a huge well funded advertising company that payed billions of dollars for YouTube and makes billions of dollars off of it, has a huge complex system set up for digital rights management, promoting and paying for advertisements, enabling copyright holders to report violations, paying many employees for actively pursuing and resolving those copyright violations, removing inappropriate content, hiring conservative lobbyists and sending executives to kiss Donald Trump's ring [1], etc.
So I would expect YouTube employees to put at least as much time and effort into reporting bugs about their own product to their employer, as they put into monetizing YouTube while defending its reputation from people they perceive as spreading falsehoods about it.
[1] http://www.reuters.com/article/us-usa-trump-google-idUSKBN14...
I don't have access to file a bug through a work account for the next few days, and if you come across any issues (like CC broken by default), please file a bug with a lot of details. People do look at that stuff. I am glad that you found the source of the issue, and I hope you can agree that it would've been impossible to find it if I had just filed a bug. I do not work anywhere close to the team that implemented the CC, and when people have said "file a bug" to me in a work context, they have meant it as a way to "let's keep track of this so it's not forgotten". Luckily, the people I have met at work have been good about this. I do not speak for Google or anyone else there, just sharing my own personal experience.
It's a design and documentation bug, that needs to be addressed at a higher level by re-evaluating the decisions and justifications behind all the keyboard accelerators, removing the ones that nobody actually uses and that cause more problems than they solve (like making closed captioned text transparent and changing its colors), implementing full and immediate "?" keyboard help, and writing some online documentation.
So should I simply click "send feedback" on any random youtube video and write up my suggestions, as this page tells me to? [1] I've done that now, so let's see what happens.
Do you really sincerely think my suggestion will actually make it back to the designers through that channel and that changes will happen as a result? Is there a way for me to track it?
Or is there a better accountable bug tracking system that I can actually submit a real trackable bug into and watch the progress and see if it gets marked "will not fix", like https://bugs.chromium.org but for youtube? Do you have access to a better bug tracking system for youtube that's not public?
Luckily, at YT, I have not met anyone that said "file a bug report" and meant "fuck off". I have worked with some people in the past at a different company that have meant that, but not here. Usually it has meant "let's file a bug to not forget about it", this is just my experience, and I just wanted to share it. I am only speaking for myself, and others might have different experiences.