Keeping Netflix Reliable Using Prioritized Load Shedding(netflixtechblog.com) |
Keeping Netflix Reliable Using Prioritized Load Shedding(netflixtechblog.com) |
Thank you to the engineers and developers!
Granted, we had specific QoS/traffic shaping to improve reliability without gobbling up all the bandwidth (stream Netflix was an advertised feature of the wifi service), but it still seemed like magic.
I'm amazed that service allowed streaming though...
Which is why TCP is a horrible choice for any streaming service and a horrible choice for lossy connections, and I would be quite surprised if Netflix relied on it. UDP is the perfect choice for streaming, since video decoders can handle packet loss pretty well. The rest you can achieve with good tradeoff between Reed-Solomon codes and key framing.
And if this is true, then how could it be that Amazon works without problem and Netflix doesn’t?
I'd imagine this is largely due to MSS clamping rather than actual MTU caused packet loss.
I assume the browse screen is based entirely on TCP?
I'm struggling to understand why packet loss would prevent it from loading -- it should be slower but TCP should handle re-transmission, no?
Or is Netflix doing something tricky with UDP even in their browsing UX?
Back in the day we used to have timeouts based on individual reads/writes which will often better answer "is this HTTP request making progress". However the problem with these sort of timeouts is they don't compose well so most people end up having an end-to-end deadline.
QUIC doesn't count because it's not tricky.
I'd love to see a source for this but seeing as YouTube works great over regular HTTP and TCP, I doubt anyone else is out in the weeds trying some custom UDP solution and reinventing wheels.
Used to have similar problems with an ADSL line but found if I limited the line (Both up and down) I could find a magic number where the packet loss went away. (Well most of the time :))
Though it did need to be tuned for different times of they . ie high congestion times need it to be lower.
Though technically it shouldn't be your problem :(
You may employ techniques more complex than a simple bucketing mechanism, such as acutely observing the degree at which clients are exceeding their baseline. However, these techniques aren’t free. The cost of simply throwing away the request can overwhelm your server - and the more steps you add before the shedding part the lower the maximum throughput you can tolerate before going to 0 availability. It’s important to understand at what point this happens when designing a system that takes advantage of this technique.
For example, If you do it at the OS level, it is a lot cheaper than leaving it to the server process. If you choose to do it in your application logic, think carefully about how much work is done for the request before it gets thrown away. Are you validating a token before you are making your decision?
It is becoming du jour to quell 99 percentile latency spikes (i.e. 1:100 requests will take substantially longer) by terminating the requests, which may not always be in the best interest of the user even if it is convenient for the devops teams and their promotion packets.
Looks like the arrow goes the wrong direction.
Seems like a pretty bad Medium bug.
Edit: it is a bad link and I can see why this would happen if you had the Medium app installed. It’s a “branded” Medium post (i.e. appears on the Netflix-owned domain) but clicking the link redirects you to medium.com then redirects you back to the cname.
"Load Shedding".
Shout-out to my fellow South Africans.
Some of the things they mentioned were also user impacting, like not being able to select a video's language, but less critical. You obviously still want that feature, but it's less important than being able to watch at all.
"Clearly nobody cares about" - what? The whole point here is "people care most about video streaming" and less about the metadata etc that they lower in priority.
https://www.latimes.com/california/story/2020-09-21/online-l...
Fire up the developer tools / network view and go watch a Netflix video; try pausing, etc. It is incredibly straightforward.
Might as well suggest that HFT finance firms enter a business of providing fast and reliable internet service to rural areas, because their employees have an extremely high expertise in providing bleeding-edge insanely responsive internet service from the exchanges to their offices (not kidding at all, they legitimately drilled through mountain ranges[0] and set up microwave towers just to get an edge over competitors[1])
0. https://www.ft.com/content/d81f96ea-d43c-11e7-a303-9060cb1e5... 1. https://www.bloomberg.com/news/features/2019-03-08/the-gazil...
There’s “inflight entertainment” where all the movies/shows are indeed stored locally on the plane, with either seatback or custom/white label streaming app for BYOD.
But in addition they were advertising streaming Netflix and YouTube over the satellite WiFi.
https://news.ycombinator.com/item?id=8638946
Even the live ones like Twitch.
Because they all want to run through HTML5 web browsers, re-use the same TLS as everyone else, and not write a ton of new code.
When QUIC gets big, they'll probably switch to UDP - Not cause it's better on every connection, but because it will be popular and it will be better on lossy connections. But for now TCP does work fine.
That's why youtube-dl can rip video without implementing tons of weird proprietary protocols - It's just HTTPS. Otherwise these video sites wouldn't run at all in Firefox.
UDP provides no out of order packet handling which _needs_ to be handled for video streaming. UDP is by default unbuffered throughout transport and tends to cause greater stress to client systems since they need to respond per packet rather than per traffic stream (IP+port combo). As a client developer, you end up reimplementing 90-95% of what TCP gives you out of the box at great development and QA cost. You also drain battery on mobile devices with all the interrupts your causing doing UDP. The upside with a UDP-based implementation is the latency from server to client display is usually much less (tens of milliseconds vs hundreds to thousands), but the trade-offs involved are almost never worth it for a static media streaming site like Netflix.
Even dynamic media streaming sites like Twitch rarely dip into UDP server-client implementations unless there are some unusual requirements.
Netflix is pure TCP I'm sure - look up HLS and DASH.
Perhaps probabilistically terminating calls would work better? I assume the decision has to be made ahead of time with timeout contexts if there anything like cancellation tokens, so even if you give just 5% of all your inbound requests a deadline 10000x as long, you’ll still get some useful info to work with.
As a user, I would absolutely hate it. I somehow frequently run into pockets of badly written or architectured code that cause some of my requests to take a minute or more to be fulfilled on an otherwise responsive server - if I had to retry “just” twenty times for it to go through, I’d lose my mind.
Supporting Path MTU discovery (PMTUD), or perhaps just capping their outbound packets to 1450 or similar. Cloudflare found and fixed a problem in this space: https://blog.cloudflare.com/path-mtu-discovery-in-practice/
Thanks for sharing, I learned a lot from that blog post.