Nov 16 GCP Load Balancing Incident Report

Nov 16 GCP Load Balancing Incident Report(status.cloud.google.com)

172 points by joshma 4 years ago | 76 comments

darkwater 4 years ago |

"Additionally, even though patch B did protect against the kind of input errors observed during testing, the actual race condition produced a different form of error in the configuration, which the completed rollout of patch B did not prevent from being accepted."

This reminds everyone that even the top-notch engineers that work at Google are still humans. A bugfix that didn't really fix the bug is one of the more human things that can happen. I surely make much more mistakes than the average Google engineer, and my overall output quality is lower but yet, I feel a bit better with myself today.

nanis 4 years ago | |

> "even the top-notch engineers that work at Google are still humans."

A top result in a Google search tells me: "According to 2016 annual report as of December 31, 2016 there was 27,169 employees in research and development and 14,287 in operations."

With that many people, it is unreasonable to flat-out assume that everyone who works at Google is top-notch. This kind of stereotyping is insidious in the labor-market for people who are otherwise excellent but do not have the magic fairy dust of Google sprinkled on them.

It is important to remember that no matter how impressive the machinery looks from the outside, everything eventually traces back to a human being typing some text in some editor with an imperfect model of how some lego pieces fit together.

dninednjwryv 4 years ago | | |

This exactly. I work at FAANG and am so tired of the stereotype. Lots of very mediocre people everywhere

darkwater 4 years ago | | |

I beg to disagree, I have been through a couple of processes with FB/Google and the bar is insanely high. I have to say that I've no college degree and just years of work experience, and I'm not the type that will study to prepare an interview, I think that I should know everything requested by heart because I'm familiar with it or used to do it. I guess that maybe there are people that prepare for this and then once they are in... relax.

hiddencost 4 years ago | | |

2016 was a very long time ago.

throwoutway 4 years ago |

Strange that the race condition existed for 6 months, and yet manifested during the last 30 minutes of completing the patch to fix it, only four days after discovery.

I’m not good with statistics but what are the chances?

hyperman1 4 years ago | |

Race conditions are weird.

I had a service that ran fine for years if not decades on Java. One day, a minor update came in to the GNU core utils, which were not at all used by the service itself, and this somehow triggered the race every time in less than 5 minutes, taking down our production cluster. The same update didn't do anything to preproduction, even under much higher load than prod had.

There was a clear bug to fix and a clear root cause. Even so, I never understood what exactly pushed it over the edge.

notyourday 4 years ago | |

Being someone who read hundreds of incident reports and postmortems that I was involved in personally in some capacity on a "fixing" side and thousands on a receiving side, I'm always amazed that otherwise intelligent people believe the details shared in them. The art of writing a postmortem is the art of feeding hungry hyenas in a zoo without blowing a budget: the details are bunk used to convince the hyenas to continue to eat the food rations.

Here's what this postmortem actually says :

* There was an undeniable, user observable issue between 10:04 and 11:28 PT as the customers could not change configuration.

* There was some root cause issue that we will say ran between time X and time Y, we do not acknowledge that your specific service was impacted in that window, unless specified separately.

* At some point we worked around/fixed the underlying issue.

* At 11:28 we fixed the user observable issue.

* The following is the number of minutes we acknowledge to be down for SLA purposes. Remember to pay your bill.

dastbe 4 years ago | |

i think they are higher than you expect, because usually what causes the bug to be known is a worsening state of the system that makes the bug more likely to be hit.

i would ask how the engineer found the race condition, and whether that doesn’t imply a much greater risk.

skunkworker 4 years ago | | |

This, as the state continues to worsen, the higher the chance that someone observing will go "huh that looks off" and then look into it, all while your system hasn't toppled over yet, no notice or write up would be necessary, but you definitely know now what the problem is. And then following that while you are working on a patch the system finally topples over and causes an incident/outage.

mattlondon 4 years ago | | |

There likely was monitoring for various "problems" in production - error rates, validation failures etc, or even just good old crash counts.

An alert may have fired that lead to someone debugging the issue in detail.

I can totally imagine a slow creeping Metric Of Death that has slowly slowly slowly been creeping up for ages and then suddenly breaches some threshold and then becomes a problem.

Spooky23 4 years ago | |

Load balancers and database servers are great candidates for this type of bug.

You can live with something for a long time, but once you hit a critical mass or trigger a particular condition, failures cascade.

jedimastert 4 years ago | |

Race conditions aren't random, but chaotic. It's very probable that the reason the race condition wasn't caught in the first place is that it was probably "impossible" to trigger until some butterfly-patch flapped its wings halfway across the server farm to cause cascading millisecond changes in timing to ripple out.

scottlamb 4 years ago | |

Off-hand, the odds seem pretty low. But maybe some seemingly-unrelated performance change in the release before made the race more likely to go badly. If so, it may not be just a coincidence that an engineer found the problem and there actually was a production outage so close together. I've seen things like that before.

zeckalpha 4 years ago | |

Pretty high with enough bugs.

sudhirj 4 years ago | |

The chances are relatively low, but this is survivorship bias, no? The thousands or tens of thousands of times the problem was fixed before it manifested are invisible to us.

InsomniacL 4 years ago | |

imagine the following:

If service B returns before Service A an error occurs. Service A is lightening fast, and Service B is a slug. Service A incurs an unexpected performance penalty for every new user added to the system. This incremental slight performance degradation adds up, eventually additional system load such as a periodic Virus Scan on System A has a chance to push it over the edge.

snowwolf 4 years ago | |

I don't know the rollout process but perhaps it involves taking servers offline, putting more load on the still live unpatched servers, increasing the probability of the race condition occurring?

londons_explore 4 years ago | |

I could imagine that the mitigations they had put in place were perhaps just in the process of being removed, perhaps by some engineer who was slightly ahead of the rollout finishing...

It's the same as me seeing apt on my machine is 88% done installing some package and deciding that's probably enough to make it runnable in a new tab...

daenz 4 years ago | |

Bingo, if I was being paranoid, I would say someone leaked knowledge of this exploit after it was discovered.

lrem 4 years ago | | |

Being a Googler privy to the internal postmortem: there was no way to trigger this externally (the faulty server is in the control plane) AND triggering this by a Google engineer would require some determination and leaving a ton of audit trail.

cranekam 4 years ago | | |

It’s much more likely that other factors increased the chances of hitting the bug. Maybe the race condition was more likely to be hit if the amount of configuration data increased or the frequency with which configuration changes were compiled went up? The component with the bug doesn’t exist in a vacuum and its behaviour could likely be influenced by external systems.

leoh 4 years ago | | |

htrp 4 years ago |

Did Roblox ever release the incident report from their outage?

xyst 4 years ago | |

I haven’t been able to locate anything since the Halloween announcement

https://blog.roblox.com/2021/10/update-recent-service-outage...

Maybe they are hoping most people forget?

encryptluks2 4 years ago | | |

Haha, it was down for like 2-3 days. Prob waiting to announce a major security incident.

chairmanwow1 4 years ago |

Not sure if this is my own personal bias, but I could have sworn this issue was effecting traffic for longer.

My company wasn’t effected, so I wasn’t paying close attention to it. I was surprised to read it was only ~90 min that services were unreachable.

Anyone else have stabilizing ancedata?

lrem 4 years ago | |

As a Googler privy to the internal postmortem: as stated in the public postmortem, all traffic was unaffected within 33 minutes of the problem appearing. The bug was very on/off: at 09:35PT a corrupted configuration stopped ~immediately (usually double digit seconds of propagation delay) all traffic. At 10:08PT it was verified that the whole service is running the configuration from before the corruption.

The >1h duration was for inability to change your load balancing configuration.

roytries 4 years ago | |

Maybe you're thinking of this incident? https://status.cloud.google.com/incidents/1xkAB1KmLrh5g3v9ZE.... It was a few days earlier and took almost 2 hours.

cajones314 4 years ago | |

We received errors at least 45 minutes before their stated time. :-/

lrem 4 years ago | | |

Then you have been hit by some other issue.

leetrout 4 years ago | |

It was definitely more than 404's they are claiming. Go playground was 503'd.

detaro 4 years ago | | |

Which it could easily have been because it itself received a 404 from something and couldn't handle that.

bullen 4 years ago |

This is my experience of the outage: My DNS servers stopped working but HTTP was operational if I used the IP, so something is rotten with this report.

Lesson learned I will switch to AWS in Asia and only use GCP in central US, with GCP as backup in Asia and IONOS in central US.

Europe is a non-issue for hosting because it's where I live and services are plentiful.

I'm going to pay for a fixed IP on the fiber I can get that on and host the first DNS on my own hardware with lead-acid backup.

Enough of this external dependency crap!

MayeulC 4 years ago | |

> I'm going to pay for a fixed IP on the fiber

This is nice for backup, but I would expect more downtime from your ISP than the big cloud platforms. Also, you might want a platform with anycast DNS if you care about (initial page load) latency.

bullen 4 years ago | | |

Sure you get more downtime, that's why I have 2x fibers with my 100% read uptime database between them, that way both fibers have to go down at the same time for existing customers to be unable to login.

I noticed DNS was a bit slow on first lookup, it's not a big deal for my product and well worth the extra control.

I looked up anycast, and it's unclear how you enable that if you have your own DNS servers, I have 3, one in each continent but I'm pretty sure the DNS provider I use does not use the DNS in the right region!

Is that something you tell the root DNS servers about through your domainname registrar?

You would think this had been built into the root servers ages ago? They can clearely see where my DNS servers are!?

breakingcups 4 years ago | | |

Anecdotally, I've had 100% uptime on my ISP for the past 3 years and have read many a cloud provider's post mortem in that time.

My company hosts a large portion co-located in a datacenter and has the same uptime as my ISP. Clouds seem to be more complex which invites more opportunity for things to go wrong.

breakingcups 4 years ago |

What I would not give for a comprehensive leak of Google's major internal post-mortems.

gigatexal 4 years ago |

I find the post mortem really humanizing. As a customer of GCP there’s no love lost on my end.

justicezyx 4 years ago | |

Why?

If you read past post mortem, you should notice that configuration induced outages have been the sole category of all large-scale outages.

GCP is repeating the same mistake with similar cycle. (Don't quote on this, that's just my impression)

That means they are not improving the situation.

theevilsharpie 4 years ago | | |

> If you read past post mortem, you should notice that configuration induced outages have been the sole category of all large-scale outages.

Is it really that surprising? GCP's services are designed to be fault tolerant, and can easily deal with node and equipment failures.

Bugs and configuration errors are much more difficult to deal with, because the computer is doing what it's been programmed to do, even if that isn't necessarily what they wanted or intended. Correctness-checking tools can catch trivial configuration errors, but problems can still slip through, especially if they only manifest themselves under a production load.

If GCP were repeating literally the same failure over and over again, I could understand the frustration, but I don't think that's the case here. Demanding that GCP avoid all configuration-related outages seems unreasonable -- they would either have to stop any further development (since after all, any change has the potential to cause an outage), or they'd need some type of mechanism for the computer to do what the developers meant rather than what they said, which is will beyond any current or foreseeable technology and would essentially require a Star Trek-level sentient computer.

cjbprime 4 years ago | | |

Configuration change being the most likely cause of outage is true across all post mortems, not solely Google's. It feels like you're blaming them for not solving something that no-one else knows how to solve either. Facebook outage, Salesforce, Azure, it's all configs.

londons_explore 4 years ago |

This text has been rewritten for public consumption in quite a positive light... There are far mode details and contributing factors, and only the best narrative will have been selected for publication here.

oofbey 4 years ago | |

Companies sugar-coating their outage reports is a pet peeve of mine and a real trust-buster. “Some users may experience delays” typically means the whole thing is completely dead. Companies that are really open and honest about such things are rare these days but really deserve praise and support for doing so.

dustintrex 4 years ago | | |

Most companies don't release outage reports period (exhibit A: Roblox). The cloud hyperscalers kind of need to though, because it's not just their own business on the line.

In the early days of GCP, major outage reports were written and signed by SRE VP Ben Treynor:

https://status.cloud.google.com/incident/compute/16007

londons_explore 4 years ago | | |

"5% of users faced difficulty logging in" typically means that the whole service was down, but that only 5% of users attempted to use the service during the downtime. They also count accounts that have been dormant since 2004... so it looks like a smaller number...

stevefan1999 4 years ago |

one bug fixed, two bugs introduced...

m0zg 4 years ago |

> customers affected by the outage _may have_ encountered 404 errors

> for the inconvenience this service outage _may have_ caused

Not a fan of this language guys/gals. You've done a doo-doo, and you know exactly what percentage (if not how many exactly) of the requests were 404s and for which customers. Why the weasel language? Own it.

kixiQu 4 years ago | |

if I had to guess, not a Googler...

Someone in a tech role wrote something like "because of the limitations of XYZ system we can't get a crisp measurement of the number of 404 errors customers experienced", failed to add a ballpark estimate because they thought everyone was on the same page about severity, and someone polishing the language saw and interpreted as "I mean, who can really say whether there were 404s?"

And the latter one would have been originally written as something more normal, then someone else read it and objected, "Most customers were outside of the blast impact!" (or somesuch) so then because the purpose of the post was informational to all customers, instead of scoping the apology to the customers who were impacted they came up with that language.

Committee communications are a painful mess, and the more important everyone thinks an issue is the more likely they are to mangle it.

cyral 4 years ago | |

Yeah we saw 100% of requests fail for a 20 minute timeframe for our production service, nothing made it through. Definitely a lot more than “may”.

SteveNuts 4 years ago |

Is there any possibility that data POSTed during that outage would have leaked some pretty sensitive data?

For example, I enter my credit card info on Etsy prior to the issue and just as I hit send the payload now gets sent to Google?

At that scale there has to be many examples of similar issues, no?

cjbprime 4 years ago | |

Why do you think there might be? They just described how the error was their system returning 404s.

londons_explore 4 years ago |

This to me shows Google hasn't gotten in place sufficient monitoring to know the scale of problems and the correct scale of response.

For example, if a service has an outage affecting 1% of users in some corner case, it perhaps makes sense to do an urgent rolling restart of the service, perhaps taking 15 minutes. (On top of diagnosis and response times)

Whereas if there is a 100% outage, it makes sense to do an "insta-nuke-and-restart-everything", taking perhaps 15 seconds.

Obviously the latter is a really large load on all surrounding infrastructure, so needs to be tested properly. But doing so can reduce a 25 minute outage down to just 10.5 minutes (10 minutes to identify the likely part of the service responsible, 30 seconds to do a nuke-everything rollback)

nine_k 4 years ago | |

The 15 seconds figure may be very wishful thinking. Often a service startup is a short burst of severe resource consumption. Doing in with 100% of the fleet at once may stall everything in an uncontrollable overloaded state.

codeflo 4 years ago | | |

Is infrastructure at this scale typically unable to do a cold start? I can believe that this is very difficult to design for, but being unable to do it sounds dangerous to me.

(Edit for the downvoters: I was genuinely curious how these kinds of things work at Google’s scale. Asking stupid questions is sometimes necessary for learning.)

piyh 4 years ago | | |

Everyone flushing the toilet at the same time to clean the pipes

sdenton4 4 years ago | |

'Whereas if there is a 100% outage, it makes sense to do an "insta-nuke-and-restart-everything", taking perhaps 15 seconds.'

Take some time to consider what a restart means, across many data centers on machines which have no memory of the world before the start of their present job...

detaro 4 years ago | |

> rollback to the last known good configuration

could very much be the "fast" option. 15s restart, or anything close to it, across the entirety of it sounds quite unlikely.

foota 4 years ago | |

15 second rollbacks don't exist at scale.