Google Cloud region currently down due to water intrusion

Google Cloud region currently down due to water intrusion(status.cloud.google.com)

289 points by kalabilla 3 years ago | 173 comments

dang 3 years ago |

All: please don't post low-effort comments that merely react to the first association you have. We're trying for curious conversation here, which is something else.

https://news.ycombinator.com/newsguidelines.html

jacquesm 3 years ago |

This seems to significantly under-report what's going on, see:

https://www.theregister.com/2023/04/26/google_cloud_outage/

There is mention of a fire as well.

madaxe_again 3 years ago | |

Oh, the irony.

A few years ago I implemented a top to bottom ISO27k1 ISMS for a client handling extremely sensitive and mission-critical data for industry.

One risk I recommended controls for was that of a fire and/or flood at their primary datacentre for their client-facing offerings - this datacentre. I’ve experienced the misery of a datacentre oops myself, firsthand, twice, and it’s a genuine risk that has to be mitigated.

At my insistence, I had them burn hundreds of man-hours ensuring that they could failover to a new environment in a different datacentre with a bare minimum of fuss, as what I arrived to was an all the eggs in one basket situation. It took a fair bit of re-engineering of how deployments worked, how data was replicated, how the environment was configured - but they got there, and the ISMS was put into operation, and was audited cleanly by a reputable auditor, and everyone lived happily ever after.

Except… they were acquired by private equity. Who had no truck with all of this costly prancing about with consultants and systems. Risk register? Why do we need this? What value does it add today? ISO27k1? Don’t be silly. We have that certificate. You don’t need it. Dev team, ops team, leadership — almost everyone — ejected and replaced with a few support staff.

I see their sites are down.

jacquesm 3 years ago | | |

There's that beautiful German word again... schadenfreude. I have had similar discussions multiple times in the last year and the magic thinking around the cloud is so strong that it is sometimes impossible to get through. The fact that cloud stuff can go down and that in the end it is your data and no amount of cloud credits are going to help you if your data is lost seems to be utterly beyond some people's comprehension.

DebtDeflation 3 years ago | |

Plot twist: the server racks were made out of sodium.

H8crilA 3 years ago | | |

You're not far off: the batteries are (probably) made of lithium.

Also, why batteries in a datacenter? When you implement a flush() command at the lowest level you're faced with two choices: 1) actually write to disk, then return from the call, 2) write to some cache/RAM and have just enough battery locally to ensure that you can write it to disk even if all power goes out.

Then there's the other problem of surviving long enough between a power interruption and diesel generators starting up. But this is a smaller problem, rebooting all instances in a datacenter is less bad than losing some data that was correctly flush()ed by software. Bad flush() behaviour can result in errors that cannot be recovered from without a complicated manual intervention (for example if it causes corrupted and unreadable database files).

ironmagma 3 years ago | | |

NaCl, the revolutionary Sodium Cloud technology.

pcurve 3 years ago | |

The outage has been going on for 40+ hours now...

I think this is sort of big.

jonatron 3 years ago | |

This doesn't sound as bad as OVH's 2021 fire.

nik736 3 years ago | | |

Well, we had pictures very quickly of the OVH fire. Google seems to be not very transparent on what is exactly happening...

sschueller 3 years ago | | |

If you trench a fire in water in a DC it might be just as bad.

jacquesm 3 years ago | | |

I wouldn't draw any conclusions just yet.

t0mas88 3 years ago |

I can't ignore the feeling that Google Cloud is sub par compared to AWS. How did this again cause a multi zone failure. Why haven't they fixed those dependencies the last few times they had a full region failure.

radicaldreamer 3 years ago |

Not sure what kind of fire there was there, but once those automatic sprinkler systems get going, they are very difficult to stop.

Someone in my freshman college dorm decided to use one as a clothes hanger hook and broke the thermometer in there. The sprinkler damaged the entire floor with water and the floor below had spotty rain as well.

The fire department came and was mainly concerned about evacuating everyone rather than shutting the water off.

The water is typically chemically treated and has been sitting there for years as well -- very nasty stuff.

sbierwagen 3 years ago | |

Always worth tracking down the sprinkler shut off valve in your residence/place of work. If you’re in a high rise it‘ll be the big red wheel on the sprinkler main in the fire stairs. If it’s a spurious activation you can just shut it off yourself, you don’t need to ask anybody’s permission.

The fire department is always going to prioritize safety of life, and after all it’s not their stuff getting soaked.

Kon-Peki 3 years ago | | |

> The fire department is always going to prioritize safety of life, and after all it’s not their stuff getting soaked.

They won't hesitate to smash your stuff or break down your walls either.

Being in a fire is no joke. You've got to be crazy to think that your stuff is important. It's not.

newZWhoDis 3 years ago | | |

True, and if you mess up how are they gonna know?

Your fingerprints won’t survive the fire!

local_crmdgeon 3 years ago | | |

Please do not disable or tamper with your buildings sprinkler systems.

You will not care about your stuff when you're in jail for negligent manslaughter.

deagle50 3 years ago | |

This happened in my freshman dorm as well. The broken sprinkler was on the 3rd or 4th floor and my room which was on the 1st got at least an inch of water.

mvanbaak 3 years ago | |

datacenters dont use sprinkler systems (or at least they should not).

packetslave 3 years ago | | |

A non-water fire suppression system for a 300,000+ square feet warehouse-scale datacenter would be incredibly expensive.

palcu 3 years ago |

[disclaimer: SRE @ Google, I was involved with the incident, obvious conflicts of interest]

Hey Dang, thanks for cleaning up the thread. One thing to note is that the title is not correct. The entire region is not currently down, as the regional impact was mitigated as of 06:39 PDT, per the support dashboard (though I think it was earlier). The impact is currently zonal (europe-west9-a), so having zone in the title as opposed to region would reflect reality closer.

Finally, there's lots of good feedback on this thread and on the previous one (https://news.ycombinator.com/item?id=35711349), so we obviously have a lot of lessons to learn.

Waterluvian 3 years ago | |

Would you be able to comment a bit on the emotional (perhaps there’s a better word) aspect of the response?

Was there a lot of anxiety? Panic? Or was it just a “woof that sucks. Time to follow a checklist and then do a bunch of paper work” ?

What I’m curious about is what it feels like on a team at a company like Google when there is a major system failure.

palcu 3 years ago | | |

There's not much emotion as the core team working on the huge outages is more like an "SRE for SRE". They are all people who've been with the company for a long time and they've been in the secondary seat for at least one previous big rodeo. Not to mention that we're all running a checklist that has been exercised multiple times and there's always somebody on the call who could help if a step fails.

Personally, I wasn't part this time for the actual mitigation of the overall Paris DC recovery, as I was busy with an unfortunate[0] side effect of the outage. These generate more anxiety, as being woken up at 6am and being told that nobody understands exactly why the system is acting this way is not great. But then again, we're trained for this situation and there are always at least several ways of fixing the issue.

Finally, it's worth repeating that incident management is just a part of the SRE job and after several years I've understood that it is not the most important one. The best SREs I know are not great when it comes to a huge incident. But, they're work has avoided the other 99 outages that could have appeared on the front page of Hacker News.

[0]: https://news.ycombinator.com/item?id=35734224

rickette 3 years ago |

Anyone know how this could affect multiple zones? "Customers can failover to zones in other regions". Unless a whole area got flooded.

asymptotic 3 years ago |

When I worked at AWS there was a similar scenario in eu-west-2. There was a fire in one of the availability zones (AZs). The fire suppression system kicked in and flooded the data center up to ankle or knee height. All the racks were powered off and the building was evacuated for hours (I don't remember the duration of the evacuation) until the water was pumped out.

But for the service team I worked for, our AZ-evacuation story wasn't great at the time and it took us tens of minutes to manually move out of the AZ, but at least there wasn't a customer-visible availability impact. Once we did it was just monitoring and baby-sitting until we got the word to move back in, I think it was 1-2 days later.

If you operate on AWS you work with the assumption that an AZ is a failure domain, and can die at any time. Surprisingly many service teams at AWS still operate services that don't handle AZ failure that well (at the time). But if you operate services in the cloud you have to know what the failure domain is.

jamesfinlayson 3 years ago | |

> urprisingly many service teams at AWS still operate services that don't handle AZ failure that well (at the time)

Ouch, hopefully none of the major services? I recently had to look into this for work (for disaster recovery preparation) and it seemed like ECS, Lambda, S3, DynamoDB and Aurora Serverless (and probably CloudWatch and IAM) all said they handled availability zone failures transparently enough.

asymptotic 3 years ago | | |

I’m familiar with Lambda and DynamoDB. When I left in 2022 they both had strong automated or semi-automated AZ evacuation stories.

I’m not that familiar with S3, but I never noticed any concerns with S3 during an AZ outage. I’m not at all familiar with Aurora Serverless or ECS.

For all AWS services you can always ask AWS Support pointed, specific questions about availability. They usually defer to the service team and they’ll give you their perspective.

Also keep in mind that AWS teams differentiate between the availability of their control and data planes. During an AZ outage you may struggle to create/delete resources until an AZ evacuation is completed internally, but already created resources should always meet the public SLA. That’s why especially for DR I recommend active-active or active-“pilot light”, have everything created in all AZs/regions and don’t need to create resources in your DR plan.

richardw 3 years ago |

I can imagine clients who used one DC being impacted. But Google’s services would be designed for a single DC going down, right? Data would be eventually consistent (once they find and plug the hard drives in) but isn’t this the promise of the cloud and they’re (approximately) the best at using it.

I have to assume it’s a fault that not even distributed services can paper over. Eg lots of crucial data in flight and they’re reluctant to drop it. Can an expert weigh in?

I love Google’s post-mortems. This one will be epic.

lamontcg 3 years ago |

Nobody here with any thoughts for the operations/datacenter engineers trying to deal with stopping and cleaning up the disaster, just customers complaining...

rurp 3 years ago | |

"Thoughts and Prayers" type comments don't make for particularly interesting reading.

I think it's safe to assume that most people feel empathy for others struggling, whether or not they type it out regularly. Then again, some AI evangelists have had me questioning that assumption lately.

BlackjackCF 3 years ago | |

I think people who are complaining are stressed out about their own services being down.

If you’ve only ever used the cloud, you’re not necessarily aware of everything that’s involved at data centers. If you’re not familiar with them, I don’t think you’d know how many things can (literally) blow up in your face. If someone sees flooding, they generally aren’t thinking that it’ll lead to fires.

Anyway, just want to think that everyone generally has good intentions and just don’t know what’s ACTUALLY happening in the DC, or how much work it will be for the folks working in the DC to restore services.

Hopefully all the failsafes kicked in and worked and nobody was injured.

okdood64 3 years ago | |

Nope. Big company bad. <Insert snarky overeactionary comment based on armchair knowledge>

Literally no concern here for anyone's safety or sanity in dealing with this.

effdee 3 years ago |

This post (in french) has some more details:

https://www.mail-archive.com/frnog@frnog.org/msg72320.html

Jgrubb 3 years ago |

eu-west-9 is Paris

sgt 3 years ago | |

Apparently the servers were told they were expected to delay their retirement a bit.

antifa 3 years ago | |

Thanks for posting which one, this is the most important detail and should have been in the title...

manojr13 3 years ago |

Let's the servers cool down for sometime. Might have been working very hard.

fnordpiglet 3 years ago |

Last time they let Bard pick a data center location and design.

lukax 3 years ago |

Google now has a "data lake" in Paris.

nyc_data_geek1 3 years ago | |

This is not what I meant by digital ocean

walrus01 3 years ago | | |

This is what happens when your cloud condenses in the water/vapor cycle and returns to liquid form temporarily.

krisoft 3 years ago | |

I don’t see the problem. Clouds are just water droplets anyway.

Joking asside I hope we will get a nice postmortem with juicy civil engineering details.

IntelMiner 3 years ago | |

Google's DC is underwater

OVH's caught fire

What's next, us-east-1 gets hit by Godzilla?

cgb223 3 years ago | | |

Lol us-east-1 already went down for a day back in 2017 when an intern accidentally took down the whole DC. We could call him “Godzilla”

Source: my startup (stupidly) hosted our entire infra in us-east-1 at the time. Was a …tough day

firstSpeaker 3 years ago | | |

Some of the regions are so critical for AWS that them going down will bring down most of the control plane :P

glogla 3 years ago | | |

Water and fire already had their way, I suspect tornado and a landslide are next.

1123581321 3 years ago | | |

Earthquakes and wind damage should be next.

syngrog66 3 years ago | | |

if us-east-1 is not in Tokyo its safe

nixcraft 3 years ago | |

I hope whoever is hosting data in that zone has thoroughly tested and verified backups offline or with another cloud provider. Of course, you can complete DC failover, depending upon service needs, but it costs more resources. Either way, timely tested backups are the only way to survive natural or manufactured disasters. Good luck to Google OPs team and everyone else involved with the GCP region in the EU.

Eji1700 3 years ago | |

I'm down for coding to go full circle.

We called them bugs because you literally had to go in and get the dead bugs out of your electrical system.

Now we can call it fishing because some pirate has sailed onto your datalake and is looking for sunken hashes.

What do you think the hourly for Cloud Architect/Data lake power boy level 1 should start at?

kortex 3 years ago |

A perfect example of when "cloud just means someone else's computers". It's literally a leaky abstraction.

doubled112 3 years ago | |

Sure is leaky. And a cloud is a bunch of water vapour that eventually comes crashing down to earth. I'll never understand how we decided it was a good metaphor for a place we run our services.

Startlingly accurate in this case.

dijit 3 years ago | | |

“the cloud” comes from old network diagrams that used cloud to mean “internet” or “unknown network”.

I think “unknown network” definitely accurately captures what hyperscalers are selling. :)

88913527 3 years ago | | |

After spending 5 minutes engaging with Product Managers, I am not at all surprised they landed on calling it 'the Cloud'.

paulmd 3 years ago | | |

https://www.youtube.com/watch?v=AnxrJiS5uKU

benatkin 3 years ago | |

1) Pay for stuff

2) Not be able to use it

3) Company continues to pretend this doesn't happen on the regular

worldsavior 3 years ago | | |

> Company continues to pretend this doesn't happen on the regular

What do you want them to say? "Hey we have X breakdowns but please, pay!"

throwaway2729 3 years ago | |

google cloud poised for precipitous fall

kotaKat 3 years ago |

Ah, a rainy day in the cloud.

kotaKat 3 years ago | |

"Please don't fulminate. Please don't sneer, including at the rest of the community." @dang, since this got hidden and I can't reply to the toppost.

We're allowed a little humor, damnit.

redindian75 3 years ago |

This is the problem storing data in the cloud - whenever it rains you may have a big data problem.

burnt_toast 3 years ago | |

It's okay because once the water evaporates its backup in the cloud.

timack 3 years ago | | |

Really? Are you cirrus?

CobrastanJorji 3 years ago |

A series of tubes would have helped with this.

aruggirello 3 years ago | |

Unfortunately, it appears Google Plumber was discontinued by Alphabet Inc.

iJohnDoe 3 years ago | | |

The plumber was laid off.

geocrasher 3 years ago | |

Thank you for your input, Senator Stevens.

moffkalast 3 years ago | |

Or a big truck to bring some tarps

throwawaaarrgh 3 years ago |

I know I said our pipeline abstraction was leaky but this is ridiculous

faangiq 3 years ago |

But I thought Google only hired geniuses?

iJohnDoe 3 years ago | |

Yes, and we often hear genius insights and anecdotes from their SRE employees who have their blog links posted to HN.