Google Cloud Europe service disruption

Google Cloud Europe service disruption(status.cloud.google.com)

216 points by eodafbaloo 3 years ago | 139 comments

Title is incorrect, this is not a general outage. There are two separate issues:

europe-west-9 (Paris) has been physically flooded with water somehow and is hard down. This is obviously bad if you're using the region in question, but has zero impact elsewhere. https://status.cloud.google.com/incidents/dS9ps52MUnxQfyDGPf...

There is a separate issue stopping changes to HTTP load balancers across most of GCP, but it has no impact on serving and they're rolling out a fix already. https://status.cloud.google.com/incidents/uSjFxRvKBheLA4Zr5q...

tinco 3 years ago | |

So this is probably too soon, thoughts and prayers for the datacenter operators and staff out there, but are they going to auction off the flooded hardware? Trying to restore a flooded Google rack sounds like a super fun project.

Anyone experience with losing an entire DC to flooding?

edit: I just Googled it (lol) and this DC has to be brand spanking new (https://cloud.google.com/blog/products/infrastructure/google...), apparently they just opened it last June. Google must be livid with the contractors who built the place for it to get flooded so soon.

vivegi 3 years ago | | |

2015 Chennai (South India) Floods. It was the flood of a century. [1]

Our DC was intact, but the building and access was cut-off. We lost the backup diesel power generators in the flooding. Of course, grid power was cut-off.

Our DC operating team managed to shutdown all the servers and racks cleanly before UPS power was completely drained. The 4 engineers and 2 security guards then swam out of the compound in chest high waters. (I am not kidding).

When the rains subsided and the flood waters receded after a couple of days, we had to plan the restart. The facility still had to be certified by health and safety, but we needed to get the datacenter back up.

A secondary operations site that would remote-connect to the DC was brought up in 1 week since we estimated the rains to potentially continue for a few more days and cause interruptions. But the critical item for the plan to work was getting a new backup power setup. We rolled in a truck-mounted diesel generator and positioned it in the highest point in the campus (also close to our building tower that had the DC) and ran power cables to it (we had to source this and it was a challenge to do it with the time crunch and the rains).

We moved staff to other cities by bus (airport was shutdown) as part of our recovery plan, but we still needed connectivity to our DC for some of the critical processes.

Long story short, it worked.

I'll never forget the experience and the scars from this war story.

[1]: https://en.wikipedia.org/wiki/2015_South_India_floods

numbsafari 3 years ago | | |

I once was a customer of a DC who's roof drainage was clogged, turning it into a lake after a couple of rain storms. It then proceeded to rain inside the DC as the roof started to leak from all the pressure.

"Servers are down, I'll head over to the DC" turned into "Um... it's raining _in the DC_. Get me some tarps and get us cut over to the backup in the office".

Ah, the glory days of running out of a single co-lo across the parking lot with our "backup site" being a former broom closet.

milesward 3 years ago | | |

The machines are not industry standard stuff, and they don't auction, they destroy for customer security. See here: https://www.datacenterknowledge.com/google-alphabet/robots-n...

pjc50 3 years ago | | |

I'm not sure what the disk encryption story is in Google Cloud but I'd rather it didn't end up on Ebay. Mind you, "flooded" covers a wide range of possibilities and a surprisingly small amount of water ingress would trip a breaker while leaving the racks in good order.

twistedpair 3 years ago | | |

Better than when Planet's DC actually exploded [1].

Restoration is hard when health and safety are in question. Good luck to these ops folks <3

[1] https://www.datacenterknowledge.com/archives/2008/06/01/expl...

Pr0Ger 3 years ago | | |

A long time ago, one server room (located in the basement of the university building) of SPB-IX was flooded. It was a fun day for engineers whom unplugged survived equipment standing knee-deep in water

It was before dam (1) was built and floods were a huge problem in SPB

[1]: https://en.wikipedia.org/wiki/Saint_Petersburg_Dam

wkat4242 3 years ago | | |

Umm thoughts and prayers? It's not as if their house is being washed away :) They just have a busy day at work. Keeps things exciting :P

verdverm 3 years ago | | |

I doubt they would let anyone have access to their hardware. There is a ton of proprietary stuff in there

MuffinFlavored 3 years ago | | |

> but are they going to auction off the flooded hardware?

I wonder how many inches/feet we're talking here? The hardware on the top (unless it experienced electrical short) is most likely fine?

bushbaba 3 years ago | | |

Likely not. It’s also not Google’s first dc flood/water intrusion causing a GCP incident.

bsdz 3 years ago | |

I'm not sure if it's a separate issue but I've had trouble creating new VM instances in Google Cloud Console or listing GPU types using their CLI and I'm in europe-west-2. The ticket I was following originally got merged with the Paris flood ticket (by Google). It was working until midnight (London) last night but went down before 8am before recovering about 1h ago for me. Not sure why an outage at one regional data centre can affect services elsewhere in the zone. Perhaps it's when pooling together meta data from different data centers for listing options?

aristus 3 years ago | | |

Also, consider everyone either automatically or manually trying to make up for the lost capacity in eu.

AlfeG 3 years ago | | |

Every customer of affected region try to restore data/compute in other regions. It's quite known and expected issue in case of region loss.

martius 3 years ago | | |

Cloud Console is having issues related to the outage in europe-west9

> Customer using Cloud Console globally are unable to open and view the Compute Engine related pages like: Instance creation page Disk creation page Instance templates page Instance Groups page

https://status.cloud.google.com/incidents/dS9ps52MUnxQfyDGPf...

mostlystatic 3 years ago | | |

Same – was unable to create new VMs in all regions between 7:15am and 11:41am UK time. Not limited to France.

londons_explore 3 years ago | |

> There is a separate issue stopping changes to HTTP load balancers across most of GCP

Is it me, or has Google had issues with pushing changes to load balancers pretty much every few months for the past decade? Even before GCP launched, people here on HN sometimes said an outage was extended because load balancer configs couldn't be changed.

Have they not considered just redesigning their config push mechanism...

derefr 3 years ago | | |

My impression, from reading the docs around Google's "premium-tier network routing" — and just from the "feeling" of deploying GCLB updates — is that when you're configuring "a" Google Cloud Load Balancer, you're actually configuring "the" Google Cloud Load Balancer. I.e., your per-tenant virtual LB config resources, get baked down along with every other tenants' virtual LB config resources, to form a single real config file, across all of GCP (maybe all of Google?), which then gets deployed to not only all of Google's real border network switches, across all their data centers; but also to all their edge network switches, in every backbone transit hub they have a POP in.

(Why not just the switches for the DC(s) your VPC is in? Because GCLB IP addresses are anycast addresses, with BGP peers routing them to their nearest Google POP, at which point Google's own backhaul — that's the "premium-tier networking" — takes over delivering your packets to the correct DC. Doing this requires all of Google's POP edge switches to know that a given GCLB-netblock IP address is currently claimed by "a project in DC X", in order to forward the anycast packets there.)

To ensure consistency between deployed GCLB config versions across this huge distributed system — and to avoid that their switches constantly being interrupted by config changes — it would seem to me that at least one — but as many as four — of the following mechanisms then take place:

1. some distributed system — probably something Zookeeper-esque — keeps global GCLB state, receiving virtual GCLB resource updates at each node and consensus-ing with the nodes in other regions to arrive at a new consistent GCLB state. Reaching this new consensus state across a globally-distributed system takes time, and so introduces latency. (But probably very little, because the resources being referenced are all sharded to their own DCs, so the "consensus algorithm" can be one that never has to resolve conflicts, and instead just needs to ensure all nodes have heard all updates from all other nodes.)

2. Even after a consistent global GCLB state is reached, not every one of those new consistent global states get converted into a network-switch config file and pushed to all the POPs. Instead, some system takes a snapshot every X minutes of the latest consistent state of the global-GCLB-config-state system, and creates and publishes a network-switch config file for that snapshot state. This introduces variable latency. (A famous speedrunning analogy: you can do everything else to remediate your app problems as fast as you like, but your LB config update arrives at a bus stop, and must wait for the next "config snapshot" bus to come. If it just missed the previous bus, it will have to wait around longer for the next one.)

3. Even after the new network-switch config file is published, the switches might receive it, but only "tick over" into a new config file state on some schedule, potentially skipping some config-file states if they're received at a bad time. Or, alternately, the switches might themselves coordinate so that only when all switches have a given config file available, will any of them go ahead and "tick over" into that new config.

4. Finally, there is probably a "distributed latch" to ensure that all POPs have been updated with the config file that contains your updates, before the Google Cloud control plane will tell you that your update has been applied.

No matter which of these factors are at fault, it's a painfully long time. I've never seen a GKE GCLB Ingress resource take less than 7 minutes to acquire an IP address; sometimes, it takes as much as 17 minutes!

And while there's definitely some constant component to the time that this config rollout takes, there's also a huge variable component to it. At least one of #2, #3, or #4 must be happening; possibly multiple of them.

---

You might ask why load-balancer changes in AWS don't suffer from this same problem. AWS doesn't have nearly as complex a problem to solve, since AFAIK their ALBs don't give out anycast IPs, just regular unicast IPs that require the packets be delivered to the AWS DC over the public Internet. (Though, on the other hand, AWS CDN changes do take minutes to roll out — CloudFront at least distributed-version-latched for rollouts, and might be doing some of the other steps above as well.)

You might ask why routing changes in Cloudflare don't suffer from this same problem. I don't know! But I know that they don't give their tenants individual anycast IP addresses, instead assigning tenants to 2-to-3 of N anycast "hub" addresses they statically maintain; and then, rather than routing packets arriving at those addresses based purely on the IP, they have to do L4 (TLS SNI) or L7 (HTTP Host header) routing. Presumably, doing that demands "smart" switches; which can then be arbitrarily programmed to do dynamic stuff — like keeping routing rules in an in-memory read-through cache with TTLs, rather than depending on an external system to push new routing tables to them.

gbajson 3 years ago | |

"europe-west-9 (Paris) has been physically flooded [...], but has zero impact elsewhere."

I am afraid this is not true. We have nothing in europe-west-9, but problem in this region caused global problem with Cloud Console, which hit us, because we were not able to use it for several hours.

Snippert from https://status.cloud.google.com/incidents/dS9ps52MUnxQfyDGPf...:

"Cloud Console: Experienced a global outage, which has been mitigated. Management tasks should be operational again for operations outside the affected region (europe-west9). Primary impact was observed from 2023-04-25 23:15:30 PDT to 2023-04-26 03:38:40 PDT."

terom 3 years ago | |

Per [1], there was a related issue affecting Cloud Console operations globally, starting from the point where the incident went regional at 23:00 PDT, and lasting until 02:00 PDT-ish. It is incorrect to say that this had zero impact elsewhere.

Sounds like some global control plane related to instance management operations started returning errors once one region failed. Or perhaps it was just the UI frontend?

[1] https://status.cloud.google.com/incidents/BWK7QzFBmfaZ4iztke...

yla92 3 years ago | |

For some reasons that might be related to the 2nd issue, even though it says resolved, I am still seeing network errors in GKE nodes, located in Singapore (asia-southeast1)

  Warning  FailedToCreateRoute      4m59s                  route_controller  Could not create route fc61a148-b428-43fa-xxxx-xxxx 10.28.167.0/24 for node gke-xxx-xxx after 16.320065487s: googleapi: Error 503: INTERNAL_ERROR - Internal error. Please try again or contact Google Support.

Any facing something similar?

theolivenbaum 3 years ago | |

Wait it's not DNS for a change?

mananaysiempre 3 years ago | | |

What’s more obscure and less tested than figurative plumbing? Literal plumbing!

m4jor 3 years ago | | |

It's still DNS

Droplets Nuking Servers

stall84 3 years ago | |

This isn't one of the under-ocean data-centers I've seen that (at least) Microsoft had been building in the Atlantic right? (They help with cooling, obviously if under ocean)

compumike 3 years ago | |

Wow, “physically flooded with water somehow” and “load balancers” config propagation issue are so drastically different!

Good reminder that downtime happens for many wild reasons, and you may want to take 30 seconds and set up a free website / API monitor with Heii On-Call [1] because we would have alerted you to either of these issues if they affected your app.

Really, a simple HTTP probe provides tremendous monitoring power. I already was telling people that it covered issues at the DNS, TCP, SSL certificate, load balancer, framework, and application layers. Now I will have to add “datacenter flood” as well :P

Best wishes to everyone working on europe-west-9.

[1] https://heiioncall.com/ (I recently helped build our HTTP probe background infrastructure in Crystal)

antonvs 3 years ago | | |

We just use a simple cloud function for that.

xrayarx 3 years ago |

Water intrusion in europe-west9-a has caused a multi-cluster failure and has led to an emergency shutdown of multiple zones. We expect general unavailability of the europe-west9 region

https://twitter.com/GCP_Incidents

aeyes 3 years ago | |

But europe-west9-a is only one zone, why does the whole region fall over as a consequence?

bushbaba 3 years ago | | |

GCP has multiple zones in the same physical building. Not all cloud providers have distinct physical buildings for each Availability Zone.

de6u99er 3 years ago | | |

AFIK all zones (a, b, and c) have been reported to be down. I'd love to understand ehat happened.

throwoutway 3 years ago | | |

Probably some dependencies they did not plan for

belter 3 years ago | |

How large is the flood? How far away are these zones?

wut42 3 years ago | | |

It happened at GlobalSwitch Clichy, near Paris. From what I gathered from a french forum[1], it started with a flood and then a fire. No rooms have been affected, apparently.

[1]: https://lafibre.info/datacenter/incendie-maitrise-globalswit...

Takennickname 3 years ago |

I thought this title meant cancelled I literally felt the blood leave my face

cal85 3 years ago | |

I thought it meant “now available” and was surprised it wasn’t already.

iainmerrick 3 years ago | | |

Yes, it would be less confusing to say “down”.

tvanantwerp 3 years ago | |

This was my first thought too. Shows how Google has trained us to expect the worst from them...

neeleshs 3 years ago | | |

GCP doesn't operate the same way as Google consumer products. We are a paid customer for over 5 years and I also have only good things to say about GCP and their support

davidkuennen 3 years ago | | |

TBH, GCP isn't operating like Google does.

I'm a long time customer and have only good things to tell so far.

nonethewiser 3 years ago | |

I read it that way too. Not sure why. Maybe the difference between out and down.

CydeWeys 3 years ago | |

Doing so the day after bragging after finally achieving profitability on the earnings call would be A Move for sure.

LadyCailin 3 years ago | |

I did too, and was not at all surprised.

CodeCompost 3 years ago | |

I thought "out, out where?" thinking it put on a hat and went outside.

antonvs 3 years ago | |

It was probably striking for better working conditions so Google terminated it

astrange 3 years ago | | |

A memorable observation from idlewords is that Googlers will organize for better conditions for Googlers but they never organize for better conditions for their users.

testemailfordg2 3 years ago |

Water intrusion in europe-west9-a has caused a multi-cluster failure and has led to an emergency shutdown of multiple zones. We expect general unavailability of the europe-west9 region. There is no current ETA for recovery of operations in the europe-west9 region at this time, but it is expected to be an extended outage. Customers are advised to failover to other regions if they are impacted.

nielsole 3 years ago |

> Water intrusion in europe-west9-a

> We expect general unavailability of the europe-west9 region.

Why would emergency shutdown of a single AZ lead to general unavailability of a region? Isn't that the point of multiple AZs?

> There is no current ETA for recovery of operations in the europe-west9 region at this time, but it is expected to be an extended outage

yikes

numbsafari 3 years ago | |

From other comments here, it sounds like multiple zones in that region are located in the same datacenter?

If so, that's ... not good.

hf_twink 3 years ago | | |

that’s how GCP does zones, firewalled off with separate networks/power in the same physical location.

femtozer 3 years ago |

It seems that there is a second issue due to a fire in a GlobalSwitch datacenter where Google host Edge cache locations (article in french):

https://dcmag.fr/breve-un-depart-dincendie-dans-un-batiment-...

base 3 years ago |

Service Health Status: https://status.cloud.google.com/incidents/dS9ps52MUnxQfyDGPf...

KomoD 3 years ago |

Terrible title, can it be changed?

drag0s 3 years ago |

Looks like it is/was not only Europe. Today we had issues in the US too, and some other regions still affected as well (us-east1, asia-northeast1, asia-south1 & asia-southeast2)

eodafbaloo 3 years ago |

It seems that it went back to normal except for Paris

skrebbel 3 years ago | |

I love how well google cloud status reflects real life.

rippercushions 3 years ago | | |

At least Paris is red, what's the last time you saw more than a green dot with a little exclamation mark on the AWS status page?

sgt 3 years ago | |

Is that so? All of them are showing network alert.

p_l 3 years ago | | |

There's network alert on the whole of EU because one of the regions in EU is out.

A case of "need to drill down"

xrayarx 3 years ago |

From what can be seen on the Status page, more than just Europe seems to have a problem.

jostein 3 years ago | |

I think their status page briefly showed more regions affected, but I have not noticed any problems in europe-north1 or europe-west1 where I have systems running.

bagacrap 3 years ago |

perfmode 3 years ago |

the flooding meant to provide cover for intelligence services to infiltrate?

have i been watching too much espionage media?

dannyw 3 years ago | |

Google is an intelligence service, more or less.