Google Kubernetes Engine's third consecutive day of service disruption

Google Kubernetes Engine's third consecutive day of service disruption(status.cloud.google.com)

779 points by rlancer 7 years ago | 407 comments

I am currently evaluating GCP for two separate projects. I want to see if I understand this correctly:

1) For three whole days, it was questionable whether or not a user would be able to launch a node pool (according to the official blog statement). It was also questionable whether a user would be able to launch a simple compute instance (according to statements here on HN).

2) This issue was global in scope, affecting all of Google's regions. Therefore, in consideration of item 1 above, it was questionable/unpredictable whether or not a user could launch a node pool or even a simple node anywhere in GCP at all.

3) The sum total of information about this incident can be found as a few one or two sentence blurbs on Google's blog. No explanation nor outline of scope for affected regions and services has been provided.

4) Some users here are reporting that other GCP services not mentioned by Google's blog are experiencing problems.

5) Some users here are reporting that they have received no response from GCP support, even over a time span of 40+ hours since the support request was submitted.

6) Google says they'll provide some information when the next business day rolls around, roughly 4 days after the start of the problem.

I really do want to make sure I'm understanding this situation. Please do correct me if I got something wrong in this summary.

manigandham 7 years ago | |

When everything works, GCP is the best. Stable, fast, simple, reliable.

When things stop working, GCP is the worst. Slow communications and they require way too much work before escalating issues or attempting to find a solution.

They already have the tools and access so most issues should take minutes for them to gather diagnostics, but instead they keep sending tickets back for "more info", inevitably followed by a hand-off to another team in a different time zone. We have spent days trying to convince them there was an issue before, which just seems unacceptable.

I can understand support costs but there should be a test (with all vendors) where I can officially certify that I know what I'm talking about and don't need to go through the "prove its actually a problem" phase every time.

laurencei 7 years ago | | |

As someone who works for Government and Enterprise - all I care about sometimes is how a company behaves when everything goes wrong.

The issue with outages for the Government organizations I have dealt with is rarely the outage itself - but strong communication about what is occurring and realistic approximate ETAs, or options around mitigation.

Being able to tell the Directors/Senior managers that issues have been "escalated" and providing regular updates are critical.

If all I could say was a "support ticket" was logged, and we are waiting on a reply (hours later) - I guarantee the conversation after the outage is going to be about moving to another solution provider with strong SLAs.

Osiris 7 years ago | | |

"Support costs" calculation often doesn't include the costs of not having support.

When I worked at GoDaddy, there were around 2/3 of the company was customer support.

At the current company I'm at, a cryptocurrency exchange, our support agents frequently hear they prefer our service over others because of our fast support response times (crypto exchanges are notorious for really poor support).

All of my interactions with Amazon support have been resolved to my satisfaction within 10 minutes or less.

Companies really ought to do the math on the value that comes from providing fast, timely, and easy (don't have to fight with them) customer support.

Google hasn't learned this lesson.

jgalentine007 7 years ago | | |

With Dell you can certify with them so you can get replacement parts and such without the BS back and forth with some guy in india. Saves everyone time and money.

dvdgsng 7 years ago | | |

> instead they keep sending tickets back for "more info"

Isn't that the case with basically every support request, no matter the company or severity? The first couple of emails from 1st & even 2nd level support are mostly about answering the same questions about the environment over and over again. We've had this ping-pong situation with production outages (which we eventually analysed and worked around by ourselves) and fairly small issues like requesting more information of an undocumented behavior which didn't even effect us much. No matter how important or urgent the initial issue was, eventually most requests end up being closed unresolved.

softawre 7 years ago | | |

Heh, your "test" reminds me of an old Hanselman article:

https://www.hanselman.com/blog/FizzBinTheTechnicalSupportSec...

ElBarto 7 years ago | | |

To say "when it works it's stable and reliable" implies that it is neither...

dilyevsky 7 years ago | |

We had an issue a few weeks back where all nodes in west1-a could not pull docker images. Google support was pinballing P1 issue around the globe and across multiple teams for a few days untill I root caused it for them - turned out to be gce service account issues affecting entire zone. 2 days to rollback (no status page update). I know nobody gives a fuck but can’t help but feel vindicated as an ex google sre.

icelancer 7 years ago | | |

I think a lot of people give a fuck here; I do, at least. Thanks for outlining it, these things are fascinating (to me anyway, who has never worked in IT/ops).

navinsylvester 7 years ago | |

We are GCP customers for the last couple of years. We use other cloud platforms(AWS, IBM, Oracle, OrionVM) too. We don't use GKE but use rancher/kubernetes combo on their standard platform.

So far GCP is the best, hands down in terms of stability. We never had a single outage or maintenance downtime notification till now. We are power users but our monitoring didn't pick any anomaly so i don't think this issue had rampant impact on other services.

But i find it concerning that they provided very little update on what went wrong. I also think its better to expect nil support out of any big cloud provider if you don't have paid support. Funny how all these big cloud providers think you are not eligible for support de-facto. Sigh.

rogerkirkness 7 years ago | | |

I agree with this. Compared to AWS, when Google says it's down, it's down, and that's rare. When they say it's up, it's up.

vira28 7 years ago | | |

I use AWS free tier and get customer support through email, but thats not the case with GCP. Do they provide free email support?

If you are an early stage startup can you afford their 200/Month support, when your entire GCP bill is under $1. However, that doesn't mean you don't have to support them.

ToFab123 7 years ago | | |

I don't understand why someone would choose to deploy anything mission critical without having an support contract with the ISP, the manufacturer of the the software etc.

lilbobbytables 7 years ago | |

You're doing me a scare. I'm in the evaluation phase with them. Maybe I'm missing something here, but this is not at all what the linked post says.

"We are investigating an issue with Google Kubernetes Engine node pool creation through Cloud Console UI."

So, it's a UI console issue, it appears you can still manage

"Affected customers can use gcloud command [1] in order to create new Node Pools. [1]"

Similarly, it actually was resolved in Friday, but they forgot to mark it as so.

"The issue with Google Kubernetes Engine Node Pool creation through the Cloud Console UI had been resolved as of Friday, 2018-11-09 14:30 US/Pacific."

shareometry 7 years ago | | |

You are right about the Google blog content itself not indicating three days of outage. Turns out they just forgot to mark that particular issue as resolved on Friday, as you point out. This is my mistake. I would update my comment to reflect this, but it doesn't seem to allow an edit at this point.

The items I put down in my comment are based largely on user reports, though (there isn't much else to go on). And I mean these items as questions (i.e. "is this accurate?"). Folks here on HN have definitely been reporting ongoing problems and seem to be suggesting that they are not resolved and are actually larger in scope than the Google blog post addressed.

Someone from Google commented here a few hours ago indicating Google was looking into it. And other folks here are reporting that they don't have the same problems. So it's kind of an open question what's going on.

I'm in the evaluation phase too. And I've found a lot to like about GCP. I'm hoping the problems are understandable.

haldora 7 years ago | | |

I've been failing all weekend to create nodes in a GKE cluster through either the UI console or gcloud. Even right now I can't get any nodes to spin up.

Edit: I finally got my cluster up and running by removing all nodes, letting it process for a few minutes, then adding new nodes.

timdumol 7 years ago | | |

We've had no issues deleting and creating node pools this weekend (on asia-east1-a). No other problems noticed either.

fizzledbits 7 years ago | | |

As of this morning, I am still unable to reliably start my docker+machine autoscaling instances. In all cases the error is "Error: The zone <my project> does not have enough resources available to fulfill the request" An instance in us-central1-a has refused to start since last Thursday or Friday.

I created a new instance in us-west2-c, which worked briefly but began to fail midday Friday, and kept failing through the weekend.

On Saturday I created yet another clone in northamerica-northeast1-b. That worked Saturday and Sunday, but this morning, it is failing to start. Fortunately my us-west2-c instance has begun to work again, but I'm having doubts about continuing to use GCE as we scale up.

And yet, the status page says all services are available.

raincom 7 years ago | | |

If you run your own k8s on GCP, you are not going to be affected by GKE.

aviv 7 years ago | | |

I can't comment regarding GKE as we don't use that particular service, however we are very heavy users of many other GCP services, including Compute, Datastore, BigQuery, Pub/Sub, Storage, Functions, Speech, and others. Zero issues this weekend, everything is running 100% as any normal day.

tejohnso 7 years ago | |

> For three whole days, it was questionable whether or not a user would be able to launch a node pool (according to the official blog statement)

What blog statement are you referring to? I don't see any such statement. Can you provide a link?

The OP incident status issue says "We are investigating an issue with Google Kubernetes Engine node pool creation through Cloud Console UI". It also says "Affected customers can use gcloud command in order to create new Node Pools."

So it sounds like a web interface problem, not a severely limiting, backend systems problem with global scope.

Also, the report says "The issue with Google Kubernetes Engine Node Pool creation through the Cloud Console UI had been resolved as of Friday, 2018-11-09 14:30 US/Pacific". So the whole issue lasted about 10 hours, not three whole days.

> Some users here are reporting that other GCP services not mentioned by Google's blog are experiencing problems

I don't see much of that.

paulddraper 7 years ago | | |

I believe the OP was referring to the very same blog (web log) you cited.

https://status.cloud.google.com/incident/container-engine/18...

"We are investigating an issue with Google Kubernetes Engine node pool creation through Cloud Console UI."

> So it sounds like a web interface problem, not a severely limiting

Depends who you as to whether this is "severely" limiting, but yes there is a workaround by using an alternate interface.

marcinzm 7 years ago | |

Right now we don't know. It's one of two possibilities from what I can tell:

a) Google had a global service disruption that impacted Kubernetes node pool creation and possible other services since Friday. They had a largely separate issue for a web UI disruption (what this thread links to) which they forgot to close on Friday. They still have not provided any issue tracker for the service distribution and it's possibly they only learned about it from this hacker news thread.

b) People are having various unrelated issues with services that they're mis-attributing to a global service disruption.

johnpython 7 years ago | |

This is why GCP has no hope of ever taking significant market share from AWS. Google thinks they can treat their cloud customers like they treat users of their free services. Customer support and communication are essential.

ernsheong 7 years ago | | |

As if something like this has never happened to AWS?

halbritt 7 years ago | | |

I'm not sure about the market share, but I agree with the last two sentences.

...and I'm a happy GCP customer.

rorykoehler 7 years ago | |

I recently removed my hosting from GCP. The pricing is confusing and unbelievable. Their customer service is a joke. I don't trust Google for longterm consistency due to the way they shut their own apps but I let that slide as I doubt they will do that on their cloud services. I have experience with AWS (rock solid, world class support but also costly), digital ocean (improving fast), heroku (good for beginners but also expensive and not as full featured as AWS) and finally Hetzner (too early to judge).

meow_mix 7 years ago | |

I think you're missing the portion about how it only appears to be the console ui, no?

ransom1538 7 years ago | |

“2) This issue was global in scope, affecting all of Google's regions. Therefore, in consideration of item 1 above, it was questionable/unpredictable whether or not a user could launch a node pool or even a simple node anywhere in GCP at all.”

Ok. So on aws we were* paying for putting systems across regions, but, honestly I don’t get the point. When an entire region is down what I have noticed is that all things are fucked globally on aws. Feel free to pay double - but it seems* if you are paying that much just pay for an additional cloud provider. Looks like it’s the same deal on GCP.

human_error 7 years ago | | |

> When an entire region is down what I have noticed is that all things are fucked globally on aws.

Do you have an example on this?

usmannk 7 years ago |

We had an issue a few weeks ago where the google front-end servers were mangling responses from Pub/Sub and returning 502 responses, making the service completely unusable and knocking over a number of things we have running in production. Despite paying for enterprise support and having in a P1 ticket, we had to spend Friday to Sunday gathering evidence to prove to the support staff that there was indeed a problem, because their monitoring wasn't detecting it. Right now I'm doing something similar (and since Friday!) but for TLS issues they're having. Again, because their support reps don't believe there's a problem. There are so many more problems than they ever show on their status page...

Jedi72 7 years ago |

"The data says engagement is down 46%, I think its time we drop the product."

- Someone at Google right now, probably.

justinsb 7 years ago | |

I can assure you that's not the case! Also, while people like to repeat this meme, Google Cloud does have a formal deprecation policy (https://cloud.google.com/terms/), whose intent is to give you some assurances.

(I work at Google, on GKE, though I am not a lawyer and thus don't work on the deprecation policy)

chrisseaton 7 years ago | | |

> Google may discontinue any Services or any portion or feature for any reason at any time without liability to Customer

for any reason

at any time

brian-armstrong 7 years ago | | |

What happens when they suddenly deprecate the deprecation policy?

whydoineedthis 7 years ago | | |

im pretty sure he just forget the /s (sarcasm) on his post, but this was pretty cool information anyway, so thanks!

davemp 7 years ago | | |

I think it’s telling of Google’s culture that the corporate arm felt the need to formalize this in law. I won’t pretend to know what it’s telling. Just suggest that you listen for yourself. Look at rule of law versus the ideas of liberty if you’d like a stronger nudge.

justinsb 7 years ago |

Hi - I work at Google on GKE - sorry about the problems you're experiencing. There's a lot of people inside Google looking into this right now!

It looks like the UI issue was actually fixed, and that we just didn't update the status dashboard correctly. But we're double checking that and looking into some of the additional things you all have reported here.

hacknat 7 years ago |

Question to Google employees:

Why do you guys suffer global outages? This is your 2nd major global outage in less than 5 years. I’m sorry to say this, but it is the equivalent of going bankrupt from a trust perspective. I need to see some blog posts about how you guys are rethinking whatever design can lead to this - twice - or you are never getting a cent of money under my control. You have the most feature rich cloud (particularly your networking products), but down time like this is unacceptable.

scarface74 7 years ago |

Say I were a CTO (I’m nowhere near it), why would I choose GCP over AWS or Azure? Even if after doing a technical assessment and I thought that GCP was technically slightly better, if something happened, the first question I would be asked is “why did you choose GCP over AWS?”

No one would ever ask why you chose AWS. The old “no one ever got fired for buying IBM”.

Even if you chose Azure because you’re a Microsoft shop, no one would question your choice of MS. Besides, MS is known for thier enterprise support.

From a developer/architect standpoint, I’ve been focused the last year on learning everything I could about AWS and chose a company that fully embraced it. AWS experience is much more marketable than GCP. It’s more popular than Azure too, but there are plenty of MS shops around that are using Azure.

AlexB138 7 years ago |

This has been going on longer than three days. We have been dealing with this exact issue since at least Monday (11/5) morning in us-central1.

splap 7 years ago | |

same here. using gcloud, not web console

fizzledbits 7 years ago | |

same here as well, my us-central1 instance still will not boot

marcinzm 7 years ago |

>Nov 09, 2018 05:59

>We will provide more information by Monday, 2018-11-12 11:00 US/Pacific.

Wait, did the people tasked with fixing this just take the weekend off?

justinsb 7 years ago | |

The incident with the UI (where we suggested using gcloud temporarily) was opened in https://status.cloud.google.com/incident/container-engine/18..., but then what sure looks to me like the same incident was closed in https://status.cloud.google.com/incident/container-engine/18....

My working assumption is that 18006 should have closed out 18005. But now it sounds like there's a different issue, which we're working to get to the bottom of.

jasonlotito 7 years ago | |

The people tasked with fixing this aren't the ones providing the updates.

INTPenis 7 years ago | | |

Understandable but in my experience the incident manager assigned is still supposed to keep track of progress during weekends when you have a major incident.

And this is likely a major incident with significant customer impact.

The way google is handling all this gives a pretty poor impression. Seems like this kubernetes is just a PoC.

marcinzm 7 years ago | | |

Fair point but still seems odd that the people providing updates took the weekend off during a large scale customer impacting issue. I'm sure all the people spending the weekend trying to mitigate the impact of this on their infrastructure would love to have timely updates.

trhway 7 years ago | | |

They have whole book describing who fixes, provides updates, etc. Fun meditative reading while waiting for the outage to get fixed.

https://landing.google.com/sre/sre-book/chapters/managing-in...

Looks like this time Mary took the whole week off without telling Josephine :)

rlancer 7 years ago |

Status page is inaccurate as issues doesn't only affect the web UI, the same operations are not functioning via the CLI.

pm90 7 years ago | |

Its kinda strange that HN seems to be the most effective way to give feedback to Google Cloud :/

Draiken 7 years ago | | |

I also find it weird that on HN where normally people are very skeptical of any argument without data backing it, when it comes to this outage, people are assuming everything written here affects everyone.

Perhaps some of the issues are localized? Perhaps it's even user error (it happens, you know?). But because a small amount of HN users say "it's everywhere!" then suddenly people reach for their pitchforks.

Sometimes we just don't have all the information.

breakingcups 7 years ago | | |

Most transparent, at least.

kenan_warren 7 years ago | |

Yeah almost all regions and zones for any compute instance have been exhausted since about 1pm PST on Friday. I finally got one up last night on us-east1, but my older cluster is basically SOL until it's fixed on us-west1. It went down for an upgrade and never came back up because of the same resource issue.

camhutch 7 years ago | |

I just tried turning up my 1-node test cluster via terraform, and it worked fine. I would have thought the gcloud CLI would be using the same API.

I did this in the australia-southeast1-a zone.

base698 7 years ago | |

What operations? Status just shows node pool creation.

rlancer 7 years ago | | |

Can not create a new Clusters or Node Pool and can not resize exiting Node Pools, as far as users are reporting it's happening in all regions too.

Error message when creating a new Cluster:

Deploy error: Not all instances running in IGM after 35m7.509000994s. Expect 1. Current errors: [ZONE_RESOURCE_POOL_EXHAUSTED]: Instance 'gke-cluster-3-pool-1-41b0abf8-73d7' creation failed: The zone 'projects/url-shortner-218503/zones/us-west2-b' does not have enough resources available to fulfill the request. Try a different zone, or try again later. - ; .

scarface74 7 years ago |

A generic question: Our company is completely dependent on AWS. Sure we have taken all of the standard precautions for redundancy, but what happened here could just as easily happen with AWS - a needed resource is down globally.

What would a small business do as a contingency plan?

rlancer 7 years ago |

UPDATE: Got some clarity, these issues are caused by "resource exhaustion" meaning there are no resources left to be allocated.

halbritt 7 years ago | |

I'm curious to see if this is true.

I faced some pretty serious resource allocation issues earlier in the year. The us-west1-a region was oversubscribed. I was unable to get any real information from support with regard to capacity. Eventually my rep gave me some qualitative information that I was able to act on.

7ewis 7 years ago |

I honestly don't mind if providers have outages - we can't expect 100.00% accuracy, I know the systems I manage certainly don't achieve that.

One thing I do care about though, is root cause analysis. I love reading a good RCA, it restores my faith in the company and makes me trust them more.

(I'm not affect by the GKE outage so opinions may differ right now!)

locusm 7 years ago |

Do not use GCP without paying for support. We have had resource allocation errors for weeks, as have a lot of other people. Check out the posts in their forum where folk on basic support get zero love. https://groups.google.com/forum/?utm_medium=email&utm_source...

thwy12321 7 years ago |

Been trying to spin up vm instances all day, had to try every single zone just to get one up. Not only is this incredibly harmful to a technology business dependent on this infra, it wasnt obvious to me what the issue was until I tried creating instances. Nothing says, hey resources are constrained here, try this one. Just about ready to bite the bullet and move to aws.

pfd1986 7 years ago | |

Same here. We have spent 2 days trying to create instances and migrate images just to figure out later they can't start.

Right when I convinced our project to get migrated from AWS...

Masiosare 7 years ago | | |

Same question... why would you do that? AWS is super stable most of the time. I have been running k8s over EC2 (not eks) for a year and works like a charm. I've even run experiments using spot instances and it's pretty good (no guarantee there of course).

tigershark 7 years ago | | |

Why on earth would you do that, unless you had huge problems on aws?

sladey 7 years ago |

Seems to be some weird underlying issue going on at GCP at the moment. Had cloud build webhooks returning a 500 error. Noticed we were at 255 images and deleting some fixed the issue. Created a P2 ticket about the issue before we managed to solve it and haven't had a response in 40+ hours.

The timeline of this disruption matches when we started experiencing cloud build errors.

lstamour 7 years ago | |

Outsider here, but I believe Cloud Build runs on GKE Jobs, so if they’re having trouble, it does indeed sound related.

ernsheong 7 years ago |

"third consecutive day of service disruption" is not an accurate statement? Latest update was Nov 11 saying things resolved on Nov 9.

https://status.cloud.google.com/incident/container-engine/18...

ernsheong 7 years ago | |

If all nodes in GKE clusters were down for 3 days, I would consider this newsworthy and shocking. This... is not. Come on, people.

013a 7 years ago |

Cloud providers have all of the potential in the world to make each region truly isolated. I shouldn't have to architect my application to be multi-cloud, at least for stability reasons.

Yet, somehow every major cloud provider experiences global outages.

That old AWS S3 outage in us-east-1 was an interesting one; when it went down, many services which rely on S3 also went down, in other regions beside us-east-1 because they were using us-east-1 buckets. I have a feeling this is more common than you'd think; globally-redundant services which rely on some single point of geographical failure for some small part.

threeseed 7 years ago | |

AWS regions are very much isolated from each other.

We know because we are still waiting here in ap-southeast-2 for services such as EKS to be made available. Pretty sure that any reliance within their backend services on us-east-1 was just a temporary bug and nothing systemic.

spiderPig 7 years ago |

Our company is dependent on this as well and the way customer service has been handling this has been abysmal thus far.

qaq 7 years ago |

There is no magic public clouds have incredibly complex control planes and marketing fluff aside you would very likely experience much better uptime at singe top tier DC than @ a cloud provider.

arunoda 7 years ago |

The is not only GKE. But for GCE as well. I cannot create instance is almost all zones. I tried both preemptible and normal as well.

Always saying resource not available. My account is a pretty new account.

In contrast, one of my friend is having a pretty old account which is very active. He has no such issue.

So I think due to this issue, Google has enabled some resource limitation for new accounts.

But they should properly communicate this issue.

gigatexal 7 years ago |

Oh man must be a tough time to be an SRE at google cloud. But... they’re Google. They have been doing internal cloud for years and years. Borg — which K8s is a reimplementation if — has been the heart of Google for so long now you’d think they’d be able to architect their systems to have no outages whatsoever. I mean nobody is perfect but this looks bad.

Jedi72 7 years ago | |

Goes to show outsourcing infrastructure is more about blame shifting so that when things go wrong its "not our fault" than reducing actual downtime.

closeparen 7 years ago |

Doesn’t GKE “just” run an independent Kubernetes cluster on customer VMs? How is a widespread outage like this possible?

regnerba 7 years ago | |

GKE does the creation of the VMs and setup of them, joining them to the cluster and applying labels for example.

The specific issue appears to be about creating new "node pools". Creating standard VMs in GCP works fine however, so this is specific to GKE and their internal tooling that integrates with the rest of GCP.

GKE doesn't (at least to my knowledge) allow you to create VMs separately and join them to the cluster in any kind of easy fashion.

kenan_warren 7 years ago | | |

It's actually not just GKE, there have been issues creating normal VMs since late Friday night. It seems anything that required creating VMs gave back resource exhaustion errors. I finally got a cluster for us-east1 setup last night so it looks like the resource issues are clearing up though.

raincom 7 years ago | |

Nope, GKE = master/control plane owned by Google. Customers are just tenants, who can schedule workloads.

rlancer 7 years ago | |

GKE gives you a fully managed Master Node.

fizzledbits 7 years ago |

As of this morning, I am still unable to reliably start my docker+machine autoscaling instances. In all cases the error is "Error: The zone <my project> does not have enough resources available to fulfill the request"

An instance in us-central1-a has refused to start since last Thursday or Friday.

I created a new instance in us-west2-c, which worked briefly but began to fail midday Friday, and kept failing through the weekend.

And yet, the status page says all services are available.

Is the typical of others' experiences?

wijowa 7 years ago |

Right now we're experiencing an issue where a small percentage of end users on our GKE site are getting super slow speeds. The issue is ISP related as they can switch to a 4G hot spot in the same location and get normal speeds... and inside our system the timing looks normal. So there's a slowdown either TO the load balancer or FROM the load balancer. Took a week to convince Google's support contractor to even believe it wasn't an issue with our site and their advice is generally along the lines of Turn it off and Turn it back on again (which might actually fix the problem) though that's easier said than done in GCP.

nielsole 7 years ago |

I use preemptible machines in autodialing and for first time did not have any machines available for multiple hours yesterday. I am wondering whether this falls under the normal preemptible behaviour or this service degradation.

wb3tech 7 years ago |

If anyone is interested, here is my documented experience with this issue. I freaking love GCP and GKE, although I have not production environment as it was a HA cluster in us-central1. Working federation now.

https://stackoverflow.com/questions/53244471/gke-cluster-won...

regnerba 7 years ago |

Is this just about creating new pools? I haven't noticed an issue with our existing pools scaling.

rlancer 7 years ago | |

You were able to add more Nodes to you're pool? Are you using any auto scaling?

_wmd 7 years ago |

When guerilla marketing backfires

bdibs 7 years ago |

As someone currently trying to decide between GCP and AWS for a project, is this a regular occurrence?

And for those who have used both, which would you go with today?

franky_g 7 years ago |

Had it affected all regions or just some?

Is there another status page Google? Coz the last update I'm looking at...is dated on the 9th..

justinsb 7 years ago | |

The general page is at https://status.cloud.google.com/; you can scroll down to see GKE, and my (unofficial) belief is that https://status.cloud.google.com/incident/container-engine/18... should have closed out https://status.cloud.google.com/incident/container-engine/18...

_If_ that's the case, something else is causing the error messages other people are seeing

fulafel 7 years ago |

Offtopic but are there some documented exceptions to the "keep the original title" rule?

whatshisface 7 years ago |

Why do cloud providers have more global outages than major flagship websites like google.com?

betaby 7 years ago | |

Whey don't run on the same infra. Amazon.com doesn't run on AWS.

talonx 7 years ago | | |

On the contrary, it does. They made the transition gradually.

fergie 7 years ago |

Things break after everybody has gone home on a Friday? 3 day disruption.

thomasfl 7 years ago |

I'd like to upvote, but 666 points seemed relevant.

haosdent 7 years ago |

Time to use Mesos.

shiftnight 7 years ago |

I have a question. At what point does k8s make sense?

I have a feeling that a microservice architecture is overkill for 99% of businesses. You can serve a lot of customers on a single node with the hardware available today. Often times, sharding on customers is rather trivial as well.

Monolith for the win! Opinions?

aaaaaaaaaab 7 years ago |

Daily reminder that there's no "cloud", just other people's computers. ( ͡° ͜ʖ ͡°)

spullara 7 years ago |

If a hosting service is down and nobody uses it, is there really any disruption?