Google Kubernetes Engine's third consecutive day of service disruption(status.cloud.google.com) |
Google Kubernetes Engine's third consecutive day of service disruption(status.cloud.google.com) |
https://stackoverflow.com/questions/53244471/gke-cluster-won...
And for those who have used both, which would you go with today?
Is there another status page Google? Coz the last update I'm looking at...is dated on the 9th..
_If_ that's the case, something else is causing the error messages other people are seeing
I have a feeling that a microservice architecture is overkill for 99% of businesses. You can serve a lot of customers on a single node with the hardware available today. Often times, sharding on customers is rather trivial as well.
Monolith for the win! Opinions?
Things like throwing another node into the cluster, or rolling updates are free, which you would otherwise need to develop yourself. All of that is totally doable, of course, but I like being able to lean on tooling that is not custom, when possible.
When your infrastructure does need to become more complicated, you're already ready for it. Even if I were only serving a single language, starting with a K8s stack makes a lot of sense, to me, from a tooling perspective. Yeah normal VMs might be simpler, conceptually, but I don't consider K8s terribly complicated from a user perspective, when you're staying around the lanes they intend you to stay in. Part of this may also be my having worked with pretty poor ops teams in the past, but I think K8s gives you a really good baseline that gives pretty good defaults about a lot of your infrastructure, without a lot of investment on your part.
That said, if you're managing it on a bare metal server, then VMs may be much easier for you. K8s The Hard Way and similar guides go into how that would work, but managing high availability etcd servers and the like is a bit outside my comfort zone. YMMV.
Most monoliths software companies build aren't actually monoliths, conceptually. Let's say you integrate with the Facebook API to pull some user data. Facebook is, within the conceptual model of your application, a service. Hell, you even have to worry "a little bit" about maintaining it; provisioning and rotating API keys, possibly paying for it, keeping up to date on deprecations, writing code to wire it up, worrying about network faults and uptime... That sounds like a service to me; we're three steps short of a true in-house service, as you don't have to worry about writing its code and actually running it, but conceptually its strikingly similar.
Facebook is a bad example here. Let's talk Authentication. Its a natural "first demonolithized service" that many companies will reach to build. Auth0, Okta, etc will sell you a SaaS product, or you can build your own with many freely available libraries. Conceptually they fill the same role in your application.
Let's say you use Postgres. That's pretty much a service in your application. A-ha; that's a cool monolith you've got there, already communicating over a network ain't it. Got a redis cache? Elasticsearch? Nginx proxy? Load balancer? Central logging and monitoring? Uh oh, this isn't really looking like a monolith anymore is it? You wanted it to be a monolith, but you've already got a few networked services. Whoops.
"Service-oriented" isn't first-and-foremost a way of building your application. It's a way of thinking about your architecture. It means things like decoupling, gracefully handling network failures, scaling out instead of up, etc. All of these concepts apply whether you're building a dozen services or you're buying a dozen services.
Monolithic architectures are old news because of this recognition; no one builds monoliths anymore. It's arguable if anyone ever did, truly. We all depend on networked services, many that other people provide. The sooner you think in terms of networked services, the sooner your application will be more reliable and offer a superior experience to customers.
And then, it's a natural step to building some in-house. I am staunchly in the camp of "'monolith' first, with the intention of going into services" because it forces you to start thinking about these big networking problems early. You can't avoid it.
Even if you deploy k8s privately, or over at Amazon, I think there's enough horror stories to make you think twice about the technology.
Then, if it isn't going to be k8s for microservices, what's a more reliable alternative?
The key issue here is that k8s was written with very large goals in mind. That a small business can easily spin it up quickly and run a few microservices or even a monolith + some workers is just coincidental. It is NOT the design goal. And the result of that is that a lot of the tooling and writing around k8s reflects that. A lot of the advice around practices like observability and service meshes comes from people who've worked in the top 1% (or less) of companies in terms of computing complexity. What I'm personally seeing is that this advice is starting to trickle down into the mainstream as gospel. Which strangely makes sense. No one else has the ability to preach with such assurance because not many people in small companies have actually been in the scenarios of the big guns. The only problem is that it's gospel without considering context.
So at what point does k8s make sense? Only when you have answers to the following:
* Getting started is easy, maintaining and keeping up with the going ons is a full time job - Do you have at least 1 engineer at least that you can spare to work on maintaining k8s as their primary job? It doesn't mean full time. But if they have to drop everything else to go work on k8s and investigate strange I/O performance issues, are you ready to allow that?
* The k8s eco system is like the JS framework ecosystem right now - There are no set ways of doing anything. You want to do CICD? Should you use helm charts? Helm charts inherited from a chart folder? Or are you fine using the PATCH API/kubectl patch commands to upgrade deployments. Who's going to maintain the pipeline? Who's going to write the custom code for your github deployments or your brigade scripts or your custom in house tool? Who's going to think about securing this stuff and the UX around it. That's just CICD mind you. We aren't anywhere close to the weeds about deciding if you want to use ingresses vs Load balancers and how you are going to run into service provider limits on certain resources. Are you ready to have at minimum 1 developer working on this stuff and taking time to talk to the team about it?
* Speaking about the team, k8s and Docker in general is a shift in thinking - This might sound surprising but the fact that Jessie Frazelle (y'all should all follow her btw) is occasionally seen reiterating the point that containers are NOT VM's is a decent indicator that people don't understand k8s or Docker at a conceptual level. When you adopt k8s, you are going to pass that complexity to your developers at some point. Either that or your dev ops team takes on that full complexity and that's a fair amount to abstract away from the developers which will likely increase the work load of devops and/or their team size. Are you prepared for either path?
* Oh also, what do your development environments start to look like? This is partly related to microservices but are you dockerizing your applications to work on the local dev environment? Who's responsible for that transition? As much as one tries to resist it, once you are on k8s you'll want to take advantage of it. Someone will build a small thing as a microservice or a worker that the monolith or other services depend on. How are you going to set that up locally? And again, who's going to help the devs accumulate that knowledge while they are busy trying to build the product. (Please don't put your hopes on devs wanting to learn that after hours. That's just cruel).
I can't write everything else I have in mind on this topic. It'd go on for a long long time. But the common theme here is that the choice around adopting k8s is generally put on a table of technical pros and cons. I'd argue that there's a significant hidden cost of human impact as well. Not all these decisions are upfront but it is the pain that you will adopt and have to decide on at some point.
Again, at what point does k8s make sense? Like I said, you ideally should be paining before you start to consider k8s because for nearly every feature of k8s, there is a well documented, well established, well secured parallel that already exists in the myriad of service providers. It's a matter of taking careful stock of how much upfront pain you are trading away for pain that you WILL accumulate later.
PS - If anyone claims that adopting a newer technology is going to make things outright less painful , that's a good sign of immaturity. I've been there and I picture myself smashing my head into a table every now and then when I think of how immature I used to be. Apologies to people I've worked with at past jobs.
PPS - From the k8s site, "Designed on the same principles that allows Google to run billions of containers a week, Kubernetes can scale without increasing your ops team." <-- is the kind of claim that we need to take flamethrowers to. On paper, 1 dev with the kubectl+kops CLI can scale services to run with 1000's of nodes and millions of containers. But realistically, you don't get there without having incurred significantly more complex use cases. So no, nothing scales independently.
Given how both the JS and devops worlds seems to be progressing, is there any reason to believe that this will change before the next thing comes and K8S becomes a ghost town?
Also, migrating to microservices for existing services might not be worth it, especially if you don't operate at a massive scale.
Keep it simple stupid is still a solid design decision, despite all the microservice/container hype.
Most bussinesses only need a couple of servers that provide the service, spread redundantly with a HA capability.
1) For three whole days, it was questionable whether or not a user would be able to launch a node pool (according to the official blog statement). It was also questionable whether a user would be able to launch a simple compute instance (according to statements here on HN).
2) This issue was global in scope, affecting all of Google's regions. Therefore, in consideration of item 1 above, it was questionable/unpredictable whether or not a user could launch a node pool or even a simple node anywhere in GCP at all.
3) The sum total of information about this incident can be found as a few one or two sentence blurbs on Google's blog. No explanation nor outline of scope for affected regions and services has been provided.
4) Some users here are reporting that other GCP services not mentioned by Google's blog are experiencing problems.
5) Some users here are reporting that they have received no response from GCP support, even over a time span of 40+ hours since the support request was submitted.
6) Google says they'll provide some information when the next business day rolls around, roughly 4 days after the start of the problem.
I really do want to make sure I'm understanding this situation. Please do correct me if I got something wrong in this summary.
When things stop working, GCP is the worst. Slow communications and they require way too much work before escalating issues or attempting to find a solution.
They already have the tools and access so most issues should take minutes for them to gather diagnostics, but instead they keep sending tickets back for "more info", inevitably followed by a hand-off to another team in a different time zone. We have spent days trying to convince them there was an issue before, which just seems unacceptable.
I can understand support costs but there should be a test (with all vendors) where I can officially certify that I know what I'm talking about and don't need to go through the "prove its actually a problem" phase every time.
The issue with outages for the Government organizations I have dealt with is rarely the outage itself - but strong communication about what is occurring and realistic approximate ETAs, or options around mitigation.
Being able to tell the Directors/Senior managers that issues have been "escalated" and providing regular updates are critical.
If all I could say was a "support ticket" was logged, and we are waiting on a reply (hours later) - I guarantee the conversation after the outage is going to be about moving to another solution provider with strong SLAs.
When I worked at GoDaddy, there were around 2/3 of the company was customer support.
At the current company I'm at, a cryptocurrency exchange, our support agents frequently hear they prefer our service over others because of our fast support response times (crypto exchanges are notorious for really poor support).
All of my interactions with Amazon support have been resolved to my satisfaction within 10 minutes or less.
Companies really ought to do the math on the value that comes from providing fast, timely, and easy (don't have to fight with them) customer support.
Google hasn't learned this lesson.
Isn't that the case with basically every support request, no matter the company or severity? The first couple of emails from 1st & even 2nd level support are mostly about answering the same questions about the environment over and over again. We've had this ping-pong situation with production outages (which we eventually analysed and worked around by ourselves) and fairly small issues like requesting more information of an undocumented behavior which didn't even effect us much. No matter how important or urgent the initial issue was, eventually most requests end up being closed unresolved.
https://www.hanselman.com/blog/FizzBinTheTechnicalSupportSec...
So far GCP is the best, hands down in terms of stability. We never had a single outage or maintenance downtime notification till now. We are power users but our monitoring didn't pick any anomaly so i don't think this issue had rampant impact on other services.
But i find it concerning that they provided very little update on what went wrong. I also think its better to expect nil support out of any big cloud provider if you don't have paid support. Funny how all these big cloud providers think you are not eligible for support de-facto. Sigh.
If you are an early stage startup can you afford their 200/Month support, when your entire GCP bill is under $1. However, that doesn't mean you don't have to support them.
"We are investigating an issue with Google Kubernetes Engine node pool creation through Cloud Console UI."
So, it's a UI console issue, it appears you can still manage
"Affected customers can use gcloud command [1] in order to create new Node Pools. [1]"
Similarly, it actually was resolved in Friday, but they forgot to mark it as so.
"The issue with Google Kubernetes Engine Node Pool creation through the Cloud Console UI had been resolved as of Friday, 2018-11-09 14:30 US/Pacific."
The items I put down in my comment are based largely on user reports, though (there isn't much else to go on). And I mean these items as questions (i.e. "is this accurate?"). Folks here on HN have definitely been reporting ongoing problems and seem to be suggesting that they are not resolved and are actually larger in scope than the Google blog post addressed.
Someone from Google commented here a few hours ago indicating Google was looking into it. And other folks here are reporting that they don't have the same problems. So it's kind of an open question what's going on.
I'm in the evaluation phase too. And I've found a lot to like about GCP. I'm hoping the problems are understandable.
Edit: I finally got my cluster up and running by removing all nodes, letting it process for a few minutes, then adding new nodes.
I created a new instance in us-west2-c, which worked briefly but began to fail midday Friday, and kept failing through the weekend.
On Saturday I created yet another clone in northamerica-northeast1-b. That worked Saturday and Sunday, but this morning, it is failing to start. Fortunately my us-west2-c instance has begun to work again, but I'm having doubts about continuing to use GCE as we scale up.
And yet, the status page says all services are available.
What blog statement are you referring to? I don't see any such statement. Can you provide a link?
The OP incident status issue says "We are investigating an issue with Google Kubernetes Engine node pool creation through Cloud Console UI". It also says "Affected customers can use gcloud command in order to create new Node Pools."
So it sounds like a web interface problem, not a severely limiting, backend systems problem with global scope.
Also, the report says "The issue with Google Kubernetes Engine Node Pool creation through the Cloud Console UI had been resolved as of Friday, 2018-11-09 14:30 US/Pacific". So the whole issue lasted about 10 hours, not three whole days.
> Some users here are reporting that other GCP services not mentioned by Google's blog are experiencing problems
I don't see much of that.
https://status.cloud.google.com/incident/container-engine/18...
"We are investigating an issue with Google Kubernetes Engine node pool creation through Cloud Console UI."
> So it sounds like a web interface problem, not a severely limiting
Depends who you as to whether this is "severely" limiting, but yes there is a workaround by using an alternate interface.
a) Google had a global service disruption that impacted Kubernetes node pool creation and possible other services since Friday. They had a largely separate issue for a web UI disruption (what this thread links to) which they forgot to close on Friday. They still have not provided any issue tracker for the service distribution and it's possibly they only learned about it from this hacker news thread.
b) People are having various unrelated issues with services that they're mis-attributing to a global service disruption.
Ok. So on aws we were* paying for putting systems across regions, but, honestly I don’t get the point. When an entire region is down what I have noticed is that all things are fucked globally on aws. Feel free to pay double - but it seems* if you are paying that much just pay for an additional cloud provider. Looks like it’s the same deal on GCP.
Do you have an example on this?
- Someone at Google right now, probably.
(I work at Google, on GKE, though I am not a lawyer and thus don't work on the deprecation policy)
for any reason
at any time
It looks like the UI issue was actually fixed, and that we just didn't update the status dashboard correctly. But we're double checking that and looking into some of the additional things you all have reported here.
Why do you guys suffer global outages? This is your 2nd major global outage in less than 5 years. I’m sorry to say this, but it is the equivalent of going bankrupt from a trust perspective. I need to see some blog posts about how you guys are rethinking whatever design can lead to this - twice - or you are never getting a cent of money under my control. You have the most feature rich cloud (particularly your networking products), but down time like this is unacceptable.
No one would ever ask why you chose AWS. The old “no one ever got fired for buying IBM”.
Even if you chose Azure because you’re a Microsoft shop, no one would question your choice of MS. Besides, MS is known for thier enterprise support.
From a developer/architect standpoint, I’ve been focused the last year on learning everything I could about AWS and chose a company that fully embraced it. AWS experience is much more marketable than GCP. It’s more popular than Azure too, but there are plenty of MS shops around that are using Azure.
>We will provide more information by Monday, 2018-11-12 11:00 US/Pacific.
Wait, did the people tasked with fixing this just take the weekend off?
My working assumption is that 18006 should have closed out 18005. But now it sounds like there's a different issue, which we're working to get to the bottom of.
And this is likely a major incident with significant customer impact.
The way google is handling all this gives a pretty poor impression. Seems like this kubernetes is just a PoC.
https://landing.google.com/sre/sre-book/chapters/managing-in...
Looks like this time Mary took the whole week off without telling Josephine :)
Perhaps some of the issues are localized? Perhaps it's even user error (it happens, you know?). But because a small amount of HN users say "it's everywhere!" then suddenly people reach for their pitchforks.
Sometimes we just don't have all the information.
I did this in the australia-southeast1-a zone.
Error message when creating a new Cluster:
Deploy error: Not all instances running in IGM after 35m7.509000994s. Expect 1. Current errors: [ZONE_RESOURCE_POOL_EXHAUSTED]: Instance 'gke-cluster-3-pool-1-41b0abf8-73d7' creation failed: The zone 'projects/url-shortner-218503/zones/us-west2-b' does not have enough resources available to fulfill the request. Try a different zone, or try again later. - ; .
What would a small business do as a contingency plan?
I faced some pretty serious resource allocation issues earlier in the year. The us-west1-a region was oversubscribed. I was unable to get any real information from support with regard to capacity. Eventually my rep gave me some qualitative information that I was able to act on.
One thing I do care about though, is root cause analysis. I love reading a good RCA, it restores my faith in the company and makes me trust them more.
(I'm not affect by the GKE outage so opinions may differ right now!)
Right when I convinced our project to get migrated from AWS...
The timeline of this disruption matches when we started experiencing cloud build errors.
https://status.cloud.google.com/incident/container-engine/18...
Yet, somehow every major cloud provider experiences global outages.
That old AWS S3 outage in us-east-1 was an interesting one; when it went down, many services which rely on S3 also went down, in other regions beside us-east-1 because they were using us-east-1 buckets. I have a feeling this is more common than you'd think; globally-redundant services which rely on some single point of geographical failure for some small part.
We know because we are still waiting here in ap-southeast-2 for services such as EKS to be made available. Pretty sure that any reliance within their backend services on us-east-1 was just a temporary bug and nothing systemic.
Always saying resource not available. My account is a pretty new account.
In contrast, one of my friend is having a pretty old account which is very active. He has no such issue.
So I think due to this issue, Google has enabled some resource limitation for new accounts.
But they should properly communicate this issue.
The specific issue appears to be about creating new "node pools". Creating standard VMs in GCP works fine however, so this is specific to GKE and their internal tooling that integrates with the rest of GCP.
GKE doesn't (at least to my knowledge) allow you to create VMs separately and join them to the cluster in any kind of easy fashion.
An instance in us-central1-a has refused to start since last Thursday or Friday.
I created a new instance in us-west2-c, which worked briefly but began to fail midday Friday, and kept failing through the weekend.
On Saturday I created yet another clone in northamerica-northeast1-b. That worked Saturday and Sunday, but this morning, it is failing to start. Fortunately my us-west2-c instance has begun to work again, but I'm having doubts about continuing to use GCE as we scale up.
And yet, the status page says all services are available.
Is the typical of others' experiences?
That being said I really do think there is a difference between who is working at google today and the google we all fell in love with pre-2008.
I am sure there are a amazing people still working at google, but nowhere near like it was.
The way I like to think about google is that some amazing people mad ea awesome train that builds tracks in front of it -- you can call them gods maybe -- but those people are gone -- or a least the critical mass required to build such a train has dwindled to just dust. What we have left is a awesome train full of people pulling the many levers left behind.
To make things even worse my last interview as a SRE left me wondering if even the people who are there know this as well, and they are actually working hard to keep out those who might expose light on to this. I don't say that because I did not get the job -- I am actually happy I did not get extended a offer.
I say this with one exception, the old-timer who was my last interview. I could tell he was dripping in knowledge and eager to share it with any that would listen. I came out of his 45 min session learning many things -- I wold actually pay to work with a guy like that.
I would also like to point out that the work ethic was not what I expected. I was told that when on call, my duty was to figure out the root cause was in the segment I was responsible for. I don't know about you, but if my phone rings at night I am going to see through to a resolution and understand the problem in full -- even if it is not on the segment that I was assigned.
/end rant
During my time on the GCE team (note I don't work at Google now) I knew multiple full-time Google employee support reps, including some still at the company. They have the good attitude and deep knowledge you'd hope for.
The problem is simply about how Google scales their GCP support org. To be completely clear, AWS support is by and large not great either.
If you're a big or strategically important customer, of course, you can get a good response from either company.
Perhaps if you explained it on a whiteboard...
From my personal experience - i think all big cloud providers first two level support staff is no good if it isn't an obvious dumb one on your part. I always prefer to forgo support and try to go through every bit of their documentation to figure it out on our own. This helps to save huge amount of time. But if you have developer support - it can help to expedite things little faster though.
That's my favorite.
As another comment pointed out, what's the point of having so many zones and redundancy around the globe if such global failure can still happen? I thought the "cloud" was supposed to make this kind of failure impossible
I've been creating GCP instances in us-central1-a and us-central1-c today without issue. Which zone were you using in NA?
I have been noticing unusual restarts, but I haven't been able to pin down the cause yet (may be my software and not GCP itself).
You have to remember that you're trying to have access to backend platforms and infrastructure at all times, which almost no public utility does (assuming "the cloud" is "public utility computing"). Power plants go into partial shutdown, water treatment plants stop processing, etc. Utilities are only designed to provide constant reliability for the last mile.
If there's a problem with your power company, they can redirect power from another part of the grid to service customers. But some part of your power company is just... down. Luckily you have no need to operate on all parts of the grid at all times, so you don't notice it's down. But failure will still happen.
Your main concern should be the reliability of the last mile. Getting away from managing infrastructure yourself is the first step in that equation. AppEngine and FaaS should be the only computing resources you use, and only object storage and databases for managing data. This will get you closer to public utility-like computing.
But there's no way to get truly reliable computing today. We would all need to use edge computing, and that means leaning heavily on ISPs and content provider networks. Every cloud computing provider is looking into this right now, but considering who actually owns the last mile, I don't think we're going to see edge computing "take over" for at least a decade.
If set up properly to be utilized correctly, yeah. But, it's not a perfect world though.
People who respond here could be employees of Google, caring about it and respond here because they know it.
What he can mention ( a lot of people are working on it) is what you can suspect when something is going down. All other cloud providers do the same.
There is a reason while Google have been having hard time making inroads in the enterprise cloud. Kind of impedance mismatch between enterprise and the Google style. That 2 stories like high "We heart API" sign on the Google Enterprise building facing 237 just screams about it :)
I created a new instance in us-west2-c, which worked briefly but began to fail midday Friday, and kept failing through the weekend.
On Saturday I created yet another clone in northamerica-northeast1-b. That worked Saturday and Sunday, but this morning, it is failing to start. Fortunately my us-west2-c instance has begun to work again, but I'm having doubts about continuing to use GCE as we scale up.
And yet, the status page says all services are available.
The problem is that running a global SDN like this means if you do something wrong, you can have outages that impact multiple regions simultaneously.
This is why AWS has strict regional isolation and will never create cross-region dependencies (outside of some truly global services like IAM and Route 53 that have sufficient redundancy that they should (hopefully) never go down).
Disclaimer: I work for AWS, but my opinions are my own.
Disclaimer: google employee in ads, who worked on many many fires throughout the years, but talking from my personal perspective and not from my employer. I am sure we are striving to have 0, but realistically, i have seen many that says things happen. Learn, and improve.
(Disclosure: I worked for Google, including GCP, for a few years ending in 2015. I don't work or speak for them now and have no inside info on this outage.)
Most of what you can read of Google's approach will teach you their ideal computing environment is a single planetary resource, pushing any natural segmentation and partitioning out of view.
It's the opposite really: the expectation that service providers have no unexpected downtime is unrealistic, and it's strange this idea persists.
I agree, in general, outages are almost inevitable, but global outages shouldn't occur. It suggests at least a couple of things:
1) Bad software deployments, without proper validation. A message elsewhere in this post on HN suggest that problems have been occurring for at least 5 days, which makes me think this is the most likely situation. If this is the case, presumably given this is multiple days in to the issue, rolling back isn't an option. That doesn't say good things about their testing or deployment stories, and possibly their monitoring of the product? Even if the deployment validation processes failed to catch it, you'd really hope alarming would have caught it.
or:
2) Regions aren't isolated from each other. Cross-region dependencies are bad, for all sorts of obvious reasons.
5. Years.
Nothing to see here, move along.
- Security posture. Project Zero is class leading, and there's absolutely a "fear-based" component there, with the open question of when Project Zero discovers a new exploit, who will they share it with before going public? The upcoming Security Command Center product looks miles ahead of the disparate and poorly integrated solutions AWS or Azure offers.
- Cost. Apples to apples, GCP is cheaper than any other cloud platform. Combine that with easy-to-use models like preemptible instances which can reduce costs further; deploying a similar strategy to AWS takes substantially more engineering effort.
- Class leading software talent. Google is proven to be on the forefront of new CS research, then pivoting that into products that software companies depend on; you can look all the way back to BigQuery, their AI work, or more recently in Spanner or Kubernetes.
- GKE. Its miles ahead of the competition. If you're on Kubernetes and its not on GKE, then you've got legacy reasons for being where you're at.
Plenty of great reasons. Reliability is just one factor in the equation, and GCP definitely isn't that far behind AWS. We have really short memories as humans, but too soon we seem to forget Azure's global outage just a couple months ago due to a weather issue at one datacenter, or AWS's massive us-east-1 S3 outage caused by a human incorrectly entering a command. Shit happens, and it's alright. As humans, we're all learning, and as long as we learn from this and we get better then that's what matters.
Or you have legitimate reasons for running on your own hardware, e.g. compliance or locality (I work at SAP's internal cloud and we have way more regions than the hyperscalers because our customers want to have their data stay in their own country).
But, whether it is right or not, as an architect/manager, etc, you have to think about what’s not just best technically. You also have to manage your reputational risks if things go south and less selfishly, how quickly can you find someone with the relevant experience.
From a reputation standpoint, even if AWS and GCP have the same reliability, no one will blame you if AWS goes down if you followed best practices. If a global outage of an AWS resource went down, you’re in the same boat as a ton of other people. If everyone else was up and running fine but you weren’t because you were on the distant third cloud provider, you don’t have as much coverage.
I went out on a limb and chose Hashicorp’s Nomad as the basis of a make or break my job project I was the Dev lead/architect for hoping like hell things didn’t go south and the first thing people were going to ask me is why I chose it. No one had heard of Nomad but I needed a “distributed cron” type system that could run anything and it was on prem. It was the right decision but I took a chance.
From a staffing standpoint, you can throw a brick and hit someone who at least thinks they know something about AWS or Azure GCP, not so much.
It’s not about which company is technically better, but I didn’t want to ignore your technical arguments...
Native integration with G-Suite as an identity provider. Unified permissions modeling from the IDP, to work apps like email/Drive, to cloud resources, all the way into Kubernetes IAM.
You can also do this with AWS - use a third party identity provider and map them to native IAM user and roles.
https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_cr...
Cost. Apples to apples, GCP is cheaper than any other cloud platform. Combine that with easy-to-use models like preemptible instances which can reduce costs further; deploying a similar strategy to AWS takes substantially more engineering effort.
The equivalent would be spot instances on AWS.
From what (little) I know about preemptible instances, it seems kind of random when they get reassigned but Google tries to be fair about it. The analagous thing on AWS would be spot instances where you set the amount you want to pay.
Class leading software talent. Google is proven to be on the forefront of new CS research, then pivoting that into products that software companies depend on; you can look all the way back to BigQuery, their AI work, or more recently in Spanner or Kubernetes.
All of the cloud providers have managed Kubernetes.
As far as BigQuery. The equivalent would be Redshift.
https://blog.panoply.io/a-full-comparison-of-redshift-and-bi...
Reliability is just one factor in the equation, and GCP definitely isn't that far behind AWS
Things happen. I never made an argument about reliability.
GCP can be a fair bit cheaper than AWS and Azure for certain workloads. Raw compute/memory is about the same. Storage can make a big difference. GCP persistent SSD costs a bit more than AWS GP2 with much better performance and way cheaper than IO2. Local SSD is also way, way cheaper than I2 instances.
Most folks deploying distributed data stores that need guaranteed performance are using local disk, so this can be a really big deal.
However, I could see doing a multicloud solution where I took advantage of the price difference for one project.
The AWS console is wildly inconsistent. I’ll give you that. But, any projects I am doing are usually defined by a Cloud Formation Template abd I can see all of the related resources by looking at the stack that was run.
Theoretically, you could use the stack price estimator, I haven’t tried it though.
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGui...
Going multi region on AWS should be safe enough.
If a multi region, multi service meltdown happens on AWS, it will feel like most of the internet has gone down to a lot of users. Being such a catastrophic failure, I bet the service will be restored pretty fast, not in 3 days.
You could go multi cloud though. But when half of the internet struggles to work correctly, I’d not feel too bad about my small business’ downtime.
Additionally, from a "nobody ever got fired for buying IBM" perspective, you're unlikely to catch much blame from your users for going down when everyone else was down too.
Multi cloud is almost always more pain than gain. You’d spend time and effort abstracting away the value that a cloud provider brings in canned services.
Hell, multi region is often more than many workloads need.
Then start looking at points of failure and sort them based on severity and probability. Is your own software deployment going to generate more downtime per year than a regional aws outage?
There are formal academic ways to determine what your overall availability is, but don't have those on hand. Suffice to say, it takes significant research, planning, execution, and testing to ensure a target availability. (See Netflix https://medium.com/netflix-techblog/the-netflix-simian-army-... ) if someone says they have 99.9% or better up time, they had better have proof in my mind (or a fat SLA violation payout)
People outsource to cloud providers not because they are cheap, but because managing infra in house is hard. Also move fast and break things.
Read AWS docs about availability, there are availability zones in a region, spread across those to minimize impact. Then test when something goes down. Fix/repeat.
Most companies I’ve been at don’t offer multi region support for their services because it’s too expensive for the service provided even in so-called “price insensitive” enterprises (you can’t just make up a price that’s huge, they do have budgets still) and most of their customers are unwilling / unable to pay more for the extra availability. If your software is designed from the start better, multi region failovers should be fairly inexpensive though. But all the bolted on “multi region” software I’ve seen has been hideously expensive and oftentimes less reliable due to the design being soundly not able to tolerate failures well.
Ultimately it’s a risk/return decision.
“Is going exclusively with AWS/azure/GCP etc a better decision in reliability, financial and mantainability terms than complicating the design to improve resiliency? And will this more complex solution actually improve reliability?”
If AWS ever screws up, you will be able to continue running the business even if it might take weeks to start over.
For live redundancy, you should have a secondary datacenter on another provider, but realistically it's hard to do and most business never achieve that. Instead, just stick with AWS and if there is a problem the strategy is to sip coffee while waiting for them to resolve it. Much better this way than you having to fix it yourself.
Depends on your definition of small. If it's small enough not to have a dedicated infrastructure team designing multicloud solution, then the contingency plan may be: switch DNS to a static site saying "we're down until AWS fixes the issue, check back later".
Otherwise it depends on your specific scenario, your support contracts, and lots of other things. You need to decide what matters, how much the mitigation costs vs downtime, and go from there.
I wish I was only being tongue-in-cheek.
Terraform using AMIs plus chef recipes that work in the cloud and bare metal. Dont use AWS specific services.
This would allow you to spin over to another cloud provider , vsphere or bare metal with minimal work
To answer the original question: It looks like this issue was just a UI bug that affected the console, the service itself wasn't impacted. Events that do impact the service will be contained to a region, meaning you can mitigate it with proper redundancy across regions, no zany multi-cloud solution required.
We actually got to a point where we had a couple of spare parts onsite (sticks of RAM, HD, etc) and so we repair immediately and then request the replacement. This was on a large HPC cluster so we had almost daily failures of some kind (most commonly we'd get a stick of RAM that would fail ECC checks repeatedly).
They have though; they've just drawn the conclusion that they'd rather put massive amounts of effort in to building services that users can use without needing support. This approach works well once the problems have been ironed out, but it's horrible until that's the case. Google's mature products like Ads, Docs, GMail, etc are amazing. Their new products ... aren't.
Google Ads and such also have a terrible support reputation, even with clients spending 8 figures.
Until something goes wrong and the only recourse is to post an angry Hacker News thread or call up people you personally know at Google to get it fixed. For example https://techcrunch.com/2017/12/22/that-time-i-got-locked-out....
Any trade comes to me if it's urgent, and I appear more professional as I've got a functioning system.
I might be an chancer running my entire system off an shoestring but being up when everyone else has taken a dive looks good.
Without our multi-cloud set up we would have been down for over an hour. In our business this is not an option.
20% of my support experiences are amazing.
Fortunately, I don't require decent support to keep my service running. My sales rep tells me that he's aware of the problem.
I speculate it's simply the result of GCP trying to grow the org very quickly.
Then there was the global load balancer outage in July.
Looking though the incident history, there were essentially monthly multi-region or global service disruptions of various services.
b. To the Agreement:
Google may make changes to this Agreement, including pricing (and any linked documents) from time to time. .... Google will provide at least 90 days’ advance notice for materially adverse changes to any SLAs by either: (i) sending an email to Customer’s primary point of contact; (ii) posting a notice in the Admin Console; or (iii) posting a notice to the applicable SLA webpage. If Customer does not agree to the revised Agreement, please stop using the Services. Google will post any modification to this Agreement to the Terms URL.
Affected customers can use gcloud command [1] in order to create new Node Pools. [1] https://cloud.google.com/sdk/gcloud/reference/container/node...
That led me to believe that only the web UI was affected.
Note how I never stated the inference. This is because I wanted to share a way of thinking without feeling the responsibility to reply to people attempting to force me to prove some prescriptive, arbitrary inference rule by exhaustion. I do not participate in such practices casually. I also consider it rude to subject people to such practices without consent. I also believe it is a practice that kills online discussion platforms. See this community’s thought provoking guidelines :)
> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.
There are lots of little things to like about GCP that are superior to AWS. Network IO, some of the bigdata products. Not having to deal with IAM. In the end it would be some combination of those things that should drive the decision. Basic enterprise IT shops moving to "cloud" should choose AWS 90% of the time.
Anyone starting from scratch on kubernetes or considering shifting all of their infrastructure to it should absolutely choose GKE. Anyone currently in EKS or AKS should sign up for GCP today and evaluate the differences to see what they're missing.
It's not like 5 years ago when everyone was ramping up their offerings with a yearly price drop and a new generation.
"Shibboleet" https://www.xkcd.com/806/
GCP does have role-based support models with a flat-rate plan, which is really great, but the overall quality of the responses leaves much to be desired.
Even when working in small companies with small infrastructure, I've kept recreation of infrastructure as one of my high priorities (one reason it really bugged me in one job to have to depend on Oracle Databases that I couldn't automate to the same degree.)
In my mind, it's not different from the importance of having, and testing restoration of, backups. If your infrastructure gets compromised somehow, or you find yourself up the creek with your provider, you've got to be able to rebuild everything from scratch.
Then you realize a lot of software and databases can only run from a single instance, zero support for multi regions, and you're not gonna to rewrite everything and resiliency just can't happen.
I am not saying with vast numbers its feasible but big cloud providers don't even give you the opportunity to raise a ticket if its their fault. There is a price you pay extra when you opt for any one of them but many don't realize. Having said that - almost all the time, our skilled expertise is better than their initial two level of support staff. We realized it early so we handle it better by going over the documentation and making our code resilient since all cloud platforms have some limit or another since overselling in a region is something they can't avoid. Going multiple regions across when you handle these exceptions is the only way through.
https://blog.thousandeyes.com/amazon-route-53-dns-and-bgp-hi...
edit: The default is also only about the UI issue and there's no issue tracker for the broader non-UI disruptions going on since Friday.
What exactly is your point?
Nope.
Edit: I saw your point a bit late. It was limited to GKE, which makes my initial comment about "service" incorrect, and it was global, which keeps my comment about "region" correct. On a related note, an SRE from GKE posted on Slack that GCE was out of resources and so GKE faced resource exhaustion as well [1][2] - so it _might_ have been a multi-service outage.
1.https://googlecloud-community.slack.com/messages/C0B9GKTKJ/c...
2. https://googlecloud-community.slack.com/archives/C0B9GKTKJ/p...
https://www.geekwire.com/2018/widespread-outage-amazon-web-s...
edit: I stand corrected. Apparently the S3 outage wasn't global, though its effects were.
Meanwhile, this outage has only really been noticeable to ops teams, since it doesn't affect existing nodes or anything outside GKE. It's definitely concerning and the fix is taking far too long, but as far as global outages go the impact is relatively minor.
edit:
nvm s3 has regions, it's the bucket names that are global.
http(s)://<bucket>.s3.amazonaws.com/<object> http(s)://s3.amazonaws.com/<bucket>/<object>
The person updating that status dashboard may or may not be an engineer, the IM certainly is.
That's yes, it's still being investigated?
You may feel that's a bad decision, but I doubt that people are in a panic because they can't push out an update that would not be noticably different from the last one.
Just to clarify, what should this update contain?
GKE being the exception, since it was launched a couple years before EKS. AWS clearly has way more services, and the features are way deeper than GCP.
Just compare virtual machines and managed databases, AWS has about 2-3x more types of VMs (VMs with more than 4TB of RAM, FPGAs, AMD Epyc, etc.), and in databases, more than just MySQL and PostgreSQL. When you start looking at features you get features that you just can't get in GCP, like 16 read-replicas, point in time recovery, backtrack, etc.
Disclaimer: I work for AWS but my opinions are my own.
Some of GCP's unique compelling features include live VM migration that makes it less relevant when a host has to reboot, the new life that has recently been put into Google App Engine (both flexible environment and the second generation standard environment runtimes), the global load balancer with a single IP and no pre-warming, and Cloud Spanner.
In terms of feature coverage breadth I started my previous comment by agreeing that AWS was ahead, and I still reaffirm that. But if you randomly select a feature that they both have to a level which purports to meet a given customer requirement, the GCP offering will frequently have advantages over the AWS equivalent.
Examples besides GKE: BigQuery is better regarded than Amazon Redshift, with less maintenance hassle. And EC2 instance, disk, and network performance is way more variable than GCE which generally delivers what it promises.
One bit of praise for AWS: when Amazon does document something, the doc is easier to find and understand, and one is less likely to find something out of date in a way that doesn't work. But GCP is more likely to have documented the thing in the first place, especially in the case of system-imposed limits.
To be clear, I want there to be three or four competitive and widely used cloud options. I just think GCP is now often the best of the major players in the cases where its scope meets customer needs.
Disk and network performance is extremely consistent with AWS so long as you use newer instance types and storage types. You can't reasonably compare the old EBS magnetic storage to the newer general purpose SSD and provisioned IOPS volume types, and likewise, newer instances get consistent non-blocking 25gbps network performance.
I'm not so sure I would praise our documentation; it is one of the areas that I wish we were better at. Some of the less used services and features don't have excellent documentation, and in some cases you really have to figure it out on your own.
GCP is a pretty nice system overall, but most of the time when I see comparisons, when GCP looks better its because the person making the comparison is comparing the AWS they remember from 5-6 years ago with the GCP of today, which would be like comparing GAE from 2012 with today.
If you're stuck implementing a suboptimal solution, that's not your fault, and not the intent of my above comment.
Disaster recovery by switching to another provider is simple when minimal centos/rhel images are used.
Are you not using any of thier managed services and are you maintaining your own on VMs? If so, you have the worse of both worlds. You’re spending more on hosting and you’re not saving money on letting someone else do the “undifferentiated heavy lifting”.
The syntax for provisioning these doesn’t work that well for some find and replace to work. Are you using a templater to generate cloud-specific HCL from a template or something? Sounds like a pretty big problem to solve to me and not just something where you can win via discipline.
Are software development and release processes improving to mitigate these outages? We don't know. You have to trust the marketing. Will regions ever be fully isolated? We don't know. Will AWS IAM and console ever not be global services? We don't know.
Blah blah blah "We'll do better in the future". Right. Sure. Some service credits will get handed out and everyone will forget until the next outage.
Disclaimer: Not a software engineer, but have worked in ops most of my career. You will have downtime, I assure you. It is unavoidable, even at global scale. You will never abstract and silo everything per region.
[1] https://www.theregister.co.uk/2017/03/01/aws_s3_outage/
[2] https://www.cnbc.com/2018/07/16/aws-hits-snag-after-amazon-p...
[3] https://www.cnet.com/news/google-cloud-issues-causes-outages...
[4] https://www.datacenterknowledge.com/uptime/microsoft-blames-...
http://highscalability.com/blog/2012/5/9/cell-architectures....
> Facebook Platform Appears to be down
> A check of https://developers.facebook.com/status/dashboard/ returns an error and I'm unable to login with facebook to some of my mobile apps.
You're right that Athena seems like the current competitor to BigQuery. This is one of those things that are easy to overlook when people made the comparison as recently as a couple of years ago (before Athena was introduced) and Redshift vs BigQuery is still often the comparison people make. This is where Amazon's branding is confusing to the customer: so many similar but slightly different product niches, filled at different times by entirely different products with entirely unrelated names.
When adding features, GCP would usually fill adjacent niches like "serverless Redshift" by adding a serverless mode to Redshift, or something like that, and behavior would be mostly similar. Harder to overlook and less risky to try.
Meanwhile, when Athena was introduced, people who had compared Redshift and BigQuery and ruled out the former as too much hassle said "ah, GCP made Amazon introduce a serverless Redshift. But it's built on totally different technology. I wonder if it will be one of the good AWS products instead of the bad ones." (Yes, bad ones exist. Amazon WorkMail is under the AWS umbrella but basically ignored, to give one example.)
And then they go back to the rest of their day, since moving products (whether from Redshift or BigQuery) to Athena would not be worth the transition cost, and forget about Athena entirely.
On the disk/network question, no I didn't see performance problems with provisioned IOPS volume types, but that doesn't matter: for GCE's equivalent of EBS magnetic storage, they do indeed give what they promise, at way less cost than their premium disk types. There's no reason it isn't a fair comparison.
And for the "instance" part of my EC2 performance comment, I was referring to a noisy neighbor problem where sometimes a newly created instance would have much worse CPU performance than promised and so sometimes delete and recreate was the solution. GCE does a much better job at ensuring the promised CPUs.
I'm glad AWS and GCP have lots of features, improve all the time, and copy each other when warranted. But I don't think the general thrust of my comparison has gone invalid, even if my recent data is more skewed toward GCP and my AWS data is skewed toward 2-3 years old. Only the specifics have changed (and the feature gap narrowed with respect to important features).
They're equivalent in the sense that you have nodes that can die anytime, but it's much more complicated. You could technically have a much lower cost on AWS by aggressively bidding low but we've had a few instances where the node only lived a few minutes.
Preemptibles nodes are max 24h, and from our stats, they really live around that amount of time. I think the lowest we've had was a node dying after 22h.
You also save out of the box because they apply discount when your instance is running for a certain number of hours.
You can even have more discount by agreeing to a committed use which you pay per month instead of one-shot unlike AWS.
I'm going to add a few more reasons to the above reply:
- UI and CLI is so much better in GCP
I don't have to switch between 20 regions to see my instances/resources. From one screen, I can see them all and filter however I like.
- GCP encourage creating different projects and apply same billing.
It's doable in AWS too, of course, but coupled with the fact that you have different projects and regions, and you can't see all instances of a project at once, this makes a super bad experience
- Networks are so much better in GCP
Out of the box, your regions are connected and have their own CIDR. Doing that in AWS is complicated.
- BigQuery integration is really good
A lot of logs and analytics can be exported to BigQuery, such as billings, or storage access. Coupled with Data Studio and you have non technical people doing dashboards.
- Kubernetes inside GCP is a lot better than AWS'
https://blog.hasura.io/gke-vs-aks-vs-eks-411f080640dc
- Firewall rules > EC2 Security Group
- A lot of small quality of life that makes the experience a lot better overall
... like automatically managing SSH keys for instances, instead of having a master ssh key and sharing that.
Here's the thing though, a lot of GCP can be replicated, just like what you linked for the identity provider. With GCP, there's a lot of stuff out of the box -- so dev and ops can focus on the important stuff.
Overall, AWS is just a confusing mess and offers a very bad UX. Moving to GCP was the best move we've made.
Moved for bizdev reasons, and really appreciated the improved quality of life.
"Cloud" is not a thing one buys and one's reputation has nothing to do with the reliability of the services consumed, but the reliability of the services provided.
To put it more succinctly, "you own your availability".
In the end, "cloud" is a commodity and all cloud providers are trying to get vendor lock-in. My goal as a manager is not to couple my business revenue linearly to any particular product or service.
Cloud is only an interchangeable commodity if you’re treating it like an overpriced colo and not using it to save costs on staff, maintenance, and helping deliver product faster.
Sure, we use support tickets with vendors for small things. Console button bugging out, etc. But for large incidents, every vendor has a representative within an hour driving distance and will be called into a room with our engineers to fix the problem. This kind of outage, with zero communication, means the dropping of a contract.
Communication is critical for trust, especially if we're running a business off it.
You need failovers to different providers and hopefully also have your hardware for general workloads
And suddenly the CEO doesn't care anymore if one of your potential failovers is behaving flaky in specific circumstances
Not saying it's good as it is.. communication as a saas provider is - as you said- one is the most important things... But this specific issue was not as bad as some people insinuate in this thread
Don't get it wrong. AWS is the exact same thing as Google. All you will is log a ticket and receive an automated ack by the next day.
Considering that even tech companies hardly manage to have a pair of DevOps or Sysadmin, running one own infrastructure is completely out of question.
The separate account was setup partially on my insistence but it was set up in the same region.
If needed, we could have done VPC peerings across regions. (https://aws.amazon.com/about-aws/whats-new/2017/11/announcin...)
Some services in AWS are also not available in others (I'm quite familiar with AWS Data Pipeline not being available outside the "core" regions like us-east-1, eu-west-1) and having services in one region make usage of resources in another region is a huge change when most developers outside ones with technology literate customers are under the gun to push features out fast over sound design. The matrix of services and configurations necessary to mix and match regions and availability zones is non-trivial if you make extensive usage of AWS services above the IAAS layer.
Also, cross-region VPC peering has a TON of limitations that rather annoying depending upon how well your network has been architected (by default in most companies outside enterprises with a deep bench of network engineers, this would be rated at "complete crap barely better than a typical home wifi network"). Heck, even though I'm non-dumb at networks I have to keep reminding myself of various cross-region VPC limitations when working with refactoring cross-region VPCs like where you can reference security groups, how to propagate Route 53 records, etc.
This was a while back though. Now we depend on a lot more AWS stuff.
Cluster administration and identity management are unique to each provider and fairly challenging to get right.
If you look at the docs now[1], new buckets are regionalized and the region is in the URL for non-us-east-1 regions.
[1] https://docs.aws.amazon.com/general/latest/gr/rande.html#s3_...
"We're working on it", possibly an ETA for a fix or some details. Technically it's fluff but people are not machines and the update is for people. We like the feeling that people are working on a fix, that people care and that the end is in sight. It makes the situation less stressful and, as for why Google should care, less stressed engineers won't bad mouth Google as much after the fact.
What's the better communication plan: detailed, hourly updates or terse, one-line blog posts scattered across several days?
And I'm confused about what was good about that response. That article is about how the s3 outage caused so many issues that Amazon couldn't update their status dashboard to inform users at all.
But I blame most of the cost overruns when using cloud providers on “consultants” who think they are “moving to the cloud” when all they really know is how to setup a little networking infrastructure and know nothing about how to use the developer, Devops, or other hosted solutions.
Most ”consultants” I’ve run across only know how to do a lift and shift and do a one to one mapping of the on prem VMs and networking infrastructure to the cloud. They know nothing about automation, transforming Devops practices, or transforming development practices and architecture.
A lift and shift should only be the first phase.
Thinking that is the reason for consultants is a problem
Note : I worked directly for companies as an employee when building these stacks
At the same time you ignored the massive complexity and size of Google compared to what they were at the beginning.
This is voodoo organisational analysis.
That being said -- when you are on call -- dropping everything is exactly what is expected.
If you've got VC money to blow so you can pretend your SaaS toy can feed 500 people while having money left to throw at things, that's cool. Just remember that other people might be running sustainable businesses.
And just like that you turned a $200/month bill into a $10k/month strawman.
> Just remember that other people might be running sustainable businesses.
Why are you pretending that a startup that can't afford $200/month is a "sustainable businesses"?
I'm glad AWS's free tier is working for you, but complaining that Google doesn't want to give you free capacity for your business and then also provide you free support for that business is pretty absurd.
Let's hope you don't have a life threatening medical emergency that can't wait near an affected healthcare facility while that silly software is down.
If your ability to operate an ER is dependent on a remote data center, you have no business being a public health provider.
Am I the only one that finds this slightly humorous?
It's likely the fix is checked in and will start roll out on Monday.
Disclaimer: I work on Google Cloud and while I believe we could use more words here, this doesn't seem like a huge problem. It's embarrassing the the issue with the ui was shipped, and I'm sure this will be addressed in the post mortem as well as whether it could have been mitigated quicker than a roll forward.
Based on comments in this thread even gcloud is failing and so are other non-kubernetes services. Which may be inaccurate but there's a lot of people saying the same thing so maybe it is.
You're right however that the linked issue is only about the UI. So Google isn't even tracking the service distribution issue in it's issue tracker much less updating people on. I personally think that's even worse...
> 7.1 Discontinuance of Services. Subject to Section 7.2, Google may discontinue > any Services or any portion or feature for any reason at any time without > liability to Customer.
Let's take a look at Section 7.2:
> 7.2 Deprecation Policy. Google will announce if it intends to discontinue or > make backwards incompatible changes to the Services specified at the URL in > the next sentence. Google will use commercially reasonable efforts to continue > to operate those Services versions and features identified at > https://cloud.google.com/terms/deprecation without these changes for at least > one year after that announcement, unless (as Google determines in its > reasonable good faith judgment): > > (i) required by law or third party relationship (including if there is a change > in applicable law or relationship), or > > (ii) doing so could create a security risk or substantial economic or material > technical burden. > > The above policy is the "Deprecation Policy."
To me that looks like a reasonable deprecation policy.
It might be, until they jack up the prices 15X with limited notice (looking at you, Google maps [1]). No deprecation needed, just force users off the platform unless they're willing to pay a massive premium.
[1] https://www.google.com/search?q=google+maps+price+increase
The fact that they're all Google makes reputation damage bleed across meaningfully different parts of what's in truth now a conglomerate under the umbrella name Google.
If they ever do deprecate something people have built on though they're gonna get absolutely crucified. That's probably better protection than any terms of service.
They do this all the time, and they get crucified every time. I built a Google Hangout App and a Chrome App, both of which were platforms eventually shut down.
This is where the meme came from, and it's why I personally stopped building on top of Google products. A 1-year deprecation policy is no assurance to me if I plan for my app to live longer than that.
If a service Google runs is losing money, what reason would they have to not shut it down?
Which is the deprecation policy. (I mean I share your frustration with Google's what-appears-to-be-at-least haphazard policy of shutting down services instead of trying to gain traction. But, let's not misrepresent what they say).
I don't think it's wrong - they can deprecate any service they want to do whenever they want, unless people have paid for and signed a contract that says otherwise which I guess people aren't doing.
But the policy doesn't really guarantee anything at all does it, due to the reference escape-hatches? It might as well not exist?
"Subject to the deprecation policy [which says that Google will give at least 1 year notice before cancelling services], Google may discontinue..."
In other words, at any time, google can give you a years notice.
(I work at Google, but am not a lawyer and this isn't official in any capacity).
Please don't selectively quote things out of context to give a misleading impression.
> commercially reasonable
> substantial economic or material technical burden
Is one engineer working on an old service to keep it alive commercially reasonable or a substantial burden? I don't know. Do you?
In practice this policy lets them shut off anything they want any time they want. Again it's their playground they can do what they want unless they signed a contract saying they'd do something else for you so I don't have a problem with it.
To be clear, that policy is a contract. And those things would be decided by a jury. And if my understanding is correct, the reasonable person standard applies. So you can answer this yourself, do you think a reasonable person would believe that your interpretation is valid?
If not, why mention it?
Caveat emptor, folks.
At the end of the day, changing your underlying infrastructure is so risky and usually not worth the cost benefit analysis, it’s rarely done.
That's a pretty manpower intensive way of operating. I think the fact that you get cloud agnostic this way is probably not worth it.
Besides, I’m assuming that the cost savings a small company can get from being billed under a much larger organization account would make up for it. That and having cheap shared netops support.
Of course, that doesn't make them knowledgeable to run stable infrastructure and they will move on as soon as they realize they are being abused to work overnight and week end.
A company that doesn’t want the overhead of an MSP which in my experience is less than the cost of a full time Dev is not a company I’m going to work for. It would tell me a lot about thier mentality.
Doable, but it’s a hell of a lot of hassle and that CapEx is huge for a startup.
I’d go bare metal in a second for any kind of cost conscious business that needed scale and had established revenue.
If I pay you for a service that would take time to migrate off of, and you are making money off me now, I am going to be ripshit if you decide to just turn it off because it's suddenly not making money for you in the short term. Google's done this a lot, and the fact that don't provide concrete time lines in their contract gives even less reason to trust them
People look at AWS's track record, and trust that. People look at Google's track record, overlook what to an inside-the-company Googler perspective are dramatically significant organizational boundaries or product lifecycle definitions that are very poorly communicated outside the company, mentally apply reputational damage from one part of Google (or from a preview-stage GCP product) to a different part of the company (or to a generally available GCP product), and don't trust that.
Google has always been worse at externally facing PR than at the internal reality, even when I worked there (2011-2015). Major company weakness.
But the internal reality inside GCP, perceptions aside, is pretty good even now.
If it's costing them money they haven't figured out a model, yet, that works in their favour.
The bit of Maps Platform integration for management of the billing and API layer was called out in the announcement blog as an integration with the console specifically, and the docs and other branding around Maps Platform remain distinct from GCP still in excessively subtle ways that Googlers pay more attention to than everyone else, like hosting the docs on developers.google.com instead of cloud.google.com and having Platform in its name separately from Cloud Platform.
This stuff makes sense to Googlers not only because of the org chart but also because Google has a pretty unified API layer technology and because Google put in a lot of work to unify billing tech & management. Reusing that is efficient but not always clear.
But you're right to be confused. Their branding is a mess and always has been. This is the same company that thought Google Play Books makes sense as a product name.
Google's product / PR / comms / exec people are very bad at understanding how external people who don't know Google's org chart and internal tech will perceive these things, or at least bad at prioritizing those concerns.
They live and breathe their corporate internals too much to realize this. Some Google engineers and tech writers realize the confusion but pick other battles to fight instead (like making good quality products).
They do at least document which services are subjected to the GCP Deprecation Policy (Maps is not there): https://cloud.google.com/terms/deprecation
As for what products are actually part of GCP, it's the parts of this page that aren't an external partner's brand name, aren't called out separately like G Suite or Cloud Identity or Cloud Search, and aren't purely open source projects like Knative and Istio (as opposed to the productized versions within GCP), with the caveat that the level so far of integration into GCP of Google acquisitions like Apigee, Firebase, and Stackdriver varies depending on per-company specifics: https://cloud.google.com/products/
G Suite and Cloud Identity accounts can be used with GCP, just like any other Google accounts. They are part of Google Cloud but not Google Cloud Platform.
Hope I waded through the mess correctly for you. :)
I mean sure, they could go and probably afford to waste $200 extra on something random that will be useless to them most of the time, but that money is going straight out of their paycheck.
You don't remain profitable though by repeatedly making bad decisions like that. Which was my point.
Running a (small) profitable business is about making the right decisions consistently, and if you're likely to waste money on one thing, you're also likely to waste it on the 19 other similar things.
Maybe speak to literally anyone you know who is running a small businesses if you want to know more. Yes that includes your local small stores on your street.
At the end of the day you probably pissed off quite a few people on here when you called their livelihood a hobby project.
This is akin to saying that a mom and pop laundromat can’t afford insurance, or shouldn’t because they won’t frequently need it.
You’re trying to equate small businesses with hobbies. You’ve now resorted to straw men, slippery slopes, and false equivalency. Maybe consider that if you have to distort the situation this much to make your point, you might just be wrong.
> At the end of the day you probably pissed off quite a few people on here when you called their livelihood a hobby project.
I didn’t say anything about anyone’s livelihood. You’re the one pretending that small businesses bringing home $120k/year can’t afford a $200 monthly support bill.
I bet the guy who started this thread about GCP’s support cost has made a sum total of <$1000 from his “startup”. Likely <$10. Hobby.
I don’t care if “quite a few people” got pissed about my comment. People with egos that delicate shouldn’t use social media.
I was trying to tell you that most small businesses can't go around spending hundreds of bucks of things that provide little value, whether that's a business support plans on services they use or something else. It's true regardless of whether you're a brick and mortar store or some online service.
> This is akin to saying that a mom and pop laundromat can’t afford insurance, or shouldn’t because they won’t frequently need it.
Speaking about about false equivalencies...
> You’re the one pretending that small businesses bringing home $120k/year can’t afford a $200 monthly support bill.
First off, I spoke of businesses making generally less than that.
Also (I already said this, good job ignoring that!) paying $200 bucks on a single useless thing is survivable for even a small business - but you know what's better than only making one bad business decision? Making no bad ones at all. Making too many will quickly break the camel's back.
Which was my whole argument and it's also what people generally refer to when they say they can't afford something.
For instance you may say "I can't afford to go to this restaurant", even though you'd have enough money to do it without going immediately bankrupt. But it'd be a bad decision, too many of which quickly add up.
By the way that GCP is so full of loopholes where Google can get out of its obligations its laughable. So it's not even that clear cut that the GCP is really a better alternative.
And even when it turns out to be legally sound, when stuff like this happens, who's going to sue google over it? Nobody, and they know it.
But as I say in another comment, the contract is less important than both trust and reality. Keep in mind nobody focuses on how AWS doesn't even have a public deprecation policy.
I'm right there with many people in this thread in agreeing that Google has a trust problem, due mostly to real perception issues stemming from Google's habits outside GCP, which can and do impact people's perceptions of what they'll do with GCP.
The reality of what Google has done and will do with GCP, though, is pretty good. Sure they do sometimes deprecate things in ways Amazon never would. But not nearly as often or as abruptly as they do on the consumer side - that would be commercial suicide - and they do other things better than Amazon. Tradeoffs.
Because it makes more people feel comfortable enough to use your services and pay you, without actually binding you towards any sort of behavior that would cost you money. There's a direct financial incentive here to use legalese to give the semblance of reliability without having to deliver on it
And I'm telling you that if you built your business on top of GCP, a support contract is probably not "low value". You'd happily pay $200 for support on your critical infrastructure, just as you'd happily pay $200 for a repairman to fix your washing machine if you owned a laundromat.
If you don't need support, then sure, don't pay for the plan. If you do need support, $200 seems pretty reasonable.
> Speaking about about false equivalencies...
Signing up for a monthly recurring support plan in case you need it is literally insuring your business.
> For instance you may say "I can't afford to go to this restaurant", even though you'd have enough money to do it without going immediately bankrupt. But it'd be a bad decision, too many of which quickly add up.
A support plan for your critical infrastructure probably isn't "useless". Which is the point. If your need for support is that low, then either you've built your own redundant systems to protect you or more likely you aren't running a real business.
No. It's just words. Actions speak louder than words. Googles' actions in the last couple of days spoke pretty loud. No amount of words will change that.
Are you working for Google PR or something?
I'm still a fan of GCP as a suite of products and services, as much as I recognize many of Google's organizational failings and disagree with plenty of their product decisions in other areas of Google.
Google (including GCP) has been bad at external communication as long as I've paid attention, and that includes external communications around incidents. What actions are you referring to, beyond poor and confusing communication (i.e. words) around what is or isn't broken or fixed at what points during the incident? That's most of the problem I'm aware of from this incident.
With that said, part of the reason people notice GCP's outages more than AWS's is that GCP publicly notes their outages way more than AWS does. In other words, among the outages that either cloud has, Google much more often creates an incident on their public status page and Amazon much more often fails to.
My "reality of [...] GCP" comment was about the bigger picture of the cloud platform offering, not any one specific incident.