Tell HN: AWS appears to be down again Console is flickering between "website is unavailable" and being up for my team. This is happening very frequently just now, reliability seems to have taken a hit. |
Tell HN: AWS appears to be down again Console is flickering between "website is unavailable" and being up for my team. This is happening very frequently just now, reliability seems to have taken a hit. |
Notably: cognito, r53 and the default web UI. (You can work around the webui one I’m told, by passing a different domain instead of just console.aws.amazon.com)
just the weekly internet apocalypse, happy holdidays fellow SREs
I'm having issues with Slack from central EU (Poland) -- can't upload images, or send emoji reactions to post; curiously, text works fine). Wondering if linked
Only with AWS and Github do I seem get panicked text messages on my phone first thing in the morning... Our workloads on Azure typically only have faults when everyone is in bed.
Santa is bringing me a Synology in three days.
> Due to this degradation your instance could already be unreachable
>:(
So it's not shocking to me that something going down in us-east-1 could have impact on other regions.
I’m not affiliated with them, and haven’t even really used them other than to explore a bit. They come highly recommended by my acquaintances, though.
Meanwhile, I currently have a gig to work on a video service which features a never updated centos 6, an unsupported python 2 blob website, and a push to prod deployment procedure, running a single postgres db serving streaming for 4 millions users a month.
And it's got years of up time, cost 1/100th of AWS, and can be maintained by one dev.
Not saying "cloud is bad", but we got to stop screaming old techs are no good either.
At least, that's what I understood.
Though I guess there's still probably just lost revenue that could be captured by having better uptime, even if your competitors are down.
AWS consists of over 200 services offered in 86 availability zones in 26 regions each with their own availability.
If one service in one availability zone being impaired equals a post about “AWS is down” we might as well auto-post that every day.
Specifically large parts of the management API, and IAM service are seemingly centrally hosted in us-east-1.
If your infrastructure is static you'll largely avoid the fallout, but if you rely on API calls or dynamically created resources you can get caught in the blast regardless of region
Wonder if it's connected?
Coworkers: "You're an f'n idiot. Amazon and Facebook don't go down, you're holding us back!" <-Quite literally their words.
Me: leaves cause that treatment was the final straw
Amazon and Facebook both go down within a month of each other, and supposedly they needed backups
Them: shocked pikachu face
I must admit that I do always try and maintain a separate data backup for true disaster recovery scenarios - but those are mainly focused around AWS locking me out of our AWS account (and hence we can't access our data or backups) or recovering from a crypto scam hack that also corrupts on-platform backups, for example.
What happens if AWS or [insert other megacloud] decides your account needs to be nuked from orbit due to a hack or some other confusion? We almost had this happen over the summer because of a problem with our bank's ability to process ACH payments. Very frustrating experience. Still isn't fully resolved.
What happens if an admin account is taken over and your account gets screwed up?
What happens if an admin loses his shit and blows up your account?
What happens if your software has a bug that destroys a bunch of your data or fubars your account?
There's a ton of cases where having at least a simple replica of your S3 buckets into a third-party cloud could prove highly valuable.
Was it just a miscommunication around AWS billing and them thinking you weren't paying? Or did AWS somehow put itself in the middle of, or react to, your use of ACH payment processing for *non-AWS* receivables or payables?
If the latter, that's a business risk I'd never even thought about. I'm not even sure how they'd know. But I'm thoughtful that things like the MATCH list [0] exist, and how easily a merchant can accidentally wind up on these lists from either human error or a small amount of high-value chargebacks. If cloud providers are somehow paying attention to merchant services reputation, that would be very scary for many businesses!
[0] https://www.merchantmaverick.com/learning-terminated-merchan...
1) Can you make your on prem infrastructure go down less than Amazon's?
2) Is it worth it?
In my experience most people grossly underestimate how expensive it is to create reliable infrastructure and at the same time overestimate how important it is for their services to run uninterrupted.
--
EDIT: I am not arguing you shouldn't build your more reliable infrastructure. AWS is just a point on a spectrum of possible compromises between cost and reliability. It might not be right for you. If it is too expensive -- go for cheaper options with less reliability.
If it is too unreliable -- go build your own yourself, but make sure you are not making huge mistake because you may not understand what it actually costs to build to AWSs level.
For example, personally, not having to focus on infra reliability makes it possible for me to focus on other things that are more important to my company. Do I care about outages? Of course I do, but I understand doing this better than AWS has would cost me huge amount of focus on something that is not core goal of what we are doing. I would rather spend that time thinking how to hire/retain better people and how to make my product better.
And adding all that complexity of running this infra to my company would cause entire organisation be less flexible, which is also a cost.
So you can't look at cost of running the infra like a bill of materials for parts and services.
And if there is an outage it is good to know there is huge organisation there trying to fix it while my small organisation can focus preparing for what to do when it comes back up.
Obviously depends on what you need, but for a small to medium web app that needs a load-balancer, a few app servers, a database and a cache, yes absolutely - all of these have been solved problems for over a decade and aren't rocket science to install & maintain.
> Is it worth it?
I'd argue that the "worth" would be less about immunity to occasional outages but the continuous savings when it comes to price per performance & not having to pay for bandwidth.
> overestimate how important it is for their services to run uninterrupted.
Agreed. However when running on-prem, should your service go down and you need it back up, you can do something about it. With the cloud, you have no choice but to wait.
Backup is cheap when you're focused about what you're backing up.
In this case, the game isn't "going down less than Amazon", it's about going down uncorrelated to Amazon. Though that's getting harder!
"In more than one way" doesn't have to be local, but it may be across multiple cloud services. Still, "local" is nice in that it doesn't require the Internet. ("The Internet" doesn't tend to go down, but the portion you are on certainly can.) Of course, as workers disperse, "local" means less and less nowadays.
Over the last two years, my track record has destroyed AWS. I've got a single Mac Mini with two VMs on it, plugged in to a UPS with enough power to keep it running for about three hours. It's never had a second of unplanned downtime.
About 15 years ago I got sick of maintaining my own stuff. I stopped building Linux desktops and bought an Apple laptop. I moved my email, calendars, contacts, chat, photos, etc, to Google. But lately I've swung 180 degrees and have been undoing all those decisions. It's not as much of a PITA as I remember. Maybe I'm better at it now? Or maybe it will become a PITA and I'll swing right back.
EDIT: I realize you're talking in a commercial sense and I'm talking about a homelab sense. Still, take my anecdote for what it's worth. :D
I think you're right about what we over- & under-estimate, but that we also under-estimate the inflection point for when it makes sense to begin relying on major cloud services. Put another way: we over-estimate our requirements, causing us to pessimistically reach for services that have problems that we'd otherwise never have.
For extra safety, and extra work, you could even take Azure as a backup if you're not locked in with AWS.
It's now hard to say how frequently Amazon's infrastructure goes down. The incident rate seems to have accelerated.
...My home Internet even is scoring better than Amazon right now, in fact. Yours probably is too.
In my experience problem number 3 is the hardest to solve.
At AWS, we built a few layers of redundant infrastructure with mulit-AZ availability within a region and then global availability across multiple regions. All this was done at roughly half the cost of the traditional hosting, even when including the additional person-hours required to maintain it on our end.
Keeping our infra simple helped that work, and it's literally been years since an outage caused by any AWS issues, even though there have been several large AWS events.
AWS maintains a fiction of turnkey infrastructure, and the reality of building your own is so starkly different that I haven't seen an IT group for some time that could successfully push back on these sorts of discussions.
Building your own datacenter is still too much like maintaining a muscle car, fiddly bits and grease under your fingernails all the time, meanwhile the world has moved on, and we now have several options in soccer mom EVs that can challenge a classic Corvette in the quarter mile, and obliterate its 0-60-0 time. There is no Hyundai for the operations people, and there should be.
I don't know the physics of shipping such a thing, but I think we really do need to be able to buy a populated and pre-wired rack and slot it into the data center. Literally slot it in. If you've ever been curious about maritime shipping, you know that they have a system for securing containers to cranes, trailers, each other, and I don't see a reason you couldn't steal that same design for mounting a server rack to the floor. Other than the pins would need to be removable (eg, a bolt that screws into a threaded hole in the floor) so you don't trip on them.
In a word, we need to make physical servers fungible. There are any number of things that we need to do to get there, but I think we can. Honestly I'm surprised we haven't heard more of this sort of talk from Dell, especially after they bought VMWare. This just seems like a huge failure of imagination. Or maybe it's simply a revolution lacking a poster child. At this rate that 'child' has already been born, and we are just waiting to see who it is.
I'd wager that will still give you more uptime than a physically-hosted solution for the same cost.
It’s not a bad idea store backups offline but costs might make that an expensive proposition.
I've had buckets and objects disappear into the ether.
It is exceedingly rare, but it's not impossible.
Offline/alt-cloud backups are probably a lot cheaper than you think, and will win you points during any audit.
Most of the folks impacted by cloud outages do not have highly available systems in place. Perhaps, for their business, the cost doesn't justify the outcome.
If you need high uptime for instances, build your system to be highly available and leverage the fault domain constructs your provider offers (placement groups, availability zones, regions, load balancing, DNS routing, autoscaling groups, service discovery, etc). For instances, double down and use spot instance and maximum lifetimes in your groups so that you're continuously validating your application can recovery from instance interruptions.
If you're heavy on applications that leverage cloud APIs, such as is often the case with labmdas, then strongly consider multi-region active/active as API outages tend to cross AZ's and impact the entire region.
Maybe a cheery note asking how the team is doing, sent right in the middle of an outage.
Passive aggressive? As hell. Cathartic? Damn skippy.
As a counterpoint, though, my last place had a large Java app, split between colo'd metal and AWS. Seemed like the colo'd stuff failed more (bad RAM mostly, a few CPUs, and an occasional PSU). Entirely anecdotal.
In our case {LargeCloud} acquired {SaaSVendor}. We were already using {LargeCloud}, with an existing billing arrangement. When {LargeCloud} got around to integrating the {SaaSVendor} into their billing system, it exposed multiple bugs in {LargeCloud}'s billing system, and ultimately limitations in our bank's internal systems--a well known establishment and it would blow your mind to learn how much manual crap they do.
Traditionally, we received favor from {SaaSVendor} through Invoices. But when {SaasVendor} was subsumed by {LargeCloud}, we stopped receiving invoices. Our internal ops reached out to {LargeCloud} about this two days before we got our first "You will experience Dire Consequences" email from {LargeCloud}'s Robot Overlords. Our attempts to contact {LargeCloud} regarding this concerning message was always routed to a Robot Overlord who only spoke in tongues and could not solve our problems. Eventually, were able to get the Robot Overload to escalate us to a Robot Superlord that would only tell us to "follow the instructions in this handy dandy web page thing", except following the instructions always summoned a "Server 500" Demon, which {LargeVendor} claimed was impossible because their Robots are Divine and Holy.
Finally circling back through random Human Actors we were able to avert the countdown to destruction. Some Robot Necromancer was able to resurrect our billing account from the "Server 500" Demon, but we would now need to setup automatic ACH payments, as whatever fix was implemented could only persist with regular monthly succor upon the alters of the Federal Reserve Automated Clearing WaffleHouse. Invoices, payments arranged through Our Lady of Visa and The Master Card would no longer suffice.
We believed we had made the appropriate incantations before FratBoy 3000 at our local branch of the Federal Reserve Chapel. However, we eventually received another threat of Dire Consequences from {LargeCloud}, indicating that our prayers were not received. It took significant supplication in order to get FratBoy 3000 to confirm that our Federal Reserve Chapel had misrouted our prayers, deducting them from our account, but sending them to the wrong Demon, through no fault of our own.
The whole time this was going on, we kept getting threats of Dire Consequences. We were told by Human Actors to have great faith, that the {LargeVendor} Robot Overlords had been placated through their secret prostrations. FratBoy 3000 was replaced by our Federal Reserve Chaplain, who informed us that they had no robots, this was all the result of Human Actor failures, but that, forthwith, all of our prayers could be answered if we moved all of our faith into a New Account which itself required additional monthly supplication, but would ensure divine routing of our prayers would always be successful.
To this day, we continue to make our monthly pilgrimage to our local Federal Reserve Chapel, supplicating upon all necessary altars. The threats of Dire Consequences from {LargeCloud} have subsided. But we have cast ourselves out onto the trail, seeking refuge from a more receptive and responsive Federal Reserve Chapel.
Everybody focuses on "what if us-east-X goes down", but, literally, sometimes it's a combination of billing and payment issues that can keep you up at night.
To do it, first I would not use any cloud features that cannot be easily setup in another cloud. So no lambdas. Just k8s clusters, maybe DBs if they can be setup to backup between clouds. I was able to migrate from AWS k8s to DO K8S very easily.... just pointed my k8s configs to the new cluster (plus configuring the DO load balancers).
In my case, I need the dynamic DNS (havnt looked into it yet), auto-scaling is already setup with k8s, and the DB backups between DBs (next project).
It's possible to go down in a mostly uncorrelated way to Amazon by just being down all the time.
Obviously this is implicit in your comment, but I'll say it anyway: your backups need to actually work when you need them. You need to test them (really test them) to make sure they're not secretly non-functional in some subtle way when Amazon is really down.
Having written this, I'm going to ping our SME on the cache replication and remind him that since the last time he benchmarked it, we've upgraded to a newer generation of EC2 instances that has lower latency, and could he please run those numbers again.
Another side benefit of being with AWS is when you do have an outage, a lot of other people have outages, and so you sort of blend in with the noise. It's not great to be down, but if you're down and also "big service X" who's also an AWS customer is down, it makes your downtime look less like a lack of competence and more like an unavoidable force of nature.
I worked at a company that's bread and butter was online services (e-commerce SaaS platform, similar to Netsuite) and we had significantly fewer outages than AWS had.
But we had redundancies built in to most things, I'm not saying it was perfect but it worked.
The major difference might be that almost nobody is willing to spend 20% of what they spend on AWS/GCP to have a self-hosted solution.
The reason "cloud is so expensive" is because they're essentially telling you what the price will be and even if they only spend 40% of that on actual hardware and operations: it's more than most companies would invest in themselves.
This is absurd, of course, but it's absolutely true.
Uptime improved rather dramatically after that.
What does it tell you that there is a market for this, where essentially what you are buying from them is a management and control plane, when other companies like BMC have been selling that as a standalone product for decades (and for the most part failing to live up to their customer's actual expectations)?
[1] https://www.bizety.com/2020/06/28/aws-outposts-google-anthos...
edit: I actually think a big pull of the cloud is also about shutting down archaic internal IT organizations that have been slowing people down so that it takes weeks and weeks to launch a simple new webservice. Better to give your programmers a cloud account and let them get shit done.
What it tells me is that someone new needs to step in.
With the caveat that you're going to have to implement all your access controls, monitoring and compliance mechanisms on those alternate backups. No point winning points during an audit for having backups outside AWS if you lose even more points for "backups weren't properly secured against unauthorized access".
And you're regularly restoring from those alternate backups as well to check their integrity, right?
But none of that changes the fact that you shouldn't put all your eggs in one basket.
Agreed. Arguably, not using an existing cloud service is a red flag on any new hires. AWS being the primary, but experience using GCS or Azure are at least viable skills, even if your business is AWS-based.
But the "fad-based-development" meme is not going away any time soon. The incentives in the business are built around it (really! No one want's to work on a boring old relational database solution any more). In the old days it was 4th generation languages, RUP, XML and Function Point Analysis... today it's functional programming, SDKs, big-three cloud PaaS experience or (shudder) block-chain.
I think back to my much younger self, when I thought that technology was something to be mastered to solve real-world problems, and I laugh. Little did I know the real problem to be solved was to figure out how to solve those same-old business problems but with the technology of the season (Kubernetes, GraphQL or ML).
"Let's move our internal app with 50 users to k8s in the cloud." --true story
It's a real shame that the collective world of technology does not properly respect the simple solutions that work.
It is almost funny the dichotomy here. Most technological people "admire" the simplicity, elegance and extensibility of the command line. But tell those same people that the best data store for the solution is a relational database and their nose crinkles up.
Every dependency scrutinized and discarded if possible.
I would probably work for free if someone setup their own on-prem cloud in Tanzu, Open shift, or Ranger and used old school proven frameworks for development.
Working in AWS has been a real shitty experience at these large companies. All the nit picky problems (of which there are thousands) get dumped on devs who are trying to deliver working software.
2 - Caching is life. We have 3 layers of caching: cloudflare, varnish, and redis. Most things don't need to be real time. A lot of things can be a month old and the user doesn't care. User need immediate feedback to be happy, but not necessary fresh data.
3 - if you compile nginx manually, you get to use a lot of plugins that can do stuff super fast, including serving videos. You can script stuff in lua that will just skip the backend completly.
4 - mind your encoding. We carefully chose how we encode videos. The ffmpeg parameters are pretty insane, but the space / quality ratio is amazing, espacially on mobile. It takes a lot of time to experiment with those, nobody share them :)
5 - we offload everything we can to cron tasks or task queues. Including, obviously, encoding, screenshooting, etc.
6 - don't hold data you can't lose. E.G: billing. This way you can have a relaxed attitute toward data. If we ever loose a day of business, users will be in a bad mood for a week, but that won't be the end of the world. We don't need a bullet proof system if bullets can't kill us.
7 - give money to ffmpeg and opencv, because damn those things are fast. And good.
8 - servers are hosted accross 2 providers. This way, if one goes down, or decide to stop doing business with us Google style, we have a second one. Happened recently with leaseweb: they shutdown a whole room without offering an alternative.
E.G: votes.
They don't hit the backend on write. We pile them from nginx to redis, then once a day, we aggregate and store on postgres, which the backends will consumme. We just store each vote on localstorage as well so that the user feels like it's real time when they vote, but in reality it's updated once a day. But votes don't affect the money side of our business, so if we lose them one day, it does not mean death.
P.S: yes, posgres/redis/elasticsearch only hold metadata. Videos are stored on disk. There is no docker images, no mircoservices, FS is ext4. Which means with a lot of RAM, the OS FS cache will have most popular videos already loaded and ready to be streamed. Everything is raid 0, so if we get one disk corrupted, you lose the server. But we upload each videos on severeal servers, so when a disk get corrupted, we just replace the whole server. In fact, anything goes wrong on a server, we replace it. It's not worth it to find the root cause, unless 2 servers die in the same way successively.
Regarding the ffmpeg parameters and formats in general: Do you use newer formats too, like AV1 and the like?
When us-east-1 is sufficiently borked the management API and IAM services in all regions tend to go down with it.
Static infrastructures usually avoid the fallout, but anyone dependent on the API or otherwise dynamically created resources often get caught in the blast regardless of region
Logging in with root credentials was not possible in any region, and even logging in with IAM creds in other regions yielded an intermittently buggy console
and as is usual with us-east-1 outages management API calls were a complete crap shoot regardless of region
The dependency chains can bite you too. During the us-east-1 outage, a Lambda run by cron-like schedules via EventBridge was itself in an okay state, but the EventBridge events that kick it off were stuck in a queue that was released when the problem was fixed. So if your Lambda wasn't idempotent, and you ran it in another region during the outage, you ended up with problems.
We didn't take any downtime, but if anything had gone wrong there would have been nothing we could do about it until IAM came back up.
Global services such as route53, Cognito, the default cloud console and Cloudfront are managed out of US-East-1.
If us-east-1 is unavailable, as is commonly the case, and you depend on those systems, you are also down.
it does not matter if you're in timbuktu-1, you are dead in the water.
it is a myth that amazon availability zones are truly independent.
please stop blaming the victim, because you can do everything right and still fail if you are not aware of this; and you are perpetuating that unawareness.
> are not truly independent of each other
Indeed. They are even on the same planet!
> please stop blaming the victim
Excuse me?
> Indeed. They are even on the same planet!
Clever bastard, aren't you.
>> please stop blaming the victim
> Excuse me?
"If you're affected by us-east-1 outages then you're not hosting in other regions and you're doing it wrong".
Except: You can be affected by this outage if you did everything right. You're putting blame on people being down for not being hosted in different regions when it would not help them. You've effectively shifted blame away from Amazon and onto the person who cannot control their uptime by doing what you said.
You are attributing a quote to me which I never expressed, nor was that expressed elsewhere in this thread. You are even using quotation marks....
I certainly didn't mean to blame anyone. You appear to see this AWS issue as one of victims and victimizers. I was just trying to point out an agency that people may have in some situations.
I was just re-wording the sentiment.
Let me quote you properly.
> Also, you can just take two different amazon regions and hope they don't both go down at the same time.
Do you see how replacing that in my comments does not change the sentiment?
You need multiple physical links in running to different ISPs because builders working on properties further down the street could accidentally cut through your fibre. Or the ISP themselves could suffer an outage.
You need a back up generator and to be a short distance away from a petrol station so you can refuel quickly and regularly when suffering from longer durations of power outages. You absolutely do not want to run out of diesel!
You need redundancy of every piece of hardware AND you need to test that failover works as expected because the last thing you need is a core switch to fail and traffic not to route over secondary core switch like expected.
You need your multiple air con units and them to be powered off different mains inputs so if the electrics fail on one unit it doesn’t take out the others. I guarantee you that if the air cons will fail, it will be on the hottest day of the year a month amount of portable units will stop your servers from overheating.
You need beefy UPS with multiple batteries. Ideally multiple UPSs with each UPS powering a different rail on your racks so that if one UPS fails your hardware is still powered from the other rail. And you need to regularly check the battery status and loads on the UPS. Remember that the back up generator takes a second or two to kick in so you need something to keep the power to the servers and networking hardware to be uninterrupted. And since all your hardware is powered via the UPS, if that dies you still lose power even if the building is powered.
And you then need to duplicate all of the above in second location just in case the first location still goes down.
By the way, all of the possible failure points I’ve raised above HAVE failed on me when managing HA on prem.
The reason people move to the cloud for HA is because rolling your own is like rolling your own encryption: it’s hard, error prone, expensive, and even when you have the right people on the team there’s still a good chance you’ll fuck it up. AWS, for all its faults, does make this side of the job easier.
In fact I used to run some hobby projects in OVH (as an aside, I really liked their services) so I’m aware that they have their own failures too.
Colo space assumes that the colo is operating more efficiently than AWS/Azure/GCP when in reality you’re comparing apples and oranges.
financial services, telecom and high-precision manufacturing companies
One of these things is not like the other, one of these things is not the same...What use does a CNC shop have for an extensive on-prem multi-DC with failover and high availability? It'd be like buying your own snowplows to make sure that the road is clear so your employees can get to work. Maybe necessary if you live in a place with very bad snowplows and no existing infrastructure, but in most places, just a waste of money.
Also the reasons those companies usually run their own infra is historically down to legislation more than preference. At least that’s been the case with almost all of the companies I’ve built on prem HA systems for.
At my last job we provided redundant paths (including entry to your building) as an add-on service. So you might not need two ISPs if you're only worried about fiber cuts. You could still be worried about things like "we think all Juniper routers in the world will die at the exact same instant", in which case you need to make sure you pick an ISP that uses Cisco equipment. And of course, it's possible that your ISP pushes a bad route and breaks the entirety of their link to the rest of the Internet.
I don't see why the petrol station needs to be a short distance away. Unless the plan is to walk to the petrol station and back (which should not be the plan[1]), anyplace within reasonable driving distance should do.
[1] long duration electrical outages will often take out everything a short distance away, and the petrol stations usually have electric pumps.
Also buying fuel for a petrol station is going to be more expensive than having a commercial tanker refill it. So ideally you wouldn’t be making large top ups from the local petrol station except under exceptional outages.
As for wider power outages affecting the fuel pumps, I suspect they might have their own generators too. But even if they don’t, outages can still be localised (eg road works accidentally cutting through the mains for that street - I’ve had that happen before too). So there’s still a benefit in having a petrol station near by.
To be clear, I’m not suggesting those petrol stations should be 5 minutes walking distance. Just close enough to drive there and back in under half an hour.
some natural disasters can render driving trickier than walking. extremely large snow storms, for instance. you can still walk a block, but you might be hard pressed to drive 5 miles.
(i don't have a bone in this particular cautiousness-fight; personally i'd just suggest folks producing DR plans cover the relevant natural disasters for the area they live in, while balancing management desires, and a realistic assessment of their own willingness to come to work to execute a DR plan during a natural disaster.)
If you are going to the level of the above, you go with co-location in purpose built centers at a wholesale level. The "layer1" is all done to the specs you state and you don't have to worry about it.
On-prem rarely actually means physically on-prem at any scale beyond a small IT office room. It means co-locating in purpose built datacenters.
I'm sure examples exist, but the days of large corporate datacenters are pretty much long over - just inertia keeping the old ones going before they move to somewhere like Equinix or DRT. With the wholesalers you can basically design things to spec, and they build out 10ksqft 2MW critical load room for you a few months later.
A few organizations will find it worthwhile to continue to build at this scale (e.g. Visa, the government) but it's exceptionally small.
Then you’re not running HA and thus the argument about cloud downtime being “worse” than on prem is moot.
Obviously if your SLA is basically “we will do our best” then there are all sorts of short cuts one can take. ;)
My building has a natural gas backup generator.
I’ve never seen a data center with natural gas backup power. But I don't know if that's because of reliability or if it's too expensive for a big natural gas hookup that's used rarely. Though I have heard of the opposite -- using natural gas turbines as primary power and utility power as backup.
If you're outsourcing that, you'd likely have to pay a boatload just for someone to be available for help, let alone the actual tasks themselves. Like you said, if you're on-prem and something goes down, you can do something. But you've gotta have the personnel to actually do something.
That said, I think you're spot-on as long as you have the skillset already.
I hear this argument a lot, but every startup I've been involved with had a full-time DevOps engineer wrangling Terraform & YAML files - that same engineer can be assigned to manage the bare-metal infrastructure.
Bare metal infrastructure requires a lot more management at any given scale. I mean, you can run stuff that lets you do part of the management the same as cloud resources, but you also have to then manage that software and manage the hardware.
30 years ago when you talked on-prem that's what this meant. It's now shifted to on-prem meaning your own hardware in massive shared facilities that handle all that "hard stuff" like redundant power and cooling for you.
Bespoke datacenter builds for true-on prem certainly exist, but it's not what that term typically means any longer - at least in my line of business. When I'm selling racks of colo now, my customers are calling that their on-prem facilities.
In fact a large part of my previous business was dismantling true "on-prem" facilities to move to such large shared wholesalers.
We colocate about 20 servers and on the average month, no one spends any time managing them. At all.
If you do move to an established data centre then you’re back to my earlier point that you’re still then dependant on their services instead of having ownership to fix all the problems yourself (which was the original argument the GP made in favour of switching away from the cloud).
Yes, Hetzner upgrades DCs (datacenter buildings), but they are the equivalent to AWS AZs (Availability Zones). When they upgrade a DC, they notify way in advance, and if you set up your services to span multiple DCs as is recommended, it does not affect you.
We run high-availability Ceph, Postgres, and Consul, across 3 Hetzner DCs, and have not had a Hetzner-induced service downtime in the 5 years that we do so.
I would never use this as part of the backup and restore plan; but I was lucky when a bunch of customer files were deleted due to a bug in a release. Something like 100k files were deleted from Google Storage without us having backup. In a panic we contact GCP. We were able to provide a list of all the file names from our logs. In the end, all but 6 files were recovered.
I think it took around 2-3 days to get all the files restored, which was still a big headache and impactful to people.
If you have to take care of availablity and redundancy and delete protection and backups then why pay the premium S3 is charging ?
Either you don't trust the cloud and you can run NAS or equivalent (with s3 APIs easily today) much cheaper or trust them to keep your data safe and available.
No point in investing in S3 and then doing it again yourself.
I mean that's just obviously wrong, though.
There is a point.
> Either you don't trust the cloud and you can run NAS or equivalent (with s3 APIs easily today) much cheaper or trust them to keep your data safe and available.
What if you trust the cloud 90%, and you trust yourself 90%, and you think it's likely that the failure cases between the two are likely to be independent? Then it seems like the smart decision would be to do both.
Your position is basically arguing that redundant systems are never necessary, because "either you trust A or you trust B, why do both?" If it's absolutely critical that you don't suffer a particular failure, then having redundant systems is very wise.
You can argue that you protect against different threats than AWS does . So far I have not seen a meaningful argument of threats a on Prem protects differently than the cloud that you need both.
Say for example your solution is to put all your data backups on the moon then it makes sense to do both, AWS does not protect against threat to planet wide issues.
However if you are both protecting against exact same risks having just provider redundancy only protects against events like AWS goes down for days /months or goes bankrupt.
All business decisions have some risk , provider redundancy does not seem a risk to mitigate for the cost it would mean for most businesses I have seen.
Even Amazon.com or Google apps host on their own cloud and not use multi cloud after all, their regular businesses are much bigger than their cloud biz , they would still risk those to stick to their cloud/services only.
But you still have some risks here, yes, with a super low probability, but a company-killing impact.
In some industries - banking, finance, anything regulated, or really (I'd argue) anywhere where losing all of your data is company killing - you will need a disaster recovery strategy in place.
The risks requiring non-AWS backups are things like:
- A failed payment goes unnoticed and AWS locks us out of your AWS account, which also goes unnoticed and the account and data are deleted
- A bad actor gains access to the root account through faxing Amazon a fake notarized letter, finding a leaked AWS key, social engineering one of your DevOps team, and encrypts all of your data while removing your AWS-based backups
- An internal bad actor deletes all of your AWS data because they know they're about to be fired
...and so on.
There's so many scenarios that aren't technical which can result in a single vendor dependency for your entire business being unwise.
A storage array in a separate DC somewhere where your platform can send (and only send! not access or modify) backups of your business critical data ticks off those super low probability but company-killing impact risks.
This is why risk matrices have separate probability and impact sections. Miniscule probability but "the company directors go to jail" impact? Better believe I'm spending some time on that.
Between these two protections, it's pretty hard to lose data from S3 if you really want to keep it. I would guess they are better protections than you could achieve in your own self managed DC.
I'm guessing AWS has some clause in their contract that means they can refuse to deal with you or even return any of your data if they feel like it. Not sure if that's ever happened, but still worth considering it.
For most companies what AWS.or Azure offers is more than adequate.
An internal bad actor with that level of privileged access can delete your local backups or external one can all things you he can do to AWS he can likely do easier to your company storage DC too.
Bottom-line it doesn't matter if customers can pay for all this low probability stuff that can only happen on the cloud and not on Prem sure go ahead. Half the things customers pay for they don't need or use anyway.
[1] assuming your business model allows you to spend the expense outlay you need for the threat model
Also, what we saw on Dec 7th was that the complexity of Amazon's infrastructure introduces risks of downtime that simply cannot be fully mitigated by Amazon, or by any other single provider. More redundancy introduces more complexity at both the micro level and macro level.
It doesn't really cost that much to at least store replicated data in an independent cloud, particularly a low-cost one like Digital Ocean.
Customers don't care if it's you're fault or not, they only care that your stuff is broken. That safety blanket of having a vendor to blame for the problem might feel like it'll protect your job but the fact is that there are many points in your career where there is one customer we can't afford to lose for financial or political reasons, and if your lack of pessimistic thinking loses us that customer, then you're boned. You might not be fired, but you'll be at the top of the list for a layoff round (and if the loss was financial, that'll happen).
In IT, we pay someone else to clean our offices and restock supplies because it's not part of our core business. It's fine to let that go. If I work at a hotel or a restaurant, though, 'we' have our own people that clean the buildings and equipment. Because a hotel is a clean, dry building that people rent in increments of 24 hours. Similarly, a restaurant has to build up a core competency in cleanliness or the health department will shut them down. If we violate that social contract, we take it in the teeth, and then people legislate away our opportunities to cut those corners.
For the life of me I can't figure out why IT companies are running to AWS. This is the exact same sort of facilities management problem that physical businesses deal with internally.
I have saved myself and my teams from a few architectural blunders by asking the head of IT or Operations what they think of my solution. Sometimes the answer starts with, "nobody would ever deploy a solution that looked like that". Better to get that feedback in private rather than in a post-mortem or via veto in a launch meeting. But I have had less and less access to that sort of domain knowledge over the last decade, between Cloud Services and centralized, faceless IT at some bigger companies. It's a huge loss of wisdom, and I don't know that the consequences are entirely outweighed by the advantages.
In some orgs, recreating lost data, code, deployment and more is literally hundreds of thousands of hours of work.
In a smaller org, the devastation can be just as stark. Loosing hundreds of hours of work can be a death knell.
Anyone advocating placing an entire orgs's future on one provider is literally, completely incompetent.
It's the equiv of a home user thinking all their baby pics will be safe on google or facebook. It is just plain dumb.
Relevant blog post, https://aws.amazon.com/blogs/aws/preview-aws-backup-adds-sup...
Ah, the Vinnie Boombatz treatment.
Maybe having S3 redundancy wasn't the most important thing to be tackled? Does your company really need that complexity? Are you so big and such an important service that you cannot possibly risk going down or losing data?
In my experience, the kind of person that argues about "arrogant 25 year olds that know everything" is the kind of person that only sees their side of a discussion and refuses to understand the whole context. Maybe OP was in the right, maybe they weren't. But the fact that they are focusing on age and making ad hominem attacks is a red flag in my book.
It's also possible the response was "That's an excellent point! I think we should put that on the backlog. Since this data is already a backup of our DB data, I think we should focus on getting the feature out rather than replicating to GCP."
Those are two plausible conversations. Instead, what we have is "these arrogant 25 year olds that have 1-2 years of experience and know it all." That's a red flag to me.
And this is of course valid reason to ignore basic data preservation approaches.
Myself I am an old fart and I realize that I am too independent / cautious. But I see way too many young programmers who just read sales pitch and honestly believe that once data is on Amazon/Azure/Google it is automatically safe, their apps are automatically scalable, etc. etc.
And again, my point isn't that you never need backups. My point is that it is entirely plausible that at that point in time backups from S3 weren't a priority.
Add object versioning for your bucket (1 click) and mirror/sync your bucket to another bucket (a few more clicks).
Yes, your S3 costs will double, but usually they're peanuts compared to all the other costs, anyway.
Debating it takes longer than configuring it.
I have my family photos on a RAIDed NAS. It took me years to get that setup simply because there were higher priority things in my life. I never once thought "ahh I don't need backups of our data" I just had more important things to do.
The Azure outage was just AD service but you can roll your own there if you wanted.
Plus if you want to talk about SaaS then OVH et al have their own SaaS too. In fact the difference between OVH and AWS is more about scale than it is about reliability (with AWS you can buy hardware and rack it in AWS just like with OVH too).
Or maybe by “old skool” you mean the few independent hosts that don’t offer SaaS. However they’re usually pretty small fry and this outages are less likely to be reported. Whereas any AWS service going down is massive news.
I’m not a cloud-fanboy by any means (I actually find AWS the least enjoyable to manage from a purely superficial perspective) but I’ve worked across a number of different hosting providers as well as building out HA systems on prem and the anti-cloud sentiment here really misses the pragmatic reality of things.
I would also argue many aren't even using multiple availability zones, as evidenced by the wide array of problems on the internet when a single AZ goes down.
I think you're vastly over-estimating how most companies are using AWS, and are substituting your own requirements for theirs.
Which is very common in tech. It's part of why people shit on cloud, microservices, and other techniques large mega-corps use on HN. People write posts with lots of assumptions and few details, then people that don't know any better just carbon copy it because hey its what Google does. Meanwhile their lambda microservice system serving a blazing 60 requests per minute has more downtime than if I just hosted it on my laptop with my dialup internet connection.
This is a really confusing question. Redundancy requires more than 1 option. It's not about it being better than AWS, it's that in order to have it you need something besides just AWS. AWS may provide redundant drives, but they don't provide a redundant AWS. AWS can protect against many things, but it cannot protect against AWS being unavailable.
This is probably true with Google, but AWS contributes > 50% of Amazon's operating income. [1]
[1] https://www.techradar.com/news/aws-is-now-a-bigger-part-of-a...
Their retail/e-commerce side is less profitable than AWS but the absolute revenue is still massive and the risk of losing that a chunk of that revenue(and income) due to tech issues is still enormous risk for Amazon .
You and AWS are using similar chips similar hard disks even with similar failure rates.
If you both use same hardware from say batch both can defects and fail at similar times.or you use the same file systems, that say corrupts both your backups.
90% is not a magic number , you need to know AWS supply chains and practices thoroughly and keep yours different enough not to have same risks as AWS does for your system to have independent probability of failures.
(Consider a buggy software release which incorrectly deletes a backup. Depending on the bug it’s very possible it will delete in both places.)
Hence why I compare doing HA in AWS correctly vs doing HA on prem correctly.
Sure, your threat model may vary. But relying on cloud only for your backup is simply not enough. If you split access for your AWS backup and your DC backup to two different people, you mitigated your thread model. If you only have 1 backup location, that's going to be very hard.
Every cloud provider has compliance locks which even root user cannot disable, version history and you can setup your own copy workflow storage container to second container without delete/update access to second one to two different people or whatever.
You don't need to do any of it offsite.
Having had to restore databases from tapes and removable drives for a compliance/legal incident, we had a failure rate of >50% on the tapes and about 33% for the removable drives.
I came away not trusting any backup that wasn’t on line.
The extra expense outlay for the 2 additional backups is approximately $50/month, so it's not going to break the bank.
At $50/month scale a lot of things are possible. Most companies cannot store their data in a hard disk in a safe. If you can, then cloud is a convenience not a necessity for you. I.e. you are perfectly fine running your storage stack for the most part.
My company is not very big(100ish employees) and we pay $200k+ for AWS in just storage and AWS is not even out primary cloud. If we have to do what you have, it is probably in bandwidth costs alone another $500k. Add running costs in another cloud and recurring bandwidth for transfers , retrieval from Glacier for older data on top of that.[1]
Over 3 years that would be easily $1-$1.5 million in net new expenses for us scale.
No sane business is going to sign off on +3x storage costs on a risk that cannot be easily modeled[2] and costs that cannot be priced into the product, just so one sysadmin can sleep better at night.
[1]your hard disk in a safe third component is not sensible discussion point at reasonable scale.
[2] this would be probability of data loss with AWS * business cost of losing that data > cost of secondary system.
Or probability of data availablity event(like now) * business cost of that > cost of an active secondary system .
For almost no business in the world the either equation would be valid.
For example even the cost is 100B dollars in revenue with 6 nines of durability the expected loss would be only $10,000 (100B * 0.000001) a secondary system is much costlier than that.
I don't get how this is relevant at all, it's more about how much data your company stores than how many employees it has.
I've worked for a company with 5000 employees that stored less data (fewer data?) than my current employer that has less than 100.
> No sane business is going to sign off on +3x storage costs on a risk that cannot be easily modeled
Probably not, but for us the cost is about 0.1x our aws storage costs, so it's a no-brainer.
> 5:01 AM PST We can confirm a loss of power within a single data center within a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. This is affecting availability and connectivity to EC2 instances that are part of the affected data center within the affected Availability Zone. We are also experiencing elevated RunInstance API error rates for launches within the affected Availability Zone. Connectivity and power to other data centers within the affected Availability Zone, or other Availability Zones within the US-EAST-1 Region are not affected by this issue, but we would recommend failing away from the affected Availability Zone (USE1-AZ4) if you are able to do so. We continue to work to address the issue and restore power within the affected data center.
So they have 2 different sources of power coming in. And generators. They do mention the UPS is only for "certain functions", so I guess it's not enough to handle full load while generators spin up if the 2 primaries go out. Or perhaps some failure in the source switching equipment (typically called a "static transfer switch").
Some detail on different approaches: https://www.donwil.com/wp-content/uploads/white-papers/Using...
We're still waiting on the RCA for last week's us-west outage...
Isn't the whole point of availability zones is that you deploy to more than one and support failing over if one fails?
IE why are we (consumers) hearing about this or being obviously impacted (eg Epic Games Store is very broken right now)? Is my assessment wrong, or are all these apps that are failing built wrong? Or something in between?
Deploying to multiple places is more expensive, it's not wrong to choose not to, it's trading off reliability for cost.
It's also unclear to me how often things fail in a way that actually only affect one AZ, but I haven't seen any good statistics either way on that one.
It's turtles all the way down, and underneath all the turtles is us-east-1.
Off the top of my head, US-EAST-1 is:
(1) topologically closer to certain customers than other regions (this applies to all regions for different customers),
(2) consistently in the first set of regions to get new features,
(3) usually in the lowest price tier for features whose pricing varies by region,
(4) where certain global (notionally region agnostic) services are effectively hosted and certain interactions with them in region-specific services need to be done.
#4 is a unique feature of US-East-1, #2-#3 are factors in region selection that can also favor other regions, e.g., for users in the West US, US-West-2 beats US-West-1 on them, and is why some users topologically closer to US-West-1 favor US-West-2.
I'm not sure why this is a big deal though, this is why Amazon has multiple AZ's. If your in one AZ, you take your chances.
IMO it was handled rather well and fast by AWS... not saying we shouldn't beat them up (for a discount) but being honest this wasn't that bad.
I once had to prepare for a total blackout scenario in a datacenter because there was a fault in the power supply system that required bypassing major systems to fix. Had some mistake or fault happened during those critical moments, all power would've been lost.
Well-designed redundancy makes high-impact incidents less likely, but you're not immune to Murphy's law.
1. Dual power in each server/device - One PSU was powered by one outlet, the other PSU by a different one with a different source meaning that we can lose a single power supply/circuit and nothing happens 2. Dual network (at minimum) - For the same reasons as above since the switches didn't always have dual power in them.
I've only had a DC fail once when the engineer was performing work on the power circuitry for the DC and thought he was taking down one, but was in fact the wrong one and took both power circuits down at the same time.
However, a power cut (in the traditional sense where the supplier has a failure so nothing comes in over the wire) should have literally zero effect!
What am I missing?
I've never worked anywhere with Amazon's budget so why are they not handling this? Is it more than just the imcoming supply being down?
edit: The question isn't necessarily AWS specific, just any data on amount of downtime per cloud provider on a timeline would be nice.
There indeed has been an uptick in AWS outages recently. You can see a bit of the history here: https://statusgator.com/services/amazon-web-services
If you can’t even admit you’re having an issue how can you keep an accurate record?
Obviously it’s not good for an AZ to go down but it does happen and why any production workload should be architected to have seamless failover and recover to other AZs, typically by just dropping nodes in the down AZ.
People commenting that servers shouldn’t go down ect don’t understand how true HA architectures work. You should expect and build for stuff to fail like this. Otherwise it’s like complaining that you lost data because a disk failed. Disks fail… build architecture where that won’t take you down.
“ditch your on-prem infrastructure and migrate to a major cloud provider”
And its starting to seem like it could be something like:
“ditch your on-prem infrastructure and spin up your own managed cloud”
This is probably untenable for larger orgs where convenience gets the blank check treatment, but for smaller operations that can’t realize that value at scale and are spooked by these outages, what are the alternatives?
Edit: Restarting Slack does update the edited messages.
Edit 15:24 CET: Slack is back up.
- edits failing or working with big lag;
- "Threads" view slow;
- can't emoji-react;
- can't upload images;
- people also say they can't join new channels.
> We are experiencing issues with file uploads, message editing, and other services. We're currently investigating the issue and will provide a status update once we have more information.
> Dec 22, 1:58 PM GMT+1
remote: Compressing source files... done.
remote: Building source:
remote:
remote: ! Heroku Git error, please try again shortly.
remote: ! See http://status.heroku.com for current Heroku platform status.
remote: ! If the problem persists, please open a ticket
remote: ! on https://help.heroku.com/tickets/newAnother thread: https://news.ycombinator.com/item?id=29648325
In all seriousness though - even non-regional AWS services seem to have ties to us-east-1 as evidenced by the recent outages. So you might be impacted even if it looks like (on paper at least) you’re not using any services tied to that region.
The console is throwing errors from time to time. As usual no information on AWS status page.
aws ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | select(.ZoneId == "use1-az4") | .ZoneName'The letters are randomised per AWS account so that instances are spread evenly and biases to certain letters don't lead to biases to certain zones.
I am not sure if the movement the cloud has reduced amount of failures, but it definitely has made these failures more catastrophic.
Our profession is busy makin the world less reliable and more fragile, we will have our reconning just like the shipping industry did.
Today, on Slack i could not edit messages, could not edit statuses and could not post attachments. Pretty annoying!
Packaging a way to migrate off AWS could be a unicorn idea.
I mean I'm glad it exists, don't get me wrong. Just weird that they'd have two status pages, one seemingly existing only to sort of 'mock' themselves...
Apparently it does some simple transformations of the actual status page, which is why the Amazon copyright stuff is in there.
[05:01 AM PST] We can confirm a loss of power within a single data center within a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. This is affecting availability and connectivity to EC2 instances that are part of the affected data center within the affected Availability Zone. We are also experiencing elevated RunInstance API error rates for launches within the affected Availability Zone. Connectivity and power to other data centers within the affected Availability Zone, or other Availability Zones within the US-EAST-1 Region are not affected by this issue, but we would recommend failing away from the affected Availability Zone (USE1-AZ4) if you are able to do so. We continue to work to address the issue and restore power within the affected data center.
Like how come down detector can do a superb job of detecting when AWS goes down and AWS can't? Because AWS doesn't want account managers of SLAs asking for credits for the uptime they're paying for but not getting.
That, or people never took the “if AWS goes down then lots of people will have a problem, so we’ll be fine” line seriously; there are few such cases.
Edit: Not supporting amazon, i generally dislike the company. I just don't understand the extend to which the criticism is justified
1. Did AMZN build an appropriate architecture?
2. Did AMZN properly represent that architecture in both documentation and sales efforts?
3. What the heck is going on with AMZN?
Let's say that they build an environment in which power is not fully redundant and tested at the rack level, but is fully redundant and tested across multiple availability zones. Did they then issue statements of reliability to their prospective and existing customers saying that a single availability zone does not have redundant power, and customers must duplicate functionality in at least 2 AZs to survive a SPOF?
Nothing happens if you remember that your new capacity limit per DC supply is 50% of the actual limit, and you're 100% confident that either of your supplies can seamlessly handle their load suddenly increasing by 100%.
I've seen more than one failure in a DC where they wired it up as you described, had a whole power side fail, followed by the other side promptly also failing because it couldn't handle the sudden new load placed on it.
Normally this is factored into the Rack you buy from a hardware provider, they will tell you that you have 10A or 16A on each feed, if you exceed that: it will work, but you are overloading their feed and they might complain about it.
This is all local scale. Your setup would not survive a data center scale power outage. At scale power outages are datacenter scale.
Data centers lose supply lines. They lose transformers. Sometimes they lose primary feed and secondary feed at the same time. Automatic transfer switches cannot be tested periodically i.e. they are typically tested once. Testing them is not "fire up a generator and see if we can draw from it"
It is cheaper to design a system that must be up which accounts for a data center being totally down and a portion of the system being totally unavailable than to add more datacenter mitigations.
And to top it off each rack had its own smaller UPS at the bottom and top, fed off both rails, and each server was fed from both.
We never had a power issue there; in fact SDGE would ask them to throw to the generators during potential brown-out conditions.
Of course this was a datacenter that was a former General Atomics setup iirc ...
Also, there are banks of batteries and generators in between the power company cables and the kit: did they not kick-in?
Again, this is all pure speculation: I have absolutely no idea of the exact failure, nor how their infrastructure is held together - this is all just speculation for the hell of it :)
Citation needed - the same issue with testing, data races and expensive bandwidth come up.
For big DC workloads, it is usually, though not always, better to take the higher failure rate than add redundancy.
Actually, now that I type that it makes sense. Scaling a few tens of dollars to a bajillion servers on the off-chance that you get an inbound power failure (quite rare I'd reckon) might cost more than what they'd lose if it does actually fail.
So yeah, they're potentially just balancing the risk here and minimising cost on the hardware.
Edit: changed grammar a bit.
Perhaps we are going to discover how AWS produces such lofty margins by way of their next RCA publication.
My guess is that they cheaped out in having redundant PSUs to get you to use multiple availability zones. (More zones = more revenue)
Even a single PSU shouldn’t be an issue if they plugged in an ATS switch though.
Somewhat surprising to see how many things are failing though, which implies, either that a lot of services aren't able to fail-over to a different availability zone, or there is something else going wrong.
Specifically large parts of the management API, and IAM service are seemingly centrally hosted in us-east-1. So called Global endpoints are also dependent on us-east-1 and parts of AWS' internal event queues (eg. event bridge triggers)
If your infrastructure is static you'll largely avoid the fallout, but if you rely on API calls or dynamically created resources you can get caught in the blast regardless of region
One incident I recall was involving our GCP regional storage buckets, which we were using to achieve mutli-region redundancy. One day, both regions went down simultaneously. Google told us that the data was safe but the control plane and API for the service is global. Now I always wonder when I read about MR what that actually means...
Your point here deserves highlighting. A failure such as a zone failing is nowadays a relatively simple problem to have. But cloud services do have bugs, internal limits or partial failures that are much more complex. They often require support assistance, which is where the expertise of their staff comes into play. Having a single provider that you know well and trust is better than having multiple providers where you need to keep track of disparate issues.
You really need multi-region and also not be relying on any AWS service that’s located only in us-east-1 (including everything from creating new S3 buckets to IAM’s STS).
Isn't that because Elasticache will distribute the cluster across AZs automatically?
https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/...
Load balancers are not doing well at all. The only way in this case to avoid an outage is to be cross regions or cross cloud which is quite more complex to handle and require more resources to do well.
And I hope that nobody is listening your blaming and pointing fingers advice, that's the worst way to solve anything.
It's AWS job to ensure that things are reliable, that there is redundancy and that multi-AZ infra should be safe enough. The amount of issues in US-EAST-1 lately is really worrying.
I do agree that the end of this year has been a very bad period for AWS. I wonder whether there’s a connection to the pandemic conditions and the current job market – it feels like a lot of teams are running without much slack capacity, which could lead to both mistakes and longer recovery times.
In the past I've seen both of those systems seamlessly handle an AZ failure. Today was different.
Is that comparison fair? If you have 2 raid-5 mirrored raid 5 boxes in your room and all disks fail at the same time, you should complain. And that won't happen. These entire datacenter failures should be anticipated, but to expect them is a bit too easy I think. There are plenty of hosters who don't have this stuff even once for the last decade in their only datacenter. I do not find it strange to expect or even demand that level but to protect yourself if it happens in any case if that fits your specific project and budget.
Edit; OK meant that raid-5 remark in the same context as the hosting; it can and does happen but it shouldn't; you should plan for a contingency but expect it goes far. We never had it (1000s of hard-drive, decades of hosting, millions of sites) and so we plan for it with backups; if it happens it will take some downtime but it costs next to nothing over time to do that. If we expected it, we would need to take far different measures. And we had less downtime in a decade than aws AZ had in the past months. I have a problem with the word 'expect'.
There are plenty of situations where this might happen if they’re in your room: a lightning strike can cause a surge that causes the disks to fry, a thief might break in and steal your system, your house might burn down, an earthquake could cause your disks to crash, a flood could destroy the machines, and a sinkhole could open up and swallow your house. You may laugh at some of these as being improbable, but I have seen _all_ of these take out systems between my times in Florida (lightning, thief, sinkhole, and flood) and California (earthquake and house fire).
The fix for this is the same fix as being proposed by the parent post - putting physical space between the two systems so if one place become unavailable you still have a backup.
Here are some examples where that happened:
1. Drive manufacturer had a hardware issue affecting a certain production batch, causing failures pretty reliably after a certain number of power-on hours. A friend learned the hard way that his request to have mixed drives in his RAID array wasn’t followed.
2. AC issues showed a problem with airflow, causing one row to get enough warmer that faults were happening faster than the RAID rebuild time.
3. UPS took out a couple racks by cycling power off and on repeatedly until the hardware failed.
No, these aren’t common but they were very hard to recover from because even if some of the drives were usable you couldn’t trust them. One interesting dynamic of the public clouds are that you tend to have better bounds on the maximum outage duration, which is an interesting trade off compared to several incidents I’ve seen where the downtime stretched into weeks due to replacement delays or manual rebuild processes.
Same manufacturer, same disk space, same location, same operator, same maintenance schedule, same legal jurisdiction, same planet, you name it, and there's a common failure to match
HA! I had received new 16-bay chasis and all of the drives needed plus cold spares for each chasis. Set them up and started the RAID-5 init on a Friday. Left them running in the rack over the weekend. Returned on Monday to find multiple drives in each chasis had failed. Even with dedicated one of the 16 drives as a hot swap, the volumes would all have failed in an unrecoverable manner.
All drives were purchased at the same time, and happened to all come from a single batch from the manufacture. The manufacture confirmed this via serial numbers, and admitted they had an issue during production. All drives were replaced and at a larger volume size.
TL;DR: Drives will fail, and manufacturing issues happend. Don't buy all of your drives in an array from the same batch! It will happen. To say it won't is just pure inexeperience.
The context of the parent seems to be that they intermittently couldn't get to the console. That seems fair to me. If we're blaming developers and finding gaps in HA design, then AWS should also figure out how to make the console url resilient. If it's not, then AWS does appear to be down.
I imagine it's pretty hard to design around these failures, because it's not always clear what to do. You would think, for example, that load balancers would work properly during this outage. They aren't. Or that you could deploy an Elasticache cluster to the remaining AZs. You can't. And I imagine the problems vary based on the AWS outage type.
Similarly, with the earlier broad us-east-1 outage, you couldn't update Route53 records. I don't think that was known beforehand by everyone that uses AWS. You can imagine changing DNS records might be useful during an outage.
As others have said, they are not being forthright about the severity of the issue, as is standard.
edit: of course, AWS does have this: AWS Fault Injection Simulator
Be in multiple AZs, and even multiple regions but if you're going to be in only one AZ or one region, make it us-east-2.
I have a server at OVH (not affiliated to them) which, at this point, I keep only for fun. It has 3162 days of uptime as I type this.
3 162 days. That's 8 years+ of uptime.
Does it have the traffic of Amazon? No.
Is it secure? Very likely not: it's running an old Debian version (Debian 7, which came out in, well, 2013).
It only has one port opened though, SSH. And with quite a hardened SSH setup at that.
I installed all the security patches I could install without rebooting it (so, yes, I know, this means I didn't install all the security patches for some required rebooting).
This server is, by now, a statement. It's about how stable Linux can be. It's about how amazingly stable Debian is. It's also about OVH: at times they had part of their datacenter burn (yup), at times they had full racks that had to be moved/disconnected. But somehow my server never got affected. It may have happened that at one point OVH had connectivity issues but my server went down.
I "gave back" many of my servers I didn't need anymore. But this one I keep just because...
I still use it, but only as an additional online/off-site backup where I send encrypted backups. It's not as if it gets zero use: I typically push backups to it daily.
They're only backups, they're encrypted. Even if my server is "owned" by some bad guys, the damage he could do is limited. Never seen anything suspicious on it though.
I like to do "silly" stuff like that. Like that one time I solve LCS35 by computing for about four years on commodity hardware at home.
I think it's about time I start to do some archeology on that server, to see what I can find. Apparently I installed Debian 7 on it in mid-october 2013.
I've created a temporary user account on it, which at times I've handle the password (before resetting it) to people just so they could SSH in and type: "uptime".
It is a thing of beauty.
Eight. Years. Of. Uptime.
Awesome! Are you Bernard Fabrot [0]?
[0] https://www.csail.mit.edu/news/programmers-solve-mits-20-yea...
At its current use, it's likely not a major issue but imagine if someone saw this uptime and thought to take it as a statement of reliability and built a service on it. I for one, would want that disclosed because this is a disaster waiting to happen. I'd much rather someone disclose that they had a few servers each with no longer than 7 days of uptime because they'd been fully imaged and cycled in that time...
Simiarly, my laptop, if I keep it plugged in the wall, and enable httpd on localhost, will surely have better uptime than any of the top clouds. I'd bet that it'd have 100% uptime if I plugged in a UPS and cared for traffic on my local network only.
In reality, most people don't need to scale. An occasional spike in traffic is a nuisance, but not the end of the world, and security is not terribly hard, if you keep your servers patched (which is trivial to automate).
I really don't understand why there's so much FUD around running your own stuff.
Of course it doesn't. Why are you asking antagonistic questions?
Better uptime than paying for EC2 on AWS US-East-1.
Obviously this approach isn't scalable but it serves me well.
A much faster and more effective solution that doesn't have you trading cloud problems with on-prem problems (the power outage still happens, except now it's your team that has to handle it) would be to update your services to run in multiple AZs and multiple regions.
Get out of AWS is you want, but don't get out of AWS because of outages. You should be able to mitigate this relatively easily.
Everything fails, we can argue the rate. But I would argue that understanding your constraints is better.
if you know that your secret storage system can't survive if a machine goes away: well, you wire redundant paths to the hardware and do memory mirroring and RAID the hell out of the disks. And if it fails you have a standby in place.
But if you use AWS Cognito.
And it goes down.
You're fucked mate.
I remember we had a power outage in 2006, it actually took one of my services off air. Since then of course that has been rectified, and the loss of a building wouldn't impact on any of the critical, essential or important services I provide.
Commenters will show up like clockwork and say shit like:
“What man, it’s not like cars didn’t crash before? Haha”
Don’t be dense dude. And definitely don’t pursue a leadership position anytime in the future.
If your app depends on a few 3rd party services -- SendGrid, Twilio, Okta and they're all hosted on different infra then congrats! You're gonna have issues when any one of them are down, yayyy.
Also the marketing benefit can't be downplayed. If your postmortem is "AWS was having issues" then your execs and customers just accept that as the cost of doing business because there's a built-in assumption that AWS, Azure, GCP are world class and any in-house team couldn't do it better.
He also has a decent newsletter and witty commentary, for all things AWS.
[0] https://twitter.com/quinnypig/status/1468331194471178241?s=2...
Interesting quote:
“This is exactly the sort of design that lets me sleep like a baby,” said DeSantis. “And indeed, this new design is getting even better availability” – better than “seven nines” or 99.99999 percent uptime, DeSantis said.
It's turtles all the way down.
For example, too often people will set up clustered databases and whatnot because "they need HA" without much thought about all the other potential effects of using a cluster, such as much more complicated recovery scenarios.
In the vast majority of cases, an active-passive replicated database with manual failover is likely to have fewer pitfalls and gives you the same operational HA a clustered database would, even though in the case of a (rare) real failure it would not automatically recover like a cluster might.
https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#...:
(Edit: I hope I didn't sound sarcastic. I don't open random console pages and scroll all the way down to check for new features. Some people will have noticed, some won't.)
On our side we saw some EC2 VM totally disconnected from the network in 3 AZs.
More modular and a lot less copper at 10x the voltage. Still a lot of copper.
This is what happens when we've been affected by outages (even without involving support).
The problem with both this example, and the AWS one (it needs to have better availability than your personal home-spun solution, and it does), is that people are amazing at deluding themselves.
"Yes, cars are dangerous, because other people can't drive. But I'm a better than average driver"
"Yes, other people will build unreliable systems. But I know how to architect for my use case and ensure that for my needs the availability will be higher than AWS's"
Both are true* in the micro sense and false in the macro sense.
* Not really. 88% of americans think they are "above average" drivers.
In my experience, execs and customers don't treat an outage differently because AWS is at fault. Though the developers do often have the attitude that it's "someone else's problem", which can actually can make execs more worried than if the problem was well known and under the company's control.
It's perfectly scalable. Just give everybody their own home server.
That's ATS. It is not really advisable to test their under load performance because the failure of an ATS would be catastrophic. ATS typically would be tested at the installation and after that their parameters would be monitored.
Replacing a functional in line ATS would be a 9-12 months long project.
> Also, there are banks of batteries and generators in between the power company cables and the kit: did they not kick-in?
At high energy you are pretty much always going to use an ATS.
Because that would mean no power at all to the DC and no way to get it back? (I am completely ignorant on this topic)
While most of smarts in the ATS are in the electronics, the really nasty failures come from the mechanical part.
At the end of the day a high energy ATS looks just like a switch behind a meter in your house. There's a lip that goes from one position to another, except in a high energy ATS the lip is big and when the transfer occurs it slams from one source to another.
There are only so many of those physical slams that it can withstand to being with so you want to minimize that number.
The second failure mode is that after transfer to non-main source, the lip can get stuck there, making it impossible to switch back on the main. [Once I have seem the lip melt into the secondary position. While I thought it was weird, the guys from the power company said it is not that uncommon.] This creates a massive problem as the non-main source is typically not designed for long term 24x7 operation. So now you are stuck on a secondary feeding system and you cant just transfer to main without de-energizing the system i.e. taking the power out of the entire data center.
I've had bad power supplies fry out taking the whole power circuit with it, and thus half (or whatever fraction) of the rack's power. I've also had bad power supplies bring down the whole machine as they shunted everything internal too.
When things go bad, anything can happen. You can provide the best effort, and it'll usually work as expected, but there will always be something that can overcome your best efforts.
Put another way: even if your home ISP has had 100% uptime, are you comfortable saying that was true for all of their customers?
Software is much easier than hardware. If you are to start a project today in this kind of hardware, you will be operating it in 2029, without changes.
The thing is, us-east-1 represents the whole AWS for the majority of us.
Your question reads as a strawman. It matters nothing if EC2 is also available in Mumbai or Hong Kong if by default the whole world deploys everything and anything to us-east-1, and us-east-1 alone.
https://www.reddit.com/r/aws/comments/nztxa5/why_useast1_reg...
The fewer API calls you need to make in-band with whatever throughput is generated via your customer demand, the better. Related to that, I have been critical of lambda/FaaS/serverless infrastructure patterns for similar reasons. Always felt like a brittle house of cards to me (N.B. I do still use aws lambda, but keep it constrained to non-critical workloads).
Agreed; however, this is somewhat difficult to do correctly. There are all sorts of systems that might have hidden dependencies on managed services. e.g. AWS IAM roles will almost always be checked at some point if your services need to interact with AWS managed services.
I think cloud providers could meet developers half way here, by providing ways to reduce API usage; but I'm not sure if it aligns with their incentives.
It's like the duality of modular code. If you want to manage one change in a lot of places, it's easiest to change it in the one module that everything else sources. But that means that one change to that module can take down everything. The alternative where you copy+paste the same change everywhere is the most resilient to failure, but also the most difficult and expensive.
AWS provides a lot of modular, dynamic things because that's what their customers want to use. But using each of those things increases the probability of failure. It's up to the customer to decide how they want to design their system using the components available.... and the customers always chose the easy path rather than the resilient path.
The great thing is that with AWS, at least you have the option to design a super freaking reliable system. But ultimately there's no way to make it easy, short of a sort of "Heroku for super reliable systems". (I know there are a few, but I don't know anything about them)
Obviously people can operate things however they want, but you wont get a tier 3 classification with that setup.
But, point taken: yes your power feed should be running at <50%. But that just means you treat 50% as 100% just like any resource.
Mostly this is outsourced to the datacenter provider; they'll give you a per side rating. (usually 10A or 16A) which also matches the cooling profile of the cabinet.
However, with their comment DC == Data Center, not Direct Current.
For example, in railway signaling, drivers are trained to interpret a signal with no light as the most restrictive aspect (e.g. "danger"). That way, any failure of a bulb in a colored light signal, or a failure of the signal as a whole, results in a safe outcome (albeit that the train might be delayed while the driver calls up the signaler).
Or, in another example from the railways, the air brake system on a train is configured such that a loss of air pressure causes emergency brake activation.
Fail-safe doesn't mean "able to continue operation in the presence of failures"; it means "systematically safe in the presence of failure".
Systems which require "liveness" (e.g. fly-by-wire for a relaxed stability aircraft) need different safety mechanisms because failure of the control law is never safe.
And even then, you still need to define "safe". Imagine a lock powered by an electromagnet. What happens if you lose power?
The safety-first approach is almost always for the unpowered lock to default to the open state — allow people to escape in case of emergency.
Conversely, the security-first approach is to keep the door locked — nothing goes in or out until the situation is under control.
A more complex solution is to design the lock to be bistable. During operating hours when the door is unlocked, failure keeps it unlocked. Outside operating hours, when the door is set to locked, it stays locked.
The common factor with all these scenarios is that you have a failure mode (power outage), and a design for how the system ensures a reasonable outcome in the face of said failure.
"fail-safe" doesn't mean "doesn't fail", it means that the failure mode chooses false negatives or false positives (depending on the context) to be on the safe side.
Or you ask if it's a lesson about how real systems operate? Because yes, it's a very serious lesson about how real systems operate.
Anyway, you seem out of grasp on system engineering. Your reply downthread isn't applicable (of course fail-safes can fail, anything can fail). If you want to learn more on this area (not everybody wants, and its ok), following that link of system theory books on the wiki may be a good idea. Or maybe start at the root:
https://en.wikipedia.org/wiki/Systems_theory
Notice that there is a huge amount of handwaving in system engineering. I don't think this is good, but I don't think it's avoidable either.
In my experience, you can be specific, but then you get the problem that people think that if they just 'what if' a narrow solution to the particular problem you're presenting they've invalidated the example, when the point was 1. that this is a representative problem, not this specific problem and 2. in real life you don't get a big arrow pointing at the exact problem 3. in real life you don't have one of these problems, your entire system is made out of these problems, because you can't help but have them, and 4. availability bias: the fact that I'm pointing an arrow at this problem for demonstration purposes makes it very easy to see, but in real life, you wouldn't have a guarantee that the problem you see is the most important one.
There's a certain mindset that can only be acquired through experience. Then you can talk systems engineering to other systems engineers and it makes sense. But prior to that it just sounds like people making excuses or telling silly stories or something.
"(of course fail-safes can fail, anything can fail)"
Another way to think of it is the correlation between failure. In principle, you want all your failures to be uncorrelated, so you can do analysis assuming they're all independent events, which means you can use high school statistics on them. Unfortunately, in real life there's a long tail (but a completely real tail) of correlation you can't get rid of. If nothing else, things are physically correlated by virtue of existing in the same physical location... if a server catches fire, you're going to experience all sorts of highly correlated failures in that location. And "just don't let things catch fire" isn't terribly practical, unfortunately.
Which reiterates the theme that in real life, you generally have very incomplete data to be operating on. I don't have a machine that I can take into my data center and point at my servers and get a "fire will start in this server in 89 hours" readout. I don't get a heads up that the world's largest DDOS is about to be fired at my system in ten minutes. I don't get a heads up that a catastrophic security vulnerability is about to come out in the largest logging library for the largest language and I'm going to have a never-before-seen random rolling restart on half the services in my company with who knows what consequences. All the little sample problems I can give in order to demonstrate systems engineering problems imply a degree of visibility you don't get in real life.
It conveys reality, that "fail-safe" isn't literal, as if anyone believed that.
I don't think this makes sense, you are using the three statements "Software is cheaper", "Software takes less time" and "software is easier" as if they all mean the same thing, and proving one means proving all of them.
Hardware takes a long time, okay, that does not mean it's expensive. Building a hydroelectric dam takes 20 years, but it provides the cheapest source of electricity that ever existed. Ships can take a decade from order to delivery, they are the cheapest mode of transport.
When your server requirements get into needing 5-6 servers (not at all atypical for a startup in their first year of being launched), running your own stuff becomes more of a challenge pretty quickly. Factor in 2-3x growth a year, and the challenges just mount.
What challenges are you thinking of? You buy a full-rack in colocation and then just buy servers/hardware when required.
If a company has the budget for AWS or some other cloud provider then they would have a budget for colocation; which in long term is cheaper. I see no additional challenge other than maintaining X amount of hardware than just one.
Buying upfront hardware is not feasible even if I had the cash(which most don't), I don't know if the company would last that long or would be doing things that require x servers .
What you are saying is similar to saying may be it is cheaper to buy the building /floor instead of renting space for office. - most small biz cannot afford do that, or expect their business to change (fail/take off) in the time frame ROI would come to take that commitment.
This is all assuming that a the startup has skill in setting up and managing physical servers and there is no opportunity costs( delayed features) on doing so, both are not a given.
small companies ( and poor people) typically don't buy low quality stuff or buy into rent seeking business models because they are dumb it is usually because they cannot afford to do long term thinking.
This comment just screams of engineer-only focus. Running servers yourself brings almost no value to the customer and is a specialized skill that you have to pay for. All to solve... a couple days of service-specific downtime a year? People need to chill out with their non-mission critical software. God forbid someone can't access their HR portal for an hour.
I don't think people care that AWS has other customers, they want their workload to work, if it doesn't: then that's a today issue.
Is this not antagonistic? It's pointless to make these statements, so your parent comment pointed it out. Go downvote the first one instead.
Sometimes this _can_ be costly. For example with something like autoscaling, thats an active system I've seen fail when seemingly unrelated systems are failing. The result is scaling out systems intentionally ahead of time to deal with oversubscription or burst traffic which can leave you with (costly) idle compute.
I don't mind this tradeoff personally, but can understand that budget constraints are going to be different org to org.
I've managed to stick to EC2/ELB and S3 as passive systems for the vast majority of what we build at my org (~90% of our stack). And for the most part, AWS failures are hitless for us as a result.
DC = Datacenter? makes no sense, so my head replaced it with "Power Supply" instead of "DC Supply", second sentence does make sense as being datacenter though.
Deploying a service to a single region is not, nor has it ever been, "customers don't know how to use AWS".
If anything, cargo culting this belief in global deployments being necessary, specially with services that have at most a regional demand, is a telltale sign a customer has no idea about what he is doing and is just mindlessly wasting money and engineering effort in something no one needs.
This blend of bad cargo cult advice sounds like a variant of microservices everywhere.
* Negative temperature coefficient of reactivity: as temperature increases, the neutron flux is reduced, which both makes it more controllable, and tends to prevent runaway reactions.
* Negative void coefficient of reactivity: as voids (steam pockets) increase, the neutron is reduced.
* Control rods constructed solely of neutron adsorbent. The RBMK reactor (Chernobyl) in particular used graphite followers (tips), which _increased_ reactivity initially when being lowered.
It's also worth noting that nuclear reactors are designed to be operated within certain limits. The RBMK reactor would have been fine had it been operated as designed.
Source: was a nuclear reactor operator on a submarine.
e.g. consider a railway track circuit - this is the way that a signaling system knows whether a particular block of a track is occupied by a train or not. The wheels and axle are conductive so you can measure this electrically by determining whether there's a circuit between the rails or not.
The naive way to do this would be to say something like "OK, we'll apply a voltage to one rail, and if we see a current flowing between the rails we'll say the block is occupied." This is not fail-safe. Say the rail has a small break, or if power is interrupted: no current will flow, so the track always looks unoccupied even if there's a train.
The better way is to say "We'll apply a voltage to one rail, but we'll have the rails connected together in a circuit during normal operation. That will energize a relay which will cause the track to indicate clear. If a train is on the track, then we'll get a short circuit, which will cause the relay to de-energize, indicating the track is occupied."
If the power fails, it shows the track occupied because the relay opens. If the rail develops a crack, the circuit opens, again causing the relay to open and indicate the track is occupied. If the relay fails, then as long as it fails open (which is the predominant failure mode of relays) the track is also indicated as occupied.
Fail safes do fail. Often due to severe user error.
You only need one server, slap on a hypervisor and your rocking. Heck you can buy entry servers from dell for budget; upgrade later.
A 10u rack, which is adequate for any small business comes to around $500 a month in LA. 4u would be more than enough for a startup and that's around $200/month.
Is the start-up not going to purchase computer hardware, monitors, television screens for their clients when they sit in the waiting room? Email accounts with Office365, a website, a domain name? If they can fit that in to their budget I am pretty sure they could afford a server and colocation space.
> What you are saying is similar to saying may be it is cheaper to buy the building /floor instead of renting space for office. - most small biz cannot afford do that, or expect their business to change (fail/take off) in the time frame ROI would come to take that commitment.
But colocation is dynamic. Contracts can be negotiated.
> buy into rent seeking business models
And AWS isn't a rent seeking business model?
Looking at EC2 instances, for 120GBHD, 32 cores "Dedicated" instance, your looking at 679.54 USD for a month. 120GB isn't much especially when the developers start doing their thing.
For $500 you can have so much more, and hardware you actually own and that if the company does not lift off it can then be sold. Is that not the better investment?
No remarks to lack of skills.
It's just strange.
(Using copyrighted material is permitted under fair use; this website is a parody. I’m not a lawyer but at some level preserving the copyright notice is probably better than claiming it as their own.)
You may say that the original work is copyright of the respective owners and that this is a parody work. But that's not what the site is doing. The footer contains the original, unaltered copyright, creating confusion as to who owns the derived work. Amazon does not own this, nor do they endorse it, so you're not allowed to say it's copyrighted by Amazon.
Speeding? Basically a national past-time at this point.
Misrepresentation, common fraud, and misappropriation? Par for the course in most small businesses.
It's only a crime if someone gives enough of a shit to do something about it; otherwise, it's just life.
"almost unusable" is maybe exaggerating, but there were definitely issues affecting more than just the single AZ.
Yes, some of these we should be better at handling ourselves, but... it's all very well to say "expect to lose an AZ" but during this outage it's not been physically possible to remove the broken AZ instances from multi-AZ services because we cannot physically get them to respond to or acknowledge commands.
edit: just to short circuit any "well, why aren't you running redundant regions" - we run redundant regions at all times. But for reasons of latency, many customers will bind to their closest region, and the nature of our technology is highly location-bound It is not possible for us to move active sessions to an alternate region. So something like this is... unpleasant.
If they claim tier4, then they basically have everything in n+n configuration.
I used to work next door to a "major" cable TV station's broadcast location. They had multiple generators on-site, and one of them was running 24/7 (they rotated which one was hot). A major power outage hit, and there was a thunderous roar as all of the generators fired up. The channel never went off the air.
Our main computer lab had a serial UPS that was online 100% of the time, though the inverters where under a very light load. If the mains even acted 'weird' (dips, bad power factor, spikes) the UPS jumped full on, and didn't revert to main power until the main was stable for some duration of time. The UPS was able to carry the full lab (which was quite large) for about two hours, allowing plenty of time for the generator to fire up.
The UPS ran a lot, and because the main was 'weird', the outages were often short, the generator wouldn't even start during the first ten minutes of UPS coverage. Of course, the rest of the building would be dark, other then emergency lighting.
I was a embedded firmware engineer, and our development lab was directly on the wall behind the UPS. When it fired into 100% mode, it roared, mostly from cooling. It was sort of a heads up that the power was likely to fail soon.
A few minutes seems correct for one place I worked.
This was back in the 90's, before UPS technology got really interesting. Our system was two large rooms with racks and racks and racks of car batteries wired together. When the power went out, the batteries took over until the diesel generator could come online.
I saw it work during several hurricanes and other flood events.
I always found the idea of running an entire building off of car batteries amusing. The engineers didn't share my mirth.
I wonder if a lot of AWS dc design in this area predates the battery grid storage revolution with (what my impression is) a far faster adaptation/switchover time than a generator spin up, and possibly software systems that work to detect and switch over quickly?
AWS can claim it will be best of breed, but they aren't going to throw out a DC power redundancy investment (or threaten downtime) that they can't wring more ROI on.
Tesla apparently did some early pilot stuff: https://www.datacenterdynamics.com/en/analysis/teslas-powerp...
Maybe it could affect people buying services as well.
No. You rack the server, connect the cable to the switch press the on-switch, let it boot and then operate as you would with any other computer. Need a new server? Get the DCOps remote hands to Rack the server, cable it to the switch and press the on-switch. If you can install Windows 10, you can install Linux.
> This comment just screams of engineer-only focus. Running servers yourself brings almost no value to the customer and is a specialized skill that you have to pay for.
Hmm, the data is yours; You own the data and that's value to me if I am the customer. And there's no value to the customer if you were to host it in the cloud either, much of a loss retrospectively.
There is no specialized skill you have to pay for, sure if you were going to run a high-density compute mainframe running some specialized OS like AIX then yeah sure. To buy a server, install a OS and plug it in to a switch and navigate it with SSH requires no overqualified anything.
To your second point: We still own the data for hosted solutions. We have our backups, we have direct access. Barring a catastrophic failure and wiping of AWS, our data is there and ours. The value to our customers for using cloud providers is the time saved on building infrastructure is instead spent on delivering value to the customer in the form of features/bug fixes. And yes I know the argument of "you end up spending more time debugging AWS", but I think you don't need to reach that point if you keep things pretty simple, especially early stage.
I think you're vastly oversimplifying the task. And if I had to guess, it's something you're pretty familiar with so it makes sense that it's easy to you! I'm sure an early stage company would be happy to have you to save a ton of money in their early days on infrastructure costs
All which you have to do on a cloud provider. Fire/Power are normally handled by the DC. You have to have someone with knowledge in the first place to operate that in the cloud and taking an application such as HAProxy is on the same skill level. Especially with vast fields of blogs you can find on the topic.
The cloud brings the instant "power-up" methodology and I would compromise, you could be right. If you want a perfective optimum cloud platform you would need a dedicated "cloud" engineer if you want to ensure security, connectivity etc. etc. But if your going to do that then again you might as well move in-house and hire a system admin, you've still got to operate your companies infrastructure. It's a moot point.
> The value to our customers for using cloud providers is the time saved on building infrastructure is instead spent on delivering value to the customer in the form of features/bug fixes.
I suppose this is a mixed area and what customers value varies. For me I value a service that uses it's own hardware rather than cloud. On the basis that they are willing to put the skill in to operate their own infrastructure rather than.
> I think you're vastly oversimplifying the task.
I don't think so. People seem to think that setting up what you can in the cloud is impossible on bare metal when really it isn't. What did devOp's do before the cloud providers? Amazon,Azure,X have only lassoed FOSS software, constructed a webGUI-admin panel and throw it as a service.
This is not to say Cloud doesn't has a purpose, otherwise it wouldn't exist today.
> it's something you're pretty familiar with so it makes sense that it's easy to you
While true, I won't disagree, I've been working in the SysAdmin field since 2009. But disagree as I started with very little knowledge and gained it through setting up such infrastructures. I've educated a few and those with very little knowledge of servers could setup an infrastructure that companies run in AWS.
That's warped view of the world. A corporation can always take you to court and harass the crap out of you, the court will side with your defense because you were right and claimant was wrong, you had the right to do what you did.
Source? Has there ever been an industry wide survey that compares availability from "insert average colo/data center operations" with the cloud ones?
And I'm not talking about "we have 12 SREs who are based in Cupertino and are all paid top dollar to support a colo"...I'm talking average.
I worked through the ranks at a large enterprise that ran a “big” datacenter for a decade. The facilities team was about 6 people, average salary around $90k. I can only remember one power interruption affecting more than a rack, caused by a failure during a maintenance event that requires a shutdown for safety reasons. The rest is like any other industrial facility - you have service contracts for the equipment, etc and maintain things.
There’s a cost/capability curve that you need to plan around for these matters. You need to make business and engineering decisions based on your actual circumstances. If the answer is automatically “AWS <whatever>“, you’re making a decision to burn dollars for convenience.
Ok so $540k salaries + benefits, so ~$700k. Then you have transaction costs:
- Annual salary increases
- Any cost associated with people leaving (severance, hiring, recruiters, HR, HR systems)
- Systems that run in the data center (logging, monitoring, etc.)
- Procurement costs with changing costs in hardware (silicon shortages, etc.)
- Security compliance overhead and associated risks
- Finance resources required to capitalize and manage asset allocation
- etc. etc.
Versus
- Click a button and voila it works.
- Hire way less engineers to manage the system administrative portion
> If the answer is automatically “AWS <whatever>“, you’re making a decision to burn dollars for convenience.
100% AGREE. The answer is always "it depends", but just like people are saying "just put in the cloud", the opposite of "well it worked for us using a data center" isn't that simple.
I’ve been deploying to AWS for years and can’t remember and outage on their side in my region. But this is anecdotal and doesn’t necessarily reflect the statistics.
It is as if the software industry has collectively forgotten how to run basic data center operations. Something that used to be a blue collar skill is now treated like arcane magic.
It's not arcane magic. It's undifferentiated toil that requires hiring for a different skill set than tech companies generally want to hire for. Of course when you get to a certain size it may make sense to take on this cost.
I want us to stop pretending individual actors lack the agency to make their own decisions, and they're all blind to how AWS is charging them a fortune for such simple things they can do themselves. You get value from AWS or you stop using AWS.
And what rate is this? It gets attention because it impacts more people, but AWS / GCP / Azure uptime is still better than what I've seen for small / mid size businesses trying to manage their own infrastructure.
So again, we're talking about cloud providers because of their scope and size, and they're still doing better than MidSizedBank managing their own infrastructure.
The "when" shouldn't really matter- Diesel engines aren't a new thing. Warming them up isn't really a thing either- they'll have electric warmers hooked up to the building power to keep them ready to go.
Edit to add: I was at a place that took over a company that had one of these. With all of the dead batteries, it was just a really really large inverter taking the 3-phase AC to DC back to AC with a really nice and clean sine wave.
Why in the world would the state need to be involved in this level of decision?
> it's all very well to say "expect to lose an AZ" but during this outage it's not been physically possible to remove the broken AZ instances from multi-AZ services because we cannot physically get them to respond to or acknowledge commands
"Expect to lose an AZ" includes not being able to make any changes to existing instances in the affected AZ.
If you had instances across multiple AZs behind an ELB with health checks, then the ELB should automatically remove the affected instances.
If you have a different architecture, you would want to: * Have another mechanism that automatically stops sending traffic to impaired instances (ideal), or * Have a means to manually remove the instances from service without being able to interact with or modify those instances in any way
Does that help, or have I misunderstood your problem?
Amazon has much bigger legal issues to focus on than some satire.
Obviously, there’s a huge capital investment component too that has to be incorporated. Those costs may be really high if you’re in a growth phase as you need to overbuy capacity.
Just to be clear, I’m not arguing that on-prem is magically cheap. :) But it has its place too!
It is also clearly not satire. That would not hold up in court, and there are many instances where they have tried that angle and failed.
And consumers are clearly deceived - hence why my original comment asking about it was written and has several upvotes.
Being concerned for the proper respect of IP laws is something that benefits everyone.
Your argument would have a small amount of merit if you acknowledged that the laws DO NOT protect people like they do corporations. That is a hollow ideal, not reality.
Regardless of your pointed comment, I'm operating in the land of legal objectivity. The law doesn't care about your feelings much.
> Nope. Because these laws also protect people who make such websites from the corporations they're commenting on, too.
My response is that your assumption is very obviously wrong: the law does not protect individuals and corporations alike.
That’s all.
That is a weird understanding of what I said, and I don't really think you're arguing in good faith here. There's a lot of bias so I am choosing to not further this conversation.