Tell HN: AWS appears to be down again

863 points by riknox 4 years ago | 614 comments

Console is flickering between "website is unavailable" and being up for my team. This is happening very frequently just now, reliability seems to have taken a hit.

aledalgrande 4 years ago |

If you haven't seen yet, news is it was a power loss:

> 5:01 AM PST We can confirm a loss of power within a single data center within a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. This is affecting availability and connectivity to EC2 instances that are part of the affected data center within the affected Availability Zone. We are also experiencing elevated RunInstance API error rates for launches within the affected Availability Zone. Connectivity and power to other data centers within the affected Availability Zone, or other Availability Zones within the US-EAST-1 Region are not affected by this issue, but we would recommend failing away from the affected Availability Zone (USE1-AZ4) if you are able to do so. We continue to work to address the issue and restore power within the affected data center.

vinay_ys 4 years ago | |

This is quite interesting as they claim their datacenter design does better than Uptime's Tier3+ design requirements which require redundant power supply paths. [https://aws.amazon.com/compliance/uptimeinstitute/]. I really hope they publish a thorough RCA for this incident.

tyingq 4 years ago | | |

"Electrical power systems are designed to be fully redundant so that in the event of a disruption, uninterruptible power supply units can be engaged for certain functions, while generators can provide backup power for the entire facility." https://aws.amazon.com/compliance/data-center/infrastructure...

So they have 2 different sources of power coming in. And generators. They do mention the UPS is only for "certain functions", so I guess it's not enough to handle full load while generators spin up if the 2 primaries go out. Or perhaps some failure in the source switching equipment (typically called a "static transfer switch").

Some detail on different approaches: https://www.donwil.com/wp-content/uploads/white-papers/Using...

JshWright 4 years ago | | |

> I really hope they publish a thorough RCA for this incident.

We're still waiting on the RCA for last week's us-west outage...

codeduck 4 years ago | |

another example of a single dc in a single AZ rendering an entire region almost unusable. This has shades of eu-central-1 all over again.

nightpool 4 years ago | | |

Amazon is claiming the failure is limited to a single AZ. Are you seeing failures for instances outside of that AZ? If not, how has this rendered "the entire region almost unusable"?

SCdF 4 years ago | |

So dumb question from someone who hasn't maintained large public infrastructure:

Isn't the whole point of availability zones is that you deploy to more than one and support failing over if one fails?

IE why are we (consumers) hearing about this or being obviously impacted (eg Epic Games Store is very broken right now)? Is my assessment wrong, or are all these apps that are failing built wrong? Or something in between?

fulafel 4 years ago | | |

IME people rarely test and drill for the failovers, it's just a checkbox in a high level plan. Maybe they have a todo item for it somewhere but it never seems very important as AZ failures are usually quite rare. After ignoring the issue for a while it starts to seem risky to test for it, you might get an outage due to bugs it's likely to uncover.

gpm 4 years ago | | |

> or are all these apps that are failing built wrong

Deploying to multiple places is more expensive, it's not wrong to choose not to, it's trading off reliability for cost.

It's also unclear to me how often things fail in a way that actually only affect one AZ, but I haven't seen any good statistics either way on that one.

peeters 4 years ago | | |

As I understand it for something like SQS, Lambda etc, AWS should automatically tolerate an AZ going down. They're responsible for making the service highly available. For something like EC2 though, where a customer is just running a node on AWS, there's no automatic failover. It's a lot more complicated to replicate a running, stateful virtual machine and have it seamlessly failover to a different host. So typically it's up to the developers to use EC2 in a way that makes it easy to relaunch the nodes on a different AZ.

robjan 4 years ago | | |

That's the theory but in practice very few companies bother because it's expensive, complicated and most workloads or customers can tolerate less than 100% uptime.

sprite 4 years ago | | |

I thought I was Multi AZ but something failed. I am mostly running EC2 + RDS both with 2 availability zones. I will have to dig into the problem but I think the issue is that my setup for RDS is one writer instance and one reader instance, each in a different AZ. However I guess there was nothing for it to fail over to since my other instance was the writer instance, so I guess I need to keep a 3rd instance available preferably in a 3rd AZ?

TruthWillHurt 4 years ago | | |

Amazon shifts the responsibility for multi-AZ deployment to us customers, saving themselves complexity and charging us extra - win-win for them.

_joel 4 years ago | | |

You're supposed to build your app across multiple AZ's but I know a lot of companies that don't do this and shove everything in a single AZ. It's not just about deploying and instance there but ensuring the consistency of data and state across the az's

xyst 4 years ago | |

This region in general is a clusterfuck. If companies by now do not have a disaster recovery and resiliency strategy in place, you are just shooting yourself in the foot.

philsnow 4 years ago | | |

In today's world of stitching together dozens of services, who each probably do the same thing, how is one to avoid a dependency on us-east-1? Add yet another bullet to the vendor questionnaire (ugh) about whether they are singly-homed / have a failover plan?

It's turtles all the way down, and underneath all the turtles is us-east-1.

notyourday 4 years ago | |

We are being told that the are still issues in the USE1-AZ4 and some of the instances are stuck in the wrong state as of 16:15 PM EST. There's no ET for resolution.

alostpuppy 4 years ago | |

Why do folks host their stuff in us-East? Is there a draw other than organizational momentum?

dragonwriter 4 years ago | | |

> Why do folks host their stuff in us-East?

Off the top of my head, US-EAST-1 is:

(1) topologically closer to certain customers than other regions (this applies to all regions for different customers),

(2) consistently in the first set of regions to get new features,

(3) usually in the lowest price tier for features whose pricing varies by region,

(4) where certain global (notionally region agnostic) services are effectively hosted and certain interactions with them in region-specific services need to be done.

#4 is a unique feature of US-East-1, #2-#3 are factors in region selection that can also favor other regions, e.g., for users in the West US, US-West-2 beats US-West-1 on them, and is why some users topologically closer to US-West-1 favor US-West-2.

superdug 4 years ago | | |

It's the cheapest.

GrumpyNl 4 years ago | |

How come they dont have power backups?

chkhd 4 years ago | | |

"When a fail-safe system fails, it fails by failing to fail-safe." - https://en.wikipedia.org/wiki/Systemantics

redm 4 years ago | | |

Some datacenter failures aren't related to redundancy. Some examples: 1) transfer switch failure where you can't switch over to backup generators and the UPS runs out, 2) someone accidentally hits the EOD, 3) maintenance work makes a mistake such as turning off the wrong circuits, 4) cooling doesn't switch over fully to backups and while your systems have power, its too hot to run. The list can go on and on.

I'm not sure why this is a big deal though, this is why Amazon has multiple AZ's. If your in one AZ, you take your chances.

taf2 4 years ago | | |

it was not a total power loss. out of 40 instances we had running at the time of the incident only 5 of our instances appeared to be lost to the power outage. the bigger issue for us was ec2 api to stop/start these instances appeared to be unavailable (but probably due to the rack these instances were in having no power). The other issue that was impactful to us was that many of the remaining running instances in the zone had intermittent connectivity out to the internet. Additionally, the incident was made worse by many of our supporting vendors being impacted as well...

IMO it was handled rather well and fast by AWS... not saying we shouldn't beat them up (for a discount) but being honest this wasn't that bad.

chousuke 4 years ago | | |

Sometimes, you have a component which fails in such a way that your redundancies can't really help.

I once had to prepare for a total blackout scenario in a datacenter because there was a fault in the power supply system that required bypassing major systems to fix. Had some mistake or fault happened during those critical moments, all power would've been lost.

Well-designed redundancy makes high-impact incidents less likely, but you're not immune to Murphy's law.

trelane 4 years ago | | |

Anything can fail, even your backup, and especially if it's mechanical.

Spooky23 4 years ago | | |

Their datacenter(s) aren’t magic because they are AWS. That facility is probably a decade old and like anything else as it ages the technical and maintenance debt makes management more challenging.

thetinguy 4 years ago | | |

They do. I remember watching one of their sessions where they showed every rack having its own battery backup.

TrueDuality 4 years ago | | |

According to the SOC certifications they give their customers they do.

ItsBob 4 years ago |

I've built out many 42U racks in DC's in my time and there were a couple of rules that we never skipped:

1. Dual power in each server/device - One PSU was powered by one outlet, the other PSU by a different one with a different source meaning that we can lose a single power supply/circuit and nothing happens 2. Dual network (at minimum) - For the same reasons as above since the switches didn't always have dual power in them.

I've only had a DC fail once when the engineer was performing work on the power circuitry for the DC and thought he was taking down one, but was in fact the wrong one and took both power circuits down at the same time.

However, a power cut (in the traditional sense where the supplier has a failure so nothing comes in over the wire) should have literally zero effect!

What am I missing?

I've never worked anywhere with Amazon's budget so why are they not handling this? Is it more than just the imcoming supply being down?

Hippocrates 4 years ago |

Every time a major cloud provider has an outage, Infra people and execs cry foul and say we need to move to <the other one>. But does anyone really have an objective measure of how clouds stack up reliability-wise? I doubt it, since outages and their effects are nuanced. The other move is that they want to go multi-cloud... But I’ve been involved in enough multi-cloud initiatives to know how much time and effort those soak up, not to mention the overhead costs of maintaining two sets of infra sub-optimally. I would say that for most businesses, these costs far exceed that occasional six-hour-long outage.

hnarn 4 years ago |

Is there a history of AWS downtimes available somewhere? This makes what, three times in as many months?

edit: The question isn't necessarily AWS specific, just any data on amount of downtime per cloud provider on a timeline would be nice.

colinbartlett 4 years ago | |

I have tons of this kind of data due to my side project, StatusGator. For some services like the big cloud providers I have data going back 7 years.

There indeed has been an uptick in AWS outages recently. You can see a bit of the history here: https://statusgator.com/services/amazon-web-services

exikyut 4 years ago | | |

(I was idly curious. It appears this data is available as part of the ~US$280/mo tier, along with a bunch of other things.)

MatteoFrigo 4 years ago | |

I don't know about AWS, but both Google Cloud and Oracle Cloud maintain at least a high level history of past outages. See https://status.cloud.google.com/summary and https://ocistatus.oraclecloud.com/history

dijit 4 years ago | | |

Given the hilariously awful reputation of the AWS status page I would hazard a guess that such a page would also be incredibly inaccurate.

If you can’t even admit you’re having an issue how can you keep an accurate record?

LuciusVerus 4 years ago | |

I'd say three times in as many weeks, give it or take

spmurrayzzz 4 years ago | |

This is a little more broad, beyond just cloud infra providers, but includes some of the kind of data you're looking for (post-mortems for outage events): https://github.com/danluu/post-mortems

andyjih_ 4 years ago |

The most hilarious irony of not being able to acknowledge a 4AM page in the PagerDuty mobile app because AWS is down.

exikyut 4 years ago | |

(Which was about AWS being down?)

JCM9 4 years ago |

AWS didn’t “go down”. They had an outage in one AZ, which is why there are multiple AZs in each region. If your app went down then you should be blaming your developers on this one, not AWS. Those having issues are discovering gaps in their HA designs.

Obviously it’s not good for an AZ to go down but it does happen and why any production workload should be architected to have seamless failover and recover to other AZs, typically by just dropping nodes in the down AZ.

People commenting that servers shouldn’t go down ect don’t understand how true HA architectures work. You should expect and build for stuff to fail like this. Otherwise it’s like complaining that you lost data because a disk failed. Disks fail… build architecture where that won’t take you down.

IceWreck 4 years ago |

Honestly my server at home has more uptime than US-East-1

RONROC 4 years ago |

The prevailing wisdom throughout the last couple of years was:

“ditch your on-prem infrastructure and migrate to a major cloud provider”

And its starting to seem like it could be something like:

“ditch your on-prem infrastructure and spin up your own managed cloud”

This is probably untenable for larger orgs where convenience gets the blank check treatment, but for smaller operations that can’t realize that value at scale and are spooked by these outages, what are the alternatives?

potas 4 years ago |

Slack seems to have some issues because of that - I'm not sure if anyone is receiving messages, as it became completely silent for the last 15 minutes or so.

jenoer 4 years ago | |

Sending and receiving messages works here, but editing them does not, it throws an error. Statuses such as "calling" also do not seem to be updated any longer.

Edit: Restarting Slack does update the edited messages.

Edit 15:24 CET: Slack is back up.

jakub_g 4 years ago | | |

Same: only normal text seems kinda working

- edits failing or working with big lag;

- "Threads" view slow;

- can't emoji-react;

- can't upload images;

- people also say they can't join new channels.

darkwater 4 years ago | |

I fail to understand how a big player like Slack can be impacted this way by a failure in a single AZ in a specific AWS region. But at least the main feature (sending and displaying messages) is still working.

jakub_g 4 years ago | |

https://status.slack.com/2021-12/a17eae991fdc437d

> We are experiencing issues with file uploads, message editing, and other services. We're currently investigating the issue and will provide a status update once we have more information.

> Dec 22, 1:58 PM GMT+1

Pandabob 4 years ago | |

Uploading images doesn't work for me.

oneeyedpigeon 4 years ago | |

New messages seem to be ok for me, but editing old ones and uploading images both seem to be broken right now.

aden1ne 4 years ago | |

I can't edit messages, nor create channels. Messages are only received with a several minute delay.

izietto 4 years ago |

I guess that's why I'm experiencing weird issues with Heroku:

    remote: Compressing source files... done.
    remote: Building source:
    remote: 
    remote: ! Heroku Git error, please try again shortly.
    remote: ! See http://status.heroku.com for current Heroku platform status.
    remote: ! If the problem persists, please open a ticket
    remote: ! on https://help.heroku.com/tickets/new

dijit 4 years ago | |

Yes.

Another thread: https://news.ycombinator.com/item?id=29648325

vegai_ 4 years ago |

5ish years ago it was common knowledge that us-east-1 is generally the worst place to put anything that needs to be reliable. I guess this is still true?

taf2 4 years ago | |

I don't know about that. It was more like common knowledge that one availability zone in us-east-1 was a problem - you would have to figure out which one it was usually by spinning up instances in all 4 zones (now 6)... and that it was the largest of all regions making it ideal place to put your service if you wanted to be close to other vendors/partners in AWS...

beermonster 4 years ago | |

us-east-1 seems to be AWS’s not so well kept little dark secret!

In all seriousness though - even non-regional AWS services seem to have ties to us-east-1 as evidenced by the recent outages. So you might be impacted even if it looks like (on paper at least) you’re not using any services tied to that region.

thow-58d4e8b 4 years ago | |

Unfortunately, the fact that us-east-1 is roughly 10% cheaper than other regions usually overrides any other concerns

dolibasija 4 years ago |

One of our EC2 instances in us-east-1c is unavailable and stuck in "stopping" state after a force stop. Interestingly enough, EC2 instances in us-east-1b don't seem to be affected.

The console is throwing errors from time to time. As usual no information on AWS status page.

JshWright 4 years ago | |

Instances stuck in the "stopping" state is pretty common, in my experience.

crescentfresh 4 years ago | |

The affected zone is use1-az4. Whatever that maps to (1a, 1b, 1c) is different per customer.

benedikt 4 years ago | | |

you can find out which zone is mapped to use1-az4 for your account with awscli:

    aws ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | select(.ZoneId == "use1-az4") | .ZoneName'

chrishynes 4 years ago | |

I had the same issue with unavailable, but on an instance in us-east-1b. Finally just got the force stop to go through a minute ago and it's now running and available again.

mike-cardwell 4 years ago | | |

Your us-east-1b may be the parents us-east-1c.

The letters are randomised per AWS account so that instances are spread evenly and biases to certain letters don't lead to biases to certain zones.

throwaway984393 4 years ago | | |

I'm not sure if we should say "AWS is down" if only us-east-1 is down. That region is more unstable than Marjorie Taylor Greene on a one-legged stool.

300bps 4 years ago | |

The 1c part is meaningless. Those letters are randomized per customer to prevent letter biases from leading to more people in 1a for instance.

crescentfresh 4 years ago | |

Was stuck on stopping in us-east-1b. Cannot start now.

ClumsyPilot 4 years ago |

Now that everyone and their dog is on AWS, it is not just 'a website stops working', half the world, from telephones to security doors and Iot equipment, stops working?

I am not sure if the movement the cloud has reduced amount of failures, but it definitely has made these failures more catastrophic.

Our profession is busy makin the world less reliable and more fragile, we will have our reconning just like the shipping industry did.

dehrmann 4 years ago | |

It's more like it's making downtimes correlated rather than random. For everything other than urgent communication, I'm not sure if this is a big deal.

madeofpalk 4 years ago | |

all I've noticed is slack was a bit unreliable for a little bit, but i just carried on and otherwise ignored it. my world did not stop working.

ClumsyPilot 4 years ago | | |

My apartment block has a dialing system, that, instead if using a cale that goes to your apartment, relies on IP telephony and calls your mobile phone. It stos working if there is no internet, or your phone is out of battery, or you are not home but your wife is.

KronisLV 4 years ago | | |

Same, maybe that was a related issue.

Today, on Slack i could not edit messages, could not edit statuses and could not post attachments. Pretty annoying!

schnebbau 4 years ago |

So, how many execs are going to push to move to self-managed hosting in the new year?

Packaging a way to migrate off AWS could be a unicorn idea.

qwertyuiop_ 4 years ago | |

None. Amazon hired all ex VPS, CTOs, Directors of small, medium large companies with Rolodexes.

mikece 4 years ago | |

Would need one hell of a compressional algorithm to keep the data exfiltration costs down.

pm90 4 years ago | | |

Pied Piper

adamm255 4 years ago | |

Anyone using VMware Cloud services is probably laughing. Just chuck it at Azure or GCP or back on prem.

dehrmann 4 years ago | |

Depends on how many customers are ready to move to a different vendor. I suspect most customers are forgiving because either they were also down or half the services they use were down. You don't get fired for hosting in AWS.

wallacoloo 4 years ago | |

AWS has its Outpost product for on-prem hosting. not 100% self-managed, but maybe enough to satisfy the execs and make your market a bit smaller.

Nextgrid 4 years ago | | |

Does it come with its own locally-hosted console or does it still rely on the main AWS control plane? If the latter then it could be affected too.

rsp1984 4 years ago |

Bitbucket having issues too: https://bitbucket.status.atlassian.com/

captn3m0 4 years ago |

4:35 AM PST We are investigating increased EC2 launched failures and networking connectivity issues for some instances in a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. Other Availability Zones within the US-EAST-1 Region are not affected by this issue.

via https://stop.lying.cloud/

junon 4 years ago | |

Can anyone explain the affiliation of stop.lying.cloud to Amazon? All of the legalese in the header/footer seem to indicate it's actually owned and run by Amazon. If so... why? Why not just... use the real status page?

I mean I'm glad it exists, don't get me wrong. Just weird that they'd have two status pages, one seemingly existing only to sort of 'mock' themselves...

taspeotis 4 years ago | | |

The people who maintain the unofficial site would have, at some point, used their CTRL and C keys followed not immediately, but closely by, their CTRL and V keys.

jrumbut 4 years ago | | |

I was curious too. An HN user takes credit for it here: https://news.ycombinator.com/item?id=24499159

Apparently it does some simple transformations of the actual status page, which is why the Amazon copyright stuff is in there.

deadbunny 4 years ago | | |

FWIW `lying.cloud` is registered with Namecheap. `amazon.com`/`aws.com`/`amazon.ca` are all registered with Mark Monitor. And I know that AWS uses ghandi behind the scenes for domain reg. Given that, I'd hazard a guess that it's not owned by Amazon. Definitely not a guarantee though.

IceWreck 4 years ago | | |

Amazon's own status page sort of lies. So someone probably wget-ed the status page, kept the same html and css and hooked it to their own API to display correct info.

mule1 4 years ago |

Feel for devops peeps who are just trying to chill for Christmas

stunt 4 years ago |

It seems that it's due to powerloss.

[05:01 AM PST] We can confirm a loss of power within a single data center within a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. This is affecting availability and connectivity to EC2 instances that are part of the affected data center within the affected Availability Zone. We are also experiencing elevated RunInstance API error rates for launches within the affected Availability Zone. Connectivity and power to other data centers within the affected Availability Zone, or other Availability Zones within the US-EAST-1 Region are not affected by this issue, but we would recommend failing away from the affected Availability Zone (USE1-AZ4) if you are able to do so. We continue to work to address the issue and restore power within the affected data center.

pawelduda 4 years ago |

Bitbucket is affected, pages randomly take forever to load or return 500

Pandabob 4 years ago | |

Yep, just botched a merge likely because of this.

el_duderino 4 years ago | |

Bitbucket just completed their migration to AWS too. Rough start.

darkwater 4 years ago |

Fields of green here https://status.aws.amazon.com/ Anyway I can access the web console with no issue (eu-west)

hnarn 4 years ago | |

I think it's pretty widely accepted that AWS' own status pages are utterly useless.

s_dev 4 years ago | | |

You would think that but there always a few contrarian AWS evangelists in the comments going on about the "difficulty" in operating a status page as though it were trying to conjure a N=NP proof.

Like how come down detector can do a superb job of detecting when AWS goes down and AWS can't? Because AWS doesn't want account managers of SLAs asking for credits for the uptime they're paying for but not getting.

https://downdetector.co.uk/status/aws-amazon-web-services/

darkwater 4 years ago | | |

Yeah, it was just to confirm that this time was no different :)

lordnacho 4 years ago | |

The elite DevOps teams are always assigned to the status page

temp0826 4 years ago | |

Changes to this page require very high level management approvals (source: used to work at aws)

JCM9 4 years ago | |

Status page says there are issues. It’s not all green.

oneeyedpigeon 4 years ago | | |

Now. It took a lot longer for that page to know/admit the problems than it did half the internet.

anshumankmr 4 years ago |

If AWS, GCP and Azure go down, we will be back in the stone ages, right?

dijit 4 years ago | |

The only stuff that will work will probably depend on things in AWS in some form.

That, or people never took the “if AWS goes down then lots of people will have a problem, so we’ll be fine” line seriously; there are few such cases.

omosubi 4 years ago |

I do wonder if the great resignation has anything to do with this. My team (no affiliation with Amazon) was cut in half from last year and we are struggling to keep up with all the work

sctgrhm 4 years ago |

Invision image uploads are down too because of this : https://status.invisionapp.com/

camdenreslink 4 years ago |

Who needs chaos monkey? Just host on AWS for a similar effect.

gtsop 4 years ago |

Question to the sysadmins here: Is it really that outrageous of amazon to have such issues or are people way to spoiled to appreciate the effort that goes into maintaining such a service?

Edit: Not supporting amazon, i generally dislike the company. I just don't understand the extend to which the criticism is justified

dsr_ 4 years ago | |

The issue is in three parts:

1. Did AMZN build an appropriate architecture?

2. Did AMZN properly represent that architecture in both documentation and sales efforts?

3. What the heck is going on with AMZN?

Let's say that they build an environment in which power is not fully redundant and tested at the rack level, but is fully redundant and tested across multiple availability zones. Did they then issue statements of reliability to their prospective and existing customers saying that a single availability zone does not have redundant power, and customers must duplicate functionality in at least 2 AZs to survive a SPOF?

rswail 4 years ago |

So why are people not migrating out of us-east-1? Operating in ap-southeast, we weren't that affected by the us-east-1 down time, although our system is reasonably static and doesn't make lots of IAM calls (which seems to be a large SPOF from us-east-1).

dijit 4 years ago | |

Some “global” systems run in us-east1 even if you’re not hosted there a service you depend on might be.

Notably: cognito, r53 and the default web UI. (You can work around the webui one I’m told, by passing a different domain instead of just console.aws.amazon.com)

watermelon0 4 years ago | | |

Don't forget about CloudFront, which can only be configured via us-east-1.

taf2 4 years ago | |

latency. us-east-1 is positioned very nicely relative to many large businesses in North America and Europe. This gives you pretty good access to a very large percentage of the economies of the world with good latency... while not requiring you to architect your application around multiple regions...

reactive55 4 years ago |

Bitbucket is down as well because of this. https://bitbucket.status.atlassian.com/incidents/r8kyb5w606g...

sprite 4 years ago |

My Elastic Beanstalk instances are completely unreachable. Seems at the very least ELB is down. Looking @ down detector it looks like this is taking a bunch of sites down with it. As usual AWS status page shows all green.

exabrial 4 years ago |

As an industry, can we please stop making products like vacuums that can't operate unless someone else's computer is working in a field in Virgina? There's literally no reason for it.

antihero 4 years ago |

I wonder how many 9s AWS is going for. Can't be a lot of 9s anymore.

arh68 4 years ago | |

89.9999 % has a lot of 9s, dare I say military-grade.

yabones 4 years ago | |

Nine Fives is the new Five Nines!

loudtieblahblah 4 years ago |

Yay! Adult snowday!

RobertKerans 4 years ago | |

Apropos of nothing, but a few Christmasses ago the place I worked had a dedicated fibre line that some workmen doing gas line repairs sawed straight through, took out everything; I was just drone worker at the time & it was a beautiful thing

exogenousdata 4 years ago |

Looks like the SEC's Edgar website is affected. This is the site the SEC uses to post the filings of public companies. Normally there are a hundred or more company filings in the morning starting at 6am ET. This morning there are two.

https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent

debarshri 4 years ago |

Hubspot seems to be down too [0].

[0] https://status.hubspot.com/

amai 4 years ago |

Thank goodness we host all IT services in the same cloud. Imagine the chaos we had if everything would not fail at the same time.

iso1631 4 years ago |

Ahh, the cloud

https://imgflip.com/i/5yrt24

lukeqsee 4 years ago |

I can't get to the console either, receiving a "Temporarily unavailable" notice without branding.

sascha_sl 4 years ago |

quay.io is also dead, as well as giphy, some parts of slack

just the weekly internet apocalypse, happy holdidays fellow SREs

richardfey 4 years ago |

As far as I understood a whole availability zone went down; today is also the day a lot of people understand why "multi-AZ" matters, so I don't think it's fair to say that services are down because the whole AWS is down.

jakub_g 4 years ago |

Where are you located? "X is down" without location is only moderately useful.

I'm having issues with Slack from central EU (Poland) -- can't upload images, or send emoji reactions to post; curiously, text works fine). Wondering if linked

riknox 4 years ago | |

AWS Console runs in us-east-1 so that points to at least that region having issues IIRC. I am also having Slack issues in EU.

hdjjhhvvhga 4 years ago | |

You should complain to Slack then. It's their problem to choose a reliable provider, and AWS seems to have trouble with keeping this status.

kemals 4 years ago |

Here is The Internet Report episode on the topic of recent AWS outages that covers outage and root causes: https://youtu.be/N68pQy8r1DI

bob1029 4 years ago |

2 of our servers are fucked right now. VOIP services down.

Only with AWS and Github do I seem get panicked text messages on my phone first thing in the morning... Our workloads on Azure typically only have faults when everyone is in bed.

fipar 4 years ago |

https://downdetector.com/status/aws-amazon-web-services/

devoutsalsa 4 years ago |

We'll never really know the answer, but I have to wonder what percentage of comments on this thread are from Amazon downplaying the severity & other cloud providers hyping it up.

mongrelion 4 years ago | |

You give HN too much credit.

j10c 4 years ago |

I also had problem with loading youtube at the same time(for 10-15 minutes) . It looks like a coincidence, but who knows if google uses some of the infrastructure from aws.

pkulak 4 years ago |

I used to think it was silly to have your own hardware (like a NAS) in your house. What makes you think you can do it better than AWS?

Santa is bringing me a Synology in three days.

darkstar999 4 years ago | |

Why not both? I just got a Synology NAS and it makes cloud sync dead simple. Now the most important things are on my PC, mirrored on 2 drives in my NAS, and on AWS S3 (or any other cloud storage).

pkulak 4 years ago | | |

Oh yeah. My plan is to migrate everything to the NAS, then have that back up to Glacier and/or Rsync.net. By S3, do you mean Glacier?

RobertKerans 4 years ago |

Assuming crates.io is AWS-backed? Getting fun situation where direct dependencies of an application are downloading but then the sub-dependencies aren't.

lukeqsee 4 years ago | |

crates.io is directly hosted on GitHub, but I'm sure some dependencies use S3 or other AWS services for things.

pietroalbini 4 years ago | | |

The crates.io index is hosted on GitHub, but the application/API is hosted on Heroku (so in the us-east-1 AWS region) and the downloads on S3/CloudFront. And yes crates.io is currently impacted.

RobertKerans 4 years ago | | |

Yep, S3 possibly the villain here

mwcampbell 4 years ago | |

Yeah, and I can't publish a crate.

kingsloi 4 years ago |

Of all the AWS outage, my team and I have dodged them all, except this one. 3 instances down and unavailable

> Due to this degradation your instance could already be unreachable

>:(

electroly 4 years ago | |

FWIW I don't think that message has anything to do with this outage. I think it's just a coincidence that you got some degraded hosts. They didn't send out emails like that for this AZ outage (nor would I expect them to -- that email is for when host machines die).

bobviolier 4 years ago |

Seems unlogical that this is just a single region in a single US region We are having issues pulling images from public.ecr.aws from an EU region.

saxonww 4 years ago | |

I don't know what's still true, but at one point us-east-1 seemed more critical than other regions because there were some things that had to be there. One thing that comes to mind is ACM certificates used with things like API Gateway (probably Cloudfront), they had to be in us-east-1 no matter where the rest of your infrastructure was.

So it's not shocking to me that something going down in us-east-1 could have impact on other regions.

l0b0 4 years ago |

Meta: I posted a "PyPI is down" link a few days ago, and the post got insta-flagged. Is there some rule about this sort of thing?

sswaner 4 years ago |

Not down as of 7:40 EST. US-EAST-1 hosted site (athene.com). Cognito, API Gateway, Lambda, S3, DynamoDB, RDS, S3, Cloudfront.

throwaway875487 4 years ago |

Our RDS instances have completely packed up. Hell knows what's going on. Here come the customer support tickets.

anonu 4 years ago |

Better polish off your BCP docs. People will be asking for them quite a bit more in the new year.

sprite 4 years ago |

My app running on AWS is currently down. Having intermittent problems with console as well.

dugmartin 4 years ago | |

I'm getting a plain "504 Gateway Time-out" page when trying access anything past the console homepage in us-east-1.

stevehawk 4 years ago | |

also having console issues in us-east-1, bitbucket is randomly throwing bad gateways at me

streamofdigits 4 years ago |

Somebody call the IT department

allocate 4 years ago |

Also running a big production app in east-1 and we're experiencing issues.

sprite 4 years ago | |

I'm also in east-1 and completely down.

throwaway81523 4 years ago |

Ok, enough AWS outages to say I'm tired of hearing about low end stuff being flaky.

BiteCode_dev 4 years ago | |

"Don't use a self hosted monolithe, it's not reliable! You need a cloud FS with a load balancer under observability and your data in a db that scales horizontally, all orchestrated by kubs."

Meanwhile, I currently have a gig to work on a video service which features a never updated centos 6, an unsupported python 2 blob website, and a push to prod deployment procedure, running a single postgres db serving streaming for 4 millions users a month.

And it's got years of up time, cost 1/100th of AWS, and can be maintained by one dev.

Not saying "cloud is bad", but we got to stop screaming old techs are no good either.

osrec 4 years ago | | |

Purely out of interest, I'd like to know more about your streaming architecture. I assume postgres just holds the meta data, and the actual video content is stored elsewhere? What strategies have you employed to scale the streaming part of your service? I imagine 4 million users a month is quite a significant amount of traffic!

henriquez 4 years ago | |

Heroku isn’t “low end,” it’s a PaaS built on top of AWS. So you’re really just hearing about another AWS outage lol

christophilus 4 years ago | | |

They're not saying Heroku is low end. They're saying, "I'm tired of hearing that it's irresponsible to run your own servers."

At least, that's what I understood.

mijoharas 4 years ago | | |

This comment doesn't say anything about heroku?

jacob019 4 years ago | |

Right. I've had an excellent experince with Vultr for the last couple years, for about 1/10th the cost of AWS. I use other small VPS providers as well. I run my own small business and I need to keep costs down to stay competitive. I used to use AWS more but the bill always creeps up to inappropriate levels. AWS billing is insulting, oh you forgot to renew your reserved instance? That's going to be double this month. I still use cloudfront, route 53, and a few of the smallest instances for mail servers and asterisk though. It's foolish to go all in with AWS, or with anything really.

api 4 years ago | |

Nobody ever got fired for using AWS.

alecbz 4 years ago | | |

I wonder to what extent this actually becomes less of a problem the more people use AWS. At this point AWS being down just feels like "the internet is down", it's hard for customers to be too mad at any company being down when all their competitors are too.

Though I guess there's still probably just lost revenue that could be captured by having better uptime, even if your competitors are down.

trabant00 4 years ago | | |

True sad fact. I first thought it is a management problem but lately I see it is the tech bros who push for fads in the hopes of staying relevant and not asuming responsability for choices.

debarshri 4 years ago | | |

Today DO also went down. We could not login briefly.

pxue 4 years ago | | |

maybe except a team at google? ;)

flatiron 4 years ago | | |

If you rely solely on east 1 maybe?

bognition 4 years ago |

What a way to start my day

300bps 4 years ago |

Can we please stop saying, “AWS is down”?

AWS consists of over 200 services offered in 86 availability zones in 26 regions each with their own availability.

If one service in one availability zone being impaired equals a post about “AWS is down” we might as well auto-post that every day.

omh2 4 years ago | |

AWS doesn't follow their own advice about hosting multi-regional so every time us-east-1 has significant issues pretty much every AZ and region is affected.

Specifically large parts of the management API, and IAM service are seemingly centrally hosted in us-east-1.

If your infrastructure is static you'll largely avoid the fallout, but if you rely on API calls or dynamically created resources you can get caught in the blast regardless of region

satya71 4 years ago | |

Seems enough services in us-east-1 are down to cause most apps to fail. My simple app uses 10s of AWS services, at least some of which are out.

300bps 4 years ago | | |

I may have seen more of these posts than you. The last one I saw where “AWS is down” was us-west-1.

KptMarchewa 4 years ago | |

Would be cool if this wasn't the region where AWS hosts their internals, making other regions unusable, right?

sawmurai 4 years ago | |

It's like my grandma saying "Honey, the internet is broken again." xD

biznickman 4 years ago |

Why isn't Heroku showing a status error despite being offline?

mikece 4 years ago | |

Because it's built on AWS and uses the AWS status page for it's status info?

sreitshamer 4 years ago |

Console is sluggish for me, but S3 (us-east-1) seems to work fine.

ChrisMarshallNY 4 years ago |

I can't play Borderlands 3 this morning (Epic).

Wonder if it's connected?

13daug 4 years ago |

This S3 how you gonna get you investment back from it

networkisfine 4 years ago |

Isn't the point of the design of an availability zone having multiple data centers so that if a single data center in the availability zone fails, services aren't affected?

Demcox 4 years ago |

Imgur is suffering from this too, I think.

amai 4 years ago |

A problem with log4j/logshell?

whoomp12342 4 years ago |

the cloud is great they said...

tomerbd 4 years ago |

Rumble was up all this time.

reactive55 4 years ago |

Bitbucket is down as well

exabrial 4 years ago |

Stat That.

quantumfissure 4 years ago |

Me: Hesitation at last job moving absolutely everything (including backups) to AWS because if it goes down it's a problem I'm a firm believer in some kind of physical/easily accessible backup.

Coworkers: "You're an f'n idiot. Amazon and Facebook don't go down, you're holding us back!" <-Quite literally their words.

Me: leaves cause that treatment was the final straw

Amazon and Facebook both go down within a month of each other, and supposedly they needed backups

Them: shocked pikachu face

CaptRon 4 years ago |

At least HN works.

sydthrowaway 4 years ago |

Switch to Azure

clavicat 4 years ago |

How much more frequent do these outages need to become before it starts triggering SLA limits?

sh4un 4 years ago |

Damn you all eggs in one basket.