Microsoft Azure Outage

spoils19 3 years ago |

It's good that Microsoft saved money via layoffs so that it balances out when customers leave Azure. Very forward thinking company.

cryptonym 3 years ago | |

Leave to go where?

On-premise and being miserable having to wait months to get a new server with poor automation, observability and worse outages? To another major cloud provider with similar pricing and outages?

Cloud helped mostly with automation and scaling but if your system is that critical, you should consider a good CDN as load balancer and multi-cloud (or at least multi-region) for actual robustness.

jakewins 3 years ago | | |

AWS and GCP both have ~100% uptime in every region for VMs this month. Meanwhile the majority of Azure regions have had various outages in the same period: https://cloudharmony.com/status-of-compute

azfubar 3 years ago | | |

Almost certainly due to Azure's broken policy where we have critical change advisory's that block deployments for huge periods of time towards the end of the year because of Black Friday and then holidays. Every team has basically been unable to deploy since the week before Thanksgiving when a surprise CCOA was pushed out by leadership at the behest of a certain big customer... then there was the World Cup and the winter holidays. Nobody could really deploy anything from a week before Thanksgiving until a week after the New Years... almost two months worth of batched changes and every team YOLO button pressing as soon as they could in January.

And now layoffs so everyone is super unmotivated! Excellent stuff going on right now from Microsoft senior leadership.

fomine3 3 years ago | | |

Interesting. Can I see this longer than a month?

heinternets 3 years ago | | |

Right after an outage of course it will show like that.

After an AWS outage it would also look non favourably on AWS right?

barbazoo 3 years ago | | |

Wow I didn't expect the difference to be so obvious.

throwaway2037 3 years ago | | |

It is weird that this answer was downvoted. I agree. What a great page!

cm2187 3 years ago | | |

Where I worked, the internal approval processes and controls over cloud resources are as lengthly as those for on premise hardware. So that may be the case for small companies but I don't think there is much of a difference in those large bureaucracies.

dx034 3 years ago |

Shows that all these availability zones and regions don't really help if an outage can knock out a whole cloud provider. And that's not specific to Microsoft. The only way to really ensure uptime is to use two providers. Sadly, that's basically only possible with on-prem/colocation where traffic is cheap.

sofixa 3 years ago | |

It's mostly Azure though that is badly designed to such an extent that multiple times there have been global outages. In general Azure availability, security (the only major cloud provider with not one but multiple cross-tenant security exploits) and usability are pretty terrible so it shouldn't be used for anything but saying "this is how it should not be done".

GCP had a similar thing once, where a BGP update knocked out their Asian regions.

AWS have never had a global outage. (And no, that time S3 in us-east-1 was down wasn't a global outage, the only customer code/workloads that were impacted was code interacting with S3 that didn't specify the region and had to rely on us-east-1 to determine it, and it didn't work anymore)

Andys 3 years ago | | |

To be fair, AWS once had a global Route53 outage, which was effectively a global outage for anyone using AWS for DNS.

snorkel 3 years ago | | |

That outage was limited to Route 53 DNS record editing and not DNS lookups.

eurg 3 years ago | | |

Do you have a link to an article about that? My google-fu is weak, and this sounds interesting - that should not happen to DNS - at all - and from the outside Route53 looks quite well managed. So what the heck did they do?

codalan 3 years ago | | |

It was back in 2019.

https://twitter.com/AWSSupport/status/1186735657387003904

I forget the details. I do remember half of our internal tools not working at the time due to DNS issues, though. Good times.

wereallterrrist 3 years ago | | |

Someday someone will write a book about how AD, AAD, etc, exert the control they do at MS and go as unchecked (or at the time) as they do. AD's inability to execute made Azure a significantly less pleasant platform until they finally fixed accounts a couple of years back to properly do OAuth 2.0 with ARM.

Maybe the book is just "AD brings in the money" but wow, they sure bring it down as well. Global outages like that always stink of AD.

throwawaythekey 3 years ago | | |

There have been several cloudfront outages that have effectively been semi global outages

nosebear 3 years ago |

I'm hearing from four different friends from four different companies in Germany that they can't really work right now.

steve1977 3 years ago | |

If they were relying on Outlook and Teams to be productive, they probably couldn't really work before either.

hnarn 3 years ago | | |

What a naive comment. As if the only truly important jobs exist in engineering and require nothing but git and a book on C.

OscarDC 3 years ago | | |

I interpreted this comment as more of a jab at how inefficient are outlook and teams themselves as applications.

I don't know if it's the right interpretation to have, but I kind of agreed with it, considering huge issues I had with teams (curiously some of them are only there for linux users, weird when considering the fact that I only use teams' web page) - not saying I could do better though!

choeger 3 years ago | | |

Yeah, what BS. Everyone knows that if you have a book on C, you can always quickly implement git yourself.

vikramkr 3 years ago | | |

I'm unsure what being in engineering has to do with using outlook and teams?

steve1977 3 years ago | | |

That wasn't my point. But tools like Teams kill more productivity than they enable, at least in my experience. If anything, I was more productive yesterday, because I got disturbed less.

dsign 3 years ago |

This makes you wonder if some centralization patterns, i.e. Azure AD, are not a national security problem?

ruffrey 3 years ago |

In the azure portal, it shows a "Routine Unplanned outage" - ??

rossdavidh 3 years ago | |

Well points for honesty, at least. :)

ugh123 3 years ago | |

I guess thats the 0.0001% of outage for an advertised 99.9999% uptime

funnymony 3 years ago | |

At least they have a sense of humor

idk1 3 years ago |

Does this mean they need to rebrand, because it's not up 365 days of the year? Maybe rebrand it to Microsoft 364.5?

altairprime 3 years ago | |

There’s 365.2425 days per year, so a six hour outage is just about 0.2425 hours, which suggests that they remain able to declare 365 when considering this specific outage only.

ericpauley 3 years ago | |

I think the joke always went that they should rename it Microsoft 360.

hobofan 3 years ago |

Not sure if it's directly related, but GitHub is also experiencing issues: https://www.githubstatus.com/

marvinblum 3 years ago | |

"We are investigating reports of issues with Actions. This looks related to Azure networking issue which is impacting multiple regions. We are seeing improvements and will continue to monitor this."

ricc 3 years ago | |

GH has been a Microsoft company since 2018...

quickthrower2 3 years ago | | |

Good to see GH is eating the dog food

kgdinesh 3 years ago |

At work, we all got kicked out of a teams meeting an hour back and sending/receiving e-mails on Outlook seems to be slow.

Location: Chennai, India

midasz 3 years ago | |

This is going to be the most productive day ever

sli 3 years ago |

Every Azure product I've had to use has been lousy in every possible way. Azure DevOps at my last employer was a nightmare and nobody in the company liked it, not even the managers who decided on it.

BLKNSLVR 3 years ago | |

I've been learning / using DevOps for the past four months and find it "quite good", and have previously used Jira, although not in great detail.

I'm making the effort to learn it in increasing detail as it's the company-wide chosen system. I'm interested to know what made / makes it a nightmare for anyone else.

(And I'm no fan of Microsoft as a whole)

telcal 3 years ago | |

I use Azure DevOps daily and honestly have no issues, it works well. What didn't work for you?

reset-password 3 years ago |

I have some Azure services that are not able to consistently make outbound HTTP requests to my heartbeat monitoring service so I'm getting alert after alert this morning. This is just the nudge I needed, and I'll be moving the whole thing to Linode later this afternoon.

alkonaut 3 years ago |

Wouldn't it be quite simple to set up an unofficial status page that just pings some relevant services and if they have a disastrous outage at least, it shows it?

Because I think it's clear that their status page is useless and "manual".

alkonaut 3 years ago |

It comes and goes. Teams and Azure DevOps some times works perfectly for a few minutes, then responds with all 503's for a few minutes.

saikatsg 3 years ago |

> We've identified a potential networking issue and are reviewing telemetry to determine the next troubleshooting steps. You can find additional information on our status page at https://msft.it/6011eAYPc or on SHD under MO502273.

ricardobayes 3 years ago | |

I'm so surprised by MS's strategy for using random domains and TLD's, this certainly don't make it easy for phishing avoidance.

noinsight 3 years ago | | |

If you implement an allowlisting proxy, the number of required domains for M365 / Azure is something like 120 [1]. Google basically requires three, tunnel.cloudproxy.app, *.google.com and *.googleapis.com. Amazon requires *.aws.amazon.com, *.amazonaws.com, *.awsstatic.com, *.api.aws and *.aws.dev.

Microsoft has some great domain planning.

[1] https://learn.microsoft.com/en-us/microsoft-365/enterprise/u...

ricardobayes 3 years ago | | |

My point is MS uses a lot of unrelated domains that are very different from the main brand, even the one above looks dodgy (msft[.]it) From your list, microsoftonline-p[.]com is an official domain, but it looks like a typosquat. I think it's quite far from "great domain planning".

adql 3 years ago | | |

> I think it's quite far from "great domain planning".

The poster saying they have 120 of them would imply that being sarcasm

robertlagrant 3 years ago | | |

They appear to be being sarcastic. I don't think anyone would be seriously saying 120 is better than 3 or 6 domains.

tenplusfive 3 years ago | | |

Luckily Microsoft also provides a service for that: Safelinks https://learn.microsoft.com/en-us/microsoft-365/security/off...

Also a personal favorite of mine: http://microsft.com (not entirely sure if its just to prevent typosquatting or if this is actually used in some products)

luckylion 3 years ago | | |

I don't know whether it's a typo but https://support.microsoft.com/en-us/topic/contact-us-91f63b4... lists "EOC: criskgro@microsft.com (For CEE and MEA)" under the Microsoft Credit Services. It feels like a typo, but who knows. If they don't have anything in place to catch this type of error, it's probably a good idea to register every domain someone could accidentally type.

fomine3 3 years ago | | |

There's no MX record on the domain so it seems to typo

jiggawatts 3 years ago | | |

microsft.com was used specifically for telemetry to bypass web proxy blocks for *.microsoft.com put in by administrators of secure networks.

I know this because I was one of those admins trying to plug the leaks.

Windows 10 + Office uses 200+ domains just for Microsoft stuff, of which something like 120 are for telemetry.

ridgered4 3 years ago | | |

And I imagine they add new domains with updates all the time.

At home I was trying to avoid random reboots from updates in a full proof way in a Windows VM that ran long processing tasks. I determined the only reasonable course of action was to remove all internet access. Stamping out the massive list of changing domains (and hard coded ip addresses?) would just be to much work that I know I would never keep up with.

A white list might work.

I mused that you could have a constantly updating Windows machine and monitor all of its connections, adding them to a block list on an external firewall but in addition to being complex to setup I bet it wouldn't even catch everything.

zerohp 3 years ago | | |

Yet people continue to defend Microsoft's telemetry practices. The OS won't let you opt out without it fighting you and they'll even fight you for blocking it on the network.

Windows is spyware.

joecool1029 3 years ago | | |

.it ccTLD is especially bad. Almost all of the generated SEO spam links to malicious ad networks I get on search pages are usually .it domains, all written in machine english, not italian. Thanks for reminding me and discovering -site:.it works in search queries to filter it out.

Tepix 3 years ago | | |

Makes sense to use a different domain if everything is down because it could also effect DNS for the main domain.

wiradikusuma 3 years ago | | |

I think what the OP saying is, if you have multiple random domains, how would people know which ones are legit (or not)? Say I have mixxxrosoft.com, how would you know this is one of MS' official domains?

latchkey 3 years ago | |

It is often very difficult to test networking changes in production. For example, firewall rules. What sort of tools do people use for this?

ChickeNES 3 years ago |

Does the Internet Archive use Azure? archive.org is throwing 503s

voytec 3 years ago | |

Two weeks ago they were affected by the Elasticsearch outage[1], too.

[1] https://news.ycombinator.com/item?id=34337518

braymundo 3 years ago |

DuckDuckGo is also affected (blank search results).

ochrist 3 years ago |

https://downdetector.dk/ indicates several MS products and services are having problems. Here is the status from MS on Twitter: https://twitter.com/MSFT365Status/status/1618149579341369345 Edit: Added this link which apparently is the new status page and seems to be updated: https://status.office365.com/

danjc 3 years ago |

Auth via Microsoft ID is degraded, our platform is blipping (cache retries, message retries due to packet loss), access to the Azure portal is degraded and the Azure status page isn't loading consistently.

LilBytes 3 years ago |

Nothing is working for me, Oceania/Australia.

Including O365, Azure, Azure Devops.

kornish 3 years ago |

Ah - so that's why GitHub Actions are unreliable right now.

Benjamin_Dobell 3 years ago | |

Glad it wasn't just me. I was waiting over 10 minutes for a hosted runner.

quickthrower2 3 years ago | | |

Such a late 2010s / 2020s problem :-(

adql 3 years ago |

Office359 strikes again

Yuioup 3 years ago | |

You mean Office364

DoctorDabadedoo 3 years ago | | |

Everyone deserves a break between Christmas and New Years, even the folks at MS! /s

wrldos 3 years ago | | |

0<Office<365

cube00 3 years ago |

> The issue is causing impact in waves, peaking approximately every 30 minutes.

Does anyone have any general ideas on what kind of outage manifests itself like this? Devices retrying to authenticate every 30 minutes and finding the service is down perhaps?

urbandw311er 3 years ago | |

Can sometimes be scaling/monitoring loops. i.e. cluster comes up, provides some limited service, gets overloaded and drops below required performance metric, gets killed by monitoring/scaling system, repeat...

osivertsson 3 years ago |

Many games that use Azure PlayFab are down as well due to this. Both PlayFab services and PlayFab MPS game-server hosting are currently broken.

https://status.playfab.com/

kemals 3 years ago |

ThousandEyes public outage map shows the scale of the Office365 outage: https://www.thousandeyes.com/outages/

hansamann 3 years ago |

DuckDuckGo.com - no search results showing up at all... are they on Azure?

pred_ 3 years ago | |

Yes.

    $ dig +short duckduckgo.com | xargs whois | grep Organization
    Organization:   Microsoft Corporation (MSFT)

atom058 3 years ago | |

They get their search results from Bing

jupiterblues- 3 years ago |

Minecraft, Asure, Office 365, etc... MS cloud services have issue