Microsoft Azure Outage(twitter.com) |
Microsoft Azure Outage(twitter.com) |
Oh please. Azure is plenty capable of taking themselves offline on their own.
Azure, Teams, Outlook are almost down from Greece and Germany, and their status page shows that everything is fine :-)
It's about contractual obligations and SLAs. Things are not officially down in most agreements until MSFT acknowledges they're down. Refunds issued because your blob storage failed to meet 99.9999 uptime to your largest customers are directly tied to these statuses.
I'm joking, but...
I guess many developers do not use Azure voluntarily but are forced to by their companies (or customers).
It's utter shit of a service. Even worse if you need to write integrations for it
I tried it a couple of years ago. After finishing the trial, I removed all instances and disks, supposedly completely blanking the account. And also supposedly deleted the account.
To this day, I still keep receiving some kind of invoice for about $2 USD that they say I owe. And when I login into the "oracle cloud account" nothing works because my account seems to be half-deleted. (like I get error screens when accessing several of their piece of shit panels).
To make things worse, suddenly I started receiving emails from some of their sales team in Portuguese, I guess that my last name sounds kind of Portuguese so someone say, yeah, you write to him.
And while using their system I was not really impressed. Their cost structure was weirder than AWS (and that's saying something) and to mount a volume in an instance you had to do some funky commands.
I would NEVER trust business technology to that sort of system.
Shit is expensive as hell
For the same money I could rent some weak linux box for a year
Or something decent for a month
Edit 10ms
You do realize what you setup in that tutorial right? A kubernetes cluster with 11 full scale microservices that are dimensioned so they can serve the average medium size business. For only a hobby this is huuuuuuugely overdimensioned.
If you were to do the same on azure, it would cost more. If you are comparing it to a cheap linux box, what the hell are you using kubernetes clusters for then?
If somebody gives you keys to a Ferrari, don't blame the manufacturer when you drive it off a cliff at 120 miles/hour...
The monthly $60-$100 developer credit was fantastic as well. It avoided the usual fighting for approval/budget to test things out.
Add to that that AWS dont really engage in the normal business to business sales process but simple gives you a price list and tells you "thats what it costs" pretty much straight up and it's no surprise a lot of traditional enterprises with huge existing Microsoft bills end up with the vendor they know, understand and think they can control.
It's not that there is anything really wrong with AWS their support is good their products work but it's a messy platform where you really need to pay attention and might even engage with consultant to fully understand what your paying for and how optimization decisions is affecting your ROI as everything is priced individually in AWS where as Azure does a bit more bundling into packages.
Since many companies rely on it, especially for role base access to internal resources, you can't avoid it as a developer/employee.
Microsoft just has found how to sell Azure: scare compliance teams that AWS and GCP are horrible, especially in EU and banking. Use their office monopoly to give huge discounts if you buy as a bundle, and be awesome on comparison charts. They check all the boxes of services they offer. For an exec, it doesn't count how well those services are executed, thats a developer problem that a system integrator will solve.
And yet it continues to rake in billions + grow 20-40% month over month (even if it is slowing)
While Google, Amazon and others were busy complaining about GDPR, Microsoft was busy working on being compliant, with the result that today they're pretty much the only legal/compliant solution in most of the EU.
The more regulated the industry (health, finance, etc), the more you can be certain that it's running on Azure if it's EU based and running in the cloud.
All to say, I agree wholeheartedly with every word.
Did Windows ME and Windows Vista also work really great for you?
We have actively pushed for AWS or even GCP but it's futile when it doesn't align with business. I'd imagine a lot of developers are facing the same company issues.
Azure is a chore compared to AWS.
Ferrari analogy would be something like being billed 100usd for 1min ride
Ive ran it for 4 times miltiplied by 5minutes + time needed for it to wake up
All im saying is that it is expensive for such a small usage
Sure, buyer beware but is it reasonable that a clearly marked demo project is set up with services to that level of resourcing?
Nobody is going to take a demo like that and start running a business off it tomorrow.
Windows Vista was honestly worse for me, not due to bugs but for being two years ahead the curve of hardware, and GPU vendors seemingly rolling their thumbs during betas and once WDDM¹ went live, they panicked and rolled out alpha quality work. So many driver crashes compounded with the heavy RAM requirements... Other than that, and with less of an UAC nazi, I could see an OS that was similar to what Windows 7 became if I squinted. Hardware had caught up, drivers were mature, and on top Microsoft optimized its performance.
In hindsight, WDDM should've been an update to Windows XP that could be rolled out well in advance and let developers focus on a single thing rather than new OS compatibility on top, and deep changes like UAC.
¹ It was necessary work though: https://en.wikipedia.org/wiki/Windows_Display_Driver_Model
As for Vista, while I did not use it in its day I can tell its problems were far more to do with crapass hardware manufacturers and their crapass drivers. Vista with access to 7's drivers and hardware runs just fine.
I think it's an important enough page that it can't be automated. It needs a manual approval from a human, for the very basics, like even if the status reporting system is operating correctly, because of various downstream effects.
If it's a false positive they just resolve it without it affecting SLA and if it's a real problem then us customers wouldn't have to debug our own stack for 2 hours before Microsoft informs us that they are the problem.
EDIT: Wonder how many man-years of extra debugging work their non-working status page have caused the customers.
Works equally well. See the point?
(1) The monitoring system would be altered to ignore tests that return false positives (at the expense of missing the alert when it represents an outage).
(2) Fixing the monitoring. It wasn't working for the sysadmins/operators, anyway, since it had so many false positives that their "mental model" was essentially based on (1), anyway.
At least, where I've forced the issue of doing just this, that's exactly what happened. At the end of the day, especially since SLAs took a hit and that affected bonus payouts, monitoring got a lot better -- as did overall team function when we truly realized how bad things were -- we stopped doing workarounds and started fixing problems at a more fundamental level which led to SLAs that were both accurate and excellent.
It helped bring attention to a hidden problem which resulted in time being allocated to fix tests that dropped constant false-positives and to evaluate each for whether or not it should exist in the first place.
And so updates to the status page become political and locked behind senior management approvals.. like AWS.
Is it free or not?
So many half baked features and legitimate bugs in their platform that they either don’t fix or take years to fix.
Azure
/ \
Azure \
DevOps----- TeamsDevelopers developers developers!
- The interface is laggy
- Scrolling back in long messages is buggy, it often skips around and loses its place
- No built in "whiteboarding" tools in screen sharing
- Teams will often keep ringing on my phone for up to a minute after I picked up a call on my laptop
- Sometimes I can't click reactions on messages. I click the emoji and nothing happens
Overall, it's just poorly made software. It feels like something that was made by a couple of interns in their spare time, not a keystone product from a multi-billion dollar company
Azure just like other cloud services (I've used AWS but as I understand it GCP is the same) doesn't believe in timely billing. You can and will receive charges against an account for services that were turned off yesterday, the day before, even last week, as gradually billing catches up to reality. This means that there is no way to actually cap a budget. If you decide "Once this costs $100 I'm turning it off" you are not capping your expense at $100, after you turn it off charges keep arriving, I've seen a week later and I wouldn't be surprised if it can be longer. Should they do that? Well, even if they shouldn't, good luck making them stop.
But with the "free" Azure credits that have no money behind them, when it drops dead Microsoft eats all the residual charges that will be discovered days or weeks later, because there is no other party for them to bill.
I work for a University, I suspect that if you paid full price for these services it makes no economic sense, a $100 Azure credit that cost $100 is a bad deal, but the University gets an enormous discount, for obvious reasons, and if the other cloud vendors don't want to offer actual billing it does feel like they deserve the consequences.
First analogy I thought of were stories about drug dealers giving away free samples to schoolchildren to hook them up before asking for money.
It also offers budget caps, but indeed, those are more a warning and not a hard shutdown. That's annoying. Same at microsoft by the way, except indeed that developer credit as a failsafe.
Google gives 100k free credits to universities and startups by the way (and even to individual departmens if you are a big university). You just have to apply and let them bring in trainers and you have to actually use a percentage, otherwise they take it away the next year.
It sounds to me some legacy Windows 2000 spaghettini fettuccini is powering some parts of azure.
For Cloud to make economic sense, you need to treat it very differently from traditional infrastructure. For example, simply shutting down our Dev environment outside of business hours saves means we're not paying for the compute the majority of the time.
Documentation lies, support lies, metrics lie, bugs everywhere, and when something breaks the status page is always all green and support tries to convince you it's your fault anyway. They're only here to prevent you from enforcing the SLA. The distrust is pervasive. I stopped suspecting my code, if something breaks outside of a planned maintenance it is _always_ Azure.
My latest support ticket: Azure App Service internal DNS server broke and there is no way to bypass it short of hardcoding IPs in /etc/hosts. Support told me that if I wanted App Service to work reliably I had to implement their DNS server myself. To rephrase, my PaaS provider told me to spend time and money to implement the very platform I was paying them for, and it just so happened to be absolutely impossible because of an unannounced BC break a few months prior (which is another lengthy and frustrating story).
This morning I had a VM cut out of the network and 10% of my App Service traffic just disappeared. No explanation, no incident report, nothing.
These days I'm working with AWS, and it just works. If something isn't working you know it's your fault and that the answer is in the documentation. I'm not spending days on workarounds, I'm actually implementing as planned. I have no words to describe the relief I'm feeling.
- If you need scale, you pick AWS or Azure (GCP doesn't have the same scale, and is catching up)
- If you are a retailer, you don't pick AWS, because you're a competitor and they'll use whatever nasty (but legal) tricks to eat your lunch money
- Windows stack workloads seem to run better on the AWS virtualization stack
- Linux stack workloads seem to run better on the Azure virtualization stack
- GCP has great integrations/automation/api, AWS is pretty good too
- AWS has great support
- GCP has terrible support
- Azure is somewhere in between the two above in terms of support
It depends what is important to you.
Bonus chatter: Oracle Exadata is an unmatched force to be reckoned with, but OCI as a whole doesn't have their shit together.
Lots of MSSQL and PowerBI licenses, lots of other Windows env features. Great deals to bundle those in w/ Azure deployments.
Great pricing too -- for the first 3 years. But at 4 years...
It's weird how slow they are with manual sign-off though.
If you work with Microsoft, you might as well spend a few bucks extra and have an external monitoring system monitor Microsoft's systems so you get real-time third-party confirmation when your monitoring alerts you of issues concerning your system. It's the price you pay for scale, I guess. More money involved = more lawyers involved = more accountants involved = more MBAs involved = more corporate bullshit.
In 7 years we had one AWS AZ outage and we didn't even notice because our monitoring platform in there couldn't reach the network (learned something!). But nothing broke. Even the us-east-1 outages didn't affect us.
We had to switch everything to SSD to get reliability comparable to on-prem VMware.
Thats not the goal.
> It's hard to see how the goal here could be anything other than trying to add plausible deniability for what would otherwise be obvious deception
Thats the goal. The "status page" is considered the source of truth for most of the big contracts. If status-page=OK then your contract with them isn't violated. So changing the status page is a big deal, with real financial implications. The status page isn't a view into the SRE's tickets, its a declaration that the service isn't being provided.
I disagree. What if you're having issues and the status page is incorrectly reporting an incident? It would be easy to waste a load of time waiting for the status page to sort itself out, only to find out you've still got an issue.
Nobody is under any illusion that Microsoft just really likes universities for some reason. But on the other hand, we did need lots of this stuff and it's very cheap, budgets are tight and it's not as though hand-rolling even more stuff would be cheaper - we do hand roll some things where it makes sense.
For example, periodically senior people say "Why do we spend $$$$ on a supercomputer? Surely we could rent one from the cloud?" and we (well, not me, different group same department) go OK, we will cost that for you. And they get Azure, Google, etc. to quote them for what they need a supercomputer to do, and then they present this, "The Cloud providers can do that for $$$$$". Ah, that's more money. No thanks, we will continue to run our own supercomputer.
It's not even close. Cloud supercomputer is great if you need the supercomputer for six weeks to do a special project and then you're done with it, the Cloud provider saves you a lot of money. But the University needs supercomputers all the time, so the numbers do not work.
On-premise and being miserable having to wait months to get a new server with poor automation, observability and worse outages? To another major cloud provider with similar pricing and outages?
Cloud helped mostly with automation and scaling but if your system is that critical, you should consider a good CDN as load balancer and multi-cloud (or at least multi-region) for actual robustness.
GCP had a similar thing once, where a BGP update knocked out their Asian regions.
AWS have never had a global outage. (And no, that time S3 in us-east-1 was down wasn't a global outage, the only customer code/workloads that were impacted was code interacting with S3 that didn't specify the region and had to rely on us-east-1 to determine it, and it didn't work anymore)
Maybe the book is just "AD brings in the money" but wow, they sure bring it down as well. Global outages like that always stink of AD.
Location: Chennai, India
I'm making the effort to learn it in increasing detail as it's the company-wide chosen system. I'm interested to know what made / makes it a nightmare for anyone else.
(And I'm no fan of Microsoft as a whole)
Because I think it's clear that their status page is useless and "manual".
Microsoft has some great domain planning.
[1] https://learn.microsoft.com/en-us/microsoft-365/enterprise/u...
Also a personal favorite of mine: http://microsft.com (not entirely sure if its just to prevent typosquatting or if this is actually used in some products)
Including O365, Azure, Azure Devops.
Does anyone have any general ideas on what kind of outage manifests itself like this? Devices retrying to authenticate every 30 minutes and finding the service is down perhaps?
This made me laugh out loud. I'm working in a multi-tenant, multi-subscription environment with Azure AD just now. MS force you to use 2FA and I picked the wrong 2FA app.
Now it's completely and utterly comical trying to work out which generated 2FA auth code I need to key in when auth'ing in Visual Studio because there are absolutely no visual cues as to which subscription it's trying to authenticate to. You can't tell VS that "I'm only interested in auth'ing to this particular subscription". Now it prompts me for almost every subscription we use and it's a whack-a-mole experience. They really need to fix the UI/UX in VS for this.
Of course when it comes to mandatory password change time I have to go through this pain all over again.
What a PITA.
https://azure.microsoft.com/en-us/explore/global-infrastruct...
What might be happening is that there is fine print you have to read and be in compliance with in order to be eligible for the SLA.
For example, look at all the conditions which have to be met for a breach of VM SLA in Azure:
https://azure.microsoft.com/en-us/support/legal/sla/virtual-...
Hidden in the SLA details is typically hints on how you can become more resilient in the cloud. So it pays to read the SLA details and really deeply understand what they are telling you.
This was circa 2018 but AWS was so much more stable at that time. Ok, US-E-1 AWS had issues from time to time but they acked them and fixed them
Our AWS reps are all over stuff when it goes down. I regularly get to talk to actual real product managers and engineers via our enterprise support if anything goes wrong.
You'd really have to try to make it so screwy.
It kind of a shame. Like most things, Azure was better when it was smaller. I loved the first version of functions.
I don't know if it's the right interpretation to have, but I kind of agreed with it, considering huge issues I had with teams (curiously some of them are only there for linux users, weird when considering the fact that I only use teams' web page) - not saying I could do better though!
The poster saying they have 120 of them would imply that being sarcasm
And now layoffs so everyone is super unmotivated! Excellent stuff going on right now from Microsoft senior leadership.
After an AWS outage it would also look non favourably on AWS right?
https://twitter.com/AWSSupport/status/1186735657387003904
I forget the details. I do remember half of our internal tools not working at the time due to DNS issues, though. Good times.
I know this because I was one of those admins trying to plug the leaks.
Windows 10 + Office uses 200+ domains just for Microsoft stuff, of which something like 120 are for telemetry.
At home I was trying to avoid random reboots from updates in a full proof way in a Windows VM that ran long processing tasks. I determined the only reasonable course of action was to remove all internet access. Stamping out the massive list of changing domains (and hard coded ip addresses?) would just be to much work that I know I would never keep up with.
A white list might work.
I mused that you could have a constantly updating Windows machine and monitor all of its connections, adding them to a block list on an external firewall but in addition to being complex to setup I bet it wouldn't even catch everything.
Windows is spyware.
I don’t yet have enough context to fully evaluate against cognito. It may end up being nice to have B2C as a first class AAD tenant, but until I get far enough along to realize those benefits, there will be a lot more cursing under my breath about the need for another layer of identity and the lack of control plane access through azure resource manager APIs/tooling.