IBM Cloud was down, as well as their status page(cloud.ibm.com) |
IBM Cloud was down, as well as their status page(cloud.ibm.com) |
Fastly error: unknown domain: www.ebay.com. Please check that this domain has been added to a service.The cable TV channel is still independent.
IBM Cloud - unsafe
At least AWS signs their routes I think.
If you can't even sign your own routes - hard to have a ton of pity.
So it appears to affect anyone who depends on IBM Cloud.
Maybe it helps with doing a sanity check before picking a provider. And, I guess, at a basic level it helps with accountability/transparency.
Do you have similar %'s of monitored cloud services that have gone off the air during other providers' outages?
(I figure either you’re in devops and you are putting out fires too busy to read this thread or you’re not and your work is halted because of the incident so you might have time to read and reply ;)
Two of the biggest advantages were:
Price for hardware. As a base price, their bare-metal gear was significantly cheaper than equivalent-specced AWS gear (if it was even possible to get something like that). We managed to snag quite a few 'interesting' configurations of things at various times that you just couldn't get at all in AWS. Things like PCI SSDs, very large RAM configs, or High-Frequency low-core count CPUs.
Free international/regional transfer. We took significant advantage of this to move data around. We'd replicate TBs of data around.
At various times management and dev teams would complain and say that we should move everything to AWS (or whatever cloud provider they'd just met with at a conference).
We consistently showed higher performance and lower cost by significant margins. On cost alone, we were paying a small fraction of what it'd cost on AWS, even after taking into consideration ways to reduce cost on AWS such as scaling, spot instances and reserved-instances.
We had a couple thousand bare metal servers, and barely used any of their API stuff.
As with any facility, there were occasional issues with electrical transfer switches, core router failures, fiber cuts, etc. Stuff happens, but we got pretty good communication, and things got resolved in a reasonable amount of time. Service got noticeably worse after IBM, but we were already planning to move to our acquirers hosting, because that's what happens when you're acquired. Oh, and their load balancers had garbage uptime.
Bandwidth prices used to be pretty reasonable, but they've adopted AWS style obscene pricing. At least they still let you use the private network for free (including to other datacenters).
MS and Google do provide those features though.
Over the past few years we have experienced quite a few network-related outages. Not usually to this extent, more generally a failure of some piece of network gear that takes out either backend or frontend traffic from a particular data center. We seriously priced out a migration to another provider recently, but in the end what held us back was cross-AZ transfer costs on AWS. We found it would raise our operating costs significantly, so the matter was dropped.
We were much happier with the service and support we received prior to the IBM acquisition.
Currently on them because we have an OpenVPN based infrastructure that is very challenging to migrate.
Lastly the majority of our customers are in the midwest or Texas, and the proximity of their Dallas DC was a huge performance win for us.
In small and mid size organizations the CSP gave better pricing, or they help with your sales etc
In large organizations - IBM/Oracle bundle their existing products currently being paid for any way, or account managers have great relationships with decision makers , the company already has signed up big multi year deals.
This is not just IBM, it applies to GCP/Azure/AWS as well.
I also really like CouchDB which IBM Cloudant is based on.
Is that enough for me to use IBM cloud? no. not really.
I'm going to wait a bit to see if we get a status update, otherwise we'll be spinning up instances on AWS to failover (which will be enormously costly for bandwidth)
No status, no nothing, we're in the dark.
Literally this morning I was wondering what ever happened to it, like did it die a quiet death? Oh it rebranded to IBM cloud in 2017. Now this news.
I think there's an eponymous law named for this sort of thing.
https://cloud.ibm.com/status?selected=history
- 2020-06-10 02:19 UTC - RESOLVED - The network operations team adjusted routing policies to fix an issue introduced by a 3rd party provider and this resolved the incident
But when it worked, it worked. API was voodoo.
It doesn't help that their status page is also hosted on IBM Cloud.
A better approach is to have it hosted on a different cloud platform. If you really care, you’ll set it up on a different domain and nameserver as well with a long lived redirect (cached on CDNs) from the usual status.example.com or example.com/status.
"Our cloud can never go completely down We are IBM, we have Watson..."
At least give me something I can point my customers at to show them this is not due to my incompetence.
The purpose of the signalling here is two fold.
1) If convincing enough (with details), you can keep current customers from moving to a competitor.
2) It also lets new customers see how you actually handle a crisis. If they can manage the crisis well enough, then you can point to this instance to prove your technical knowhow to handle their needs.
If they don't tell anything, or aren't transparent, then they can expect a mass exodus of customers.
Pindom[1] had a spike of website outages from 11k => 27k.
Sorry to be glib, I'm sure it's a tough time for people who were sold on their cloud platform and work on it!
Hope they get a root cause and a quick fix. I’m not a fan of their cloud service but I know people working on the outage and fix are stressed.
>> A 3rd party network provider was advertising routes which resulted in our WW traffic becoming severely impeded.
No IBM computer has ever made a mistake or distorted information. They are all, by any practical definition of the words, foolproof and incapable of error.
It is not that companies become consciously malicious or are incompetent to start with, it becomes a vicious cycle, as more and more poor management and engineering talent join, the good ones leave, and the cycle continues.
Acquisitions and merge stave off the slow slide into irrelevance for a while, till the best of the new guys leave too. Systemic cultural changes is very very hard to achieve in large organizations.
If they are receptive to feedback and clearly want to do better, I would be kind and explain why I had suggested it not be there in the first place and cite this as an example.
If they were being adamant or denying it was their fault, I'd probably be really quiet and just make subtle remarks about how it would have been better if they listened.
(Was interested to see what you were up to these days, which is how I stumbled on it).
Seriously they probably tested it and it worked in theory, just not in practice and now they fix it for reals.
The idea that they could even get to this point probably seemed unfathomable. It does to me.
Or we just simply accept and making it the norm that even the lowest level of organizational governance is corrupt?
I am serious about this, because how people perceive their own rights, their own roles, their own status, their own influence and their organization's wrongdoing will influence the attitude in the long run against each and every organization in society in my opinion.
I know that I was blowing the question out of proportion, but it bugged me to ask anyway.
But whether you can get away with that depends on culture.
The last time we went down I questioned out loud the point of the status page and the general consensus was for others to be able to reference our outage.
aws support create-case \
--subject "not working" \
--communication-body file://description.txtI wonder if that's a thing that would even cross a typical IBM-ers mind? It might just be me, but I get a very strong smell of "We're IBM! There's nowhere else for you to go!" from them...
If you'd like an outside resource to suggest or read up on better postmortem practices, the Google SRE Book has a chapter [0] on postmortem culture. It's an amazing change of pace and a huge stress level improvement for us SREs.
[0] https://landing.google.com/sre/sre-book/chapters/postmortem-...
https://www.ribbonfarm.com/2009/10/07/the-gervais-principle-...
The link was just a data point, not evidence of anything in particular.
(Because there is always a next time)
But I've certainly seen more Top50 companies than not who have at least a few Manager / Sr Manager-level folks, in charge of key teams who own the sole keys to necessary functions, who are happy to say no to anything they're ignorant of, without any impetus to learn about it.
(You should headhunt the guy at BackBlaze who does their hard drive stats blog posts, and release this data analysis quarterly!)
(You should contact the guy at BackBlaze who does their hard drive stats blog posts, and pay him to do this for you as a side hustle - and release this data analysis quarterly!)
:-)
I think the biggest issue is that far too many people assume that the AWS Savings stories are universally applicable, and that it's safe to assume AWS is going to be the cheap option.
I'm sure there are folks for whom AWS is the cheap option, but it wasn't at my last job, and it's not for my current one (even though they are using it).
A cynic might argue that is why their pricing structure is so complicated in the first place.
For us as well. It was so nice to have things work one day and the next and the next, although I guess they wouldn't have worked today.
Favorite firefighting moment was when wdc lost half the fiber in ~ 2014, and we had to move all of our traffic out, so that there was capacity. Our guy asked why we had to move? and your guy said something like 'Because if you guys move, we only need one customer to move.' :D
Apparently it was enough information for dsmcr to properly id the service though; not enough for nixgeek though, I think.
The odd thing is, for half the price I could get SL service w/10TB from a reseller, while at list price I only got 1-2TB bandwidth, and sales absolutely would not budge on that. I wonder why.
The key thing is each IP 5-tuple (peerA, peerB, protocol, portA, portB) will always take the same path over their network (most likely a different path for return packets, when A and B are switched), so in order to properly probe, you need to probe on a lot of of port combos, and once you find a broken combo, you need to run MTR on those ports, so you can give them the MTR that shows the issue.
Or, if you can, have your internode protocol run on multiple connections and drop connections that are showing issues, and let a different customer file the tickets :)
(email is in my profile if you want to discuss)
But don't worry, you're not the first to mention it. I suppose I should just fix it and deal with the spam like normal.
I liked the unintended effect of cutting down on spam. I guess a lot of spam bots are written on top of standard libraries that reject bad certs. :)
Also, this was ironically a great way to publicly call someone out for a seemingly bad decision without being cruel about it, so props to you!
This is intriguing. I'm going to remember this but I'm too anal about perfect A+ TLS and renewal is already fully automated these days anyway :-\
I wonder if one could setup their TLS stack to get this effect without the tradeoff...
(at least partly tongue-in-cheek) will it support DDL too? can I INSERT infra? or is this a read-only endeavor? :)
If someone does 'DROP TABLE ec2.instances', what exactly are they trying to accomplish? Do they want to terminate every ec2.instance? Should we let them?
Questions like that make write access very difficult.
I've always imagined it as a big tower shoved under someone's desk. The side panel of the case is off because otherwise it overheats. On the screen there's a single maximized window of DrRacket. A post it note warns you not to quit or reboot the system.
edit: It's possible that you could take out an entire TLD and make it impossible to resolve domains on that TLD once all the cached records expire. But that kind of targeted attack would not be possible with a BGP error, unless it was a very specifically crafted BGP error happening over a very long period of time (weeks-months-years depending on the record TTLs).
I guess I'm just calling out the people who are making fun of them for having their status page dependent on the same hardware it's monitoring when it's not clear that's the case just because they are both down?
I would suppose if it's a different TLD domain, then it would be more likely to conclude that.
Yeah, even if you could find a way to deny the spammers via esoteric configuration, it'll just make them realize they forgot to turn off TLS validation anyway (which is clearly what they meant to do)
Also, if y'all do this, it probably won't work because the spammers will start ignoring expired certs.
I'm not going to make fun of them for their status page being down, but it certainly doesn't reflect well on the brand/products.