IBM Cloud was down, as well as their status page

IBM Cloud was down, as well as their status page(cloud.ibm.com)

268 points by whyleym 6 years ago | 181 comments

My side project StatusGator monitors status pages (including IBM's ill-fated page) and I'm seeing more than 10% of the nearly 800 services we monitor having an outage right now.

So it appears to affect anyone who depends on IBM Cloud.

bberenberg 6 years ago | |

I really wonder how people get value out of a meta status page when my experience is that status pages are often incorrect about what the actual status is. Whether they're manually updated, or it's a case of "your 9s are not my 9s", it seems like a compounded broken telephone problem.

pas 6 years ago | | |

Probaly it's great to have some very big picture overview. Both in scope ("all" the cloud, and both in time as in "all" time, and maybe there's even some value in looking at the correlation of these).

Maybe it helps with doing a sanity check before picking a provider. And, I guess, at a basic level it helps with accountability/transparency.

Jedd 6 years ago | |

Big fans of StatusGator here.

Do you have similar %'s of monitored cloud services that have gone off the air during other providers' outages?

colinbartlett 6 years ago | | |

Thanks and that's a GREAT idea for some detailed analysis. I have been trying to make better use of the 5+ years of status page history stashed in the cupboard.

ComputerGuru 6 years ago |

So what are HNers using IBM Cloud for and where do you see that it has an edge over AWS offerings (where an overlap exists, obviously)?

(I figure either you’re in devops and you are putting out fires too busy to read this thread or you’re not and your work is halted because of the incident so you might have time to read and reply ;)

paranoidrobot 6 years ago | |

My previous job used Softlayer heavily.

Two of the biggest advantages were:

Price for hardware. As a base price, their bare-metal gear was significantly cheaper than equivalent-specced AWS gear (if it was even possible to get something like that). We managed to snag quite a few 'interesting' configurations of things at various times that you just couldn't get at all in AWS. Things like PCI SSDs, very large RAM configs, or High-Frequency low-core count CPUs.

Free international/regional transfer. We took significant advantage of this to move data around. We'd replicate TBs of data around.

At various times management and dev teams would complain and say that we should move everything to AWS (or whatever cloud provider they'd just met with at a conference).

We consistently showed higher performance and lower cost by significant margins. On cost alone, we were paying a small fraction of what it'd cost on AWS, even after taking into consideration ways to reduce cost on AWS such as scaling, spot instances and reserved-instances.

jgalt212 6 years ago | | |

I really would like to see an AWS memo which maps out common use cases and expected costs (selling points) vs actual use cases and actual costs (pain points).

toast0 6 years ago | |

We used Softlayer (rebranded to IBM Cloud, and affected by this) at my last job. For the most part, their service pretty much just works; clearly not today. :)

We had a couple thousand bare metal servers, and barely used any of their API stuff.

As with any facility, there were occasional issues with electrical transfer switches, core router failures, fiber cuts, etc. Stuff happens, but we got pretty good communication, and things got resolved in a reasonable amount of time. Service got noticeably worse after IBM, but we were already planning to move to our acquirers hosting, because that's what happens when you're acquired. Oh, and their load balancers had garbage uptime.

Bandwidth prices used to be pretty reasonable, but they've adopted AWS style obscene pricing. At least they still let you use the private network for free (including to other datacenters).

dang 6 years ago | | |

HN ran on a box at Softlayer until early 2018 or so. This makes me think that the title of this post (which was submitted as "IBM Cloud down as well as their status page which looks to be hosted there") could at some point have been "IBM Cloud down as well as their status page which looks to be hosted there as well as the forum where people post these things which also looks to be hosted there".

dsmcr 6 years ago | | |

You guys were one of the best use cases for the SL model, which really hasn't changed in 10+ years. You had very few dependencies on the less-reliable (read: all of them) services inside the SL stack and mostly managed everything on box and in software. In a few POPs you guys were running about 50% of the total SL backbone bandwidth. There were a lot of sad panda hats when you guys started to transition away.

nixgeek 6 years ago | | |

Last job for me was also a few thousand bare metal servers at SoftLayer. Acquired and moved to that infrastructure instead. Wonder if its the same acquisition? :-)

dsmcr 6 years ago | |

FWIW - IBM Cloud today has basically no benefit over AWS, Azure or GCE or even against some of the smaller regional players like AliCloud. The notable exception would be if you need to run a bare metal solution and leverage their free backbone which is a pretty narrow use case these days. The main selling point previously was to stand up your own VMware environment but even that came with a laundry list of caveats (unsupported hardware, limited VLANs, non-flexible IP space) that made it painful to use. Today AWS is vastly more performant, flexible, reliable and has a bunch of useful services you don't get from IBM Cloud.

nojito 6 years ago | | |

Price isn’t a benefit?

Xenoamorphous 6 years ago | |

If we speak specifically about IBM Cloud vs AWS, we use the Natural Language Understanding API in IBM Cloud and as far as I know the equivalent AWS offering, Comprehend, doesn't provide named entity disambiguation nor links to knowledge graphs (IBM links to DBpedia).

MS and Google do provide those features though.

dsmcr 6 years ago | | |

Unfortunately, that API changes regularly and often in undocumented ways that causes breakages for customers. Its really a lot of fun to deal with when suddenly a bunch of automation breaks and it turns out an unannounced push fundamentally re-writes foundational API calls.

fieldmarshal 6 years ago | |

We have been leasing bare metal servers since the pre-IBM Softlayer days.

Over the past few years we have experienced quite a few network-related outages. Not usually to this extent, more generally a failure of some piece of network gear that takes out either backend or frontend traffic from a particular data center. We seriously priced out a migration to another provider recently, but in the end what held us back was cross-AZ transfer costs on AWS. We found it would raise our operating costs significantly, so the matter was dropped.

We were much happier with the service and support we received prior to the IBM acquisition.

sky_rw 6 years ago | |

I had originally signed up due to the availability and pricing of bare metal servers and the mixed Windows/Linux server offerings. Their windows server licensing was better than AWS and I didn't want to be on Azure for a variety of reasons.

Currently on them because we have an OpenVPN based infrastructure that is very challenging to migrate.

Lastly the majority of our customers are in the midwest or Texas, and the proximity of their Dallas DC was a huge performance win for us.

manquer 6 years ago | |

Rarely it is a just technical decision, usually money is the reason.

In small and mid size organizations the CSP gave better pricing, or they help with your sales etc

In large organizations - IBM/Oracle bundle their existing products currently being paid for any way, or account managers have great relationships with decision makers , the company already has signed up big multi year deals.

This is not just IBM, it applies to GCP/Azure/AWS as well.

nihil75 6 years ago | |

I like OpenWhisk which is the basis for their serverless compute offering. Has orchestration/state-machine functionality that makes it superior to GCP Cloud Functions, and uses Docker containers which makes it more flexible than AWS Lambda.

I also really like CouchDB which IBM Cloudant is based on.

Is that enough for me to use IBM cloud? no. not really.

spydum 6 years ago | |

I suspect nobody really uses it outside of weird outsourced financial modelling/planning tools like TM1 and other apps people stopped wanting to manage themselves.

freehunter 6 years ago | | |

I work as a consultant with big enterprise companies and I can assure you big enterprise companies are using IBM very heavily. As well as Oracle and HP and other uncool tech companies.

rezonant 6 years ago | |

We use Restream.io and Solar Winds Papertrail, both were down today, my guess is they use IBM Cloud itself or some rackspace that IBM's rented to other clouds, which is apparently typical at the edge of the major public cloud regions

blantonl 6 years ago |

All of Broadcastify's audio servers (hosted with Softlayer in their Dallas datacenter) are completely unreachable and down.

I'm going to wait a bit to see if we get a status update, otherwise we'll be spinning up instances on AWS to failover (which will be enormously costly for bandwidth)

No status, no nothing, we're in the dark.

Operyl 6 years ago | |

Hey. Do you want to shoot me an email, IRC chat, or anything? I can keep you up to date with what I'm hearing from my manager.

dashesyan 6 years ago | | |

Hey, I'm a customer of IBM Cloud, too. Could you share what you're hearing from them? It would be nice to know what's going on

Fordec 6 years ago |

I remember I was at an IBM sponsored hackathon around 2015 where it was a requirement to use Bluemix. Over the course of the weekend the service went down for hours 3 times.

Literally this morning I was wondering what ever happened to it, like did it die a quiet death? Oh it rebranded to IBM cloud in 2017. Now this news.

I think there's an eponymous law named for this sort of thing.

kinghuang 6 years ago | |

That's funny. I've had the exact same experience with Bluemix at a Hackathon in the past. It was down for almost the entire weekend, screwing all the teams that didn't pivot early enough.

vmh1928 6 years ago |

In the Cloud Status History page scroll down to the 6:32 entry that says "Unable to Access IBM Cloud"

https://cloud.ibm.com/status?selected=history

- 2020-06-10 02:19 UTC - RESOLVED - The network operations team adjusted routing policies to fix an issue introduced by a 3rd party provider and this resolved the incident

voz_ 6 years ago |

I generally do everything on AWS or GCP, with a little Azure sometimes for personal projects. In what world does IBM beat one of those three in anything? Generally curious - how they are able to stay competitive?

twalla 6 years ago | |

Their bare metal cloud offering (SoftLayer acquisition) was actually pretty good whenever I used it about 4 years ago. Wasn’t the most intuitive API or UI but you could get a bare metal server anywhere in the world in a few minutes.

rad_gruchalski 6 years ago | | |

When the wind blows in the right direction. Sometimes, your server would get stuck in provisioning for hours and only get „un-stuck” after creating a support ticket. Which, I kid you not, at one of the previous jobs, wd had automated in our provisioning popeline. Good times.

But when it worked, it worked. API was voodoo.

jonfw 6 years ago | |

They've got the only real managed Openshift option right now, and their managed Kubernetes services is really great and seamless IMO.

wmf 6 years ago | |

They had bare metal before Packet or AWS and inter-region traffic is free.

Operyl 6 years ago | | |

Their biggest thing going for them is 100% free dark fiber private network. You do have to pay for the bigger pipe (100mbps included for each server, a minimal upcharge for gigabit), but that's pretty much a rounding error.

blantonl 6 years ago | |

Softlayer

blazefox69 6 years ago |

Fixed it for you https://github.com/ibm-cloud-docs/overview/pull/74

caiobegotti 6 years ago |

Honest slightly cynical question: most probably someone inside the responsible team said some day that it would be very stupid to host the status page inside the same infrastructure being monitored, but they were probably ignored... what should that person do now? Say "toldya!" out loud in the postmortem meeting or simply shut up and move on because reality is that we are hired to do some stupid task and not to think for ourselves?

Lyren 6 years ago |

I received communication ~15min ago that they're actively looking into the issue. I submitted the ticket roughly 20min ago. So it seems they're aware.

It doesn't help that their status page is also hosted on IBM Cloud.

whyleym 6 years ago |

Found this from a user on Twitter - "Our status page for IBM Aspera is on StatusPage, so you can track here as a bank shot: https://status.aspera.io "

gatvol 6 years ago |

Well if they cannot foresee this eventuality, what else are they missing under the hood?

julianeon 6 years ago |

Seems pretty dumb to host a status page in a way that it could go down, when it should be a static page that is trivially hosted on CDN's worldwide.

koolba 6 years ago | |

You can’t cache it for that long though.

A better approach is to have it hosted on a different cloud platform. If you really care, you’ll set it up on a different domain and nameserver as well with a long lived redirect (cached on CDNs) from the usual status.example.com or example.com/status.

julianeon 6 years ago | | |

Thanks; you're right - the caching would be a problem, so your solution makes more sense.

syshum 6 years ago | |

Over confident in their own Cloud

"Our cloud can never go completely down We are IBM, we have Watson..."

sky_rw 6 years ago | |

Their status page also seems integrated into their internal support ticketing system. It's not a traditional status page. They wanted to maintain a consistent garbage interface to keep it inline with the rest of their administrative service.

sky_rw 6 years ago |

The most infuriating thing about this is the ZERO communication coming out of IBM Cloud. No emails. No updates to twitter. Status page down. Support lines clogged.

At least give me something I can point my customers at to show them this is not due to my incompetence.

bizt 6 years ago | |

Yep, super annoying I had to link my customers to a techcrunch page :(

shaabanban 6 years ago |

Also still no communication from IBM that anything is wrong.

Operyl 6 years ago | |

Account managers are texting, but they have no VPN access right now.

adrr 6 years ago | | |

It seems all their external network connections are down. I assume people will have to drive to the data centers to fix. I really want to see a post mortem on this outage.

mark-r 6 years ago | | |

Let me guess, their VPN authorization runs on the IBM cloud.

akerro 6 years ago |

Haha, amazon had the same problem a few years ago when they had fire in datacenter, their status checker page was hosted in the same building and was showing everything is fine, while 1000s of websites hosted on AWS were down.

shaabanban 6 years ago |

wonder if we'll ever get a post-mortem about this... Seems to be global

Operyl 6 years ago | |

Maybe. About 3/4 of all outages get a post mortem. There's 1/4 of the time they refuse to tell us anything.

mbreese 6 years ago | | |

There will have to be a post mortem on this. The convention is to be as transparent as possible as to what went wrong. This helps to let current customers know that you found the problem, and have put plans in place to make sure it doesn't happen again.

The purpose of the signalling here is two fold.

1) If convincing enough (with details), you can keep current customers from moving to a competitor.

2) It also lets new customers see how you actually handle a crisis. If they can manage the crisis well enough, then you can point to this instance to prove your technical knowhow to handle their needs.

If they don't tell anything, or aren't transparent, then they can expect a mass exodus of customers.

colinbartlett 6 years ago | | |

Do you actually have data on that or are you conjecturing? Because I would really love to see data about that if it exists somewhere.

redler 6 years ago | |

I certainly hope so, considering all the IBM customers that are going to have to explain this to their customers in turn.

thephyber 6 years ago |

How sure are we that this outage is limited to IBM cloud?

Pindom[1] had a spike of website outages from 11k => 27k.

[1] https://livemap.pingdom.com/

Nextgrid 6 years ago | |

It's most likely customers of IBM cloud whose systems rely on something hosted there and are thus down as well.

thephyber 6 years ago | | |

Yes, I considered that possibility before posting.

AaronFriel 6 years ago |

Ah, is this the exception that proves the rule that "no one was ever fired for buying IBM?"

Sorry to be glib, I'm sure it's a tough time for people who were sold on their cloud platform and work on it!

mark-r 6 years ago | |

Everybody's cloud goes down sometime. The big fail here was hosting their status page on the same infrastructure.

oceanswave 6 years ago | | |

But usually only a single AZ or region... seems like this is bigger?

Operyl 6 years ago |

Yup .. hit us pretty badly. Our account manager doesn't know either.

homeglue 6 years ago |

I've seen multiple services get affected this morning including Sendgrid, Nexmo and Up bank, all at the same time. Wondering if this is related.

leetrout 6 years ago |

Hugops.

Hope they get a root cause and a quick fix. I’m not a fan of their cloud service but I know people working on the outage and fix are stressed.

kitteh 6 years ago |

About a month ago their Northern Virginia region was down. All the BGP prefixes associated with it disappeared from the internet (routes withdrawn). This time (I went to check when someone mentioned it) they kept advertising, but all traffic went nowhere once it got into their network. Curious to see if there is an RFO released.

aiisjustanif 6 years ago | |

I wish we had a record of this.

kitteh 6 years ago | | |

I do. I store all this stuff. Where should I put it?

nonines 6 years ago |

This looks related (smoking gun?) https://status.aspera.io/incidents/t9r03x71dxkl

>> A 3rd party network provider was advertising routes which resulted in our WW traffic becoming severely impeded.

rbanffy 6 years ago | |

It can only be attributable to human error.

No IBM computer has ever made a mistake or distorted information. They are all, by any practical definition of the words, foolproof and incapable of error.

stevehawk 6 years ago |

guess they didn't learn from AWS and hosting their status pages (in particular their icons) in S3

bantec 6 years ago |

It’s a second significant issue for last year with IBM( absolutely inconsistent for critical infrastructure (we are FinTech)

cerw 6 years ago |

Been like that for last 1h, Network packet Sydney (GCP) to Sydney (IBM) 62% packet loss

ck2 6 years ago |

even weather.com was down but someone broke ebay too

       Fastly error: unknown domain: www.ebay.com. Please check that this domain has been added to a service.

toast0 6 years ago | |

weather.com makes sense. IBM bought the weather channel a while ago, hosting is likely tied to IBM Cloud at this point (although it looks like it's fronted by Akamai)

vmh1928 6 years ago | | |

IBM bought the technology part called the Weather Company. That's the part that gathers weather info from all over and makes it available.

The cable TV channel is still independent.

supernova87a 6 years ago | |

Aha, I guess explains why Wunderground.com was out too.

pmarreck 6 years ago |

Imagine hosting your status page on a different domain

9nGQluzmnq3M 6 years ago | |

DNS worked fine here, this was an infra issue.

nadavami 6 years ago |

It seems like the status page just came back up.

woakas 6 years ago |

Our site (ubidots.com) does not have a complete down, but the IBM network has a high latency.

someguy12321 6 years ago |

heads be rolling tomorrow!

anon102010 6 years ago |

A quick check of cloudflare's isbgpsafeyet page

IBM Cloud - unsafe

At least AWS signs their routes I think.

If you can't even sign your own routes - hard to have a ton of pity.

kortilla 6 years ago | |

Signing routes doesn’t mean others reject unsigned routes. AWS is just as vulnerable to hijacking as anyone.