January 28th Incident Report

January 28th Incident Report(github.com)

451 points by Oompa 10 years ago | 183 comments

eric_h 10 years ago |

> One of the biggest customer-facing effects of this delay was that status.github.com wasn't set to status red until 00:32am UTC, eight minutes after the site became inaccessible. We consider this to be an unacceptably long delay, and will ensure faster communication to our users in the future.

Amazon could learn a thing or two from Github in terms of understanding customer expectations.

dmunoz 10 years ago | |

I recently stepped into a role with a devops component, and one of my first surprises was just how slow status.aws.amazon.com was to update about ongoing issues. I had to scramble to find twitter and external forums confirmation for the client.

atom_enger 10 years ago | | |

What's even worse is that when Amazon finally updates their status page it's usually still a green icon with a little i tick for "information" even if it was a partial outage. It takes a lot for the icons to go red which is what you'd look for if you're experiencing issues.

I do the same thing, often searching Twitter for "aws" or "outage" and find people complaining about the problem which confirms my suspicions. It's a sad state of affairs when you have to do this and Amazon doesn't seem interested in fixing it.

eric_h 10 years ago | | |

day to day i mostly write software, but I also help manage our infrastructure (we're a small company - 9 people total, 4 engineers, I'm one of the 2 that understands managing servers well enough to support it). We were on linode up until about a year and change ago and switched to AWS/Opsworks to both decrease our infrastructure bill and increase our ability to scale horizontally quickly (for unfortunately long definitions of quickly - "running setup...")

Both Linode and Amazon suck at their status pages (though linode was quite informative about their DDoS outages that started on Christmas). Every amazon issue we've had, the status page only changed once they'd more or less fixed it. As far as I'm concerned their status page is basically useless unless it's an extended outage, at which point it's still basically useless...

vacri 10 years ago | |

> Amazon could learn a thing or two from Github in terms of understanding customer expectations.

Do you mean that "the cloud provider that is bigger than the next 14 combined and whose jargon has spread through the community" doesn't understand what customers are interested in and delivering on that?

existencebox 10 years ago | | |

Gonna speak up to defend OP here: I've worn the devops hat for products across multiple "Large Companies" (Amazon and larger scale) and found that for small products where it was me and a few other devs keeping the lights on, we would have outage alerts on status pages/twitter typically _before_ public users even realized something was wrong, since we were all very high touch on the project.

The bigger a project gets, the less prioritized something like a status page often seems to get. Larger entities certainly _have_ them but I often see more things interfering as scale grows (this isn't only a MS thing, let me make clear) whether it be domain switches between engineering and social management (status is often via twitter), feeding the status page via a long telemetry/monitoring platform that has some lag, or a high threshold for what "outage" means to avoid flappy notices (at the cost of some false negatives).

I'm not even going to make a value judgement on the tradeoff of these costs at this point, (I certainly wouldn't dismiss it offhand as a net negative although equally it's not all roses) but at the very least I'd observe that something like a status page _CAN_ be serviced very well from an up and comer (for as much as Github is that any more) and it's far from a true statement that bigCOs can't take learnings from improving customer happiness from newer entities. (In fact, I wish that was a more common practice!)

mikeash 10 years ago | | |

Do you mean that just because a company has huge market share it must be doing every single thing better than its competitors?

bosdev 10 years ago |

There's no mention of why they don't have redundant systems in more than one datacenter. As they say, it is unavoidable to have power or connectivity disruptions in a datacenter. This is why reliable configurations have redundancy in another datacenter elsewhere in the world.

danielvf 10 years ago |

For all that work to be done in just two hours is amazing, especially with degraded internal tools, and both hardware and ops teams working simultaneously.

DarkTree 10 years ago |

I don't know enough about server infrastructure to comment on whether or not Github was adequately prepared or reacted appropriately to fix the problem.

But wow it is refreshing to hear a company take full responsibility and own up to a mistake/failure and apologize for it.

Like people, all companies will make mistakes and have momentary problems. It's normal. So own up to it and learn how to avoid the mistake in the future.

eric_h 10 years ago | |

As I said in another comment, the fact that they found an 8 minute delay from outage to status page update to be unacceptable speaks volumes to how much they value their relationship with their customers.

as an aside I feel that I'm quite fortunate to work in the EST timezone, as their outage apparently started at about 7pm my time. We have a general rule at my company to not deploy after 6pm unless an emergency fix absolutely needs to go up.

I saw the title of the story and said to myself, what outage? :P

pedalpete 10 years ago |

Does Github run anything like Netflix Simbian Army against it's services? As a company by engineers for engineers with the scale that github has reached, I'm a bit surprised they are lacking a bit more redundancy. Though they may not need the uptime of netflix, an outage of more than a few minutes on github could affect businesses that rely on the service.

imbriaco 10 years ago | |

Google "Netflix downtime" for evidence that Netflix also has outages. Google has outages, sometimes very significant ones of Google Apps. Facebook has outages.

Complex systems fail. Period. All the time. Things like the Simian Army are fantastic tools that help you identify a host of problems and remediate them in advance, but they cannot test every combinatorial possibility in a complex distributed system.

At the end of the day, the best defense is to have skilled people who are practiced at responding to problems. GitHub has those in spades, which is why they could respond to a widespread failure of their physical layer in just over 2 hours.

The biggest win with the Simian Army isn't that it improves your redundancy. It's that it gives your people opportunities to _practice_ responses.

drdrey 10 years ago | | |

More than practicing responses, Chaos Monkey and Failure Injection Testing allow us to verify that we don't have unexpected hard dependencies. Sometimes you find out that your service can't start if another one becomes latent, in which case you can plan for it by adding redundancy/extra capacity, fallbacks or working in degraded mode.

kuschku 10 years ago | | |

I remember in 2013 a full-day outage of Google.

tbrock 10 years ago | |

It's "simian army". A simbian army is like a herd of dildos to sit on. I doubt that would have helped github's services recover faster.

MatekCopatek 10 years ago | | |

You're thinking of sybian army.

I'm really tempted to continue with "a simbian army is actually" but this isn't Reddit so end of comment thread.

onetwotree 10 years ago |

Every time I read about a massive systems failure, I think of Jurassic Park and am mildly grateful that the velociraptor padock wasn't depending on the systems operation.

chris_wot 10 years ago | |

I think you'll find they were.

mattdeboard 10 years ago | |

Well as long as you're not Samuel L. Jackson in that scenario you should be fine. Ish.

onetwotree 10 years ago | | |

Samuel L. Jackson taught me everything I know about ethics in software engineering.

Including the principle that if your software breaks, you're the on who has to go get savaged by velociraptors to fix it.

mjevans 10 years ago |

This just shows how difficult it is to avoid hidden dependencies without a complete, cleanly isolated, testing environment of sufficient scale to replicate production operations and do strange system fault scenarios somewhere that won't kill production.

imbriaco 10 years ago | |

It turns out that it's even hard then. Complex systems, by their very nature, fail in unexpected and unpredictable ways. If that weren't bad enough, hindsight bias makes it way too easy for us to look back with perfect knowledge and opine "That was so obvious, how could they have missed such a rudimentary issue?"

If only things were that easy.

ssmoot 10 years ago | | |

I'm not sure what part of servers failing to POST is especially complex or related to distributed computing.

For all the fawning over being provided technical details, this article was pretty light on them.

I don't think Github going down for a couple hours is that big of a deal TBH. But it does seem to expose a few really basic failings in their DR planning IMO.

I also think it's ridiculous that some commenters are trying to frame this as a distributed computing problem. It's not even a clustering problem (apparently). It's just looking at the iDRAC or whatever to see why the server isn't getting past POST and putting your recovery plan into action.

This is white box vanilla stuff that happens to everybody.

That servers had to be rebuilt as part of DR says a lot.

The fact that there was a Redis dependency during bootstrap? Probably a good thing. You know as well as anyone I'm sure the last thing you want is a bunch of processes that only look like they're up. And even if they could not error without their Redis connections, if Redis is used for caching, what's that going to do to availability? Would it be a good thing to have the processes up if they can only handle 10% of the usual load?

Those are details that aren't there.

But complex distributed computing problem this is not. Not as it was presented anyways.

ones_and_zeros 10 years ago | |

Or use the Netflix model: Chaos testing in production.

toomuchtodo 10 years ago | | |

No system is perfect; as you continue to add 9s, the cost increases steeply.

Usually its just cheaper to be down for an hour or two, versus architect for the end of times.

aaronblohowiak 10 years ago | | |

Part of our Chaos testing in prod is exercising our ability to route traffic around failures of entire regions. jobs.netflix.com

bluecmd 10 years ago | | |

Or Google for that matter. DiRT.

viraptor 10 years ago |

> ... Updating our tooling to automatically open issues for the team when new firmware updates are available will force us to review the changelogs against our environment.

That's an awesome idea. I wish all companies published the firmware releases in simple rss feeds, so everyone could easily integrate them with their trackers.

(If someone's bored, that may be a nice service actually ;) )

vhost- 10 years ago | |

This was one of the toughest things about admining hardware clusters. Firmware updates (and firmware issues) are so hard to track down. It's so annoying. I remember spending a week tracking down an issue with a RAID controller and then spending another day or two on the phone with the vendor trying to get a firmware update so we did not have 2 racks of hardware sitting on a ticking time-bomb.

Cthulhu_ 10 years ago | |

I've played with the idea of some automated software update reporting site ages ago - it'd read rss feeds and scrape websites for the required info. It'd probably need adjustments for each hardware manufacturer / product though, and regular updating. But that could possibly be part of an open source project, give the firmware maintainers the opportunity to help out too.

matt_wulfeck 10 years ago |

> Remote access console screenshots from the failed hardware showed boot failures because the physical drives were no longer recognized.

I'm getting flashbacks. All of the servers in the DC reboot and NONE of them come online. No network or anything. Even remotely rebooting them again we had nothing. Finally getting a screen (which is a pain in itself) we saw they were all stuck on a grub screen. Grub detected an error and decided not to boot automatically. Needless to say we patched grubbed and removed this "feature" promptly!

gaius 10 years ago |

You can very clearly see two kinds of people posting on this thread: those who have actually dealt with failures of complex distributed systems, and those who think it's easy.

Animats 10 years ago |

"We identified the hardware issue resulting in servers being unable to view their own drives after power-cycling as a known firmware issue that we are updating across our fleet."

Tell us which vendor shipped that firmware, so everyone else can stop buying from them.

gruez 10 years ago | |

I'm guessing they didn't disclose the vendor because they didn't want to be sued for defamation.

theptip 10 years ago | | |

And/or they want to maintain a working relationship with said vendor. Going nuclear is a good way of getting _exactly_ the minimum level of service that your SLA specifies.

Animats 10 years ago | | |

Truth is an absolute defense to libel in the US.

merqurio 10 years ago |

I feel it was good incident for the Open Source community, to see how dependent we are on GitHub today. I feel sad whenever I see another large project like Python moving to GitHub, a closed-sourced company. I know, GitLab is there as an alternative, but I would love to see all the big Open Source projects putting pressure over GitHub to make them open their source code, as right they are big player in open source, like it or not.

rqebmm 10 years ago |

It must be nice to know that the majority of your customers are familiar enough with the nature of your work that they'll actually understand a relatively complex issue like this. Almost by definition, we've all been there.

dsmithatx 10 years ago |

If only Bitbucket could give such comprehensive reports. A few months back outages seemed almost daily. Things are more stable now. I hope for the long term.

viraptor 10 years ago | |

Isn't BB's problem basically that there are too many users? GH's outage writeup is cool, because it's a one off and it can be analysed. When BB is just overloaded for a long time and needs more power, it's not going to be very interesting.

(unless I missed some specific non capacity related outages?)

yeukhon 10 years ago | | |

Maybe. BitBucket was also an acquisition so for some time I believe there was a lack of resource provided to them and there was a huge technical debt/integration effort required. At this very time, I don't know if Atlassian actually care much about BitBucket. They are probably more concerned about delivering Stash than BitBucket, my wild guess.

I was an active BB user a couple years ago, and the project I worked on would hg clone from BB many times a day so I would be the first one to notice a 503 or whatever error coming from their service. Typically I would see one or two outage per month, some last a few minutes, some last several hours. Most of the time the outage impacted git/hg checkout, so I think that was their technical bottleneck.

guelo 10 years ago |

Weird that they didn't say what caused the power outage and what the mitigations are for that.

sh4na 10 years ago | |

If it's a data center owned by a third party, they probably can't talk about it.

jlgaddis 10 years ago | |

  RFO: A squirrel climbed into a transformer
       and a short time later they both blew up.

gsibble 10 years ago | |

I'm also confused about how the racks would lose power. Surely they had UPSes.

abrookewood 10 years ago | | |

Generally speaking, I'd recommend AGAINST running UPSes in racks that are managed by top-tier data centres. I've had way more trouble with UPSes misbehaving than I ever have with data centres losing power. EDIT: I'd also point out that 2 hours is a long time to be running on in-rack UPSes. I've usually seen them designed to withstand about an hour, but not much more.

ams6110 10 years ago | | |

UPSs don't always cover everything. There are systems that are considered critical that are on UPS, and others that are considered restartable that might not be. There are a lot of tradeoffs in a data center. Having full UPS and generator backup capacity for everything gets very expensive.

technion 10 years ago | | |

I have multiple experiences with high end DCs with dual UPS and diesel genset experiencing power fail.

Once it involved fire alarms, which trigger safety shutdowns within a suite. The other involved a failed static switch panel - ie, the things that aren't mean to be able to fail.

tmsh 10 years ago |

> Over the past week, we have devoted significant time and effort towards understanding the nature of the cascading failure which led to GitHub being unavailable for over two hours.

I don't mean to be blasphemous, but from a high level, is the performance issues with Ruby (and Rails) that necessitate close binding with Redis (i.e., lots of caching) part of the issue?

It sounds like the fundamental issue is not Ruby, nor Redis, but the close coupling between them. That's sort of interesting.

byroot 10 years ago | |

No the fundamental issue is that an application should not require any external service to boot.

It has nothing to do with Ruby, or Rails or even Redis. It's just a design flaw of the application, that you often learn the hard way.

atom_enger 10 years ago | |

I don't think that Ruby/Rails has anything to do with this, really. If you want to scale any app, you're going to want to do some caching somewhere. What this boils down to is that their app has a dependency in an initializer that depends on redis. Without a connection to redis, it will flap.

lukeasrodgers 10 years ago | |

As someone with a fair bit of ruby+rails+redis experience, I don't think this is blasphemous, but I also don't think the performance issues of ruby/rails having anything to do with the failure. Generally you would cache/store something in redis not because your programming language or framework is slow, but because a query to another database is slow (or at least, slower than redis), or because redis data structures happen to be a good/quick way to store certain kinds of data.

I believe the fundamental issue was just that redis availability was taken for granted by app servers so that certain code paths/requests would fail if it wasn't available, rather than merely be slower.

cognivore 10 years ago |

Um, work from your local cache for a few hours? It's that the one of the main reasons for git?

majewsky 10 years ago | |

Not all processes that involve GitHub are development processes. I've seen automated deployments fail inside a corporate network when the resident HTTP proxy had a bad day and could not connect to github.com.

timiblossom 10 years ago |

If you use Redis, you should try out Dynomite at http://github.com/Netflix/Dynomite. It can provide HA for Redis servers

rurounijones 10 years ago |

I would have expected there to be a notification system owned by the DC that literally send an email to clients saying "Power blipped / failed".

That would have given them immediate co text and not wasting time on DDOS protection

spydum 10 years ago |

So, while it sounds like they have reasonable HA, they fell down on DR. unrelated, I could not comprehend what this means?..: technicians to bring these servers back online by draining the flea power to bring

Flea power?

Someone1234 10 years ago | |

I assume they mean completely disconnect the equipment from ALL external power sources. Typically even when a piece of equipment is offline in a data center, it continues to draw power, and will often keep running systems like DRAC and other management/status tools (since the whole concept of a data center is NEVER having to get up out of your chair, so even a "shutdown" system needs to be able to be remotely started).

Since the firmware had a bug, bad state could be stored, completely removing power may clear that state and appears to have done so in this case. They may have also needed to pull the backup battery, and reset the firmware settings, but I wouldn't presume that just from the term "flea power."

spydum 10 years ago | | |

sure enough, it's a real term, and it's relatively old.. http://answers.google.com/answers/threadview/id/185999.html

I have never known what to call this, but have definitely been engaged in draining a few fleas.

Also, I can't believe it's been that long since google answers has been closed..

tonylxc 10 years ago |

TL;DR: "We don’t believe it is possible to fully prevent the events that resulted in a large part of our infrastructure losing power, ..."

This doesn't sound very good.

jpatokal 10 years ago | |

No, it sounds good, because it's realistic and then you can build mitigation strategies.

I was recently involved in an outage that occurred because the sama datacenter was hit by lightning three times in a row. Everything was redundant up the wazoo and handled the first two hits just fine, but by the time the power went out for the third time within N minutes, there wasn't enough juice left in some of the batteries!

Now would it be possible to build an automated system that can withstand this? Probably. But would your time & money be better spend worrying about other failure modes? Almost certainly.

jrockway 10 years ago | |

If your plan to avoid downtime is to prevent power outages, you're going to have downtime. All their sentence says is they can't prevent power outages. That's fine, because the other 1/nth of your servers are on a different power grid in a different state.

tonylxc 10 years ago | | |

I totally share the same view that to best avoid failure is to embrace it and cope with it.

It is true that all their sentence is about recovery, however, it is disappointing that they didn't mention anything about a redundant datacenter.

otterley 10 years ago | |

Whose datacenter are they in? This is the second time in less than two weeks that they've suffered a power-related issue. My company is in 4 different sites around the world and we've never lost power ever - and, if one circuit did go out, we'd still be up and running because all of our servers have redundant power supplies on separate infeed circuits.

mattdeboard 10 years ago |

Anyone have a link to a description of the firmware bug that caused the disk-mounting failure after power was restored?

ymse 10 years ago | |

I'm going to guess that these are Dell R730xd boxes with PERC H730 Mini controllers (LSI MegaRAID SAS-3 3108).

A failed/failing drive present during cold boot could cause the controller to believe there were no drives present. To add insult to injury, on early BIOS versions this made the UEFI interface inaccessible. The only way to recover from this state was to re-seat the RAID controller.

There were also two bizarre cases where the operating system SSD RAID1 would be wiped and replaced with a NTFS partition after upgrading the controller firmware (and more) on an affected system (hanging/flapping drives). Attempts to enter UEFI caused a fatal crash, but reinstall (over PXE) worked fine. BIOS upgrade from within fresh install restored it.

From the changelog:

    Fixes: 
    - Decreased latency impact for passthrough commands on SATA disks
    - Improved error handling for iDRAC / CEM storage functions
    - Usability improvements for CTRL-R and HII utilities
    - Resolved several cases where foreign drives could not be imported
    - Resolved several issues where the presence of failed drives could lead to controller hangs
    - Resolved issues with managing controllers in HBA mode from iDRAC / CEM
    - Resolved issues with displayed Virtual Disk and Non-RAID Drive counts in BIOS boot mode
    - Corrected issue with tape media on H330 where tape was not being treated as sequential device
    - resolved an issue where Inserted hard drives might not get detected properly.

TazeTSchnitzel 10 years ago |

> We had inadvertently added a hard dependency on our Redis cluster being available within the boot path of our application code.

I seem to recall a recent post on here about how you shouldn't have such hard dependencies. It's good advice.

Incidentally, this type of dependency is unlikely to happen if you have a shared-nothing model (like PHP has, for instance), because in such a system each request is isolated and tries to connect on its own.

totally 10 years ago |

> Because we have experience mitigating DDoS attacks, our response procedure is now habit and we are pleased we could act quickly and confidently without distracting other efforts to resolve the incident.

The thing that fixed the last problem doesn't always fix the current problem.

dgritsko 10 years ago | |

Occam's razor isn't a bad rule of thumb, however.

swrobel 10 years ago |

Anyone got a good tl;dr version?

alblue 10 years ago | |

Power outage in DC brought many machines down. Redis clusters failed to start owing to disk issues (not cleanly unmounted?). The reboot of remaining machines uncovered an unknown dependency on the machines needing the redis cluster to be up in order to boot.

There were other learning points such as immediately going into anti DDoS mode and human communication issues that didn't realise or escalate the problem until some time after the issues started occurring.

aidenn0 10 years ago | |

Power outage brought 25% of servers down.

Firmware issue meant that a large fraction of their servers could not detect the disks on reboot.

This prevented the redis cluster from starting.

They inadvertently have a hard-dependency on redis being up for the majority of their infrastructure to start.

daigoba66 10 years ago | |

Lost power. Took a while to get the servers cleanly rebooted.

contingencies 10 years ago | |

No CI/test process was in place for critical systems to ensure that they had no external dependencies.

Takeaway: If you run any complex system, ensure that each component is tested for its response to various degrees of failure in peer services, including but not limited to totally unavailable, intermittent connectivity, reduced bandwidth, lossy links, power-cycling peers.

No CI/test process was in place for hardware/firmware combos to ensure they recovered fine from power loss.

Takeaway: If you run a decent-sized cluster, ensure all new hardware ingested is tested through various power state transitions multiple times, and again after firmware updates. With software defined networking now the norm, we have little excuse not to put a machine through its paces on an automated basis before accepting it to run critical infrastructure.

No CI/test process was in place for status advisory processes to ensure they were sufficiently rapid, representative, and automated.

Takeaway: Test your status update processes as you would test any other component service. If humans are involved, drill them regularly.

Infrastructure was too dependent on a single data center.

Takeaway: Analyze worst case failure modes, which are usually entire-site and power, networking or security related. Where possible, never depend on a single site. (At a more abstract level of business, this extends to legal jurisdictions). Don't believe the promises of third party service providers (SLAs).

PS. I am available for consulting, and not expensive.

maerF0x0 10 years ago | |

Intern trips on power cable, 25% of servers go down.

Edit this is mostly the "DR" part of tldr :P

draw_down 10 years ago | |

"Stuff went wrong and our servers were down for a couple hours."

You're welcome.

jargonless 10 years ago |

What is this "HA" jargon?

I would STFW, but searching for "HA" isn't helpful.

dang 10 years ago | |

We detached this subthread from https://news.ycombinator.com/item?id=11030063 and marked it off-topic.

polysaturate 10 years ago | |

Pretty sure it's "High Availability" in this instance...

suraj 10 years ago | |

High availability

cycomachead 10 years ago | |

I had to STFW to figure out what STFW meant...TIL.

xzlzx 10 years ago | |

You could google "HA", click in the Wikipedia link that shows all the things "HA" may refer to, and deduct that the most logical thing in the list, given the context, would be this link: https://en.wikipedia.org/wiki/High_availability.

dgritsko 10 years ago | | |

Would it have been so hard to just type "high availability" rather than making him feel bad for being one of today's 10,000? https://xkcd.com/1053/

mattbeckman 10 years ago | |

Also, of note, the ever popular HAProxy Github uses and mentions stands for "High Availability Proxy".

julesbond007 10 years ago |

I seriously doubt this version of the story. While it's possible for several hardware/firmware to fail in all your datacenters, for them to fail at the same time is highly unlikely. This may just be a PR spin to think they're not vulnerable to security attacks.

While this was happening at Github, I noticed several other companies facing that same issue at the same time. Atlassian was down for the most part. It could have been an issue with the service github uses, but they won't admit that. Notice they never said what the firmware issue was instead blaming it on 'hardware'.

I think they should be transparent with people about such vulnerability, but I suspect they would never say so because then they would lose revenue.

Here on my blog I talked about this issue: http://julesjaypaulynice.com/simple-server-malicious-attacks...

I think it was some ddos campaign going on over the web.

dandandan 10 years ago | |

They're not hosted in multiple datacenters; there was a power interruption in their single datacenter that exposed this firmware bug. The point of this postmortem isn't the initial power interruption but rather its repercussions, why it took so long to recover from and how they can improve their response and communications in the future.

julesbond007 10 years ago | | |

Ok...so this is another PR...without admitting the issue. I don't know github's infrastructure, but they have a single point of failure? Last I know, every place these days have backup power especially a datacenter...so those were not working either? My point is that it's much better to be upfront sometimes. In fact github didn't have to say anything about the whole thing since everyone forgot already...