Update on 1/28 service outage

Update on 1/28 service outage(github.com)

179 points by traviskuhl 10 years ago | 186 comments

rburhum 10 years ago |

Yesterday I was being a bit of an ass to a few people about how "the whole point of using git is so that we can do decentralized code management and why these dependencies were being pulled from our private github if the could be sent point to point yadda yadda yadda". Then they proceeded to go over the list of package managers and dependencies we used and I had to shut up. Even when we host our own Docker Hub and package managers (we do), if you dig far enough, you can find some dependency of a dependency of dependency that relies on GitHub. Brew/npm/build script/whatever. It is crazy how everything has changed so much in the past few years. GitHub went from something that was really nice to have to a core requirement for complex systems that rely heavily on open source.

chatmasta 10 years ago | |

If this is a problem worth solving, you can absolutely solve it. The easiest would be through the use of a caching proxy and/or load balancing system.

The caching proxy system could be as simple as setting up a squid cache for apt. Multiple projects exist which do this already.

The load balancing system would involve keeping a private mirror of every repository in the dependency graph, and falling back to the mirror when GitHub fails. To automate this, proxy all git requests. If github is up, let the request pass through. If no mirror exists for the repository, create one. If GitHub fails, fall back to the mirror.

yeukhon 10 years ago | | |

> The easiest would be through the use of a caching proxy and/or load balancing system.

Sorring if I sound like an asshole, but before we actually use the word "easiest", can you please share with us how to do all of that. It is not as easy as you claim, to be honest. Not just some /etc/hosts hijack.

tylorr 10 years ago | | |

For the load balancing system I would assume you would also want to keep the mirrors up to date as well? So for every request if mirror exists and is out of date, update it.

viperscape 10 years ago | |

The package system for the rust language actually relies on github, as many found out during outage. I don't know if that will change, probably will with a read copy in a different git service.. but I thought it was interesting because I use github for everything save a few private projects, as I imagine most do. I'm not sure what to think of this, it seems backwards and grossly incompetent, yet here we are using it almost exclusively. It might be smart to decentralize some of this with torrents, if that's possible. Even if it was the read portion of a repository, it seems like something to consider, if it hasn't been already

brinker 10 years ago | | |

> The package system for the rust language actually relies on github, as many found out during outage.

This is not quite correct (although close to it). Cargo doesn't rely on GitHub, but it expects that there is some publicly-accessible git repository from which it can pull the source for any crate, and most crates use GitHub. So it's not a particular choice of Cargo, but a side-effect of GitHub's popularity in the community, and the fact that Cargo does not host source code itself.

rms_returns 10 years ago | | |

Not just rust language, to the best of my knowledge, even packagist, the php package manager relies heavily on github for sourcing its packages. But I think they have other resources too, apart from github.

seiji 10 years ago | |

if you dig far enough, you can find some dependency of a dependency of dependency that relies on GitHub. Brew/npm/build script/whatever

But really, why?

Is it just institutional laziness on the part of all developers? We had reliable rsync CPAN mirrors in 1995. In the early days of the Internet, companies would mutually host secondary DNS for each other to be more reliable. For some reason, we've forgotten all about reliability and disaster recovery and geographical distribution. Now the collective programmer mindset with regards to global infrastructure seems to be "lol, we're too dumb to make things work, let's just outsource everything to closed source, for-profit companies and hope for the best."

hguant 10 years ago | | |

I think a large part of this is that cloud hosting has allowed us to abstract those problems - reliability, disaster recovery, geographical distribution - away, and we don't really think of computers as computers anymore. It's a service or a platform or what have you, and the expectation is that it will always be there. I wouldn't say this is laziness, just a byproduct of changing how we view Internet architecture. We systems to take care of reliability etc because everyone has those problems. Now, those are only things you'll experience if you host your own stuff, or work for one of the big providers. (Broad assertion, I know, but I think it's mostly true)

swalsh 10 years ago | |

I mean a really simple solution (simple to say, maybe not to do) would be for package managers to require a "backup" repository from a different domain, than if you get a 500 error try the second remote repository. Use git for its advantages.

ugexe 10 years ago | | |

I think you mean a mirror, and many package managers use them.

forrestthewoods 10 years ago | |

And people give me shit when I argue that open source projects should include 100% of dependencies.

kbenson 10 years ago | | |

I think that's a bit crazy as well. This is a problem if your build process happens often and requires pulling external data. Ideally, you want a way to cache that external data, and a way to force invalidation of that cache.

Building, at least after the first time, should not require external access. There are security reasons for this as well.

dsjoerg 10 years ago | | |

how far down the stack do you go? do open source projects need to include their own compiler? what would compile it?

zwetan 10 years ago | | |

Same here but I don't care all deps HAVE TO be in the repo, period.

In fact I go further than that, anything that a project depends on HAVE TO be "saved" somehow somewhere: use a special commercial tool ? save it, use some particular OS ? save the ISO, need a particular version of a compiler / SDK ? have an installer ready, etc.

But nowadays it seems dev program temporary stuff meant to last just few months.

jessaustin 10 years ago | | |

If you personally run software for which reliability is important, absolutely you should maintain your own vendor repos. Open source projects are not in that position, and following your advice would lead to much harmful coupling and repetition.

nickpsecurity 10 years ago | |

That's a good point. I've been ignoring learning Git as long as I can but almost everything on my todo list heavily uses it. Or ties into it as you said. So, I'm going to have to bite the bullet and learn it.

Yet, I swore Git fans told me its decentralized design avoids single points of failures where everyone has a copy and can still work when a node is down just not necessarily coordinate or sync in a straight-forward way. This situation makes me thing, either for Git or just Github, there's some gap between the ideal they described and how things work in practice. I mean, even CVS or Subversion repos on high availability systems didn't have 2 hours of downtime in my experience.

When I pick up Git/Github, I think I'll implement a way to constantly pull anything from Git projects into local repos and copies. Probably non-Git copies as a backup. I used to also use append-only storage for changes in potentially buggy or malicious services. Sounds like that might be a good idea, too, to prevent some issues.

ajkjk 10 years ago | | |

I'm sorry to be rude, but, it sounds like you should go learn Git and come back to this conversation.

The decentralized design does avoid single points of failures, and everyone does have a copy. So - check, check, great. Unfortunately (maybe..) everyone has put their master repos in the same place, which somewhat counteracts the decentralization. But there is certainly no immediate coupling between the Git repository on your computer and the Github repository it's pulling from. It's not like Github being down in any way prevents you from working on code you've already checked out, unless you need to go check out more code.

(The same obviously may not be true for package managers and build scripts that are not running in isolation from your upstream repository, which is where the problems have arisen.)

kbenson 10 years ago | | |

Git works as advertised, but when all your build processes start with a sync from the upstream master (the equivalent of "svn up") that a lot of build scripts required that to work, then they've thrown away that advantage when building.

Everyone with a checked out repo should have been able to develop and commit, branch and merge locally fine though.

jallmann 10 years ago | | |

> either for Git or just Github, there's some gap between the ideal they described and how things work in practice

The hub-spoke topology is the easiest way of distributing source code to a lot of people. If the hub goes down, this is what happens. If that leads to a halt in productivity, then that is a failure in contingency planning. Git gives you many tools to distribute your workflow, but that won't save you if your workflow is centralized around Github.

Granted, sometimes you don't really have a choice whether to depend on Github, such as when working with language package managers. Perhaps that goes to show that mirroring and resiliency should be a design consideration in those tools, but it's not a shortcoming of Git itself.

> even CVS or Subversion repos on high availability systems didn't have 2 hours of downtime

It's easier than ever to have HA with a DVCS: clone the repository somewhere else and keep it in sync with commit hooks.

Large FOSS projects (should) do this by keeping a self hosted repository, and mirroring somewhere else like Github, Bitbucket, etc. Internally, an org should be able to quickly stand up a SSH or HTTP server for the purpose, or have collaborators push-pull directly from each other. Worst case? Send patches. Git apply works really well, and you might be surprised at how clever git-merge is when everyone finally syncs up.

That's what it means to be distributed: there is no real concept of a "central" node, unlike Subversion. Every local checkout has a full copy of the repository history. Any centralization is a (somewhat understandable) incidental artifact of how Git is being used.

ymse 10 years ago | | |

> I used to also use append-only storage for changes in potentially buggy or malicious services. Sounds like that might be a good idea, too, to prevent some issues.

In a certain sense, git is "append-only". If you change a commit in history, every ancestor commit will have its SHA hash changed. Naturally this will conflict with other copies of the repository.

For backups you should do a "git clone --bare" which checks out the internal git structure with data and history, but not the actual files.

maker1138 10 years ago | | |

Git is to GitHub as JavaScript is to Java. Though their names are similar they are very different things.

debaserab2 10 years ago | | |

git != github

skewart 10 years ago |

Am I the only one who is a little shocked that a power outage could have such a huge effect and bring them down for so long? I'm not an infrastructure guy, and I don't know anything about Github's systems, but aren't data center power outages pretty much exactly the kind of thing you plan for with multi-region failover and whatnot. Is it actually frighteningly easy for kind of to happen despite following best practices? Or is it more likely that there's more to the story than what they're sharing now?

nickpsecurity 10 years ago |

Here's the only page I could quickly find on Github's architecture for those interested:

https://github.com/blog/530-how-we-made-github-fast

This looks like a single datacenter. I don't see anything here indicating high availability or other datacenters. You'll usually spot either an outright mention of it or certain components/setups common in it. They might have updated their stuff for redundancy since then. However, if it's same architecture, then the reason for the downtime might be intentional design where only a single datacenter has to go down.

Might be fine given how people apparently use the service. It's just good to know that this is the case so users can factor that into how they use the product and have a way of working around the expected downtime if it's critical to do so.

bhaak 10 years ago |

"Millions of people and businesses depend on GitHub"

Well, we shouldn't depend on it so much.

I shudder at the thought what an outage of GitHub would mean for our company. This time, we were lucky as it was during the night in Europe.

Unfortunately, I don't have the power to test this scenario in our company.

anton_gogolev 10 years ago |

It's one thing when one temporarily loses access to remote repositories for pushes. Quite bearable, because you can exchange code across your corporate network using patches and whatnot. And it's totally different when you cannot friggin build anything because package managers grab dependencies directly off of GitHub.

msbarnett 10 years ago | |

This is more an argument for caching or vending dependencies than anything else.

If the ability to make builds is critical to your org, making your build process depend on the availability of third-party services over which you have no control is going to end in tears.

banku_brougham 10 years ago | | |

This is it. Production builds have to have dependencies hosted internally, not all over the web.

saidajigumi 10 years ago | |

Agreed. The modern ease of pulling in third-party dependencies, while wonderful in its way, has gotten so easy that even "simple" applications require automated caching infrastructure. E.g. if you just fork your top-level dependencies, you won't pick up any of your recursive dependencies.

I suppose we all need package manager and git/VCS aware recursive forking/caching tools now. E.g. works with npm, gem, etc. and recursively forks your entire dependency chain.

And to think that I managed that sort thing of entirely by hand some years back. (For C/C++ libs, then, so far more manageable.)

bjacobel 10 years ago |

Not much detail here. A more thorough postmortem would give me more confidence they can recover from another similar issue. Hoping to see one soon.

anon987 10 years ago | |

Yep, I think most of these post-mortems from any company are pointless from a technical perspective. It's 4 paragraphs that boils down to "someone did something wrong and we'll make sure it doesn't happen" with zero specifics.

There's no point in reading these because there's no technical information. Stuff like this is something you sent to your customer because they want root cause.

Zikes 10 years ago | | |

I strongly disagree that these sorts of communications are pointless. In every major service outage I've seen where the company maintained a degree of silence, it's caused major damage to their public relations and consumer trust.

I know it doesn't tell you much about exactly what happened, but the truth is they may still be sorting that out and focusing on ensuring it does not happen again. An in-depth post-mortem accompanied by a description of the fix would be great. In the meantime, admitting culpability and apologizing are the ideal essential first steps.

Zikes 10 years ago | |

I agree that a postmortem would be great, but it's good PR for companies to quickly put out statements like this to admit fault and maintain customer trust.

outworlder 10 years ago | |

Give them time.

frik 10 years ago |

You can see the cascade effect on their status page graphs: https://status.github.com/

Loic 10 years ago | |

What is impressive is that with a website 2h down, they can still announce a 97% availability for the day even so the graph clearly shows the 2h of failures in the day... :-/

WillAbides 10 years ago | | |

The 97% you see on the status page is for the past 24 hours. That doesn't include any of the outage being discussed here.

arthurschreiber 10 years ago | | |

Unless I'm mistaken, 97% of (24 hours) = 23.28 hours.

ceejayoz 10 years ago | |

Interesting that their exception logging didn't get turned back on until this morning, from the looks of things.

rcthompson 10 years ago | | |

Well, if exception logger was going off nonstop due to the outage, yet not providing any new information, it would make sense if they disabled it until things had returned to normal.

tommoor 10 years ago |

This post makes it sound like Github has it's own data centers and power infrastructure structure, this is definitely news to me.. I'd presumed co-lo at best.

noazark 10 years ago | |

The last news I've heard about it was back in 2009, https://github.com/blog/493-github-is-moving-to-rackspace. But I've also heard that they have some infrastructure on site (clearly not what they were talking about).

seiji 10 years ago | |

"data center" is a confusing term.

Very few companies build their "data centers" (apple, google, amazon, NSA, actual 'data center' companies, etc). Most companies rent cage space in a larger data center and call that their "private data center." Smaller companies will rent a few dedicated servers or colo half racks from other resellers.

brazzledazzle 10 years ago | |

Unless it's explicitly stated to be a wholly owned data center I always assume companies are talking about rack space in a bigger DC like supernap.

moondev 10 years ago |

Github doesn't deploy their services in multiple az's?

rs999gti 10 years ago | |

Maybe they do.

But this two hour failure tells me that they have never really tried a hot failover and failback scenario in order to test the resiliency of their site.

detaro 10 years ago | | |

Or something happened that didn't happen in the tests. And if they suspected something might be in an inconsistent state, taking some downtime to make sure it comes back up properly clearly is the better option.

moondev 10 years ago | | |

Hope we get more info about it. Would be very interesting to see how their architecture is setup

beachstartup 10 years ago |

it seriously makes me lol that people are upset, or surprised, that an internet service went down for a couple of hours. a couple of hours! get some perspective please. go for a walk, get a tasty burrito, try a new brand of hot sauce.

"why didn't they do X, Y, or Z"

the answer in every case is it's extremely expensive, or extremely hard to do, or both. you want a reason, there's the reason. maybe they'll fix it. maybe they won't. next question.

make your own backups and redundant systems. "but github is so critical!" -- even more reason to have a backup. bad shit happens in this world. even to good people. prepare or suffer the consequences.

ljk 10 years ago |

Maybe I'm ignorant, but why do companies rely on github? Why not just host it in-house? If there's power outage in the office then everything would be down anyways, right?

danneu 10 years ago | |

A rare two hour Github outage isn't enough to make anyone on my team want to start dicking around with internal tools.

gavazzy 10 years ago |

Would it be possible for a cross between Git and Torrents? Rather than having a central server to pull/push from, instead the server would provide a list of clients. If the server goes down, the list is still available, and so people who depend on it would be able to communicate.

matt_wulfeck 10 years ago |

Why is it so hard for us to distribute our dependencies? Hash the package to a sha and put t anywhere on the Internet. Then we just need a service that holds and updates the locations of the hashes and we can fetch them anywhere.

ibejoeb 10 years ago |

For those that have been affected by this, what parts of your process were disrupted? I've read, so far:

  * Build fails due to unreachable dependencies hosted by GitHub
  * Development process depends on PRs

free2rhyme214 10 years ago |

Chinese DDoS? Somehow I don't buy power going out at a server farm.

hayleox 10 years ago | |

Why not? Things break. Electricity is one of those magical things that's very hard to have insanely good uptime -- frankly, it's incredibly impressive that power outages aren't more common.

And why would GitHub not disclose that it was a DDoS? They were very forthcoming when there actually _was_ a Chinese DDoS last April: http://arstechnica.com/security/2015/04/ddos-attacks-that-cr...

And in a DDoS, the service typically becomes slower and slower until it reaches the point where only like one in a hundred requests succeeds. With the GitHub outage, it died fairly instantaneously, and it was completely 100% dead. There was no timeout as the servers tried to respond -- the "no servers are available" error page loaded instantly every time.

cjbprime 10 years ago | |

> Chinese DDoS? Somehow I don't buy power going out at a server farm.

You should read more about server farms.

johnhenry 10 years ago | |

Considering the attacks within the past year, I was thinking the same thing. I hate to spread conspiracies without foundation, but I wonder if anyone has seen a assessment on the cyberkinetic capabilities of nations around the world?

smaili 10 years ago |

It's always scary when a cloud service you rely on goes down but great to see GitHub recover. Well done!

out_of_protocol 10 years ago |

Various date/time formats across the world bringing me to the knees. If 1/28 outage was _that_ rough 2/28 would be twice as bad and 28/28 would feel like armageddon maybe?

ryanfitz 10 years ago |

I recently read a blog post from Github about them operating their own datacenter http://githubengineering.com/githubs-metal-cloud/

Im not positive, but it sounds like a fairly recent switch from a cloud provider to their own datacenter. If thats the case, Id expect a number of outages to come in the following months.

secure 10 years ago | |

AFAIK, they never used a cloud provider.

ryanfitz 10 years ago | | |

Github was hosted at rackspace, here is there blogpost about it https://github.com/blog/493-github-is-moving-to-rackspace

From their blog posted last month:

As we started transitioning hosts and services to our own data center, we quickly realized we'd also need an efficient process for installing and configuring operating systems on this new hardware.