Update on 1/28 service outage(github.com) |
Update on 1/28 service outage(github.com) |
The caching proxy system could be as simple as setting up a squid cache for apt. Multiple projects exist which do this already.
The load balancing system would involve keeping a private mirror of every repository in the dependency graph, and falling back to the mirror when GitHub fails. To automate this, proxy all git requests. If github is up, let the request pass through. If no mirror exists for the repository, create one. If GitHub fails, fall back to the mirror.
Sorring if I sound like an asshole, but before we actually use the word "easiest", can you please share with us how to do all of that. It is not as easy as you claim, to be honest. Not just some /etc/hosts hijack.
This is not quite correct (although close to it). Cargo doesn't rely on GitHub, but it expects that there is some publicly-accessible git repository from which it can pull the source for any crate, and most crates use GitHub. So it's not a particular choice of Cargo, but a side-effect of GitHub's popularity in the community, and the fact that Cargo does not host source code itself.
But really, why?
Is it just institutional laziness on the part of all developers? We had reliable rsync CPAN mirrors in 1995. In the early days of the Internet, companies would mutually host secondary DNS for each other to be more reliable. For some reason, we've forgotten all about reliability and disaster recovery and geographical distribution. Now the collective programmer mindset with regards to global infrastructure seems to be "lol, we're too dumb to make things work, let's just outsource everything to closed source, for-profit companies and hope for the best."
Building, at least after the first time, should not require external access. There are security reasons for this as well.
In fact I go further than that, anything that a project depends on HAVE TO be "saved" somehow somewhere: use a special commercial tool ? save it, use some particular OS ? save the ISO, need a particular version of a compiler / SDK ? have an installer ready, etc.
But nowadays it seems dev program temporary stuff meant to last just few months.
Yet, I swore Git fans told me its decentralized design avoids single points of failures where everyone has a copy and can still work when a node is down just not necessarily coordinate or sync in a straight-forward way. This situation makes me thing, either for Git or just Github, there's some gap between the ideal they described and how things work in practice. I mean, even CVS or Subversion repos on high availability systems didn't have 2 hours of downtime in my experience.
When I pick up Git/Github, I think I'll implement a way to constantly pull anything from Git projects into local repos and copies. Probably non-Git copies as a backup. I used to also use append-only storage for changes in potentially buggy or malicious services. Sounds like that might be a good idea, too, to prevent some issues.
The decentralized design does avoid single points of failures, and everyone does have a copy. So - check, check, great. Unfortunately (maybe..) everyone has put their master repos in the same place, which somewhat counteracts the decentralization. But there is certainly no immediate coupling between the Git repository on your computer and the Github repository it's pulling from. It's not like Github being down in any way prevents you from working on code you've already checked out, unless you need to go check out more code.
(The same obviously may not be true for package managers and build scripts that are not running in isolation from your upstream repository, which is where the problems have arisen.)
Everyone with a checked out repo should have been able to develop and commit, branch and merge locally fine though.
The hub-spoke topology is the easiest way of distributing source code to a lot of people. If the hub goes down, this is what happens. If that leads to a halt in productivity, then that is a failure in contingency planning. Git gives you many tools to distribute your workflow, but that won't save you if your workflow is centralized around Github.
Granted, sometimes you don't really have a choice whether to depend on Github, such as when working with language package managers. Perhaps that goes to show that mirroring and resiliency should be a design consideration in those tools, but it's not a shortcoming of Git itself.
> even CVS or Subversion repos on high availability systems didn't have 2 hours of downtime
It's easier than ever to have HA with a DVCS: clone the repository somewhere else and keep it in sync with commit hooks.
Large FOSS projects (should) do this by keeping a self hosted repository, and mirroring somewhere else like Github, Bitbucket, etc. Internally, an org should be able to quickly stand up a SSH or HTTP server for the purpose, or have collaborators push-pull directly from each other. Worst case? Send patches. Git apply works really well, and you might be surprised at how clever git-merge is when everyone finally syncs up.
That's what it means to be distributed: there is no real concept of a "central" node, unlike Subversion. Every local checkout has a full copy of the repository history. Any centralization is a (somewhat understandable) incidental artifact of how Git is being used.
In a certain sense, git is "append-only". If you change a commit in history, every ancestor commit will have its SHA hash changed. Naturally this will conflict with other copies of the repository.
For backups you should do a "git clone --bare" which checks out the internal git structure with data and history, but not the actual files.
https://github.com/blog/530-how-we-made-github-fast
This looks like a single datacenter. I don't see anything here indicating high availability or other datacenters. You'll usually spot either an outright mention of it or certain components/setups common in it. They might have updated their stuff for redundancy since then. However, if it's same architecture, then the reason for the downtime might be intentional design where only a single datacenter has to go down.
Might be fine given how people apparently use the service. It's just good to know that this is the case so users can factor that into how they use the product and have a way of working around the expected downtime if it's critical to do so.
Well, we shouldn't depend on it so much.
I shudder at the thought what an outage of GitHub would mean for our company. This time, we were lucky as it was during the night in Europe.
Unfortunately, I don't have the power to test this scenario in our company.
If the ability to make builds is critical to your org, making your build process depend on the availability of third-party services over which you have no control is going to end in tears.
I suppose we all need package manager and git/VCS aware recursive forking/caching tools now. E.g. works with npm, gem, etc. and recursively forks your entire dependency chain.
And to think that I managed that sort thing of entirely by hand some years back. (For C/C++ libs, then, so far more manageable.)
There's no point in reading these because there's no technical information. Stuff like this is something you sent to your customer because they want root cause.
I know it doesn't tell you much about exactly what happened, but the truth is they may still be sorting that out and focusing on ensuring it does not happen again. An in-depth post-mortem accompanied by a description of the fix would be great. In the meantime, admitting culpability and apologizing are the ideal essential first steps.
Very few companies build their "data centers" (apple, google, amazon, NSA, actual 'data center' companies, etc). Most companies rent cage space in a larger data center and call that their "private data center." Smaller companies will rent a few dedicated servers or colo half racks from other resellers.
But this two hour failure tells me that they have never really tried a hot failover and failback scenario in order to test the resiliency of their site.
"why didn't they do X, Y, or Z"
the answer in every case is it's extremely expensive, or extremely hard to do, or both. you want a reason, there's the reason. maybe they'll fix it. maybe they won't. next question.
make your own backups and redundant systems. "but github is so critical!" -- even more reason to have a backup. bad shit happens in this world. even to good people. prepare or suffer the consequences.
* Build fails due to unreachable dependencies hosted by GitHub
* Development process depends on PRsAnd why would GitHub not disclose that it was a DDoS? They were very forthcoming when there actually _was_ a Chinese DDoS last April: http://arstechnica.com/security/2015/04/ddos-attacks-that-cr...
And in a DDoS, the service typically becomes slower and slower until it reaches the point where only like one in a hundred requests succeeds. With the GitHub outage, it died fairly instantaneously, and it was completely 100% dead. There was no timeout as the servers tried to respond -- the "no servers are available" error page loaded instantly every time.
You should read more about server farms.
Im not positive, but it sounds like a fairly recent switch from a cloud provider to their own datacenter. If thats the case, Id expect a number of outages to come in the following months.
From their blog posted last month:
As we started transitioning hosts and services to our own data center, we quickly realized we'd also need an efficient process for installing and configuring operating systems on this new hardware.
And the worst that can happen is a customer's stream stops and they have to restart it.
But in most big companies you have thousands of apps that are all doing very different things. Perhaps a critical app might run on 4 hosts spread across two data centres - you're not going to convince people to have chaos monkey regularly and randomly bringing down these hosts, it would cause real impact and is risky. Yeh in theory it should be able to cope but in reality the scales in most orgs are quite different.
That said github sounds a lot more like the netflix end of the scale, doing one specific thing at large scale.
Actually, wasn't this[0] what did happen several years ago when Amazon Ireland went down for days on end?[1]
[0] TL;DR: Cascading effects of power outage.
[1] http://readwrite.com/2011/08/08/amazons-ireland-services-sti... (didn't read the article, it was just high in the google search results)
Are those kinds of failover scenarios frequently messy and risky at the scale of Github? Or is it more likely that in the context of a fast growing company, and even at a place as "cloudy" as Github, there are bound to be some serious bugs lurking in your system design?
It's one of those things where, if you're not regularly cutting power to your data center, you're not building resilience to such a thing happening. So when it does, it's not pretty. :)
Would love to read examples on who is doing this and how? Reminds me of Netflix's Choas monkey, only applied to electricity. :p
I can't remember the last time there was a power outage at a Tier I or II data center -- they're all N+1, from the cabinet PDUs to the distribution units to the UPSes to the diesel generators. Some even go so far as to connect to multiple in-feeds from different utility providers.
At my company, every piece of server, storage, and network equipment we own is connected via redundant power supplies to different circuits (except for nonessential equipment like monitors; we can simply re-plug them into the functioning circuit). I can't imagine running a datacenter any other way.
You can't failover things you didn't predict.
1. Degraded performance that might be a fault justifying fail-over. A human in the loop is a must here as complex services can just act weird under load or randomly.
2. Corrupted data or packets coming in that might indicate a failure. Might automatically fail-over here.
3. No data coming in at all for 5-10 seconds, esp on a dedicated line. Fail-over automatically here as nothing sending data is already the definition of downtime and probably indicates a huge failure.
Companies should also do plenty of practice fail-overs at various layers of the stack during non-critical hours to ensure the mechanisms work. In Github's case, number 3 should've applied and solutions far back as 80's would kick in automatically within seconds to minutes. Their tech or DR setup must just not be capable of that. There could be good financial reasons or something for that but not technological.
That all assumes there's a total and catastrophic failure at main datacenter. If not, there's local backup batteries to sustain a smoother, fail-over plus shutdown. Plus, there's tricks like isolating the monitoring systems from main systems and power supply using things data diodes over octocouplers or infrared. At least one thing will still be working and feeding you reliable information over a wireless connection after the full failure.
NonStop and VMS setups from late 80's did better than Github. My own setups involving a minimum of servers plus apps with loose coupling could fail-over in such a situation. So, this just has to be bad architecture caused by who knows what. Examples below of OpenVMS in catastrophic situations having either no downtime or short downtime due to good architecture plus disaster planning.
Case study of active-active at World Trade Center http://h71000.www7.hp.com/openvms/brochures/commerzbank/comm...
Marketing piece where HP straight-up detonates a datacenter. Guess who was number 1 in recovery. :) https://youtu.be/bUwthF9x210?t=34s
The cases that work are not the ones you hear about. Best practices and testing reduce the risk of making the news, but can't guarantee success.
That said, it's possible that github may have considered that this particular style of outage is rare enough that they don't want to make their design tolerate it. Though if that were the case, I'd wager they'd re-evaluate the cost/benefit right around now. :)
Purely conjecture, but I suspect since github uses mysql cluster they only write to a single dc, which would be the primary dc that failed in this case.
The things that come to mind: issue trackers, messaging, not being able to see latest pull requests.
Update: Now i'm starting to understand the build dependency issue. Still, why do you need to rebuild all dependencies from GitHub repo to build the application? Can't the currently available version work?
You're right in that you're (probably) not totally deadlocked. But I can't start to estimate the lost $$ in productivity that comes with a global GH outage because of all that.
Should that fail, start working on the local repos until github is back, then sync back to it
If a company can't maintain their own internal tools and self-hosting servers, why does the same company think it can run reliable services for users?
Not putting the core of your business on a remote platform is disaster mitigation 101.
Github should not be the master, it should be a mirror of a company master that they host on their own server.
The main problem with that is some company do not want the cost of the infra + the cost of the sysadmin to set that up, etc.
The second problem is the build, even if you host your own repos, if all your dependencies are on github and you don't include them in the repo, then you are bust.
https://news.ycombinator.com/item?id=10182282
That would solve readability, plenty of subversion, verifiability, much of portability, and perform anywhere from OK to good. Not going to happen but academics and proprietary software already did it to varying degrees. As post noted, traceability & verification from requirements to specs to code to object code is a requirement for high assurance systems. My methods, mostly borrowed from better researchers, are the easiest ones to use.
If I'm using some obscure tool that my distro doesn't package, that's when I mirror the version I'm using, and build my own RPM from source if it needs to be deployed to prod servers rather than merely run from rpmbuild.
* static libraries
* dynamic libraries
Provide compiled libs for the platforms of your choice. Preferably all three of Windows, OS X, and Linux. Users can issue pull requests if there is a platform or variant they wish to add.
Hah, because it's been defunct for a while now. Thanks for the reminder, removed it from my profile.
The common trend is that the systems constantly sync critical data, can detect downtime, and automatically (or manually) fail-over when it occurs. Been OS's and ISV's offering that capability with many proven in field going back decades. Certain high-tech companies just don't apply those for whatever reasons. Maybe their stacks just still don't have that feature.
EDIT: Here's more info: http://www.datacenterknowledge.com/archives/2014/09/15/faceb...
Like an above commenter mentioned, weird activity in electrical system can make some products go haywire and even corrupt data in unexpected ways. Of course, simulated takedowns and all appropriate measures for countering common issues should've already happened before a real one. Just to be extra clear there.
Relying on Github is not the problem, relying on Github to be available 24/7 is. Github provides a free master node for your eventually consistent database needs, where the database is git. The eventual portion is key here.
And yes, there have been concerns raised about what would happen if Github took a turn like Sourceforge, which usually get brought up when information about new shady practices at Sourceforge come up (or they get rehashed here).
The fact that everyone uses git more or less the same as svn is the problem. Git is decentralized, but because so many people rely on github most don't ever use the decentralized aspect to it.
Several commenters helpfully described how Git can easily prevent stuff like this and that project-level stuff is why this is a liability. That's good to know as it's already a selling point to management types for a solution like it. Can just ensure the problem doesn't show up in a local deployment by a wiser configuration.
Chaos Monkey fits when people build and deploy their services with the notion that any particular instance (or dependency) could fail at any given time. It's a tough road to evolve out of a legacy, monolithic stack without much redundancy baked in.
They have a focussed business with relatively little variation in how they make money - all their customers simply pay for a streaming service.
Most large companies, certainly banks anyway, have thousands of apps because there's also thousands of different parts of the business making money in their own unique ways that have their own unique needs.
What works for netflix therefore can't work for other businesses, because the actual business is much more heterogenous than that of netflix and the technology will reflect that whether it is organised in microservices or monolithically - that's totally irrelevant.
The difference between theory and reality is precisely the reason Chaos Monkey and tools like it exist.
What you're essentially saying is that in theory, these systems have been designed to be resilient, but in reality, they may not be. If that's the case, then you'd better verify your resiliency, because being resilient in theory but not reality isn't going to help you when your service goes down.
Itsa lot easier to promote that if it is thousands of servers doing something fairly mundane where, worst-case, it not working means a tiny tiny proportion of your customers have to restart their video stream. So what?
But for a small hetereogenous business where what's happening has a much higher cash density, the actual practicalities of randomly killing things in production and the risk that represents rather get in the way, even though in theory you should be able to kill anything in production with minimal impact, you are much less inclined to take that risk when the stakes are higher.
I mean, I've heard about things so wrong and ease it's like shooting fish in a bucket but... exploding fish in a datacenter? That's on another level.
It looks like it.
"The decentralized design does avoid single points of failures, and everyone does have a copy. "
So, like many decentralized systems I've used, a master node gets worked around by other nodes who communicate in another way? Or would some retarded situation be possible where...
"Unfortunately (maybe..) everyone has put their master repos in the same place, which somewhat counteracts the decentralization."
...one node going down could prevent collaboration? Oh, you answered that. That sounds better than CVS but shit by distributed systems standards. I'll still learn it anyway since everyone is using it. Probably in next week or two.
It's not surprising at all that if you make a master repo at the root of the tree, and it goes down, then you can't communicate it. But it doesn't prohibit any communication between other nodes, or re-wiring the tree, and it definitely doesn't inherently block development work on any of the other nodes.
It just so happens, though, that people's build scripts and package managers like to refresh packages from the root and don't handle failures modes of that operation very well. That's the only place problems emerge - besides the obvious fact that if your public releases of software go through the root, and the root is down, then you can't release until it's up. But you could easily make a new root if you wanted to.
That's the critical part. So, countering this risk is apparently a manual thing if one uses off-the-shelf tooling for Git. I'll just have to remember to look at that if I do a deployment. Put it on a checklist or something.
"There are only two hard things in Computer Science: cache invalidation and naming things."
-- Phil Karlton
Set up your build environment with whatever manual intervention is required so that it can run without downloading remote resources. Build as needed. There is no reason for, and many reasons against, downloading dependencies during the build process, but that doesn't necessitate duplicating those dependencies within your own source tree. As long as there are directions on how to download a specific, definitive version of the dependency, whether that is automated or not isn't really a big deal if it's done infrequently.
Also, not frozen dependencies means you are at the mercy on any dependencies changes breaking your build at any time.
With that, even if your first build run and go fetch those deps and can build at T1, it is not guaranteed at all that the build will work at T1+n.
There is a big difference between your team working from trunk and your team being dependent on other projects trunk.
Now, if you're doing this with mission-critical software, you should probably be maintaining mirrors of those dependencies locally on infrastructure you control, but, again, that's another of the things that Git makes easy.
You should never be dependent on a reference that can move, unless you're willing to accept the consequences (that includes branches in any version control system, tags if you don't have infrastructure to verify that they haven't changed, external non-version-controlled downloads, etc.).
Basically, what you should learn here is that you shouldn't build your business around a third-party service's continued availability. Especially if it's a third-party service where you're not paying for an SLA, like Github. Reproducibility of builds is a different issue, and including 100% of your dependencies in your own source repository is not the only solution to it.
* Barring a SHA-1 collision, which is highly unlikely with Git.
Obviously you can run the first build. You wouldn't be using Github if you never got it working in the first place.
To clarify, setting up the build environment may require network access, but if the process of building requires it, there are many places where it can go wrong, both operationally and security wise.
> Also, not frozen dependencies means you are at the mercy on any dependencies changes breaking your build at any time. ...
I agree, but that's a separate discussion and doesn't really apply here. There's nothing preventing the pulling of a specifically tagged version for builds. If someone's build process that used Git for dependencies is not doing this, whether they are using Github or some internal server is irrelevant, the same problems apply.
As soon as chaos monkey cause a service interrupt for, say, traders - it would get turned off and whoever had such a bright idea fired. But if it causes a service interruption for a tiny proportion of people watching streaming videos - no big deal.
Its proponents just ignore this practical reality and seem politically unaware.
git remote set-url --add --push origin git://original/repo.git
git remote set-url --add --push origin git://another/repo.gitPersonally I try not to form strong opinions about things I haven't actually learned or understood yet.
The reality can be rather different[1][2][3].
1. http://money.cnn.com/2011/04/21/technology/amazon_server_out...
2. http://www.zdnet.com/article/amazon-web-services-suffers-out...
3. http://www.theregister.co.uk/2015/09/20/aws_database_outage/
Most developers I've seen reject even learning about networks or DNS or operating systems or databases. Such willful ignorance boggles the mind, but they are praised because their goals are shipping half-broken things as rapidly as possible to flip upwards for those oh-so-tasty acquihire payouts.
We even saw this week how overconsumption of convenience APIs can put entire companies in danger when those privately controlled convenience APIs just decide to shut down one day. Convenience of immediacy always seems to trump connivence of long term stability.
I will argue that this trend has always existed. I'm sure you can find an x86/68k/z80 developer complaining that developers are going "lalala we don't want to know how anything works! give us an the C-language and go away."*
I'm sure there are developers who couldn't imagine learning C without learning x86, and saw developers learning C without learning x86 as "willful ignorance".
Good abstractions will cause developers to simply gloss over how they work.
But if you are programming in C and notices that something goes wrong with the hardware, ( for example, an instruction does not do something that it is supposed to do ) you will have to ask for help since it is someone else's work that is faulty. Sounds reasonable ?
Functional programming proves my point even more where they don't know how the hardware functions or even use the same model. Yet, with good compiler and language design, they can make robust, fast, and recently parallel programs staying totally within their model. Most problems we pick up outside the abstraction gaps can be fixed in the tooling or with interface checks.
So, I think the common perception of people doing crap code while working within an abstraction is unjustified and even disproven by good practices in that area. Much like I would be unjustified in accusing assembly coders of being "willfully ignorant" or working within foolish abstractions because they didn't know underlying microprogramming or RTL. They don't need it: just knowledge of how to effectively use the assembly. Actually, I saw one commenting so let me go try that real quickly. :)
One difference: C->x86 is a static translation layer. Other network/system things dynamically change out from under your "designed" system and alter threat/security/disaster/reliability/consistency models in a potentially unpredictable combinational fashion.
Saying "cloud abstraction" or "I trust this API and don't care how it works" is basically committing every https://en.wikipedia.org/wiki/Fallacies_of_distributed_compu... and just saying "X can't break because we use provider Y who guarantees they can violate the laws of physics for us!"
If the 0day in your familiar pastures dwindles, despair not! Rather, bestir yourself to where programmers are led astray from the sacred Assembly, neither understanding what their programming languages compile to, nor asking to see how their data is stored or transmitted in the true bits of the wire. For those who follow their computation through the layers shall gain 0day and pwn, and those who say “we trust in our APIs, in our proofs, and in our memory models and need not burden ourselves with confusing engineering detail that has no scientific value anyhow” shall surely provide an abundance of 0day and pwnage sufficient for all of us.
I'm sure all the assembly programmers were complaining that the C programmers had no respect for "how anything works".
But, yeah, I hear you... Great movie as well. One of few that brings my favorite mad scientist into eye of mainstream audience as well. I doubt I must name him. :)
EDIT to add: I'm guessing you think the geeks were too sadistic to pass up the opportunity, eh?
No, it just seems weird, that's all. I don't see how either interpretation would benefit HP.
Not so much off-the-shelf tooling for Git, its more off-the-shelf tooling for Node/Ruby/Go/Rust/PHP.
Nothing about Node's npm really requires it to depend on a single GitHub, in fact I think you can use any Git repo. Its just that most tend to use a single Git repo, and there is no way to configure mirrors.
"and there is no way to configure mirrors."
Its that in Git itself or the project-specific tooling you're mentioning?
Git, (and like most other DVCS) supports mirroring. For example Linux, hosted on Github, (https://github.com/torvalds/linux/commits/master) is also mirrored and hosted on kernel.org (https://git.kernel.org/cgit/). Or, the apache projects (https://github.com/apache/cassandra), which are also hosted on apache.org (https://git-wip-us.apache.org/repos/asf?p=cassandra.git). Generally when commits are merged with upstream, they are mirrored to all other hosts.
The tools, however, are generally configured with only the GitHub address (or the author of the tool only publishes to GitHub), and the tools (unlike say Perl's CPAN) don't offer to maintain mirrors of the libraries published. So when github is down, a tool like npm will give up, even though the author could have another git repo host elsewhere.
If the 0day in your familiar pastures dwindles, despair not! Rather, bestir yourself to where programmers are led astray from the sacred RTL/Transistor language, neither understanding what their assembly languages and microprograms compile to, nor asking to see how their data is stored or transmitted in the true bits of the CPU's network-on-a-chip and memory plus analog values and circuitry many run through at interfaces. For those who follow their computation through the layers shall gain 0day and pwn, and those who say “we trust in our assemblers, our C compilers, our APIs, in our proofs, and in our memory models and ISA models and need not burden ourselves with confusing engineering detail that has no scientific value anyhow” shall surely provide an abundance of 0day and pwnage sufficient for all of us.
Source: LISP, Forth, and Oberon communities who did hardware to microcode to language & OS all integrated & consistent. :P
It's the opposite: Linux is hosted on kernel.org, and the mirror on github.com is just something that was created during a kernel.org outage. The canonical address is the kernel.org one.
(The Linux repository on kernel.org, by the way, is one of the oldest git repositories; IIRC, it was created when git was only a few weeks old.)
Github is popular because it is opinionated--it chooses to use git in certain ways, thus reducing the complexity for people who aren't git experts (i.e. most people).
The most sophisticated users of git--the Linux and git projects, probably--do not rely on github at all. As far as I know, they share code via emailed patches. Some of those developers might not even be using git at all! They just send patches upstream, and the upstream developer checks the patches into their local git repo and then preps a larger patch to be emailed farther upstream.
I remember thinking in my early reading that git was like an assembly language for build systems. It really needed a front-end of some kind to smooth things over for new and casual users. Maybe not as heavyweight as Github but better than the main program. Can keep the low-level stuff in for advanced users.
Was that or is that still a common assessment or was my initial impression off?
https://git-scm.com/download/gui/linux
Github (which provides a desktop app in addition to their website) is by far the most successful one, I think because they define a whole simplifed and social experience, not just a client.
To me it seems like people seem to segment into two camps: those who want to do the basics (they tend to use GitHub), and those who want to use the full power of git (they tend to use the CLI).