GitHub availability report: October 2022

GitHub availability report: October 2022(github.blog)

49 points by edmorley 3 years ago | 39 comments

Of all the many SaaS vendors I use, GitHub has the worst availability by far.

There isn't a month that goes by without our devs being impacted.

GitHub - please just work on fixing this. Your product is great but your availability is your biggest problem. It's beyond a joke at this point.

handsclean 3 years ago | |

GitHub’s reputation for reliability really did a 180 after the Microsoft acquisition, as some predicted. It’s strange to me that despite this, you still get vociferous argument every step of the way to blaming Microsoft. People don’t remember GitHub’s past reputation for excellent reliability, then they accuse you of rose tinted glasses, then they say we’re just noticing it more now, then they say GitHub’s complexity significantly changed at a time that just happened to coincide with the acquisition, then they say better reliability is impossible. No, man, Microsoft acquired it, and when they got around to transitioning it to their infra, reliability plummeted.

rvz 3 years ago | |

I know. They have been very unreliable for years as I have predicted in here [0] and you can see all the times it went down or had intermittent issues [1]. I'm not really surprised to see GitHub become less reliable than someone self-hosting a typical Git server.

This is why it makes no sense going 'all in' on GitHub services.

[0] https://news.ycombinator.com/item?id=22867803

[1] https://news.ycombinator.com/item?id=32752965

smcleod 3 years ago | |

I absolutely second this. Their APIs are incredibly flaky - especially with Actions, and it's incredibly annoying that they don't report on their _real_ availability - I can't remember the last time there was any form of related service degradation on their customer facing monitoring during an outage.

lol768 3 years ago | |

> Of all the many SaaS vendors I use, GitHub has the worst availability by far.

Have you ever tried using GitLab?

rozenmd 3 years ago | |

Thankfully the web app was relatively stable last month compared to September: https://github.onlineornot.com/

group_love 3 years ago | |

> There isn't a month that goes by without our devs being impacted.

+1 -- same experience here for a medium size (60) eng team.

kgeist 3 years ago |

>Attempting to retry these failed jobs tied up our worker and it was unable to process new incoming events, resulting in a severe backlog in our queues.

Interesting - we have this kind of thing quite often. Basically, an event is stuck in the queue due to a logic error or a prior race condition, and it's endlessly retried blocking the rest of the events from being processed. We can't just automatically remove such an event from the queue because events must be processed in order or client data can get corrupted. It requires manual intervention (we have alerts in place), and every time it's a new event so we have to be creative and think quickly - how to unblock the queue without corrupting client data by skipping events. After an event is unstuck, there's a huge queue of unprocessed events which can take up to a few hours to be emptied in worst cases. Fortunately we have some sharding in place so there can be several independent workers processing the same global queue - with workers' shard affinity we can process shard data in order AND in parallel, so SRE can temporarily increase the number of workers when the queue gets too large, to speed it up. I still don't know how to solve this kind of problem once and for all (i.e. to have zero manual intervention). Is it even solvable?

throwawaythekey 3 years ago | |

It sounds like you've identified the issue yourself. You are relying on ordering when processing events. You need to either loosen that requirement or do better testing to prevent head of line blocking.

I don't know much about your application but the fact that you can mitigate the problem by scaling the number of workers suggests that the order requirements might actually be fairly weak. As a worst case outcome you may be able to push all events interdependent to the one with an error to a DLQ using a temporary blacklisting mechanism, but by that stage I think I would just prefer better testing.

simonpantzare 3 years ago | |

Sounds like a DAG based task orchestrator could be a good fit. Where tasks state their dependencies and are allowed to run only when they have all completed.

chris24680 3 years ago |

It's amazing how many people here are able to precisely diagnose what GitHub should 'just' do, especially without access to their code base or experience with working at their scale.

brink 3 years ago |

I'm not sure about the technical details behind their outages since they're a little vague on that, but it's funny how every Rails developer champions Github as a Ruby on Rails shop as why Rails should continue living on when their availability is some of the worst in the tech scene. Lazy evaluation is great until it's not.

tenderlove 3 years ago | |

You're not sure about the technical details, but it's clearly the technology's fault?

cheshire137 3 years ago | | |

What, do you know anything about Rails?

pizza234 3 years ago | |

All the incidents seem to be platform-agnostic:

- improper database validation

- older component not tested against configuration change

- uncontrolled automation DOS

- incompletely distributed secrets

o_m 3 years ago | |

Isn't it GitHub Actions that has been having the availability problems for a while now? It started having problems after Microsoft acquired GitHub and there has been speculation that it is because it was ported to .NET and Azure. Is codespaces created with Rails?

jabart 3 years ago | | |

It tends to have a lot of abuse where people try to run crypto miners on free github action accounts. Azure has been more stable the past few years and .NET wouldn't have any issues with stability at scale. Likely just a hard problem to solve at the scale they run at.

c2h5oh 3 years ago | | |

IDK what is the reason, all I know it's been down so much we've blacklisted it for anything of importance and migrated projects already using it to something else.

JoyrexJ9 3 years ago | | |

Trust me, nothing has been ported to .NET, nothing would be gained from such a move. That's not how Microsoft works. Source: I work at Microsoft

erk__ 3 years ago | | |

Actions was not a thing until after GitHub was acquired, but idk if they used something else in the early days of actions.

pdimitar 3 years ago |

Maybe start moving away from Ruby on Rails. Good web stacks in Golang and Rust do exist and at that scale they're likely the only sensible choices.

How far must the sunk cost fallacy go before something is done?

speedgoose 3 years ago | |

Which Golang or rust web stacks are as good as rails?

pdimitar 3 years ago | | |

I'm not saying they are as good -- I'm saying that they are good.

Meaning that at one point extra programmer difficulty is worth it if your everyday web stack can't keep up.

darksaints 3 years ago | |

Rails is a huge problem, but the most mature libraries/frameworks in Go/Rust are all micro-frameworks, which isn't much of a replacement. Maybe some .NET frameworks would be a better choice.

pdimitar 3 years ago | | |

Well I mean if stability plus performance are the main requirements then I'd disqualify everything except Rust.

Though I'd personally do it in Elixir but again, speed. GitHub is huge and should rise up to the challenge.

hit8run 3 years ago | | |

It’s quite the opposite: Rails is a big enabler.

hit8run 3 years ago | |

Because you like Go or Rust better? Makes sense… muhahaha

pdimitar 3 years ago | | |

I don't like either very much. But I've worked with them extensively and they are a better fit for when you want to squeeze more resources and more stability (the latter depends on certain details but it's certainly easier to achieve compared to Rails).