GitHub availability report: October 2022(github.blog) |
GitHub availability report: October 2022(github.blog) |
There isn't a month that goes by without our devs being impacted.
GitHub - please just work on fixing this. Your product is great but your availability is your biggest problem. It's beyond a joke at this point.
This is why it makes no sense going 'all in' on GitHub services.
Have you ever tried using GitLab?
+1 -- same experience here for a medium size (60) eng team.
Interesting - we have this kind of thing quite often. Basically, an event is stuck in the queue due to a logic error or a prior race condition, and it's endlessly retried blocking the rest of the events from being processed. We can't just automatically remove such an event from the queue because events must be processed in order or client data can get corrupted. It requires manual intervention (we have alerts in place), and every time it's a new event so we have to be creative and think quickly - how to unblock the queue without corrupting client data by skipping events. After an event is unstuck, there's a huge queue of unprocessed events which can take up to a few hours to be emptied in worst cases. Fortunately we have some sharding in place so there can be several independent workers processing the same global queue - with workers' shard affinity we can process shard data in order AND in parallel, so SRE can temporarily increase the number of workers when the queue gets too large, to speed it up. I still don't know how to solve this kind of problem once and for all (i.e. to have zero manual intervention). Is it even solvable?
I don't know much about your application but the fact that you can mitigate the problem by scaling the number of workers suggests that the order requirements might actually be fairly weak. As a worst case outcome you may be able to push all events interdependent to the one with an error to a DLQ using a temporary blacklisting mechanism, but by that stage I think I would just prefer better testing.
- improper database validation
- older component not tested against configuration change
- uncontrolled automation DOS
- incompletely distributed secrets
How far must the sunk cost fallacy go before something is done?
Meaning that at one point extra programmer difficulty is worth it if your everyday web stack can't keep up.
Though I'd personally do it in Elixir but again, speed. GitHub is huge and should rise up to the challenge.
I've seen good and productive Rails teams but they had to deliberately stop themselves from certain practices, otherwise they ran into problems. Long topic though, and people get very emotional and preachy defending Rails so it's a fruitless discussion 99.9% of the time.
In the end use what you feel works best for you and your team. Objective differences in programmer productivity, machine speed, iteration speed and other metrics does exist though and it's very tiring to see people constantly pretend otherwise.
My point is that if the stack regularly falls over then the programmer convenience has to be sacrificed in favor of stable and mega-fast alternative that requires more programmer energy.
I love working with dynamic languages. I can prototype almost anything that I want to do, in hours. But I also recognized the need for a hardcore stack for a previous contract and went the long and painful route with Rust.
Result: the project is running for 7 months now, has only been restarted 4 times for updating it (re-deployment), never crashed once, handles 5000+ network connections and streams data from them 24/7.
Peak CPU usage on a 4-core VPS: 27%.
Peak memory usage: 180MB. Normal average memory usage: 80MB.
Right tool for the job.
Obviously I can't know for sure but it's not an uninformed assumption.
If you do this, you will realize that none are close to what you describe.
Also have you considered that if you had weekly outage when billion dollars companies continued to stick with Rails, maybe you were the problem?
And even if zero of their incidents alluded to performance problems with Rails I still worked a lot with it and I know for a fact that it's a factor.
Your snark doesn't change reality but you are free to pretend otherwise, fine with me.
> Also have you considered that if you had weekly outage when billion dollars companies continued to stick with Rails, maybe you were the problem?
Indeed, a programmer not having executive powers to influence change of deployment tech and server (was Puma at the time) is indeed me being a problem, surely. Especially after he made a study demonstrating the problems and calculated how much programmer time is wasted on these matters every week and he still got ignored. Perhaps I am the problem indeed!
That's a capacity problem caused by a logic bug. Nothing stack specific. If you throw more work at a system than it is designed to handle, you'll hit a bottleneck.
> Your snark doesn't change reality
What reality? You are just barking your uneducated opinion. No one who ever worked on a service anywhere close to the scale of GitHub (regardless of the stack) would make such statements.
I bet you I could cause this bug on a Rust product if you let me near the code ;)
In practice however, I found people working with certain languages and stacks to be more thorough. Still largely depends on the person in the important position though, that much is always true.