Is it because they are over the curve and don't make "any" changes to their system. As opposed to other companies, we are still maturing?
> Visa, for example, uses the mainframe to process billions of credit and debit card payments every year.
> According to some estimates, up to $3 trillion in daily commerce flows through mainframes.
https://www.share.org/blog/mainframe-matters-how-mainframes-...
https://blog.syncsort.com/2018/06/mainframe/9-mainframe-stat...
https://www.ibm.com/it-infrastructure/z/transaction-processi...
https://www.ft.com/content/1fd2a066-860f-11e8-a29d-73e3d4545...
I suspect they sometimes 'fail open' (ie. allow all payments through and reconcile later) too.
When dispute is at play, it's a hot potato that no one wants to hold between the merchant, processor, ISO, sales agent & bank. The card networks have been smart to eliminate themselves from that step.
Stolen/lost cards can simply be flagged in a master db/table and can be rejected quickly for example.
For example:
* My credit card statement should have links to the merchant, the address, a list of the things I bought, a link to the returns process, etc.
* Why can't my statement also have the total number of calories I've purchased in the last month, or grams of carbon in fuel I've put in the truck?
* Why can't I use my mastercard to pay another mastercard user directly?
* Why hasn't mastercard produced a '2 factor' for card payments rather than forcing every bank to implement their own?
* Why can't I buy a dual Mastercard/Visa/Other card, which works with merchants who are picky and will only accept one or the other?
* Why are we still issuing bits of plastic in the digital age anyway?
* Why don't the cards have a microusb plug on one edge, or NFC to plug into a phone or computer to log in, to act as an identity card, to authenticate or make payments, or anything else other companies issue smartcards for?
* Why don't mastercard work with mobile providers to issue cards that you can spend your pay-as-you-go balance with, turning a mobile provider into a bank.
It seems mastercards business is 'stuck', and there are opportunities to innovate all around them, but they won't.
There's a 20 minute gap between investigation and "rollback". Why did they rollback if the service was back to normal? How can they decide, and document the change within 20 minutes? Are they using CMs to document changes in production? Were there enough engineers involved in the decision? Clearly all variables were not considered.
To me, this demonstrates poor Operational Excellence values. Your first goal is to mitigate the problem. Then, you need to analyze, understand, and document the root cause. Rolling-back was a poor decision, imo.
How did they not catch this? It's super surprising to me that they wouldn't have monitors for this.
It would be excellent if Stripe published a truly technical RCA, perhaps for distribution via their tech blog, so that folks like us could get a more complete understanding and what-not-to-do lesson (if the failing systems were based on non-proprietary technologies, that is).
That's a reasonable question. We wrote this RCA to help our users understand what had happened and to help inform their own response efforts. Because a large absolute number of requests with stateful consequences (including e.g. moving money IRL) succeeded during the event, we wanted to avoid customers believing that retrying all requests would be necessarily safe. For example, users (if they don’t use idempotency keys in our API) who simply decided to re-charge all orders in their database during the event might inadvertently double charge some of their customers. We hear you on the transparency point, though, and will likely describe events of similar magnitude as an "outage" in the future - thank you for the feedback.
Damn what a mess. Sounds like y'all are rolling out way to many changes too quickly with little to no time for integration testing.
It's a somewhat amateur move to assume you can just arbitrarily rollback without consequence, without testing etc.
One solution I don't see mentioned, don't upgrade to minor versions ever. And create a dependency matrix so if you do rollback, you rollback all the other things that depend on the thing you're rolling back as well.
Doing a large rollback based on a hunch seems like an overreaction.
It's totally normal for engineers to commit these errors. That's fine. The detail that's missing in this PM is what kind of operational culture, procedures and automation is in place to reduce operator errors.
Did the engineer making this decision have access to other team members to review their plan of action? I believe that a group (2-3) of experienced engineers sharing information in real-time and coordinating the response could have reacted better.
Of course, I wasn't there so I could be completely off.
idk the suits have a very different viewpoint; 30 minutes of downtime for a large financial system isn't fine. it can be very costly.
Thanks for the questions. We have testing procedures and deploy mechanisms that enable us to ship hundreds of deploys a week safely, including many which touch our infrastructure. For example, we do a fleetwide version rollout in stages with a blue/green deploy for typical changes.
In this case, we identified a specific code path that we believed had a high potential to cause a follow-up incident soon. The course of action was reviewed by several engineers; however we lacked an efficient way to fully validate this change on the order of minutes. We're investing in building tooling to increase robustness in rapid response mechanisms and to help responding engineers understand the potential impact of configuration changes or other remediation efforts they're pushing through an accelerated process.
I think our engineers’ approach was strong here, but our processes could have been better. Our continuing remediation efforts are focused there.
I hope that lessons are learned from this operational event, and invest towards building metrics and tooling that allows you to, first of all, prevent issues, and second, shorten the outage/mitigation times in the future.
I'm happy you guys are being open about the issue, and taking feedback from people outside your company. I definitely applaud this.
That seems like a lot of change in a week, or does deploys mean something else like customer websites being deployed?
From the description/comment it also sounds like the database operates directly on files rather than file leases as there's no notion of a separate local - cluster-scoped - byte-level replication layer below it. Harder to shoot a stateful node.. And sounds like it's tricky to externally cross-check various rates, i.e. monitor replication RPCs and notice that certain nodes are stepping away from the expected numbers without depending on the health of the nodes themselves.
Hopefully the database doesn't also mix geo-replication for local access requirements / sovereignty in among the same mechanisms too.. rather than separating out into some aggregation layers above purely cluster-scoped zones!
Of course, this is all far far easier said than done given the available open source building blocks. Fun problems while scaling like crazy :)
Bryan Cantrill has a great talk[0] about dealing with fires where he says something to the effect of:
> Now you will find out if you are more operations or development - developers will want to leave things be to gather data and understand, while operations will want to rollback and fix things as quickly as possible
[0] Debugging Under Fire: Keep your Head when Systems have Lost their Mind - Bryan Cantrill: https://www.youtube.com/watch?v=30jNsCVLpAE
Mitigation is your top-priority. Bringing the system back to a good shape.
If there needs to be follow-up actions, take the less-impactful steps to prevent another wave.
If there was a deployment, roll-back.
My concern here is, a deployment have been made months ago, and many other changes that could make things worse were introduced. This is the case. The difference between taking an extra 10-20 minutes to make sure everything is fine, versus taking a hot call and causing another outage makes a big difference.
I'm just asking questions based on the documentation provided; I do not have more insights.
I am happy Stripe is being open about the issue, that way many the industry learns and matures regarding software-caused outages. Cloudflare's outage documentation is really good as well.
However, it's hard to say whether this is a poor decision unless we know that they didn't analyze the path and determine that it would most likely be fine. If they did do that, then it's just a mistake and those happen. 20 minutes is enough time to make that call for the team that built it.
The full report is here if you're curious: https://www.sec.gov/litigation/admin/2013/34-70694.pdf
If you have a good CM system, you should have a timeline of changes that you can correlate against incidents. Most incidents are caused by changes, so you can narrow down most incidents to a handful of changes.
Then the question is, if you have a handful of changes that you could roll back, and rollbacks are risk free, then does it make sense to delay rolling back any particular change until the root cause is understood?
This was a focus in our after-action review. The nodes responded as healthy to active checks, while silently dropping updates on their replication lag, together this created the impression of a healthy node. The missing bit was verifying the absence of lag updates. (Which we have now.)
"[Three months prior to the incident] We upgraded our databases to a new minor version that introduced a subtle, undetected fault in the database’s failover system."
could have been prevented if you had stopped upgrading minor versions, i.e. froze on one specific version and not even applied security fixes, instead relying on containing it as a "known" vulnerable database?
The reason I ask is that I heard of ATM's still running windows XP or stuff like that. but if it's not networked could it be that that actually has a bigger uptime than anything you can do on windows 7 or 10?
what I mean is even though it is hilariously out of date to be using windows xp, still, by any measure it's had a billion device-days to expose its failure modes.
when you upgrade to the latest minor version of databases, don't you sacrifice the known bad for an unknown good?
excuse my ignorance on this subject.
I kinda lost count of how many times Nagios barfed itself and reported an error while the application was fine
Stripe splits data by kind into different database clusters and by quantity into different shards. Each cluster has many shards, and each shard has multiple redundant nodes.
having a few nodes down is perfectly acceptable. I guess they would have had an alert if the number of down nodes exceeded some threshold.
The article said that the node stalled in a way that was unforseen which may have caused standard recovery mechanisms to silently fail.
What's fun for a software person is that there's a lot of interesting digressions and stuff to learn in the Cloudflare one. The whole explanation of the regexp at the end is something that no one cares about from the business side, but is an interesting read in and of itself.
It's worth noting that yours came out a bit more than a week faster than theirs, which jgrahamc clearly spent a lot of time writing. No idea if anyone cares about the speed with which these things are released...
The first question is who is this written for: It lacks the detail I would write for the incident review meeting audience, while lacking a simpler story for the non technical. As it is at the time I read it, I don't think it aims any audience very well.
I understand that the level of detail of the internal report might be excessive for the internal report, but if technical readers are the target, some more details would have helped. For example the monitoring details that Will described in another thread are a key missing detail that, if anything, would make Stripe look better, as problems like that happen all the time. I bet there are more details that are equally useful that would be in an internal report that would not reveal delicate information. In general, the only reason I could follow the document well is that I remember how the Stripe storage system worked last year, and I could handwave a year worth of changes. Since this part of the Stripe infrastructure is relatively unique, it's difficult to understand from the outside, and looks as if it doesn't have enough information.
In particular, the remediations say very little that is understandable from the outside: Most of the text could apply to pretty much any incident on a storage of queuing subsystem I was ever a part of: More alerts, an extra chart in an ever growing dashboard, some circuit breakers to deal with the specific failure shape... It's all real, but without details, it says very little.
I understand why you might not want to divulge that level of detail though. If we want fewer details, then the article could cut all kinds of low-information sections, and instead focus more on the response, and the things that will be changed in the future. The most interesting bit about this is the quick version rollback, which, in retrospect, might not have been the right call. A more detailed view of the alternatives, And why the actions that ultimately led to the second incident were made would be enlightening, and would humanize the piece.
Thank you for not just providing a public root cause analysis, but coming here to discuss it in HN.
If anyone on HN knows anyone who has the sort of interesting life story where they both know what can cause a cluster election to fail and like writing about that sort of thing, we would eagerly like to make their acquaintance.
The remediation part is quite cautious/generic but overall it seems like a good faith effort by someone constrained by corporate rules.
Also, does your company's engineering decisions change based on other companies' post-mortems?
I don't understand why people demand the usage of incorrect language.
On the other hand, if for a significant number of users the site was completely unusable for some period of time, then I think it's fair to use the word "outage". (Even if it's not a complete outage affecting all users.)
I don't know whether other people would interpret these terms the same way I do, nor do I think there's enough information in this blog post to determine for sure which label is more accurate for this particular incident. So personally, I'm not going to be too picky about the wording.
The fact that you needed to qualify “outage” with “complete” clearly means the word on its own is not incorrect for cases where a system was “only” mostly unavailable rather than completely so.
> I don't understand why people demand the usage of incorrect language.
The irony.
> Stripe splits data by kind into different database clusters and by quantity into different shards.
So in theory any request that didn't interact with the problematic database should have been OK (I don't know if the offending DB was in the critical path of _every_ request).
The following incident comes across as reckless and avoidable as there should have been procedures to safely test the rollback (and perhaps there were, but a perfect storm allowed it fail in prod). Lacking details about how the second incident came to be or how they will be prevented going forward places the second incident as 'not fine'.
This information is what the GP comment is asking for.
Compare this PM with Cloudflare's PM, where they detail how they tested rules, what safeguards were in place, how the incident came to be, and how they intend to prevent similar incidents; the impression given here is that they will put up more fire alarms and fire extinguishers but do little fire prevention.
It doesn't shard nicely. Failovers have rather nasty semantics that can cause nasty bugs in client side code. Performance cliffs abound.
If your datastore is anything over 1TB, I'd be using postgres, or if you can manage it something bigtable-like.
On the contrary, I developed early merchant and payment gateway tech, and they absolutely do. The scenario you describe is extraordinarily rare, which allows an arbitrage between CAP perfection and customer satisfaction.
On a separate note, at any given time, some parts of our national payments ecosystem are “down”. There are enough players involved you have an appearance of resilience.
You can see this in a mall, when one store’s card swipe terminals are down, and another’s are not, and almost never happens that all the stores are down at the same time.
You can think of all these other players as an incidental circuit breaker pattern upstream of Visa.
VisaNet itself is surprisingly unscaled, capable of only about 24,000 transactions per second. Twenty years ago, our gateway would hit 15,000 transactions per second real world use. To do that, we scattered/gathered across many independent paths into card networks and various merchant banks.
https://usa.visa.com/content/dam/VCOM/download/corporate/med...
https://www.capgemini.com/wp-content/uploads/2017/07/Domesti...
There are of course limits on how big an offline transaction you can take intentionally or unintentionally and probably the airline wears the full cost of failed transactions in this case.
Doesn't matter that much when its for a $5 coffee, plus they know who you were if they really wanted to chase it down.
And as mentioned electronic terminals absolutely have automatic offline modes also.
Shop owners would even get a reward for snipping a bad credit user’s card in half (something that survives to this day only as a meme).
Not change.
Make every bit of software in your stack export as a monitoring metric it's build date. Have an alert if any bit of software goes over 1 month old. Manually or automatically re-build and redeploy that software.
That prevents 'bit rot' meaning you daren't reduild or rollback something.
This is a valid question.
As a database and security expert, I carefully weigh database changes. However, developers and security zealots typically charge ahead "because compliance."
Email me if you need help with that.
But customers want new features, so Stripe does changes.
Well I mean they're not exactly on the Internet with an IP address and no firewall, are they? (Or they would have been compromised already.)
Whatever it is, it must be separated off as an "insecure enclave".
So that's why I'm wondering about this technique. You don't just miss out on security updates, you miss performance and architecture improvements, too, if you stop upgrading.
But can that be the path toward 100% uptime? Known bad and out of date configurations, carefully maintained in a brittle known state?
Phones die.
If you don’t care, I suggest you look into Apple Pay or something similar. You’ll find many merchants that you won’t be able to pay.
There are certainly changes that cannot be rolled back such that the affected users are magically fixed, which is not what I am suggesting. In the context of mission critical systems, mitigation is usually strongly preferred. For example, the Google SRE book says the following:
> Your first response in a major outage may be to start troubleshooting and try to find a root cause as quickly as possible. Ignore that instinct!
> Instead, your course of action should be to make the system work as well as it can under the circumstances. This may entail emergency options, such as diverting traffic from a broken cluster to others that are still working, dropping traffic wholesale to prevent a cascading failure, or disabling subsystems to lighten the load. Stopping the bleeding should be your first priority; you aren’t helping your users if the system dies while you’re root-causing. [...] The highest priority is to resolve the issue at hand quickly.”
I have seen too many incidents (one in the last 2 days in fact) that were prolonged because people dismissed blindly rolling back changes, merely because they thought the changes were not the root cause.
OK, but then what if it's new data being stored in real time, so there isn't any previous backup with the data in the intended form? In this case, we're talking about Stripe, which is presumably processing a high volume of financial transactions even in just a few minutes. Obviously there is no good option if your choice is between preventing some or all of your new transactions or losing data about some of your previous transactions, but it doesn't seem unreasonable to do at least some cursory checking about whether you're about to cause the latter effect before you roll back.
I'm not saying that the decision won't be to do the rollback much of the time. I'm just saying it's unlikely to be entirely without risk and so there is a decision to be considered. Rolling back on autopilot is probably a bad idea no matter how good a change management process you might use, unless perhaps we're talking about some sort of automatic mechanism that could do so almost immediately, before there was enough time for significant amounts of data to be accumulated and then potentially lost by the rollback.
So the legit question is, can insecure systems (e.g. ancient mainframes) be wrapped by a security layer (WAF, etc.) to get better uptime than patching an exposed system?
Rollbacks should always be safe. They should always be automatically tested. So a software release should do a gradual rollout (ie. 1, 10, 100, 1000 servers), but it should also restart a few servers with the old software version just to check a rollback still works.
The rollout should fail if health checks (including checking business metrics like conversion rates) on the new release or old release fails.
If only the new release fails, a rollback should be initiated automatically.
If only the old release fails, the system is in a fragile but still working state for a human to decide what to do.
For example, in order to avoid any possibility of data loss at all using such a system, you need to continue running all of your transactions through the previous version of your system as well as the new version until you're happy that the performance of the new version is satisfactory. In the event of any divergence you probably need to keep the output of the previous version but also report the anomaly to whoever should investigate it.
But then if you're monitoring your production system, how do you make that decision about the performance of the new version being acceptable? If you're looking at metrics like conversion rates, you're going to need a certain amount of time to get a statistically significant result if anything has broken. Depending on your system and what constitutes a conversion, that might take seconds or it might take days. And you can only make a single change, which can therefore be rolled back to exactly the previous version without any confounding factors, during that whole time.
And even if you provide a doubled-up set of resources to run new versions in parallel and you insist on only rolling out a single change to your entire system during a period of time that might last for days in case extended use demonstrates a problem that should trigger an automatic rollback, you're still only protecting yourself against problems that would show up in whatever metric(s) you chose to monitor. The real horror stories are very often the result of failure modes that no-one anticipated or tried to guard against.