Root cause analysis: significantly elevated error rates on 2019‑07‑10

Root cause analysis: significantly elevated error rates on 2019‑07‑10(stripe.com)

203 points by gr2020 6 years ago | 108 comments

vjagrawal1984 6 years ago |

In the face of so many outages from big companies, I wonder how Visa/MasterCard is so resilient.

Is it because they are over the curve and don't make "any" changes to their system. As opposed to other companies, we are still maturing?

wallflower 6 years ago | |

Mainframes.

> Visa, for example, uses the mainframe to process billions of credit and debit card payments every year.

> According to some estimates, up to $3 trillion in daily commerce flows through mainframes.

https://www.share.org/blog/mainframe-matters-how-mainframes-...

https://blog.syncsort.com/2018/06/mainframe/9-mainframe-stat...

https://www.ibm.com/it-infrastructure/servers/mainframes

andrewg 6 years ago | | |

Specifically they run IBM zTPF on their mainframes, which is also used by airlines. Some installations have uptimes measured in decades.

https://www.ibm.com/it-infrastructure/z/transaction-processi...

londons_explore 6 years ago | |

Both have had plenty of downtime:

https://www.ft.com/content/1fd2a066-860f-11e8-a29d-73e3d4545...

I suspect they sometimes 'fail open' (ie. allow all payments through and reconcile later) too.

segmondy 6 years ago | | |

No they don't. If I sell a diamond ring for $20k and Visa passes that the card is valid but it's not. The buyer just got a free $20k ring. Card could be expired, cancelled, or not have enough balance. The merchant must be paid, their processor has to pay them, the bank that issued the card must provide that credit until the card holder pays it back. If the card was expired or a card with a $10 balance. The card holder will refuse, it get's really mess fast. Visa is not willing to assume such risk, they simply provide a network. If it goes down, it goes down and everyone on their network is screwed.

When dispute is at play, it's a hot potato that no one wants to hold between the merchant, processor, ISO, sales agent & bank. The card networks have been smart to eliminate themselves from that step.

jasonjei 6 years ago | |

That’s a great point. In spite of technical changes such as Apple Pay/Android Pay, chip cards, and so on, I can never recall an instance when I was unable to use a credit card globally. It seems most failures to running a credit card are pretty localized, too, and never at the interchange level...

raverbashing 6 years ago | | |

I suspect there's a lot of caching involved as well. When making a purchase you probably don't need all the info to go all the way to the bank and back.

Stolen/lost cards can simply be flagged in a master db/table and can be rejected quickly for example.

Thaxll 6 years ago | |

They're also much simpler and the system behind payment solution didn't changed that much in the last 10 years.

londons_explore 6 years ago | | |

They are also miles behind on features customers want...

For example:

* My credit card statement should have links to the merchant, the address, a list of the things I bought, a link to the returns process, etc.

* Why can't my statement also have the total number of calories I've purchased in the last month, or grams of carbon in fuel I've put in the truck?

* Why can't I use my mastercard to pay another mastercard user directly?

* Why hasn't mastercard produced a '2 factor' for card payments rather than forcing every bank to implement their own?

* Why can't I buy a dual Mastercard/Visa/Other card, which works with merchants who are picky and will only accept one or the other?

* Why are we still issuing bits of plastic in the digital age anyway?

* Why don't the cards have a microusb plug on one edge, or NFC to plug into a phone or computer to log in, to act as an identity card, to authenticate or make payments, or anything else other companies issue smartcards for?

* Why don't mastercard work with mobile providers to issue cards that you can spend your pay-as-you-go balance with, turning a mobile provider into a bank.

It seems mastercards business is 'stuck', and there are opportunities to innovate all around them, but they won't.

segmondy 6 years ago | |

They are not, they go down quite often. lol.

ssalazars 6 years ago |

[2019-07-10 20:13 UTC] During our investigation into the root cause of the first event, we identified a code path likely causing the bug in a new minor version of the database’s election protocol. [2019-07-10 20:42 UTC] We rolled back to a previous minor version of the election protocol and monitored the rollout.

There's a 20 minute gap between investigation and "rollback". Why did they rollback if the service was back to normal? How can they decide, and document the change within 20 minutes? Are they using CMs to document changes in production? Were there enough engineers involved in the decision? Clearly all variables were not considered.

To me, this demonstrates poor Operational Excellence values. Your first goal is to mitigate the problem. Then, you need to analyze, understand, and document the root cause. Rolling-back was a poor decision, imo.

laCour 6 years ago |

"[Four days prior to the incident] Two nodes became stalled for yet-to-be-determined reasons."

How did they not catch this? It's super surprising to me that they wouldn't have monitors for this.

zby 6 years ago |

So the article identifies a software bug and a software/config bug as the root cause. That sounds a bit shallow for such a high visibility case - I was expecting something like the https://en.wikipedia.org/wiki/5_Whys method with subplots on why the bugs where not caught in testing. By the way I only clicked on it because I was hoping it would be an occasion to use the methods from http://bayes.cs.ucla.edu/WHY/ - alas no - it was too shallow for that.

zbentley 6 years ago | |

It is likely that this RCA was shallow because it was intended for everyone--including non-technical users, who (at least in my experience) tend to misinterpret or get confused by deep technical or systemic failure analysis.

It would be excellent if Stripe published a truly technical RCA, perhaps for distribution via their tech blog, so that folks like us could get a more complete understanding and what-not-to-do lesson (if the failing systems were based on non-proprietary technologies, that is).

throwawaydba 6 years ago | | |

From reading the RCA, this should be the trinity of mysql + orchestrator + vitess. If stripe can't get it right, there is no chance for the others.

gr2020 6 years ago |

Anybody know what database they’re using?

conroy 6 years ago | |

MongoDB is the primary data store used at Stripe.

a13n 6 years ago | | |

Really speaks volumes about how mature MongoDB has become considering how solid Stripe's reliability is.

segmondy 6 years ago |

As I mentioned early, " human error often, configuration changes often, new changes often. " https://news.ycombinator.com/item?id=20406116

chance_state 6 years ago |

This reads like the marketing/PR teams wrote much of it. Compare to the Cloudflare post-mortem from today: https://blog.cloudflare.com/details-of-the-cloudflare-outage...

mual 6 years ago |

Is this Stripe's first public RCA? Looking through their tweets, there do not appear to be other RCAs for the same "elevated error rates". It seems hard to conclude much from one RCA.

jacquesm 6 years ago |

Why don't they call 'significantly elevated error rates' an 'outage' instead?

dps 6 years ago | |

(Stripe CTO here)

That's a reasonable question. We wrote this RCA to help our users understand what had happened and to help inform their own response efforts. Because a large absolute number of requests with stateful consequences (including e.g. moving money IRL) succeeded during the event, we wanted to avoid customers believing that retrying all requests would be necessarily safe. For example, users (if they don’t use idempotency keys in our API) who simply decided to re-charge all orders in their database during the event might inadvertently double charge some of their customers. We hear you on the transparency point, though, and will likely describe events of similar magnitude as an "outage" in the future - thank you for the feedback.

luminati 6 years ago |

Since both companies' root cause analysis are currently trending on HN, it's pretty apparent that Stripe's engineering culture has a long ways to go catch up with Cloudflare's.

debt 6 years ago |

"We identified that our rolled-back election protocol interacted poorly with a recently-introduced configuration setting to trigger the second period of degradation."

Damn what a mess. Sounds like y'all are rolling out way to many changes too quickly with little to no time for integration testing.

It's a somewhat amateur move to assume you can just arbitrarily rollback without consequence, without testing etc.

One solution I don't see mentioned, don't upgrade to minor versions ever. And create a dependency matrix so if you do rollback, you rollback all the other things that depend on the thing you're rolling back as well.

cetico 6 years ago | |

Yes this was very surprising. The system was working fine after the cluster restart. There was no need for an emergency rollback.

Doing a large rollback based on a hunch seems like an overreaction.

It's totally normal for engineers to commit these errors. That's fine. The detail that's missing in this PM is what kind of operational culture, procedures and automation is in place to reduce operator errors.

Did the engineer making this decision have access to other team members to review their plan of action? I believe that a group (2-3) of experienced engineers sharing information in real-time and coordinating the response could have reacted better.

Of course, I wasn't there so I could be completely off.

debt 6 years ago | | |

"That's fine."

idk the suits have a very different viewpoint; 30 minutes of downtime for a large financial system isn't fine. it can be very costly.

EugeneOZ 6 years ago | |

Not sure why this is downvoted but it all really looks like non-tested deployments to production servers.

dang 6 years ago | | |

Possibly downvoted because of the name-calling ('what a mess', 'amateur move'), which degrades discussion and is against the site guidelines. It's also sort of distasteful to pile on like that.

https://news.ycombinator.com/newsguidelines.html