DoorDash manages high-availability CockroachDB clusters at scale

DoorDash manages high-availability CockroachDB clusters at scale(cockroachlabs.com)

52 points by orangechairs 2 years ago | 61 comments

gizmo 2 years ago |

DoorDash has about 35 million users, and there is zero interaction between users. The median user uses doordash maybe once a week. So 5 million sessions a day, all happening in the same 3 hour window. That's 2 million sessions per hour at peak times.

How does DoorDash get to 1.2 million queries per second. 1.2mqps * 10000 seconds in 3 hours = 12 billion queries to process 5 million orders? That's wild. Is it all analytics? This is highly suspect. 35m users isn't nothing, but it isn't exactly Facebook scale either.

BrentOzar 2 years ago | |

I’m not excusing the wild number, but just tossing out some additional load: * Drivers checking in for work, especially if the apps poll automatically * Drivers phoning home with live location updates * Restaurants sending automated updates on order status * Push notifications to users with status changes on their orders * Users with multiple devices (like I have at least 5 devices with the UberEats app)

jakjak123 2 years ago | | |

Yes, our server had 120k queries/sec, but 80% of that traffic was driver heartbeats or connection verification. We halved it by disabling the connection verification query.

cdchn 2 years ago | | |

Even just searching and browsing restaurants and menus is probably dozens of queries for every interaction.

scottlamb 2 years ago | |

> 12 billion queries to process 5 million orders?

2,400 queries per order? That's not that crazy IMHO. There might be significant database fan-out on each click (depending on how they do geographic lookups, search ranking / synonyms / sponsored stuff, the repeat your last order features, whether the ranked search returns the full object or a reference that then has to be individually queried, etc.). There might be many clicks per order because people browse a lot (both to find a restaurant then to find dishes within the restaurant), leave reviews, poll for delivery status updates, etc.

gizmo 2 years ago | | |

That's fair but that also suggests most actions hit the main database directly instead of caching layers. Possible, but somewhat unusual at this scale.

beembeem 2 years ago | | |

> 2,400 queries per order? That's not that crazy IMHO.

Isn't that off by at least an order of magnitude though? It forces them to operate a much larger cluster than should be necessary.

rbranson 2 years ago | |

Given how these blog posts are typically written, it is very likely the 1.2 million QPS figure is an all-time peak, not anything like an average.

orangechairs 2 years ago | | |

According to their slides/video (bottom of the blog post), the 1.2 million QPS is their daily peak number, not the average.

beoberha 2 years ago | |

It’s naive to think that database access is only happening when a customer makes an order. Each driver has workflows they exercise and data that needs stored. Same with vendors. There could be operational data for their infrastructure.

That said 1.2 million queries per second is wild. Would be interesting to see the breakdown.

jen20 2 years ago | |

> there is zero interaction between users

A curious description for a platform which acts as a broker for transactions between users!

indymike 2 years ago | |

Massive amounts of user tracking.

Thaxll 2 years ago | |

Your numbers are off by a large factor.

milkglass 2 years ago | |

It's all the new reminders telling users to tip.

xyst 2 years ago |

> About 1.2 million queries per second at daily peak hours.

> About 2,300 total nodes spread across 300+ clusters.

> About 1.9 petabytes of data on disk.

> Close to 900 changefeeds.

> Largest cluster is currently 280 TB in size (but has peaked above 600 TB), with a single table that is 122 TB.

all of this yet my food still arrives cold af

kidding aside, I wonder if DD has the same problems as Uber or Lyft except with food delivery. Each new "change feed" is a specific region, county/municipality, or city. Federal, state, and local laws all handled delicately.

orangechairs 2 years ago | |

DoorDash's engineering blog has a much more indepth look at their architecture: https://doordash.engineering/2023/02/07/how-we-scaled-new-ve...

beembeem 2 years ago | |

> my food still arrives cold af

Ha.

The first thing I noticed and you almost got to it in your summary: at 1.2MM/2300 = 520 qps per node, this isn't a wild setup. I'm wrapping my head around how they're generating that amount of load. Seems like an easy task for any database to handle.

rickreynoldssf 2 years ago |

I'm not really seeing why DoorDash needs all their operational data in one monster clustered database. I would think its so much simpler to shard the data by region for operational queries and aggregate in the background for long-term storage.

sean0- 2 years ago |

Ha! Amazing (didn’t know this was being written or put up).

This is a summary of a recent conference talk:

https://youtu.be/jCjrfpF64Kc?si=Gf-gp_ixX2V6Qz8V

This was my team. We did and lived this. AMA.

sverhagen 2 years ago | |

Well, it looks like the sibling threads are very interested to know where the need comes from to even _have_ 1.2 million queries per second. How does that break down? How much is that just core functionality versus analytics and tracking?

sean0- 2 years ago | | |

That's core functionality, not analytics. Nearly all of that is browsing, ordering, location updates, etc. The dismissive comments are amusing and show a lack of understanding of how the business works and, subsequently, the technology required to power the end-to-end flow for users.

joshstrange 2 years ago |

Do you know what DoorDash doesn’t manage? A staging/test environment. All testing for API integrations is done in prod on shared account. The docs and the API endpoints themselves leave a lot to be desired as well.

jvans 2 years ago | |

I've been advocating for this approach for a long time. At some level of size it is so brutally difficult to maintain an environment that mirrors production that the effort isn't worth it. With enough tooling in place you can mitigate the risk to customers significantly

stingraycharles 2 years ago | |

So I assume they use feature flags instead, and staggered rollout of new features? As that’s a common alternative to heavy up-front testing.

snihalani 2 years ago |

interesting. curious if anyone has benchmarked it relative to other dbs. like: https://benchmark.clickhouse.com/

karmakaze 2 years ago | |

CrDB is not about many many low latency queries, like say MySQL. It's designed more for getting your workload processes down to making as few large queries as every one incurs quorum latencies. You don't want to prototype something in Rails and hope there's no hidden lazy queries happening along the way.

It wouldn't be a good idea to take a large working PostgreSQL app and try to switch over to using CrDB. You'd spend all your time (unwittingly rewriting the entire app) speeding up and grouping a few queries at a time.

jordanthoms 2 years ago | | |

We moved took a large working PostgreSQL app and switched it over to CRDB and that doesn't match my experience. Our existing schemas and query patterns moved over nicely - latency for small indexed reads and writes did increase from ~1ms to ~3ms, but the max throughput now effectively unlimited since we could add capacity by adding new nodes into the cluster and letting CRDB automatically rebalance the workload. There was an increase in cost as it will need more cores, disk etc compared to a single-primary PostgreSQL, but that makes sense when you consider that every bit of data is getting stored on 5 different nodes and there are overheads to maintain the consistency.

For the highest throughput endpoints we did make some changes to be more optimal on CRDB so we could run a smaller cluster, but it didn't require anything close to a rewrite.

namibj 2 years ago | | |

You can have small queries, they just have to be be sent before you block on the results from the first of each group.

jordanthoms 2 years ago | |

Clickhouse is a totally different use case - Cockroach is OLTP, Clickhouse is OLAP. We use both Cockroach and Clickhouse at scale and they are both great but not competing products - Cockroach is great for the types of reads and writes you do when serving user requests, processing transactions etc, but isn't optimal for analytics queries where you are going do things like read and aggregate data on a 50TB table. Clickhouse eats those kinds of aggregate queries for breakfast, and is fast for some types of small read queries too, but it's not built to handle random writes or frequently updating rows of data.

cebert 2 years ago |

This reads like a long form advertisement.

al_borland 2 years ago | |

Case studies hosted on a company's own website generally are. It's kinds of an, "it worked for them, so it will work for you," thing.

candiddevmike 2 years ago | |

"Art of the possible" (YMMV)