If you decide to move off of RDS but stay on Postgres, you can run your own Postgres but now lose all the benefits of a managed service. You could move off of AWS (eg to Azure), but moving to a different cloud is a huge lift. That, btw, would also be required if you want to try something like Spanner (move to GCP). Moving off of Postgres to another database is also risky. The migration will obviously take some effort, but you're also probably talking about lots of code changes, schema changes, etc, as well as unknown operational risks that you'll discover on the new system. This applies if you're talking about things like Cockroach or even moving to MySQL.
That said, rolling your own sharding is a MASSIVE undertaking. Limitless looks promising, since it takes care of a lot of the issues that Figma ended up spending time on (basically, you shouldn't need something like Figma's DBProxy, as shard routing, shard splitting, etc will be taken care of). It's still in preview though, and like the article mentioned, the costs may be high.
Overall, no easy decisions on this one, unfortunately.
It's a large challenge, but it's absolutely doable. A ton of companies did this 10-15 years ago, basically every successful social network, user generated content site, many e-commerce sites, massively multiplayer games, etc. Today's pre-baked solutions didn't exist then, so we all just rolled our own, typically on MySQL back then.
With DIY, the key thing is to sidestep any need for cross-shard joins. This is easier if you only use your relational DB for OLTP, and already have OLAP use-cases elsewhere.
Storing "association" type relation tables on 2 shards helps tremendously too: for example, if user A follows user B, you want to record this on both user A and user B's shards. This way you can do "list all IDs of everyone user A follows" as well as "list all IDs of users following user B" without crossing shard boundaries. Once you have the IDs, you have to do a multi-shard query to get the actual user rows, but that's a simple scatter-gather by ID and easy to parallelize.
Implementing shard splitting is hard, but again definitely doable. Or avoid it entirely by putting many smaller shards on each physical DB server -- then instead of splitting a big server, you can just move an entire shard to another server, which avoids any row-by-row operations.
Many other tricks like this. It's a lot of tribal knowledge scattered across database conference talks from a decade ago :)
And yes, there are many tricks like having more logical shards than physical ones, collocating tables by the same shard_id, etc. It's still difficult. You need tooling for everything from shard splitting (even if that is just loving a logical shard), to schema migrations, not to mention if you end up needing cross-shard transactions or cross-shard joins.
Generally, you'd need a team of very strong infrastructure engineers. Most companies don't have the resources for that. There are definitely some engineers out there that could whip this all together.
It’s not really just postgres compatibility they are after, but compatibility with the Amazon RDS version of postgres. They also wanted to have something they could adopt incrementally and back out of when something unanticipated goes wrong.
Also, I think yugabyte uses an older version of the postgres processing engine, which may or may not be a big deal, depending on what they are using.
Pitching Citus ran into issues where people were hoping it would handle sharding transparently, which isn't the case. But for someone who's evaluating rolling their own sharding, being able to manage sharding keys explicitly is how Citus allows efficient joins based on your workload. So yes, if you're looking to roll an unmanaged sharded postgresql cluster, consider starting with Citus
Yes. It may not fits your need but take a look at PlanetScale. ( Based on MySQL and Vitess but I have seen quite a few people moving from Postgres )
Both of these companies are very specific about their circumstances and requirements
- Time is ticking, and downtime is guaranteed if they don’t do anything
- They are not interested in giving up the massive amount feature AWS supports via RDS, very specially around data recovery (anyone involved with Business Continuity planning would understand the importance of this)
- They need to be able to do it safely, incrementally, and without closing the door on reverting their implementation/rollout
- The impact on Developers should be minimal
“Engineering” at its core is about navigating the solution space given your requirements, and they did it well!
Both Figma and Notion meticulously focused on the minimal feature set they needed, in the most important order, to prevent disastrous downtime (e.g Figma didn’t need to support all of SQL for sharding, just their most used subset).
Both companies call out (rightfully so) that they have extensive experience operating RDS at this point, and that existing solutions either didn’t give them the control they needed (Notion) or required a significant rewrite or their data structures (Figma), which was not acceptable.
I think many people also completely underestimate how important operational experience with your solution is at this scale. Switch to Citus/Vitess? You’ll now find out the hard way all the “gotchas” of running those solutions at scale, and it would guarantedly have resulted in significant downtime as they procured this knowledge.
They’d also have to spend a ton of time catching up to RDS features they were suddenly lacking, which I would wager would take much more time than the time it took implementing their solutions.
Great job to both teams!
[0] https://www.figma.com/blog/how-figmas-databases-team-lived-t...
For me to understand their choices, I’ll first have to understand their system, which can be a task in itself. Without that, I’d not become aware of the nuances, only general patterns.
https://learn.microsoft.com/en-us/azure/cosmos-db/postgresql...
Also these can add unnecessary complexity by having features you don't need. Or they might be missing features you need. Contributing up-stream can be difficult and there might be conflicts of interest especially for projects which have separate paid version.
Obviously it could interfere with demand for aurora limitless
A company like figma (billions in revenue) putting a small team to implementing a database sharding solution for an un-implemented use case (RDS, not just postgres). AND open sourcing it creating a value for the community is a net-good for the industry.
It works for Figma(for now), but for it to work as a solution for other companies with different hardware, data schema and access patterns will add even more complexity to the mix.
It's a excellent solution but I don't think it be good enough in the long run.
Building something super simple can be ok for the base use case but if you are a multi-billion dollar company you can probably afford a few dbas to actually make your platform good.
Many customers will have a tiny amount of data. For those customers a dedicated database is huge amount of overhead. There may not be any single customer who it makes sense to allocate dedicated "hardware" for.
Sure you have to deal with a one-time pain to shard your thingy, but you don't need to pay for tens-of-thousands of individual database servers, write interesting tools to keep their schemas in sync, wrangle their backups, etc.
There is a bit of overhead, but not huge by any means.
Being able to update schema one customer at a time is a huge plus in my view, as it gives you a lot of flexibility in rollout. You deploy a new version of the application on a new application server and move the customers on the new servers updating their schema one by one (automatically, obviously)
Backups are routinely automated anyway.
Well there's trade-offs with this too, whether needing aggregate data across shards for features, reporting, etc. Shared data between customers, users, etc. API access, etc.
Our shard key - Workspace ID - is a UUIDv4 so there’s a pretty high number of orgs per shard without conflict.
I have seen this several times before and it's always a symptom of having too many engineers working below the waterline. Rather than work on the actual customer-facing problem, let's port the backend to do event-sourcing/cqrs, move all our infrastructure to k8s, change language from x to y etc.
These are all what I would call "internal goals" (ie they may or may not be necessary or even essential to progress but are not directly customer-visible in their outcomes even if they may enable customer features to be built or indirectly improve the customer experience later) and need to be held to an extremely high level of scrutiny.
If you're amazon/google/meta and you need to do this because of extreme user scale I might believe you. If you're CERN or someone and you need to do this because of absolutely ridiculous data scale I might believe you. The idea that it's better for figma to write their own sharding solution than it is to port to one of the existing ones just doesn't pass even the most basic sniff test.
There is always a price. In this case the database team did something quick, cheap and easily. But the Application teams now have to deal with handling all the nuaces of the system. Maybe Figma has more people in these Apps teams with time on their hands to handle it.
It follows the same approach but is far more sophisticated and mature.
I know over simplified approach but majority of problem would be solved.
of course it's not cheap, but probably they are deep into the AWS money pit anyway (so running Citus/whatever would be similar TCO)
and it's okay, AWS is expensive for a lot of bootstrapped startups with huge infra requirements for each additional user, but Figma and Notion are virtually on the exact opposite of that spectrum
also it shows that there's no trivial solution in this space, sharding SQL DBs is very hard in general, and the extant solutions have sharp edges
Also, I’m assuming Amazon or Google will sometimes roll their own solutions on problems of a scale in the same ballpark as Figma’s.
But anyhow, what’s the scale at which this becomes acceptable, exactly? Is there a magical number which serves as a universal threshold? Or is there - like in all engineering decisions - a very concrete economic case for which you and I both lack a lot of the requisite context and inputs?
Seems like a useful argument design pattern.
In both our case and Figma’s, sharding Postgres ASAP was of critical importance because of transaction ID wraparound threat or other capacity issues that promise hard days-long downtime. The kind of downtime that costs 10s of millions of dollars of brand damage alone. Possibly even company ending.
In such a situation, failure is not an option, and you must pick the least risky solution. Moving to an unmanaged cluster system and figuring out your own point-in-time backup/restore, access control provisioning, etc etc has a lot more unknown unknowns than sticking with the managed database vendor you know. The potential failure scenarios of Citus have scary worst cases - we get backup and restore wrong but it seems to work fine in test, then we move to Citus, then something breaks and we can’t restore from backup after all. It’s equally bad to mis-estimate the amount of time needed to bring up the new system. Let’s say you estimate 6 months to get parity with RDS built in features needed to survive disaster and start moving data over, but instead it takes 10 months. Is there enough time left to finish before going hard down? The clock is ticking. Staying with RDS keeps a whole class of new risk out of the picture.
At least here at Notion, NO ONE wanted to build something complicated for fun. We really wanted the company we’d spent years working for and on-call for to survive.
Our story: https://www.notion.so/blog/sharding-postgres-at-notion
> So, now, let me speculate. The real reason why Figma reinvented the wheel by creating their own custom solution for sharding might be as straightforward as this — Figma wanted to stay on RDS, and since Amazon had decided not to support the CitusData extension in the past, the Figma team had no choice but to develop their own sharding solution from scratch.
I could of course be wrong but it really just feels to me like the reasons given in the article are attempts to justify a decision that was actually made because of "not invented here" syndrome.
> over the past few years, we’ve developed a lot of expertise on how to reliably and performantly run RDS Postgres in-house. While migrating, we would have had to rebuild our domain expertise from scratch.
So they had in house expertise to run performantly on RDS but that same experience couldn't be translated to switching over to it running on EC2 + Citus? Rather they used another non-experience concept of building their own sharding? That left me scratching my head.
That said, I've seen plenty of unprofitable startups with high burn rate play this game. That seems a bit suicidal to me.
The article implies otherwise. E.g. it quotes Figma saying: “Given our very aggressive growth rate, we had only months of runway remaining.”
Second best option is ability to easily create prod environments, and then give those to your biggest customers (bigname.figma.com) etc. No single figma customer will go beyond an i3.metal for the DB, or the app.
This issue is so pervasive that we've all just assumed that it must be necessary.
also, it's... strange that they had 18 months and "extremely tight timeline pressure" we simply don't know enough about the situation
We recently did a reshard from 32 nodes / 15 shards per node -> 96 nodes / 5 shards per node. That puts us in footing to scale up for a while before we need to reshard again. This is a pretty smooth process, and each time we scale out we get much more scale up capacity.
Our shard logic is very simple static assignment based only on the Workspace ID. If we wanted to add workspace created time routing, we'd need to starting plumbing that information around the system in ways that are slightly annoying. Probably the move would be to re-key the Workspace table to use a date-embedding UUID format.
First, the data size is growing and we didn't really know the growth rate in advance. Sharding gives you some flexibility in the infrastructure sizing. And yes, you don't want to wait until the last minute.
Second, it helps us to spread the disk I/O. Possible on a single machine if you're a little bit careful with disk types and sizes. But again, the overall load still grows.
Third, all the bulk operations take a long time on a single server. Each of the distributed servers takes about an hour to back up and 2-3 hours to restore. I'd feel uneasy if it was much longer.
As companies get larger they move further up the stack whether it's sharding techniques, databases, custom orchestration software, their own networking hardware, etc.
The whole point of a database is to contain the state - as a pet - so the rest of your application can be stateless - as cattle.
To really get cattle database systems, you need a self-managing cluster architecture that puts things on autopilot like Neon where you've got >=2 copies of each row and can tolerate losing any single box without unavailability.
But restoring a small DB from a fresh backup, if things go really wrong, is faster, and does not affect other customers.
I completely agree wrt having a hot spare / cluster with transparent failover and management.
> Common DBA tasks for Amazon RDS for PostgreSQL
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appen...
I just read the article and from what I can tell the Figma team made a somewhat reasonable sounding decision
My main point is this decision makes no sense on its face[1]. Obviously I'm lacking the real context, so there may be overwhelming circumstances which mean that it was the right decision anyway, but these weren't explained in TFA for me. In TFA the reasoning was superficial, and this is the sort of decision that really should be held to a very high standard because as I say these types of internal goals have the potential to burn a ton of valuable engineering time on things which don't affect the customer-facing offering.
Now we have in a sibling thread someone from notion saying they did the same thing and for me exactly the same reasoning applies. It could be that all these different Saas companies are so special that them each building their own individual postgres sharding solutions to work around the fact that they can't get a sharded, managed postgres instance makes sense. Or not.
[1] That's what I mean by saying it doesn't pass the sniff test. It might actually be the right decision but your instincts should rebel against it because it feels very wrong. So there needs to be a serious examination before going down that path.
I had a problem just recently where I worked at a place that's using blue/green aws rds deployments with mysql replication, and binlogs cant be moved in that service.
This is something that is bog simple in a non-managed service, and as a result we can either manage app replication, re-sync data on each b/g upgrade, or do physical replication (slow). My point isn't that rds is bad, it's just that if you are already deciding to implement your own significant infrastructure on top of database it seems weird to me to not just have the knobs on the thing itself.
Though you could say the same is true of the storage, and tbqh most of the cloud storage is dogshit these days but we just deal with it.
They also stated it was too risky to migrate data stores on the timeline they were working within
Those all seem like measured engineering decisions AFAICT
Do data dump from prod for initial sync and then setup replication from RDS to new cluster. Once synced do switch. Then you're off RDS and can shard on Citus.
The coordination alone usually takes months