Scaling Rails and Postgres to millions of users at Microsoft

Scaling Rails and Postgres to millions of users at Microsoft(stepchange.work)

202 points by htormey 1 year ago | 91 comments

pajeets 1 year ago |

Postgres can be scaled vertically like Stackoverflow did. With cache on edge for popular reads if you absolutely must (but you most likely dont).

No need to microservice or sync read replicas even (unless you are making a game). No load balancers. Just up the RAM and CPU up to TB levels for heavy real world apps (99% of you wont ever run into this issue)

Seriously its so create scalable backend services with postgrest, rpc, triggers, v8, even queues now all in Postgres. You dont even need cloud. Even a mildly RAM'd VPS will do for most apps.

got rid of redis, kubernetes, rabbitmq, bunch of SaaS tools. I just do everything on Postgres and scale vertically.

One server. No serverless. No microservice or load handlers. It's sooo easy.

mr_toad 1 year ago | |

Stack overflow absolutely had load balancers, and 9 web servers, and Redis caches. They also use 4 SQL servers, so not entirely vertical either. And they were only serving 500 requests a second on average (peak was probably higher).

pajeets 1 year ago | | |

was it? i read it was a huge ram server

danmaz74 1 year ago | |

Having at least 2 web servers and a read-only DB replica for redundancy/high availability is very easy and much safer. Yes, setting up a single-server is faster, but if your DB server dies - and at some point it will happen - you'll not just save a lot of downtime, but also a lot of stress and additional work.

brightball 1 year ago | | |

Read replicas come with their own complexity as you have to account for the lag time on the replica for UX. This leads to a lot of unexpected quirks if it’s not planned for.

cultofmetatron 1 year ago | | |

my startup has a similar setup (elixir + postgres). we use aurora so we get automated failover. its more expensive but its just a cost of doing business.

justinclift 1 year ago | |

That works for the performance aspect, but doesn't address any kind of High Availability (HA).

There are definitely ways to make HA work, especially if you run your own hardware, but the point is that you'll need (at least) a 2nd server to take over the load of the primary one that died.

pajeets 1 year ago | | |

sure failover is recommended if you have HA commitments

nazka 1 year ago | |

Thank you for sharing this! I have been diving into it.

How do you manage transactions with PostgREST? Is there a way to do it inside it? Or does it need to be in a good old endpoint/microservice? I can’t find anything in their documentation about complex business logic beyond CRUD operations.

steve-chavez 1 year ago | | |

Transactions are done using database functions https://docs.postgrest.org/en/v12/references/api/functions.h....

whakim 1 year ago | |

Yes, scaling vertically is much easier than scaling horizontally and dealing with replicas, caching, etc. But that certainly has limits and shouldn’t be taken as gospel, and is also way more expensive when you’re starting to deal with terabytes of RAM.

I also find it very difficult to trust your advice when you’re telling folks to stick Postgres on a VPS - for almost any real organization using a managed database will pay for itself many times over, especially at the start.

pajeets 1 year ago | | |

looking at hetzner benchmarks i would say VPS are quite enough to handle Postgres for Alexa Top 1000. When you approach under top 100, you will need more RAM than what is offered.

But my point is you won't ever hit this type of traffic. You don't even need Kafka to handle streams of logs from a fleet of generators from the wild. Postgres just works.

In general, the problem with modern backend architectural thinking is that it treats database as some unreliable bottleneck but that is an old fashioned belief.

Vast majority of HN users and startups are not going to be servicing more than 1 million transactions per second. Even a medium sized VPS from Digital Ocean running Postgres can handle that load just fine.

Postgres is very fast and efficient and you dont need to build your architecture around problems you wont ever hit and prepay that premium for that <0.1% peak that happens so infrequently (unless you are a bank and receive fines for that).

seabrookmx 1 year ago | |

> One server

What happens if this server dies?

wongarsu 1 year ago | | |

Then your service is offline until you fix it. For many services a completely acceptable thing to happen once in a blue moon

Most would probably get two servers with a simple failover strategy. But on the other hand servers rarely die. At the scale of a datacenter it happens often, but if you have like six of them, buy server grade stuff and replace them every 3-5 years chances you won't experience any hardware issues

pajeets 1 year ago | | |

if you cant risk this rarity then get a failover server with equal specs

maybe add another for good measure....if the biz insurance needs extreme HA then absolutely have multiple failover

my point is you arent doing extreme orchestration or routing

throw a cloudflare ddos protection too

JB_Dev 1 year ago | |

Eventually you get data residency asks to keep data in the right region and for that you need to have horizontal partitioning of some kind.

jamil7 1 year ago | |

Our backend at work does use a read replica purely for websockets. I always wondered if it was overkill, I’m not a backend developer, though.

pajeets 1 year ago | | |

not sure what you are building but i hope that was for a real time multiplayer game otherwise doesn't make sense to have bi-directional communication when you only need reads

making read replicas function also as writes is needed for such cases but already when you have more than one place to write you run into edge cases and complexities in debugging

mattacular 1 year ago | |

> Just up the RAM and CPU up to TB levels

not sure what CPU at TB levels means but hope your wallet scales better vertically

cosmicradiance 1 year ago | | |

They are definitely not on the cloud.

cdiamand 1 year ago |

I ran into some scaling challenges with Postgres a few years ago and had to dive into the docs.

While I was mostly living out of the "High Availability, Load Balancing, and Replication" chapter, I couldn't help but poke around and found the docs to be excellent in general. Highly recommend checking them out.

https://www.postgresql.org/docs/16/index.html

danpalmer 1 year ago | |

They are excellent! Another great example is the Django project, which I always point to for how to write and structure great technical documentation. Working with Django/Postgres is such a nice combo and the standards of documentation and community are a huge part of that.

irjustin 1 year ago | | |

Interestingly I have had almost the exact opposite experience being very frustrated with the Django docs.

To be fair, it could be because I'm frustrated with Django's design decisions having come from Rails.

When learning Django a few years ago, I still carry a deep loathing against polymorphism (generic relations[0]), and model validations (full clean[1]),

You know what - it's design decisions...

[0] https://docs.djangoproject.com/en/5.1/ref/contrib/contenttyp...

[1] https://docs.djangoproject.com/en/5.1/ref/models/instances/#...

jbverschoor 1 year ago | |

Like many of the BSDs

aerzen 1 year ago | | |

Did Postgres used to be a BSD? Are they known for good documentation?

rubyfan 1 year ago |

15 years ago I worked on a couple of really high profile rails sites. We had millions of users with Rails and a single mysql instance (+memcached and nginx). Back then ruby was a bit slower than it is today but I’m certain some of the challenges you face at that scale are things people still do today…

1. try to make most things static-ish reads and cache generic stuff, e.g. most things became non-user specific HTML that got cached as SSI via nginx or memcached

2. move dynamic content to services to load after static-ish main content, e.g. comments, likes, etc. would be loaded via JSON after the page load

3. Move write operations to microservices, i.e. creating new content and changes to DB become mostly deferrable background operations

I guess the strategy was to do as much serving of content without dipping into ruby layer except for write or infrequent reads that would update cache.

teleforce 1 year ago |

Please check this excellent book by former Microsoft and Groupon engineer on scaling Rails and Postgres:

[1] High Performance PostgreSQL for Rails Reliable, Scalable, Maintainable Database Applications by Andrew Atkinson:

https://pragprog.com/titles/aapsql/high-performance-postgres...

giovannibonetti 1 year ago |

What a small world. Earlier today I got tagged in a PR [1] where Andrew became the maintainer of a Ruby gem related to database migrations. Good to know he is involved in multiple projects in this space.

[1] https://github.com/lfittl/activerecord-clean-db-structure/is...

andatki 1 year ago | |

Hi there! That's funny! This interview and those gem updates were unrelated. However both are part of the sweet spot for me of education, advocacy, and technical solutions for PostgreSQL and Ruby on Rails apps.

I hope you’re able to check out the podcast episode and enjoy it. Thanks for weighing in within the gem comments, and for commenting here on this connection. :)

benwilber0 1 year ago |

Postgres can scale to millions of users, but Rails definitely can't. Unless you're prepared to spend a ton of money.

petcat 1 year ago | |

For real. Show me a company that has scaled RoR or Django to 1 million concurrent users without blowing $250,000/month on their AWS bill. I've worked at unicorn companies trying to do exactly that.

Their baseline was 800 instances of the Rails app...lol.

I'm not going to name-names (you've heard of them) ... but this is a company that had to invent an entirely new and novel deployment process in order to get new code onto the massive beast of Rails servers within a finite amount of time.

loktarogar 1 year ago | | |

I've scaled a single rails server to 50k concurrent, and so if Rails is the theoretical bottleneck there, and we base it off scaling my meager efforts, that's only 20 servers for 1 mil concurrent, or around $1000/mo at the price point I was paying (heroku).

Rails these days isn't the top of the speed meters but it's not that slow either.

cies 1 year ago |

My experience scaling up Rails (mostly in size of codebase NOT in size of traffic) really made me love typesafe languages.

IDE smartness (auto complete, refactoring), compile error instead of runtime, clear APIs...

Kotlin is a pretty nice "Type-safe Ruby" to me nowadays.

Alifatisk 1 year ago | |

I had a similar experience, working in a large Ruby codebase made me realise how important type-hints is, sometimes I had to investige what types where expected and required because the editor where unable to tell me. I hope RBS / Sorbet solves this.

neonsunset 1 year ago |

This desperately needs the Walmart treatment of JET.com’s teams past acquisition :)

jojobas 1 year ago |

What's Rails and Postgres? Do they mean ASP.NET and MS SQL Server?

andatki 1 year ago | |

Rails and Postgres (and AWS) was the pre-acquisition stack, and development continued with that stack during this time period (2020-2021). https://en.wikipedia.org/wiki/Flip_(software)

Microsoft acquired companies with web and mobile platforms with varied backgrounds at a high rate. I got the sense that the tech stack—at least when it was based on open source—was evaluated for ongoing maintenance and evolution on a case by case basis. There was a cloud migration to Azure and encouragement to adopt Surface laptops and VS Code, but the leadership advocated for continuing development in the stack as feature development was ongoing, and the team was small.

Besides hosted commercial versions, I was happy to see Microsoft supporting community/open source PostgreSQL so much and they continue to do so.

https://en.wikipedia.org/wiki/List_of_mergers_and_acquisitio...

https://techcommunity.microsoft.com/t5/azure-database-for-po...

neonsunset 1 year ago | | |

PostgreSQL has been the most popular choice for greenfield .NET projects for a while too. There really isn't any vendor lock-in as most of the ecosystem is built with swappable components.

djaouen 1 year ago |

I don't understand why you wouldn't just use Elixir/Phoenix if you need to scale?

foundart 1 year ago | |

Perhaps because you need to scale quickly and already have a large Rails app that would take a long time to recreate in another language and framework.

SkyPuncher 1 year ago | |

It’s hard to compete with Rails productivity

seabrookmx 1 year ago | |

I don't understand why you wouldn't use <compiled language that's faster than the BEAM> if you need to scale?

djaouen 1 year ago | | |

I mean, you could, but you'd be missing out on the Rails-esque nature of Elixir/Phoenix.

datadeft 1 year ago |

Scaling a non-scalabe by default framework that should have been few services written in a performance first language at a billion+ USD company.

I am not sure why are we boliling the oceans for the sake of a language like Ruby and a framework like Rails. I love those to death but Amazons approach is much better (or it used to be): you can't make a service for 10.000+ users in anything else than: C++, Java (probably Rust as well nowadays).

For millions of users the CPU cost difference probably justifies the rewrite cost.