Scaling Rails and Postgres to millions of users at Microsoft(stepchange.work) |
Scaling Rails and Postgres to millions of users at Microsoft(stepchange.work) |
No need to microservice or sync read replicas even (unless you are making a game). No load balancers. Just up the RAM and CPU up to TB levels for heavy real world apps (99% of you wont ever run into this issue)
Seriously its so create scalable backend services with postgrest, rpc, triggers, v8, even queues now all in Postgres. You dont even need cloud. Even a mildly RAM'd VPS will do for most apps.
got rid of redis, kubernetes, rabbitmq, bunch of SaaS tools. I just do everything on Postgres and scale vertically.
One server. No serverless. No microservice or load handlers. It's sooo easy.
There are definitely ways to make HA work, especially if you run your own hardware, but the point is that you'll need (at least) a 2nd server to take over the load of the primary one that died.
How do you manage transactions with PostgREST? Is there a way to do it inside it? Or does it need to be in a good old endpoint/microservice? I can’t find anything in their documentation about complex business logic beyond CRUD operations.
I also find it very difficult to trust your advice when you’re telling folks to stick Postgres on a VPS - for almost any real organization using a managed database will pay for itself many times over, especially at the start.
But my point is you won't ever hit this type of traffic. You don't even need Kafka to handle streams of logs from a fleet of generators from the wild. Postgres just works.
In general, the problem with modern backend architectural thinking is that it treats database as some unreliable bottleneck but that is an old fashioned belief.
Vast majority of HN users and startups are not going to be servicing more than 1 million transactions per second. Even a medium sized VPS from Digital Ocean running Postgres can handle that load just fine.
Postgres is very fast and efficient and you dont need to build your architecture around problems you wont ever hit and prepay that premium for that <0.1% peak that happens so infrequently (unless you are a bank and receive fines for that).
What happens if this server dies?
Most would probably get two servers with a simple failover strategy. But on the other hand servers rarely die. At the scale of a datacenter it happens often, but if you have like six of them, buy server grade stuff and replace them every 3-5 years chances you won't experience any hardware issues
maybe add another for good measure....if the biz insurance needs extreme HA then absolutely have multiple failover
my point is you arent doing extreme orchestration or routing
throw a cloudflare ddos protection too
making read replicas function also as writes is needed for such cases but already when you have more than one place to write you run into edge cases and complexities in debugging
not sure what CPU at TB levels means but hope your wallet scales better vertically
While I was mostly living out of the "High Availability, Load Balancing, and Replication" chapter, I couldn't help but poke around and found the docs to be excellent in general. Highly recommend checking them out.
To be fair, it could be because I'm frustrated with Django's design decisions having come from Rails.
When learning Django a few years ago, I still carry a deep loathing against polymorphism (generic relations[0]), and model validations (full clean[1]),
You know what - it's design decisions...
[0] https://docs.djangoproject.com/en/5.1/ref/contrib/contenttyp...
[1] https://docs.djangoproject.com/en/5.1/ref/models/instances/#...
1. try to make most things static-ish reads and cache generic stuff, e.g. most things became non-user specific HTML that got cached as SSI via nginx or memcached
2. move dynamic content to services to load after static-ish main content, e.g. comments, likes, etc. would be loaded via JSON after the page load
3. Move write operations to microservices, i.e. creating new content and changes to DB become mostly deferrable background operations
I guess the strategy was to do as much serving of content without dipping into ruby layer except for write or infrequent reads that would update cache.
[1] High Performance PostgreSQL for Rails Reliable, Scalable, Maintainable Database Applications by Andrew Atkinson:
https://pragprog.com/titles/aapsql/high-performance-postgres...
[1] https://github.com/lfittl/activerecord-clean-db-structure/is...
I hope you’re able to check out the podcast episode and enjoy it. Thanks for weighing in within the gem comments, and for commenting here on this connection. :)
Their baseline was 800 instances of the Rails app...lol.
I'm not going to name-names (you've heard of them) ... but this is a company that had to invent an entirely new and novel deployment process in order to get new code onto the massive beast of Rails servers within a finite amount of time.
Rails these days isn't the top of the speed meters but it's not that slow either.
IDE smartness (auto complete, refactoring), compile error instead of runtime, clear APIs...
Kotlin is a pretty nice "Type-safe Ruby" to me nowadays.
Microsoft acquired companies with web and mobile platforms with varied backgrounds at a high rate. I got the sense that the tech stack—at least when it was based on open source—was evaluated for ongoing maintenance and evolution on a case by case basis. There was a cloud migration to Azure and encouragement to adopt Surface laptops and VS Code, but the leadership advocated for continuing development in the stack as feature development was ongoing, and the team was small.
Besides hosted commercial versions, I was happy to see Microsoft supporting community/open source PostgreSQL so much and they continue to do so.
https://en.wikipedia.org/wiki/List_of_mergers_and_acquisitio...
https://techcommunity.microsoft.com/t5/azure-database-for-po...
/s
I am not sure why are we boliling the oceans for the sake of a language like Ruby and a framework like Rails. I love those to death but Amazons approach is much better (or it used to be): you can't make a service for 10.000+ users in anything else than: C++, Java (probably Rust as well nowadays).
For millions of users the CPU cost difference probably justifies the rewrite cost.
We're running 270k+ RPM no sweat, and our spend for those containers is maybe 1/100th what you're quoting there.
The idea that Rails can't handle high load is just such bloody nonsense.
You can build an abomination with any framework, if you try.
Can you deploy something to vercel that supports a million concurrent users for less than $250K/month? What about using AWS Lambdas? Go microservices running in K8s?
I think your infra bills are going to skyrocket no matter your software stack if you're serving 1 million+ concurrent users.
You might get surprised as how far you can go with the KISS approach with modern hardware and open source tools.
So if you have a lot of money then you can start implementing from scratch your own web framework in C. It will be the perfect framework for your own product and you can put 50 dev/sec/ops/* on the team to make sure both the framework and product code are written.
But some (probably most) products are started with 1-2 people trying to find product market fit or whatever name is for solving a real problem for paying users as fast as they can. And then delegate scaling for when money are going in.
This is similar because this is about a startup/product bought by Microsoft and not built inhouse.
For fast delivery of stable secure code for web apps Rails is a perfect fit. I am not saying the only one but there are not that many offering the stability and batteries included to deliver with a small team a web app that can scale to product market fit while keeping the team small.
My go-to example is graphql-ruby, which really chokes serializing complex object graphs (or did, it's been a while now since I've had to use it). It is pretty easy to consume 100s of ms purely on compute to serialize a complex graphql response.
> it is not about the language
Sure how about these people?
https://thenewstack.io/which-programming-languages-use-the-l...
Edit: Careful for the non-realtime reporting though if you want to run very slow queries - those will pause replication and can be a PITA.
Good documentation? Yes
It's not cheap at roughly $200/hr but already if you have this type of traffic then you are generating revenues (hopefully) at much greater amounts.
It still is. But you have to look at it in perspective. do you have customers that NEED high availability an will pull out pitch forks if you are down for even a few minutes? I do. the peace of mind is what you're paying for in that case.
Plus its still cheaper than paying a devops guy a fulltime salary to maintain these systems if you do it on your own.
https://nickcraver.com/blog/2016/02/03/stack-overflow-a-tech...
I get what you're saying, they didn't do dynamic and "wild" horizontal scaling, they focused more on having an optimal architecture with beefy "vertically scaled" servers.
Very much something we should focus on. These days horizontal scaling, microservices, kubernetes, and just generally "throwing compute" at the problem is the lazy answer to scaling issues.
However, if they have a peak of 450 web requests per second and somewhere between 11000 - 23800 SQL queries per second, that'd mean between 25 - 53 SQL queries to serve a single request. There's probably a lot of background processes and whatnot (and also queries needed for web sockets) that cut the number down and it's not that bad either way, but I do wonder why that is.
The apps with good performance that I've generally worked with attempted to minimize the amount of DB requests needed to serve a user's request (e.g. session cached in Redis/Valkey and using DB views to return an optimized data structure that can be returned with minimal transformations).
Either way, that's a quite beefy setup!
For GraphQL on Rails you can avoid graphql-ruby and use Agoo[1] instead so that that work is outsourced to C. So in practice it's not a problem.
Exactly. So C/C++/Fortrant is better in this regard than Python.
When you need them... it's nice to have them "just there", implemented correctly (at least as correctly as they can be in an entirely generic way).
Model validations is a whole thing... I think that Django offering a built-in auto-generated admin leads to a whole slew of differing decisions that end up coming back to be really tricky to handle.
But yea, I can complain at length.
- Model validations aren't run automatically. Need to call full_clean manually.
- EXCEPT when you're in a form! Forms have their own clean, which IS run automatically because is_valid() is run.
- This also happens to run the model's full_clean.
- DRF has its own version of create which is separate and also does not run full_clean.
- Validation errors in DRF's Serializers are a separate class of errors from model validations and thus model Val Errors are not handled automatically.
- Can't monkey patch models.Model.save to run full_clean automatically for because it breaks some models like User AND now it would run twice for Forms+Model[0].
Because of some very old web-forum style design decisions, model validations aren't unified thus the fragmentation makes you need to know whether you're calling .save()/.create() manually, are in a form, or in DRF. And it's been requested to change this behavior but it breaks backwards compat[0].
It's frustrating because in Rails this is a solved problem. Model validations ALWAYS run (and only once) because... I'm validating the model. Model validations == data validations which means it should be true for all areas regardless of caller, except in exceptions, then I should be required to be explicit when skipping (i.e. Rails) where as in Django I need to be explicit in running it - sometimes... depends where I am.
[0] https://stackoverflow.com/questions/4441539/why-doesnt-djang...
I think Django seems confused on the issue of clean/validation. On the one hand, it could say the "model" is just a database table and any validation should live in the business logic of your application. This would be a standard way of architecting a system where the persistence layer is in some peripheral part that isn't tied to the business logic. It's also how things like SQLAlchemy ORM are meant to be used. On the other hand, it could try to magically handle the translation of real business objects (with validation) to database tables.
It tries to do both, with bad results IMO. It sucks to use it on the periphery like SQLAlchemy, it's just not designed for that at all. So everyone builds "fat" models that try to be simultaneously business objects plus database tables. This just doesn't work for many reasons. It very quickly falls apart due to the object relational mismatch. I don't know how Rails works, but I can't imagine this ever working right. The only way is to do validation in the business layer of the application. Doing it in the views, like rest framework or form cleans is even worse.
By the way, I was running my startup on 17 physical machines on Hetzner, so I'm not talking from marketing but from experience.
For us we separate validations in two. Business and Data validations, which are generally defined as:
- Business: The Invoice in Country X is needs to ensure Y and Z taxes are applied at Billing T+3 days otherwise throw an error.
- Data Validation: The company's currency must match the country it operates in.
Business validations and logic always go inside services where as data validations are on the model. Data validations apply to 100% of all inserts. Once there's an IF statement segmenting a group it becomes business validation.
I could see an argument as to why the above is bad because sometimes it's a qualitative decision. Once in a while the lines get blurry, a data validation becomes _slightly_ too complex and an arguement ensues as to whether it's data vs business logic.
Our team really adheres to services and not fat models, sorry DHH.
To me, it's all so controversial whatever you pick will work out just fine - just stick to it and don't get lazy about it.
The ultimate I think is Domain-Driven Design (or Clean Architecture). This gives you a true core domain model that isn't constrained by frameworks etc. It's as powerful as it can be in whatever language you use (which in the case of Python is very powerful indeed). Some people have tried to get it to work with Django but it fights against you. It's probably more up front work as you won't get things like Django admin, but unless you really, truly are doing CRUD, then admin shouldn't be considered a good thing (it's like doing updates directly on the database, undermining any semblance of business rules).