It seems lost on the authors that yes that might work for some folks just fine, but others really do want the Land Rover and all its additional baked in features beyond getting you from A to B.
Your comparisons are similar to many others out there that focus on measuring basic cpu and memory. This type of easy comparison where AWS/Azure/GCP is treated as a "dumb" datacenter is easy for alternatives like Hetzner or self-hosting to "win".
>Do you really need the advanced features of AWS and Azure right now? Or would a simple virtual machine at a reasonable price be sufficient? [...] There’s a growing movement among tech companies and startups to opt for more cost-effective hosting solutions like Hetzner. The high costs associated with AWS and Azure
Many (most?) YC startups are not using AWS as a low-level dumb data center with blank EC2 virtual machines and installing infrastructure software like Linux and PostgreSQL on it. Instead, they are using higher-level AWS managed services such as DynamoDB, Kinesis, SQS, etc :
Therefore, the more difficult comparison (that almost no blog post ever does) is the startup's costs for its employees to re-create/re-invent the set of higher-level AWS services that they need.
Sure, there's the "but you don't need to pay expensive AWS costs for DynamoDB when one can just install open-source Cassandra at Hetzner; and instead of AWS Kinesis, install your own Kafka, etc". Well, you add up more and more of those "just install and manage your own X,Y,Zs" and you can end up crossing the threshold where paying AWS cloud fees cost less than your staff maintaining it. The threshold for AWS isn't just massive scale of 100+ million users. The threshold can be the complexity and scope of higher-level services you need the cloud to take care of on your behalf so your small team can concentrate on the aspects of the business that are true differentiators. In other words, instead of employees installing Cassandra, they're adding features to the smartphone app.
If your company doesn't need any of the Big 3 clouds' higher-level platform services, it's easier to save money with alternatives.
As soon as your startup does get big, it starts to make more sense to try and migrate to 'dumb' machines and save on infrastructure costs, especially if your business is low margin and your infrastructure costs are high.
And adding one dev/engineer is _massively_ more expensive, so you seldom want to scale in that axis when the option is to, say, use a managed database or even a complete data pipeline.
If you have a good understanding of load up front, however, those are probably non-issues.
E.g., if you're running K8s (one thing I typically recommend you buy a managed one of), you can install your own Kafka in it, using an operator that does about 85% of what MSK does.
Sure, you'll need to dedicate person hours to support the operator, but is supporting that any more expensive than supporting AWS products? That you're already paying through the nose for?
If you are bootstrapping a crud app business then 1 beefy hetzner box (or something slightly more reliable) with postgresql is probably fine until you reach scale where you sell the business. You care about burn rate above all.
If you are VC backed go all in on gcp or aws because thats what you're expected to do and and what the expensive people you hire are going to know.
Same with RDS, etc.
It’s pretty great not to waste time when the lottery for the bizarrest of 0.000001% issues arise.
The operator only solves the happy path. An AWS support ticket usually can solve the unhappy path.
For high-scale operations, you need to think real hard about how you do things and usually simplicity is key, and trying to do a little as possible on the high throughput parts is useful.
The costs do add up when you have professionals maintaining your Cassadra/Kafka boxes, but the same degree of complexity exists on AWS, when you try to weave together a tapestry of EC2s, lambdas, various storage services, with all the delicious complexity of multiple VPCs and networking fineries while not blowing the budget.
It's a different skillset, but not less work.
Even storage in hyperscalers is inherently redundant—and I keep getting folk who ask about setting up their own RAID array, or using their own containers and job management when there’s a dozen zero-code alternatives in each individual hyperscaler.
Part of me thinks, man, the engineers not afraid of setting up a p Postgres or Redis really should be worth a lot more, given how absurd the prices can get. I guess the getting started costs for these services are usually manageable though; by the time the bill is big it's a "nice problem to have" because you have significant load now, and presumably customers & revenue to show for it.
More so, I think orgs are somewhat rightfully afraid of running infra because historically we have been bad at it. It's been every sys-op or devops for themselves in the world. Everyone making their own practices, assembling their own stack of networking setup, init scripts, db procedures, monitoring, alerting, resilience/reliability. This stuff has a lot of dimensions of care to it.
And even when you go the extra mile to document everything, it's still rough to hand-off ownership. A new gal joins; how long does it take to get comfortable? And how much will her style & preferences mesh with whats been string up so far? Or worse, what happens when someone quits? How load bearing were they?
And this is why I'm so humungouely excited about Kubernetes. Fleet was pretty sweet & cool & direct in the past, RIP, but like so many of the "way to run containers" option it was just that: a way to run containers. Having an extensible system, where operators keep networking, storage, databases running, where tasks like backups and migrations and high availability are built in to well tested controllers: it cuts out so so so many things that operators had to discover, socialize, and test test test test test test before. There's such incredibly good load bearing systems-that-maintain-systems (i.g. autonomic) available, that compete very much with the paid for/managed services that have done likewise for us for so long.
And it's a consistent paradigm, for whatever you are up to. Write a manifest with what you want, send it to api-server, wait for operator to make it so. Instead of having different dimensions or concerns have different operational paradigms & styles, there's a unified extensible Desired State Management that does a damn good job.
It felt like running services was in a dark ages for so long, that each.shop was fractured & alone with their infrastructure, and it was obvious why managed services were winning. But today there's a hope that we can run services, well, in a way that will be very clear & explicit if it ever needs to be handed off.
But only if they agree to be on call 24/7 to support what they deployed. Ask engineers to guarantee you won’t loose data and see how they tell you to buy RDS.
To add, if you every want to get ISO/PCIDSS etc certification done then good luck implementing gazillion check list items which Azure/AWS/GCP have already taken care of.
They will have object storage soon, but dont hold your breath for one-click kubernetes etc. So the fancier you infrastructure, the more you your startup would need to invest in time and money to use Hetzner and thus make it "not worth it".
There is also a gpt that you can use that will genereate you the module block based on your requirements.
For example, instead of the ancient F8 series used in the article, a modern D8as_v5 Azure instance under a 3-year Savings Plan is $115/mo.
Also, the article compares CPX41 to EC2 and Azure VMs with dedicated cores, not shared cores. The CCX33 Hetzner model is closer to the normal clouds, and costs $50/mo, so now we're at 2x the price instead of 10x the price. (Conversely, the B8als_v2 size uses shared cores and is also 2x the price of CPX41 at $74/mo)
For that 2x cost you get a lot more features, first-party and third-party support, more locations, faster networking, etc... That's worth it for most large enterprises that care about ticking checkboxes on audit reports more than absolute cost. Or to put it this way: the annual price difference is just $600, which is the same cost to an org as half a day of engineer-time or less. If Hetzner is the slightest bit more difficult than a large public cloud VM for anything, ever, then it's not cheaper. This could be patching, maintenance, migrations, backup, recovery, automation, encryption, or just about anything else.
There are other differences as well. Hetzner has a separate charge for load balancers and IP addresses, whereas with Azure they're included in the price of the VM.
The biggest cost difference is that the public clouds charge eyewatering amounts for Internet egress traffic. Azure is about 100x as expensive as Hetzner, which is just crazy.
Both of these scale to zero and offer 180k vCPU/s free per month, 360k GB/s free per month. You incur billing only against the active execution time. Cloud Run Jobs has a whole separate free monthly grant as well.
You can run A LOT for free within those constraints. Certainly a blog or website. To prevent cold starts, just set up Cloud Scheduler (also free for this purpose) to ping the container every few minutes.
Use Supabase for a DB or one of the serverless options (if it works for your data use case) like Firestore, CosmosDB and you can run workloads for a few cents per month with an architecture that will scale easily if you need it to.
6 min video showing the receipts and how easy this is: https://youtu.be/GlnEm7JyvyY
YMMV but all costs aren't instance costs.
And they're not just salespeople, they've actually said multiple times if a feature doesn't work for us without trying to hold it wrong in a dangerous (and expensive) way.
Can you give examples of this? I'd love to hear more about the kinds of guidance they can give.
This is one of the more important points and why the point "The learning curve of a single server isn't so big, especially when compared to AWS" is sitting a bit wrong with me.
Sure, if you talk about 1 VM, I agree. And I wouldn't second guess doing this, at all. It would be my initial plan as well as long as I don't have to make any strong availability guarantees. And for this use case, I'd call AWS a bad choice. It's not a simple VM provider.
But once you start running e.g. a redundant postgres cluster for updates without downtime, the amount of stuff to know also grows, a lot. Suddenly you also need backups, tests of backups. And this is where AWS/the cloud allows you to save time, and treadmill time.
Would probably give them way more budget in actually building applications than running the infrastructure.
Maybe I'll extend the article to include the point of using a managed postgres at AWS / Azure / fly.io, whatever, in combination with Hetzner VMs.
The pricing is more on par with Digital Ocean/Linode.
Maybe, just maybe, I want to use LVM or something entirely unknown to them. Not necessarily in a privacy sense, but control.
If you're looking for a cheap one-off server, the server auction has some very good deals.
[0] Full details at https://blog.searchmysite.net/posts/migrating-off-aws-has-re...
Take the recent Lichess downtime, for example. Their main server had a hardware issue that required physical intervention. This meant the site was down for over 10 hours, and there wasn't much they could do except wait for OVH to send a tech.
If Lichess had been on AWS, the provider would have automatically moved their workload to a functioning server, and the outage would have been much shorter or possibly avoided altogether.
For Lichess, a non-profit, this tradeoff still make sense. Their service, while important to its users, isn't critical. Nobody dies if Lichess is down and the cost savings help them keep running. But if your business can't afford downtime, the extra guarantees from a public cloud provider can definitely be worth paying for.
They're leaving other things on AWS, i.e. partial migration is quite doable.
Hetzner starts at 50 Euro, only has servers and Europe and is going to require a ton more work.
AWS has the right idea, they give everyone who asks nicely thousands in free credits to get started. Then 2 years in your hooked. I don't want to learn a new system.
It will take slightly more effort than Lightsail, yes.
I still don't think I feel like migration though. Captain Rover isn't exactly lightweight.
I have only stumbled on one service that do it. its a datadog alternative, so the bar is not that high for pricing.
Even with automation tools like Ansible or immutable server images, packing as Docker images and running on a container orchestrator have always been much easier.
Hetzner doesn't have the services AWS provides, that's the reason most companies I know use AWS for.
If we could run our crap on any server, we would, but managed services are still cost-effective vs hiring our own 24/7/365 rotation of on-call ops people.
They have the skills, cash flow, and resources to do whatever they want.
Yeah if people had less shaky stacks. But it is always easier to pay someone to run the hack.
"In the beginning" the clouds promised to use their scale to soak up your unpredictable demand. You as the customer didn't have to think about capacity, or planning ahead, budgets, opex, etc... Just swipe your credit card and go from zero to any number you please and back again at any time of your choosing. Because there are so many other customers using the cloud with you, the unpredictable nature of your individual usage is averaged out and the cloud vendor gets a (slightly) noisy but manageable usage level of their resources. They have to work a little harder to predict future capacity needs, but you pay a premium for this.
"A little later" the MBAs realise that they can squeeze 5% more profit out of their customers with lock-in contracts that make everything "nice and predictable" instead of the stochastic noise they had to "deal with" before. Getting rid of that makes things a lot harder for you as the customer, but they don't care. They care about that 5%.
Ta-da... we're back at having to "procure", we're back at budgets that have to be planned for 3 years in advance, we're back at having to have time machines.
PaaS services or even VM scaling sets with volatile instances can still be stupefyingly cheaper, but that point is really hard to make to architecture astronauts.
> They’re conceptually simple, but you soon realize that you need at least a couple of 24/7 always on boxes and that you only really should use Cloud Run-like services for burstable workloads.
This is simply not true and Cloud Run-like services offer an easy path for progressive scaling.1. You can scale it to 0 at the outset as you build your app
2. You can set it to scale to a minimum of n instances (e.g. minimum 1, 2) to have fast response times
3. If you find a need for a 24x7 instance, take the same container image and you can launch a Compute Engine instance with the container directly and scale that way.
4. If you need more control beyond that, move those containers into GKE Autopilot or full GKE or your container orchestrator of choice.
Not only is it easy and free to get started, it provides a straightforward path to adapting the underlying deployment and compute model based on needs as the app scales without the need to pay anything until you actually need 24x7 compute (and even then, it's a matter of setting your Cloud Run service to min=1 instances to get 24x7 compute or configuring a CE instance with the same exact container).
Most people think it is easier to use EC2 than FarGate since the first is the most famous one. But actually, it is the other way around!
If you not a HN person with systemadmin skills yes. But is NOT that hard to have in house RADI hd setup, with failover server. Or failover NAT gateway. AWS and cloud provider are just a rip off.
Lichess admins are highly skilled and I'm sure they already have a well designed infrastructure. You can see what they use at https://docs.google.com/spreadsheets/d/1Si3PMUJGR9KrpE5lngSk...
The issue was on a network equipment that they didn't even manage. You can't load balance when your core network is down. There was nothing they could do as I understand it.
More details at: https://lichess.org/@/Lichess/blog/post-mortem-of-our-longes...
OPs comment is valid - physical servers might incur downtime.
But I do agree with your sentiment. "Downtime" is not an argument which should tilt the discussion towards either physical servers or the cloud. AWS data centers famously also have outages, while physical servers often have uptimes of multiple years. So what's better? It's hard to tell, but at the very least, none of these solutions is downtime-free.
You can almost certainly fit all your business logic into one or two appengine apps, and fit all your data into one database. While you have just a few programmers, the fact they're all sharing a process with eachother won't matter.
The goal is working product and paying customers ASAP, not a nicely architected microservices backend 2 years from now.
Yes, it'll end up being a mess when the company has pivoted and changed directions a bunch of times, and when you finally come to get to 50M users+ scale you'll probably have to rewrite from scratch. But by then, you ought to be rewriting from scratch, because you won't know the true requirements till you get to that scale.
BUT, it's _very_ highly geared towards fully-online games where everyone playing the game is connected to a server all the time. Our game was asynchronous PvP where the attacker was online, but the defender wasn't.
I had a 30 minute chat with them and they confirmed that it _could_ be made to work, but it'd be extremely janky and expensive in our use case.
We ended up building our own (or actually expanding our existing setup a bit).
--
We've also got pretty good estimates on how much something might cost: we have an application that needs a specific number of writes/reads with X amount of data per write, what would be the cost on different Amazon services.
Again they came back with numbers and with many services (DynamoDB especially) it would've been either impossible or prohibitively expensive. We ended up changing the application structure to do less IOPS + more aggressive caching and ended up using plain S3 as storage.
Without their consultation (And inside knowledge about AWS internal hard limits) we would've spent weeks building a solution that will eventually fail as the data stored per user goes up.
[0] https://docs.aws.amazon.com/gamelift/latest/flexmatchguide/m...
Plus Xeon’s ofc
If I was looking to scale up an existing operation considerably and minimize costs as much as possible, I'd consider spinning up e.g. a Postgres cluster or minio on their infra, which would be significantly cheaper than RDS or S3. But it's not something that I would gladly do---the storage deals provided by hyperscalers are quite reasonable, as you say.
I have been running fault-tolerant systems spread across multiple dedicated servers (inside system with multiple DB/KV stores distributed/replicated/sharded, Kafka etc). If one server experiences hardware failure, the system will automatically recover within seconds to minutes (depending on which server/part of service failed) without any data loss.
It's not that hard. You need the knowledge, but it's not rocket science.
I've seen people shoehorning all sorts of batch processing into HTTP (backed by queues or not) and it has tremendous overhead over just having the cores and RAM there in the same place as the data.
I learned that lesson with Google Apps, and never designed anything to rely solely on HTTP ever again.
More realistically, I've found that the cost is between 3x to 7x what people were paying for before.
I'm not surprised cloud adoption has slowed to a crawl. Azure and AWS won't admit this publicly because it would tank their share price, but they can't hide it from observant people. For example, they used to get the latest Intel or AMD CPUs before retail availability in huge numbers. Now? They're 2-3 generations behind because they're not rolling out new servers in significant numbers. The customers are all tightening their belts because of the global economic downturn, and one of the most expensive things they've been splurging on before was public cloud hosting.
However, the image remains "warm" and incurs zero cost once the last request ends. So I usually have a `/heartbeat` endpoint for this purpose and point a Cloud Scheduler job at it.
I haven't read the docs to figure out the exact heuristics of when it becomes "cold" again.
These also run on VMs.
> database servers
These also run on VMs.
> DNS
This is such a tiny cost that it’s not worth mentioning at any scale.
> Backups
This can go any number of ways, with price tags all over the place, yes.
> Access management
There are plenty of free and paid solutions available.
> Monitoring
See Backups.
There's almost no testing and validation needed for something like AWS RDS Postgres backups. Occasionally you store an instance and that's it.
Other things like Postgres out-of-disk-space is a 10 min fix on AWS to increment the assigned space. If your VM provider is offering SAN/NAS you may be in luck, otherwise hopefully you have a balast file or some logs you can delete to free enough space to get things running long enough to fix the problem.
I promise you, running a DB in RDS is almost as difficult as running it on metal or a VM, except that when things go wrong, you don’t have as much insight into why.
Running a Linux server is not hard.
I mentioned monitoring being required and having a wide variety of options and price points.
If your DB runs out of space in a VM that’s on you for not having adequate monitoring and alerting. If you don’t have networked storage available and somehow fail to see the predicted eventual out of disk, again, that’s on you.
For free.
Yep, if your Kafka is mission critical and crashes hard, that is bad.
But things like Kafka are _never_ a black box you just spin up and never worry about, if anyone thinks so, CAP theorem will give them an awful surprise one day.
You're always going to need someone in your team who understands the tech and how to make best use of it.
MSK won't tell you how many partitions your topic needs, or whether your retention strategy should be delete, or compact, or both.
You still need that knowledge of the "managed" service to make effective use of it.
And that knowledge sits rather close to knowledge of how the system works, so given you'll need that knowledge anyway, may as well cultivate it instead.
Oh, and the operators also solve a lot of the unhappy paths too, FYI.
I tend to describe the operator approach as "half-managed" because things like multiple-AZ stretch clusters need some configuration.
But then, maybe you didn't want a 3-AZ cluster? Maybe a 2.5? MSK says no.
…
> And that knowledge sits rather close to knowledge of how the system works, so given you'll need that knowledge anyway, may as well cultivate it instead.
This has been my argument forever, and it’s always met with disagreement, because entirely too many people have no desire to learn their tooling. They just want an API that they can push data into, and get it back out. What happens inside is irrelevant.
It’s extremely sad to me.
At some point, we have to decide that there's a lot of knowledge expectations depending on your stack, especially as parts of your application grows.
Say you're a Python-based webapp running with Postgres, Kafka, and Elasticsearch. Your stack requires pretty decent knowledge of:
1. Postgres
2. Kafka
3. Elasticache
4. Linux (and a lot more than what many developers I've encountered seem to have)
5. Kubernetes, because it is 2024
6. Whatever frameworks you're doing with your webapp + ensuring you're keeping up with security best practices
7. + the soup involved with exposing your webapp to customers
Being able to handle any of these 6 at scale require different skillsets. It's unreasonable to expect anyone to be an expert at all of this -- in a real, tried-and-true environment -- especially with deadlines and SLAs involved.
Relying on volunteer support of varying degrees of quality for your business sounds insane.
Also at that point the business should really be donating or contributing to the development of the software otherwise it is considered what we call a dick move.
> Relying on volunteer support of varying degrees of quality for your business sounds insane.
Given my experiences of Confluent paid support, and my experiences of the volunteer support around Kafka, I disagree.
Not sure we agree on the meaning of this phrase in this context.
For 0 money. That kinda free.
I’d rather focus on my expertise and mental energy in other tools that are much more significant to the stack I support.
For big flagship services you can usually get pretty good support (EC2, S3, SQS, Lambda)
For smaller/more niche services where AWS stood up a managed version of some OSS it's more hit and miss (like managed RabbitMQ).
In both cases, it definitely helps to have an open line to your TAM and send them case numbers and they'll usually do some internal nudging to keep things moving. In addition, for projects, you can usually reach out ahead of time and get some dedicated SMEs to help set things up/train you.
In either case, hopefully you've never had the displeasure of working with Azure support.
They usually tend to be genuinely helpful but are a far cry from solving your issues themselves.
Of course there’s a minuscule possibility of you having a new use case. But is that good enough reason to build your infrastructure? That is a business call you need to make.
Until you’re at quite a high scale, you probably don’t actually need Kafka. There are plenty of much lighter ways to do pub/sub, including Postgres itself.
Similarly, if your RDBMS schema is properly defined and your queries are well-written, you probably also don’t need Redis / EC.
Re: K8s, if you do need it, I’m not sure why people think that it’s so much easier to run EKS than your own cluster. The only thing you get to skip is the control plane; everything else is still your responsibility. Same with Postgres – you still are wholly responsible for its schema/table maintenance and optimization on major DBaaS.
In any case, nowhere did I say one person should be an expert at all of this.
As someone who accidentally specialised in Kafka... ...bingo.
So many companies using it who don't need the sheer scale it offers, and get to pay the complexity cost anyway, with no benefits.
But you as well as I know, that what the other participant in this conversation means, is that if a for-profit entity relies on support that is "free of charge" in this way, such that it can continue to profit on the back of their product support, then the for-profit entity really ought to seriously consider a voluntary donation of some kind to support the continued maintenance and support of the product.