Serverless: A lesson learned the hard way(sourcebox.be) |
Serverless: A lesson learned the hard way(sourcebox.be) |
But if you can ignore that, you can probably also ignore the fact that your code runs on a server.
...until you receive the $206 bill for the work done by the server.
No, dear author. Setting up the AWS billing alarm was the smartest thing you ever did. It probably saved you tens of thousands of dollars (or at least the headache associated with fighting Amazon over the bill).
Developers make mistakes. It's part of the job. It's not unusual or bad in any way. A bad developer is one who denies that fact and fails to prepare for it. A great developer is one like the author.
So I guess the question is, with a mistake like this, is it better to be charged hundreds or thousands of dollars, or to have your service degrade or go offline until you can fix it?
We dynamically create and instantiate new servers based on load and if it's sustained for a while. Once it's up, it's added to the load balancer. Once the load of them goes down, it's spin down after it's spent some time idle (it costs to instantiate so might as well keep outside of the queue for a bit before completely removing it).
This all runs automatically. If we don't limit it, it's on us.
How is this not a problem with how he managed it?
> This is probably the most stupid thing I ever did. One missing return; ended up costing me $206.
He clearly mentioned it's his error there.
If the degraded or offline system is used by people, and these people cannot work, the cost can be a lot higher. For example, 10 people not able to work could cost something in the range of $250-$750 per hour.
Moreover, if customers are lost due to this degradation of service and CAC is high, then clearly the cheapest thing is a high bill by AWS, which probably is also capped by Amazon (and handled as an alert by Amazon).
So yeah, let's blame the developer, but let's not play like mistakes don't happen and they're not costly in the "serverless" world.
It’s easy to burn tens or hundreds of thousands ‘accidentally’ on “server”, easier than on serverless.
If you’re spending real money, you should have an account team. Talk to them if such a problem happens.
Its not a "problem" with server/serverless of course, but no-scaling-by-default vs unlimited-scaling-by-default (which is imo the better way to split the server/serverless topic), one is going to cost more when things get thrown for a loop
Without autoscaling, you would just have a queue that grows until the machine runs out of disk space. Either way, this was a problem with code and not event based scaling.
Never use a pay-per-use service that does not include a reasonable "turn off after $X" feature and appropriate warnings. Also, never use such services without being sure to configure such settings.
I like to think of this as a self-inflicted "DDOC" attack: Distributed Denial of Capital.
Best not to leave yourself exposed.
With a regular server you could go viral, your server dies, so you lose also without a bound in lost business/good will/whatever.
Also need to take into account the time/effort spent on making the regular server scale, albeit this is also a relatively flat rate.
people still play with fire. limit your losses, go with digital ocean or something for 5$/mo flat no matter what.
I think the author meant to do this less of a "play with fire" way but more of experimenting with new tech way. But yes, I agree that for personal sites, running with your own money, you probably want to stick with something safer like the $5/mo digital ocean box.
I'm surprised people still talk about cloud services as being cheaper esp where developers are free to use what they want.
Idea: use API Gateway to configure a quota to match your budget projections. That will force a hard stop. Would be nice if AWS made this easier.
My main challenge with serverless is using Lambda with API Gateway. Lambda has no database connection pooling, so I end up with a ridiculous number of connections to RDS - one for each simultaneous user. I haven't found a solution to this yet, other than not using API Gateway.
All a developer needs to do immediately after adding a credit card to AWS/Azure/GCP would be to create an IAM role with permission to automatically add and track fine-grained billing alarms and notify via email/sms for any potential billing overages.
I think a $60/yr service like this would be useful to protect against future events of bill shock.
https://github.com/Teevity/ice https://billgist.com/ http://cloudcheckr.com/
a) go into credit (so they will charge you at end of month)
b) disable services
Maybe AWS/Google also support a hard limit on spending.
It's gotten much, much easier, and is just another form of command line management, similar to the CLI framework tools with your preferred stack.
Once that first setup is done, similar to setting up a serverless environment, you are generally restoring backups of your base image and beginning projects from there.
It also immensely helps to learn about how to build something to scale that isn't completely reliant on the PaaS layer.
It's nice not to have to worry about a server, but I feel like there are just as many little things to futz with in serverless architectures especially before "environment" variables existed in Lambda.
I have migrated all my services to GCE. At least GCE provides free decent quotas for every resource.
Qualify your statements.
This is the takeaway quote from this for me.
When you have customers doing events, it’s more often that the scale up is from a real event than that someone fat fingered a config.
If they are broadcasting an unscheduled Obama speech from home page of a major paper, that’s not the time to go “Oh, anomalous, shut it down.” By the time that gets fixed and back on, Obama’s left the building - and your customer leaves too.
If you are in the business of offering a service with “elasticity” as a core capability, we found it better for SLOs and better for the bottom line to ‘fix’ this after the fact by discussion than to attempt to tell real spikes from glitches.
If you don’t want elasticity, you might not be looking for “cloud”.
and a SDK like boto3:
http://boto3.readthedocs.io/en/latest/reference/services/bud...
I can't imagine this changing.
None of the "Cloud providers" offer that. They "claim" that it could impact service - yeah, service of debt that you owe them.
Even a few bytes sitting on S3 continue to incur charges and it's hard to be real-time with spend tracking at the scale of these providers so the only option they have is to delete your entire account immediately. Is that what you want? Who would?
For most companies, business continuity matters. The proper solution is to use the budget and reporting features to check your work.
The colloquialism for "real money", at least to me, is "a substantial sum". If that's what you intended, wouldn't it make sense that you wouldn't have an account team if the only time you spent real money was by accident?
That's simply not true. You can accidentally run up huge bills with EC2 instances too. One typo in your CloudFormation templates could spin up a ton of reserved p2.16xlarge's.
Of course, if you consider EC2, and other AWS services, to be "serverless" too - you're not physically managing your own racks after all - then, yeah, fair enough, it is a problem exclusive to these "serverless" IaaS/PaaS providers.
The OP said I should run my own infrastructure. I -could- host my blog by running a web server atop a server I administer, sure. I'd have to take on all the infrastructural tasks of doing that, securing it, ensuring any availability/scalability concerns I may have are taken care of, etc, but I -could- do that.
Instead, S3 + Cloudfront (or, sure, any flavor of hosting and edge caching options you care for; I was not implying "Just AWS") means I don't have to worry about any of that. For me, the reduced level of control, increased availability, scalability, and easy "it just works", is worth the tradeoff. As is the pennies per month it costs me given the low utilization and pay-as-you-go model. It's hardly a scam.
Of course scalability is incomparable in both cases, so if it's something that really matters - and matters more than money - of course something like AWS is a better choice.
And Cloudflare is still an option, since everyone needs their precious caching.
The fundamental issue here is serverless is great at allowing you to automatically scale to meet demand, but it also is great at automatically scaling to meet unexpected resource usage caused by errors (or poor design). And so this means a mistake on your end can cost you a lot of money, because the system thought that it was real demand.
If it is the latter you do not want any rate limiting, you want everything to scale as fast as possible (I hope there are no bugs on your end). Rate limiting means that your new customers get a poor experience and so they are more likely to ask for a refund, or not renew next time.
1. Warn me at $X but don't throttle me for any reason--I'll pay if I go viral
2. Warn me at $X and start throttling until I get to $Y at which point stop service and stop charging
3. Warn me at $X and stop service/charging immediately
When you are on shared hosting, the expectation is that you get shut off when you go over.
When you are on "unlimited" shared hosting, the expectation is that you and everyone on the server gets throttled when you go over.
When you are on a VPS, the expectation is that you will be throttled when you go over, and you will be throttled much less than with other options when your neighbor goes over.
With cloud, then, the expectation is that if you go over, you are charged more proportionately, but things continue to work.
Of course, this is a simplification, but I think it accurate enough to be useful.
I do agree that it would be better to choose your api/provisioning and node reliability separately from overage behavior, but most of these behaviors and expectations were based on traditions that were shaped by technical constraints.
To credibly say "we will keep you online and just charge you" you need a lot of spare capacity.
Throttling one customer on a shared host without impacting other customers used to be very difficult. It is still way easier to throttle one VPS customer, and easier stil to throttle that one customer when they have their own kernel and reserved memory; it is not as big of a deal as it once was, considering everyone now uses ssd, but systems that share page cache are notoriously difficult to setup such that light users don't impact heavy users.
> based on their business/hobby needs
AWS is not interested in hobbyists - other vendors are picking up the crumbs there.
Besides, we aren't really talking about production databases at large companies. The people who want caps are devs learning and experimenting. It could come with dislaimers that if you enable a cap and exceed it that your services will go offline unexpectedly, and that may leave databases in inconsistent states. But for a large number of usage scenerios that is a completely acceptable tradeoff.
The simple fact is, not having a cap certainly puts me off experimenting with a service due to a fear of a mistake causing a big bill. And developers learning and investigating a technology is what preceeds them recommending that technology to their companies.
Last time I looked Azure allows a zero spend cap on free accounts, but you can't change the amount to anything else, and once you remove it you can't switch the cap back on. Thats limited, but it's perfect for a learning environment.
If Azure can implement a zero spend cap, there is absolutly no reason that either AWS or Azure can't implement an x spend cap in exactly the same way.
Then AWS is not focused on their use-case. You can make the argument that they're throwing away potential business here, but AWS is already the gorilla in the room, and people clamour for their products already. A couple of years ago they were rated as being bigger than the next 17 VPS providers combined.
> I'm sorry, I just don't buy that. It doesn't have to be a hard cap, it could be a soft one. i.e. at £x your servers start shutting down, you'll get billed for a few extra minutes over your cap, before things have finished shutting down. Servers are totally capable of being shutdown without destroying databases.
The idea of a payment cap sounds easy, but with something as complex as AWS, it's incredibly difficult, and everyone would demand different behaviour at the cap. So, you hit your cap. Turning off EC2 servers is easy. What about the data you've stored on s3? That costs. Should it be purged? What about your disk drives on EBS, should they be purged? How about items you have queued in SQS, should they be purged? Are you using RDS databases? They can't be stopped, only destroyed (you can do a final snapshot, but that's going to go to block storage, which costs. not much, but it costs).
"Just stop anything that costs" sounds easy, but it's not, not when you have a service as complex as AWS. AWS's current model of "forgive the bill for obvious mistakes" is way more workable.
Besides, Azure proves that it's clearly possible. Azure have a cap. MSDN gives you free Azure credits. When you open an account via MSDN you still have to put in payment details, but you have the option to enable a hard cap that prevents you spending past your free credits. So Azure have clearly got a solution for stopping all the services when the credit limit is reached.
All of what you describe as problems are just decisions to be made. S3 data..? Delete it. Make it read only. Pretend to delete it, but make recovery possible for x days. Doesn't really matter, just pick one when you build it, and document what it does. People who want a cap are going to more concerned with the overspend than any data or service integrity. They could stick up a disclaimer... "If you enable this cap you data may be destroyed or corrupted if your spending reaches the cap". There are solutions to the implementation problem.
Besides, they probably already have all this code in place. If your payment methods gets declined I'm prepared to bet Amazon don't just let all your services continue running indefinitely because shutting them down automatically is too hard of a problem for them to solve. So any cap could be implemented by just triggering the payment declined function.
I guess it could limit global request rate. But the idea of unbounded elastic services behind a global rate limiter is just funny to me. Like a Ferrari with a 50mph limiter.
Unless I have hard guarantees, I give "cloud providers" re-loadable cards. Can't take more money than what's on there.
I would greatly prefer to pay up front, and have services take my credit. That way, I could control my costs directly and concisely. No surprise billing. DOS'es get stopped by no more funds- they aren't the infinite money piggybank they are now with debt.
I also understand why some clients would want a debt based system where they can expand and contract their costs. I'm cool with that, as long as you know what you're signing up for. The person in this article didn't, and surprise billing is majorly at fault here.
My solution would stem this "You owes us $20,000 by end of month", to "Your credit is exhausted after 10 minutes. Something seems wrong with this account cwhen compared to history."
[1] https://azure.microsoft.com/en-us/support/legal/offer-detail...
And even 100% code coverage doesn't find all possible errors.
Unit tests are specifically useful for refactors. You can refactor your code and ensure that it behaves as intended. Integration tests are great, too, don't get me wrong. Either or both would have probably caught this.
O_o
> ... Make it read only.
The primary use-case of s3 is reading objects. This would not be a deterrent for quite a few use-cases
> ... Pretend to delete it, but make recovery possible for x days.
Still consumes the space that they're charging for in the first place
> People who want a cap are going to more concerned with the overspend than any data or service integrity.
This is patently not true, and is why I think you don't really grok why implementing a cap is difficult. It's specifically why I said "everyone would demand different behaviour at the cap". Some would want only this or that service to stop, for example.
A small business sets up a payment cap and hits that cap because they went viral? BAM, all their block storage, destroyed. All their backups, their analytics, their RDS databases, just gone. Right at the time they needed it most.That's a much harder lesson to deal with than "oops, our bill's a bit high because we made a mistake, can you please forgive it?". Or even "ouch, okay we'll pay it". The protection you want for hobbyists would destroy small businesses that may not understand what is actually meant when that hard cap is hit. It's not that caps aren't doable at all, it's just that they're a wicked problem, and the more you look at it, the more issues you can see.
As for soft caps, what is the functional difference between a soft cap and the billing warnings they already have?
Also, their claim on s3 is "we don't lose objects". Destroying objects because of billing would utterly undermine that claim.
> If your payment methods gets declined I'm prepared to bet Amazon don't just let all your services continue running indefinitely because shutting them down automatically is too hard of a problem for them to solve.
AWS does not destroy your services because of late payment. Source: we've just been in late payment.
Are you serious? This would mean Azure is not fit for business:
https://docs.microsoft.com/en-us/azure/billing/billing-spend...
When your usage results in charges that exhaust the monthly amounts included in your offer, the services that you deployed are disabled for the rest of that billing month. For example, Cloud Services that you deployed are removed from production and your Azure virtual machines are stopped and de-allocated. To prevent your services from being disabled, you can choose to remove your spending limit. When your services are disabled, the data in your storage accounts and databases are available in a read-only manner for administrators. At the beginning of the next billing month, if your offer includes credits over multiple months, your subscription will be re-enabled. Then you can redeploy your Cloud Services and have full access to your storage accounts and databases.
The decision not to implement this in AWS has nothing to do with technical issues - they can all be solved in this way or another.
But this thread was initially about a dev running some experiments and getting a $200 unexpected bill. They would have been very happy with a $30 cap that just deleted everything. That functionality would be easy to build if they wanted to. But they don't. For other reasons, not because it's hard.
The thing I dont buy is Amazon claiming its too hard to build a cap. A cap that suits some people would be easy. What they really mean is... A simple hard cap is only useful to customers we dont care about because they dont pay us enough. An advanced cap with all the kinds of failover options and thresholds that a medium sized business might want is complex and the people who actually pay the big money (those we care about) don't actually want caps anyway.
That's an honest answer, I'd be happy if they formulated it this way. Fortunately, there are other cloud providers with billing cap implemented properly, and you don't hear horror stories about them (problems with spending too much on AWS are very common though).