Scaling up the Prime Video audio/video monitoring service and reducing costs(primevideotech.com) |
Scaling up the Prime Video audio/video monitoring service and reducing costs(primevideotech.com) |
Also, the pricing of AWS quickly goes up as you go from EC2 -> Fargate -> Lambda. I don't know why on earth someone would build microservices at the lambda-level.
The worst software systems I have ever seen were micro-services. One of them is more than 20 years old. The WTF count per minute is exponential.
I have 0 experience with serverless/cloud. Just a thought.
Wouldn’t have expected prime to be pushing around images on s3
It's fine to split things up, but we have to be careful how we do it + aware of the overheads.
It is pretty bad. It happens in 8 out of 10 movies. There is some misconfiguration in their AV transcoding pipeline.
And here, we have an article talking about Monolith vs. Microservices improving user experience.
Netflix's shiny new compression scheme a couple years ago didn't work on my Sony TV's buggy silicon. The only way I got that fixed was by knowing someone on the inside.
Hulu usually can't make it through an episode without the video freezing at least once. Sometimes it just refuses to work at all until I completely reboot the TV.
HBO Max's UI is just really cheesy and slow, but whatever it's fine.
Paramount+ is my new favorite to hate on. The UI is maddeningly glitchy and lethargic. I pay for no ads, but it plays ads anyway, on Star Trek episodes from 1996. It doesn't remember progress in a show more than once every week or two, just enough to remind you that it's supposed to be a feature. On my phone, it doesn't hide the typical menu overlays unless I do a complex sequence of finger taps. One time I tried to file a bug report from inside the logged-into app, and I got an email back claiming that they would love to consider my concerns but can't because they don't have an account associated with my email address.
And I use a FireStick, FWIW.
BTW, their own trascoder product MediaConvert seems to have this issue (It is possible that it could be user error too in how they have used the product or setup the parameters). [1]
My guess is PrimeVideo dogfoods MediaConvert and they also have this issue. They could have fixed it for newer content, but previously transcoded content still has issues (which will remain until they are re-transcoded).
[1]: https://repost.aws/questions/QUGajgu4zKTlewlTg1M96i_Q/questi...?
Microservices are no more or less scalable than a monolith. The main benefit of Microservices is allowing multiple teams to work independently from each other without everyone "stepping on each others toes". You can have scalable monoliths and unscalable microservices.
This is not fully true. A microservice architecture is more finely scalable than a monolith.
To take a very basic example, if you have a peak of users watching a video you can scale up the microservice dedicated to serving videos, but not scale up the service dedicated to users signups, which isn't having an increased load.
No, splitting a codebase does not magically make it more scalable in production. You still have to prove that the authentication component would create significant unnecessary load if it was scaled up together with the video service.
Apologies, but I strongly disagree and I'm going to go on a bit of a rant here....
This is a myth, and one of the reasons people are making these ridiculous architecture descisions. If you have a monolith that serves videos and enables signups, you can deploy as many instances of that as you like based on the highest need. It doesn't matter if user signups are a fraction of video watches, it just means that your user signup endpoint is not getting called as much. Maybe you're deploying a larger codebase than you need to but that's hardly a downside.
In your example, let's say we have 2 endpoints that are behind a gateway or L7 LB so that we can point them at different codebases if we like:
- videoservice.com/signup
- videoservice.com/watch
If I'm geting 100k rps to /watch, and 100 rps to /signup, I can just deploy loads of instance of my monolith behind the /watch endpoint. Maybe that monolith contains code for /signup, but it's not going to get called. So what.
I've seen this approach used in many places. You don't need to split the code to do this at all. Sure it might feel "cleaner" to you to do this, but it's not needed.
Now, you may get to a point where your deployment is really heavy and time consuming and you don't want to deploy everything just to scale up /watch - but again I'd argue that is not really anything to do with scalability, it's about being able to deploy things independently. Using a microservice doesn't make your service more scalable here, but it might make it easier to deploy.
Microservices are nothing to do with scalability. They are about how you organise code and teams to achieve better development velocity.
I don't think this is strictly true; even though microservices are usually used that way.
Scaling up everything even when not needed has it difficulties. You can have lots of unnecessary initialisation tasks, lots of unused caches warmed up, database and socket connections that are not needed, complexities in work sharing algorithms etc.
So various parts of Amazon have to work through the AWS same pricing programs that the rest of us do?
Also, Prime Video isn't part of AWS but the consumer / devices / other part of (retail) Amazon.
Source: worked there
If your architecture has a high cost to develop, test and run when a cheaper architecture meets your needs, it's a sign that you have overengineered. In my experience there is an order-of-magnitude increase in complexity by adopting microservices that only starts to pay off when your org and user base are huge.
"Let's make our entire website serverless now" erm, no?
It's cargo culting of the worse kind
Understanding this behavioralism will get you through many situations in life.
The product (Prime Video) is still built using many business oriented services. Furthermore, this service appears to be developed and operated by a single team.
That being said, there are some lessons here - there are good ideas in most design paradigms, but if you take them to the extreme, you're going to see some weird side effects. Understand the benefits and engineer a balanced solution.
We are looking into serverless as a way to exhibit to our customers that we are strictly following certain pre-packaged compliance models. Cost & performance are a distant 2nd concern to security & compliance for us. And to be clear - we aren't necessarily talking about actual security - this is more about making a B2B client feel more secure by way of our standardized operating model.
The thinking goes something like - If we don't have direct access to any servers, hard drives or databases, there aren't any major audit points to discuss. Storage of PII is the hottest topic in our industry and we can sidestep entire aspects of The Auditor's main quest line by avoiding certain technology choices. If we decided to go with an on-prem setup and rack our own servers, we'd have to endure uncomfortable levels of compliance.
Put differently, if you want to achieve something like PCI-DSS or ITAR compliance without having to covert your [home] office into a SCIF, serverless can be a fantastic thing to consider.
If performance & cost are the primary considerations and you don't have auditors breathing down your neck, maybe stick with simpler tech.
Being an early engineer at most of my stints, I have build and scaled multiple startups using the approach and it has never failed me, the pitfalls of micro-services is not worth it unless absolutely necessary.
I always made it a point to group by business-logic rather than separate at whatever curve ball "new-tech" throws at me.
- granularity
- bandwidth negligibility
Breaking everything down to a gnat's ass might improve testability, but is testability the product? Do I really need a Java stack trace that reads like an Andrew Wiles proof?[1] Maybe I do, at scale.
Then there is the non-zero cost of the packet shuffling. Every edge in the aechitctural graph, not just the nodes, costs. But we just throw a waiter into the code and move on to the next line. No biggie.
What was most interesting was "It also increased our scaling capabilities." Granularity was supposed to let "serverless" absorb the entire universe, I thought.
At a higher level of abstraction, maybe The Famous Article is a map/reduce job: the requirements dissolved into solution, and a proper number of components precipitated out.
[1] https://en.m.wikipedia.org/wiki/Wiles%27s_proof_of_Fermat%27...
Taking "malloc for the Internet" [1] a bit /too/ literally there.
[1] https://aws.amazon.com/blogs/aws/eight-years-and-counting-of...
> grug wonder why big brain take hardest problem, factoring system correctly, and introduce network call too
> seem very confusing to grug
Even with monolith -> microservices I've seen it go wrong. One Go application I worked on it would take a senior engineer a week to add a basic CRUD endpoint as the code had been split in to microservices along the wrong boundaries. There was a ridiculous amount of wiring up and service to service calls that needed done. I remember suggesting a monolith might be more appropriate, and was told it used to be a monolith but had been "refactored to microservices"...
This type of stuff can literally kill early stage companies.
They're though good enough to deliver an MVP quickly, but that's all about it.
Microservices seem to be a decent idea with a terrible name. The idea of running services that are small enough that they can be managed by a single team makes sense - it enables each team to deploy their own stuff.
But if you break things down further, where you need multiple "services" to perform a single task, and you have a single team managing multiple services - all you do is increase operational & computational overhead.
Serverless and edge aren’t the same thing.
But I think it's interesting that if we took a time machine back to 2014 or 2015 the tone here would be quite different, and microservices were all the rage on this forum as I recall.
I like to hope that the industry learns from its failed trends, but I'm now old enough to see this is rarely the case.
Scaling is definitely a good thing, microservices make scaling easier, no doubt about that. But an MVP rarely needs k8s level scaling, it just needs to be written well so it can scale in the future.
I love the anecdotes about just buying a Hetzner server which can handle a surprising amount.
One of my ideas is a company that maintains an incremental infrastructure that can grow to handle extreme levels of traffic - the infrastructure itself mutates over time.
In my experience, AWS/Amazon people do not force you or even direct you to a particular architectural choice. They are relatively indifferent about it.
Instead, trend-driven architectures seem to come from the tech community themselves. It's the customers often making the wrong choice.
- When people use the solution -> problem path instead of problem -> proposals -> cost analysis -> solution they get what they deserve.
- It is possible to optimize most infrastructures and code, it depends how much obviously but I have seen such percentages before
The real question is: why didn't they chose the right stack for their problem the begin with?
It is clearly made by people who don't really understand (or does not care) how distributed workflows work.
And pricing are prohibiting to run it at scale. In my opinion it should be free to use, provided you glue together other AWS services with it.
It is like saying Oracle/Postgres/MySQL/MSQL. If you say they are different in some X functionality, yeah duh they are different in X functionality. However, they are all SQL databases.
Same way, Edge/Serverless is both running on limited compute resources (which is the point of the article and point I was making). Both differing in functionality X (of latency/closeness to your user) has nothing to do with either the article or my answer.
Write a verbal description of your cloud function and let the LLM simulate the execution.
Very cheap to develop. Very expensive to execute.
> ChatGPT: Yes, Avogadro's number is even. The value of Avogadro's number is approximately 6.022 x 10^23, and since it ends with the digit 2, it is an even number.
Right answer, wrong reasoning.
I know there are nuances in the article, but my first impression was it's saying "we went back to basics and stopped using needless expensive AWS stuff that caused us to completely over architect our application and the results were much better". Which is good lesson, and a good story, but there's a kind of irony it's come from an internal Amazon team. As another poster commented, I wouldn't be surprised if it's taken down at some point.
If your business invests in physical servers anticipating strong growth next year then later finds out actually we're going into a recession and those servers are no longer needed, then that's a sunk cost.
With cloud if demand drops you can scale up and down as needed. Helping customers cut costs during difficult times makes sense since those customers are more likely to survive and stay with you through good times.
So in context I think this article makes sense since long-term sustainable growth of AWS should be linked with the growth of their customers' businesses.
its quite simple, if workload x can be done 100% cheaper on-prem then its an obvious move (probably) if AWS manage to get that closer to 30-40% then the operational benefits of using AWS make more sense, more workloads, more total spend.
(They finally delivered 3.10 last month at least)
And then charging them to use AWS anywhere and outpost!
Why? Using the model they switched to (which uses a different set of AWS services) instead of the model they switched from is a recommendation that the AWS tech advisers that are made available to enterprise customers will make for certain workloads.
Now when they do that, they can also point to this article as additional backing.
AWS doesn't have an equally distributed interest in selling all of its products. Some AWS products exist because customers need/demand them and others exist because they provide higher margins and tighter lock-in to Amazon: the first type of products are great for customer acquisition, the role of their sales folk is to then convince people using the former to migrate to the latter.
I'm pretty happy with the monolith that we run at our business and this seems to validate our decision to stick to that monolith, but I'm also pretty confident that where we use AWS Lambda, serverless is absolutely the right way to go.
For example, I've written a Lambda application to reply to webhook calls and send API calls whenever those come in. It costs maybe $2 per month to run in compute and requests. Would that make more sense to rewrite as a monolith and run on EC2? I really doubt it.
IMO, distributed software is more practical for working development than for technical reasons.
We all know from basic stuff that performing software comes from single structures that does not require packing and unpacking data But scaling large applications is hard, and it was much more expensive back then. Now that we overreacted to microservices we will overreact to monoliths again. And we will bounce many more times until AI take our jobs and do the loop itself
Why? they're still using "AWS stuff" - EC2 and ECS etc. Serverless is a fraction of the services AWS offers.
AWS actively promote ways of reducing customers bills. This article could be considered a puff piece for the AWS Compute Savings Plan:
At some point the industry will wake up to the fact the AWS pricing pages are the real API docs, meanwhile dumb shit like this will keep happening over and over again, and AWS absolutely are not to blame for it, any more than e.g. a vendor of cabling is guilty of burning down the house of someone who plugged 10 electric heaters into a chain of double-gang power extension cords
Not at all. My time working with AWS reps, they never pushed a particular way of doing things. Rather, they tried to make what we wanted to do easier. And the caveat was always to test and make decisions on what was important to us. This isn't an anti-AWS article. Rather, it's exactly the type of thing I'd expect from them. Use the right tool for the right job.
Tldr build the right thing.
>"AWS sales and support teams continue to spend much of their time helping customers optimize AWS spend so they can weather this uncertain economy," Brian Olsavsky, Amazon's finance chief, said on a conference call with analysts.[0]
Amazon isn't afraid of this trend, they're embracing it. Better to cannibalise yourself than be disrupted by someone else
https://twitter.com/DanRose999/status/1287944667414196225?s=...
[0] https://www.cnbc.com/2023/04/27/aws-q1-earnings-report-2023....
The key is to look down on nothing, become competent with multiple architects and know which ones not to implement in a use case if the one to use isn’t clear right away
Yes agreed there were some funny business like not selling Chromecast, but the guiding principle was generally to make things customers want...
Some irony in my anecdotal experiences is that most places that don't have the traffic to justify the cost of these super distributed service architectures also see a performance penalty from introducing network calls and marshaling costs
It’s something they use to check for defects in the video stream - hence the storing of individual frames in S3.
Original title: Scaling up the Prime Video audio/video monitoring service and reducing costs by 90%
This post is going to pick up a lot of traction and I suspect these comments are going to bikeshed monolith vs microservices for the next day.
On reading it, this is for a video quality monitoring system, that needs to consume and process video. Generally a compute and time intensive task. Something not always suited to severless, particularly when it’s not easy to parallelise.
The task at hand doesn’t sound ideally suited to serverless, but the existence of the post shows that’s not readily obvious. So it’s a valuable post to explain a scenario where a few big machines is the best call.
But the sensationalism of the headline, would suggest all serverless is expensive and wasteful. When in reality the same is true for a non-ideal workload on a monolith.
PrimeVideo is very much based on a microservice architecture. Hell, my team which isn't client facing and has a very dedicated purpose has easily more microservices than engineers.
Not surprising that didn't go will. This strikes me as a punching bag example.
Anyone who has worked with images, video, 3d models, or even just really large blocks of text or numbers before (any kind of actually "big data") knows how much work goes into NOT copying the frames/files around unnecessarily, even in memory. Copying them across network is just a completely naive first pass at implementing something like this.
Video processing is very definitely a job you want to bring the functions to the data for. That is why graphics card APIs are built the way they are. You don't see OpenGL offering a ton of functions to copy the framebuffers into ram so you can work on them there only to copy them back to the video card. And if you did do that, you will quickly find out that you can be 10x to 100x more efficient by just learning compute shaders or OpenCL.
You could do this in a distributed fashion though, but it would have to look more like Hadoop jobs. I predict the final answer here, if they want to be reasonably fast as well, is going to be sending the videos to G4 instances and switching the detectors over to a shader language.
In general, if the data is much bigger than the code in bytes, move the code, not the data.
IO is almost always the most expensive part of any data processing job. If you're going to do highly scalable data processing, you need to be measuring how much time you spend on IO versus actually running your processing job, per record. That will make it dead obvious where you should spend your optimization efforts.
Some excerpts: > This eliminated the need for the S3 bucket as the intermediate storage for video frames because our data transfer now happened in the memory.
My candid reaction: Seriously? WTF?
I am honestly surprised that someone thought it was a good idea to shuffle video frames over the wire to S3 and then back down to run some buffer computations. Fixing the problem and then calling it a win?
But I think I understand what might have lead to this. At AWS, there is an emphasis on using their own services. So when use cases that don't fit well on top of AWS services come up, there is internal pressure to shoehorn it anyway. Hence these sorts of decisions.
I feel that's like 95% of the "we migrated from X to Y and now it is better"; most of improvements coming from rewriting app/infrastructure after learning the lessons with only small part sometimes being the change in tech
> Amazon Web Services, Inc. is a subsidiary of Amazon
So it’s technically another company.
Another comment seems to confirm this akshually comment ^_^’
They changed a single service, the Prime Video audio/video monitoring service, from a few Lambda and Step Function components into a 'monolith'. This monolith is still one of presumably many services within Prime Video.
And the article itself mentions the 90% cost reduction.
So the title seems pretty much in-line with the original intent.
That has some newsworthiness and the title kind of reflects that.
…and going to a newer AWS service (ECS), instead.
Honestly, the original architecture was insane though. They needed to monitor encoding quality for video streams so they decided to save each encoded video frame as a separate image on S3 and pass it around to various machines for processing.
That is a massive data explosion and very inefficient. It makes a lot more sense that they now look for defects directly on the machines that are encoding the video.
Another architecture that would work is to stream the encoded video from the encoding machines to other machines to decode and inspect. That would work as well. And again avoid the inefficiencies with saving and passing around individual images.
No, that’s still a bad architecture. Bandwidth within AWS may be “free” within the same AZ, but it’s very limited. Until you get to very very large instance types, you max out at 30 Gbps instance networking, and even the largest types only hit 200 Gbps. A single 1080p uncompressed stream is 3 Gbps or so. There is no way you can effectively use any of the large M7g instances to decode and stream uncompressed video.(Maybe the very smallest, but that has its own issues.)
In contrast, if you decode and process the data on the same machine, you can very easily fit enough buffers in memory, getting the full memory bandwidth, which is more like 1Tbps. If you can process partial frames so you never write whole frames to memory, you can live in cache for even more bandwidth and improved multi core scalability.
In this case they were using AWS Step functions that are known to be expensive ($0.025 per 1,000 state transitions) and they wrote: > Our service performed multiple state transitions for every second of the stream
Secondly, they were using large amounts of S3 requests to temporarily store and download each video frame which became a cost factor.
They had a hammer - and every problem looked like a nail. In my experience this happens to every developer at a certain stage when he/she gets in touch with a new technology; it doesn't mean that the tech itself is bad - it depends on the scenario, though.
Like, did they even think about cost when designing this the first time?
Obviously no, only after managers complained.
The latter of course helping Amazon market "serverless" to the unwashed masses as a "solution".
This is so obvious in my head. I can't think of a single good reason where a SFN makes sense here.
The message seems more that they outgrew AWS lambda but that lambda was a good choice at first.
“There are use cases where Amazon EC2 and Amazon ECS are a better platform than AWS Lambda” is…not actually a message that anyone involved in AWS has ever been afraid to put forward.
I mean, the whole reason that AWS has a whole raft of different compute solutions is that, notionally, removing any one would make the offering less fit for some use case.
The article mostly lays the blame on step functions. Also, lambdas are portrayed as event handlers that don't run relatively often. This means long running tasks that are ran occasionally, or events that don't fire that often. Once throughout needs go up or your invocation frequency comes closer to the millisecond then the rule of thumb is that you are already requiring a dedicated service.
Most importantly it's good for mental health though.
Same for cloud, same for <pattern>
If everything is a hammer you'll hurt your thumb/hand/arm.
At least now (for some time) the pattern is named, so broadly when talking about this sort of thing, the name conjures up the same/similar image in everyones heads.
There are all sorts of inputs to the choice of architectural patterns, including budget, scalability (up and down), criticality, security, secrecy, team skills and knowledge, preference, organisational layout, organisation size, vendor landscape, existing contracts, legal jurisdiction ....
Some of this would have been really easy to predict (eg. hitting account limits) if they simply took the time to calculate how many workflow transitions they'd need to execute for the load.
1) We have sexy new product! Everyone use it so we have some use-case stories to tell and we look credible! Who cares if it's not the right tool for the job! We need a splashy way to use hackneyed business speak like "we're eating our own dog food" at the next user con so all the IT middle managers there will fight over early access and adoption. PROFIT! (Screams of technology teams in the background of "a knife is the most expensive, useless pry tool you can buy, but whatever, you are not listening, mmmkay").
2) A few quarters/years later (if you're lucky and you made it or someone with enough gravity in their title finally saw the light): Why is expense so high in this business unit? This is insane! Let's go back to a more sane architecture. (Screams of technology teams going back to what was working in the first place, but was not sexy nor necessarily new now that no one is watching and hype cycle is over)
Does this mean that serverless is useless? Dumb? Uneconomical? No way. For bursty, very short running workloads, it can be GREAT and INCREDIBLY economical.
What is useless and "dumb" is whomever thought that Prime Video's encoding workloads were going to do anything but increase cost and were somehow a fit for a system whose business case specifically necessitates bursty, shorter workloads that are primarily scale-to-zero for significant periods of the day/week/month.
It was a marketing stunt gone horribly wrong: intentional or not, but that doesn't repudiate the value of "serverless" for the right workloads, it just proves you better really understand the technology and the business case and the scale economics, and that goes for any technology.
feels very strongly they just moved from one AWS platform to another.
delay between asynchronous communicating processes differs in these architectures and I suspect they were unable to orchestrate microservices to match the RPC "inside" a monolith model. Nobody can: It only matters if your IPC is causing delay you can avoid.
Most of us aren't in a room where the real cost is high: 90% of computers are more than 90% idle 90% of the time. Amazon is not in that cohort.
In this case, they probably should have used Step Functions Express, which charges based on duration as opposed to number of transitions and they're looking for "on host orchestration" like orchestrate a bunch of things which usually are done in small time and are done over & over many times. Step functions is better when workflows are running longer, and exactly once semantics are needed. Link for reading differences between Express & standard step functions: https://docs.aws.amazon.com/en_us/step-functions/latest/dg/c....
This also exemplifies the fact that I learned while being at Amazon & AWS that Amazon themselves dont know how best to use AWS. This being one of the great examples. I'll share 1 more:
- In my team within AWS, we were building a new service, and someone proposed to build a whole new micro service to monitor the progress of requests to ensure we dont drop requests. As soon I mentioned about visibility timeout in SQS queues, the whole need for the service went away. Saving Amazon money ($$) & time (also $$). But if I or someone else didn't mention, we would have built it.
I dont think serverless is a silver bullet, but I don't think this is a great example of when not to use serverless. It helps to know the differences between various services and when to use what.
PS: Ex Amazon & AWS here. I have nothing to gain or lose by AWS usage going up or down. I'm currently using a serverless architecture for my new startup which may bias my opinions here.
If the article said fargate, which is technically still serverless we could have avoided a whole microservice vs monolith debate or serverless vs hosts/instances debate.
I haven't yet seen a project/product which would need microservice architecture for technical reasons. If you need to scale, you can just scale monoliths (perhaps serving in different roles).
The use case for microservice architecture is IMHO an organizational / high level architecture driven. I've worked in a big company (20K employees) which was completely redesigning its back-office IT solution which ended up as a mesh of various microservices serving various needs (typically consumed by purpose built frontends), worked on by different teams. There monolith didn't make sense, because there was no single purpose, no single product.
But if I'm building a product, I will choose monolith every time. Maaaaybe in some very special cases, I will build some auxiliary services serving the monolith, but there needs to be a very good reason to do so.
I did feel a bit embarrassed having to make a microservice after having argued against them so much over the years. Hopefully I can stop producing PDFs soon so I can delete the entire thing :P
"Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure." -- Melvin E. Conway
But that just puts into perspective how silly this argument is because I have no idea what a project means to other people.
Naturally it could be a single container taking care of all those integrations.
With microservices, it is easy to see services which are down or have high error rate or latency, have clear API contract and call out the team for breaking API contract, and assign cost for which the teams have incentive to reduce, or at least not increase it.
Large monolithic repos with many independent targets for testing and deployment work the best at huge scales. If you are only a few hundred engineers, monorepo with monolithic deployments and tests work fine.
That is a good point about reliability and cost though. I hadn't heard that before.
A lot of people here say...one service per team. But to me that is, or can be, a monolith. Often a team is a product line, so you have one service for that product. Is that a monolith? I don't know either, I guess.
I -do- know most people who go around promoting that sweet microservice life end up being the worst. They seem to want every db table to be its own service, introduce tons of message passing and queues, etc, for absolutely no reason. I think we can probably all agree that is about the worst way to go about it.
But doing it right, is nevertheless hard. Because cutting your business into chunks ... is not as easy as it always looks.
In my org in Google we average over one microservice per engineer. I'll be adding two in the next couple weeks. With the right automation setup you don't notice them any more than you do server instances.
The conversation went as below.
- Does your app work fine?
- Yes.
- Do you have any problems?
- No.
- Why do you want to migrate then?
- Silence.
Your job is to figure out what they actually need even if they don't understand it?
Seems pretty par the course.
There are dozens of reasons to migrate to the cloud. Do they apply to everyone? No. Are they always worth the cost? No. But the whole "cloud vs not cloud" argument that happened, got settled ("cloud"), and is now being restarted by the DHH-like is not data-driven and full of exaggerations and fear-mongering from both sides.
Then you add on top of that that the main product of moving to the cloud is "operations" which is typically measured in "hours of human capital being impacted outside of core working hours". When the market is booming, tech humans are expensive and fickle, and don't want to undertake more operations than they should have to, and companies are forced to pay cloud providers.
But in today's 2023 climate, any company looking around to decide how much to spend on cloud just says "Why would we pay for something when we can just ask our engineers to work more hours, and invite them to quit if they don't like it, oh wait nobody else is hiring anyway"
No cost calculator of $$$ saved considers that overtime is free in our industry.
tl;dr the cloud backlash is overblown, more companies/businesses would benefit from cloud than not.
Everyone wants to do the new cool thing. Everyone wants it on their CV. To be followed some years later by everyone saying how awful it is, and moving on to the next fad. Rinse, repeat, round and round we go with no actual intelligence being applied.
If you ask me, if the time and focus is invested properly, it would be much more efficient to run a monolith instead. That's what some small number of great teams end up doing.
"lateley" as in "for the last 5 years"?
If I could trivially package them up and deploy them locally with intrinsically less effort and wall-time then the old way, that'd be amazing.
If I could somehow get the horizontal scaling promises and redundancy as some kind of built-in, like I can with say, memcache, that's be cool.
If I could do these kinds of "hard" things with them more trivially, that'd be really nice.
There's a lot of things I want them to do but it's a god-damn bull-riding rodeo every time I try to get there.
And before you reply, I know you're an expert and can do all these things trivially. That's amazing. The vast majority of the industry creates a giant fragile spaghetti knot with them and I am not a full time k8s admin nor do I want this to be a career trajectory. It should be like you know, wine, ffmpeg, imagemagick, virtualbox, lua, qemu, redis, gnuplot, lvm2, gdb, ssh, sqlite; tools like that. It's pretty easy to get them to do really nice things. Those things deliver on their promises and potential pretty nicely.
It's nice that nobody feels a need to hype curl or squid. They just work. Isn't that nice? I mean look at gdb's website: https://www.sourceware.org/gdb/ it doesn't even have CSS animations --- in fact, it doesn't even have CSS.
But then for perfect decorrelation you'd also need independent databases behind the microservices, and queues between them for horizontal communication,and few are actually going all in with that, and so fall in the 95% where they go trough te motion and the effort of splitting microservices,but reap no actual benefit from it.
https://en.wikipedia.org/wiki/Cargo_cult_programming#:~:text....
If you have a microservice first architecture, the perception is, it's easier to describe effort to re-write an individual service or split it into two services as there is a clearly delineated body of work. Bizarre service-to-service dependencies may still exist and a poorly implemented microservice architecture is still a potential challenge.
Point being, organizations incentivize bad economic decisions on the part of engineers through the inability to recognize that rework is a necessary aspect of developing software and by constantly eschewing rework in favor of feature delivery it sends a strong message to the engineer about what to prioritize.
The "pain" of figuring out how to deploy your "normal app" quickly amortizes over just how much easier and more reliable code is.
That assuming they are big enough to warrant that in the first place.
IMO if it is a dozen unrelated endpoints but all of that still takes less than dev/month to manage it's probably entirely wasted effort to "fix it"
And if any of them grows to warrant dev/month or more.... separate that one and only that one out of that.
My biggest grief with microservices is the fact that it's effectively become a war on having a coherent logical normalized relational data model inside an organization.
However, I think there is something hiding inside the µservices movement that is actually much more generally applicable and useful: API-first development.
And of course good old OO.
Micro-services were invented by an outstanding software outsourcing company to milk billable hours and offload responsibilities in large org.
If you want to save cost and not-that-large business, go for monolith-first. Keep it modular.
You also have to consider things like it is now harder for people to see the system as a holistic whole (the tricky bugs are often in the composition of components) and a lot of subtle effects that beings. Even just increasing the friction for people to move between teams or friction for security people to apply consistent standards across all groups.
I have only seen this from the business side (I'm not a developer), but I have seen teams start coding in another teams service just to be able to proceed.
It's not always good to create silos like this either.
You need truly gargantuan scale before things become logically separate code-bases.
It quickly became clear even he had no experience with the set of tools & services he had advocated, and the whole thing went off the rails slowly & surely.
Low & behold 100% of existing customers are still on the on-prem offering 2 years later, and if you throw in the new customers that were shoehorned onto the AWS offering, his team has captured 2% of customer use after 2 years of effort.
I was back on AWS for the first time in a few years this week and the amount of new "upsell" prompts in the console is ridiculous. Spin up an RDS instance - "hey, would you like an Elasticache cluster too?". I think AWS are very aware of this behaviour and encourage it. Simplicity is not in their interest.
Don't forget to configure Route53, VPC, IAM and an ELB.
Great - ready to start writing your app now?
Oh wow one of those components as configured with the other components isn't behaving as expected - time to contact AWS support!
Cynically I think CTOs see all this stuff and think they'll turn all their expensive on-shore devs into cheaper DevOps because AWS is magic and you don't need to write hard app code anymore.
I'd counter that AWS forces expensive on-shore devs into having to wear an entire new hat and be half a DevOps engineer to figure out how to make their code work on this alphabet soup instead of a Linux server.
A few requests per millisecond should be well within the capabilities of this instance, depending on the complexity of each request of course.
It was a small semi static contact form that was deployed on 27 web apps (9 services x 3 environments) and used a NoSQL storage, redis, serverless stuff, etc.
Insanely complex deployment process, crazy complexity and all over the place.
Of course the only rational take on monoliths versus microservices is "use the right tool for the job".
But systems design interviews, FAANG, 'thought leaders', etc basically ignore this nuance in favour of something like the following.
Question: design pastebin (edit, I of course mean a URL shortener not pastebin)
Rational first pass but wrong Answer: Have a monolith that chucks the URL in the database.
Whereas the only winning answer is going to have a bunch of services, separate persistence and caching, a CDN, load balancing, replicas, probably a DNS and a service mesh chucked in for good measure.
I think this article shows that this is training and producing people who can't even think of the obvious first answer they have been so thoroughly indoctrinated.
It would be nice to know how much latency there was in the microservice version vs the monolithic version.
I point this out because how we talk about a problem determines what solutions we even acknowledge as being on the table here. Saying it's a realtime system when it isn't, or thinking we need realtime processing when we don't, makes people throw out solutions per-maturely, that the thrown out solutions are often right answers.
Once you acknowledge that your system will not be "realtime" and you actually don't have the time-boxing and specific time window delivery constraints that actual realtime problem spaces have, you can weigh all of your actual options with an eye for what will be fastest and most efficient given the budget and hardware you have to throw at this problem.
I've seen some staggering cost savings realized because someone happened to notice that an inefficient implementation that wasn't a problem two years ago at the scale it was running at back then did not age well to the 10x volume it was handling two years later. The reason it hadn't fallen over was that horizontal scaling features built into the cloud products were able to keep it running with minimal attention from the SRE's.
As mentioned in other comments, there are options such as Fargate, that would still technically be "serverless" and still yield similar cost reductions. Not to mention that AWS also has Step functions express for "on host orchestration" use cases. This seems like a case where the original architecture wasn't very well researched and nor was the new one.
It is a win. Just not the win they're aluding to.
Even really smart, capable people in general have really poorly calibrated intuition when it comes to the intrinsic overhead of software. It's a testament to the raw computational power of modern hardware I guess. In the case of AWS, it's never been easier to accidentally a million dollars a month.
That summarizes hype-based design very well.
Apparently they didn’t know about the EXPRESS execution model, or the much improved Map state. The story seems to be one of failing to do the math and design for constraints rather than an indictment of serverless.
I have to agree with others - it is amazing this article saw the light of day.
I had a project to work around a bottleneck of the framework. It could only process about 70 CAN frames per second before running out of CPU. The vehicle's CAN bus had several thousand per second, though. At the time I was able to fix the problem by adding filtering to the CAN adapter's kernel module.
A couple years later, I worked on replacing the python based framework with C++. I discovered the underlying root cause of the bottleneck. Someone (cough my manager) had figured out a very "pythonic" way to extract bit-packed fields from the 64-bit CAN frame payloads. They converted every 8-byte payload buffer into a canonical binary representation, i.e. ascii strings of 1's and 0's. They then used string slicing syntax to extract fields. Finally, they casted the resulting substrings back to integers. Awesome!
I've since used python many times to process CAN frames in realtime, scaling up to thousands of frames per second without the CPU breaking a sweat. One trick is to use integer bit shifts and masks rather than string printing, slicing and parsing...
They most definitely want to as that would most likely mean more money (and promotions) is flowing there.
~9 months almost a decade ago, so not substantially newer.
More than likely, Prime Video making their numbers look better makes AWS' numbers look (slightly) worse, because they're doing a little less business. In the overarching grand scheme of things, this will save Amazon some amount of physical computing resources they weren't getting paid by an outside customer for, but good luck figuring out how much that actual real world savings is.
Emphasis on "their server less architecture". Sometimes good tools are used poorly.
For example they describe a high throughout workload, and each workload spread through a bunch of lambdas that handled bite size bits of the workflow. Also, they managed the workflow with step functions. Just imagine the number of network calls involved to run a single job, let alone all the work pulling data to/from a data store like S3 into/out of a lambda. I'd guess the bulk of their wall time was IO to setup the computation.
Of course you get far better performance if you get rid of all these interfaces.
I also think it would be way healthier if teams acted as "maintainers" rather than "sole developer" of a service.
For example if team A wants feature from service team B manages they should be free (after communicating that so there is no confict/work duplication) to just make that feature and submit pull request to the team B.
Then team B can make sure it's up to their standard but that's shorter work than getting the whole machine of "submit a ticket for team B to add feature, find manpower to do it, and schedule work" running.
As a developer, I have certainly seen the same. Pretty sure this very scenario is where I heard the term "away team" used in the industry: send your folks over to change things, and under our guidance they can check in the code.
If you have a single team, you shouldn't be doing microservices.
A simple example: I have a SPA that has the following features: auth(login, logout), dashboard, feature a, feature b. I can write a few very simple lambda functions and deploy these the same way (IaC). What do we (my team) win? We can implement each function in a language we want. You have a feature that is too slow? Rewrite it in Rust. You have an amazing Python lib for feature a? Use Python. What else? We almost never touch auth, so if a feature has a bug it does not impact the entire application. Security is better because we can allow individual functions to access part of the infra they really need to access. Lambda functions can call other lambda functions as well.
Downside is that we cannot use a shared cache that is easy with a monolith. People need to design the boxes well which functionality goes to which lambda function. We have to use distributed trace ids to track requests.
Basically cut down the cruft when deploying another small self-contained feature but still keep the code running (savings of few MB memory are meanigless if you just have few dozen features that might run at the same time anyway).
Then I realized it's basically reinventing the ancient idea of "application server" like JBoss and EJB... which is kinda the case for lambda anyway.
I think it's actually quite rare for companies to have data so actually autonomous and unrelated that it does not logically relate to anything else in the organization.
Currently supported on all non-Firefox major browsers. https://caniuse.com/url-scroll-to-text-fragment
Seems Brave is an exception amongst Chromium browsers. They don’t implement it for privacy reasons.
Amazon Retail and AWS are the same legal entity for stocks, but other than that they might as well be separate companies.
Retail uses AWS with all the same APIs and quirks as any other company. The only thing different is the negotiation on price (which many large companies also do).
Meanwhile, AWS is apathetic towards feature requests from Retail, and especially operational support for Retail.
In many ways Retail would be better off if it was a separate company and could threaten AWS with a multi-cloud diversification play.
This is why microservice costs often far outweighs the benefits, but they rarely consider the cost in their crusade to 'break up the monolith'
Google "continuous profiling".
I'm not sure why you would think that that reinvents microservices.
Microservices is just taking a monolith and moving the components into separate processes that communicate via RPC.
> Microservices is just taking a monolith and moving the components into separate processes that communicate via RPC.
Microservice architecture divides the responsibility much more than that. They have separate redis cache, local cache, tests, and even likely has different DB etc.
Honestly sounds like something I'd do but I've never programmed anything more dangerous than a toaster let alone a car.
I once threw together a mylar balloon helium blimp in the shape of a Dragon space capsule. My goal was to fly it over the cafeteria crowd at SpaceX during the C2 launch. For control, I used the PCB of a travel wifi router. I soldered three small DC motors to its LED outputs. The embedded software consisted of something like:
nc -l -u -p 10000 | bash
I then connected my laptop to the access point and ran a python script that would send UDP packets containing shell commands to toggle the LED GPIO pins based on arrow keypresses.
The crowd really enjoyed the novelty. After the excitement was over, I flew it around some more in the cafeteria. Elon Musk walked up to it floating in the air, paused for a few seconds, then looked around the room trying to find the operator. I was just like any other employee hanging out at a table casually typing on my laptop, though.
Good times. On my last day there I still had a helium tank under my desk. So, I filled up a life-sized Elmo balloon (a left over prototype), then let it float up into the rafters of the office. It was presumably up there for a month or two.
> nc -l -u -p 10000 | bash
That's a neat idea. Did you have to flash it with a custom firmware or do they typically come with netcat etc installed?
1. https://dev.l1x.be/posts/2023/02/28/using-python-3.11-with-a...
> Lambda also optimizes the image and caches it close to where the functions runs so cold start times are the same as for .zip archives.[0]
This[1] article shows almost no discernable difference in .NET cold start times between containerised and regular lambdas.
It's easy to imagine developers pushing up bloated images, slowing startup down and blaming docker/AWS for it.
[0] https://aws.amazon.com/blogs/compute/working-with-lambda-lay...
[1] https://www.kloia.com/blog/aws-lambda-container-image-.net-b...
I've done the AWS solutions architect associate level cert and I can tell you first-hand experience that in order to pass the exam you need to memorize a lot of AWS propaganda that was written primarily to optimize AWS profit, not to optimize customer satisfaction. How many of those solution architects take those materials with a grain of salt vs how many of them genuinely believe that crap, I don't know.
I'm still trying to figure out which one Aurora and Cognito fall under.
In this article, they didn't bolt the serverless architecture onto another existing monolith, but rather rewrote the Step Functions and Lambda functions to be a single ECS task.
When I have 200 things that cost $5/mo each to run but fit nicely on a single 8core/32gb ram server.. then this lambda stuff starts to seem crazy expensive right?
I like AWS and I would still recommend it. It saves some work but also creates new stuff to do. Especially if you also want the costs be manageable. Automatic updates, configuring a firewall + reverse proxy with automatic certificate renewals and you favorite deployment mechanism isn't more complicated or labor intensive than managing a small application with AWS. You need to interface it just like software you run on your server.
One of the services I host needed to be authenticated by IP. Happens. You easily get a static IP on AWS for incoming traffic. No problem and cheap too. Now try to get one for the other direction... Possible too, but maintenance just became at least as labor intensive as hosting your own machine. AWS just has to fit your scenario and I think many people overestimate how comparatively easier it has become to host a server with feasible security today. Chances are your databases would be less public than if you skip the AWS documentation.
I think you might be unintentionally arguing with a strawman, as everyone else here is talking about using monoliths instead of that.
Few people want to administer a bunch of micro services themselves, but running a single service on a box is pretty low effort, even if you duplicate it for fail over/redundancy
For example you'll have to read fine print to find out that 256MB lambda will have the compute power of a 90s desktop PC because compute scales with memory. And to get access to "one core" of compute you have to use like 2GB of memory.
Now you may say "serverless isn't geared towards compute" - but this kind of CPU bottlenecking affects rudimentary stuff - like using any framework that does some upfront optimizations will murder your first request/cold start performance - EF Core ORM expression compiler will take seconds to cold start the model/queries ! For comparison I can run ~100 integration tests (with entire context bootstrap for each) against a real database in that time on my desktop machine. It's unbelievably slow - unless you're doing trivial "reparse this JSON and manually concat shit to a DB query" kind of workloads.
You could say those frameworks aren't suited for serverless - or you could say that the pricing is designed to screw over people trying to port these kinds of workloads to serverless.
If you went to a car rental and they told you we have a cheap car that's slower when you add passengers - and then you drive it to pick up your wife and it turns out it only goes 20 km/h when your wife gets in - you would be rightfully mad. You could say "why didn't you ask for specifications" but you have certain expectations of what a car should behave like and what they gave you doesn't really qualify as a car no matter if their disclaimer was technically correct.
I don't care about what is the equivalent computing power in 90s desktop measurement because you cannot replace a lambda function with a 90s desktop, so it is pointless.
The right approach is: I have a problem A that I can implement using AWS Lambda, AWS EC2 or your favourite DHH approved stack, how much of these cost compare to each other.
90s CPU comparison is just to demonstrate how out of place it is with what people are used to even on lowest tier hosts with shared CPU cores. Low ram compute seems to be artificially limited to make low ram lambdas useful in very narrow use cases.
For reference I have a devops team in-company that deployed and maintained several AWS projects, including some serverless, even they were surprised at the low compute available at low RAM lambdas.
It does look that they replaced the serverless implementation of a service with an hosted app because this service wasn't scaling.
They don't really communicate around the architecture of the whole Prime Video product but it doesn't look like a monolith.
A single team managing 10 microservices that actually make sense to be microservices (like the PDF renderer example above [1]) is kinda good and perfectly manageable.
A team with one single microservice that would actually work better if it was part of a monolith is already in the "creates more problems than it solves" territory.
Having hundreds of engineers work in a single monolith in a single repo without any kind of (enforced) boundaries is a one way ticket to a big ball of mud. You need to invest heavily in tooling to make it work, and e.g. Google does so.
Having a network in between teams is a relatively easy way to enforce boundaries.
Moved the monolith to .NET Core, kept the report service on .NET Framework. A win for everybody.
It's fine to just have plain "services" to do things like this where you need to leverage another OS/framework/whatever and just hive off something like PDF conversion while your core application remains a monolith.
Non-copyable build outputs sound a bit wild - you're thinking of builds that encode absolute paths into the output binaries?
I still recall the day when my local Docker builds necessitated a new router to properly manage streaming traffic at home while I downloaded a few GBs of layer images. Or the time I wanted to setup a 'simple' hosted Kubernetes cluster of my own in my lab for testing, only to discover the nightmare that is networking on it. Then there was the grim discovery that Docker containers were much more sensitive to the hosting environment than I had assumed, resulting in some fun "but it worked and tested fine" moments.
Did they all work eventually for me? Yes. Was it simple? Not by my standards.
If you find all that simple, kudos to you.
Like, compared to implementing hitless rollback over bare metal services k8s way is "easy", just set some stuff in YAML and have proper healtchecks in your app.
I wish I had infinite time to document all the issues. This isn't a small nuanced detailed thing - it falls deeply, systemically fundamentally short and in practice you still get the magical monolithic system it tried to kill but now with more obfuscation, complexity and a theatrical slight of hand to convince yourself it isn't that.
Instead of the server being configured for the monolithic app, it's now extensively and carefully configured for the myriad of containers, hostnames, configurations and connections of the containers running the microservice app.
It's in practice the same problem with a different costume.
The other promise of it being a collection of smaller constrained services running on tcp ports talking to each other ... that's nothing new. You've invented the idea of computer networks.
I'm not so sure it's a black/white true/false. Depends on what goes in the docker image. It's something like for larger deployments docker is faster but for small deployments it's the other way.
Separate services aren't a silver bullet, but as an fyi to the younger software developers, we tried "just have all teams work on the same code base and deployable artifact" for a long while and it didn't work very well either.
Now in one place such an engineer uses pandas 2 in other place pandas 1 but it is just one single app. What does it say about the quality of engineering and mental focus of such a solo developer that cannot accomplish same thing with the same API - OR cannot refactor the already written code for Pandas 1 to Pandas2.
Sounds to me like more of an engineering discipline and engineering mindfulness problem.
Fix is simple with a simple rule - everyone has to use the latest major version, always.
Micro services do not make any sort of people's communication go away, they move it to different boundaries. From dependencies to the business layer/interfaces which is lot harder to navigate and negotiate.
Imagine needing a field in your downstream service. They refuse because they don't see it their domain and you cram it on your side and what not. Ask anyone working in micro services environment and they'll tell you it is a recurring issue every quarter if not more.
Just press this button to start the upgrade build and....boom! 10,000 services and their dependencies being built on a ton of hardware; we can practically gurantee your change in dependency will be checked... Whoops, turns out your one dependency change cascaded into about 1.5% breakage....no, I don't know who owns those packsges; why do you ask? That's not my job!
/s
Memory is the principal lever available to Lambda developers for controlling the performance of a function. You can configure the amount of memory allocated to a Lambda function, between 128 MB and 10,240 MB. The Lambda console defaults new functions to the smallest setting and many developers also choose 128 MB for their functions.
https://docs.aws.amazon.com/lambda/latest/operatorguide/comp...
CPU Allocation
It is known that at 1,792 MB we get 1 full vCPU1 (notice the v in front of CPU). A vCPU is “a thread of either an Intel Xeon core or an AMD EPYC core”2. This is valid for the compute-optimized instance types, which are the underlying Lambda infrastructure (not a hard commitment by AWS, but a general rule).
If 1,024 MB are allocated to a function, it gets roughly 57% of a vCPU (1,024 / 1,792 ~= 0,57). It is obviously impossible to divide a CPU thread. In background, AWS is dividing the CPU time. With 1,024 MB, the function will receive 57% of the processing time. The CPU may switch to perform other tasks on the remaining 43% of the time.
The result of this CPU allocation model is: the more memory is allocated to a function, the faster it will accomplish a given task.
https://dashbird.io/knowledge-base/aws-lambda/resource-alloc...
> For reference I have a devops team in-company that deployed and maintained several AWS projects, including some serverless
Same here.
> even they were surprised at the low compute available at low RAM lambdas.
I wasn't because we measured it and based on the measurement we calculated what we want. I think it is a good approach not to assume anything.
I have lost track of how many times I did a git pull on a Python based solution only to find I broke all the things when I tried to upgrade one package.
Do you need a screenshot and red box around the text or would you believe me if I tell you it is written on their lambda pricing page near the beginning ? It's also written in docs about configuring lambad functions so at this point it is PEBKAC/RTFM issue, not "them not being upfront"
And frankly it is done that way because they have standarized machines, scheduling CPU heavy/memory light and cpu light/memory heavy is extra complexity. I mean ,they should, but they have no real incentive to, as in most cases apps written in slower languages are also memory-fatter so it fits well enough
> If you went to a car rental and they told you we have a cheap car that's slower when you add passengers - and then you drive it to pick up your wife and it turns out it only goes 20 km/h when your wife gets in - you would be rightfully mad.
Getting lowest tier one is more like renting a 125cc bike than a car if anything. You can do plenty with that limit in efficient language too.
Simple CPU time calculator on the pricing calculator page when you enter the RAM would be sufficient, linking to the said docs. Trivial to implement, really cleans up things when planning resource costs.
In software engineering, a monolithic application describes a single-tiered software application in which the user interface and data access code are combined into a single program from a single platform.
"mono" stands for one/alone/singular, so monolithic is kinda defined to be exactly that, yes.
You can still have multiple monoliths, but they wouldn't communicate with each other and would be entirely separate applications.
Cloud vendors also mostly sell minimum use packages for discounts in the range of 20 to 80% (called e.g. "committed use discount" or "compute savings plan"). Lots of businesses use those, because two-digit discounts are real money, but they might find themselves in the same spot as with physical hardware they don't need...
And cloud proponents pretend data center / rack space / server leasing doesn't exist either, for those trying to avoid large up front costs.
It also means some poor fuck at AWS gets woken up in the middle of the night instead of me when things go to shit.
It absolutely comes at a cost, and might not be the right fit for an organisation that's absolutely on top of it's hardware requirements and can afford to divert resources from new development work. For the rest of us it saves a lot of dev hours that would have otherwise been spent in pointless meetings or debating the best implementation of whatever half-baked stack has oozed it's way out of the organisation in an attempt to replicate what's handed to you with a cloud solution.
Also it's going to be simpler to provision your base (commited use) on the cloud and then handle bursts on the cloud, than it is to have your base on prem and burst to the cloud.
You can buy physical servers in leasing ,turning it into opex
You can also rent them for little bit extra via managed dedicated servers from vendors like OVH.
Yes, but that sunk cost is probably still lower than what you paid AWS for the option to scale up and down.
If a business invest into a cloud infrastructure and create a binding contract for 5 years, only to find out that they actually want to abandon that project a year later, that is also a sunk cost. Long term contracts tend to be cheaper, so its a trade off between saving money vs risk.
It all depend on the risk analysis, how risk averse one want to be, and the economics/liquidity needs.
You don't need to go all in on buying racks and other hardware, when you can rent servers at Hetzner at a cheaper cost than aws.
Really though, it seems like a hybrid on-prem/cloud approach is one to consider. Software like Anthos eases this, though there are also pitfalls with this approach too.
Although in this specific case, being a team at AWS, they are using their own company data centers, so it's essentially on-prem to them.
A really cheap server leasing deal will cost you yearly about as much as the purchase price of the server. With opaque AWS services it is probably more like a month of subscription to pay for the hardware that you are indirectly using.
They were entirely unusable.
Opening a relatively small file in notepad could take multiple minutes. OS click and typing response times were measured in seconds.
Despite wasting thousands of developer hours each year, they refused to upgrade their data center. Probably because doing so would have been a major budget fight that requires an executive to actually advocate for something instead of making their characteristic animalistic grunts of agreement.
For better or worse I haven't seen the same issue with cloud expenditure. It seems to be perceived as a necessary expense, rather than the engineering department getting ideas above their station.
Thankfully after they understood the problem it only took 8 months of procurement, techs going to the data center 10+ times with endless screw ups, and everyone pointing the finger at each other.
While the cloud sucks in many ways the traditional setup has big problems as soon as you hit a midsize company ime.
2. Add a test with a network dependency, and when that dependency is slow / down / turned off, the test starts failing.
3. Add a dependency on a third-party Github repo that clones from `main`, and the next time some dev touches a file in that repo your test starts failing.
4. Add a test that allocates memory in proportion to size of the codebase (e.g. because it tries to build a giant in-memory tarball of all the .mp4 assets). Eventually it will get flaky when it starts scraping up against the build machine's limit. Extra fun if your builds run without defined memory limits on machines of different sizes.
In a monolithic build, there's all sorts of ways for a single person to cause other teams' tests to fail, even months or years after they've left the company. Some of them can be prevented mechanically (such as by running tests without network access), but a lot come down to "tell them to stop doing that".
That's why big companies never run one build per repo.
IRT (2), network dependencies were forbidden in-general. Over any long enough timespan, the rate of failure is 100%. If you wanted to use the network, you had to consider the failure case and handle it in your tests.
For (3), all dependencies were committed as part of the repo. All dependencies had to be reviewed for any issues before being used, so this made sense. You simply weren’t allowed to just randomly include a new dependency without a review/PR to add it.
For (4), our dev environments had less memory than build machines and the same as production. If you couldn’t build it in a dev environment, it wasn’t getting committed without special treatment from dev ops (and a really good reason).
You can just... not allow code not passing test into master branch. They can fuck around in their own one, that's what branches are for
2. How about an old way of blocking deployments less: parts of this monolithic apps developed as libraries with a stable API, a new version of a library released only after its own tests has passed, next you can increment dependency version in the monolith and run integration tests, if they failing you can revert to the old library version and still go ahead with the deployment (if you don't depend on something added in the latest library release). If you depend on this new feature and a component providing it is broken micro-services would not help you.
> If you are only a few hundred engineers, monorepo with monolithic deployments and tests work fine.
And here lies a very important problem IMHO - many (if not majority) of organizations (which do at least some software development) have less than 100 software developers but the industry best practices (which include micro-service architecture) are defined by FAANG-sized organizations and at least some of these practices are sub-optimal for small shops.
This is not a commonly known fact. Just to take an example of GitHub, this check is disabled by default:
> Require branches to be up to date before merging
> Whether pull requests targeting a matching branch must be tested with the latest code.
Because this really doesn't scale well with the number of developers.
For larger teams the solution is to use a merge queue, e.g. https://shopify.engineering/successfully-merging-work-1000-d...
I haven't tried it, but GitHub is now offering such feature in public-beta: https://github.blog/changelog/2023-02-08-pull-request-merge-...
I'm sure this happens occasionally, but I've never experienced it, and it seems to be rare enough that it's not that big of a concern. Especially since it'll be easily remedied by either just fixing the error or just reverting one or both of the changes.
The last thing I want to do is have my build broken by someone on another team and then have to track them down and babysitting the revert. That is easily an hour of my time wasted.
That's my experience at least. Things still break, you just notice it later.
Not going with a big cloud provider def doesn’t mean that you need to buy physical servers and build an on-prem data center.
...while forgetting to have sane on-call rotation for cloud you also need at least 3 people on that rotation that are also clued in on cloud operation enough. Sure they can be "developers" but if your app architecture requires so little maintenance and flea removal that they are not doing ops jobs much, chances are so would it in either rented or dedicated server env.
And endless orgies of "call for pricing" with hardware vendors and hosting. Shitty websites where you can buy preconfigured servers somewhat cheaply, or vendor websites where you can configure everything but overpay. Useless sales-droids trying to "value-add" stuff on top.
Cloud buys are a lot friendlier, because you only have the one cloud vendor to worry about. Entry level you just pay list price by clicking a button. If you buy a lot, you are big enough to have your own business people to hammer out a rebate on list price, still very easy, still very simple. But overall still more expensive unfortunately.
I'd hope there aren't actually hours of meetings for a single $5/mo VM?
But I would hope there are reviews and meetings when deploying enough of these to amount to real money. Companies that don't do that soon enough find themselves with a million dollar AWS bill without understanding what's going on.
Spend is spend, it's vital to understand what is being spent on what and why.
Slightly exaggerated in the case of the $5 machine, probably 2-3 manhours total but it took 4 days for it to be deployed instead of ~5 minutes. We did spent tens of hours justifying why the business should spend ~$100 more per month on a production system where the metrics clearly indicated that it was resource constrained.
The same IT department that demanded we justify every penny spent did not apply any of that rigour to their own spending. Control over the deployment of resources was used as a political tool to increase their headcount.
> I would hope there are reviews and meetings when deploying enough of these to amount to real money. Companies that don't do that soon enough find themselves with a million dollar AWS bill without understanding what's going on.
I consider the judicious use of resources to be part of my job as a software engineer. A development team that isn't considering how they can reduce spend, tidy up, or right-size their resources is a massive red flag to me. Organisations frequently shoot themselves in the foot by shifting that responsibility away from the development team. The result is usually factional infighting and more meetings.
Typing/executing in Powershell was just as slow
When committing, do a ff-only of ‘main’ to your branch. Yes, this forces everyone to rebase before “merging” but in practice, this results in the least amount of failures, tests being run after you resolved any conflicts, etc.
If you can use GitHub merge queues, that solves a ton of this, and you can run tests on the final merge before actually merging instead of relying on rebasing.
This. It makes life so much simpler. With teams that don't have a lot of experience with git, however, I tend to use the "Squash and Merge" feature, coupled with forcing a linear history.
Alternatively, if you have a low enough merge volume, requiring mergers (by policy) to squash and rebase (and re-run tests before attempting to merge) can work too, as others have already mentioned.
Once you have on-premises you need people that know switches, routers, rackmount server, hardware, virtualization, etc, plus keeping all of that properly maintained (security patches, IaC, periodic updates, analyzing performance, making sure it's properly architected, etc).
I often see people saying it's the same cost or less but it's really not. Unless you have no idea what you should be doing.
Virtualization, IaC, analyzing performance, right architecture etc is all for later, when you've grown enough to need that.
Yeah, I think it might be a different perspective about when that all should be done.
I tend to do that right from the beginning because I often see it snowball later on and nobody ever fixes it or does it "properly" (in my opinion, possibly not the right one).
But that's a good point, no doubt.
A cloud vendor (who will be nameless as I signed an NDA specifically that prevents me from disparaging them; but one of the big three) ran out of capacity for me and it was 3 months before they managed to fix it. -- that was with a couple million a month in spend.
Cloud is still servers; you just depend on someone elses capacity management skills and you hope that there isn't a rush to populate a location (like when a region goes down and everyone's auto-provisioners move regions to yours)
I have to deal with a grumpy finance guy that thinks my whole department is overpaid already, especially so if we might use the dreaded `CapEx` word.
> Nope. AWS makes it dead simple to move from RDS to Aurora by clicking a button. There's no way to move data from Aurora to RDS short of doing a SQL dump and reloading everything that way. I found this out when my previous employer was looking at moving from RDS to Aurora.
I got a bit of a chuckle out of this. There's no way to move from Aurora to RDS short of... 2 minutes of actual work and a lot of waiting around due to the limitations of the hardware?
I get that it's not as easy as a literal button click, but this isn't vendor lock in.
Physical Replication should be something any database can offer given its something cross database migrations used decades ago with no problem.
Just like easy subscribe-online publications that will have you call during 2 hours with a rep pushing you discounts or whatever to cancel such subscription.
Just not cool
I will also mention that the AWS team we were working with on this didn't mention DMS, and, when directly asked, literally told me there was no easy way to do an Aurora -> RDS migration.
Previously it wasn't great, but since they started using Logical Replication for Postgres, it's gotten far better than it was in the past.
A monolithic build means that your ability to develop and deploy your team's code is dependent on every other team. As the number of teams gets larger, that multiplier really hurts.
Just I'd try to stay away from amazon if I could
There were a thousand automated checks to prevent you from doing the same thing as someone else that caused downtime in the past. It was virtually impossible to commit code that deleted/truncated a table, for example.
The sarcasm was warranted.
I'm not sure what I was supposed to take away from your cryptic sentence. Is there a two minute solution to this problem that you are smugly keeping to yourself so you can mock people replying to you?
"There's no way to move from Aurora to RDS short of... 2 minutes of actual work and a lot of waiting around due to the limitations of the hardware?"
You seem to be having trouble getting past the word "and", so I've helpfully italicized the part you've repeatedly missed or ignored.
Now sure, that's a bit vague, but if you want more details it might have been advisable to ask a question rather than simply ignoring half the sentence because you don't understand it and jumping in with a correction.
And honestly, even if it's vague on some details, there's no universe in which "2 minutes and a lot of waiting around" = "2 minutes". Whatever vagueness you might accuse me of, that fact isn't vague.
that is fair answer! AFAIK, there are two things that consider:
* Aurora do have some vendor-locking feature if I'm not wrong?
* moving from Aurora to PostGres will lead to downtime of 2 minutes + unknown waiting, where this is not a case when you convert from PostGres to Aurora.
Consider: is that impracticality caused by Amazon creating vendor lock-in? Or is that impracticality caused by the fact that reading terabytes of data from storage, transferring it over the network, and writing it into storage is inherently slow because of the physical limitations of hardware, no matter what vendor you're using?
It's a bit odd for me to be in the position of defending Amazon here. I genuinely don't like them, don't use them, and generally do think they're guilty of creating a lot of vendor lock-in. But this is legitimately not an example of any of that.
If the CEO asked how long a migration will take would you respond "2 minutes of engineering time"?
Obviously I'm going to spend a lot more time communicating the details of the situation to the CEO of a company that is paying me, than I'm going to spend communicating in a Hacker News comment. But as it turns out, no amount of communication is going to be effective if people don't bother to read past the first opportunity they see to jump in with a correction, even if that means stopping reading mid-sentence.
As is, I've no responsibility to show you anything, and you're just making unwarranted assumptions about what I have and haven't considered, based on a pretty selective reading of what I've said.