Scaling up the Prime Video audio/video monitoring service and reducing costs

Scaling up the Prime Video audio/video monitoring service and reducing costs(primevideotech.com)

989 points by debdut 3 years ago | 507 comments

My word. I'm sort of gob smacked this article exists.

I know there are nuances in the article, but my first impression was it's saying "we went back to basics and stopped using needless expensive AWS stuff that caused us to completely over architect our application and the results were much better". Which is good lesson, and a good story, but there's a kind of irony it's come from an internal Amazon team. As another poster commented, I wouldn't be surprised if it's taken down at some point.

BbzzbB 3 years ago | |

There was an article not long ago from AWS saying they'll be focussing on cutting cost for customers. Maybe the next step of that process will be pushing their clients off of AWS and telling them to just host on prem.

kypro 3 years ago | | |

I know you're joking around, but no, as they also explained a benefit of cloud (and therefore using AWS) is that it can scale flexibly with their customers' businesses.

If your business invests in physical servers anticipating strong growth next year then later finds out actually we're going into a recession and those servers are no longer needed, then that's a sunk cost.

With cloud if demand drops you can scale up and down as needed. Helping customers cut costs during difficult times makes sense since those customers are more likely to survive and stay with you through good times.

So in context I think this article makes sense since long-term sustainable growth of AWS should be linked with the growth of their customers' businesses.

nodefortytwo 3 years ago | | |

To be fair to AWS, they do work really hard to (at least at an account level) to optimize workloads with you. They do this so overall you'll move more workloads to them.

its quite simple, if workload x can be done 100% cheaper on-prem then its an obvious move (probably) if AWS manage to get that closer to 30-40% then the operational benefits of using AWS make more sense, more workloads, more total spend.

bootsmann 3 years ago | | |

Still waiting for python 3.11 on lambdas so must not be that big of a focus.

(They finally delivered 3.10 last month at least)

re-thc 3 years ago | | |

> Maybe the next step of that process will be pushing their clients off of AWS and telling them to just host on prem.

And then charging them to use AWS anywhere and outpost!

donalhunt 3 years ago | |

Archived:

https://archive.is/LFtNg

http://web.archive.org/web/20230504060528/https://www.primev...

dragonwriter 3 years ago | |

> Which is good lesson, and a good story, but there’s a kind of irony it’s come from an internal Amazon team. As another poster commented, I wouldn’t be surprised if it’s taken down at some point.

Why? Using the model they switched to (which uses a different set of AWS services) instead of the model they switched from is a recommendation that the AWS tech advisers that are made available to enterprise customers will make for certain workloads.

Now when they do that, they can also point to this article as additional backing.

lucideer 3 years ago | | |

Have you had AWS tech advisers advise teams in your company to go with this stack? Because I haven't.

AWS doesn't have an equally distributed interest in selling all of its products. Some AWS products exist because customers need/demand them and others exist because they provide higher margins and tighter lock-in to Amazon: the first type of products are great for customer acquisition, the role of their sales folk is to then convince people using the former to migrate to the latter.

adql 3 years ago | | |

The "different set of services" is so basic it can be ran anywhere, and near-everywhere is cheaper than AWS

Aeolun 3 years ago | |

I feel like it’s an object lesson in using the right solution for a problem. Step functions do not appear to me to be something that you’d use for things that need to be executed multiple times per second.

jon-wood 3 years ago | | |

Having occasionally looked at them for workflow driven tasks I'm not sure what the use case for Step Functions is, unless your workflow being called once an hour or something they seem infeasibly expensive for what they offer, and somehow manage to be more complex than just writing some code to model the workflow.

Hamuko 3 years ago | | |

Yeah, this is my takeaway too.

I'm pretty happy with the monolith that we run at our business and this seems to validate our decision to stick to that monolith, but I'm also pretty confident that where we use AWS Lambda, serverless is absolutely the right way to go.

For example, I've written a Lambda application to reply to webhook calls and send API calls whenever those come in. It costs maybe $2 per month to run in compute and requests. Would that make more sense to rewrite as a monolith and run on EC2? I really doubt it.

motbus3 3 years ago | |

I think it is fine. There are scenarios were you need distributed and there are scenarios that you don't.

IMO, distributed software is more practical for working development than for technical reasons.

We all know from basic stuff that performing software comes from single structures that does not require packing and unpacking data But scaling large applications is hard, and it was much more expensive back then. Now that we overreacted to microservices we will overreact to monoliths again. And we will bounce many more times until AI take our jobs and do the loop itself

Jenk 3 years ago | |

The cynic in me (so like 93% of me) reads this as a "Instead of abandoning AWS altogether, we changed how we use AWS, but most importantly we're still on AWS"

fnordpiglet 3 years ago | |

As an exaws senior dude we never looked at our service stack as a sell at any cost, but as a continuum of service offerings that could be assembled to be more cost optimal at higher operational burden to (mostly) ops free at a higher premium. The goal was to provide a lego kit of power tools and disappear from view tools. At least in my org we never tried to upsell or convince customers of architectures that accreted revenues at their expense, we tried to honestly assess their sophistication and desire for ops burden and complexity vs cost savings by building it themselves with the lower level kit. By our measure using aws brought us business, and we were generally more motivated by customer obsession over soaking them. I know Andy definitely had that view and drilled it into our collective heads. In many ways as an engineering minded person I appreciated the sentiment as I enjoy solving problems more than screwing people out of their money for sport.

helsinkiandrew 3 years ago | |

> I wouldn't be surprised if it's taken down at some point ...

Why? they're still using "AWS stuff" - EC2 and ECS etc. Serverless is a fraction of the services AWS offers.

AWS actively promote ways of reducing customers bills. This article could be considered a puff piece for the AWS Compute Savings Plan:

https://aws.amazon.com/savingsplans/compute-pricing/

blowski 3 years ago | | |

Exactly. You could easily frame it as "if AWS seems expensive, you're using it wrong". That an internal team could get it so wrong is testament to how difficult it is to get right, but of course, there's a consultant for helping with that.

dmw_ng 3 years ago | |

The smoking gun is probably the box that was previously labelled "Media Conversion Service" (Elemental MediaConvert - easily 5-6 figures/mo. for a small amount of snappy on-demand capacity, or crippled slow-as-molasses reserved queues) now labelled "Media Converter" running on ECS. For example, vt1 instances are <$200/mo. spot and each instance packs enough transcode to power a small galaxy, for fine-grained tuning an equivalent CPU-only transcode solution isn't that much more expensive either.

At some point the industry will wake up to the fact the AWS pricing pages are the real API docs, meanwhile dumb shit like this will keep happening over and over again, and AWS absolutely are not to blame for it, any more than e.g. a vendor of cabling is guilty of burning down the house of someone who plugged 10 electric heaters into a chain of double-gang power extension cords

joelhaasnoot 3 years ago | |

Half of the AWS certifications isn't about what's what but what to use when and using it for the right use case.

datadeft 3 years ago | | |

Exactly right. Most cloud victims are people who have faith instead of cost calculations. DHH & co. are the prime example. It seems even Amazon has such people. I guess hiring is much harder nowadays.

benjaminwootton 3 years ago | |

That was my reaction too. I know Microservices doesn’t equal cloud, but putting a big monolith on a big server is tangential to AWS interests to say the least!

jasonlotito 3 years ago | |

> but there's a kind of irony it's come from an internal Amazon team

Not at all. My time working with AWS reps, they never pushed a particular way of doing things. Rather, they tried to make what we wanted to do easier. And the caveat was always to test and make decisions on what was important to us. This isn't an anti-AWS article. Rather, it's exactly the type of thing I'd expect from them. Use the right tool for the right job.

djtango 3 years ago | |

>Microservices and serverless components are tools that do work at high scale, but whether to use them over monolith has to be made on a case-by-case basis.

Tldr build the right thing.

>"AWS sales and support teams continue to spend much of their time helping customers optimize AWS spend so they can weather this uncertain economy," Brian Olsavsky, Amazon's finance chief, said on a conference call with analysts.[0]

Amazon isn't afraid of this trend, they're embracing it. Better to cannibalise yourself than be disrupted by someone else

https://twitter.com/DanRose999/status/1287944667414196225?s=...

[0] https://www.cnbc.com/2023/04/27/aws-q1-earnings-report-2023....

pauby 3 years ago | |

It's already on the Wayback Machine https://web.archive.org/web/20230504060528/https://www.prime...

steveBK123 3 years ago | |

Yeah this article seems like heresy for someone at Amazon to have written about AWS, no way it lives long.

j45 3 years ago | |

Around 2008 the idea of microseconds were looked down on, until they weren’t.

The key is to look down on nothing, become competent with multiple architects and know which ones not to implement in a use case if the one to use isn’t clear right away

seydor 3 years ago | |

Maybe they'll publish the opposite results in 6 months

goodpoint 3 years ago | | |

And someone else will get promoted.

credit_guy 3 years ago | |

I don't read it like that at all. Both solutions use the Amazon cloud. Only in one solution you distribute a lot of processes, just because it's possible, and easy to code. When they figured out that rampant distribution was costly, they put more thinking in keeping a lot of computation in the same place (so, "monolith", but still in the cloud). No surprise, they found great savings. If they hadn't, they wold not have written about it. But they had to put some (most likely major) effort into redesigning the application.

emodendroket 3 years ago | |

I don’t really agree that this somehow exposes those tools as bad. It more shows that they weren’t that well suited for this particular use case.

seanhandley 3 years ago | |

It's been online for 2 weeks already.

abrookewood 3 years ago | |

Yep, expect the Lambda team to raise hell.

djtango 3 years ago | | |

Probably an unpopular take and my experience is almost 10 years old, but I would be surprised to see the Amazon I worked at try to bury something like this. If the product isn't what the customer wants, it isn't what the customer wants - move on and build something the customer wants.

Yes agreed there were some funny business like not selling Chromecast, but the guiding principle was generally to make things customers want...

onion2k 3 years ago | | |

Do you think the Lambda team want people to use as many of their services as possible even when it's not actually appropriate and there are better architectures and approaches available? I doubt that. They probably understand that Lambda is a good service for some things and not for others, and using it as a part of deploying things to AWS is a great idea but using it where it doesn't fit makes all of AWS look bad (in particular, hard to use and expensive.)

oblio 3 years ago | | |

Especially now that it's on the frontpage of HN :-)))

datadeft 3 years ago | | |

Why would that be? Somebody figured out that AWS Lambda is not the answer to every single question?

fbn79 3 years ago | |

But they migrated to AWS ECS that still is an expensive serverless AWS stuff, just fully managed by Amazon.

nevon 3 years ago | | |

This is simply incorrect. ECS doesn't cost anything other than what you're paying for the EC2 instances that you place your tasks on. Fargate does, but that's not what they're using.

bawolff 3 years ago |

I'm pretty convinced that microservices are one of those things that make sense 5% of the time and the other 95% is cargo culting.

boredumb 3 years ago |

AWS has a great business model of people over "optimizing" their architecture using new toys from amazon and being charged through the nose for it. It's amazing how clients that are doing a few requests per second will want a fully distributed, serverless, microservice + dynamodb + s3 + athena + etc + etc, in order to serve a semi-static web app and print some reports off throughout the day and pay 10-50k a month when the entire thing could run on a few nodes and even a managed RDS instance for a thousand bucks a month. I would argue at this point that early optimization of architecture is astronomically worse than even* your co-worker that keeps turning all of your non-critical, low-volume iterable functions into lanes to utilize SIMD instructions.

Some irony in my anecdotal experiences is that most places that don't have the traffic to justify the cost of these super distributed service architectures also see a performance penalty from introducing network calls and marshaling costs

iamflimflam1 3 years ago |

This really is a click bait title. They are talking about their video quality monitoring service, not their video streaming service.

It’s something they use to check for defects in the video stream - hence the storing of individual frames in S3.

Original title: Scaling up the Prime Video audio/video monitoring service and reducing costs by 90%

debdut 3 years ago | |

The subtitle is "The move from a distributed microservices architecture to a monolith application helped achieve higher scale, resilience, and reduce costs." And the article itself mentions the 90% cost reduction. So the title seems pretty much in-line with the original intent.

ojkelly 3 years ago | | |

But, by omission is reads that Prime Video rebuilt their stack without serverless and got a 90% cost reduction.

This post is going to pick up a lot of traction and I suspect these comments are going to bikeshed monolith vs microservices for the next day.

On reading it, this is for a video quality monitoring system, that needs to consume and process video. Generally a compute and time intensive task. Something not always suited to severless, particularly when it’s not easy to parallelise.

The task at hand doesn’t sound ideally suited to serverless, but the existence of the post shows that’s not readily obvious. So it’s a valuable post to explain a scenario where a few big machines is the best call.

But the sensationalism of the headline, would suggest all serverless is expensive and wasteful. When in reality the same is true for a non-ideal workload on a monolith.

iLoveOncall 3 years ago | |

Yes this is a ridiculous clickbait. For once the original title is not and the poster had to make it so... Why is dang not changing it back?

PrimeVideo is very much based on a microservice architecture. Hell, my team which isn't client facing and has a very dedicated purpose has easily more microservices than engineers.

burnished 3 years ago | | |

Well, it was probably the middle of the night.

chmaynard 3 years ago | |

I guess all titles are clickbait to some degree. That said, the OP should have used the original title. Dan G. often corrects this mistake after the fact.

alpos 3 years ago |

"We built a video stream processor by splitting every 1080p+, multi hour long, 30-60fps video into individual images and copying them across networks multiple times."

Not surprising that didn't go will. This strikes me as a punching bag example.

Anyone who has worked with images, video, 3d models, or even just really large blocks of text or numbers before (any kind of actually "big data") knows how much work goes into NOT copying the frames/files around unnecessarily, even in memory. Copying them across network is just a completely naive first pass at implementing something like this.

Video processing is very definitely a job you want to bring the functions to the data for. That is why graphics card APIs are built the way they are. You don't see OpenGL offering a ton of functions to copy the framebuffers into ram so you can work on them there only to copy them back to the video card. And if you did do that, you will quickly find out that you can be 10x to 100x more efficient by just learning compute shaders or OpenCL.

You could do this in a distributed fashion though, but it would have to look more like Hadoop jobs. I predict the final answer here, if they want to be reasonably fast as well, is going to be sending the videos to G4 instances and switching the detectors over to a shader language.

In general, if the data is much bigger than the code in bytes, move the code, not the data.

IO is almost always the most expensive part of any data processing job. If you're going to do highly scalable data processing, you need to be measuring how much time you spend on IO versus actually running your processing job, per record. That will make it dead obvious where you should spend your optimization efforts.

LASR 3 years ago |

This is not a discussion of monolith vs serverless. This is some terrible engineering all over that was "fixed".

Some excerpts: > This eliminated the need for the S3 bucket as the intermediate storage for video frames because our data transfer now happened in the memory.

My candid reaction: Seriously? WTF?

I am honestly surprised that someone thought it was a good idea to shuffle video frames over the wire to S3 and then back down to run some buffer computations. Fixing the problem and then calling it a win?

But I think I understand what might have lead to this. At AWS, there is an emphasis on using their own services. So when use cases that don't fit well on top of AWS services come up, there is internal pressure to shoehorn it anyway. Hence these sorts of decisions.

ripper1138 3 years ago | |

This is what L6 and L7 are building at Amazon, meanwhile in sys design interviews I’m being asked to design solutions for a gaming platform with 50M concurrent users.

adql 3 years ago | |

> This is not a discussion of monolith vs serverless. This is some terrible engineering all over that was "fixed".

I feel that's like 95% of the "we migrated from X to Y and now it is better"; most of improvements coming from rewriting app/infrastructure after learning the lessons with only small part sometimes being the change in tech

samwillis 3 years ago |

Next they will transition to on premises hardware from the cloud to save another 90%.... oh wait...

clnq 3 years ago | |

It turns out taking it offline has yet another 90% reduction in cost.

NBJack 3 years ago | | |

And vastly improves security!

rbanffy 3 years ago | |

From Amazon's PoV, AWS is on-prem ;-)

gondaloof 3 years ago | | |

Tongue in cheek, but

> Amazon Web Services, Inc. is a subsidiary of Amazon

So it’s technically another company.

Another comment seems to confirm this akshually comment ^_^’

https://news.ycombinator.com/item?id=35812230

selcuka 3 years ago | |

I imagine they will transition to bare metal as the next step.

yolovoe 3 years ago | | |

Don't. There's no benefit to using metal as opposed to the largest virt (which will take up the entire server anyways) pretty much. Metal just tends to be somewhat less reliable. Source: I work here.

anyfactor 3 years ago | |

I wouldn't be surprised if AWS started a on-prem hardware leasing service. Some company are providing "On-premise As A Servicse" solution.

deadcore 3 years ago | | |

They already do ^_^. AWS Outpost[1]

[1]: https://aws.amazon.com/outposts/

atty 3 years ago | | |

Isn’t that what AWS Outposts are? Leased hardware with a subset of AWS services running on them.

slekker 3 years ago | | |

Could you name names? I am very interested in this!

bobsmooth 3 years ago | | |

What a delightful euphemism for "leasing".

pbd 3 years ago | |

lol. Amazon is literally where microservices became mainstream.

jjevanoorschot 3 years ago |

The title is editorialised to be clickbait. The original title is "Scaling up the Prime Video audio/video monitoring service and reducing costs by 90%".

They changed a single service, the Prime Video audio/video monitoring service, from a few Lambda and Step Function components into a 'monolith'. This monolith is still one of presumably many services within Prime Video.

kstenerud 3 years ago | |

The subtitle is "The move from a distributed microservices architecture to a monolith application helped achieve higher scale, resilience, and reduce costs."

And the article itself mentions the 90% cost reduction.

So the title seems pretty much in-line with the original intent.

jjevanoorschot 3 years ago | | |

The title makes it sound like Prime Video abandoned microservices all-together, but in reality they only did so for a single service.

herculity275 3 years ago | | |

Prime Video has hundreds of teams, VQA is a tiny team that owns a very specific QA service. Omitting that distinction from the title absolutely is clickbait.

oaiey 3 years ago | |

The worth here is that Amazon is writing about not going into AWS PaaS native programming (what Lambda is) because it is too expensive for them.

That has some newsworthiness and the title kind of reflects that.

dragonwriter 3 years ago | | |

> The worth here is that Amazon is writing about not going into AWS PaaS native programming (what Lambda is) because it is too expensive for them.

…and going to a newer AWS service (ECS), instead.

bhouston 3 years ago |

I wish this was a good condemnation of microservices in a general use case but it is very specific to the task at hand.

Honestly, the original architecture was insane though. They needed to monitor encoding quality for video streams so they decided to save each encoded video frame as a separate image on S3 and pass it around to various machines for processing.

That is a massive data explosion and very inefficient. It makes a lot more sense that they now look for defects directly on the machines that are encoding the video.

Another architecture that would work is to stream the encoded video from the encoding machines to other machines to decode and inspect. That would work as well. And again avoid the inefficiencies with saving and passing around individual images.

amluto 3 years ago | |

> Another architecture that would work is to stream the encoded video from the encoding machines to other machines to decode and inspect. That would work as well. And again avoid the inefficiencies with saving and passing around individual images.

No, that’s still a bad architecture. Bandwidth within AWS may be “free” within the same AZ, but it’s very limited. Until you get to very very large instance types, you max out at 30 Gbps instance networking, and even the largest types only hit 200 Gbps. A single 1080p uncompressed stream is 3 Gbps or so. There is no way you can effectively use any of the large M7g instances to decode and stream uncompressed video.(Maybe the very smallest, but that has its own issues.)

In contrast, if you decode and process the data on the same machine, you can very easily fit enough buffers in memory, getting the full memory bandwidth, which is more like 1Tbps. If you can process partial frames so you never write whole frames to memory, you can live in cache for even more bandwidth and improved multi core scalability.

bhouston 3 years ago | | |

Ah. I was thinking that the encoding machines were not bandwidth limited but rather cpu limited as they were doing expensive encoding algorithms. So I was thinking the streams were streaming out at less than real time. I figured this was better than the dual/multi encode method I think they are now relying upon when all the detection code doesn’t fit on the same machine as the encoder.

kiesel 3 years ago |

This is less an example of why serverless was bad but rather an example where using non-suitable services for tasks they were not meant for.

In this case they were using AWS Step functions that are known to be expensive ($0.025 per 1,000 state transitions) and they wrote: > Our service performed multiple state transitions for every second of the stream

Secondly, they were using large amounts of S3 requests to temporarily store and download each video frame which became a cost factor.

They had a hammer - and every problem looked like a nail. In my experience this happens to every developer at a certain stage when he/she gets in touch with a new technology; it doesn't mean that the tech itself is bad - it depends on the scenario, though.

ikiris 3 years ago |

Sending video frames between services is expensive, also doing per state transition hosting on things doing state transitions multiple times per second in a single stream is also expensive...

Like, did they even think about cost when designing this the first time?

bhouston 3 years ago | |

Yeah completely insane original design. A design I would expect from a first year intern who is just trying to make his first project work and is picking random technologies to string together.

ocdtrekkie 3 years ago | |

Considering they don't actually pay the bill for this and it is internal accounting, probably not. Belt tightening has probably pushed cloud providers to figure out if they're wasting stuff they could put to better use, and I assume when it launched and nobody was watching Prime Video, inefficiencies were both smaller and less noticeable.

bagels 3 years ago | | |

Their team/org has resource budgets too.

adql 3 years ago | |

>Like, did they even think about cost when designing this the first time?

Obviously no, only after managers complained.

radicalbyte 3 years ago | |

It stinks of a lack of very basic engineering skills to me combined with a large dose of CV-driven-development.

The latter of course helping Amazon market "serverless" to the unwashed masses as a "solution".

Vosporos 3 years ago | |

why should they, they're richer than God!

eru 3 years ago | | |

You don't get (and stay rich) by wasting all your resources.

ishanjain28 3 years ago |

> The main scaling bottleneck in the architecture was the orchestration management that was implemented using AWS Step Functions. *Our service performed multiple state transitions for every second of the stream*(???), so we quickly reached account limits. Besides that, AWS Step Functions charges users per state transition.

This is so obvious in my head. I can't think of a single good reason where a SFN makes sense here.

bberrry 3 years ago |

I'd be surprised if this doesn't get taken down as it casts AWS lambda in an unfavorable light (and rightly so). That's the impression I have of Amazon's leadership but maybe I'm wrong.

dpwm 3 years ago | |

> We designed our initial solution as a distributed system using serverless components (for example, AWS Step Functions or AWS Lambda), which was a good choice for building the service quickly.

The message seems more that they outgrew AWS lambda but that lambda was a good choice at first.

supriyo-biswas 3 years ago | | |

The post literally says that they could hit only 5% of the expected workload with their server less architecture, so IMO it is still quite negative.

qaq 3 years ago | | |

Well they do work for Amazon they can't say lambda sux. Monolith is way faster to develop especially the CI/CD part so no if they started with monolith there would be no downside.

dragonwriter 3 years ago | |

> I’d be surprised if this doesn’t get taken down as it casts AWS lambda in an unfavorable light

“There are use cases where Amazon EC2 and Amazon ECS are a better platform than AWS Lambda” is…not actually a message that anyone involved in AWS has ever been afraid to put forward.

I mean, the whole reason that AWS has a whole raft of different compute solutions is that, notionally, removing any one would make the offering less fit for some use case.

emodendroket 3 years ago | |

The solution was using a different array of AWS resources so I don't see how anything is being cast in a bad light. Lambda is great for many use cases.

simplotek 3 years ago | |

> I'd be surprised if this doesn't get taken down as it casts AWS lambda in an unfavorable light (and rightly so).

The article mostly lays the blame on step functions. Also, lambdas are portrayed as event handlers that don't run relatively often. This means long running tasks that are ran occasionally, or events that don't fire that often. Once throughout needs go up or your invocation frequency comes closer to the millisecond then the rule of thumb is that you are already requiring a dedicated service.

ad-astra 3 years ago |

Storing individual frames in S3??? Insanity! Their initial distributed architecture is unbelievable.

jpgvm 3 years ago |

Dead horse and all that but please just stick to Boring Tech, it is better for your mental health, not to mention your business, development velocity, defect rate, etc.

Most importantly it's good for mental health though.

noobermin 3 years ago | |

Not good for resume padding hype chasers. Especially the managerial types who never need to actual write the code.

time4tea 3 years ago |

Microservices are just an architectural pattern, and like all patterns there are places where they are highly appropriate, and others where they are inappropriate.

Same for cloud, same for <pattern>

If everything is a hammer you'll hurt your thumb/hand/arm.

At least now (for some time) the pattern is named, so broadly when talking about this sort of thing, the name conjures up the same/similar image in everyones heads.

There are all sorts of inputs to the choice of architectural patterns, including budget, scalability (up and down), criticality, security, secrecy, team skills and knowledge, preference, organisational layout, organisation size, vendor landscape, existing contracts, legal jurisdiction ....

mparnisari 3 years ago |

As a former AWS employee I can almost guarantee that the person that made the original design got a promotion over it.

throwaway2990 3 years ago | |

As a never been AWS employee I can almost guarantee you the original design was most likely simple and the use of lambdas and step functions a good choice and not expensive but the functionality grew and the cost sky rocketed. This is only normal evolution of a service.

IceHegel 3 years ago | |

They put individual video frames as images in S3. That’s ubsurdly dumb. It’s like putting a frame buffer on an HDD.

_joel 3 years ago | | |

In any normal company, not one that profits from such dumbness :)

fastest963 3 years ago |

Clickbait title. The expensive part was passing around individual frames and the associated S3 operations. It's not clear if they could've kept a distributed architecture but made the work units be chunks of frames or even whole videos. Monoliths can inefficiently use S3 and other cloud services to rack up a huge bill.

bagels 3 years ago |

I don't want to come off too harsh on this, but it sounds like the service didn't meet the initial design requirements?

Some of this would have been really easy to predict (eg. hitting account limits) if they simply took the time to calculate how many workflow transitions they'd need to execute for the load.

camgunz 3 years ago |

If you came to me with a design that included passing individual video frames through S3 instead of RAM I would honestly think you were joking. What a wild article.

IceHegel 3 years ago | |

I’m all for big, fast, monoliths - but I’m not sure I want to hear it from the team that saved video frames to s3 in their AWS Step Function video encoder.

vinay_ys 3 years ago |

AWS is truly a customer first company. I been AWS customer in its early days (2006-2012) and then recently (2022-now). And they have been consistent in being customer-first. In the last year, they have proactively helped us cut our AWS spend by multiples. I'm not surprised at all by this article coming from within Amazon. Kudos for maintaining such a culture.

basitmustafa 3 years ago |

The headline is a bit of a misnomer. This happens in large businesses all the time (which isn't to say it's "good", hardly is, but it suggests the causation is incorrect here, which then indicates the conclusion is entirely off-base):

1) We have sexy new product! Everyone use it so we have some use-case stories to tell and we look credible! Who cares if it's not the right tool for the job! We need a splashy way to use hackneyed business speak like "we're eating our own dog food" at the next user con so all the IT middle managers there will fight over early access and adoption. PROFIT! (Screams of technology teams in the background of "a knife is the most expensive, useless pry tool you can buy, but whatever, you are not listening, mmmkay").

2) A few quarters/years later (if you're lucky and you made it or someone with enough gravity in their title finally saw the light): Why is expense so high in this business unit? This is insane! Let's go back to a more sane architecture. (Screams of technology teams going back to what was working in the first place, but was not sexy nor necessarily new now that no one is watching and hype cycle is over)

Does this mean that serverless is useless? Dumb? Uneconomical? No way. For bursty, very short running workloads, it can be GREAT and INCREDIBLY economical.

What is useless and "dumb" is whomever thought that Prime Video's encoding workloads were going to do anything but increase cost and were somehow a fit for a system whose business case specifically necessitates bursty, shorter workloads that are primarily scale-to-zero for significant periods of the day/week/month.

It was a marketing stunt gone horribly wrong: intentional or not, but that doesn't repudiate the value of "serverless" for the right workloads, it just proves you better really understand the technology and the business case and the scale economics, and that goes for any technology.

asdfman123 3 years ago |

What are the developers doing, though, if they’re not diagnosing why Reaper isn’t communicating with the Zanzibar service registry?

ggm 3 years ago |

amazon product ditches amazon product for another amazon product?

feels very strongly they just moved from one AWS platform to another.

delay between asynchronous communicating processes differs in these architectures and I suspect they were unable to orchestrate microservices to match the RPC "inside" a monolith model. Nobody can: It only matters if your IPC is causing delay you can avoid.

Most of us aren't in a room where the real cost is high: 90% of computers are more than 90% idle 90% of the time. Amazon is not in that cohort.

tylerdurden91 3 years ago |

I think what most people are missing here is that they used AWS Step Functions in the wrong place. Part of the blame here is that in over enthusiasm of trying to get more users, AWS doesn't properly educate customers when to use which service. Worse, for each use case AWS has about dozens of options making the choice incredibly hard.

In this case, they probably should have used Step Functions Express, which charges based on duration as opposed to number of transitions and they're looking for "on host orchestration" like orchestrate a bunch of things which usually are done in small time and are done over & over many times. Step functions is better when workflows are running longer, and exactly once semantics are needed. Link for reading differences between Express & standard step functions: https://docs.aws.amazon.com/en_us/step-functions/latest/dg/c....

This also exemplifies the fact that I learned while being at Amazon & AWS that Amazon themselves dont know how best to use AWS. This being one of the great examples. I'll share 1 more:

- In my team within AWS, we were building a new service, and someone proposed to build a whole new micro service to monitor the progress of requests to ensure we dont drop requests. As soon I mentioned about visibility timeout in SQS queues, the whole need for the service went away. Saving Amazon money ($$) & time (also $$). But if I or someone else didn't mention, we would have built it.

I dont think serverless is a silver bullet, but I don't think this is a great example of when not to use serverless. It helps to know the differences between various services and when to use what.

PS: Ex Amazon & AWS here. I have nothing to gain or lose by AWS usage going up or down. I'm currently using a serverless architecture for my new startup which may bias my opinions here.

tylerdurden91 3 years ago | |

Worth mentioning as mentioned in other comments that moving video data around at that scale was a bad choice to begin with. They could have considered fargate and avoided moving the data around so much as well and realized similar reductions in cost. So the wins are not really coming from moving to monolith as much as they're coming from optimizing unnecessary data transfers.

If the article said fargate, which is technically still serverless we could have avoided a whole microservice vs monolith debate or serverless vs hosts/instances debate.

jdub 3 years ago |

I work in streaming video, specialise on AWS, and have enjoyed using Step Functions for certain (non-video) projects. I am _astonished_ that Step Functions + S3 was even considered as a starting point for defect detection in streaming video. Astonished.

LVB 3 years ago |

> Moving the solution to Amazon EC2 and Amazon ECS also allowed us to use the Amazon EC2 compute saving plans that will help drive costs down even further.

So various parts of Amazon have to work through the AWS same pricing programs that the rest of us do?

zoover2020 3 years ago | |

There are internal discount rates per service (IMR), but there's no such thing as free lunch

Also, Prime Video isn't part of AWS but the consumer / devices / other part of (retail) Amazon.

Source: worked there

ldargin 3 years ago | |

Yes, to keep track of costs.

Garlef 3 years ago | | |

Also they might be actually be different legal entities.

sen 3 years ago |

Everything old is new again.

asim 3 years ago | |

This. All trends are cyclical. Microservices have a purpose. Monoliths have a purpose. They are not mutually exclusive. One is the path to the other but there may also be resets along the way. I spent 10 years doing microservices and now I'm back to a monolith. It's a refreshing change but it's also a project in its infancy. Breaking that out over time will only happen as and when needed.

rbanffy 3 years ago | |

Not really - what they realized is that the billing model was not well aligned to what they were doing.

bcoughlan 3 years ago |

I quite like the idea of viewing run cost as architecture fitness function https://www.thoughtworks.com/radar/techniques/run-cost-as-ar....

If your architecture has a high cost to develop, test and run when a cheaper architecture meets your needs, it's a sign that you have overengineered. In my experience there is an order-of-magnitude increase in complexity by adopting microservices that only starts to pay off when your org and user base are huge.

raverbashing 3 years ago |

I love how some of developers jumped on the serverless bandwagon with some of the least "serverless" workloads first

"Let's make our entire website serverless now" erm, no?

It's cargo culting of the worse kind

selcuka 3 years ago | |

It's the same story as NoSQL. "Let's migrate our transactional data that requires strict referential integrity to CouchDB... Oh, wait..."

quickthrower2 3 years ago | | |

Waiting for the next wave of tech misuse due to LLMs and ML!

1-6 3 years ago | |

Well, even outside of the world of computers, you have sheeple everywhere who will do what they're told without questioning anything.

Understanding this behavioralism will get you through many situations in life.

alexchamberlain 3 years ago |

Somewhat interesting article, but this isn't a monolith, at least not by a microservice fanboy definition.

The product (Prime Video) is still built using many business oriented services. Furthermore, this service appears to be developed and operated by a single team.

That being said, there are some lessons here - there are good ideas in most design paradigms, but if you take them to the extreme, you're going to see some weird side effects. Understand the benefits and engineer a balanced solution.

bob1029 3 years ago |

I think serverless has its place, but this problem doesn't seem like a fantastic fit.

We are looking into serverless as a way to exhibit to our customers that we are strictly following certain pre-packaged compliance models. Cost & performance are a distant 2nd concern to security & compliance for us. And to be clear - we aren't necessarily talking about actual security - this is more about making a B2B client feel more secure by way of our standardized operating model.

The thinking goes something like - If we don't have direct access to any servers, hard drives or databases, there aren't any major audit points to discuss. Storage of PII is the hottest topic in our industry and we can sidestep entire aspects of The Auditor's main quest line by avoiding certain technology choices. If we decided to go with an on-prem setup and rack our own servers, we'd have to endure uncomfortable levels of compliance.

Put differently, if you want to achieve something like PCI-DSS or ITAR compliance without having to covert your [home] office into a SCIF, serverless can be a fantastic thing to consider.

If performance & cost are the primary considerations and you don't have auditors breathing down your neck, maybe stick with simpler tech.

InvOfSmallC 3 years ago | |

Overall, like it's stated in the article, it would be a case-by-case choice what to use. My experience tells me it's always a good idea to start with the monolith but I don't know much about PII to tell you your idea is over-engineered. I feel there are better ways though. Also because you don't need to use Lambda to not be on-prem EC2 is enough.

uber1geek 3 years ago |

I am big fan of django's apps model ... what I like to call a "Modular Monolith".

Being an early engineer at most of my stints, I have build and scaled multiple startups using the approach and it has never failed me, the pitfalls of micro-services is not worth it unless absolutely necessary.

I always made it a point to group by business-logic rather than separate at whatever curve ball "new-tech" throws at me.

smitty1e 3 years ago |

Two naive ideas that may be OK as a going-in position:

- granularity

- bandwidth negligibility

Breaking everything down to a gnat's ass might improve testability, but is testability the product? Do I really need a Java stack trace that reads like an Andrew Wiles proof?[1] Maybe I do, at scale.

Then there is the non-zero cost of the packet shuffling. Every edge in the aechitctural graph, not just the nodes, costs. But we just throw a waiter into the code and move on to the next line. No biggie.

What was most interesting was "It also increased our scaling capabilities." Granularity was supposed to let "serverless" absorb the entire universe, I thought.

At a higher level of abstraction, maybe The Famous Article is a map/reduce job: the requirements dissolved into solution, and a proper number of components precipitated out.

[1] https://en.m.wikipedia.org/wiki/Wiles%27s_proof_of_Fermat%27...

terom 3 years ago |

> The second cost problem we discovered was about the way we were passing video frames (images) around different components. To reduce computationally expensive video conversion jobs, we built a microservice that splits videos into frames and temporarily uploads images to an Amazon Simple Storage Service (Amazon S3) bucket. Defect detectors (where each of them also runs as a separate microservice) then download images and processed it concurrently using AWS Lambda. However, the high number of Tier-1 calls to the S3 bucket was expensive.

Taking "malloc for the Internet" [1] a bit /too/ literally there.

[1] https://aws.amazon.com/blogs/aws/eight-years-and-counting-of...

recursivedoubts 3 years ago |

https://grugbrain.dev/#grug-on-microservices

> grug wonder why big brain take hardest problem, factoring system correctly, and introduce network call too

> seem very confusing to grug

tyingq 3 years ago |

Seems somewhat curious that they didn't at least include Fargate. Feels like they jumped all the way from the typical overengineered setup into using AWS in a way that's very close to just "I need virtual machines".

tylerdurden91 3 years ago | |

Absolutely. Neither fargate nor step functions express. Seems like they did not evaluate all the options before making the jump.

christkv 3 years ago |

I've never seen successful micro services if the starting point is not a monolith. The most successful ones I've seen are hybrid ones where some parts needed to be scaled are refactored as a micro service to run in parallel.

lastangryman 3 years ago | |

Bang on. A friend I work with used to say "microservices are for scaling teams, not tech" which I liked.

Even with monolith -> microservices I've seen it go wrong. One Go application I worked on it would take a senior engineer a week to add a basic CRUD endpoint as the code had been split in to microservices along the wrong boundaries. There was a ridiculous amount of wiring up and service to service calls that needed done. I remember suggesting a monolith might be more appropriate, and was told it used to be a monolith but had been "refactored to microservices"...

This type of stuff can literally kill early stage companies.

revskill 3 years ago |

Lambda , Steps functions,... is just pure marketing scam to me , because the price is ridiculous too high for 99% of real world use case.

They're though good enough to deliver an MVP quickly, but that's all about it.

zenbowman 3 years ago |

Shipping around individual video frames between components is really an astonishingly bad idea.

Microservices seem to be a decent idea with a terrible name. The idea of running services that are small enough that they can be managed by a single team makes sense - it enables each team to deploy their own stuff.

But if you break things down further, where you need multiple "services" to perform a single task, and you have a single team managing multiple services - all you do is increase operational & computational overhead.

shri_krishna 3 years ago |

The only time it makes sense to use edge/serverless anything is lightweight APIs and rendering HTML to end users so they get the page loaded as quickly as possible. That's the only use case good for edge. And any supporting infra that can help deliver rendered pages asap (like kv store on the edge for storing sessions, lightweight database on the edge for user profile data, queues etc). Anything that requires decent amount of processing should not live on the edge/serverless. It defeats the purpose.

dragonwriter 3 years ago | |

> The only time it makes sense to use edge/serverless anything is lightweight APIs and rendering HTML to end users so they get the page loaded as quickly as possible. That’s the only use case good for edge.

Serverless and edge aren’t the same thing.

shri_krishna 3 years ago | | |

Nope. Edge is just serverless that is closer to your user to reduce the number of network hops. Both are essentially the same when it comes to technical functionality. They run on limited resources and should not be used for intensive workloads.

dannyobrien 3 years ago |

Am I right in understanding this is just their defect-detection system?

AndrewPGameDev 3 years ago | |

Yes, this is just the defect detector and not the actual video streaming service.

hubraumhugo 3 years ago |

I'll launch a consulting business focused on migrations from microservices to monoliths and from the cloud to in-house. Pricing would be a % of the saving over the first year.

cmrdporcupine 3 years ago |

I'm happy to see -- in the discussion here-- the continued backlash against microservices and the deleterious effects it has had on software complexity, and data modelling.

But I think it's interesting that if we took a time machine back to 2014 or 2015 the tone here would be quite different, and microservices were all the rage on this forum as I recall.

I like to hope that the industry learns from its failed trends, but I'm now old enough to see this is rarely the case.

INTPenis 3 years ago |

These days when project managers of new products seek my advice as a solutions architect I tend to suggest they create a minimally viable product that is written modularly so it can scale, but deploy it very simply on a few servers just like we used to 15 years ago.

Scaling is definitely a good thing, microservices make scaling easier, no doubt about that. But an MVP rarely needs k8s level scaling, it just needs to be written well so it can scale in the future.

samsquire 3 years ago |

I've been having lots of thoughts lately about how you build a) a system that can respond to scale b) for the affordable price possible c) scaling infrastructure spend with income

I love the anecdotes about just buying a Hetzner server which can handle a surprising amount.

One of my ideas is a company that maintains an incremental infrastructure that can grow to handle extreme levels of traffic - the infrastructure itself mutates over time.

babbledabbler 3 years ago |

Breaking things into tiny functions and putting them on many different servers incurs tradeoff costs in both complexity and compute. There is a complexity cost in having to deal with the setup, security, and orchestration of those functions, and a compute cost because if the overall system is running constantly it will be less efficient and therefore more expensive than running on one box.

makkes 3 years ago | |

I agree on the tradeoffs you have to make. The main cost driver here was storage and traffic, though.

babbledabbler 3 years ago | | |

Good point. "communication" should also be on the list. I don't think storage is technically the tradeoff in this case even though it's S3. It's the traffic between those components that's costing them.

dahwolf 3 years ago |

Rarely will I defend Amazon in anything, but I'll make an exception.

In my experience, AWS/Amazon people do not force you or even direct you to a particular architectural choice. They are relatively indifferent about it.

Instead, trend-driven architectures seem to come from the tech community themselves. It's the customers often making the wrong choice.

datadeft 3 years ago |

Two things:

- When people use the solution -> problem path instead of problem -> proposals -> cost analysis -> solution they get what they deserve.

- It is possible to optimize most infrastructures and code, it depends how much obviously but I have seen such percentages before

The real question is: why didn't they chose the right stack for their problem the begin with?

huksley 3 years ago |

AWS Step Functions are bad for so many reasons. Scaling, pricing, developer experience, etc.

It is clearly made by people who don't really understand (or does not care) how distributed workflows work.

And pricing are prohibiting to run it at scale. In my opinion it should be free to use, provided you glue together other AWS services with it.

1-6 3 years ago |

Perhaps Amazon reached peak saturation for its video streaming services so it no longer needed unknown unknowns from holding it back from using a more efficient monolithic architecture. Distributing services across multiple machines is certainly more scalable but all those API calls can add up.

tnsengimana 3 years ago |

Over engineering at its best. I tend to see microservices as a doubled edged sword and in this case, there was no need for them.

Also, the pricing of AWS quickly goes up as you go from EC2 -> Fargate -> Lambda. I don't know why on earth someone would build microservices at the lambda-level.

Cthulhu_ 3 years ago |

They basically underestimated the cost of moving millions of small files to and from S3; it kinda makes sense if they want to save those images for a long time, but in this case it was for semi-real-time error detection, which is much faster to do in-memory.

deterministic 3 years ago |

Micro-services is BS invented by cloud providers to solve problems you don’t have at 10x the cost.

The worst software systems I have ever seen were micro-services. One of them is more than 20 years old. The WTF count per minute is exponential.

bjornsing 3 years ago |

It’s expensive to store individual video frames in s3 for no good reason? Go figure…

kreco 3 years ago |

At this point isn't the lesson to use serverless stack for fast iterative processes then use a custom solution once you know exactly what you want?

I have 0 experience with serverless/cloud. Just a thought.

yakshaving_jgt 3 years ago | |

I think the lesson ought to be that you should start by writing one computer program and running it on one computer.

basilgohar 3 years ago |

I wonder if this is, in some way, a kind of signalling of where AWS wants to go – maybe they want to shift more towards dedicated hosting rather than all of these separate services?

retrac98 3 years ago |

You can read more of these sorts of posts at https://www.microservice-stories.com/

h05sz487b 3 years ago |

So one team at Prime for one specific application learned, that serverless was not the ideal compute model for their workload. Wow.

sjinta 3 years ago |

I think using "Monolith" for what they ended up with is badly chosen. Basically they just made a service less granular (or less micro, if you will).

Havoc 3 years ago |

Feels more like the initial version was a prototype not meant to scale

Wouldn’t have expected prime to be pushing around images on s3

stuaxo 3 years ago |

Locality of reference matters.

It's fine to split things up, but we have to be careful how we do it + aware of the overheads.

prisonguard 3 years ago |

This is astonishing coming from Amazon.

marcopicentini 3 years ago |

They could save millions by migrating to Digital Ocean or Hetzner (+Cloud66).

abrookewood 3 years ago |

I'm waiting for the AWS Lambda team to talk to Marketing to get this taken down ...

vivegi 3 years ago |

Of all the video streaming services I have used, PrimeVideo is the one where the video/audio sync becomes terrible progressively.

It is pretty bad. It happens in 8 out of 10 movies. There is some misconfiguration in their AV transcoding pipeline.

And here, we have an article talking about Monolith vs. Microservices improving user experience.

sgtnoodle 3 years ago | |

Of all the streaming services that have irritated me, I can't recall any serious technical problems with prime. I suppose I have a vague memory of poor AV sync that could have been on prime, it was always a problem at the start of streaming that would work itself out after a few seconds.

Netflix's shiny new compression scheme a couple years ago didn't work on my Sony TV's buggy silicon. The only way I got that fixed was by knowing someone on the inside.

Hulu usually can't make it through an episode without the video freezing at least once. Sometimes it just refuses to work at all until I completely reboot the TV.

HBO Max's UI is just really cheesy and slow, but whatever it's fine.

Paramount+ is my new favorite to hate on. The UI is maddeningly glitchy and lethargic. I pay for no ads, but it plays ads anyway, on Star Trek episodes from 1996. It doesn't remember progress in a show more than once every week or two, just enough to remind you that it's supposed to be a feature. On my phone, it doesn't hide the typical menu overlays unless I do a complex sequence of finger taps. One time I tried to file a bug report from inside the logged-into app, and I got an email back claiming that they would love to consider my concerns but can't because they don't have an account associated with my email address.

vivegi 3 years ago | | |

For me, the sync is fine at the start of playback on PrimeVideo. It just becomes bad progressively (which leads me to believe they have used a video framerate that is ever so slightly different from the source and have keyframes insertion after a longer than optimal duration; similarly sample rate mismatch for output audio relative to input audio stream could be a potential cause).

And I use a FireStick, FWIW.

BTW, their own trascoder product MediaConvert seems to have this issue (It is possible that it could be user error too in how they have used the product or setup the parameters). [1]

My guess is PrimeVideo dogfoods MediaConvert and they also have this issue. They could have fixed it for newer content, but previously transcoded content still has issues (which will remain until they are re-transcoded).

[1]: https://repost.aws/questions/QUGajgu4zKTlewlTg1M96i_Q/questi...?

chrismsimpson 3 years ago |

This article is going to keep me employed for some time yet

debdut 3 years ago |

What! They changed the title. Tells you something

amne 3 years ago |

but .. but .. there's no buzz words in this solution. monolith? ew!

dragonwriter 3 years ago | |

“container”

finikytou 3 years ago |

they just coded a step functions monolith...

Alifatisk 3 years ago |

Guess who happy DHH was reading this

Alifatisk 3 years ago | |

how*

ThouYS 3 years ago |

hahahahahahaha

EVa5I7bHFq9mnYK 3 years ago |

I guess what AWS sells is not servers, but software to manage them automatically, to load balance, to replicate etc. Once, in a short time, GPT can write such (pretty standard) software for you, Amazon will, too, go down.

_joel 3 years ago | |

You're vastly oversimplifying this, imho. It's not just being able to write something and get AI to write terraform for you (it doesn't do it all that well atm in reality, for anything complex). You can't automate the people who you need to convince to make those decisions internally, on the whole, at least :)

iLoveOncall 3 years ago | |

Sure, ChatGPT will automate in a short time what tens of thousands of top engineers have built over a decade.

EVa5I7bHFq9mnYK 3 years ago | | |

Of course not. It will help millions of small companies to write scripts so they won't need AWS anymore.

bberrry 3 years ago |

I wouldn't call it a monolith as the number of instances could be scaled up. Mono implies single instance. They just combined multiple microservices into a larger one.

config_yml 3 years ago | |

I am not sure if you’re joking.

bberrry 3 years ago | | |

I don't see whats funny about my statement. Please elaborate on your definition of monoliths vs scalable microservices.

rbanffy 3 years ago | | |

We could call it a polylith.

cogitoergofutuo 3 years ago | |

It’s also not really serverless to begin with, because at the end of the day code is being executed on a physical device that many of us might call a “server”