Maestro: Netflix's Workflow Orchestrator

Maestro: Netflix's Workflow Orchestrator(netflixtechblog.com)

307 points by vquemener 1 year ago | 159 comments

slt2021 1 year ago |

I used to be impressed with these corporate techblogs and their internal proprietary systems, but not so much anymore. Because code is a liability.

I would rather use off-the-shelf open source stuff with long history of maintenance and improvement, rather than reinvent the cron/celery/airflow/whatever, because code is a liability. Somebody needs to maintain it, fix bugs, add new features. Unless I get +1 grade promotion and salary/rsu bump, ofc.

People need to realize that code is a liability, anything that is not the business critical stuff that earns/makes $$$ for the company is a distraction and resource sink.

cortesoft 1 year ago | |

Isn't this exactly WHY this blog post exists? They are open sourcing this software so that they don't have to maintain it all internally anymore.

They had a need that an existing "off-the-shelf open source" project didn't solve, so they created this an are now turning it into an "off-the-shelf open source" project so they can keep using it without having to maintain it entirely themselves.

How are these open source tools supposed to be created in the first place? This is the process, someone has to do it

rjh29 1 year ago | | |

Usually the corporate needs differ too much and they end up keeping their own fork anyway.

Netflix has the resources to maintain this. It's probably more a PR move for their hiring division.

slt2021 1 year ago | | |

So Netflix expects open source community to pick up the maintenance tab ?

I understand how open source proejcts are born, but I struggle to see what is novelty of this project. Just another Java CRUD app with some questionable design choices that are only applicable to netflix:

1. They claim it is distributed system, but it is just a regular Java crud with SQL backend

2. Java-like DSL with parser and classloader (why? Just why?)

Projects like these are the perfect examples of Enterprise Grade FizzBuzz (https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpris...) and this is exactly what I dont like about it

bluepizza 1 year ago | |

> People need to realize that code is a liability

This is an extreme point of view, that is tightly connected to the MBA-driven min-maxing of everything under the sun.

I am glad that there are folks who aren't afraid to code new systems and champion new ideas. Even in the corporate sense, mediocre risk averse solutions will only take you so far. The most profitable companies tend to be quite daring in their tech.

Code is not a liability. Code is what makes a company move its gears.

delecti 1 year ago | | |

Code being a liability is not a contradiction with code being what makes a company move its gears. The trucks of a delivery service are a liability (requiring maintenance, deprecation accounting, fuel), but are also the only thing that lets the company deliver. A delivery company should own as few trucks as necessary, and no fewer. Any company should publish/run/maintain as little code as necessary, and no less.

pants2 1 year ago | | |

Using open-source is a liability too, with added problems of code licensing conflicts, supply chain attacks, zero-day vulnerabilities, relying on maintainers that don’t work for you, etc.

YawningAngel 1 year ago | |

Off-the-shelf open source stuff is often the product of big companies open sourcing internal tools though. Airflow, which you name check, is a great example of this. Temporal is another example in the space. Someone has to be dumb enough to build new stuff

slt2021 1 year ago | | |

airflow and Temporal has teams dedicated to maintain and extend their system. And these systems are business critical for astronomer/temporal, respectively.

And they develop them in a way that works for many customers and use cases, not just netflix.

But for netflix this is just another auxillary system, out of many others. Just a nice GUI to schedule cron jobs basically, does it make sense to sink resources into custom cron?

bhawks 1 year ago | |

> with long history of maintenance and improvement,

That is a huge load bearing statement.

Do you plan on any contributions back to the community yourself?

Build vs. buy is always an important conversation but claiming that the 'buy'-side path has perfectly 0 maintenance and reliability costs reeks of naivety.

slt2021 1 year ago | | |

If I needed container orchestration I would use k8s. I can improve it and even propose patches/bugs or chip into opensource maintainers fund. I wont write my own orchestrator, especialy being in a streaming business.

Thats what I meant, doesnt even necessarily Build-vs-Buy, but rather Use-Open-Source-and-Contribute or Reinvent-the-wheel-for-L6-promo-and-then-opensource ??

Would the world be better with 10 workflow orchestrator systems or one mature?

makeset 1 year ago | |

> anything that is not the business critical stuff That's an important qualifier. For skilled teams in performance-critical domains, the inflection point where any outside code becomes a low-quality/low-control liability is not that far.

ripped_britches 1 year ago | |

100%. Very few times are these systems built as robustly as external folks who earn a profit on building robustness. Best example of course being Stripe. But I see this from everything from visual snapshot testing tools to custom CI workflows. The good thing is you can always rely on competitive market dynamics to price the off the shelf solution down to a reasonable margin above maintenance costs.

why-el 1 year ago | |

I am confused by this comment:

> open source stuff with long history of maintenance and improvement

improvement and maintenance is continent on usage, and having been used at Netflix, this project is in a better place to have already faced whatever bug you are worried about (and let's be real, 99% of applications wont ever get the luck to exercise code paths sophisticated enough to find bugs Netflix has not found already).

You might be unnecessarily projecting here. You don't have evidence to support that open sourcing this might have been for any other reason than it is simply good for the community to have.

archerx 1 year ago | |

This is a naive view, other people’s code is even more of a liability. Look at crowdstrike and opensource infiltrations. Using opensource software doesn’t magically grant you security nor stability.

jcgrillo 1 year ago | |

> People need to realize that code is a liability

Code that you own and intimately understand is less of a liability than some 3rd party dependency (paid or free). Stitching together a patchwork of dependencies is not likely the optimal result. The more aligned your codebase is with the problem you're trying to solve the better, and if functionality is core to your business better to own than borrow or rent.

alfalfasprout 1 year ago | |

I very much disagree with this take-- and the more I've experienced throughout my career the more I'm sure of it.

Companies spend an IMMENSE amount of time and effort adapting sometimes subpar off the shelf solutions to fit their infra and pay an ongoing tax w/ increasing tech debt trying to support them. Often something bespoke and smaller + more tailored would unlock significantly more productivity if the investment is made consciously.

Any code that is written has both assets and liabilities. But to claim it is a distraction and resource sink is a very, very bad take. Every decision to build something in-house needs to be done thoughtfully and deliberately.

jefurii 1 year ago | |

This sounds like the beginning of a sales pitch.

wodenokoto 1 year ago | |

Aren’t most of those things developed in house at tech giants and later open sourced?

MetaWhirledPeas 1 year ago | |

> code is a liability

3rd parties are also a liability. Pick your poison. Trust in unknown individuals, trust in megacorps, or trust your own people. Choosing wisely is why people get paid the big bucks.

paxys 1 year ago | |

> I would rather use off-the-shelf open source stuff with long history of maintenance and improvement

Where do you think this "off-the-shelf open source stuff" comes from exactly?

beanjuiceII 1 year ago | |

Off the shelf come with its own set of burdens, not always sunshine rainbows and loli's

ldjkfkdsjnv 1 year ago | |

I was going to say this. I never mess with random libraries like this, always so much pain.

hintymad 1 year ago |

I wonder how many iterations we will need before engineers are happy with a workflow solution. Netflix had multiple solutions before Maestro, such as metaflow. Uber built multiple solutions too. Amazon had at least a dozen internal workflow engines. It's quite curious why engineers are so keen on building their own workflow engines.

Update: I just find it really interesting that many individuals in many companies like to build workflow engines. This is a not deriding comment towards anyone or Netflix in particular. To me, such observation is worth some friendly chitchat.

rubenfiszel 1 year ago |

Founder of https://windmill.dev here which share many similarities with Maestro.

> Maestro is a general-purpose, horizontally scalable workflow orchestrator designed to manage large-scale workflows such as data pipelines and machine learning model training pipelines. It oversees the entire lifecycle of a workflow, from start to finish, including retries, queuing, task distribution to compute engines, etc.. Users can package their business logic in various formats such as Docker images, notebooks, bash script, SQL, Python, and more. Unlike traditional workflow orchestrators that only support Directed Acyclic Graphs (DAGs), Maestro supports both acyclic and cyclic workflows and also includes multiple reusable patterns, including foreach loops, subworkflow, and conditional branch, etc.

You could replace Maestro with Windmill here and it would be precisely correct. Their rollup is what we call the openflow state.

Main differences I see:

- Windmill is written in Rust instead of Java.

- Maestro relies on CockroachDB for state and us Postgresql for everything (state but also queue). I can see why they would use CockroachDB, we had to rollout our own sharding algorithms to make Windmill horizontally scale on our very large scale customer instances

- Maestro is Apache 2.0 vs Windmill AGPL which is less friendly

- It's backed by Netflix so infinite money but although we are profitable, we are a much smaller company

- Maestro doesn't have extensive docs about self-hosting on k8s or docker-compose and either there is no UI to build stuff, or the UI is not yet well surfaced in their documentation

But overall, pretty cool stuff to open-source, will keep an eye on it and benchmark it asap

arresin 1 year ago | |

Anyone considering windmill needs to look at this first:

https://www.windmill.dev/docs/advanced/local_development.

Why do I need to "sync" with windmill? Why is there an IDE built into windmill? Why is this so convoluted? It's like it's starting with the goal of lock-in before even developing a good product or finding market fit.

ensignavenger 1 year ago | |

Thanks for the great comparison! While Meastro is Apache licensed, if it depends on CockroachDB, Cokroach itslef isn't even Open Source, so that isn't great. I would rather have an AGPL codebase than a non open source dependency. Of course overtime some one could add alternative DB support.

jamra 1 year ago | | |

I really wonder why they didn’t choose something like RocksDB for more speed.

rwky 1 year ago | |

Been using windmill for a few months and so far it's rock solid keep it up!

skissane 1 year ago |

I'm a bit confused about what is going on here: This project appears to use Netflix/conductor [0]. But you go to that repo, you see it has been archived, with a message saying it is replaced by Netflix's internal non-OSS version, and by unmentioned community forks – by which I assume they mean Orkes Conductor [1]. But this isn't using Orkes Conductor, it looks like it is using the discontinued Netflix version `com.netflix.conductor:conductor-core:2.31.5` [2] – and an outdated version of it too.

[0] https://github.com/Netflix/conductor

[1] https://github.com/conductor-oss/conductor

[2] https://github.com/Netflix/maestro/blob/e8bee3f1625d3f31d84d...

saturn8601 1 year ago |

Anyone here use Activebatch? To me it is the best software I wish had an equivalent for non enterprise users. I have tried and tried to use other "competitors" but Activebatch's simplicity of just attaching a simple MS SQL DB, installing the Windows GUI and execution agent is just click, click, click and now you have a robust GUI based automation environment where you don't have to use code...or if you want, go ahead and use code in any language if you want...but you don't have to.

Airflow may be robust but it is hidden behind a complexity fence that prevents most from seeing whatever its true capability may be. The same goes for other "open source" competitors.

Why can't someone just develop a robust DB backed GUI first system?

I have tried online services as well, they pale in comparison. I guess the cost of maintaining extensions is what kills simpler paid offerings?

Its a complete shame that ActiveBatch is walled off behind a stupid enterprise sales model. This has prevented this wonderful piece of software from being picked up by the wider community. Its like a hidden secret. :/

skywhopper 1 year ago |

Advice: don’t rely on any tool open-sourced by Netflix. They have a long history of dropping support for things after they’ve announced them. Someone got a checkmark on their promotion packet by getting this blog post and code sharing out the door, but don’t build your business on a solution like this.

meliora245 1 year ago |

why would one consider this over something more established such as Temporal, also I see Maestro is written in Java vs Temporal's Go

trustno2 1 year ago | |

Temporal's go is... something. They used to use Java (I think), then they switched to Go, and the Go is very Java-like.

Or maybe I just don't know Fx.

https://github.com/temporalio/temporal/blob/main/service/mat...

The issue we hit with Temporal - again and again - is that it's very under-documented, and it's something you install at the core of your business, yet it's really hard to understand what is going on, through all the layers and through the very obtuse documentation.

Maestro has... no documentation? OK Temporal wins by default.

swyx 1 year ago | | |

no just the SDK is Java. temporal is 99% Golang, even at Uber https://github.com/uber/cadence

robryan 1 year ago | |

Netflix also uses temporal: https://temporal.io/in-use/netflix

tiffanyh 1 year ago | | |

Is Temporal still alive?

(website doesn't resolve for me)

EDIT: I found the GitHub page

https://github.com/temporalio/temporal

iamspoilt 1 year ago | |

That's also my question.

aimazon 1 year ago | |

isn’t Maestro an alternative to Airflow, not Temporal? Temporal isn’t a workflow orchestrator. There’s some overlap on the internals but they’re different designs for different use cases.

troebr 1 year ago | |

Didn't they rewrite some of Temporal's core in rust?

sjansen 1 year ago | | |

They (re)wrote most of the client SDKs on a Rust core, but the Temporal server is still written in Go.

pantsforbirds 1 year ago |

This is a really great-looking project. I know I've considered building (a probably worse) version of exactly this on almost every mixed ML + Data Engineering project I've ever worked on.

I'm looking forward to testing it out.

HugoLu88 1 year ago |

I'm building something in the space (orchestra) so here's my take:

Folks making stuff open source and building in the open is obviously brilliant, but when it comes to "orchestrators" (as this is, and identifies) there is already so much that has been before (Airflow and so on) it's quite hard to see how this actually adds anything to the space other than another option nobody is ever going to use in a commercial setting.

Shameless plug: https://getorchestra.io

indiv0 1 year ago |

Is this meaningfully different from Conductor (which they archived a while back)? Browsing through the code I see quite a few similarities. Plus the use of JSON as the workflow definition language.

opiniateddev 1 year ago | |

Conductor was moved here: https://github.com/conductor-oss/conductor Maestro uses conductor as its core.

https://github.com/Netflix/maestro/blob/main/maestro-engine/...

https://netflixtechblog.com/orchestrating-data-ml-workflows-...

iamsanteri 1 year ago |

So will this serve as a stand-in replacement for something like Airflow?

then4p 1 year ago | |

I'm also missing comparisons to other existing tools like airflow, dagster, mlflow...

makestuff 1 year ago | |

Yeah, also curious if this is meant as a replacement for Airflow.

dboreham 1 year ago |

Interesting. My team recently built a thing for managing long running, multi-machine, restartable, cascading batch jobs in an unrelated vehicle. Had no idea it was a category.

gtrubetskoy 1 year ago |

The name Maestro has already been used for a workflow orchestrator which I worked on back in 2016. That maestro is SQL-centric and infers dependencies automatically by simply examining the SQL. It's written in Go and is BigQuery-specific (but could be easily adjusted to use any SQL-based system).

https://github.com/voxmedia/maestro/

stepanhruda 1 year ago | |

With all due respect, there are so many projects. They don’t care about clashing with a repo that has 12 stars and 14 commits.

nijave 1 year ago | | |

Worked at a bank that named their container "cloud" platform GCP and it was in no way related to Google facepalm

jekude 1 year ago |

Seems like they re-engineered Temporal: https://temporal.io/

troebr 1 year ago | |

They did use Temporal at Netflix, they gave a couple presentations 2 years ago. I think this is very much not-Temporal because it relies on a DSL instead of workflow as code.

I don't know if it's a scale-thing, I'm not a workflow expert but this seems more in line with the map-reduce of yore, as in you get some big fat steps and you coordinate them, although you could have coarse-grained activities in Temporal workflows.

I'd be curious to see what the tradeoffs are between the two and if they still have usages for Temporal. Maybe Maestro is better for less technical people? Latency? Scale?

swyx 1 year ago | | |

former temporal employee here. netflix is very big, temporal-at-netflix always coexisted with other orchestration solutions including conductor

> I think this is very much not-Temporal because it relies on a DSL instead of workflow as code.

yup you get it. maestro defines things as json, which just inherently limits how you can write and test it with your normal app code

willbeddow 1 year ago |

I'm sure this is very nice, but the article reads as if written by AI. The first thing I'd want to see is an example workflow (both code and configuration) in a realistic use case. Instead, there's a lot of "powerful and flexible" language, but the example workflow doesn't come until halfway down, and then it's just foobar

halamadrid 1 year ago |

Very nice, Netflix has a reputation of making great OSS products. I wonder where does this stand with Conductor.

opiniateddev 1 year ago | |

Maestro is a domain specific implementation for ML and data pipelines that uses Conductor as its core

https://netflixtechblog.com/orchestrating-data-ml-workflows-...

https://github.com/Netflix/maestro/blob/main/maestro-engine/...

tiffanyh 1 year ago |

Don't see many Java projects being posted on HN.

xyst 1 year ago | |

We only upvote Go or Rust projects here ;)

andbberger 1 year ago |

slightly off topic, but there is dire need for a scientific "workflow manager" built to FAANG engineering standards attuned for the needs of academia (ie primarily designed to facilitate execution of DAGs on clusters). The airflows of the world have complex unnecessary features and require extensive kitbashing to plug into slurm and the academic side of things is a huge mess. Snakemake comes the closest but suffers from massive feature creep, a bizarre specification DSL (superset of python) and blurred resource requirement abstraction boundaries.

slt2021 1 year ago | |

Academia better to learn k8s and one of the k8s-native workflow orchestrators. This is as close to FAANG grade and open source as they can get, and arguably a bit better than this repo

andbberger 1 year ago | | |

for better or worse slurm is the status quo for HPC. it works, every university has a slurm cluster, people already know how to use it

torrance 1 year ago | |

What about Nextflow?

andbberger 1 year ago | | |

I considered Nextflow before begrudgingly settling on snakemake for my current project. Didn't record why... possibly because snakemake was already a known quantity and I was under time pressure or because I felt the task DAG would be difficult to specify in WDL. It's certainly the most mature of the bunch.

_Wintermute 1 year ago | | |

Nobody wants to write or debug groovy, especially scientists who are used to python. It also causes havoc on a busy SLURM scheduler with its lack of array jobs (heard this is being fixed soon).

oneplane 1 year ago |

Looks a bit like Argo Workflows combined with Argo Events. Makes sense to have so many projects and products converge around the same endstate.

antishatter 1 year ago |

Anyone have a recommendation for a workflow orchestrator for single server deployments? Looking at running a project at home and for certain pieces think it would be easiest to orchestrate with a tool like Maestro or Airflow but they’re basically set up to run in clusters with admins to manage them.

rwky 1 year ago | |

Windmill is pretty lightweight and easy to deploy. https://www.windmill.dev/ you can configure it to have a single worker on the same server as the ui and database.

katrotz 1 year ago | |

I'd recommend Kestra[1] since it can be run on a single node

[1] https://kestra.io/

ssfak 1 year ago | |

For Python tasks you can check Prefect, among others..

mianos 1 year ago |

Interesting how complete this is. It's almost as comprehensive as prefect.io

This is a critical software infrastructure I have been promoting for years yet almost everyone thinks they don't need it.

kabes 1 year ago |

It says one of the big differentiators with 'traditional workflow orchestrators' is that is supports cyclic graphs. But BPMN (and the orchestrators using it) also supports loops.

petromir 1 year ago |

So they abandoned https://github.com/Netflix/conductor to create Maestro

febed 1 year ago |

Dagster is a better alternative, because of its asset first philosophy. Task based workflows are still available if you really need it.

Sparkyte 1 year ago |

Whats the difference of this and enqueue work into a queue then waiting for a job to pick it up at a scheduled time? Not saying build a Kafka cluster to serve this but most cloud providers have queuing tools.

sjansen 1 year ago | |

Putting work in a queue is only the start. Most organizations start there and gradually write ad hoc logic as they discover problems like dependencies, retries, & scheduling.

Dependencies: what can be done in parallel and what must be done in sequence? For example, three tasks get pushed in the queue and only after all three finish a fourth task must be run.

Retries: The concept is simple. The details are killer. For example, ifa task fails, how long should the delay between retries be? Too short and you create a retry storm. Forget to add some jitter and you get thundering hoards all retrying at the same time.

Scheduling: Because cron is good enough, until it isn't.

A good workflow solution provides battle tested versions of all of the above. Better yet, a great workflow solution makes it easier to keep business logic separate from plumbing so that it's easier to reason about and test.

shawabawa3 1 year ago | |

workflows typically involve chains of jobs with state transitions, waits, triggers, error handling etc

a lot more than just e.g. celery jobs

nijave 1 year ago | |

A workflow manager implements a Choreography based saga pattern https://microservices.io/patterns/data/saga.html

bjourne 1 year ago |

What is a workflow in this context?

monkychop 1 year ago |

Hooolaa

monkychop 1 year ago |

Eduardo

nikhilsimha 1 year ago |

great job on open sourcing!