Maestro: Netflix's Workflow Orchestrator(netflixtechblog.com) |
Maestro: Netflix's Workflow Orchestrator(netflixtechblog.com) |
I would rather use off-the-shelf open source stuff with long history of maintenance and improvement, rather than reinvent the cron/celery/airflow/whatever, because code is a liability. Somebody needs to maintain it, fix bugs, add new features. Unless I get +1 grade promotion and salary/rsu bump, ofc.
People need to realize that code is a liability, anything that is not the business critical stuff that earns/makes $$$ for the company is a distraction and resource sink.
They had a need that an existing "off-the-shelf open source" project didn't solve, so they created this an are now turning it into an "off-the-shelf open source" project so they can keep using it without having to maintain it entirely themselves.
How are these open source tools supposed to be created in the first place? This is the process, someone has to do it
Netflix has the resources to maintain this. It's probably more a PR move for their hiring division.
I understand how open source proejcts are born, but I struggle to see what is novelty of this project. Just another Java CRUD app with some questionable design choices that are only applicable to netflix:
1. They claim it is distributed system, but it is just a regular Java crud with SQL backend
2. Java-like DSL with parser and classloader (why? Just why?)
Projects like these are the perfect examples of Enterprise Grade FizzBuzz (https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpris...) and this is exactly what I dont like about it
This is an extreme point of view, that is tightly connected to the MBA-driven min-maxing of everything under the sun.
I am glad that there are folks who aren't afraid to code new systems and champion new ideas. Even in the corporate sense, mediocre risk averse solutions will only take you so far. The most profitable companies tend to be quite daring in their tech.
Code is not a liability. Code is what makes a company move its gears.
And they develop them in a way that works for many customers and use cases, not just netflix.
But for netflix this is just another auxillary system, out of many others. Just a nice GUI to schedule cron jobs basically, does it make sense to sink resources into custom cron?
That is a huge load bearing statement.
Do you plan on any contributions back to the community yourself?
Build vs. buy is always an important conversation but claiming that the 'buy'-side path has perfectly 0 maintenance and reliability costs reeks of naivety.
Thats what I meant, doesnt even necessarily Build-vs-Buy, but rather Use-Open-Source-and-Contribute or Reinvent-the-wheel-for-L6-promo-and-then-opensource ??
Would the world be better with 10 workflow orchestrator systems or one mature?
> open source stuff with long history of maintenance and improvement
improvement and maintenance is continent on usage, and having been used at Netflix, this project is in a better place to have already faced whatever bug you are worried about (and let's be real, 99% of applications wont ever get the luck to exercise code paths sophisticated enough to find bugs Netflix has not found already).
You might be unnecessarily projecting here. You don't have evidence to support that open sourcing this might have been for any other reason than it is simply good for the community to have.
Code that you own and intimately understand is less of a liability than some 3rd party dependency (paid or free). Stitching together a patchwork of dependencies is not likely the optimal result. The more aligned your codebase is with the problem you're trying to solve the better, and if functionality is core to your business better to own than borrow or rent.
Companies spend an IMMENSE amount of time and effort adapting sometimes subpar off the shelf solutions to fit their infra and pay an ongoing tax w/ increasing tech debt trying to support them. Often something bespoke and smaller + more tailored would unlock significantly more productivity if the investment is made consciously.
Any code that is written has both assets and liabilities. But to claim it is a distraction and resource sink is a very, very bad take. Every decision to build something in-house needs to be done thoughtfully and deliberately.
3rd parties are also a liability. Pick your poison. Trust in unknown individuals, trust in megacorps, or trust your own people. Choosing wisely is why people get paid the big bucks.
Where do you think this "off-the-shelf open source stuff" comes from exactly?
Update: I just find it really interesting that many individuals in many companies like to build workflow engines. This is a not deriding comment towards anyone or Netflix in particular. To me, such observation is worth some friendly chitchat.
> Maestro is a general-purpose, horizontally scalable workflow orchestrator designed to manage large-scale workflows such as data pipelines and machine learning model training pipelines. It oversees the entire lifecycle of a workflow, from start to finish, including retries, queuing, task distribution to compute engines, etc.. Users can package their business logic in various formats such as Docker images, notebooks, bash script, SQL, Python, and more. Unlike traditional workflow orchestrators that only support Directed Acyclic Graphs (DAGs), Maestro supports both acyclic and cyclic workflows and also includes multiple reusable patterns, including foreach loops, subworkflow, and conditional branch, etc.
You could replace Maestro with Windmill here and it would be precisely correct. Their rollup is what we call the openflow state.
Main differences I see:
- Windmill is written in Rust instead of Java.
- Maestro relies on CockroachDB for state and us Postgresql for everything (state but also queue). I can see why they would use CockroachDB, we had to rollout our own sharding algorithms to make Windmill horizontally scale on our very large scale customer instances
- Maestro is Apache 2.0 vs Windmill AGPL which is less friendly
- It's backed by Netflix so infinite money but although we are profitable, we are a much smaller company
- Maestro doesn't have extensive docs about self-hosting on k8s or docker-compose and either there is no UI to build stuff, or the UI is not yet well surfaced in their documentation
But overall, pretty cool stuff to open-source, will keep an eye on it and benchmark it asap
https://www.windmill.dev/docs/advanced/local_development.
Why do I need to "sync" with windmill? Why is there an IDE built into windmill? Why is this so convoluted? It's like it's starting with the goal of lock-in before even developing a good product or finding market fit.
[0] https://github.com/Netflix/conductor
[1] https://github.com/conductor-oss/conductor
[2] https://github.com/Netflix/maestro/blob/e8bee3f1625d3f31d84d...
Airflow may be robust but it is hidden behind a complexity fence that prevents most from seeing whatever its true capability may be. The same goes for other "open source" competitors.
Why can't someone just develop a robust DB backed GUI first system?
I have tried online services as well, they pale in comparison. I guess the cost of maintaining extensions is what kills simpler paid offerings?
Its a complete shame that ActiveBatch is walled off behind a stupid enterprise sales model. This has prevented this wonderful piece of software from being picked up by the wider community. Its like a hidden secret. :/
Or maybe I just don't know Fx.
https://github.com/temporalio/temporal/blob/main/service/mat...
The issue we hit with Temporal - again and again - is that it's very under-documented, and it's something you install at the core of your business, yet it's really hard to understand what is going on, through all the layers and through the very obtuse documentation.
Maestro has... no documentation? OK Temporal wins by default.
(website doesn't resolve for me)
EDIT: I found the GitHub page
I'm looking forward to testing it out.
Folks making stuff open source and building in the open is obviously brilliant, but when it comes to "orchestrators" (as this is, and identifies) there is already so much that has been before (Airflow and so on) it's quite hard to see how this actually adds anything to the space other than another option nobody is ever going to use in a commercial setting.
Shameless plug: https://getorchestra.io
https://github.com/Netflix/maestro/blob/main/maestro-engine/...
https://netflixtechblog.com/orchestrating-data-ml-workflows-...
I don't know if it's a scale-thing, I'm not a workflow expert but this seems more in line with the map-reduce of yore, as in you get some big fat steps and you coordinate them, although you could have coarse-grained activities in Temporal workflows.
I'd be curious to see what the tradeoffs are between the two and if they still have usages for Temporal. Maybe Maestro is better for less technical people? Latency? Scale?
> I think this is very much not-Temporal because it relies on a DSL instead of workflow as code.
yup you get it. maestro defines things as json, which just inherently limits how you can write and test it with your normal app code
https://netflixtechblog.com/orchestrating-data-ml-workflows-...
https://github.com/Netflix/maestro/blob/main/maestro-engine/...
This is a critical software infrastructure I have been promoting for years yet almost everyone thinks they don't need it.
Dependencies: what can be done in parallel and what must be done in sequence? For example, three tasks get pushed in the queue and only after all three finish a fourth task must be run.
Retries: The concept is simple. The details are killer. For example, ifa task fails, how long should the delay between retries be? Too short and you create a retry storm. Forget to add some jitter and you get thundering hoards all retrying at the same time.
Scheduling: Because cron is good enough, until it isn't.
A good workflow solution provides battle tested versions of all of the above. Better yet, a great workflow solution makes it easier to keep business logic separate from plumbing so that it's easier to reason about and test.
a lot more than just e.g. celery jobs
In reality there are five main concerns: 1. Resource scheduling-- "I have a job or collection of jobs to run... allocate them to the machines I have" 2. Dependency solving-- If my jobs have dependencies on each other, perform the topological sort so I can dispatch things to my resource scheduler 3. API/DSL for creating jobs and workflows. I want to define a DAG... sometimes static, sometimes on the fly. 4. Cron-like functionality. I want to be able to run things on a schedule or ad-hoc. 5. Domain awareness-- If doing ETL I want my DAGs to be data aware... if doing ML/AI workflows then I want to be able to surface info about what I'm actually doing with them
No one solution does all these things cleanly. So companies end up building or hacking around off the shelf stuff to deal with the downsides of existing solutions. Hence it's a perpetual cycle of everyone being unhappy.
I don't think that you can just spin up a startup to deliver this as a "solution". This needs to be solved with an open source ecosystem of good pluggable modular components.
> I don't think that you can just spin up a startup to deliver this as a "solution". This needs to be solved with an open source ecosystem of good pluggable modular components.
But rather more specialized tools that solve specific issues.
What you describe just sounds like a better implemented version of Airflow or the over 100 other systems that are actively trying to be this today (Flyte, Dagster, Prefect, Argo Workflows, Kubeflow, Nifi, Oozie, Conductor, Cadence, Temporal, Step Functions, Logic Apps, your CI system of choice has their own, need I continue, that is not even scratching the surface). Most of those have some sort of "plugin" ecosystem for custom code, in varying degrees of robustness.
For what it is worth, everyone and their mom thinks they can make and wants to be this orchestrator. It's a problem that is just so generic and such a wide net that you end up with annoying-to-use building blocks because everyone wants to architecture astronaut themselves into being the generic workflow orchestration engine. The ultimate system design trap: Something so fundamentally easy to grok and conceptualize that you can PoC one in hours or days, but near infinite possibilities of what you can do with it, resulting in near infinite edge cases.
Instead, I'd rather companies just focus on the problem space that it lends itself to. Instead of Dagster saying "Automate any workflow" and try to capture that space, just make building blocks for data engineering workflows and get really good at that. Instead of Github Actions being a generic "workflow engine" just have it really good at making CI workflow building blocks.
But we can't have it that way. Because then some architecture astronaut will come around and design a generic workflow engine for orchestrating your domain specific workflow engines and say that you no longer need those.
Actually I think I just convinced myself that what you are suggesting actually IS the right way. If companies just said "we will provide an Airflow plugin" instead of building their own damn Airflow this would be easy. But we won't ever have that either. What we really need is some standards around that. Like if CNCF got together and got tired of this and said "This is THE canonical and supported engine for Kube workflows, bring your plugins here if you want us to pump you up". That might work. They've usually had better luck with putting people in lockstep in the Kube ecosystem at least than Apache has historically for more general FOSS stuff. Probably because the problem space there is more limited.
> ...Users can use Metaflow library to create workflows in Maestro to execute DAGs consisting of arbitrary Python code. from https://netflixtechblog.com/orchestrating-data-ml-workflows-...
The orchestration section in this article (https://netflixtechblog.com/supporting-diverse-ml-systems-at...) goes into detail on how Metaflow interplays with Maestro (and Airflow, Argo Workflows & Step Functions)
I’m starting to think workflow engines are somewhat of a design smell.
It’s enticing to think you can build this reusable thing once and use it for a ton of different workflows, but besides requiring more than one asynchronous step, these workflows have almost nothing in common.
Different data, different APIs, different feedback required from users or other systems to continue.
Probably so, but the real design smell seems to be thinking of a workflow engine as a panacea for sustainable business process automation.
You have to really understand the business flow before you automate it. You have to continuously update your understanding of it as it changes. You have to refactor it into sub-flows or bigger/smaller units of work. You have to have tests, tracer-bullets, and well-defined user-stories that the flows represent.
Else your business flow automation accumulates process debt. Just as much as a full-code-based solution accumulates technical debt.
And, just like technical debt, it's much easier (or at least more interesting) to propose a rewrite or framework change than it is to propose an investment in refactoring, testing, and gradual migrations.
It’s really easy to build a custom workflow engine and optimize it for specific use cases. I think we haven’t yet seen a convergence simply because this tool hasn’t yet been built.
Consider the recent rise of tools that quickly dominated their fields: Terraform (IaC), Kubernetes (distributed compute). Both systems are hella complex, but they solve hard problems. Generic workflow engines are complex to understand and difficult to operate and offer a middling experience so many folks don’t even bother.
Often times what happens is the workflow engine is tailored to a specific problem and then other teams discover the engine and want to use it for their projects, but often need some additional feature, sometimes which completely up-ends the mental model of the engine itself.
So they hire tons of engineers who have nothing to do but rearchitecture the mess their microservices have created.
Then there are others who create observability and test harnesses for all of that.
When Pornhub and other porn sites can deliver orders of magnitude more data across the world with much simpler systems, you know it's all bullshit.
In the time since those problems have been solved and now are offered as a service by most cloud providers (for a hefty fee of course)
When is that, exactly? https://www.statista.com/chart/15692/distribution-of-global-...
That's nothing. My dedicated server delivers two orders of magnitude greater traffic than Pornhub (and everything in the Mindgeek network really). And I don't even need the cloud. Just better engineering.
Because all the existing ones suck.
(We built our own tiny one two. We need tight integration with systemd jobs and cgroups, and existing solutions don't do that.)
[0] https://techcrunch.com/2023/12/13/orkes-forks-conductor-as-n...
My impression of the code base, is I felt like it needed a lot of work to run in a non-Netflix environment. Which is part of why the project I was working on ended up abandoning Conductor – we were going to embed Conductor in our product as a workflow engine, we ended up building our own workflow engine from scratch instead. Another team did end up using it for some internal use cases, but scalability/reliability/etc are less of a concern for internal use cases as opposed to customer-facing ones.
And then Netflix abandons it – and then they open source something else which depends on an old version of it – well, I'm happy they open source anything, but it fits with my earlier impression – throwing stuff over the fence which can be a struggle to adopt in an outside environment. Still, throwing it over the fence is better than not releasing it at all.
Does the world need another workflow orchestrator? Who knows - some folks at Netflix seem willing to pay a handful of engineers $ to do so. Good luck to them
To be fair, I doubt Maestro will take off like Airflow did.
Airflow filled a void of an easier orchestrator for Big Data with a prettier UI than the competitors of the time (Oozie, Luigi), implementing some UX patterns which had been tested at scale at Facebook with data swarm.
The field is quite a bit more crowded now.
Just one of the questions I have regarding this -- China has nearly 1.4 billion people, and barely any of them use any of the services here. Instead, they have their own video platforms. And you tell me that none of those platforms use at least the same amount of traffic of Prime Video? I doubt it.
[0] https://www.sandvine.com/hubfs/Sandvine_Redesign_2019/Downlo...
Isn’t this the deal with all open source? They are giving something (the code and access to the project) in return for help maintaining it?
No one is being forced to do anything. It is not like there is some open source contributor somewhere now saying, “oh damn, now I have to maintain this, too?”
If people like it and find value in it, they can help contribute to the project in ways they want. Netflix gets to use those contributions, in return for letting people use their contributions. That is just how open source works.
If Netflix still heavily uses this internally, they should still do the most maintenance. Others contribute based on their own needs.
This is exactly same stack I have to deal daily and management reason is it is lowest common denominator that works well with 3-month contract developer to deliver Nth micro service whose sole job is to call another service.
I think the notion of open sourcing a project, is you are literally asking at the community for help and that the community will naturally help you with the maintenance.
In fairness, the very nature of open source is that the community is only going to pick up the maintenance tab if the value they're getting out of it is worth it.
It's open source, and they don't have to accept external contributions. Terms have a well-defined meaning, please refrain from calling open source code not open source, and not open source code, open source.
It not having those things is fine, and eventually someone may still take the source and create an open project around it. But understanding that is a Netflix project helps calibrate people's understanding around whether the model when you find a bug is going to be "fork, fix, and run the fork indefinitely" or "fork, fix, contribution accepted, drop fork and return to upstream."
any other 'source available' licenses would not (legally) let you do that.
The only way a 'truck' could be a liability is a lease for said truck.
There are plenty of economically rationale reasons why a company may own more trucks they strictly need to manage delivery. For example wanting to handle seasonal bursts, wanting to ensure reliability, preparing for an expansion, being able to lease capacity to other businesses.
Actually you can go replace truck with server and you describe what made AWS make initial sense.
Please stop misusing accounting concepts.
Assets can also be liabilities. The mortgages in a mortgage backed security is both an asset and a liability, as was only too well demonstrated in 2008... It's an asset in the security portfolio, but until you sell the security, it's a liability for whomever is securitizing it.
The problem was the market value of those assets plummeted because no one expected them to generate the agreed upon cash flows because the underlying loans were going into correlated defaults. Despite all this the only party that saw the mortgage as a liability is the individual who's responsibility it was to make a monthly payment on said mortgage.
Outside of swaps and other derivatives financial instruments and other properties don't magically switch from being an asset to being a liability based on random external factors.
This conversation is like accountants talking about processes, threads, fibers and context switching... very imprecisely.
> a person or thing whose presence or behavior is likely to cause embarrassment or put one at a disadvantage.
Code is absolutely a liability. Code deteriorates as conditions change, and unchanged code also becomes more vulnerable in a way that conventional objects can't.
You are describing an operating expense which has an entirely different nature than a liability.
'comes with a maintenance liability' is a handwaving statement that means practically nothing without a ton of contextual information. A true liability has a contractual set of obligations to pay defined amounts on a agreed upon schedule. No one is going to come after you for not changing the oil on your truck, try missing payments on a lease.
But does it make sense for a trucking(streaming) company to create own plumbing equipment? I’d rather use Plumbers Supply Inc that every other company uses from Plumber Depot or use open-source-plumbers.com, because I am not in a plumbing business
This describes Google and Amazon perfectly - while you can armchair quarterback their biz decisions they are definitely doing well for themselves.
The whole aws reinvent is repackaged whatever open source project is trending, hiding control plane from the user and instead expose it via AWS control plane and charge people per usage instead of per server
Incentives are everything. That's why managers are so careful when applying them to their own jobs.
Several parties will come after you for not changing the oil on your semi-truck that is being used professionally for freight, starting with your driver, your insurance company, and the US Department of Transportation (DOT), specifically the Federal Motor Carrier Safety Administration (FMCSA), with whom you have m have to provide maintenance records. Trucking is a highly regulated industry, and after Crowdstrike, software engineering is only going to get more regulated, not less.
I wasn't saying they switch; I'm saying they can be both an asset and a liability. Liability isn't strictly an accounting term. It also can refer to something that acts as a disadvantage. Illiquid assets whose valuation can be volatile can be a liability.
Amazon’s approach is the opposite: steal open source repo and make $$$ off of open source contributors’ labor
As for pay per server vs pay per usage. Heck you know Amazon actually bills the team who caused the cost. And gives finance a report on how much each team is spending and on what. Good luck doing that on prem.