MLOps is mostly data engineering(cpard.xyz) |
MLOps is mostly data engineering(cpard.xyz) |
* Prometheus/Grafana/TSDB/... can be used to setup a model monitoring platform since you're observing metrics whether they are from an ML service or a normal service.
* Any service deployment tool can be used to deploy ML models, since they are services.
* AirFlow/Dagster/... can be used to orchestrate model training, since training a model is basically a data engineering task.
With that said, I still believe that there is space for ML-specific tools to be created.
* Model Monitoring tools (ArizeAI is the only one I've used) can be tailored to be easily usable by ML Engineers without requiring DE knowledge.
* Deploying models in production has some specifities: things like GPU support, adaptive batching, ... Those specifities can be implemented inside a model deployment tool.
* Training orchestration is the only domain where I think there's truly no need for new tools.
In the end the most successful platform I built was a custom orm I built around redis objects and queues and the most important part wasn't actually the fancy data processing platform, but actually the details of the container layers, the refactoring of the code to make it easily composable, releasable and easy for the scientists to play with but with enough guard rails so they wouldn't diverge too far from the structure.
It made incredibly fast at iterating. Of all the things I worked with Airflow was the one I was most hyped about from all the videos I had seen and which turn out to be the biggest mess of them all.
So, talking about monitoring, training, and recording model drift is only a single side of the domain.
Delta Lake/Apache Iceberg solves that.
PS. I am a maintainer at flyte.org (thoughts are my own)
* Add a GPU resource requirement on one of your step
* Add an auto-scaler that adds GPU nodes to your cluster based on the GPU resource demand.
After having written the above, I realize that it might sound like that famous HN comment about how you can /easily/ re-create Dropbox yourself, which might actually prove your point that there is a need for ML-specific tools for the training part.
At Canva I built auto-scaling GPU infra on K8s for model training[1], and it's way too much work and operational expense to be worth building yourself. I went work at Modal because building it properly once and then distributing the solution was going to be just way better and more efficient.
1. https://canvatechblog.com/supporting-gpu-accelerated-machine...
Anecdotally, I've worked in high-performance computing and machine learning for years now and the past few months I've seen a huge spike in the number of messages I get for MLOps positions. I think companies are slowly starting to realize that setting up machine learning at scale isn't as simple as deploying poorly written code by research scientists to managed platforms.
This is some truth right here
This doesn’t bode well for ML in lots of orgs. Obviously machine learning is very powerful and effective in many use cases. But the value proposition already isn’t there for lots of companies. At such places, discovering a hidden requirement for more resources is a great reason to change directions.
To be very clear, I’m not talking about the field in general. I’m talking about orgs where management doesn’t see value generated from the ML efforts that suddenly demand more resources to operate.
I don't understand the argument, not feigning confusion. Everyone has to scale at some point and every solution has its limits. If they were successful with Airflow and Pandas/Numpy for a long time and then well, now the fan is spinning. They are going to call for experts that put the pieces in place. Asking for help is a sign of maturity. It really depends on the state of system when the experts arrive.
I personally think every org can use ML (it all decays to statistics, then linear algebra).
I want to clarify something about my intention with this post. There is a reason I chose "mostly" on the title. I'm not dismissing the different needs of ML.
if a category withstands the tests of the market, then there's good reason for it to exist.
But, we have ended up creating silos within orgs with fundamentally aligned goals because of the way we build products and companies around them.
What I'm advocating for in this article, is the need to think more holistically when we design and build data infra tooling. Yes ML has unique challenges but these challenges won't be addressed by reinventing everything again and again.
Tooling should be built having in mind all the practitioners involved in the lifecycle of data.
It's harder to do but at least we'll stop wasting our time building one Airflow copy after the other that is doomed to fail.
These transformations are so different that some of them are "model-independent" - you can do the transformations and reuse the output feature across many models (aggregations, binning, feature-crosses, embeddings, etc), but the data scientists transformations are model-dependent (and not reusable across different models - e.g., a XGBoost model doesn't want normalized numerical features, but a DNN typically needs normalized or standardized numerical features.
These differences are reflected in how we build our "ML pipelines" - we split them into feature/training/inference pipelines, enabling us to localize model-independent transformations to feature pipelines, while doing model-dependent transformations in training/inference pipelines (while ensuring no skew).
In summary, the devil's in the details in pipelines for ML. I agree, however, that orchestrators like Airflow are good enough for orchestration. However, the ML Assets (mutable & reusable features, immutable models, immutable training/inference datasets) are different and tooling will ultimately reflect those differences.
In MLOps, feature pipelines produce features (from raw data). Training pipelines produce models (from features/labels). Inference pipelines produce predictions (from models + features). There is no such thing as a "ML Pipeline" in a production ML system. There is no ML pipeline that goes from raw data to predictions. We have the above FTI pipelines (feature/training/inference pipelines).
The principles of MLOps are around being able to develop faster (shorten the development lifecyle) through automated testing and versioning. You need to validate data to build features. You need tested features to build models. You need to test models for your ML systems. It's a hierarchy: data->features->models-ML Apps. Versioning is needed for features and models in order to safely upgrade systems and enable them to evolve over time.
I cover a lot of this in a course i developed called 'serverless ml'.
There are some people who have expertise in building production infrastructure, writing production code, and managing production data, but they are few and far between. So finding a "system" that lets data experts work with code experts work with infrastructure experts is important.
Many people say it's "just software engineering" or "just DevOps", but I feel like they are either not respective enough of the challenges of whichever pillar they are ignoring, or they don't even know that those challenges exist.
Filtering out the BS and finding the smart people who are writing interesting things about MLOps is difficult as they use the same terminology (and if the smart people switched, the BS people would follow, so they may as well stand their ground) but the BS cover doesn't mean that there's nothing substantial underneath.
That can still mean it requires a unique skillset that devops people dont generally have.
It’s just hard to imagine how infrastructure and deployment suddenly isn’t devops because it deals with ML.
For one, the customer is entirely different. You are mostly serving data scientists who don’t have strong engineering skills, which dramatically skews the solutions toward things like Python and Jupyter.
This is a big reason why the tool space is different and has been successful at what it does.
Model training and serving are absolutely nothing like traditional methods. In serving, you are deploying a stateful model, not a stateless backend. That model’s state should ideally be continuously trained, requires different scaling and monitoring capabilities.
In training, the GPU problem is far from solved and it is unique to ML with things like how you shard models and fit their weights into memory.
There are extremely challenging problems in this space that simply aren’t the same as devops, and this is coming from a former k8s contributor.
That data engineers or devops people don’t have the background to do MLops doesn’t mean MLops is outside of those disciplines. MLops increasingly seems like a very specific version of these things with unique challenges.
Whether or not serving models fits “traditional” methods seems irrelevant.
This is true for pretty much every engineering discipline/role.
(2) A major difference w/ the conventional software development CI/CD pipelines is the sheer size of the data involved. When you are dealing with “tiny data” you can waste resources on Docker, but when your foundation model is 100x the size, when the training process is distributed, and takes a day, the quantity is taking on a quality of its own.
(3) The worst performance sin is moving data around although this will be necessary so far as the system is distributed. Avoiding excess data moving can be the difference between training a model and failing to train a model, but when you put together a patchwork of ML ops programs you will fin they are moving data around internally for good reasons sometimes and no reason other times plus the easy (and sometimes only) integration method is moving data around. Don’t be that guy!
But that doesn't make it any less legitimate - DevOps came from Dev + SysOps, but nobody is arguing DevOps shouldn't be a thing (although you might argue it's no different from SysOps).
In general, buzzwords align pretty closely to VC funding cycles.
As far as I can see MLOPs is just equivalent generation devops applied to ML.
First, there’s the matter that ML introduces a significant stack and complexity into what was already a relatively complex framework. I mean, managing storage, quality, data processing, streaming, scheduling/orchestration, transformation logic and SLAs requires a lot of tools, whichever combination pleases you the most. Even full platforms offered by some of the players in this market can get quite complex, and it’s very hard to set everything right and handle all the cases. Specialised tooling or skills is probably a good idea to focus on the things that matter to Ml and that go beyond what DE already covers. Think of all the frameworks, the statistical libraries, the different nature of the logic for ML features when compared to regular reporting needs, the different quality requirements and structure that ML expects, managing versions of raw, labelled and test datasets, etc. (there are many more, the discussion already covered quite a few).
Which brings me to the second thing - the knowledge stack required to run ML. Besides some of the usual DE stack (developing, data manipulation, quality, etc), a whole new set of skills, related to several branches of math, parallelism, very complex and costly infrastructure management, research skills, experiment design, specific algorithms and approaches (does the regular DE need to understand neural network patterns, how data and model-parallel training works, the statistics behind setting up and running drift measures, what all the metrics behind model performance mean, etc.?). I find this a better reason for specialisation than any other - there’s just so much one person can hold in their heads, and ML development and operations is just getting more complex by the day.
So, my point is, from a very simplified and abstract perspective, the author seems right. But, in practice, you won’t be able to just stack that on top of data engineers and not expect them to become specialized - and that’s where the ML Engineer and MLOps engineer roles are emerging. They’re not completely new, but they’re no longer your regular data engineer or data scientist.
A lot of groups start using Kafka because they have high-throughput event streams they want to aggregate over, and then you just use Kafka for everything because managing a 5 TPS topic alongside a 100k TPS topic is trivial.
In terms of why Kafka is a good fit for that workflow — database writes are unnecessary and oddly structured for raw events we want to expire in a few days, and dealing with buffering blob file writes can cause data loss, so Kafka can really simplify the producer architecture where the producer is also a consumer from a producer who wants an ack. Combine this with how trivial it is to have multiple readers on a pub sub system, and it is easy to scale from 1 to N consumers of a data source without duplicating the data everywhere. E.g. you could have three aggregation jobs that use the same data, one job that writes the topic’s data to blob-style storage for batch use-cases, and a low latency inference job all running from the same data stream.
More or less, Kafka just simplifies scale-out in some cases, maybe not your case, though. If you’re kicking off workers to do ingestion, you might have a system where you are pulling files down at some infrequent cadence (Let’s say every 10 minutes) —- in that case Kafka is likely going to be overkill and feel like a lot of work for an API call that then becomes a tasked tracked in some database.
We evaluated several commercial MLops tools and ended up going with generic tools that we already use, instead of something new that's branded for MLops. I.e. postgres + snowflake instead of a commercial feature store -- model deployment, monitoring, and alerting on the same platform as the rest of the company's applications -- etc. When we tried "ML" tools, they took so much work to adapt to our use cases that they really added no value.
"It may be surprising to the academic community to know that only a tiny fraction of the code in many ML systems is actually devoted to learning or prediction – see Figure 1"
https://papers.nips.cc/paper_files/paper/2015/hash/86df7dcfd...
PDF: https://papers.nips.cc/paper_files/paper/2015/file/86df7dcfd...
So it's an okay overview of some ML engineering / ops things with a contradictory title which isn't followed up on (and which I'm sure gets more clicks).
So no, MLOps isn't just data engineering. For more information read your own article.
I'm going through the main Pilars of MLOps and explaining how they overlap with data engineering.
Also, the title says, "Mostly" not "just" data engineering.
It might be misleading if you just go through the sub-headers but I'm sure if you go through the content you will see that the title and the content are pretty aligned.
I was doing this work since before MLOps was the new buzzword in town, and it was always under the data engineering job title. It was only in the past few years that data engineering has become more focused, requiring new titles/job descriptions to truly cover the different specializations.
systematically orchestrates the data preprocessing and post-processing of the training loop for multi-dimensional data and various types of analysis
But corporate "ML" has been until recently "dashboards" or "big data" or "data analytics" or "look at your corporate records with a computer" in disguise.
This is hardly an uncommon problem outside of machine learning.
A single vendor/tech does not "solve" anything when the task at hand implies you need to entirely re-design data pipelines, ML modelling and benchmarking.
Reproducibility is more than just upstream data versioning.
Use Kafka as the super fast layer to connect producers and consumers. Have it write to S3/Blob. Then insert into your dB layer for "cold" access. Inserting into a dB takes longer than pub/sub, and you don't want to mess up the publishing of data.
Airflow is also absolutely not built for that purpose. It's ~10yr old Hadoop-era technology.
As for configuring the kubernetes pod operator to ask for pods with GPU's, it exposes the k8s python API in the dag definition. I haven't done it myself, but I think that it's not really airflow that's going to be a pain there. Getting the pod spec right is gonna have to happen whatever does the orchestration.
(Full disclosure: my employer offers airflow as a service)
https://www.davidsbatista.net/blog/2018/03/31/SentenceClassi...
for early processing of data, you can think of it as a profiling tool that works at the level of an individual “cell” in a table, so given a number like 90214 it will guess that it could be a zip code, given “John” it would guess it is a proper name, etc. I’d contrast that to tools that profile a whole column at a time (funny… these are all 5 digit zero padded numbers) but that column is full of numbers consistent with
https://en.wikipedia.org/wiki/Benford%27s_law
This system was successful for a few customers including a major telecom and ultimately the company got bought by a major shoe and clothing brand.
We had an “MLOps” system we developed internally for training those CNN models and we trained some for that task and trained some for other tasks. There were numerous problems with the models that we were concerned about then, most of them have been solved by transformer models which we were just starting to read about then.
I agree with everything you said for whatever it’s worth. I’m mostly just making the observation that there is a lot of machine learning endeavors that aren’t generating much value.
I agree with this. From what I have seen, execs want something fancy, when you could give a boring-ish tool that reduces your OODA loop cycle time to 30% of what it was using unsexy techniques. Much of it having to do with tolerancing of answers, being within 5-10% is more than enough to drive the business, but someone somewhere said it had to be exact and that blows out the latency budget.
When engaging in consulting gigs, it is super important to know what kind of org you dealing with before you get involved. The myopic penny pinching orgs should be steered well away from, which I think was your point.
Okay, so, where is your training data? Is your training data in the layout which your training code can just linearly scan on S3? Or you have to transform them first? Or provision a dataset cache on-demamd? Is this data engineering or training orchestration?
Not claiming this is the only or best solution, but the way my team solved that was by creating an internal Python lib with common happy paths to access our infrastructure and processes. We deploy our data pipelines as FastAPI services and call them using Airflow. This architecture has scaled really well: we have 300+ data pipelines, even more schedules and 3 engineers. We use Knative so our AWS bill is quite cheap for the number of services we are running.
It all boiled down to treating ml / data engineering problems as common software problems.
Yeah, that's my point, it's hardly a solved problem and you have to write software for this!
ELT = data engineering. Model architecture & training design = MLE. MLOps is the storage of the training data, monitoring of the whole process, caching of model for use in serving and deployment, and retiring of resources. MLOps has some overlap with dataops, e.g. caching of training data, serving of model as application, but monitors for different things like data/concept drift.
But I have to say something about Modal. The difference with this vendor is that they try to reimagine the way people build on the Cloud and it's worth checking out just to see how different the developer experience could be.
I know that most people use it because of the easy and affordable access to GPUs, but I think we are missing the true innovation here, which is the developer experience.
I would even consider Modal as a cloud infra product, although a vertical one, more than an ML or DE product.
*edited to fix some spelling*
Glad you really get what we're trying to do with Modal. You're right it's not just an easy way to get serverless GPUs.
Modal is reimagining software development practices for the cloud era. Developing in the cloud should not be just writing YAML or Hashicorp Config Language templates, push/pulling Docker images, and re-running 'infratool up' over and over until things over.
Common reasons not to include (1) "I have soooo many AWS credits that I want to use" and (2) (our company's reason) "We have on-prem GPUs but sometimes need Cloud GPUs as well with the same interface".
Using e.g. Ray with AWS is very painful, took us a long time to iron out all the quirks.