MLOps is mostly data engineering

MLOps is mostly data engineering(cpard.xyz)

169 points by dpbrinkm 3 years ago | 88 comments

I've been an MLOps Engineer for around 3 years now and I mostly agree with the article. There is a big overlap between the ML-specific tools that are popping up on the market and traditionnal Data Engineering tools, and I think people are not always realizing that:

* Prometheus/Grafana/TSDB/... can be used to setup a model monitoring platform since you're observing metrics whether they are from an ML service or a normal service.

* Any service deployment tool can be used to deploy ML models, since they are services.

* AirFlow/Dagster/... can be used to orchestrate model training, since training a model is basically a data engineering task.

With that said, I still believe that there is space for ML-specific tools to be created.

* Model Monitoring tools (ArizeAI is the only one I've used) can be tailored to be easily usable by ML Engineers without requiring DE knowledge.

* Deploying models in production has some specifities: things like GPU support, adaptive batching, ... Those specifities can be implemented inside a model deployment tool.

* Training orchestration is the only domain where I think there's truly no need for new tools.

rjzzleep 3 years ago | |

I've used dozens of platforms, distributed job queues and pipelining tools, including airflow, pachyderm and a bunch of others. most of them turned out to be more effort that it was worth and designed around a very specific use case. Some of them looked fantastic but then had all sorts of weird cases to account for. Kinda like how ArgoCD looks great, but has a bunch of common bugs that nobody seems to care enough about to fix.

In the end the most successful platform I built was a custom orm I built around redis objects and queues and the most important part wasn't actually the fancy data processing platform, but actually the details of the container layers, the refactoring of the code to make it easily composable, releasable and easy for the scientists to play with but with enough guard rails so they wouldn't diverge too far from the structure.

It made incredibly fast at iterating. Of all the things I worked with Airflow was the one I was most hyped about from all the videos I had seen and which turn out to be the biggest mess of them all.

sitkack 3 years ago | | |

External blackboards and functional code that operate over them continues to pay dividends. I too have had an amazing amount of success with this pattern. The Redis api is just so damn nice.

Fiahil 3 years ago | |

Yet, everyone misses reproducibility and data versioning :)

So, talking about monitoring, training, and recording model drift is only a single side of the domain.

KptMarchewa 3 years ago | | |

> Yet, everyone misses reproducibility and data versioning :)

Delta Lake/Apache Iceberg solves that.

alfalfasprout 3 years ago | | |

Not to mention dependency management! Since a lot of ML code is in Python this ends up being a very tricky thing to handle at scale (especially if you need to update dependencies, etc.)

kumare3 3 years ago | |

May I recommend looking into flyte.org, it is open source kubernetes native "orchestration" style tool, but essentially and infrastructure component that is geared to making your ML Engineers and Data scientists more productive. I think iteration velocity, dynamic infrastructure management and trackability are really important and fundamentally different needs of such products.

PS. I am a maintainer at flyte.org (thoughts are my own)

mountainriver 3 years ago | |

Training orchestration does need new tools to spin up GPU instances and make the most of them and then spin them down, we are still struggling in this domain

Longwelwind 3 years ago | | |

I don't know what tools you are using but this can be achieved with Airflow on k8s, for example:

* Add a GPU resource requirement on one of your step

* Add an auto-scaler that adds GPU nodes to your cluster based on the GPU resource demand.

After having written the above, I realize that it might sound like that famous HN comment about how you can /easily/ re-create Dropbox yourself, which might actually prove your point that there is a need for ML-specific tools for the training part.

thundergolfer 3 years ago | | |

Just use https://modal.com/ :)

At Canva I built auto-scaling GPU infra on K8s for model training[1], and it's way too much work and operational expense to be worth building yourself. I went work at Modal because building it properly once and then distributing the solution was going to be just way better and more efficient.

1. https://canvatechblog.com/supporting-gpu-accelerated-machine...

pid-1 3 years ago | | |

Why isn't that solved by k8s + a node autoscaler such as Karpenter?

sandkoan 3 years ago | |

You may want to look at run.house [0] for a pretty powerful solution to many of these problems.

[0] https://github.com/run-house/runhouse

o10449366 3 years ago |

Sure, MLOps is just Data Engineering in disguise when you ignore the complexities of hardware provisioning, GPU optimization, integration tests for model performance and quality, benchmarking, resource constraints (network, disk, memory, GPU memory), etc.

Anecdotally, I've worked in high-performance computing and machine learning for years now and the past few months I've seen a huge spike in the number of messages I get for MLOps positions. I think companies are slowly starting to realize that setting up machine learning at scale isn't as simple as deploying poorly written code by research scientists to managed platforms.

dpbrinkm 3 years ago | |

>I think companies are slowly starting to realize that setting up machine learning at scale isn't as simple as deploying poorly written code by research scientists to managed platforms.

This is some truth right here

nonethewiser 3 years ago | |

> Anecdotally, I've worked in high-performance computing and machine learning for years now and the past few months I've seen a huge spike in the number of messages I get for MLOps positions. I think companies are slowly starting to realize that setting up machine learning at scale isn't as simple as deploying poorly written code by research scientists to managed platforms.

This doesn’t bode well for ML in lots of orgs. Obviously machine learning is very powerful and effective in many use cases. But the value proposition already isn’t there for lots of companies. At such places, discovering a hidden requirement for more resources is a great reason to change directions.

To be very clear, I’m not talking about the field in general. I’m talking about orgs where management doesn’t see value generated from the ML efforts that suddenly demand more resources to operate.

sitkack 3 years ago | | |

Could you explain the statement, "At such places, discovering a hidden requirement for more resources is a great reason to change directions."

I don't understand the argument, not feigning confusion. Everyone has to scale at some point and every solution has its limits. If they were successful with Airflow and Pandas/Numpy for a long time and then well, now the fan is spinning. They are going to call for experts that put the pieces in place. Asking for help is a sign of maturity. It really depends on the state of system when the experts arrive.

I personally think every org can use ML (it all decays to statistics, then linear algebra).

cpard 3 years ago |

Hey folks I'm the author of the post and happy to see that it gets so much attention on HN. Thank you for the incredible comments!

I want to clarify something about my intention with this post. There is a reason I chose "mostly" on the title. I'm not dismissing the different needs of ML.

if a category withstands the tests of the market, then there's good reason for it to exist.

But, we have ended up creating silos within orgs with fundamentally aligned goals because of the way we build products and companies around them.

What I'm advocating for in this article, is the need to think more holistically when we design and build data infra tooling. Yes ML has unique challenges but these challenges won't be addressed by reinventing everything again and again.

Tooling should be built having in mind all the practitioners involved in the lifecycle of data.

It's harder to do but at least we'll stop wasting our time building one Airflow copy after the other that is doomed to fail.

jamesblonde 3 years ago | |

Here's one thing that you missed - transformations. Data engineers hear the word transformations and think they know what they mean - aggregations, binning, data reductions, cleansing, etc. Data scientists, however, think preparing features for models with transformations - encoding categorical variables (one-hot-encoding, LabelEncoding, OrdinalEncoding) and normalizing/standardizing/log-transforms for numerical features.

These transformations are so different that some of them are "model-independent" - you can do the transformations and reuse the output feature across many models (aggregations, binning, feature-crosses, embeddings, etc), but the data scientists transformations are model-dependent (and not reusable across different models - e.g., a XGBoost model doesn't want normalized numerical features, but a DNN typically needs normalized or standardized numerical features.

These differences are reflected in how we build our "ML pipelines" - we split them into feature/training/inference pipelines, enabling us to localize model-independent transformations to feature pipelines, while doing model-dependent transformations in training/inference pipelines (while ensuring no skew).

In summary, the devil's in the details in pipelines for ML. I agree, however, that orchestrators like Airflow are good enough for orchestration. However, the ML Assets (mutable & reusable features, immutable models, immutable training/inference datasets) are different and tooling will ultimately reflect those differences.

jamesblonde 3 years ago |

IMO, this article misses the essence of and principles of MLOps. The essence of MLOps is that it is about processes (and tooling/platforms) for creating ML assets - features/labels and models. We call them FTI pipelines. Data engineering has data pipelines that produce datasets for consumption.

In MLOps, feature pipelines produce features (from raw data). Training pipelines produce models (from features/labels). Inference pipelines produce predictions (from models + features). There is no such thing as a "ML Pipeline" in a production ML system. There is no ML pipeline that goes from raw data to predictions. We have the above FTI pipelines (feature/training/inference pipelines).

The principles of MLOps are around being able to develop faster (shorten the development lifecyle) through automated testing and versioning. You need to validate data to build features. You need tested features to build models. You need to test models for your ML systems. It's a hierarchy: data->features->models-ML Apps. Versioning is needed for features and models in order to safely upgrade systems and enable them to evolve over time.

I cover a lot of this in a course i developed called 'serverless ml'.

ritzaco 3 years ago |

I agree there's a lot of "Agile" like BS around MLOps, but this article doesn't really give the prior art enough attention IMO. Data Engineering is a large part of MLOps, but there are unique parts of production ML engineering so it makes sense that it is (slowly) evolving its own discipline.

There are some people who have expertise in building production infrastructure, writing production code, and managing production data, but they are few and far between. So finding a "system" that lets data experts work with code experts work with infrastructure experts is important.

Many people say it's "just software engineering" or "just DevOps", but I feel like they are either not respective enough of the challenges of whichever pillar they are ignoring, or they don't even know that those challenges exist.

Filtering out the BS and finding the smart people who are writing interesting things about MLOps is difficult as they use the same terminology (and if the smart people switched, the BS people would follow, so they may as well stand their ground) but the BS cover doesn't mean that there's nothing substantial underneath.

nonethewiser 3 years ago | |

Couldn’t you say it’s just devops applied to a specific type of app that often isn’t deployed to production? And takes significant understanding of the underlying systems to deploy? It certainly feels like ML discovering devops.

That can still mean it requires a unique skillset that devops people dont generally have.

It’s just hard to imagine how infrastructure and deployment suddenly isn’t devops because it deals with ML.

ritzaco 3 years ago | | |

I mean there has definitely been an explosion of "Ops" terms, and not all of them are justified. I think MLOps makes sense to emphasise the three pieces though. I know people want the "Ops" part of "DevOps" to mean "magic optimization for anything", but it goes back to "ops" as in infrastructure/servers etc. So if DevOps is software engineering + deploying and running software reliably, then it still leaves a big gap on the "data" side - both in terms of orchestrating source data, but also in handling large model files, and tracking things like model decay.

mountainriver 3 years ago |

This article feels like it’s written by someone who doesn’t understand the problems faced by ML. These people come through the MLops community every so often, they think no one has realized that dev ops and DE are similar, when in reality they just don’t yet realize how different ML is yet.

For one, the customer is entirely different. You are mostly serving data scientists who don’t have strong engineering skills, which dramatically skews the solutions toward things like Python and Jupyter.

This is a big reason why the tool space is different and has been successful at what it does.

Model training and serving are absolutely nothing like traditional methods. In serving, you are deploying a stateful model, not a stateless backend. That model’s state should ideally be continuously trained, requires different scaling and monitoring capabilities.

In training, the GPU problem is far from solved and it is unique to ML with things like how you shard models and fit their weights into memory.

There are extremely challenging problems in this space that simply aren’t the same as devops, and this is coming from a former k8s contributor.

mylons 3 years ago | |

as someone who’s been in the industry for 20 years, and worked in ML data engineering at twitch, i think you’re just using new words (some are tools but so what) to describe moving data around and making it accessible to people who need it.

eropple 3 years ago | | |

Yeah - I've never been a "data engineer", but I've been doing this sort of shoveling for twenty years. We called it "ETL" and felt faintly foolish blogging about it.

mountainriver 3 years ago | | |

Yes I've been an engineer for 15 years at a bunch of big companies. MLops isn't ML data engineering. I doubt data engineering has many changes outside of active learning.

nonethewiser 3 years ago | |

It sounds more like MLops does fall into devops or data engineering but it expands the definition.

That data engineers or devops people don’t have the background to do MLops doesn’t mean MLops is outside of those disciplines. MLops increasingly seems like a very specific version of these things with unique challenges.

Whether or not serving models fits “traditional” methods seems irrelevant.

boredumb 3 years ago |

Learning some pytorch is truly not the bulk of the work to build a model, having to wrangle and mangle a massive amount of data coming from less-than-ideal data sources, orchestrating the jobs and making all of this available for your training routines to slurp up is a lot of work that evolves fairly quickly beneath you.

PartiallyTyped 3 years ago | |

I wonder whether you could get a GPT-3 level LLM to extract the data for you.

PaulHoule 3 years ago | | |

not fast enough, but maybe it can help write the program that does.

danthelion 3 years ago |

Yes, and Data Engineering is just Software Engineering "in disguise"

marcyb5st 3 years ago | |

While I agree with the premise, there's the added "complication" that knowledge or experience as a Data Scientist/ML Engineer it's highly beneficial. Pure SWEs, in my experience, struggle a bit in anticipating how Data Scientists will shoot themselves in the feet. Having experienced it yourself before it's a big plus.

rldjbpin 3 years ago | |

underrated comment. of course i agree that there are specificities with mlops and data engineering that don't exist for a typical full-stack application. but if you boil it down enough, you just find the same underlying concepts.

rickette 3 years ago | |

Exactly my thought.

nerdponx 3 years ago | |

Or database admin.

nerdponx 3 years ago | | |

I guess I should clarify because I was downvoted. Data engineering involves a lot of database work. So if ML engineering is largely data engineering, then it's also at least partially database administration, transitively.

noobcoder 3 years ago |

Before entering the field of ML, I perceived MLOps as a superhero with abilities to handle and deploy ML models. However, it seems that MLOps is more or less a typical engineer who acquired skills to manage and deploy data infrastructure for ML purposes (exclusively) by exposure to data engineering.

fest 3 years ago | |

> However, it seems that XYZ is more or less a typical engineer who acquired skills..

This is true for pretty much every engineering discipline/role.

mountainriver 3 years ago | |

Eh that’s not what I’ve seen, non deterministic models have a whole class of unique challenges

PaulHoule 3 years ago |

(1) The killer product encompasses “all of the above”, if you really are going to buy five of them God help you because with all the mistakes vendors will make and you’ll have to work around plus the overhead of moving data around you’re in for it.

(2) A major difference w/ the conventional software development CI/CD pipelines is the sheer size of the data involved. When you are dealing with “tiny data” you can waste resources on Docker, but when your foundation model is 100x the size, when the training process is distributed, and takes a day, the quantity is taking on a quality of its own.

(3) The worst performance sin is moving data around although this will be necessary so far as the system is distributed. Avoiding excess data moving can be the difference between training a model and failing to train a model, but when you put together a patchwork of ML ops programs you will fin they are moving data around internally for good reasons sometimes and no reason other times plus the easy (and sometimes only) integration method is moving data around. Don’t be that guy!

chatmasta 3 years ago |

If a buzzword is a portmanteau of a previous buzzword (DevOps), combined with a newly hot buzz word (ML), then chances are it's something in disguise.

But that doesn't make it any less legitimate - DevOps came from Dev + SysOps, but nobody is arguing DevOps shouldn't be a thing (although you might argue it's no different from SysOps).

In general, buzzwords align pretty closely to VC funding cycles.

pydry 3 years ago | |

DevOps was associated with a new generation of sysops with stuff like managed infrastructure, IaaC, continuous deploys, etc. It was a whole new generation of that thing.

As far as I can see MLOPs is just equivalent generation devops applied to ML.

bobbruno 3 years ago |

On a first read, I could agree with most, if not all, of the author’s arguments. But there are two aspects that were simply left out that I have to consider when doing MLOps, which can prove too complex for just saying “It’s an extension of tooling that data engineering already has”.

First, there’s the matter that ML introduces a significant stack and complexity into what was already a relatively complex framework. I mean, managing storage, quality, data processing, streaming, scheduling/orchestration, transformation logic and SLAs requires a lot of tools, whichever combination pleases you the most. Even full platforms offered by some of the players in this market can get quite complex, and it’s very hard to set everything right and handle all the cases. Specialised tooling or skills is probably a good idea to focus on the things that matter to Ml and that go beyond what DE already covers. Think of all the frameworks, the statistical libraries, the different nature of the logic for ML features when compared to regular reporting needs, the different quality requirements and structure that ML expects, managing versions of raw, labelled and test datasets, etc. (there are many more, the discussion already covered quite a few).

Which brings me to the second thing - the knowledge stack required to run ML. Besides some of the usual DE stack (developing, data manipulation, quality, etc), a whole new set of skills, related to several branches of math, parallelism, very complex and costly infrastructure management, research skills, experiment design, specific algorithms and approaches (does the regular DE need to understand neural network patterns, how data and model-parallel training works, the statistics behind setting up and running drift measures, what all the metrics behind model performance mean, etc.?). I find this a better reason for specialisation than any other - there’s just so much one person can hold in their heads, and ML development and operations is just getting more complex by the day.

So, my point is, from a very simplified and abstract perspective, the author seems right. But, in practice, you won’t be able to just stack that on top of data engineers and not expect them to become specialized - and that’s where the ML Engineer and MLOps engineer roles are emerging. They’re not completely new, but they’re no longer your regular data engineer or data scientist.

kfk 3 years ago |

We are experimenting with workers using the simple python arq library and Redis and I am yet to find a MLOps or Data Engineering use case that is not a good fit for a API+Worker on K8S. For instance, you need to manage ML artifacts? You can just offer an API endpoint so the ML models can automatically update the artifacts. You need data ingestion? You can have a worker running ingestion scripts and kick off the worker via API. We tried pub/sub and Kafka but it can be really wasteful, workers can process work for multiple streams, but Kafka cannot. But of course I wonder if I am missing something, I am not an ML engineer so probably I am?

alextheparrot 3 years ago | |

It isn’t particularly clear what technical requirements you were working against, but let me give it a shot:

A lot of groups start using Kafka because they have high-throughput event streams they want to aggregate over, and then you just use Kafka for everything because managing a 5 TPS topic alongside a 100k TPS topic is trivial.

In terms of why Kafka is a good fit for that workflow — database writes are unnecessary and oddly structured for raw events we want to expire in a few days, and dealing with buffering blob file writes can cause data loss, so Kafka can really simplify the producer architecture where the producer is also a consumer from a producer who wants an ack. Combine this with how trivial it is to have multiple readers on a pub sub system, and it is easy to scale from 1 to N consumers of a data source without duplicating the data everywhere. E.g. you could have three aggregation jobs that use the same data, one job that writes the topic’s data to blob-style storage for batch use-cases, and a low latency inference job all running from the same data stream.

More or less, Kafka just simplifies scale-out in some cases, maybe not your case, though. If you’re kicking off workers to do ingestion, you might have a system where you are pulling files down at some infrequent cadence (Let’s say every 10 minutes) —- in that case Kafka is likely going to be overkill and feel like a lot of work for an API call that then becomes a tasked tracked in some database.

kfk 3 years ago | | |

interesting, in my case I need the events to be inserted in a SQL db, even the real time ones. For instance, I receive Contacts data from Hubspot in realtime, I send those contacts over to Salesforce and I store them in Postgres. Why Postgres? Because we want to keep a history of the contacts, plus we will need to have 1 source of truth for customers data to fulfill various data privacy requirements. How about Kafka for such a use case? Let's say I receive maybe 10,000 contacts per day.

mmierz 3 years ago |

I'm currently working in an MLOps engineer role at a mid sized company. I agree with that article that most of what I do is plain old software engineering. I don't think I'm interchangeable with any other backend dev though, because ML expertise really does come in handy here. I think the thing that makes it a bit specialized is that we are providing tools to allow our data scientists to self-serve model deployment and monitoring, but they by and large not expert web programmers. So we need to anticipate the kind of mistakes they're likely to make and provide opinionated tools that guide them into building sane software in the specific context of our company's technology. As well as direct support as needed.

We evaluated several commercial MLops tools and ended up going with generic tools that we already use, instead of something new that's branded for MLops. I.e. postgres + snowflake instead of a commercial feature store -- model deployment, monitoring, and alerting on the same platform as the rest of the company's applications -- etc. When we tried "ML" tools, they took so much work to adapt to our use cases that they really added no value.

antipaul 3 years ago |

The overlap in "deployment" between MLOps and Software engineering was hinted at in that well-known 2015 NeuriIPS paper, "Hidden Technical Debt in Machine Learning Systems":

"It may be surprising to the academic community to know that only a tiny fraction of the code in many ML systems is actually devoted to learning or prediction – see Figure 1"

https://papers.nips.cc/paper_files/paper/2015/hash/86df7dcfd...

PDF: https://papers.nips.cc/paper_files/paper/2015/file/86df7dcfd...

jstx1 3 years ago |

This is a strange article. The body of the article correctly talks about all the model work... but data engineers typically don't have to do any of that work.

So it's an okay overview of some ML engineering / ops things with a contradictory title which isn't followed up on (and which I'm sure gets more clicks).

So no, MLOps isn't just data engineering. For more information read your own article.

cpard 3 years ago | |

Hey, thanks for the feedback! There is a reason the article has the structure it does.

I'm going through the main Pilars of MLOps and explaining how they overlap with data engineering.

Also, the title says, "Mostly" not "just" data engineering.

It might be misleading if you just go through the sub-headers but I'm sure if you go through the content you will see that the title and the content are pretty aligned.

antonvs 3 years ago | |

Betteridge’s Law of Headlines

stuartaxelowen 3 years ago |

Feature stores are essentially materialized views (aside from any realtime feature resolution needed). I think it's a good thing that there is specialized effort being taken here, though: features stores are an abstraction that could be useful in other domains also, and this surge in interest is an opportunity for us to make better tools.

nonethewiser 3 years ago |

Consider that there are different types of developers. This remains true in the context of “Devops”. Devops doesn’t mean web dev ops or something. Given that, how is MLops not a certain type of devops? It’s basically ML engineers figuring out how to deploy their systems to production, no?

KptMarchewa 3 years ago | |

"Devops" only means terraform monkey now. Just as "data engineer" is python/sql monkey.

achileas 3 years ago |

alwayshasbeen.jpg

I was doing this work since before MLOps was the new buzzword in town, and it was always under the data engineering job title. It was only in the past few years that data engineering has become more focused, requiring new titles/job descriptions to truly cover the different specializations.

Kalanos 3 years ago |

https://docs.aiqc.io

systematically orchestrates the data preprocessing and post-processing of the training loop for multi-dimensional data and various types of analysis

jldugger 3 years ago |

Side note: was this a tweetstorm originally? Literally every paragraph is a single sentence, often long ones with no reader afforrdances like bolding key points.

hgsgm 3 years ago |

Is ML just statistics (data analysis) in disguise?

version_five 3 years ago | |

No.

But corporate "ML" has been until recently "dashboards" or "big data" or "data analytics" or "look at your corporate records with a computer" in disguise.

jstx1 3 years ago | |

They're distinct, there isn't all that much analysis in ML.

steveBK123 3 years ago |

Data Engineering with a top hat & bow tie, really.