One of the problems is that the demos in PyData or SparkSummit and what not do not survive first contact with reality. For really simple things.
For example, some libraries expect a filepath for their data. Say you want to use Keras from a notebook and your data is somewhere else than on disk (like if your job isn't to write blog posts on ML deployments, but you have real clients who expect you to explore data, build, deploy, and manage models, then build applications that use them that also look pretty, with money on the line, not toy projects), you suddenly have to dive into the framework internals to make it work with say, object storage.
Another example, say your project is for image classification and you have 100+k images. Min.io does not support pagination because it's not really "S3", so you have to build pagination for the users because you're displaying it like a directory, and it must act like a directory. The way Min.io does it in their front end is they download the whole list recursively, and then do an infinite scroll. This can be 20MB+ of data through the network. It works great if you have great internet bandwidth, but for a lot of parts in the world having maybe 1Mbps (notice Mbps, not MBps), this won't work when a user just wants to "explore the directory structure".
Heck, one of our colleagues was not using the product and when pressed, she said the notebook is taking forever to load. There were 30 megabytes of static files being downloaded and she had a 5kB/second during confinement. We dug into rebuilding it, then compressing static files, and caching. And she's having trouble using the AppBook doing real projects in vision, for example, specifying a data source and having the boxes display properly.
One way we're developing the product is by going through the real projects we have worked on, with real data and doing them retrospectively on our platform to make sure it works with real problems. We're not optimizing for a demo in an event, we're optimizing for something that really works for us because we don't have teams of "data scientists", "ML engineers", "deployment engineers" and we want to be able to allow the couple of ML practitioners we have to get data projects running in a self service way, which means that by definition you have to inherit of all the complexity you're trying to spare users.
The same problems when you can't trivially create an "empty bucket". Users don't care that S3 is not the same as a filesystem, you're pretending it is by having a "folder" icon and you damn better get it to work like a "folder" where one can create structures for image classes, and then traverse them. The API does not allow that, so you have to write the code to give it the look and feel of a directory and you must thus write something that make "pagination" work to display hundreds of thousands of images. And that's just 100K+ images, not millions or billions. But you wouldn't have that problem with a hello world example or the talk you give.
The deployment problem, for instance. Yes, you see the example and it looks great. Then you try to reproduce the example in the repo, and it does not work.
Let's say you use MLflow to "deploy". It has a client and a server. As far as you'd expect, the client makes a request to the server, and the server does "things". But let's say you're deploying a model that's in object storage: object storage credentials must be put server side and client side. You can't just make a request from the client to save a model and then the server handles it in the backend with whatever solution you're having. No, you must specify the object storage URL, and credentials in the client code.
Which means, if you don't want to play house, you have to proxy requests and then authenticate them in a "Man in the middle" fashion between the mlflow client and the mlflow server itself, just so that your credentials do not leak.
This would be mitigated if you're using Min.io in a multi tenant mode so each user has their own "object storage", but Min.io does not have an API with which you can can do that (user creation, etc), and you must do it with their `mc` client. Which means you have to create this on the fly for each user and wrap these.
There's also the problem of work load scheduling, notebook collaboration and versioning. You give 2GB or RAM? OK. Users need way more. What do you do next? You give 100GB of RAM? You make it elastic? How do you deal with "runway models" (as opposed to Instagram models) that are hemorraging your resources? You have to think about resource management and workload management. Do you instaure quotas so that one user, doesn't monopolize all the resources?
How do you deal with real time collaboration and versioning? Because you know, you're working on real projects with real people? Do they have to version their notebooks? They don't know how to use Git when they do ML. Do you hack on the Contents API and have a custom ContentsManager? Do you dig through operational transformation or CRDT to give it the look and feel people expect now for collaboration?
It is that stitching and managing these fragmented tools idiosyncrasies that make it that the posts I read on some data science medium blog posts or watch talks about machine learning lifecycle management completely shock me, as I really would love it to be that way, but it simply isn't. Maybe it is when you're toying with a jupyter notebook or on Kaggle and, training a model on data on your disk, and wrapping a Flask application on it, then writing a blog post on how easy it is.
Let's then say that you have "deployed" your model with the super ml lifecycle management library, which really just starts a process and launches a flask application. How do you shut it down or manage it? Drifting? How do you retrain it? Do you use Airflow or NiFi or the like? Who configures them, the use? What's the schedule?
So, yes.. I understand why your question is: "Since everybody has it figured and blogs about it and demos in conferences, am I that stupid or is everyone full of baloney? Is there something everyone knows that I don't or what?"