Data Version Control

161 points by HerrMonnezza 3 years ago | 59 comments

lizen_one 3 years ago |

DVC has had the following problems, when I tested it (half a year ago):

I gets super slow (waiting minutes) when there are a few thousand files tracked. Thousands files have to be tracked, if you have e.g. a 10GB file per day and region and artifacts generated from it.

You are encouraged (it only can track artifacts) if you model your pipeline in DVC (think like make). However, it cannot run tasks it parallel. So it takes a lot of time to run a pipeline while you are on a beefy machine and only one core is used. Obviously, you cannot run other tools (e.g. snakemake) to distribute/parallelize on multiple machines. Running one (part of a) stage has also some overhead, because it does commit/checks after/before running the executable of the task.

Sometimes you get merge conflicts, if you run a (partial parmaretized) stage on one machine and the other part on the other machine manually. These are cumbersome to fix.

Currently, I think they are more focused on ML features like experiment tracking (I prefer other mature tools here) instead of performance and data safety.

There is an alternative implementation from a single developer (I cannot find it right now) that fixes some problems. However, I do not use this because it propably will not have the same development progress and testing as DVC.

This sounds negative but I think it is currently the one of the best tools in this space.

kvnhn 3 years ago | |

You might be referring to me/Dud[0]. If you are, first off, thanks! I'd love to know more about what development progress you are hoping for. Is there a specific set of features that bar you from using Dud? As far as testing, Dud has a large and growing set of unit and integration tests[1] that are run in Github CI. I'll never have the same resources as Iterative/DVC, but my hope is that being open source will attract collaborators. PRs are always welcome ;)

[0]: https://github.com/kevin-hanselman/dud

[1]: https://github.com/kevin-hanselman/dud/tree/main/integration...

remram 3 years ago | |

> You are encouraged if you model your pipeline in DVC.

Encouraged to do what?

You might want to slow down on the use of parentheses, we are both getting lost in them.

nerdponx 3 years ago | | |

I assume they meant to say "you are encouraged to use DVC to run your model and experiment pipeline". They want to encourage you to do this because they are trying to build a business around being a data science ops ecosystem. But the truth is that DVC is not a great tool for running "experiments" searching over a parameter space. it could be improved in that regard, but that's just not what I use it for nor is it what I recommend it to other people for.

However it's fantastic for tracking artifacts throughout an project that have been generated by other means, and for keeping those artifacts tightly in sync with Git, and for making it easy to share those artifacts without forcing people to re-run expensive pipelines.

jdoliner 3 years ago | |

DVC is great for use cases that don't get to this scale or have these needs. And the issues here are non-trivial to solve. I've spent a lot of time figuring out how to solve them in Pachyderm which is good for use cases where you do need higher levels of scale or might run into merge conflicts with DVC. There's trade-offs though. DVC is definitely easier for a single developer / data scientist to get up and running with.

nerdponx 3 years ago | | |

I think it's worth noting that DVC can be used to track artifacts that have been generated by other tools. For example, you could use MLFlow to run several model experiments, but at the end track the artifacts with DVC. Personally I think that this is the best way to use it.

However I agree that in general it's best for smaller projects and use cases. for example, it still shares the primary deficiency of Make in that it can only track files on the file system, and now things like ensuring a database table has been created (unless you 'touch' your own sentinel files).

bagavi 3 years ago | |

The alternative tool you are referring to is `Dud` I believe

Dvc is the best tool (I found) inspite of being dead slow and complex (trying to do many things).

What alternatives would you recommend?

DougBTX 3 years ago | |

What’s best if parallel step processing is required?

mountainriver 3 years ago | |

Yeah we had a lot of problems with things getting out of sync and we just got tired of it

throwawaybutwhy 3 years ago |

The package phones home. One has to set an env var or fix several lines of code to prevent that.

sva_ 3 years ago | |

I wondered how they'll make money

https://www.crunchbase.com/organization/iterative-ai/company...

nerdponx 3 years ago | | |

I think their plan was/is to make money on corporate licenses and support, as well as SaaS/cloud products.

machinekob 3 years ago | | |

They won't, they can make investor money back only from selling company to Amazon/Microsoft/Google but in this economy it won't happen.

shcheklein 3 years ago | |

Hey, yes, we've decided to keep it opt-out for now and it collects fully anonymized basic statistics. Here is the full policy: https://dvc.org/doc/user-guide/analytics .

It should be easy to opt-out though `dvc config core.analytics false` or an env variable `DVC_ANALYTICS=False`.

Could you please clarify about the `several lines of code`? We were trying to make it very open and visible what we collect (it prints a large message when it starts) + make it easy to disable it.

prepend 3 years ago | | |

This seems pretty anti user since most users prefer opt in. Seems pretty shady to keep in behavior that users don’t like and potentially harms them (you think it’s fully anonymized).

That’s your prerogative as it’s your project but makes me think what else you’re doing that’s against users best interest and in your own.

pabs3 3 years ago | |

I wonder what the GDPR implications of this are. I note other projects (for eg Cura) switched their telemetry to opt-in.

https://github.com/Ultimaker/Cura/issues/2810

adhocmobility 3 years ago |

If you just want a git for large data files, and your files don't get updated too often (e.g. an ML model deployed in production which gets updated every month) then git-lfs is a nice solution. Bitbucket and Github both have support for it.

tomthe 3 years ago |

Can anyone compare this to DataLad [1], which someone introduced to me as "git for data"?

[https://www.datalad.org/]

benhurmarcel 3 years ago | |

And what about Dolt?

https://docs.dolthub.com/introduction/what-is-dolt

shcheklein 3 years ago | | |

Dolt is for tabular data. It's like SQLite but with branching, versioning of the DB level. DVC is file-based. It saves large files, directories, etc to one of the supported storages - S3, GCP, Azure, etc. It's more like Git-lfs in that sense.

Another difference is that for DVC (surprisingly) data versioning itself is just one of the main fundamental layers that is needed to provide holistic ML experiments tracking and versioning. So, DVC has a layer to describe an ML project, run it, capture and version inputs/outputs. In that sense DVC becomes a more opinionated / high level tool if that makes sense.

remram 3 years ago | |

Doesn't use git-annex like DataLad. That alone is a huge benefit given the state of that tool.

imiric 3 years ago | | |

I'm curious, what's the problem with git-annex?

I've considered using it before as an alternative to Git LFS.

jefurii 3 years ago | | |

What's wrong with git-annex? My work has been using it for almost 10 years to manage 40TB+ of data. It's always been rock solid.

polemic 3 years ago |

If you're looking for something that actually tracks tabular data there's https://kartproject.org. Geo focused but also works with standard database tables. Built with git (kart repos are git repos), can track PostgreSQL, MSSQL, MySQL etc.

LaserToy 3 years ago |

Can it be used for large and fast changing datasets?

Example: 100 TB, write us every 10 mins.

Or, 1tb, parquet, 40% is rewritten daily.

nerdponx 3 years ago | |

DVC is expressly for tracking artifacts that are files on disk, and only by comparing their MD5 hashes. So it can definitely track the parquet files, but you are not going to get row or field diffs or anything like that.

Maybe Pachyderm or Dolt would be better tools here.

AlotOfReading 3 years ago | | |

Why would you use MD5 in anything written in the last 5 years? The SHA family is faster on modern hardware and there aren't trivial collisions floating around out there.

snthpy 3 years ago | |

What about Apache Iceberg for those?

smeagull 3 years ago |

I don't think this tool can encompass everything you need in managing ML models and data sets, even if you limit it to versioning data.

I'd need such a tool to manage features, checkpoints and labels. This doesn't do any of that. Nor does it really handle merging multiple versions of data.

And I'd really like the code to be handled separately from the data. Git is not the place to do this. Because the choice of picking pairs of code and data should happen at a higher level, and be tracked along with the results - that's not going in a repo - MLFlow or Tensorboard handles it better.

davidatbu 3 years ago | |

How do you merge multiple versions of data using tensorboard? Or what other tool handles that for you?

What's the case for handling code and data separately? In my experience, the primary motivation for using such a tool are easy reproducibility through easy tracking of code, hyperparams, and data. It's not obvious to me how that goal would be advanced by tracking code and data separately.

smeagull 3 years ago | | |

Tensorboard doesn't do that, I was referring to things a dataset/model management tool should do. For us, Tensorboard tracks the datasets as hyperparams. The actual multiple versions of data end up being handled on the warehouse side. Prefect is what we use for running those DAGs to make the different versions.

Handling code and data separately is important, to allow easy updates to one or the other. They are loosely coupled to allow quicker updates, rather than having to increment versions on both as per DVC, and DVC is far heavier weight as it pulls the data referenced in the dvc files, and you have to pick out on the CLI which ones you want.

Downloading as required to a local cache when needed from your actual scripts works much better. It's just like what transformers does for pre-trained models.

bs7280 3 years ago |

What value does this provide that I can't get by versioning my data in partitioned parquet files on s3?

shcheklein 3 years ago | |

I think parquet won't help with images, video, ML models.

Also, one thing is to physically provide a way to version data (e.g. partitioned parquet files, cloud versioning, etc, etc), but another one is to also have a mechanism of saving / codifying dataset version into the project. E.g. to answer the question which version of data this model was built with you would need to save some identifier / hash / list of files that were used. DVC takes care of that part as well.

(it has mechanics to cache data that you download, make-file like pipelines, etc)