Data Version Control(dvc.org) |
Data Version Control(dvc.org) |
I gets super slow (waiting minutes) when there are a few thousand files tracked. Thousands files have to be tracked, if you have e.g. a 10GB file per day and region and artifacts generated from it.
You are encouraged (it only can track artifacts) if you model your pipeline in DVC (think like make). However, it cannot run tasks it parallel. So it takes a lot of time to run a pipeline while you are on a beefy machine and only one core is used. Obviously, you cannot run other tools (e.g. snakemake) to distribute/parallelize on multiple machines. Running one (part of a) stage has also some overhead, because it does commit/checks after/before running the executable of the task.
Sometimes you get merge conflicts, if you run a (partial parmaretized) stage on one machine and the other part on the other machine manually. These are cumbersome to fix.
Currently, I think they are more focused on ML features like experiment tracking (I prefer other mature tools here) instead of performance and data safety.
There is an alternative implementation from a single developer (I cannot find it right now) that fixes some problems. However, I do not use this because it propably will not have the same development progress and testing as DVC.
This sounds negative but I think it is currently the one of the best tools in this space.
[0]: https://github.com/kevin-hanselman/dud
[1]: https://github.com/kevin-hanselman/dud/tree/main/integration...
Encouraged to do what?
You might want to slow down on the use of parentheses, we are both getting lost in them.
However it's fantastic for tracking artifacts throughout an project that have been generated by other means, and for keeping those artifacts tightly in sync with Git, and for making it easy to share those artifacts without forcing people to re-run expensive pipelines.
However I agree that in general it's best for smaller projects and use cases. for example, it still shares the primary deficiency of Make in that it can only track files on the file system, and now things like ensuring a database table has been created (unless you 'touch' your own sentinel files).
Dvc is the best tool (I found) inspite of being dead slow and complex (trying to do many things).
What alternatives would you recommend?
https://www.crunchbase.com/organization/iterative-ai/company...
It should be easy to opt-out though `dvc config core.analytics false` or an env variable `DVC_ANALYTICS=False`.
Could you please clarify about the `several lines of code`? We were trying to make it very open and visible what we collect (it prints a large message when it starts) + make it easy to disable it.
That’s your prerogative as it’s your project but makes me think what else you’re doing that’s against users best interest and in your own.
Another difference is that for DVC (surprisingly) data versioning itself is just one of the main fundamental layers that is needed to provide holistic ML experiments tracking and versioning. So, DVC has a layer to describe an ML project, run it, capture and version inputs/outputs. In that sense DVC becomes a more opinionated / high level tool if that makes sense.
Example: 100 TB, write us every 10 mins.
Or, 1tb, parquet, 40% is rewritten daily.
Maybe Pachyderm or Dolt would be better tools here.
I'd need such a tool to manage features, checkpoints and labels. This doesn't do any of that. Nor does it really handle merging multiple versions of data.
And I'd really like the code to be handled separately from the data. Git is not the place to do this. Because the choice of picking pairs of code and data should happen at a higher level, and be tracked along with the results - that's not going in a repo - MLFlow or Tensorboard handles it better.
What's the case for handling code and data separately? In my experience, the primary motivation for using such a tool are easy reproducibility through easy tracking of code, hyperparams, and data. It's not obvious to me how that goal would be advanced by tracking code and data separately.
Handling code and data separately is important, to allow easy updates to one or the other. They are loosely coupled to allow quicker updates, rather than having to increment versions on both as per DVC, and DVC is far heavier weight as it pulls the data referenced in the dvc files, and you have to pick out on the CLI which ones you want.
Downloading as required to a local cache when needed from your actual scripts works much better. It's just like what transformers does for pre-trained models.
Also, one thing is to physically provide a way to version data (e.g. partitioned parquet files, cloud versioning, etc, etc), but another one is to also have a mechanism of saving / codifying dataset version into the project. E.g. to answer the question which version of data this model was built with you would need to save some identifier / hash / list of files that were used. DVC takes care of that part as well.
(it has mechanics to cache data that you download, make-file like pipelines, etc)
That, combined with the nature of re-using the same filename for the metadata files, meant that it was common for folks to commit the binary and push it. Again, lots of history rewriting to get git sizes back down.
Maybe there exist solutions to my problems but I had spent hours wrestling with it trying to fix these bad states, and it caused me much distress.
Also configuring the backing store was generally more painful, especially if you needed >2GB.
DVC was easy to use from the first moment. The separate meta files meant that it can't get into mixed clean/smudge states. If you aren't in a cloud workflow already, the backing store was a bit tricky, but even without AWS I made it work.
1. All git-lfs files are kept in the same folder
2. No one can directly push commits to one of the main branches, they need to raise a PR. This means that commits go through review and its easy to tell if they've accidentally commit a binary, and we can just delete their branch form the remote bringing the size back down.
For example, I think CodeOcean might use git-lfs under the hood but handles upload download separately from the UI. In the below sample, you can clone the repo from the Capsule menu but data and results are downloadable from a contextual menu available from each, respectively.
Ideally I'd love to use git-lfs on top of S3, directly. I've looked into git-annex and various git-lfs proxies, but I'm not sure they're maintained well enough to be trusting it with long-term data storage.
Huggingface datasets are built on git-lfs and it works really well for them for storage of large datasets. Ideally I'd love for AWS to offer this as a hosted thin layer on top of S3, or for some well funded or supported community effort to do the same, and in a performant way.
If you know of any such solution, please let me know!
It comes with a smart versioning approach, checks the Δ based on the checksum and has a feature to visualize the lineage.
You can also use your existing object store and link it for very large / sensitive data.[2]
Disclaimer: I work at W&B.
[1]: https://docs.wandb.ai/guides/data-and-model-versioning/model... [2]: https://docs.wandb.ai/guides/artifacts/track-external-files#...
Thinking more abstractly, there is benefit for code and data to live "next" to each other, if possible. Atomically committed to a codebase and the latter loaded / used by the former without connecting to yet another workflow.
> Tensorboard tracks the datasets as hyperparams.
Clever!
> Warehouse side .. Prefect
I'll have to checkout warehouse-side things and Prefect to see what you mean.
Appreciate all the pointers!
There's an ongoing discussion about replacing/configuring the hash function, and it looks like there might be some movement toward replacing the hash and other speedups in 3.0
https://github.com/iterative/dvc/issues/3069
> We not only want to switch to a different algorithm in 3.0, but to also provide better performance/ui/architecture/ecosystem for data management, and all of that while not seizing releases with new features (experiements, dvc machine, plots, etc) and bug fixes for 2.0, so we've been gradually rebuilding that and will likely be ready for 3.0 in the upcoming months. - https://github.com/iterative/dvc/issues/3069#issuecomment-93...
Would love your feedback what's missing there! We've been improving it lately - e.g.
- Hydra support https://dvc.org/doc/user-guide/experiment-management/hydra
- VS Code extension - https://marketplace.visualstudio.com/items?itemName=Iterativ...
Ideally I'd like the tool I use for data versioning (DVC/git-lfs/gif-annex) to be orthogonal to that which I use for hyperparameter sweeping (DVD/optuna/SageMaker experiments), and orthogonal to that which I use for configuration management (DVC/Hydra/Plain YAML), to that what I use for experimental DAG management (DVC/Makefile)
Optuna is becoming very popular in the data-science/deep learning ecosystem at the moment. It would be great to see more composable tools, rather than having to opt all-in into a given ecosystem.
Love the work that DVC is doing though to tackle these difficult problems though!
We've tried to make it as open as possible - code is available (its open source), we write openly about this at the very start, we have a policy online, made it easy to opt-out. If you have other ideas how to make it even more friendly, more visible, etc - let us know please.
Still, we've preferred so far to keep it opt-out since it's crucial for us to see major product trends (which features are being used more, product growth MoM etc). Opt-in at this stage realistically won't give us this information.
I think the challenge I have is that since you’re getting IP address that will be an opportunity to abuse. And there seems to be some rule that any data that can be misused will eventually be misused.
Since you’re not willing to make it opt-in, I think perhaps the only other way would be to support an automated distro that doesn’t include it so users are at least able to easily choose a version.
I admire you for responding to this thread and me as it’s definitely not easy. I just feel like one of the main benefits of open source is its alignment with user benefits so it’s discouraging when an open source project chooses code that users don’t want.
https://docs.brew.sh/Analytics https://docs.npmjs.com/policies/privacy#how-does-npm-collect... VS Code, etc
> I think the challenge I have is that since you’re getting IP address that will be an opportunity to abuse.
Yes! And we are migrating to the new package / infrastructure because of this - https://github.com/iterative/telemetry-python (DVC's sister tool MLEM is already on it and it's not sending (saving) IP addresses, nor using GA or any other third-party tools, data is saved into BigQuery and eventually we'll make publicly accessible - https://mlem.ai/doc/user-guide/analytics to be fully GDPR compatible). It's a legacy system that DVC had in place. There was no intention to use those IP addresses in some way.
> I think perhaps the only other way would be to support an automated distro that doesn’t include it so users are at least able to easily choose a version.
Thanks. To some extent brew-like policy (not sending anything significant before there is a chance to disable it and there is clear explicit message) should be mitigating this, but I'll check if it works this way now and if it can be improved.
* git diff doesn't work in any sensible way
* if you forget and do `git add` instead of `git annex add`, everything is fine, but you've now spoilt the nice thing that git annex does of de-duping files. (git annex only stores one copy of identical files)
* for our use case (which I'm sure is the wrong way of doing things) it's possible to overwrite the single copy of a file that git annex stores, which rather spoils the point of the thing. I do think it's down to the way we use it, though, so not specifically a git annex problem
The _great_ thing about git annex is it can be self-hosted. For various reasons we can't put our source data in one of the systems that uses git-lfs.
We've got about 800 GB of data in git annex and I've been happy with it despite the limitations.
git annex config --set annex.largefiles 'largerthan=1kb and not (mimeencoding=us-ascii or mimeencoding=utf-8)'
> By default, git-annex add adds all files to the annex (except dotfiles), and git add adds files to git (unless they were added to the annex previously). When annex.largefiles is configured, both git annex add and git add will add matching large files to the annex, and the other files to git. —https://git-annex.branchable.com/git-annex/Note that git add will add large files unlocked, though, since (as far as I understand) it’s assumed you’re still modifying them for safety:
> If you use git add to add a file to the annex, it will be added in unlocked form from the beginning. This allows workflows where a file starts out unlocked, is modified as necessary, and is locked once it reaches its final version. —https://git-annex.branchable.com/git-annex-unlock/
Huggingface uses git-lfs for large datasets with good success. git-lfs on GitHub gets very pricey at higher volumes of data. Would love the affordability of object storage, just with a better git blob storage interface, that will be around in the future.
Most of these systems do their own hash calculations and are not interchangeable with each other. I feel like git-lfs has the momentum at the momentum in data-science at the moment, but needs some better options for people who want a low cost storage option that they can control.
Huggingface is great, but it's one more service to onboard if you're in an enterprise. And data privacy/retention/governance means that many people would liek their data to reside on their own infrastructure.
If AWS were to give us a low cost git-lfs hosted service on top of S3 it would be very popular.
If anyone knows of some good alternatives, please let us know!
When does it use hard links? As far as I remember it used symlinks unless you used something like annex.hardlink (described in the man page: https://git-annex.branchable.com/git-annex/)
Well, anything stored by git-annex has read-only file permissions. Apps will follow the symlink, yes, but they will fail to write to the location if they try.
> The way the "check out" feature works is also weird, causing a change in the shared version history.
Unlocking a file changes it from a symlink to a git-annex pointer file from git’s perspective (git-annex accomplishes this via git’s smudge filter interface), but you don’t have to commit the unlock. You can unlock, modify locally, re-lock, and commit the new changed version in one go. It’s nice that you can commit the unlocking action itself if you want a file to be unlocked in all clones of the repository. You can choose whether to commit the unlock depending on if it fits your use case.
For curious readers, https://git-annex.branchable.com/tips/unlocked_files/ discusses these topics in more detail.
One in C# (with support for auth)
https://github.com/alanedwardes/Estranged.Lfs
One in Rust (but no Auth, have to run reverse proxy)
https://github.com/jasonwhite/rudolfs
Both seem interesting. Anyone use these?