Avoid Git LFS if possible(gregoryszorc.com) |
Avoid Git LFS if possible(gregoryszorc.com) |
Your repo size could easily balloon to terabytes, for every clone. Additionally, I think there's other performance issues, but I don't allow this to happen, so I'm not sure.
SVN happily handles terabytes, due to the client server interface. As does LFS. My biggest gripe with LFS is that it turns your distributed tool into a client server one. I kinda wish they had and easy "skip lfs" type option.
I've been using a decentralised svn workflow at work for so long, I didn't even think of this :)
Do you include git SHAs in your bug tracking system? Or perhaps your department wiki links to a specific commit to document lessons learned? Maybe you're using Sentry and find including the git SHA of the build to be invaluable for troubleshooting?
For some organizations, rewriting history would be a non-event and for others it would be a major disruption.
The git-lfs devs obviously don't use ssh, so you get the feeling they are a bit exasperated by this call to support an industry standard protocol which is widely used as part of ecosystems and workflows involving Git.
LFS has existed for several years, and as far as I know Git still doesn’t have support for large files. At this point I’m not holding out much hope.
If they're just sitting around it's fine, but then why would you have them in VC
I’m sure that if you know how to use it... maybe... you can figure it out.
That said; here’s my battle story:
Estimate the time it’ll take to move all our repositories from a to b they said.
Us: with all branches?
Them: just main and develop.
Us: you just clone and push to the new origin, it’s not zero but it’s trivial.
Weeks later...
Yeah. LFS is now banned.
LFS is not a distributed version control system; once you use it, a clone is no longer “as good” as the original, because it refers to a LFS server that is independent of your clone.
...also, actually cloning all the LFS content from git lab is both slow and occasionally broken in a way that requires you to restart the clone.
:(
git lfs fetch --all
every modern git hosting servers now have support for LFS directly inside the git server (gitlab, github, gerrit to my knowledge)
This solve the authentication issue nicely so make it easy for developers.
git lfs starts to be adopted by vendors and becomes usable. It solves real problem when you are tired of having to double your git server CPUs every 6 months as your git upload packs are taking huge time trying to recompress those big files over and over.
Putting blobs into an SCM was always a bad idea, but in git it's particularly bad because the whole tree is always checked out at once. (I think last year a major change was added, that makes blob handling slightly more efficient though)
Still, I think there is still no independently developed git lfs server. I know, most people don't run git servers themselves but in the LFS use case it actually makes sense. There is also git annex which is completely open and free but adoption is very poor and the handling is even more obscure.
If you're not rewriting the files, also yes.
If you don't need the history, put them on a normal web server.
You need server side support, which GitHub and GitLab have, and then a special clone command:
git clone --filter=blob:none
Some background about the feature is here: https://github.blog/2020-12-21-get-up-to-speed-with-partial-...The blobless clone is going to be ensaddening the next time that I'm examining the history of some source code when I'm hacking away without a network connection.
In the end it's the same as LFS though in that without a network examining old commits without a network is a bummer. No free lunch here besides something a bit more complex like git-annex.
I'm frankly surprised GIT hasn't made LFS an official part by now. It fixes the problem, the problem is common and real, and GIT hasn't offered a better alternative.
If LFS was made official it would solve this critique, since that is really the only critique here.
Absolutely not. Having worked with Mercurial LFS and Git LFS, the differences seem subtle but they are there. Basically,
In Mercurial, LFS is (to an extent) an implementation detail of how you check out a repository. It doesn't mean altering the repository contents itself (the data), it just means altering how you get that data. Contrast with Git LFS, where the data itself must be altered in order to become LFS data, and the "LFS flag" is recorded in history.
This is not something that you would solve by upstreaming LFS. You would need to redesign LFS.
You can remove it after the fact if you don't like it, it supports a ton of protocols, and it's distributed just like git is (you can share the files managed by git-annex among different repos or even among different non-git backends such as S3).
The main issue that git-annex does not solve is that, like Git LFS, it's not a part of git proper and it shows in its occasionally clunky integration. By virtue of having more knobs and dials it also potentially has more to learn than Git LFS.
IIRC, it's $5 per 50GB per month? That's really a deal breaker to me and wondering whether people actually use LFS at volume will avoid LFS-over-GitHub.
What you describe sounds fair to me. The problem is that 50GB is not a decent fraction of a hard disk.
Hell even uploading to an S3 compatible API was insanely cheaper than Github.
That and i really hated the feeling that Git LFS was being designed for a server architecture. I didn't have an easy way to locally dump the written files without running an HTTP server.
There are a couple Git LFS servers that upload to, say, S3 - but i really just wanted a dumb FS or SSH dump of my LFS files. Running a localhost server feels so.. anti-Git to me.
So definitely not "as much as you want." If you pull it too many times you may get charged another $5.
Not a solution.
This said, the whole git-lfs bit feels like a (bad) afterthought the way its implemented. I'd love to see some significant reduction of complexity (you shouldn't need to do 'git lfs enable', it should be done automatically), and increases in resiliency (sharding into FEC'ed blocks with distributed checksums, etc.) so we don't have to deal with 'bad' files.
I was a fan of mercurial before I switched to git ... it was IMO an easier/better system at the time (early 2010s). Not likely to switch now though.
I get that a mercurial developer has different preferences but I don’t think that this is an especially effective form of advocacy.
I see so many git repos with READMEs saying download this huge pretrained weights file from {Dropbox link, Google drive link, Baidu link, ...} and I don't think that's a very good user experience compared to LFS.
LFS itself sucks and should be transparent without having to install it, but it's slightly better than downloading stuff from Dropbox or Google Drive.
We had an image on AWS go bad, still not sure how. Our devs lost the ability to pull. Disabling LFS could not be done (because of rewriting history). "disable smudge" is not an official option, and none of the hacks work reliably. We finally excluded all images from smudge, and downloaded them with SFTP. Git status shows all the images as having changed, and we are downright unhappy...
It would be happy to hear that I just don't know how to use LFS - but even if so, that means the docs are woefully not useful.
I want to: 1) Tell LFS to get whatever files it could, and just throw a warning on issues. 2) If image is restored not using LFS, git should still know the file has not been modified (by comparing the checksum or whatever smudge would do).
Note that with Yarn 2 you're committing .tar.gz's of packages rather than the JS files themselves, so it lends itself quite well to LFS as there are a smaller number of large files.
https://yarnpkg.com/features/zero-installs#how-do-you-reach-... https://yarnpkg.com/features/zero-installs#is-it-different-f...
Sure, lfs contaminates a repository, so do large files, sensitive data removal, and references to packages and package managers that might become obsolete or non-existent in the future. The chance of your project compiling after 15 years, the age of git by the way, are very slim, and the chance that having a entirely compilable history being useful even slimmer.
And I think the author's statement about setupping up lfs being hard is exaggerated. It's a handful of command lines that should be in the "welcome at our company" manual anyway.
I've used lfs in the past and while it can be misused, as with all other tools, it does the job without too much headaches compared submodules and ignored tracked files.
1. Type information. Enough to synthesize a fake example.
2. A simple preview. This can be a thumb or video snippet, for example.
3. Checksum and URL of the big file.
This way your code can work at compile/test time using the snippet or synthesized data, and you can fetch the actual big data at ship time.
You can then also use the best version control tool for the job for the particular big files in question.
- You have file type and preview that you can use without getting the full thing
- You have a custom metadata for each file enforced by your scripts -- for example for archives, you may store the list of files inside. This will allow your CI tests to validate the references into the files without having to download the whole huge thing.
- You fully control remote fetch logic. Multiple servers? Migration rules for old revisions? That weird auth scheme that your IT insists on? It is all supported with a bit of code.
- You fully control local storage. Do you want a computer-wide shared CAS cache between multiple users? What if you have NAS that most users mount? Or maybe s3fs is your thing? Adding support is easy.
The main downside is that you get to do all the tooling and documentation, so I would not recommend this for the smaller teams. Nor would I recommend this for open-source projects.
But if your infra team is big enough to support this, you'll definitely have the better experience than generic Git LFS.
Details: I wanted to have a remote I can push to but anonymous users can only pull from, couldn't piece it together.
ssh user@rsync.net git clone blah
... will properly handle LFS assets, etc.This is in response to several requests we have had for this feature...
I'd title it: "Why Mercurial is better than git+LFS"
> Git on Windows client corrupts files > 4Gb
It's apparently an upstream issue with Git on Windows, but if you depend on something, you inherit its issues.
It just adds complication for a limit that shouldn't be there anyway.
This seemed to me a sensible approach as Artifactory is a repository for binaries (usually, the compiled output of a project). It also seemed to me that the decisions on which versions to retain and when an update to a binary is expected or when that resource is now frozen and a replacement would be a new version is similar to the decision on when a build is a snapshot vs a release.
> Git Large File Storage (LFS) replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing the file contents on a remote server like GitHub.com or GitHub Enterprise.
Build the real thing then..
> Since I'm a maintainer of the Mercurial version control tool,
Not to mention, many users are paying for a service that provides LFS, and hosting an LFS service isn't crazy hard. It's a file server with a custom API, it's mostly doable using S3 as a backend. It's not like this is crazy complicated stuff.
https://bryan-murdock.blogspot.com/2020/03/git-annex-is-grea...
This means there are were no read-only operations: you just want some files.. and that throwaway clone and CI machine would get recorded into the global repo state. If you are not careful, and will be propagated forever and would appear in the various reports.
The "one-way door" as I understand the article to be describing is talking about the additional layer of centralization that Git LFS brings. In particular it's pretty annoying to have to always spin up a full HTTPS server just to be able to have access to your files. There is now always a source of truth that is inconvenient to work around when you might still have the files lying around on a bunch of different hard drives or USB drives.
Whereas with git-annex, it is true that without rewriting history, even if you disable git-annex moving forward, you'll still have symlinks in your git history. However, as long as you still have your exact binary files sitting around somewhere, you can always import them back on the fly, so e.g. to move away from git-annex you can just commit the binary files directly to your git directory and then just copy them out to a separate folder whenever you go back to an old commit and re-import them.
But perhaps I'm interpreting the author incorrectly, in which case it's hard for me to see how any solution for large files in git would allow you to move back without rewriting history to an ordinary git repository without large file support.
edit: The article claims it's a "one-way door" because you can't move to an altogether different system without rewriting history, which is true of git-annex. My bad.
I think I'll stick to LFS.
Partial clones (https://docs.gitlab.com/ee/topics/git/partial_clone.html)
Shallow clones (see the --depth argument: https://linux.die.net/man/1/git-clone)
The problem with large files is not so much that putting a 1Gb file in Git is a problem. If you just have one revision of it, you get a 1Gb repo, and things run at a reasonable speed. The problem is when you have 10 revisions of the 1Gb file and you end up dealing with 10Gb of data when you only want one, because the default git clone model is to give you the full history of everything since the beginning of time. This is fine for (compressible) text files, less fine for large binary blobs.
Git-lfs is a hack and it has caused me pain every time I've used it, despite Gitlab having good support for it. Some of this is more implementation detail - the command line UI has some wierdness to it, there's no clear error if someone doesn't have git-lfs when cloning and so something in your build process down the line breaks with a weird error because you've got a marker file instead of the expected binary blob. Some of it is inherent though - the hardest problem is that we now can't easily mirror the git repo from our internal gitlab to the client's gitlab because the config has to hold the http server address with the blobs in. We have workarounds but they're not fun.
The solution is to get over the 'always have the whole repository' thing. This is also useful for massive monorepos because you can clone and checkout just the subfolder you need and not all of everything.
I say this, but I haven't yet used partial clones in anger (unlike git-lfs). I have high hopes though, and it's a feature in early days.
It doesn't fix all of the problems with LFS, but it helps a lot with some of them (and happens to also be a decent Make replacement in certain situations).
[0]: https://dvc.org/
Your option is basically Git LFS, possibly also VFSForGit, or putting your large files in separate storage.
To me, most cases of large files in VCS seem like using a hammer as a screwdriver.
The gitattributes file provides a version-controlled and review-controlled mechanism to decide exactly which objects get special treatment and which ones don't. Since its a part of the repository itself, you don't have to remind new developers to specify some unusual arguments to git at clone time to avoid a performance disaster.
> In the end it's the same as LFS though in that without a network examining old commits without a network is a bummer.
Except for a crucial detail: The tools I use to examine history are the log, diff[tool], and blame. All of those tools continue to function normally on an LFS-enabled offline clone. IIUC, `--filter=blob:none` doesn't work at all, and `--filter=blob:limit=256k` is a proxy which almost, but doesn't quite work.
If you for some reason require redundancy of a package repo, then host your own.
While i was writing it i found the basic process of basic large file storage to be insanely simple with Git. I debated doing the same thing but backed by a seasoned backup solution, like Borg/Bup/etc.
Currently I'm using an artifactory for this, but it would be much nicer if this could be integrated.
Lots of people want to use document databases as if they were relational, lots of people want to use their RDBMS as a file server and lots of people use spreadsheets for just about everything.
Lots of people wanting to use a product in certain way doesn't mean it's a good idea, nor that someone else should make that work for them that way
LFS belongs in your build scripts, the model the git extension use doesn't even match the git VCS model.
* Initial setup includes git filter rules so that "git add" automatically uses get-fat for matching files (no need to remember to invoke git-fat when adding/changing files).
* It works by rsync'ing to/from the remote. The setup for this is in a single ".gitfat" file, separate from the filter rules.
* You do need to run "git fat push" and "git fat pull"; this can probably be automated with hooks.
So just offhand without even trying to think about the "right" way to do what you want, the committed ".gitfat" could be to a read-only remote, then you can swap it with your own un-committed file for a push that has an rsync-writeable remote.
Also, the whole thing is a single 628-line python file, so worst case it would be easy to tweak it to read something like ".gitfat-push" and not have to manually swap it.
Exactly. Here's an (anonymized) example of a git-annex symlink from one of my repos:
../../.git/annex/objects/AA/BB/SHA256-s123456--abcdf...1234/SHA256-s8968192--abcdf...1234
It's just a link to a file with a SHA256 hash in the name and path. The simplest way to reconstruct that in the future is to just check-in the whole `objects` directory into the repo, and copy/symlink it back to `.git/annex` when needed. You definitely don't need the git-annex software itself to view the data in the future.I personally have hundreds of gigabytes of data in git-annex repos. It works great!
I’ve used largefiles and ran into these issues and ended up having to turn it off after a few years because it’s so problematic with the tooling since it modifies the underlying mercurial commit structure like git lfs.
However it sounds like mercurial lfs is different in that it only modifies the transport layer, though I’m not totally clear on the details and have been meaning to look into it further.
However, my impression is that in fact largefiles is basically the only game in town and Mercurial LFS if anything is meant to be even more like Git LFS to the point of being compatible with it.
The thing I'm more curious about is I don't immediately see how large file support in git (or mercurial), whether implemented as a separate tool or natively, could ever feasibly be "transparently erasable," that is rewindable back to be absolutely identical to a repository with no large files support without rewriting revision history.
It doesn't seem impossible (e.g. maybe you could somehow maintain a duplicate shadow revision history and transparently intercept syscalls?), but the approaches I can think of all have pretty hefty downsides and feel even more like hacks than the current crop of tools.
LRzip happens to have such a format preprocessor that would make for exceedingly efficient binary history at cost of being more similar to git pack file than incremental versions.
Then again, GitHub in particular sets a very low limit on binary size in version control.
http://jojodiff.sourceforge.net/
https://github.com/janjongboom/janpatch
Rsync is also very popular, even if not that efficient. xdelta, bsdiff, BDelta, bdiff are all crap.
Tracking changes of binaries makes a lot of sense if you use that to only store incremental changes to the file. Git stores each modification of a binary file as a separate blob since it doesn't know how to track its changes.
This is mitigated in large parts by the compression applied in git-gc, after packed, objects went from 196mb to 108mb.
In our project it helped dramatically as you only pull X MB instead of X * Y MB when a CI or developer clone the (already big) repo.
Note that this can now be accomplished with Git directly, by using --filter=blob:none when you clone; this will cause Git to basically lazy-load blobs (i.e. file contents) by only downloading blobs from the server when necessary (i.e. when checkout out, when doing a diff, etc).
If you were to add the $1.15 storage cost onto a reasonable bandwidth number like $0.01/GB you'd just about reach a third of $5.
I think if you introspect you'll see you are defending a flaw in something you like with spurious technical objections because you don't want to admit it isn't perfect. That's understandable. Happens a lot.
How is it supposed to work? The LFS way where you're storing a pointer to an external http resource? Just put a script in your repo to fetch that then.
Or maybe stick the data in git's merkle tree and have a really slow repo? Why bother with LFS then?
One obvious improvement would be for Git to use the hash of the object rather than the pointer file when calculating tree hashes. That makes the storage method for the actual files independent of the commit hashes.
People have mentioned several other VCSs in this thread that do it already.
If you store 50GB in AWS S3 (US-East-2), download 1000GB, do 100 PUT operations and 1000 GET operations, the cost is $89.68 per month.
Considering that GitHub isn't just providing you with storage, but a complete Git LFS solution plus storage, plus traffic that you can just use and not think about, I think it's worth the expense. But then again I probably wouldn't store binary blobs in Git.
One is that amazon has an enormous markup on bandwidth, compared to their other products.
The other is that GitHub does not actually let you download each file 20x in a month and "not think about" it. 50GB of space for a month only gets you 50GB of bandwidth.
If Amazon didn't explicitly ban people from using Lightsail bandwidth with other services, you could put together an all-AWS package that has 150GB of high quality S3 storage and enough bandwidth to download it 2-3x per $5 (minimum order quantity 2). For a service like B2 you could store 250GB twice (each copy having its own cross-server RAID) and download it once for $5. At digitalocean $5 will get you 250GB of probably-redundant data with 1TB of bandwidth, though it eventually tapers off toward 167GB/$5.
I need the data for my team, so we pay, but I use it as an excuse to force the team to clean up data every so often. As a game company we need large repos. It was either Gitlab or Azure DevOps.
It's quite easy to burn 1000G of storage on GitHub/GitLab (again, don't forget all the revisions). That puts just the storage cost at $6000/year. At this price point, it's really worth hosting on your own.
As Danieru mentioned they are forcing ppl to manually do cleanups, that probably indicates the storage costs are even higher which worth manual interventions.
An I happy paying 6 dollars a year for a single gigabyte: I'd rather pay less.
Edit: I pay attention to costs, and repo size is the sort of thing which has a. Habit of growing. All else being equal I'd prefer my team receive the money. A dollar I can give to my team/employees feels good, a dollar paying for over priced storage feels bad.
You can't diff and I'm not not convinced the VCS should carry the burden of version controlling assets. Seems better to have a separate dedicated system for such purposes
Then again I don't do game development so I'm not familiar with the requirements of such projects
I know that a small game studio must have a tight budget — and yours looks like a really interesting project (just subscribed) — but it seemed like an awfully strong objection to what I would have assumed would be a small fraction of your total expenses.
> it doesn't seem terribly far off of S3's pricing
GitHub's offering is close to S3 but only because AWS charges so much for bandwidth. The storage portion is less than a quarter of the equivalent bill.
And then GitLab is charging 5x as much as GitHub.