Avoid Git LFS if possible

Avoid Git LFS if possible(gregoryszorc.com)

143 points by reimbar 5 years ago | 140 comments

asymptosis 5 years ago |

Something missing from the list of problems: Git LFS is a http(s) protocol so is problematic at best when you are using Git over ssh[1].

The git-lfs devs obviously don't use ssh, so you get the feeling they are a bit exasperated by this call to support an industry standard protocol which is widely used as part of ecosystems and workflows involving Git.

[1] https://github.com/git-lfs/git-lfs/issues/1044

alkonaut 5 years ago | |

That issue has come a long way though, already a draft PR! This seems like it could actually happen.

Aeolun 5 years ago |

Did he really just try to make the argument that we shouldn’t use LFS because Git will have large file support at some unspecified point in the future?

LFS has existed for several years, and as far as I know Git still doesn’t have support for large files. At this point I’m not holding out much hope.

cerved 5 years ago | |

git supports large files, it just can't track changes in binary files efficiently and if they're large you check in a new blob every modification.

If they're just sitting around it's fine, but then why would you have them in VC

pooya13 5 years ago | | |

It tracks the changes fine. It’s just that it doesn’t make sense to track changes in a binary.

wokwokwok 5 years ago |

I despise LFS.

I’m sure that if you know how to use it... maybe... you can figure it out.

That said; here’s my battle story:

Estimate the time it’ll take to move all our repositories from a to b they said.

Us: with all branches?

Them: just main and develop.

Us: you just clone and push to the new origin, it’s not zero but it’s trivial.

Weeks later...

Yeah. LFS is now banned.

LFS is not a distributed version control system; once you use it, a clone is no longer “as good” as the original, because it refers to a LFS server that is independent of your clone.

...also, actually cloning all the LFS content from git lab is both slow and occasionally broken in a way that requires you to restart the clone.

iab 5 years ago | |

I would rather maintain a handwritten journal of 1s and 0s than use git LFS again

toomanyducks 5 years ago | | |

honestly had no idea what git LFS was before seeing this, but okay then

tardyp 5 years ago | |

You can easily mirror LFS objects by just using

git lfs fetch --all

every modern git hosting servers now have support for LFS directly inside the git server (gitlab, github, gerrit to my knowledge)

This solve the authentication issue nicely so make it easy for developers.

git lfs starts to be adopted by vendors and becomes usable. It solves real problem when you are tired of having to double your git server CPUs every 6 months as your git upload packs are taking huge time trying to recompress those big files over and over.

blablabla123 5 years ago | | |

> It solves real problem when you are tired of having to double your git server CPUs every 6 months

Putting blobs into an SCM was always a bad idea, but in git it's particularly bad because the whole tree is always checked out at once. (I think last year a major change was added, that makes blob handling slightly more efficient though)

Still, I think there is still no independently developed git lfs server. I know, most people don't run git servers themselves but in the LFS use case it actually makes sense. There is also git annex which is completely open and free but adoption is very poor and the handling is even more obscure.

pooya13 5 years ago | |

What is your alternative then? Version control binary files and have your repos grow gigabytes?

AstralStorm 5 years ago | | |

If you're rewriting files and need the version history, yes.

If you're not rewriting the files, also yes.

If you don't need the history, put them on a normal web server.

Game_Ender 5 years ago |

The latest version of git has a very similar feature called “partial clones” to what the author describes for Mercurial. All the data is still in your history, no extra tools are needed, but you only fetch the blobs from the server for the commits you checkout. So just like LFS larger blobs not on master are effectively free, but you still grab all the blobs for your current commit.

You need server side support, which GitHub and GitLab have, and then a special clone command:

    git clone --filter=blob:none

Some background about the feature is here: https://github.blog/2020-12-21-get-up-to-speed-with-partial-...

brandmeyer 5 years ago | |

This looks too aggressive. The nice thing about git-lfs is that only the binary file type(s) you care about are run through git-lfs. All other ordinary diffable text is treated normally.

The blobless clone is going to be ensaddening the next time that I'm examining the history of some source code when I'm hacking away without a network connection.

Game_Ender 5 years ago | | |

You can mitigate a bit of this by only ignoring blobs over a certain size like "--filter=blob:limit=256k" which should allow most ordinary text files through.

In the end it's the same as LFS though in that without a network examining old commits without a network is a bummer. No free lunch here besides something a bit more complex like git-annex.

Someone1234 5 years ago |

All three points are really just the same point repeated three times: That it isn't part of core/official GIT ("stop gap" until official, irreversible to later official solution, and adds complexity that an official version would lack due to extra/third party tooling).

I'm frankly surprised GIT hasn't made LFS an official part by now. It fixes the problem, the problem is common and real, and GIT hasn't offered a better alternative.

If LFS was made official it would solve this critique, since that is really the only critique here.

klodolph 5 years ago | |

> All three points are really just the same point repeated three times

Absolutely not. Having worked with Mercurial LFS and Git LFS, the differences seem subtle but they are there. Basically,

In Mercurial, LFS is (to an extent) an implementation detail of how you check out a repository. It doesn't mean altering the repository contents itself (the data), it just means altering how you get that data. Contrast with Git LFS, where the data itself must be altered in order to become LFS data, and the "LFS flag" is recorded in history.

This is not something that you would solve by upstreaming LFS. You would need to redesign LFS.

swiley 5 years ago | |

LFS is definitely outside the scope of git.

IshKebab 5 years ago | | |

It shouldn't be though.

dwohnitmok 5 years ago |

git-annex is an interesting alternative the HTTP-first nature of Git LFS and the one-way door bother you.

You can remove it after the fact if you don't like it, it supports a ton of protocols, and it's distributed just like git is (you can share the files managed by git-annex among different repos or even among different non-git backends such as S3).

The main issue that git-annex does not solve is that, like Git LFS, it's not a part of git proper and it shows in its occasionally clunky integration. By virtue of having more knobs and dials it also potentially has more to learn than Git LFS.

CreepGin 5 years ago |

I've been using Git LFS with several large Unity projects in the past several years. Never really had any problems. It was always just "enable and forget" kind of thing.

P_I_Staker 5 years ago | |

Yeah, unless you can avoid large files entirely, or are okay with a separate tool (this pisses of a lot of devs IME), then just don't use git? I don't like that option. I think this is sensational. At this point LFS is a really big deal. I don't think it's going anywhere or that LFS users will be shafted.

coley 5 years ago | |

This is my experience as well so far. It took 15-20 minutes to learn about it, install it, and setup configs. Since then I haven't had to think about it once.

goodcjw2 5 years ago |

A side topic: is there a concrete reason why github's LFS solution has to be so expensive?

IIRC, it's $5 per 50GB per month? That's really a deal breaker to me and wondering whether people actually use LFS at volume will avoid LFS-over-GitHub.

tux1968 5 years ago | |

$60 / year for a decent fraction of a hard disk and the associated backup resources, seems pretty fair to me. What price would you expect?

Dylan16807 5 years ago | | |

> $60 / year for a decent fraction of a hard disk and the associated backup resources, seems pretty fair to me.

What you describe sounds fair to me. The problem is that 50GB is not a decent fraction of a hard disk.

moshmosh 5 years ago | | |

It's a very large markup on the small-user retail cost of the basic thing they're providing (web-accessible, access-controlled file storage—see, for example, BackBlaze B2) but that's utterly typical of services that can get away with charging you a "convenience fee" for that sort of thing once you're on their SaaS. 2-3x markup isn't unusual, and that's about what this is, and that's above typical retail—even if GH's not managing the storage and such themselves, they're likely getting an even better (bulk) rate.

adkadskhj 5 years ago | |

Yea i actually wrote my own file chunking and general git-lfs-like backend for this exact reason. I liked Git LFS, but Github's pricing felt insane for my indie dev. For my needs i could backup onto a local server, network drive, or w/e at an insanely cheaper price.

Hell even uploading to an S3 compatible API was insanely cheaper than Github.

That and i really hated the feeling that Git LFS was being designed for a server architecture. I didn't have an easy way to locally dump the written files without running an HTTP server.

There are a couple Git LFS servers that upload to, say, S3 - but i really just wanted a dumb FS or SSH dump of my LFS files. Running a localhost server feels so.. anti-Git to me.

elcritch 5 years ago | | |

Do you have a repo for it?

toomuchtodo 5 years ago | |

Consider S3 storage and egress costs. You’re paying a flat rate to store and then pull that 50GB data (edit: removed an incorrect statement here).

Someone1234 5 years ago | | |

Just for clarity, it is 50 GB of storage and 50 GB of bandwidth.

So definitely not "as much as you want." If you pull it too many times you may get charged another $5.

hpcjoe 5 years ago |

Just this past week, git lfs was throwing smudge errors for me. Not really sure what the issue was, I followed the recommendations to disable, pull, and re-enable. And got them again. So I disabled. And left it disabled.

Not a solution.

This said, the whole git-lfs bit feels like a (bad) afterthought the way its implemented. I'd love to see some significant reduction of complexity (you shouldn't need to do 'git lfs enable', it should be done automatically), and increases in resiliency (sharding into FEC'ed blocks with distributed checksums, etc.) so we don't have to deal with 'bad' files.

I was a fan of mercurial before I switched to git ... it was IMO an easier/better system at the time (early 2010s). Not likely to switch now though.

klodolph 5 years ago | |

I would say that if you care about good LFS support, that is a sufficient reason to use Mercurial. Harder to find Mercurial hosting these days, though, but I'm not worried that the Mercurial project will die off (since both Facebook and Google use it, in some manner).

jfim 5 years ago | | |

Is Facebook still using mercurial? It seems that there was a blog post about it in 2014, but their repo[0] just seems to say that their codebase was originally based on/evolved from mercurial.

[0] https://github.com/facebookexperimental/eden

acdha 5 years ago |

This is really overstating the cost of a one-time setup step. History rewriting is only necessary for preexisting projects and you can use things like GitLab’s push rules to ensure that it’s never necessary in the future.

I get that a mercurial developer has different preferences but I don’t think that this is an especially effective form of advocacy.

dheera 5 years ago |

Okay, so I should avoid it. What is the alternative?

I see so many git repos with READMEs saying download this huge pretrained weights file from {Dropbox link, Google drive link, Baidu link, ...} and I don't think that's a very good user experience compared to LFS.

LFS itself sucks and should be transparent without having to install it, but it's slightly better than downloading stuff from Dropbox or Google Drive.

korijn 5 years ago |

I'm honestly super content with LFS. Wrote our own little API server to hook it up to Azure Blob Storage, never have issues with it. I don't recognize the issues mentioned in the article at all. Our whole team relies on it for years, and it delivers. No problems. Keep up the great work, git-lfs maintainers! Much love.

sam_goody 5 years ago |

If you have a hundred images in git, and one cannot be downloaded for any reason, git smudge will not be able to run, and you won't be able to git pull at all.

We had an image on AWS go bad, still not sure how. Our devs lost the ability to pull. Disabling LFS could not be done (because of rewriting history). "disable smudge" is not an official option, and none of the hacks work reliably. We finally excluded all images from smudge, and downloaded them with SFTP. Git status shows all the images as having changed, and we are downright unhappy...

It would be happy to hear that I just don't know how to use LFS - but even if so, that means the docs are woefully not useful.

I want to: 1) Tell LFS to get whatever files it could, and just throw a warning on issues. 2) If image is restored not using LFS, git should still know the file has not been modified (by comparing the checksum or whatever smudge would do).

madjam002 5 years ago |

As much as Git LFS is a bit of a pain, on recent projects I've resorted to committing my node_modules with Yarn 2 to Git using LFS and it works really well.

Note that with Yarn 2 you're committing .tar.gz's of packages rather than the JS files themselves, so it lends itself quite well to LFS as there are a smaller number of large files.

https://yarnpkg.com/features/zero-installs#how-do-you-reach-... https://yarnpkg.com/features/zero-installs#is-it-different-f...

devinrhode2 5 years ago | |

Does yarn2 recommend also using LFS? Do you see any performance improvements when using LFS?

cerved 5 years ago | |

why are you committing packages?

madjam002 5 years ago | | |

Because why not? It’s recommended in Yarn 2 and I don’t see there being any downsides with Git LFS as the files stores in Git are essentially pointers.

hobofan 5 years ago | | |

I would assume to prevent situations like the left-pad incident.

TeeMassive 5 years ago |

The reason the author provides is in my opinion weak compared to both his alternatives.

Sure, lfs contaminates a repository, so do large files, sensitive data removal, and references to packages and package managers that might become obsolete or non-existent in the future. The chance of your project compiling after 15 years, the age of git by the way, are very slim, and the chance that having a entirely compilable history being useful even slimmer.

And I think the author's statement about setupping up lfs being hard is exaggerated. It's a handful of command lines that should be in the "welcome at our company" manual anyway.

I've used lfs in the past and while it can be misused, as with all other tools, it does the job without too much headaches compared submodules and ignored tracked files.

breck 5 years ago |

My practice for storing large files with Git is to include the metadata for the large file in a tiny file(s):

1. Type information. Enough to synthesize a fake example.

2. A simple preview. This can be a thumb or video snippet, for example.

3. Checksum and URL of the big file.

This way your code can work at compile/test time using the snippet or synthesized data, and you can fetch the actual big data at ship time.

You can then also use the best version control tool for the job for the particular big files in question.

CobrastanJorji 5 years ago | |

Is this just a manual equivalent of git LFS, or is there some advantage here?

theamk 5 years ago | | |

This is pretty superior to git LFS in many aspects:

- You have file type and preview that you can use without getting the full thing

- You have a custom metadata for each file enforced by your scripts -- for example for archives, you may store the list of files inside. This will allow your CI tests to validate the references into the files without having to download the whole huge thing.

- You fully control remote fetch logic. Multiple servers? Migration rules for old revisions? That weird auth scheme that your IT insists on? It is all supported with a bit of code.

- You fully control local storage. Do you want a computer-wide shared CAS cache between multiple users? What if you have NAS that most users mount? Or maybe s3fs is your thing? Adding support is easy.

The main downside is that you get to do all the tooling and documentation, so I would not recommend this for the smaller teams. Nor would I recommend this for open-source projects.

But if your infra team is big enough to support this, you'll definitely have the better experience than generic Git LFS.

breck 5 years ago | | |

It's a design pattern that ensures testability of the system without any dependencies on the big files.

dmm 5 years ago | |

Tools like git-annex or dvc support similar strategies.

remram 5 years ago | | |

Is git-annex still alive? Last time I tried to use it, it was very rough, and the official wiki (that serves as doc + bug tracker) gives database errors trying to create an account.

Details: I wanted to have a remote I can push to but anonymous users can only pull from, couldn't piece it together.

rsync 5 years ago |

FWIW, rsync.net is currently deploying LFS support such that operations like:

  ssh user@rsync.net git clone blah

... will properly handle LFS assets, etc.

This is in response to several requests we have had for this feature...

KETpXDDzR 5 years ago |

This opinion only lists issues, not solutions. Sure, they advertise mercurial, but migrating from git to mercurial is unrealistic for many cases.

I'd title it: "Why Mercurial is better than git+LFS"

cerved 5 years ago | |

The author is detailing the problems wrt git-lfs, why they are problems and how those problems are overcome in a similar technical solution in a similar VCS. I think the original title is fine

shabbyrobe 5 years ago |

Here's another fun one: https://github.com/git-lfs/git-lfs/issues/2434

> Git on Windows client corrupts files > 4Gb

It's apparently an upstream issue with Git on Windows, but if you depend on something, you inherit its issues.

robmsmt 5 years ago |

Pushing Github past the 100mb limit has to be the most requested feature. Ridiculous that we have to use the fudge that is GitLFS.

It just adds complication for a limit that shouldn't be there anyway.

viraptor 5 years ago | |

You can use gitlab instead with a few GB limit instead.

robmsmt 5 years ago | | |

You can also self host with GOGS / Gitea and no limit but getting my company to move from Github will be a large undertaking. It's not worth it for GitLFS on it's own.

wbillingsley 5 years ago |

The solution I've tended to use in classes (where there'll always be some student who hasn't installed LFS) is to store the large files in Artifactory, so they are pulled in at build-time in the same way as libraries.

This seemed to me a sensible approach as Artifactory is a repository for binaries (usually, the compiled output of a project). It also seemed to me that the decisions on which versions to retain and when an update to a binary is expected or when that resource is now frozen and a replacement would be a new version is similar to the decision on when a build is a snapshot vs a release.

temac 5 years ago |

If you just don't jump on random tech without good reasons, you already naturally apply this advice. Especially since once you really need it and also wants Git, there is not much alternative (as the author recognizes). In this context, just waiting for a potential "better support for handling of large files" of official Git makes little sense; plus I make the wild prediction that what will actually happen is that it's Git LFS that will (continue to) be improved and used by most people (and maybe even integrated in "official Git"?)

ziml77 5 years ago | |

This is what I was thinking too. There's really nothing about Git LFS that should come as a surprise. Yes it rewrites history, but how else are you going to cut bloat from the repo after it's been stuffed in there? And the fact that the file is stored on something completely outside of git is clearly and concisely explained as the main text, directly above the download button, on https://git-lfs.github.com/

> Git Large File Storage (LFS) replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing the file contents on a remote server like GitHub.com or GitHub Enterprise.

justaguy88 5 years ago |

> Git LFS is a Stop Gap Solution

Build the real thing then..

klodolph 5 years ago | |

The author of the article is a Mercurial maintainer, and Mercurial has the "real thing" implemented already (and it has been part of Mercurial since at least 2012, at least in some form). So it's already done, just not for Git.

formerly_proven 5 years ago | |

He does.

> Since I'm a maintainer of the Mercurial version control tool,

ecnahc515 5 years ago |

You don't need to rewrite history unless you weren't using LFS or accidently committed large files to the repository. Nothing about LFS "requires" rewriting history.

Not to mention, many users are paying for a service that provides LFS, and hosting an LFS service isn't crazy hard. It's a file server with a custom API, it's mostly doable using S3 as a backend. It's not like this is crazy complicated stuff.

cerved 5 years ago | |

other stuff might require require rewriting history

SavantIdiot 5 years ago |

Yep. All of this. I tried using Git LFS for a project and reverted back to links to cloud server for the large binary blobs and hashes on those blobs.

tpoacher 5 years ago |

I keep hearing the mantra that "svn is better for large files than git" but never really understood why. To me a large file is a large file; if you make changes, worst case scenario you add the entire new file to the commit, best case you add some sort of binary diff. Does git do the former and svn the latter by any chance?

P_I_Staker 5 years ago | |

For a sizeable project, or one with lots of binary commits, eg. 1-10 GB sized items per day... that type of thing, you have to use LFS or SVN.

Your repo size could easily balloon to terabytes, for every clone. Additionally, I think there's other performance issues, but I don't allow this to happen, so I'm not sure.

SVN happily handles terabytes, due to the client server interface. As does LFS. My biggest gripe with LFS is that it turns your distributed tool into a client server one. I kinda wish they had and easy "skip lfs" type option.

lmz 5 years ago | |

An svn working copy has one version stored locally. A git clone has all versions stored locally. All versions of a large file takes up lots of space.

tpoacher 5 years ago | | |

I see. So the idea is not that svn's handling of large files at the repo-level is somehow better than that of a git repo per se, but that it's fine for the (possibly remote) svn repo to take the 'large file' hit, since the (presumably local) wc is disjoint from it, and thus unaffected in terms of local storage. Ok, that makes sense...

I've been using a decentralised svn workflow at work for so long, I didn't even think of this :)

sjburt 5 years ago |

The thing that always rubbed me the wrong way about git-lfs was that they cloned the git-scm.org site design. It's not part of git!

[1]https://git-lfs.github.com/

[2]https://git-scm.com/

alkonaut 5 years ago |

I’m using Git+LFS because my issue tracker, CI/CD etc natively speaks it. Not because it’s in any way superior or even on par with the large file handling of Mercurial (or even SVN to be honest).

slaymaker1907 5 years ago |

Is rewriting the history for large repos really that difficult besides coordinating with other contributors? My understanding is that it shouldn't be that much worse than "git gc --aggressive". Yes it is expensive, but it is the sort of thing you can schedule to do overnight or on a weekend.

sjansen 5 years ago | |

The issue is breaking external references.

Do you include git SHAs in your bug tracking system? Or perhaps your department wiki links to a specific commit to document lessons learned? Maybe you're using Sentry and find including the git SHA of the build to be invaluable for troubleshooting?

For some organizations, rewriting history would be a non-event and for others it would be a major disruption.

cookiecaper 5 years ago | | |

Yeah, git is really not a mature or well-designed VCS. The fact that you can trivially lose the supposed permanent reference -- and that it's encouraged as part of several common workflows at that -- should be more than enough to demonstrate this. If you care about history, use a VCS like Fossil.

alkonaut 5 years ago | |

The problem I see is that things like commit hashes which are etched in history in bug reports, version tags etc, instantly lose meaning. Whether or not that’s a problem depends on how much of that you have.

cerved 5 years ago | |

git gc doesn't rewrite history, it packs objects in your local repertory into a pack file

pooya13 5 years ago |

Maybe I am missing the point. What is the alternative this article proposes then?... Also, Git is not central so how can you ever integrate large file support without a separate server?

thenoblesunfish 5 years ago |

Good points, but it seems optimistic to assume that git will have good, native, large file support anytime soon. I‘ve been waiting quite a while for git submodules to improve..

chrisdbanks 5 years ago |

The main argument here seems to be that we shouldn’t use LFS because Git will have large file support at some unspecified point in the future? Similarly you could argue that we shouldn't use a Covid vaccine because we'll develop a cure in the future..why vaccinate billions of people when we can just treat the 1% of people who get ill? Clearly that argument doesn't work. People need a solution now. Ironically we had to stop using mercurial because it didn't have an LFS alternative even though I prefer it. LFS is definitely not ideal but as a solution to a real world problem, it works. There may be issues around cloning repos and losing history in the future, but those are one off issues where you have to accept the pain, rather than living in pain every day.