Rename files to match hash of contents(gist.github.com) |
Rename files to match hash of contents(gist.github.com) |
$ ls
01.jpg 03.jpg 03_copy.jpg 04.jpg 05.jpg
$ git init
Initialized empty Git repository in /tmp/test/.git/
$ git hash-object -w *
82f7d50fc89d2fd47150aff539ea4acf45ec1589
0080672bc4f248c400d569cce1a2a3d743eb1331
0080672bc4f248c400d569cce1a2a3d743eb1331
58db57b10c219b9b71f0223e58a6dc0d51cfe207
05dcde743807bddaf55ad1231572c1365d4db4af
$ find .git/objects -type f
.git/objects/00/80672bc4f248c400d569cce1a2a3d743eb1331
.git/objects/05/dcde743807bddaf55ad1231572c1365d4db4af
.git/objects/58/db57b10c219b9b71f0223e58a6dc0d51cfe207
.git/objects/82/f7d50fc89d2fd47150aff539ea4acf45ec1589
If you're curious, you can read more about how it works here: https://git-scm.com/book/en/v1/Git-Internals-Git-ObjectsVery cool!
[1] https://github.com/adrianlopezroche/fdupes
Edit: just noticed that it's using md5, which is broken [2], and that it's using truncated md5 hashes.....!
[2] https://natmchugh.blogspot.ca/2015/02/create-your-own-md5-co...
md5 is fine for deduplicating. It's extremely improbable you'd 'organically' get a md5 hash clash for two different files.
Even such a simple optimization can make a huge difference on a large directory of images or MP3s.
Also, what of truncating the hashes?
I don't get why people try to justify using severely weakened things when using the non-broken (ie, secure) version is a /trivial/ drop in replacement...
So while you're correct about the two images on that blog, the only reason why you'd get a clash is because the author of that blog post spent ~15 hours on an AWS GPU instance to generate the correct prefixes which, when appended to those files, results in a clash.
So, I guess if you are in the habit of grabbing random files from your hdd, loading them on to an AWS GPU instance for 15 hours (per file) and generating hash collisions, then yeah, don't use fdupes.
I was unimpressed by the md5 used in the shell script at the original link, which is using a truncated md5...
And if you are deduping on really fast storage, you'd get way better performance (with comparable safety) using something like xxHash64 (https://cyan4973.github.io/xxHash/).