Data Deduplication with Linux

Data Deduplication with Linux(linuxjournal.com)

50 points by chintanp 14 years ago | 24 comments

wazoox 14 years ago |

For those interested, I've done lots of lessfs testing published on my professional blog a while ago :

* first post: http://blogs.intellique.com/tech/2010/12/22#dedupe

* detailed setup and benchmark results: http://blogs.intellique.com/tech/2011/01/03#dedupe-config

After more than 9 months running lessfs, I recommend it.

chintanp 14 years ago |

A required reading from my course on Advanced Storage Systems at CMU, http://www.cs.cmu.edu/~15-610/READINGS/optional/zhu2008.pdf

Really good paper which describes in detail how the deduplication works.

ak217 14 years ago |

So, from what I understand, this is great but more of a proof of concept since fuse performance kills it. As far as putting it in production, there are a few unresolved questions which I haven't seen picked apart:

- Can dedup be integrated into the VFS layer, like unionfs is shooting for, or does it have be integrated with the underlying filesystem.

- Is online dedup possible, and does the answer change when running SSD.

- What's the best granularity (block-level? inode-level? block extent-level?) and how badly can it randomize the i/o. I imagine one would have to do a lot of real-world benchmarking to find this out.

- Are there possible privacy issues (i.e. finding through i/o patterns whether someone else has a given block or file stored) and how to deal with them

res0nat0r 14 years ago |

Bup is also a pretty cool git based ddup backup utility:

https://github.com/apenwarr/bup#readme

viraptor 14 years ago |

I was wondering - with the current amount of abstraction and similar (sometimes redundant) metadata on almost everything - what percent of duplicate blocks could be found on a standard desktop system?

I don't think it would be useful, I'm just interested in the level of "standard" data duplication.

viraptor 14 years ago | |

Actually the btrfs email thread contained the answer (http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg0...):

"I was just toying around with a simple userspace app to see exactly how much I would save if I did dedup on my normal system, and with 107 gigabytes in use, I'd save 300 megabytes."

It's a relatively small amount. Then again - you're storing 300MB of exactly the same blocks of data... Unless they're manual backup files, this looks like a big waste to me.

radiowave 14 years ago | | |

Yup, that's about the same proportion I found when I recently tried copying my data across to a ZFS system with the dedup switched on.

I then decided to disable the dedup, because it comes at a cost - the checksum data (which would mostly be living on the SSD read cache I had attached) was occupying about 3 times the monetary worth of SSD storage space than the monetary worth of conventional disk space that the duplicate data was occupying.

I noticed that the opendedup site (linked from the article) claims a much lower volume of checksum data, relative to number of files; perhaps an order of magnitude less than I observed with ZFS, but they seem achieve that by using a fixed 128KB block size, which brings along its own waste. (ZFS uses variable block size.) I haven't actually done the numbers here but I wouldn't be at all surprised to find that for my data, the 128KB block size would be costing as much disk space as what dedup was saving me. (YMMV, of course.)

ak217 14 years ago | |

I think this will be much more useful on large multi-user storage systems, e.g. the classic example of the 5 MB email attachment sent to 100 people.

makmanalp 14 years ago |

btrfs also has a deduplication feature in the works: http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg0...

tobias3 14 years ago |

I tested it and I don't recommend it. (It was like a year ago though) It was really slow and some blog posts about the reliability of the data storage backend were a little bit scary.

I would recommend using zfs-fuse. You don't have the FUSE->File on a filesystem->Hard disk indirection (thus more speed). And additionaly you get all the cool ZFS features! If you need even more speed there is a ZFS kernel module for linux and a dedup patch for btrfs. I don't think those are production ready though.

Dylan16807 14 years ago | |

I tried ZFS dedup but there was something like a 20x slowdown to write files compared to ZFS without dedup, and this was on under ten gigabytes of files. I don't know if I somehow had the cache settings wrong or what the problem was, but I didn't manage to fix it, even trying both FUSE and kernel versions. (On ubuntu 11.4)

tobias3 14 years ago | | |

Yeah random acess on hard disks is awfully slow. And if you have dedup you can cause lots of random access... If you have a little bit of data the hashtable used for dedup can also be to big to fit into memory. Then ZFS puts it onto the disk and it is even slower. Luckily there is a feature to use SSDs as a cache device in this case.

wazoox 14 years ago | |

Actually excepting DataDomain ultra expensive specialised hardware (and maybe a couple of similar enterprise solutions), all dedupe systems come with a huge performance hit. ZFS is no exception...

alecco 14 years ago |

I don't understand the complication of using a database. The sensible approach would be something like BMDiff with [page] indexing on top for random access.

billswift 14 years ago | |

I remember a spate of academic articles a few (3-7?) years ago talking about how all filesystems were going to be replaced by single huge databases to hold all our "files", maybe this is partially a continuation of that research.

alecco 14 years ago | | |

I remember over a decade ago Microsoft was working on a FAT based filesystem backed by one of their database engines.

wcoenen 14 years ago |

lessfs appears to do block level deduplication (like ZFS). This means that if I copy a huge file but add a few bytes at the start, I won't get any benefit from deduplication because the data doesn't align anymore with the original block boundaries.

I wonder if there is a way to improve on that?