Making Backup Validation Easier

Making Backup Validation Easier(brokensandals.net)

41 points by jaw 6 years ago | 17 comments

gruez 6 years ago |

This seems like a worse than just hashing the file. Random bit flips will probably go undetected using this method, but won't be with hashing.

jaw 6 years ago | |

I'm mostly trying to address cases where there is no original file that I fully trust. If I'm exporting my data from some web app/service, I can't get a hash of the data as it is in the actual source of truth on their servers, and there's multiple points at which an error could be introduced before the completed export file lands on my machine.

It's a good point that hashing is a better method when you have access to the original files.

close04 6 years ago | | |

> and there's multiple points at which an error could be introduced before the completed export file lands on my machine

Aren't all bets off at this point? I mean validating the backup seems skipping steps if you are not validating the source. Scrolling through thumbnails is better than nothing, sure. But it's really prone to false negatives. Corrupted images can look good in a thumbnail and your eyes might just miss even glaring corruption because you just scrolled too fast. If it's not an image file it gets more challenging.

You seem to have one of those corner cases where basically no automated method can solve your problem but the volume of data is just low enough to alleviate the issues with a bit of manual intervention.

akie 6 years ago | |

The contents of my backups are never the same, not from one single day to the other - so hashes would be useless.

HideousKojima 6 years ago | | |

You don't need hashes to match between days at all though. You simply hash the file that was just backed up, and the the backup copy of it, then compare the two

close04 6 years ago |

I think making a list of the files to be copied and their hashes, then a list of files that were copied and their hashes, then comparing the 2 lists should provide an even quicker way to validate. Or even hashing the entire source and destination (hash of the list of hashes) and providing both values to the user to visually compare.

As far as I can tell the method described in the article doesn't really validate the backups in any way, just provides some statistics that will fail in very plausible ways.

And of course, if the data is important to you and there are special circumstances that could affect the process, nothing beats an actual restore test.

jaw 6 years ago | |

I replied to a similar point about hashing here - https://news.ycombinator.com/item?id=23032633

You're correct that the methods I described are a far cry from actually guaranteeing that the backup has no errors. In the same way that a unit test doesn't prove code is error-free, but _can_ justify increased confidence in the code, I'm interested in techniques that can justify increased confidence in my backups. Particularly in cases where I don't have direct access to the original data, and where exhaustively checking the data manually is too time-consuming to be worth it.

wila 6 years ago |

What I did is to do all my work in a VMware virtual machine.

Then I wrote software for backing up VM's automatically (disclaimer: this is a commercial product I sell)

There's options for getting an email on success, failure or both. The VM files are all hashed.

VMs are easy to restore, so an actual restore is pretty easy without risking to overwrite the original. If a file hash does not match on restore, then my software will complain, but continue the restore anyways.

FWIW, all my code etc... is also in source control, so I am not relying on a single layer for that.