Making Backup Validation Easier(brokensandals.net) |
Making Backup Validation Easier(brokensandals.net) |
It's a good point that hashing is a better method when you have access to the original files.
Aren't all bets off at this point? I mean validating the backup seems skipping steps if you are not validating the source. Scrolling through thumbnails is better than nothing, sure. But it's really prone to false negatives. Corrupted images can look good in a thumbnail and your eyes might just miss even glaring corruption because you just scrolled too fast. If it's not an image file it gets more challenging.
You seem to have one of those corner cases where basically no automated method can solve your problem but the volume of data is just low enough to alleviate the issues with a bit of manual intervention.
As far as I can tell the method described in the article doesn't really validate the backups in any way, just provides some statistics that will fail in very plausible ways.
And of course, if the data is important to you and there are special circumstances that could affect the process, nothing beats an actual restore test.
You're correct that the methods I described are a far cry from actually guaranteeing that the backup has no errors. In the same way that a unit test doesn't prove code is error-free, but _can_ justify increased confidence in the code, I'm interested in techniques that can justify increased confidence in my backups. Particularly in cases where I don't have direct access to the original data, and where exhaustively checking the data manually is too time-consuming to be worth it.
Then I wrote software for backing up VM's automatically (disclaimer: this is a commercial product I sell)
There's options for getting an email on success, failure or both. The VM files are all hashed.
VMs are easy to restore, so an actual restore is pretty easy without risking to overwrite the original. If a file hash does not match on restore, then my software will complain, but continue the restore anyways.
FWIW, all my code etc... is also in source control, so I am not relying on a single layer for that.
"Better than nothing" is pretty much what I'm going for here. Almost all my personal data stored in cloud services falls into this "corner case": I only have indirect access to the source, it's important enough to me that I want to do some level of checking, but it's not important enough to spend the huge amounts of time it would take to inspect every individual datum.
Most people solve this issue by keeping multiple versions, not by trying to "validate" the backups somehow.
Metrics for todoist-fullsync:
name 1 days ago 8 days ago
------------------------------
files 1 1
bytes 82363 86661
items 85 87
The "items" line there seems like it's actually parsing the file and counting the number of entries in it? It's also captured in point #2: "Can be intuitively evaluated as plausible or suspicious. If the number of tasks in my Todoist export dropped from dozens to 1, that would be cause for concern."> Not only do you have to validate a file looks like a .jpg/.json/.zip file, you also need to validate that it looks semantically correct (ie. the file format is valid but a chunk of it is missing).
But you don't have to do that perfectly to get value out of it; for example:
- If the .json file parses as json, then at least you probably didn't truncate the download mid-stream.
- If it also contains a particular attribute, then you probably didn't save a structured error response instead of the actual data, or save something from a radically-nonpassively-changed endpoint that might no longer be adequate.
- If it also has roughly the number of elements you expect, you probably didn't miss entire pages of the response.
I haven't found a good way to verify these without doing a full database restore and seeing if the logs apply cleanly, along with having the DB do internal checks.
JFTR, this is supported on Linux as well and, especially when using LVM, is quite simple and straightforward.
You can do it manually [0,1] or using tools made for just this purpose, such as mylvmbackup [2] (which should be available in most distribution's package repositories).
---
[0]: https://www.badllama.com/content/mysql-backups-using-lvm-sna...
[1]: https://www.percona.com/blog/2006/08/21/using-lvm-for-mysql-...
Obviously, that won't catch all potential problems in the file, but it's a low-effort way to catch some.