Tarsnap outage postmortem(mail.tarsnap.com) |
Tarsnap outage postmortem(mail.tarsnap.com) |
Ok, I really wasn't expecting this to land at the top of HN. I'd love to stick around to answer any questions people have, but it's 10PM and my toddler decided to go to bed at 5PM... so if I'm lucky I can get about 4 hours of sleep before she decides that it's time to get up. I'll check in and answer questions in the morning.
God bless you Colin, but reading this, it appears you're the only one in charge of the infrastructure for this service. I'm glad you're clear about no SLA, but this seems like a big liability between me and my backups.
I know you didn’t ask me — but I don’t think Colin can answer differently other than saying that he is training a family member or friend to take over if needed.
Here’s more https://news.ycombinator.com/item?id=7514753 this is also linked there http://mail.tarsnap.com/tarsnap-users/msg00846.html
Very old threads but I am not sure much has changed there https://www.tarsnap.com/contact.html
Why would you use it instead of restic? Well, for pricing in pico dollars ;-)
and for it has a functional GUI with tiny system footprint and that there really aren’t many such solutions out there.
Hence the toddler.
(FWIW, S3 can be somewhat straightforwardly configured so that old data is effectively immutable. Google Cloud Storage’s similarly named versioning feature appears to be far weaker.)
Wasabi does $7/TB with no ingress/egress fees. My NAS is set up to rclone to it about once a day and I've yet to have any problems
A lot of 'lessons learned' analysis boils down to this: in order to prevent a recurrence of X, we introduced complex subsystem Y, the unexpected effects of which you can read about in our next post-mortem.
"Our simple model that fails gracefully did so and was simple to recover"
Redundancies and failsafes are not free - they add complexity.
99.9% availability fails in boring ways.
99.999% availability fails in fascinating ways.
The main lesson learned was "rehearse this process at least once a year".
> at the present time it is possible — but quite unlikely — that a hardware failure would result in the Tarsnap service becoming unavailable until a new EC2 instance can be launched and the Tarsnap server code can be restarted ... So far such an outage has never occurred
I read the postmortem as that a hardware failure did cause it to be unavailable and the code could not be restarted, a new server had to be built.
If that is correct, as well as writing up learning (as Jacques mentions) this page could be updated with outage information -- or even info on changes to reduce risk of repetition.
For what it's worth, one outage of a single day in fifteen years is impressive. If my ballpark math is correct, that's 99.992% uptime, ie four nines.
Have been having some luck reading https://www.amazon.com/No-Cry-Sleep-Solution-Toddlers-Presch... - available everywhere libraries (blockbuster for books!) are found.
I too had a few EC2 instances go down with signs of being severed from the EBS in the recent couple of weeks; mine were in eu-west.
- Setup nightly automatic snapshots of EBS volumes (this is supported natively now in AWS under lifecycle manager).
- Use EBS volumes of the new GP3 type, and perhaps use provisioned IOPS.
- Setup a auto-scaling group with automatic failover. Of course increases cost, but should be able to automatically failover to a standby EC2 instance (assuming all the code works automatically which the blog post indicates is not currently the case).
What prevents you to distribute load among other regions?
(Also: did you ever think about abandoning AWS?)
- The use of “I” begs the question: what’s the “bus factor” of Tarsnap? If you were unavailable, temporarily or permanently, what are the contingency plans?
- Will you be making any other changes to improve the recovery time, or did the system mostly function as designed? For example having a hot spare central server?
This speaks volumes to me about what kind of person Percival is; that credit would appear to be generously on the "make customer whole" side of the fence, and unlike the major cloud providers, he didn't make each customer come and individually grovel for it. And a clearly written, technical, detailed PM, too. This is how it ought to be done, and done everywhere. Thanks for being a beacon of light in the dark.
That's well put.
It makes me very happy to live in a world where tarsnap exists and is priced in picodollars.
Also I would suggest to think about the business long term and seeing if you can increase the revenue enough to enable you to hire a part-timer who can be of great help in case a similar event happens.
We are also a small cloud solution provider (we focus on ML API's) and over the years it has become clear to us that when you use cloud hardware (either dedicated or virtual), from time to time the outages periodically happen. RAM, HDD or other parts of the hardware just can malfunction anytime. So this is something which 100% needs to be taken into consideration when running any high availability online service over long-term.
For example in both trains and cars, thanks to anti-lock braking, the correct way to stop the vehicle ASAP is to brake just like normal but as hard as you can, the computers will automatically solve the much trickier problem of turning your input into maximum deliverable braking force by periodically releasing brakes on sticking wheels.
If you run a fire drill, it's surprisingly difficult to get employees to use fire doors that they're used to finding alarmed and unusable. Even though intellectually they know that, say, the door at the bottom of the stairwell is a fire door, with crash bars and leads directly to the outside world, and this is a fire drill, they are likely to (for example) exit on a higher floor and go through a chokepoint lobby, as they would normally, instead of following this safer path that is emergency only. Sadly it is hard to fix buildings after construction if they were designed with such "unused" emergency exits.
For a backup process, having restoring machine images be a service that is sometimes, though not constantly, used anyway for some other reason, is a good way to be comfortable with how it works, that it works, etc. At work for example we routinely test upgrades on test servers restored from a recent backup. Restore serviceA to testA, apply upgrade, discover upgrade completely ruins the service, throw testA away and report this upgrade is garbage. But in the process we gained confidence in the restore process, infrastructure people instead of trying to recall something they only ever did in a drill, when things go badly wrong are very used to this procedure because they do it "all the time".
Rehearsing this annually is definitely going to be a high priority.
Recommend writing a TLA+ model to catch stuff like this
(People here asking about the low Bus Factor: you don't keep your backups in one service/location, eh? You use Tarsnap and Restic with Backblaze, Rsync.net, S3, etc. right? "Backups are a tax you pay for the luxury of restore.")
I have been using Tarsnap for a decade and not only has there been minimal availability issues there have been almost no issues of any kind that I can recall.
>> So far such an outage has never occurred; but over time Tarsnap will become more tolerant of failures in order to minimize the probability that such an outage occurs in the future.
Neglecting the pricing, does Tarsnap have any advantage over Restic?
Restic also deduplicates, using little data.
I mean.. you could purchase a cheaper service and also donate to various efforts. Bonus: Then you'd also be able to pick those efforts.
Colin, could the website be updated to the 2010s? :P
Far be it from me to tell anyone how to write software, but why build a database on top of S3 when you can just chuck the metadata into RDS with however much replication you want?
The backups themselves should be in S3, but using S3 as a NoSQL append-only database seems unwise.
This would benefit from being further from the metal.
I personally would go with the simpler solution because in my experience you need an awful lot of extra complexity before you get to the same level of reliability that you have with the simpler system. Most complexity is just making things worse.
You can see this clearly when it comes to clustering servers. A single server with a solid power supply and network hookup will be more reliable than any attempt at making that service redundant until you get to something like 5x more costly and complex. Then maybe you'll have the same MTBF as you had with the single server. Beyond that you can have actualy improvements. YMMV and you may be able to get better reliability at the same level of performance in some cases but on average it always first gets far more complex, costly and fragile before you see any real improvements.
I strongly believe that the best path to real reliability is simplicity (which is: as simple as possible) and good backups. For stuff that needs to be available 24x7 and 365 days per year this limits your choices in available technologies considerably.
This is Colin's job. Colin has his name attached to it. It's really important to Colin.
You're not going to get the same kind of service from BigBackupCorp. Their employees are replaceable, their management is replaceable, and to be honest, you as a customer are replaceable, if they decide to move in a different direction and become BigFlowerArrangementShippingCorp.
The neat thing about a small business is that it runs entirely on its own profits. There are no stock price games or VC jiggery-pokery or anything like that. If it's a profitable business, there will be somebody to come along and take it over and make it their job with their name attached to it. I think the open Internet benefits a lot from this sort of thing.
They should take separate buses to ______.
Better to have multiple layers of backup, of which tarsnap and friends are only one, and verify regularly.
Tarsnap makes a lot of sense when you benefit from the encryption and (especially) de-duplication features that it offers. For me, all of my most important personal and business data, from multiple decades, compresses-and-deduplicates down to around 6GiB. Considering the high value of the data I store in it, tarsnap's pricing actually feels absurdly low.
Can you provide more detail why you think so? I don't believe there is any use case in which tarsnap makes sense, other than maybe some Plan-C backup solution which you fall back on in the highly unlikely event that neither Plan-A nor Plan-B worked.
Concretely, what benefits does tarsnap offer over restic or borg in combination with rsync.net, to make up for the substantial downsides (such as insanely slow restore, complete lack of wetware redundancy or being written in C[1])?
Tarsnap : $0.25 / GB storage, $0.25 / GB bandwidth cost
rsync.net : $0.015 / GB storage, no bandwidth cost
s3 : $0.023 / GB storage, some complicated bandwidth pricing
If tarsnap is built on top of s3, they're charging 10 times for the storage cost. Easy money from the uninformed?
Tarsnap is a wonderful piece of software. You're paying for that.
That said, is the value of "Tarsnap" worth the price difference from "Borg+rsync.net"? (Or Restic, I've been meaning to look into Restic). I'm not so sure. These days I'm a customer of rsync.net, not of Tarsnap.
But I still firmly disagree with the "Colin's just exploiting the uninformed" angle.
Geez, that's really not improving the comparison with Tarsnap.
Backblaze: $0.005 / GB storage, $0.01 / GB download.
I don't think so. Anyone who can use this software I'm sure knows what other options exist.
The 120Gb is the contents of my OneDrive and local repository trees. This is everything I've ever done that I want to keep and is approximately 115Gb of photos and not a lot else!
That's pretty much any SaaS... look at the various log or metrics gathering solution, where you pay serious multipliers of what would cost to run same software on your own instance.
I've been using Tarsnap for 10+ years. There's some Linux stuff getting backed up, configs and such. It costs next to nothing for this kind of usage.
While on the price, patio11 (Patrick) has written an article about tarsnap’s issues more than nine years ago (April 2014). One of the suggestions was to raise prices, IIRC. It’s a long post, but you can read it [1] and the HN post [2] from that time.
[1] In case of an emergency, you will always be able to get back your data from tarsnap at a blazing rate of 50kB/s https://github.com/Tarsnap/tarsnap/issues/333.
How many of the world's best and brightest are doing all sorts of busywork? At least Colin has some time to do whatever he wants to do while running tarsnap.
This is entirely Safari's fault for not having good compatibility with a common existing webpage format.
Anyway, if you're the intended audience (someone using tarsnap), you also received a copy to your email address, where you can read the text with your email reader of choice.
<p> is far more appropriate
That isn’t apple’s problem, nor mine.
It’s not impossible, and likely just a fault of whatever list thing is used, but it could be better, and it’s nice if people let you know as such right?
i assumed the parent did not know how to do that, i tried locally and it seemed to work, but i did not pay attention to the text
original:
on the left side of the url input field you'll find "AA"(the first smaller then the seconds), tap that
then, near the bottom of the pop-up menu you'll have "Show Reader", tap that
if you're not happy with the text as displayed then, you can go back to the "AA" menu and change the options
On a less technical note: Always avoid the fancy option when it makes sense. (From a veteran of building and maintaining large scale high performance high availability systems)
S3 is not the problem here. The problem is building a database on top of S3, and having to reimplement all the consistency, atomicity, transactions etc. on top.
>no thought to a schema, no migrations to manage
There is, in fact, always a schema. Some people choose to ignore it's there, to their detriment.
>Always avoid the fancy option when it makes sense.
It's not the 1980s. Postgres is not fancy, and Greenspunning it is a mistake.
>Almost guaranteed it's cheaper.
Cheaper than a 26-hour outage?
Cost and reliability?
* Using S3 as a simple database is generally going to be much cheaper than RDS.
* If you turn on point in time restore, then losing data stored in S3 is not a possibility worth worrying about on a practical level for most people. RDS replication is easy enough to use, but adds more cost and a little bit of extra infra complexity.
It's a bad trade. Thousands of hours of a high human capital computer scientist vs. a few tens of dollars a month for RDS.
>Reliability
Empirically false: none of this would have happened if Tarsnap used Postgres instead of a home-spun database.
There's client libraries like Delta Lake that implement ACID on S3.
Much of the Grafana stack uses S3 for storage (Mimir/metrics, Loki/logs, Tempo/traces).
That said, I'm not sure about the implementation Tarsnap uses--if it's completely ad-hoc or based of other patterns/libraries.
How, exactly, is that a good thing?
Yes, I hired him in 2015 IIRC. If you look at tarsnap's GitHub you'll see a lot of commits from gperciva.
He mentored me so that I was able to contribute and eventually help maintain and manage LilyPond's Documentation and Patch Testing in a meaningful and rewarding way - all without any programming experience.
1. Key person gets hit by bus
2. You see the black bar on Hacker News and learn the sad news
3. You go download all your data from the service, which is still up because there is no bus access to data centers.
4. You feel like a jerk for all your creepy "hit by bus" talk.
5. A few weeks later, some VC-funded operation with multiple employees you depended on disappears overnight without a trace.Just about this step... you are supposed to have it already. You just have to find another service and start using it.
It won't, though, because of the points mentioned by the post you're replying to. It's been 15 years; tarsnap is as popular as it's going to get.
I don't find that that logically follows from making bank. Not everyone who makes bank is a positive influence.
Tarsnap does provide value, even if I think it's less than its cost: I'm just commenting on the general case that making money would mean you're providing good value
We also have .edu / student / nonprofit discounts. Email us.
Finally, Debian and FreeBSD project members get free accounts. See the committers handbook, etc., for details.
[1] Whenever we lower our prices, we increase quota on existing customers to "normalize" them to the new price/GB. If you do nothing, your rsync.net account just grows over time due to this.
Embedding the log-structured representation of user data in Postgres would increase complexity and overhead without offering significant resiliency or recoverability advantages — in fact, quite the opposite.
I was only edgy about it because when it takes 36h it blocks the next daily backup, and I wondered whether that was going to get worse (it hasn't).
In general, there's an unavoidable trade-off between creating many small packs (harder on metadata throughout the system, inside restic and on the backing store but more efficient to prune) versus creating big packs which are more easy on the metadata but might create big repack cost.
I guess a bit more intelligent repacking could avoid some of that cost by packing stuff together that might be more likely to get pruned together.
Your comment sounds like tarsnap is more secure (in terms of longevity) than rsync.net. Is this true? If yes, why?
Genuine question, because I'm using rsync.net for my critical stuff and would gladly move to tarsnap if appropriate.
First is that my usage of Tarsnap pre-dates my usage of rsync.net so it's been the primary backup of my home directory since ~2010. I haven't felt a need to change it and the 2 occasions I needed to restore it everything worked perfectly. i.e don't fix what isn't broken.
The second is that while I can in theory restore from rsync.net I actually never have... this is more a testament to the relability of ZFS though I guess and local snapshots have always been enough. That said the convenience of send/recv is sort of awesome.
Lastly is that I don't use ZFS on my client machines. If I did I would probably consider rysnc.net for everything.
So it's not really an explicit I think one is more secure or durable than the other it's that Tarsnap has fulfilled my DR needs successfully for a long time and I have come to trust it to do so in the future.
The airline industry is as safe as it is because every accident gets thoroughly investigated with detailed reports ("post-mortems") including what to do differently going forward. These are taken as gospel among all players in the industry and as a result, you very rarely see two different accidents caused by the same thing anymore.
I'm sorry if I misconstrued your meaning, but I am flattered that you think there are things beneath me!
However, if the user clicked the "reader mode" button, that's a good sign the user thinks this is reflowable text. Firefox's reader mode figures this out. Safari's doesn't.
Also it’s really simple and does what it says it does, nothing more, nothing less. In today’s everything convoluted and bloated world this is a luxury imho. The GUI app is also quite good and functional. Support is prompt (that is if you need it).
You don’t have to worry about file being deleted just because your machine didn’t connect or backup for some time even if you keep paying (hello Backblaze) etc. I mean there’s no circus, melodrama , and cliffhangers involved.
I personally would never use it backup my entire laptop, due to price alone. But I have a subset of VVI files and Tarsnap is one of more than one backups for those files. So for that use-case Tarsnap is perfect for me, so far.
This "uninformed mom-and-pop" is potentially compiling the client application from source, but can't do basic math to compare tarsnap's pricing to the top 20 or so competitors that rank above tarsnap in SEO?
For example, my team has people across the world for HW bringup, so we can't allow our code hosting or CI to be down for more than a few hours. Of course, backups have different uptime requirements, but as for everything, it's a tradeoff between features, of which an SLA is one.
Tarsnap's features are granularity of cost, reliability of storage, and encryption, but not 99.999% uptime.
Meeeeh, my ISP cut of around 100+ fiber connections in my town and spend three weeks fixing it. My neighbor have business line, there's an SLA on those that among other things, require them if reestablish his connection within 3 - 5 hours. It took them over 500 hours, so that SLA is useless for anything but forcing compensations.
The problem is that the SLA should give an indication of available resources, but in reality it's mostly a contractual thing for most companies, they'll pay the "fine" or refund a customer if they fail to hit their SLA and that's about it. Tarsnap most likely have better availability than many midsize competitors simply because it's just one person who really cares about it. Doesn't help if he's hit by a bus though.
What is this mythical unimportant data that people still want to back up?
Subjectively you may feel that your data is super important, but objectively it probably isn't.
When people talk about 'super important' (totally a technical term), I think of things like DB backups in software companies, backups of financial reporting for firms, etc. Not your tax return from 2008.
These are examples of data that I could easily live without. Where losing it would either be a matter of re-doing old work, or just forgetting about old and minor things.
I have lots of stuff like this. Often it is easier to just back up an entire folder than go through sub/sub folders separating stuff into: important, not very important. Storage costs are low enough to just backup everything (almost). Also, one often doesn't know what may be important/useful in future. For example a couple of years ago I had this huge buildroot system (600gb) to build firmware images for a single board computer I spent quite a while to put together. The project I was doing it for got cancelled so I had no need to keep it. Still I wish I did, as I'd love to be able to tinker with it now, but 600gb is not a trivial amount to store so it got deleted. Most of this data was pulled from various online resources that don't exist anymore too.
What's the morale of my story? If you have a fast internet connection (I don't) backup "everything" to cloud. Then find "really important stuff" like the pictures of your children etc and back it up again to a different cloud.
If you're in a middle of nowhere on a slow LTE connection like me, building a nas box is not a bad idea for backups.
Well I used to until macOS kinda went off the rails a bit. Now it’s mostly an exercise in running my arch script for my thinkpad.
Being stuck between operating systems is kinda a mess though, makes backup and file sync in general really hard. But everyone’s gotta have their own cloud, right?!
Why can’t I just put a cloud under my bed and forget about it?
Just buy a Synology NAS. Keep default settings, set up a few user accounts, tweak a few things here and there, enable encryption, install Active Backup on all your devices, done.
There are many cheaper/more open options for self-owned NAS storage, but contrary to a Synology they're definitely far and away from "and forget about it".
I’m a native English speaker but sometimes I swear I’m losing grasp on communication in the Internet age and am sincerely trying to understand this all.
For example, a code review on a mailing list can only make sense with the linebreaks and spacing preserved.
However, as you knew to try, there is "reader mode", which is meant to heuristically ignore the exact html in order to display textual content.
Firefox's reader mode has no trouble figuring out that this is a block of text that can be reflowed.
Safari's heuristics clearly fall short on one of the more common kinds of textual blurbs you might want to reader-view-ize.
Seems like a safari problem to me.
It was absolutely unreadable on mobile Firefox (Android), initially didn't even think to use the reader mode, which indeed did help make it readable!
I think this is the first time I've ever actually needed that functionality, never quite got it beforehand. Thanks for the suggestion!
------- -------
| foo | ---> | bar |
------- -------
It's an older technology but it checks out.Well, it can't be that ba..
$0.25 x 2000 = $500
Yikes. And this is without BW costs.At $500/M you can just rent a dedicated physical server with a lot of HDDs and still have money left for your favourite pumpkin latte.
For comparison rsync.net says it's $0.015 per GB/Mo, for 2TBs that's $30/m and no BW costs.
[0] https://www.scaleway.com/en/object-storage
Now that I think about it... some kind of micro-distributed backup server (throw on few of your machines, auto-replicate between) would be a neat project...
Just slap rsync/syncthing to the backup dir.
- your comment was a very valid question but rather quip-like, offhanded, seemed off etc etc. I mean something like that…
- Tarsnap is an hn darling
If I have to pick one I think it’s the latter :)
He’s brought far more value to the community than that, of course.
If I want to store my 100GB of data now, and I want to have it stored for a year, I want to pay for that year's worth of storage of 100GB of data now and not worry about any money or account problems for that bit of data.
Which is probably okay if you want to pivot from a geek-ish service to one that geeks don't use, of course. Does the owner want that?
Oh dear. It’s an HN thing. I have had brushes with it only once or twice across various accounts across years but it’s very much an HN thing.
Whenever you see an utterly useless quip (or sometimes even name calling or offensive words) being heavily upvoted you should know that some alpha HNer has arrived on the scene :)
But to be honest I have never seen author Tarsnap engage in such privileged gentlemanly d-baggery. He is quite cool, as they say it.
Anyway I just ignore it and move on. But again OP could have worded the question better. I mean no matter how good or bad you want to feel about it — it’s just a vc run anon forum and just another forum.
Early in my career, I became the second person able to support and operate a system that was public facing and responsible for billions of dollars of activity that mattered to many individuals and stakeholders. The entire team retired over a period of six months, after giving the folks in charge a year or more notice. After about 12 weeks, I was the sole guy, training a 4-5 new people.
We’re all probably using a service like this. As demonstrated by Twitter, well engineered systems can persist, even without proper care and feeding, until they don’t.
That is to say, if Tarsnap is the only place you've keeping sensitive/important data, then you're "not doing it right" as a backup. Things happen... your hard drive can die suddenly, and a data center bursts into flames all on the same day.
This is why house doors open in but business doors have to open out - if there’s a crush against a fire door it opens.
You even see this in aviation, where everything is checkisted; the pilots will first stabilize the plane in an emergency and then run the checklist. And small plane that operate unexpectedly are always higher in crash rates.
This doesn't work for normal people because normal people don't drill non-normal events until the response is instinctive.
> It's a bad trade.
Maybe. But that's the reason. You never acknowledged that advantage in your question so it needed to be emphasized
The opportunity cost of building your own database is 10,000x the cost of running RDS for a year.
* RDS costs obviously scale linearly with ongoing time and probably scale linearly with the total amount of data being backed up. So depending on the revenue of the business, these extra costs could easily end up outweighing the (notional) cost of the time saved, which is mostly a one-off expense.
* The cost of a software engineer's time is notional in the context of a one-person business. The author of Tarsnap isn't going to be able to employ fewer than zero additional software engineers to maintain Tarsnap because of the time saved by using RDS.
Optimizing your system or upgrading it just becomes a "trash boot drive and reinstall" operation, applied without a care in the world.
That’s barely possible to read on a high dpi screen, but fairly uncomfortable.
Dumb is `aws s3 cp` and being done in 5 minutes.
It's just PITA to add another instance.
To be fair, they deserve it a bit as they got up in flames twice .
Indeed, after the first fire, the geniuses over there collected all the UPS and batteries they could find from the DC and stored them all in a pile in a closed container... where they predictably bulged, failed, sparked and eventually triggered another fire after a couple days.
But you should NEVER design a system that requires normal people to drill non-normal events; even planes have been redesigned to "fix" problems where the pilot had to do something unintuitive or unexpected, because eventually it WILL catch up to you.
SLI - Service Level Indicators - Metrics ie Latency of each request / response cycle
SLO - Service Level Objective - What threshold we are aiming for - 10 ms from request to response averaged over 1 hour period.
SLA - SL agreement - contract with custom yet what happens if we breach (credits given, put the CTO in stocks and throw eggs at him etc)
Instead we get refunded some pitiful amount when our business is seriously disrupted for an extended period of time.
This ability is critical to prevent a compromised system from having its data wiped and having all backups wiped as well.
I haven't been able to figure out how to do this in any other system. But if someone has a tutorial, I am all ears.
I've been musing on this subject all afternoon. I'm a user of Tarsnap, and I do find it expensive, in the sense that I would prefer to backup larger amounts of data for less amount of money. At the moment I backup photos separately from Tarsnap and in an adhoc way.
But I still cannot figure out a way to get all the benefits I get from Tarsnap from any other software solution.
* Must be usable under Nixos.
* Backups must be asymmetrically encrypted so that backups can be automated, yet a compromise of the system cannot immediately gain read authorization to arcived data.
* Backups must be append-only without further credentials, or otherwise prevent a compromised system from being able to delete existing archives.
* Deduplication between archives while still allowing archives to independently be deleted.
Using the ZFS snapshot functionality with rsync.net, for example, with Duplicity comes close. However, as I recall, duplicity wants to regular (typically monthly) full backups and then incremental backups from there. You cannot remove these full backups without deleting the entire month's worth of backups, and because the full backups are independently encrypted, there is (of course) no deduplication between full snapshots, even though the data is still likely largely the same. And because the snapshots are encrypted, it is impossible for the rsync.net storage to see or even know that large parts of the encrypted data is identical.
AFAICT there is really nothing else that does what Tarsnap does.
* Create a S3 bucket and enable versioning * Create a new user and give it only s3:PutObject on your new bucket * Create an auth keypair for that user and put it on your server
Now any server compromise that gets those keys can only add new data to your backup bucket, and can't read, overwrite, or delete any previous backup.
There's no dedup, so that could be a deal-breaker.
There's also no real encryption (though that shouldn't be too hard to add I guess). I don't really see the gain though. Anyone who compromises the server keys is blocked from reading by AWS permissions. Granted, that's not quite as reliable as good crypto for blocking reading, but on the deleting side, there's never going to be anything but the auth system of whatever solution you're using to block that.
I get that there's some applications out there where preventing data exfiltration is important enough to need strong crypto (though is that really important when we're talking about full compromise of your server, which gets the attacker direct access to the data anyways?), but I decided that the risk of failing to implement properly or full data loss due to losing the keys or them being corrupted wasn't worth the risk of blocking somebody who somehow compromised the AWS account security from being able to read backup data.
My main machine is currently storing 1.6 TB (compressed) of total archives with tarsnap, but only 33 GB (compressed) of unique data within those archives. So if S3 is 50x cheaper, then not having deduplication would be a wash.
However other comments here suggest that S3 is only 10x cheaper.
Edit: just saw your sibling / reply-to-self comment. This setup would fulfill the requirements you posted, or at least I would assume that restic runs under (or compiles for) your nix OS. It doesn't use asymmetric encryption for this but the goal of append-only is there
> because the snapshots are encrypted, it is impossible for the rsync.net storage to see or even know that large parts of the encrypted data is identical
If they don't see a large amount of data incoming, they'll know large parts of the data are identical (or removed, I suppose). Hiding traffic volumes is fundamentally only possible by introducing dummy data
The thing is that tarsnap deduplicates over arbitrarily long time periods, letting me make arbitrarily long staggered sequences of retained archives.
Perhaps I should really reconsider if I really need such long lived archives, but it is hard to bring myself to drop them.
I learned in the small car from the same brand as my father's larger car, so that the controls are in the same place, the symbols on stuff are identical, all that was different once I have a license and borrow dad's car is it's longer and has more power.
It also probably shouldn't be legal for me to drive today, but it is. I learned 25 years ago, and I haven't driven anything in over a decade, so a rational system would say nah, you're too rusty, get a refresher course, but there's no mandate for that.
Most of those problems are moot if you're only ever writing from a single head node. If all your data is strictly ordered and you have no meaningful concurrency, this is a far, far simpler system.
Complex is Greenspunning a database and having it blow up in your face and cause a twenty-six hour outage. You never hear about such things with Postgres because Postgres is rock-solid.
But to your point, if your system requires less than a thousand lines of code to open a file, do basic parsing and processing (which no data storage system is going to do anyway), and write the output to another file, I personally can't say that Postgres or MySQL or any other solution is really worth the effort/cost to build and maintain. In the system being discussed, the benefits of an RDBMS simply don't matter: any strongly consistent key-value store would work.
> Complex is Greenspunning a database and having it blow up in your face and cause a twenty-six hour outage.
S3 didn't cause the outage, and from the look of it, neither did the code that processes the files. It was an application logic problem which caused issues during the restore process, and this would have been an issue regardless.
You could make an argument that the recovery being slower than it could have been was a problem, but it's wild to say that and imply that traditional databases have no performance cliffs. Especially when dealing with corruption or data recovery. Raw file storage will never have a Postgres transaction id wraparound incident (see: Sentry outage for most of a day in 2015, MailChimp/Mandrill for over a day in 2019) or have to rebuild a critical index.
In this case there was a hard coded concurrency limit with S3 of 250 outstanding requests. Bumping that up to 3000 would have been easy and reasonable (S3 rate limits at 5000). How confident would you be that your database can performantly handle a backfill during recovery? Have you provisioned enough iops? Are you running an RDS instance with only a burstable vCPU limit? To say Postgres is "rock-solid" (and make no mistake, I am a Postgres fanboy) dismisses the many and varying ways that it can fail in unusual and surprising ways.
Did you miss where I said it's read-after-write strongly consistent?
AFAICT, using HN theres been roughly 30 hours of non-availability over 15 years. RDS didn't even support Postgres when Tarsnap was released.
EDIT: Tarsnap predates RDS.
Ugh.
Try picking a choosing specific file types or file extensions from filesystems holding thousands of files.
I ended up having to cobble together some god-awful pre-process powershell with multiple pipes just because restic fails to be able to grep using Windows reliably.
:(
That is news to me. I backup almost a million files spread across 4 Windows devices, with heavy use of --files-from and --iexclude and it seems to work. What am I missing?
I agree that restic filtering options are pretty limited. Too limited, really. But what's there seems to work?
With regards to duplicity, Tarsnap does full deduplication across all backups for any given "machine", while still letting you independently remove any snapshots you like. i.e. no special "full snapshot" that must always be kept around, and no need for multiple full snapshots that have no deduplication between them.
There are services like rsync.net that support borg at a lower price. Borgbase is one of them. I haven’t used either of these.
And rsync.net is even one of them!
"Special "borg accounts" are available at a very deep discount for technically proficient users." -- https://www.rsync.net/products/borg.html
...hrm, it seems they didn't update that page with last year's price drop. https://web.archive.org/web/20220319135035/https://www.rsync... It used to be a deep discount, now it's the same for <100TB. I wonder if they did drop the Borg prices too and just forgot to update that page?
Run rsync to the target and forget is quite easy, though I admit rsync.net's deal is getting worse these days posing minimum usages here and there.
Plus "written by cperciva and heavily battle tested by Serious Sysadmins" is a feature I couldn't recreate myself - notice that while there was an outage, part of the reason for it taking a while was a conscious choice to take a much longer path to resolution than bringing up the previous server in the name of paranoia. Paranoia about data corruption is a nice thing to have in a backup system and something I'm happily willing to trade-off uptime for.
However: For backups of bulk data then, yes, it's going to be relatively expensive. I wouldn't put e.g. my media backups on tarsnap, but "use tarsnap for your git repositories and other high value data, and something else for the rest" is both perfectly doable and an approach I suspect cperciva himself would endorse.
As Actual Serious Sysadmin that Actually Manages Big Systems for Living that screams more lack of preparation than anything else.
Yes you should be careful but you should also have procedures in place and know the system well enough to trust it. And the fact is that the "boring" architecture of RDS DB instead of that S3 database abomination thing would just start right up if master DB server failed.
It honestly looks like a trap many intelligent people fall into where they turn their cool-but-ulimately-flawed mental excercise into bedrock of the product. I don't want to use baby's-first-database on my production servers (I'm looking at you Lennart Poettering and journald) and I don't want my data/metadata stored on some experimental one.
Without agreeing or disagreeing with those, "I'm not going to trust the filesystem on the existing machine" was the choice I was talking about.
The tarsnap architecture still does more things.
You're welcome to feel that you don't need those things, but that wasn't my point.
I have written about this some time ago if you’re interested: https://www.franzoni.eu/ransomware-resistant-backups/
There's also a service like rsync.net where you can just rsync to the destination and they do the versioning and so on for less than 10th of the cost of tarsnap.
I just now need a deduplicating asymmetrically encrypted backup program.
I've tried duplicity in the past, and maybe I should try it again. But my recollection is that duplicity will just fail to do backups at the slightest hint of any problem. Like maybe if the last backup was interrupted then no more backups for you until you attend to it.
Edit: More memories returning of having to dig out my decryption key to resync the metadata when duplicity gets unhappy, and then since my target server was append-only, duplicity was upset when it wasn't allowed overwrite any of it's incomplete metadata files. I guess the ZFS snapshot technique would alleviate the latter issue.
To be fair, if tarsnap gets confused it needs the keys to do its fsck command, but I recall this sort of thing happening regularly with duplicity and almost never with tarsnap.
An rsync.net account can have any arbitrary schedule of snapshots - including days, weeks, months, quarters and years.
Optimal reading width for speed/comprehension is also fewer than 80 characters afaik. I think different sources I read years ago were undecided whether it's closer to 60- or 70-character lines. Either way, rescaling when there is no ASCII art or position-dependent characters seems rather basic and <pre> disallows that
I always wear my glassses when I'm using my iPad, which is in landscape mode almost all the time (the exceptions being the rare apps like Uber's that won't do landscape mode).
My youngest once found some sort of chocolate drops called "unicorn poo" - which seems a more ironic thing to chuck at CTOs !
Secure
Technology
Oversight for
Corporate
Software
STOCS Act here we come !
Edit : yeah I could not get the K in ... that's hard
So it's not just ease of use. It's actual _functionality_ to me - getting from raw object storage to a fully working, attack-resistant backup strategy, is not trivial; hence, comparing tarsnap (or rsync.net, or borgbase, or whatever) to B2 or S3 makes little to no sense.
You _could_ compare it to crashplan or backblaze personal backup if you like, but IIRC those don't work for *nix systems, only for Win and Mac.
Those restrictions are enforced by the service.
Thought it used readonly features of S3/Glacier or something..