Tarsnap outage postmortem

Tarsnap outage postmortem(mail.tarsnap.com)

553 points by anderiv 2 years ago | 319 comments

cperciva 2 years ago |

blinks

Ok, I really wasn't expecting this to land at the top of HN. I'd love to stick around to answer any questions people have, but it's 10PM and my toddler decided to go to bed at 5PM... so if I'm lucky I can get about 4 hours of sleep before she decides that it's time to get up. I'll check in and answer questions in the morning.

stigz 2 years ago | |

Why would I use your service over restic?

God bless you Colin, but reading this, it appears you're the only one in charge of the infrastructure for this service. I'm glad you're clear about no SLA, but this seems like a big liability between me and my backups.

ivanhoe 2 years ago | | |

It's a pretty well-known fact for years that tarsnap is basically a one-man show, and yet Colin has managed to provide fantastic service so far. Sometimes having ppl who built the service also managing it is actually a big plus, compared to other services where you first have to fight through outsourced & underpaid support that's limited to template answers, only to finally get some "engineer" who got that job 2 months ago and is more clueless on their system than myself...

crossroadsguy 2 years ago | | |

You really shouldn’t if that’s a major concern for you and that is a valid concern. For the same reason I’ll never use PurelyMail otherwise it’s perfect.

I know you didn’t ask me — but I don’t think Colin can answer differently other than saying that he is training a family member or friend to take over if needed.

Here’s more https://news.ycombinator.com/item?id=7514753 this is also linked there http://mail.tarsnap.com/tarsnap-users/msg00846.html

Very old threads but I am not sure much has changed there https://www.tarsnap.com/contact.html

Why would you use it instead of restic? Well, for pricing in pico dollars ;-)

and for it has a functional GUI with tiny system footprint and that there really aren’t many such solutions out there.

twic 2 years ago | | |

> God bless you Colin, but reading this, it appears you're the only one in charge of the infrastructure for this service

Hence the toddler.

amluto 2 years ago | | |

tarsnap natively protects against inadvertent or malicious deletion or corruption — old tarsnap backups are immutablez The low-cost competitors (restic, borg, etc) seem to have this feature as an afterthought, and they make it surprisingly difficult.

(FWIW, S3 can be somewhat straightforwardly configured so that old data is effectively immutable. Google Cloud Storage’s similarly named versioning feature appears to be far weaker.)

88913527 2 years ago | | |

Even large organizations can have fairly regular availability issues. I appreciate the noted flaws of "single point of failure", but I also see orgs where 100s of people have access to the infrastructure, make a change, and then it breaks something. I wouldn't do business with an org just because they have many people, that won't mean they're operationally sound, at least not to my expectations.

k8sToGo 2 years ago | | |

If the data is super important you should be setting on two different providers anyways for backups.

throwaway290 2 years ago | | |

What use is SLA? If a service goes down for too long, are you really going to hire a lawyer sue it over SLA or just... use another backup?

IntelMiner 2 years ago | | |

I'm curious how the prices shake out against services like Wasabi, since it's just dumping to an AWS S3 bucket

Wasabi does $7/TB with no ingress/egress fees. My NAS is set up to rclone to it about once a day and I've yet to have any problems

Mawr 2 years ago | | |

Uptime isn't an important property of a backup solution, so I'm not sure where the expectation comes from?

jacquesm 2 years ago | |

In future postmortems (of which I hope there will be very few or even none) you may want to spell out your 'lessons learned' to show why particular items will never recur.

idlewords 2 years ago | | |

It always amuses me how people want reassurance that the next crisis will be a fresh, new problem, and not one the person can demonstrably solve.

A lot of 'lessons learned' analysis boils down to this: in order to prevent a recurrence of X, we introduced complex subsystem Y, the unexpected effects of which you can read about in our next post-mortem.

rsync 2 years ago | | |

You should consider this possible lesson:

"Our simple model that fails gracefully did so and was simple to recover"

Redundancies and failsafes are not free - they add complexity.

99.9% availability fails in boring ways.

99.999% availability fails in fascinating ways.

cperciva 2 years ago | | |

Yeah, I was going to do that but it was getting late, I wanted to get some sleep, and the post-mortem had already been waiting far too long to be sent out.

The main lesson learned was "rehearse this process at least once a year".

vintagedave 2 years ago | | |

The infrastructure page* says,

> at the present time it is possible — but quite unlikely — that a hardware failure would result in the Tarsnap service becoming unavailable until a new EC2 instance can be launched and the Tarsnap server code can be restarted ... So far such an outage has never occurred

I read the postmortem as that a hardware failure did cause it to be unavailable and the code could not be restarted, a new server had to be built.

If that is correct, as well as writing up learning (as Jacques mentions) this page could be updated with outage information -- or even info on changes to reduce risk of repetition.

For what it's worth, one outage of a single day in fifteen years is impressive. If my ballpark math is correct, that's 99.992% uptime, ie four nines.

* http://www.tarsnap.com/infrastructure.html

mike_d 2 years ago | |

This was an extremely well written and thoughtful postmortem, but I hope to never see one from you again. :)

Tepix 2 years ago | | |

It was a postmortem without the mandatory "how can we prevent this in the future" steps…

bombcar 2 years ago | |

Time to get your toddler providing round-the-clock support! ;)

Have been having some luck reading https://www.amazon.com/No-Cry-Sleep-Solution-Toddlers-Presch... - available everywhere libraries (blockbuster for books!) are found.

cperciva 2 years ago | | |

She's generally a wonderful girl. Right now she's dealing with her second molars coming out and just picked up a cold though, which is throwing off her sleep schedule.

gfv 2 years ago | |

How long do you keep the transaction logs before rewriting them?

I too had a few EC2 instances go down with signs of being severed from the EBS in the recent couple of weeks; mine were in eu-west.

cperciva 2 years ago | | |

There's a continual background cleaning process which depends on the amount of storage which can be reclaimed -- there's a tradeoff between cleaning too slowly (and paying for wasted storage) and cleaning too fast (and paying for lots of S3 operations). I think it averages a couple weeks right now.

dharmapure 2 years ago | |

Thank you for the post-mortem Colin and I hope you get some sleep!

cperciva 2 years ago | | |

Thanks, I did! My long suffering wife was up at 3:30 though. :-(

LinAGKar 2 years ago | |

What I'm wondering is, I had data on Tarsnap, why am I only hearing about this now?

nodesocket 2 years ago | |

Some recommendations on the AWS front (not sure if some of these are already implemented since the postmortem does not go into AWS details).

- Setup nightly automatic snapshots of EBS volumes (this is supported natively now in AWS under lifecycle manager).

- Use EBS volumes of the new GP3 type, and perhaps use provisioned IOPS.

- Setup a auto-scaling group with automatic failover. Of course increases cost, but should be able to automatically failover to a standby EC2 instance (assuming all the code works automatically which the blog post indicates is not currently the case).

e63f67dd-065b 2 years ago | |

Can you say a bit more about the log-structured S3 filesystem? I wrote something very similar recently (https://github.com/isaackhor/objectfs) and I'm curious what made you settle on that architecture. The closest thing I know of that's similar is Nvidia's ProxyFS (https://github.com/NVIDIA/proxyfs)

nextaccountic 2 years ago | |

> the central Tarsnap server (hosted in Amazon's EC2 us-east-1 region)

What prevents you to distribute load among other regions?

(Also: did you ever think about abandoning AWS?)

rlt 2 years ago | |

Nice write up. A couple questions:

- The use of “I” begs the question: what’s the “bus factor” of Tarsnap? If you were unavailable, temporarily or permanently, what are the contingency plans?

- Will you be making any other changes to improve the recovery time, or did the system mostly function as designed? For example having a hot spare central server?

throwawaaarrgh 2 years ago | |

Are you gonna switch to us-east-2?

deathanatos 2 years ago |

> Following my ill-defined "Tarsnap doesn't have an SLA but I'll give people credits for outages when it seems fair" policy, on 2023-07-13 (after some dust settled and I caught up on some sleep) I credited everyone's Tarsnap accounts with 50% of a month's storage costs.

This speaks volumes to me about what kind of person Percival is; that credit would appear to be generously on the "make customer whole" side of the fence, and unlike the major cloud providers, he didn't make each customer come and individually grovel for it. And a clearly written, technical, detailed PM, too. This is how it ought to be done, and done everywhere. Thanks for being a beacon of light in the dark.

rsync 2 years ago | |

"Thanks for being a beacon of light in the dark."

That's well put.

It makes me very happy to live in a world where tarsnap exists and is priced in picodollars.

cperciva 2 years ago | | |

For the record, I'm happy to live in a world where rsync.net exists. I've pointed quite a few customers in your direction over the years, when tarsnap hasn't been suitable for their needs for a variety of reasons.

hightrees2023 2 years ago |

The downtime could have been much shortened if you had properly setup and _tested_ disaster recovery steps. Create a full fledged separate staging system which you can bring down and recreate and periodically test various failure modes + document all detailed steps of system restore etc.

Also I would suggest to think about the business long term and seeing if you can increase the revenue enough to enable you to hire a part-timer who can be of great help in case a similar event happens.

We are also a small cloud solution provider (we focus on ML API's) and over the years it has become clear to us that when you use cloud hardware (either dedicated or virtual), from time to time the outages periodically happen. RAM, HDD or other parts of the hardware just can malfunction anytime. So this is something which 100% needs to be taken into consideration when running any high availability online service over long-term.

idlewords 2 years ago |

Hats off to you for an honest postmortem and your capable handling of a difficult situation. The only remark I would offer is with respect to sleep deprivation—when you're the only person who can fix a problem, there's no shame in trading some additional outage time for a fresh mind. Though it feels weird to go nap when all the klaxons are blaring, problems are too easy to compound under the combination of adrenaline and inadequate sleep.

cperciva 2 years ago | |

Don't worry, I had a couple naps in there. "This seems to be running smoothly but it will take several more hours; I'll set my alarm to wake me up in two hours and have a nap" is part of why I didn't notice the second step was unnecessarily I/O bound.

gus_massa 2 years ago | | |

IIUC the process had a few steps were you only had to wait while data was transferred or processed for long times. They were probably useful to take a nap or eat or just drink more coffee.

zokier 2 years ago |

Based on the description it sounds like it should be relatively easy to test this recovery process on a regular basis, to catch any lingering bugs and evaluate the recovery time. As they say, the only backups are the ones you have tested.

baz00 2 years ago | |

As someone who just discovered my DR process does not work by testing it, 100% this. The only plan that is likely to work is a repeatable tested one.

tialaramex 2 years ago | | |

Ideally, the thing you do in an emergency is largely routine, so that it happens by instinct rather than being a special case you need to remember. It should not be different in arbitrary ways.

For example in both trains and cars, thanks to anti-lock braking, the correct way to stop the vehicle ASAP is to brake just like normal but as hard as you can, the computers will automatically solve the much trickier problem of turning your input into maximum deliverable braking force by periodically releasing brakes on sticking wheels.

If you run a fire drill, it's surprisingly difficult to get employees to use fire doors that they're used to finding alarmed and unusable. Even though intellectually they know that, say, the door at the bottom of the stairwell is a fire door, with crash bars and leads directly to the outside world, and this is a fire drill, they are likely to (for example) exit on a higher floor and go through a chokepoint lobby, as they would normally, instead of following this safer path that is emergency only. Sadly it is hard to fix buildings after construction if they were designed with such "unused" emergency exits.

For a backup process, having restoring machine images be a service that is sometimes, though not constantly, used anyway for some other reason, is a good way to be comfortable with how it works, that it works, etc. At work for example we routinely test upgrades on test servers restored from a recent backup. Restore serviceA to testA, apply upgrade, discover upgrade completely ruins the service, throw testA away and report this upgrade is garbage. But in the process we gained confidence in the restore process, infrastructure people instead of trying to recall something they only ever did in a drill, when things go badly wrong are very used to this procedure because they do it "all the time".

cperciva 2 years ago | |

Yep! I've been meaning to do it for a while but there was always something higher priority... I didn't realize until this outage that it had been almost a decade since I had tested it.

Rehearsing this annually is definitely going to be a high priority.

mplewis 2 years ago |

I always appreciate seeing a professional, courteous, and honest postmortem like this one.

verytrivial 2 years ago |

(caveat: I may be running on old tarsnap company info but) I must say, the ONLY thing that has ever made me shy away from seriously using tarsnap was the prospect of an unexpected Colin Percival outage. i.e. key person risk. I'm guessing I'm not alone in this.

abiro 2 years ago |

> The second step failed almost immediately, with an error telling me that a replayed log entry was recording data belonging to a machine which didn't exist. This provoked some head-scratching until I realized that this was introduced by some code I wrote in 2014: Occasionally Tarsnap users need to move a machine between accounts, and I handle this storing a new "machine registration" log entry and deleting the previous one

Recommend writing a TLA+ model to catch stuff like this

colonwqbang 2 years ago |

What would be the benefit of tarsnap over using something like restic+backblaze at order(s) of magnitude lower cost? What specific need would motivate you to pay $3000 per TB-year?

carapace 2 years ago | |

Some of us have lots of extra money and like an excuse to give some of it to cperciva so he doesn't have to work a shit job and can apply his skills and talents to bigger, better things?

(People here asking about the low Bus Factor: you don't keep your backups in one service/location, eh? You use Tarsnap and Restic with Backblaze, Rsync.net, S3, etc. right? "Backups are a tax you pay for the luxury of restore.")

jpgvm 2 years ago | |

Extremely good deduplication means that for the core set of very important data I backup to Tarsnap the costs are negligible. I imagine the math is probably different if your data is changing more frequently. I for instance use other services to manage my video and photo libraries but my accounting databases, critical documents, etc are backed up to Tarsnap.

I have been using Tarsnap for a decade and not only has there been minimal availability issues there have been almost no issues of any kind that I can recall.

mherrmann 2 years ago |

It sounds like most of the 26h downtime was spent restoring backups. Incidentally, this is exactly the reason why Tarsnap is unusable for me for production environments. Backup restoration (as a user) is excruciatingly slow. When my systems are offline, I have no patience to wait for hours for my backup service. Maybe things are better now; Last I tried was a few years ago when Tarsnap took on the order of magnitude of one hour to restore a backup of a few GBs.

akashshah87 2 years ago |

Unfortunately, looks like https://www.tarsnap.com/infrastructure.html will have to be updated.

>> So far such an outage has never occurred; but over time Tarsnap will become more tolerant of failures in order to minimize the probability that such an outage occurs in the future.

viscousviolin 2 years ago |

Unrelated to the outage, but I'm curious nonetheless: would it be possible to hook up Tarsnap's encryption software to a Dropbox folder? I'm not sure if it even makes sense to use Tarsnap for this, but I'd love to have an easy setup that allows me to use Dropbox's servers but only let them see encrypted data so they can't snoop.

matthiaswh 2 years ago | |

You probably want something like https://cryptomator.org/

ivoras 2 years ago | |

Doesn't plain old Duplicity (https://duplicity.us/) do that already? (except for de-duplication)

aborsy 2 years ago |

Tarsnap is undoubtedly expensive, but it also donates to various efforts!

Neglecting the pricing, does Tarsnap have any advantage over Restic?

Restic also deduplicates, using little data.

mattbee 2 years ago | |

The deduping in restic is just on the edge of acceptable for me, making me think I'd have trouble with a lot more data. Basically the one a month "prune" operation takes about 36h (to B2) . I feel I could be tuning something but also it works and I don't want to touch it.

aborsy 2 years ago | | |

I backup around 2TB with Restic, also tried locally with Borg. The size is nearly the same. Sadly, I can’t even test with Tarsnap! (absurd pricing for 2TB).

sandgiant 2 years ago | | |

Curious how much you backup, which version of restic you're running and why you think the deduplication is borderline unacceptable. There were several major (orders of magnitudes) improvements made to pruning within the past ~1 year, that's why I'm interested.

zgluck 2 years ago | |

Tarsnap is undoubtedly expensive, but it also donates to various efforts!

I mean.. you could purchase a cheaper service and also donate to various efforts. Bonus: Then you'd also be able to pick those efforts.

bartvk 2 years ago | |

How do you compare the two, price-wise? With Restic, you have to provide your own storage.

RockRobotRock 2 years ago |

Aren't these storage prices absurd? Please let me know if I'm misunderstanding.

switch007 2 years ago |

Not to be that guy, but it’s unreadable either zoomed in or in reader mode either horizontal or landscape on iOS.

Colin, could the website be updated to the 2010s? :P

zetalyrae 2 years ago |

>The process of recovering the EC2 instance state consists of two steps: First, reading all of the metadata headers from S3; and second, "replaying" all of those operations locally. (These cannot be performed at the same time, since the use of log-structured storage means that log entries are "rewritten" to free up storage when data is deleted; log entries contain sequence numbers to allow them to be replayed in the correct order, but they must be sorted into the correct order after being retrieved before they can be replayed.)

Far be it from me to tell anyone how to write software, but why build a database on top of S3 when you can just chuck the metadata into RDS with however much replication you want?

The backups themselves should be in S3, but using S3 as a NoSQL append-only database seems unwise.

This would benefit from being further from the metal.

1. Key person gets hit by bus 2. You see the black bar on Hacker News and learn the sad news 3. You go download all your data from the service, which is still up because there is no bus access to data centers. 4. You feel like a jerk for all your creepy "hit by bus" talk. 5. A few weeks later, some VC-funded operation with multiple employees you depended on disappears overnight without a trace.