Full technical details on Asana's worst outage(blog.asana.com) |
Full technical details on Asana's worst outage(blog.asana.com) |
Every outage I read, something like that happened. At least asana didn't blamed the technology their were using.
Customers appreciate transparency, but perhaps delving into the fine details of the investigation (various hypotheses, overlooked warning signs, yada yada) might actually end up leaving the customer more unsettled than they would have been otherwise.
Today I learned that Asana had a bunch of bad deploys and put the icing on the cake with one that resulted in an outage the next day.
This is coming from someone who runs an ad server - if that ad server goes down it's damn near catastrophic for my customers and their customers. When we do have a (rare) outage, I sweat it out, reassure customers that people are on it, and give a brief, accurate, and high level explanation without getting into the gruesome details.
I'm not saying my approach is best, but I do think trying to avoid scaring people in your explanation is an idea.
They require us to actually do the work of identifying the issues and writing up what happened and why. I realize that having a customer contract to do this shouldn't be a requirement but human psychology is funny thing. I can turn to my pm and say "I have to do this it's part of the contract" and they immediately back off.
I agree it might not be the best solution but it's definitely better than not doing them.
The latter is useful for example when my boss asks me to evaluate whether to continue using a service after an incident. If I can't get enough information to make a recommendation I might propose a switch out of distrust. Especially when to problem was related to security or privacy.
... That kind of defeats the purpose of "dogfooding". Sure, you have to use the same code (hopefully) but it doesn't give you the same experience.
We roll back by reverting to a previous release on the load balancers, which is usually pretty instant. The previous releases were bad and themselves rolled back, which is a rare situation for us. So there was a bit of scrambling to look into the chat logs to determine a safe (non-rolled back) release we could roll back to. Then the high CPU caused our roll back to be really, really slow. Then we still had old processes running the bad release running, and killing them on webservers with high CPU took a while to actually work. Then it took a bit of time for load to come down on its own. All of this took place within the 8:08-8:29 window reported in the post. And I'm still simplifying a lot.
Also, when only deploying two times a day, it's harder to tell which of the included changes have the problem. That's an argument for more frequent deploys!
In that case, should you be doing daily deployments to production?
Are the daily drops predominately bug fixes or also a regular drip of new functionality?
I think the old world of quarterly releases was also bad for other reasons. I'm curious about the right middle point.
Every time a company like Asana comes clean about outages and software quality issues, the canon of knowledge improves. Thank you for sharing!
Performance is the hardest thing to integration test for. Keeping careful track of CPU/memory/network/disk load with automated alerts can help.
(Fancy systems like running a traffic replica can help, too, but at a much higher cost.)
Additionally, CPU alarms on the web servers should've informed them that the app was inaccessible because the web servers did not have sufficient resources to serve requests. This can be alleviated prior to pinpointing the cause by a) spinning up more web servers and adding them to the load balancer; or b) redirecting portions of the traffic to a static "try again later" page hosted on a CDN or static-only server. This can be done at the DNS level.
Let this be a lesson to all of us. Have basic dashboards and alarming.
If you're working in Slack or chat, you've got a minimum of half a dozen people typing and putting out suggestions and offering to investigate something. That's all time stamped. And even if you're not doing that real-time, you may be using something like a GitHub issue to discuss the problem via comments, which are also time-stamped.
No one at the moment of the incident is probably going "Ah, it's 8:01, better write down that I identified the problem." It's most likely "hay I think I got it one sec" and then that works. Or doesn't. But hopefully it does.
judging from the number of 'sorry's in the text, seems like post mortems have been slowly adapted into a very specialized form of semi-fictional stage drama in which the audience is pandered to excessively through the use of hyperbolic apology.
I'm not sure what you're using for dashboards but Datadog makes it pretty easy to find this stuff. I'm not a Datadog shill and I actually am not a huge fan of the product, but it's what we use and it's been a big help over our previous Munin installation.
Other process changes that could prevent this are good load testing in a stage environment and getting your company using the real prod code on the real prod infrastructure as its main/default install. A lot of the benefits of "dogfooding" are lost if you're using alpha code on dev-only boxes (as you state that you are in another comment).
As another commenter said, I'm not sure that postmortems like this are valuable unless the problem was particularly complex/interesting. I'm sure that a lot of people at Asana know how to fix this and that it's just a matter of getting management to allow them to do so. I'm sure you owe your customers an explanation of some sort, but I don't know if you need to get into details that say "Yeah, it was just a pretty typical organizational failure, we really should've known better". Everyone has those, but it's best not to publicize them too much.
I'm not going to hold it against Asana because I've worked at a lot of companies and I know how this goes, but when people come here and analyze the cause, as a postmortem invites the readers to do, you seem a little defensive. Perhaps it's best to keep the explanation more brief/vague when it's not a complex failure.
1. Describing the root cause and what you failed at. 2. Blame the stuff you are using / other people (clouds you use) 3. just says nothing and try to forget what happened.
What do you think is best?
You can simply make a statement that goes something like this:
"We've completed our investigation of the outage and we found that it was caused due to both technical and procedural errors in the manner in which we deploy our code and monitor the environment. We gathered all the information we require and have made improvements based on it that would help to prevent these issues and other issues with similar causality from occurring. While we do apologize for any inconvenience that the outage may have caused we do want to stress that because of the lessons learned from it our service would grow to be more robust and reliable in the future."
That's it, simple even if generic, having to read 3 pages of technical details isn't really helpful to anyone, if anything the more "suspicious" people might see that as an attempt to mask the real cause of the issues.
But overall when you go into specific what you also give is for people the ability to focus their frustrations and disapproval on a specific subject which is never good. After reading this what I "feel" at first glance is that the the fault lies in the engineers that monitored the environment, so the engineers are incapable of performing their duties, now i feel like the hiring and management processes in that company are not working well if they let "unqualified" engineers in. This is how how a minor outage now blows into a specific complaint or negative bias towards a company and you can easily avoid it by giving enough "reassuring" information but not enough for anyone to actually sink their teeth at.
Overall a generic positive statements is more likely to be accepted as well it sucks but shit melts down sometimes and sometimes people make mistake. A a more technical statement might be accepted as "well why did you hire bob in the first place?" or "why fuck are you using this_framework_i_dont_like?".
"Asana had an outage for 45 minutes yesterday. This was due to an issue with a deploy that was pushed the night prior. We apologize for the inconvenience and are undertaking a thorough review of processes to ensure that similar events don't occur in the future. Please be assured everything is back in working order now. Thank you for your patience and continued patronage."
Big detailed postmortems like this should remain internal documents unless they describe a complex or rare technical failure, news and/or discussion of which will actually benefit the larger community.
Two lines, with the same information someone not very technically literate would understand from the OP. I agree with being transparent, but I also believe in not unnecessarily scaring and/or confusing customers, either.
(Pretty soon they'll just start outting individual engineers...)
The process failed the engineer. Testing, deployment, and monitoring infrastructure was not up to the task of supporting human beings. That it happened to be triggered by engineer X instead of engineer Y is entirely coincidental.
The audience of the post mortem matters. When I see the two line summary, I have no idea whether that's a CYA whitewash, or a sincere part of a process of improvement. When I see the full PM, it builds more trust.
If you're not an engineer capable of understanding the details, it may have a different effect. And if you're part of a corporate culture of politics, shaming, and status chasing, it must feel totally alien.
Three cheers for transparency!
I think you do need to at least acknowledge the problem. With a clear non-technical explanation of the problem in the first paragraph. The rest should go into real technical details of the result of the investigation, not an investigation itself.
- train the people more - help them to get over (some ppl could be really mad and infconfident after they did bad)