Full technical details on Asana's worst outage

Full technical details on Asana's worst outage(blog.asana.com)

77 points by marcog1 9 years ago | 62 comments

merb 9 years ago |

> Initially the on-call engineers didn’t understand the severity of the problem

Every outage I read, something like that happened. At least asana didn't blamed the technology their were using.

babo 9 years ago | |

For me that was the great part of the post mortem, they identified the response process itself as the root cause.

merb 9 years ago | | |

yep that was what I thinking aswell.

katzgrau 9 years ago |

These sort of deeply apologetic and hyper-transparent post-mortems have become commonplace, but sometimes I wonder how beneficial they are.

Customers appreciate transparency, but perhaps delving into the fine details of the investigation (various hypotheses, overlooked warning signs, yada yada) might actually end up leaving the customer more unsettled than they would have been otherwise.

Today I learned that Asana had a bunch of bad deploys and put the icing on the cake with one that resulted in an outage the next day.

This is coming from someone who runs an ad server - if that ad server goes down it's damn near catastrophic for my customers and their customers. When we do have a (rare) outage, I sweat it out, reassure customers that people are on it, and give a brief, accurate, and high level explanation without getting into the gruesome details.

I'm not saying my approach is best, but I do think trying to avoid scaring people in your explanation is an idea.

bognition 9 years ago | |

I work at a shop that does these kinds of post mortems. I find them highly beneficial.

They require us to actually do the work of identifying the issues and writing up what happened and why. I realize that having a customer contract to do this shouldn't be a requirement but human psychology is funny thing. I can turn to my pm and say "I have to do this it's part of the contract" and they immediately back off.

I agree it might not be the best solution but it's definitely better than not doing them.

dogma1138 9 years ago | | |

I think the OP didn't mean that these post mortems are not beneficial internally, what he said that disclosing all these details to the public can be confusing and maybe counter productive.

smoe 9 years ago | |

Personally I'd produce both. A brief high level explanation for non technical people (e.G. customers, press) and an in-depth blog post with the gruesome details.

The latter is useful for example when my boss asks me to evaluate whether to continue using a service after an incident. If I can't get enough information to make a recommendation I might propose a switch out of distrust. Especially when to problem was related to security or privacy.

rossjudson 9 years ago | |

Your approach works for the incident, but not for the relationship. Transparency about the technical nature of the outage is a commitment to the client that this type of outage won't recur, and steps are being taken to ensure that. It pierces the veil of arrogance by assuming client competence. That client is actually someone who reports to someone else, and they're going to have to explain their outage to the boss. For cloud providers, this kind of transparent post-mortem is the root of a fan-out of incident analysis.

marcog1 9 years ago | |

Every response from our users so far has been thanking is for the transparency. It also represents our internal transparency, and that has a real impact on recruiting.

noir_lord 9 years ago | | |

Not an Asana user but if I were this kind of response is exactly what I like to see as a both a user and a developer so well done.

madelinecameron 9 years ago |

>And to make things even more confusing, our engineers were all using the dogfooding version of Asana, which runs on different AWS-EC2 instances than the production version

... That kind of defeats the purpose of "dogfooding". Sure, you have to use the same code (hopefully) but it doesn't give you the same experience.

marcog1 9 years ago | |

You want to replicate as much as possible, but if we ran canary on the same machines we could have testing code bring down production. That's bad.

bArray 9 years ago |

Was this incident really recorded minute by minute or is that made up? I've noticed a lot of companies that give this kind of detail like to give a minute by minute report, I just don't understand how they get that accuracy?

kctess5 9 years ago |

I find it interesting that they didn't notice the overloading for so long. Also that it took so long to roll back. Given that they reportedly roll out twice a day, it seems like identifying a rollback target would be fairly quick.

marcog1 9 years ago | |

This was the first time we had this class of outage. Many things were in a very bad state, and many of these symptoms were more familiar to us. So we spent time ruling them out before realising webserver CPU was closer to the root cause than the other symptoms.

We roll back by reverting to a previous release on the load balancers, which is usually pretty instant. The previous releases were bad and themselves rolled back, which is a rare situation for us. So there was a bit of scrambling to look into the chat logs to determine a safe (non-rolled back) release we could roll back to. Then the high CPU caused our roll back to be really, really slow. Then we still had old processes running the bad release running, and killing them on webservers with high CPU took a while to actually work. Then it took a bit of time for load to come down on its own. All of this took place within the 8:08-8:29 window reported in the post. And I'm still simplifying a lot.

tomjen3 9 years ago | | |

What I don't get is why you didn't see the relatively low cpu usage on the database server and the super high ones on the webserver immediately in a nagios (or similar) dashboard.

jwatte 9 years ago | | |

Roll backs are in chat logs? I'd assume your scripts would record what they do when they do it, including roll backs.

Also, when only deploying two times a day, it's harder to tell which of the included changes have the problem. That's an argument for more frequent deploys!

abhishekash 9 years ago | | |

Seems like pretty ambitious logging that it tripped the servers !!! Will be careful with my logging next time :) .

ycombinatorMan 9 years ago | | |

Out of curiosity, why are you deploying to all your web servers simultaneously? Could you not do a partial roll-out to make sure something like this doesnt happen?

mathattack 9 years ago |

Not a bad reaction. With all the reverts is there a QA issue? Or too many releases?

marcog1 9 years ago | |

When you do daily deployments, you can't QA every one much. You rely on automated tests and Internal users using the new code for a couple hours before the deployment. We were unlucky in this case with the number of bad releases. Each was relatively minor, and ironically one was to fix a bug with the code that caused this outage. We run a 5 whys for most of them.

Mtinie 9 years ago | | |

> When you do daily deployments, you can't QA every one much.

In that case, should you be doing daily deployments to production?

mathattack 9 years ago | | |

I include automated testing in my definition of QA. (Necessary but not sufficient)

Are the daily drops predominately bug fixes or also a regular drip of new functionality?

I think the old world of quarterly releases was also bad for other reasons. I'm curious about the right middle point.

Every time a company like Asana comes clean about outages and software quality issues, the canon of knowledge improves. Thank you for sharing!

zzzcpan 9 years ago |

Strangely, there are no actual technical details in the report and the blame is on the process. Although most of the times there is some way to prevent bugs from causing problems with better architecture.

jwatte 9 years ago | |

The detail was right there: debugging something in security caused massive logging which caused CPU bottlenecking.

Performance is the hardest thing to integration test for. Keeping careful track of CPU/memory/network/disk load with automated alerts can help.

(Fancy systems like running a traffic replica can help, too, but at a much higher cost.)

marcog1 9 years ago | | |

We actually have a traffic replica (dark client) setup for the new webserver architecture we are gradually migrating to. It likely would have caught this before deploying to users.

cookiecaper 9 years ago |

Reading through this, it sounds like some basic monitoring would've quickly allowed them to pinpoint the cause instead of wasting time with database servers. All it would take is pulling up the charts in Munin or Datadog or whatever and seeing "Oh, there's a big spike correlated with our deploy and the server is redlining now, better roll that back". A bug or issue in the recent deploy would logically be one of the first suspects in such a circumstance. Don't know why they wasted 30-60 minutes on a red herring. The correlation would be even more obvious if they took advantage of Datadog's event stream and marked each deployment.

Additionally, CPU alarms on the web servers should've informed them that the app was inaccessible because the web servers did not have sufficient resources to serve requests. This can be alleviated prior to pinpointing the cause by a) spinning up more web servers and adding them to the load balancer; or b) redirecting portions of the traffic to a static "try again later" page hosted on a CDN or static-only server. This can be done at the DNS level.

Let this be a lesson to all of us. Have basic dashboards and alarming.

qaq 9 years ago |

This is "not that different" from getting a very high load spike do you guys not have some autoscaling setup?

marcog1 9 years ago | |

We do, but it didn't help given the cause of the high cpu was our logging infrastructure (Amazon Kinesis) being overloaded by the webservers.

matt_wulfeck 9 years ago | | |

Does kinesis not support UDP sylog style logging, some of these old technologies had the right idea: if your sending too much data, drop the packets on the floor instead of falling over!

babo 9 years ago | |

Autoscaling as not necessary driven by CPU load.

qaq 9 years ago | | |

true but by default it is

jwatte 9 years ago |

The real support for a frequent deployment system is in the immune system! I've had good luck with a deployment immune system that rolls back if CPU or other load jumps, even if it doesn't immediately cause user failure. (I e, monitor crucial internals, not just user availability)