DataDog is having a major outage across almost all services

DataDog is having a major outage across almost all services(status.datadoghq.com)

156 points by sercant 3 years ago | 39 comments

donutshop 3 years ago |

Latest scoop:

At 06:00 UTC on March 8th, 2023 the Datadog platform started experiencing widespread issues across multiple products. The web application was unavailable or intermittently loading, and data ingestion & monitor evaluation were delayed.

We have identified and remedied the issue that caused this outage. We will prepare and share a detailed root cause analysis as soon as possible after our incident response is complete, but we can share a preliminary analysis now. A critical software update applied to a broad set of hosts in our infrastructure caused a subset of these hosts to lose network connectivity.

The primary impact of this was that several of our regional Kubernetes clusters became unhealthy, affecting the control plane that keeps our workloads running smoothly. At this point, we believe we have repaired all the affected Kubernetes clusters, and our recovery efforts are now focused on the application layer above this.

The web application is now generally available, although data and monitor evaluation remains delayed in some cases (refer to the Status Page in your region for the latest information). We have made substantial progress on restoring the various core services that were impacted by the incident, and have now moved on to getting our data processing pipelines for metrics, logs, traces, and other data into a healthy state.

It is difficult to give a precise ETA on our full recovery and we are focusing our efforts on restoring real-time data and alerts within a matter of hours (not minutes, but also not days). The recovery of historical data (between the start of the outage and 15 minutes in the past) has been deprioritized.

We understand the impact an outage can have, and are sorry for the disruption.

mlhpdx 3 years ago |

Ironically, DD promotes using their tool to set and measure SLAs but has a low bar on their SLA:

> Excluding scheduled maintenance windows, Datadog will use commercially reasonable efforts to maintain 99.8% availability of the hosted portion of the Service for each calendar month during the term of this Agreement. The Service will be deemed “available” so long as Authorized Users are able to login to the Service interface and access monitoring data. Excluding planned maintenance periods, in the event the Service availability drops below 99.8% for two consecutive months, Customer may terminate the Service in the calendar month following such two-month period upon written notice to Datadog. To assess uptime, Customer may, if under a Paying Plan, request the Service availability for a prior month by filing a support ticket through the Site.

palijer 3 years ago | |

I don't understand how that is ironic.

Doesn't seem like that SLA could be defined as a "low bar" to me honestly, 99.8% in writing is impressive. It's public as well, meaning if you need a better SLA they aren't the ones for you.

New Relic is 98.5% https://docs.newrelic.com/docs/licenses/license-information/...

everforward 3 years ago | | |

There's two fronts to why it's a low bar.

The first is that it doesn't cover key platform features. I don't see anything about error rates on metric ingestion or error rates/timing on sending out alerts. Being able to log in and look at metrics is like 4th or 5th on my list of things I care about. It also doesn't preclude a severely degraded service being considered up (e.g. a 25% error rate, but refreshing enough times will get it to load). DynaTrace, by comparison, does count the service as unavailable if it's unable to receive any inbound data.

The second is that their SLA doesn't give out credits, it just allows you to cancel your contract in the calendar month following 2 months of not hitting their SLA. In other words, using their SLA means finding a new provider and migrating within ~30 days. It also means there's no real penalty to them for violating their SLA, since customers upset about the uptime would just not renew their contract. This just lets that happen at an accelerated rate.

99.8% is also not that high of an SLA (especially with what it covers). That's ~1.5 hours of downtime per month, which I would consider pretty average or even mediocre. That's almost a half hour outage per week. To me, 99.9% is good (~45 minutes/month) and 99.99% is impressive (~4 minutes/month).

t0mas88 3 years ago | | |

The low bar is:

- It needs to be missed 2 consecutive months before it applies

- You can't see the uptime, have to submit support tickets to get it

- And then you only get to cancel a bit earlier (after 2 months of fuckups), not even a service-credit or refund

It's a completely useless SLA

Ancalagon 3 years ago | |

Can't measure uptime if you have no metrics points to head

specialdragon 3 years ago |

We're up to 10 hours of downtime on their services.

Anyone know what could have caused this? Companies generally [citation needed] don't go down for half a day across all their services.

palijer 3 years ago | |

I think having very infrequent large long outages is the norm here actually.

I can't think of any company this size that didn't have some outage of this magnitude at least once.

Facebook BGP for instance, Slack in Feb of '22, Cloudflare in June, YouTube, Twitch, Sony PlayStation etc etc have all had incidents this wide and long.

indigodaddy 3 years ago | | |

Roblox had a super long one didn't they a year or two ago?

palijer 3 years ago |

All of Datadog's auto-muting logic during incidents is super well thought out and impresses me every time.

I'd have a few incidents open at 2am for missing business metrics and hosts falling out of the sky due to this if they didn't have that logic there, but instead we've sent out no false alerts for this.

civicsquid 3 years ago | |

For those who haven’t used Datadog, what do you mean by auto-muting logic? What does it do exactly?

palijer 3 years ago | | |

We have alerts set up that expect metrics for things like "orders placed" to always be happening at expected rates.

When datadog has the very rare outage that breaks ingestion, all of our alerts would normally go off because we aren't seeing the expected volume of "orders placed" and open up StatusPage incidents for us and our customers, call the pagers and get folks working.

But instead they automatically stop any false alerts that would normally alert here because of their outage. Saves me a lot of headaches.

It is stuff like this why I am happy paying the Datadog bills. Even their outages are good.

jlmorton 3 years ago |

> Mar 08, 2023 - 13:14 EST > Update - We are continuing to make progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.

> Mar 08, 2023 - 12:29 EST > Update - We continue progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.

> Mar 08, 2023 - 11:46 EST > Update - We continue progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.

-----

I won't post all of it, but you get the picture. Datadog's status updates go on like this for 12 hours. This is a status update anti-pattern. These updates add no value, except maybe some small reassurance that Datadog hasn't forgotten they are down, and they continue to work on it.

But it's actually worse than no update at all, because now every customer needs to parse through a whole mess of these updates to try and figure out what's happening, and if you subscribe to the updates, you start getting spam with no new information every 45 minutes.

I know companies are in a bind here. They don't want to provide estimates they might miss, or adhoc engineering info without vetting. On the other hand, customers complain about a lack of communication if there are no updates. But spamming your status page like this is not a real update, it's a pretend update. Just something to point to when customers complain about a lack of communication, but ultimately still a lack of communication.

palijer 3 years ago | |

I agree they don't add anything and do subtract, but I think it is the lesser of the two evils.

I find myself wasting a lot of time refreshing, and force reloading JavaScript on incident updates if they haven't had an update in hours.

I'd much rather have a look and "cool, the auto-communicate thing is still broadcasting the same message as of 12 minutes ago".

StatusPage is definitely missing a "last update still current as of $time" option, but I prefer the repetition personally.

It likely reduces customer anger as well saying "there hasn't been an update in hours!?!" like the recent Atlassian outage and Okta as well.

jlmorton 3 years ago | | |

> It likely reduces customer anger

I think you're probably right, and I think that's what ultimately irks me. It's like an infinite no-update-required hack.

With that said, the recent updates have been informative, so I'll stop complaining. Cheers to the team working through what must've been a tough problem.

ultrasaurus 3 years ago |

There some 3rd party uptime tracking here that might be useful: https://app.metrist.io/demo/datadog

mlrtime 3 years ago |

It's been hard down for 5 hours now.

sgt 3 years ago | |

At least their self monitoring works.

ollien 3 years ago | | |

I joked about this with a coworker, but I do have to wonder what they actually use for monitoring internally. It would be interesting if they just have a second copy of the prod stack for internal monitoring, or something.

masterSshKey 3 years ago | | |

fact: datadog self monitoring is datadog.

indigodaddy 3 years ago |

Still getting "500 Internal Server Error, see context.apiResponse for more details" on most monitors in event history..

justinzollars 3 years ago |

This effected a production deploy today - our team did not know datadog was down prior to deployment. Horrible.

palijer 3 years ago | |

We have "datadog outage" as an abort condition that is checked before any deploys or operations. I highly suggest implementing something like that for things you depend on.

If y'all are slinging code you are aware that nothing is 100% available, a consumer failing to anticipate that isn't really the providers fault.

temp_praneshp 3 years ago | | |

What actual metric do you monitor for "datadog outage"? Simply have the deploy tooling make an api request/something else?

Our playbook asks the deployer to take a look at dashboard X, but automating this would be nicer for some of our CD pipelines

amanj41 3 years ago | |

Out of curiosity, what datadog services do you use that cause deployments to be interrupted when they are having an outage?

justinzollars 3 years ago | | |

Professionally it is a very bad idea to deploy to production without log monitoring and alerts to see the effects of a deployment.

mjcl 3 years ago | | |

Different person, but during a deploy the pipeline checks some the APM metrics (error rate, new errors, latency) to determine if it should roll back the deploy. A previous company would do a canary deployment (like 2 out of 60 instances), wait 20 minutes and only proceed to a full deploy if the canary instances had similar error & latency rates.

akagarwa 3 years ago |

Atleast it is a buying opportunity for the DDOG stock. Thank you DDOG!

sgreene570 3 years ago | |

underrated comment. DDOG to the DDOGE moon.