DataDog is having a major outage across almost all services(status.datadoghq.com) |
DataDog is having a major outage across almost all services(status.datadoghq.com) |
At 06:00 UTC on March 8th, 2023 the Datadog platform started experiencing widespread issues across multiple products. The web application was unavailable or intermittently loading, and data ingestion & monitor evaluation were delayed.
We have identified and remedied the issue that caused this outage. We will prepare and share a detailed root cause analysis as soon as possible after our incident response is complete, but we can share a preliminary analysis now. A critical software update applied to a broad set of hosts in our infrastructure caused a subset of these hosts to lose network connectivity.
The primary impact of this was that several of our regional Kubernetes clusters became unhealthy, affecting the control plane that keeps our workloads running smoothly. At this point, we believe we have repaired all the affected Kubernetes clusters, and our recovery efforts are now focused on the application layer above this.
The web application is now generally available, although data and monitor evaluation remains delayed in some cases (refer to the Status Page in your region for the latest information). We have made substantial progress on restoring the various core services that were impacted by the incident, and have now moved on to getting our data processing pipelines for metrics, logs, traces, and other data into a healthy state.
It is difficult to give a precise ETA on our full recovery and we are focusing our efforts on restoring real-time data and alerts within a matter of hours (not minutes, but also not days). The recovery of historical data (between the start of the outage and 15 minutes in the past) has been deprioritized.
We understand the impact an outage can have, and are sorry for the disruption.
> Excluding scheduled maintenance windows, Datadog will use commercially reasonable efforts to maintain 99.8% availability of the hosted portion of the Service for each calendar month during the term of this Agreement. The Service will be deemed “available” so long as Authorized Users are able to login to the Service interface and access monitoring data. Excluding planned maintenance periods, in the event the Service availability drops below 99.8% for two consecutive months, Customer may terminate the Service in the calendar month following such two-month period upon written notice to Datadog. To assess uptime, Customer may, if under a Paying Plan, request the Service availability for a prior month by filing a support ticket through the Site.
Doesn't seem like that SLA could be defined as a "low bar" to me honestly, 99.8% in writing is impressive. It's public as well, meaning if you need a better SLA they aren't the ones for you.
New Relic is 98.5% https://docs.newrelic.com/docs/licenses/license-information/...
The first is that it doesn't cover key platform features. I don't see anything about error rates on metric ingestion or error rates/timing on sending out alerts. Being able to log in and look at metrics is like 4th or 5th on my list of things I care about. It also doesn't preclude a severely degraded service being considered up (e.g. a 25% error rate, but refreshing enough times will get it to load). DynaTrace, by comparison, does count the service as unavailable if it's unable to receive any inbound data.
The second is that their SLA doesn't give out credits, it just allows you to cancel your contract in the calendar month following 2 months of not hitting their SLA. In other words, using their SLA means finding a new provider and migrating within ~30 days. It also means there's no real penalty to them for violating their SLA, since customers upset about the uptime would just not renew their contract. This just lets that happen at an accelerated rate.
99.8% is also not that high of an SLA (especially with what it covers). That's ~1.5 hours of downtime per month, which I would consider pretty average or even mediocre. That's almost a half hour outage per week. To me, 99.9% is good (~45 minutes/month) and 99.99% is impressive (~4 minutes/month).
- It needs to be missed 2 consecutive months before it applies
- You can't see the uptime, have to submit support tickets to get it
- And then you only get to cancel a bit earlier (after 2 months of fuckups), not even a service-credit or refund
It's a completely useless SLA
Anyone know what could have caused this? Companies generally [citation needed] don't go down for half a day across all their services.
I can't think of any company this size that didn't have some outage of this magnitude at least once.
Facebook BGP for instance, Slack in Feb of '22, Cloudflare in June, YouTube, Twitch, Sony PlayStation etc etc have all had incidents this wide and long.
I'd have a few incidents open at 2am for missing business metrics and hosts falling out of the sky due to this if they didn't have that logic there, but instead we've sent out no false alerts for this.
When datadog has the very rare outage that breaks ingestion, all of our alerts would normally go off because we aren't seeing the expected volume of "orders placed" and open up StatusPage incidents for us and our customers, call the pagers and get folks working.
But instead they automatically stop any false alerts that would normally alert here because of their outage. Saves me a lot of headaches.
It is stuff like this why I am happy paying the Datadog bills. Even their outages are good.
> Mar 08, 2023 - 12:29 EST > Update - We continue progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.
> Mar 08, 2023 - 11:46 EST > Update - We continue progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.
-----
I won't post all of it, but you get the picture. Datadog's status updates go on like this for 12 hours. This is a status update anti-pattern. These updates add no value, except maybe some small reassurance that Datadog hasn't forgotten they are down, and they continue to work on it.
But it's actually worse than no update at all, because now every customer needs to parse through a whole mess of these updates to try and figure out what's happening, and if you subscribe to the updates, you start getting spam with no new information every 45 minutes.
I know companies are in a bind here. They don't want to provide estimates they might miss, or adhoc engineering info without vetting. On the other hand, customers complain about a lack of communication if there are no updates. But spamming your status page like this is not a real update, it's a pretend update. Just something to point to when customers complain about a lack of communication, but ultimately still a lack of communication.
I find myself wasting a lot of time refreshing, and force reloading JavaScript on incident updates if they haven't had an update in hours.
I'd much rather have a look and "cool, the auto-communicate thing is still broadcasting the same message as of 12 minutes ago".
StatusPage is definitely missing a "last update still current as of $time" option, but I prefer the repetition personally.
It likely reduces customer anger as well saying "there hasn't been an update in hours!?!" like the recent Atlassian outage and Okta as well.
I think you're probably right, and I think that's what ultimately irks me. It's like an infinite no-update-required hack.
With that said, the recent updates have been informative, so I'll stop complaining. Cheers to the team working through what must've been a tough problem.
If y'all are slinging code you are aware that nothing is 100% available, a consumer failing to anticipate that isn't really the providers fault.
Our playbook asks the deployer to take a look at dashboard X, but automating this would be nicer for some of our CD pipelines
Datadog doesn't go down often enough for me invest time in automating locking deploys based on it.
For anything bigger than a regular code deploys we typically have a runbook ahead of time, and in our template we have a manual check for "make sure datadog is operational" that needs to be checked off on the call. Same with with Github, circleCI, AWS, etc all because we got burned once and and in the postmortem identified that a simple "preflight checklist" would have prevented the issue from lasting so long.
It's a good sanity check, reading The Checklist Manifesto influenced me here for these. We work in complex systems, gotta make sure all the stuff is in working order before takeoff.
Opting out would just mean all your missing data alerts fire every time Datadog has an incident and you would then check, see that everything is missing, and then identify the cause as the Datadog incident.
Its much better to have them handle it and auto-mute the impacted monitors than communicate to my customers every time about false alerts saying all our services are down.
You are missing the last step, which is that, knowing alerts are down, you can actively monitor using other tools/reporting for the duration of their incident.
And why would you have no logs? Even assuming you ingest logs through Datadog (they monitor on much than just logs and not everyone uses all facets of their offering), you would presumably have some way to access them more directly (even tailing output directly if necessary).
And lastly, why would you communicate to your customers without any idea of the scope or cause of the issue? It would likely be clear very quickly that Datadog was having issues when you see that all your metrics are suddenly discontinued without other ill effect.
Cute, but it gets the point across: watchmen for the watchmen, with each layer slightly less mission-critical than the last.
If you just want notifications for when datadog is down, their StatusPage does a fine job of clearly communicating incidents.
I wouldn't want to rely on a "when multiple of our 'missing business metric' monitors alert, check and see if datadog is down" step in a runbook. I don't like false alerts. I don't like paging folks about false alerts. Waking up an oncall dev at 2am saying all of production is down when it is just datadog is bad for morale. Alert fatigue is a real and measurable issue with consequences. Avoiding false alerts is good. If the notification says "all of production is down" and that isn't the case, there is impact for that. I'd much prefer having a StatusPage alert at a lower severity and communication level say "datadog ingestion is down".
Instead, use their StatusPage notifications and then execute your plan from that notification, not all of your alerts firing.
>And why would you have no logs?
I mean Datadog logs/metrics etc. Currently, we are missing everything from them. We can still ssh into things etc, they aren't gone, but from the Datadog monitor's view in this scenario, they stopped seeing logs/metrics and would alert if Datadog didn't automatically mute them.
>why would you communicate to your customers without any idea of the scope or cause of the issue?
We prioritize Time To Communicate as a metric. When we notice and issue in production, we want customers to find out from us that we are investigating instead of troubleshooting and encountering the issue themselves, getting mad, and clogging up our support resources. Flaky alerts here don't work at all for us.