From my experience in AWS, part of the problem is scope of impact. It's easy to lose track of just how many active customers there are at any time, and it's easy to see the platform as a cohesive whole, i.e. "If it's affecting you it
must be affecting everyone else". In reality almost every customer impacting event affects only a tiny percentage of the active users at any one time. I know it can be hard to believe or see this as an external customer, because after all the service appears to be down to you. Take, for example, when people start saying "us-east-1a" is down. What is "us-east-1a"? If you've watched some of the re-invent talks you'll know that it actually describes numerous data centres, in close proximity (within a certain millisecond network target).
If
one of those has an incident, it might look to some customers like "us-east-1a" is down, when the reality might be that 95%+ of the data centres still fully functional, and most customers aren't seeing an impact.
You might have an incident affecting just 2% of the API calls, and affecting less than 2% of the user base (even that would be unusually large and a source of big drama internally). The service could be super stable and extremely reliable, but that 98% could get completely the wrong idea if they saw a service status, (and of course from a PR perspective, the same goes for anyone looking to use the platform.)
A service dashboard is an extremely blunt tool with which to pass out a message about service status. It renders what is an extremely nuanced situation down to "All good, maybe, no, DEAD"
To give a rough example, one service I was familiar with had a "page everyone in the team" level of incident. API availability tanked, badly. It looked atrocious, and seemed like hardly any requests were getting through successfully. You'd have every expectation that they should at least post a yellow alert, if not approaching red. It turned out that it was one single customer who's requests were failing (I forget the reason why), but due to a bug in the customer's software consuming the API, every time it got a 500 response, it would immediately resend the request, every single time, with no timeout or limited retry number. It reached such a terrific pace it got to the point where they made up a huge majority of all the requests hitting the endpoint. Every other customer using the service was completely fine. If you'd looked at the API graphs you'd think "POST YELLOW, POST YELLOW, NOW NOW NOW!", but because they took time to figure out the actual impact, they found out that would have been totally the wrong thing to do.
Service health dashboards are a neat idea, but one that is in desperate need of a rethink and overhaul. It has some value when you're a smaller service, but it just doesn't accurately scale with the platform.
I'm not sure what the real solution is. They've somehow got to pull together TB of logs and/or metrics to make an accurate assessment of the scenario, and do it in a matter of minutes, so as to provide accurate updates, and not needlessly panic customers.