AWS Cognito is having issues and health dashboards are still green(status.aws.amazon.com) |
AWS Cognito is having issues and health dashboards are still green(status.aws.amazon.com) |
Is there a status website for AWS Status?
Annoyingly, they expect you to do the leg work to show when the outage happened and supply logs demonstrating that you were impacted.
Might want to do some napkin math first to see if the amount credit is worth your time. The couple times my org considered pursuing it, it just wasn't worth the effort. (Though, personally, I think that speaks to a larger problem with the SLA.)
Credit Request Procedure in Kinesis SLA: https://aws.amazon.com/kinesis/sla/#Credit_Request_and_Payme...
> upstream connect error or disconnect/reset before headers. reset reason: overflow
And request timeouts against cognito-idp.us-east-1.amazonaws.com
And the cognito console won't load
Their ETA, 2 hours, and then try contacting again!
You're right that there's definitely some internal coupling though:
> If you want to require HTTPS between viewers and CloudFront, you must change the AWS Region to US East (N. Virginia) in the AWS Certificate Manager console before you request or import a certificate.
From https://docs.aws.amazon.com/AmazonCloudFront/latest/Develope...
(I think it's also pretty rare for an already configured cloudfront to suffer from issues on the control planes. Cloudfront configuration updates are painfully slow even under normal circumstances, and that's probably because the configuration is heavily replicated to all POPs)
Happened faster than I thought, but based on reading the comments about people who work(ed) there, this seems cut and dried to me.
This is when you fall back to the Tumblr blog for status updates.
<rimshot>
I guess the lawyers of those who paid for uptime guarantees...
Never trust that. Deploy in multiple regions (and AZs within those regions) if you really cannot tolerate any downtime.
"amazon-cognito-identity-js": "^3.2.2" "aws-amplify": "^2.2.2"
It is reported now in their service health dashboard.
Oh, wait! EY, PWC, and who can forget Arthur Andersen!
But, naturally, technology people can solve this better than anyone else, right?
EDIT: nevermind, the Post is back, and Kinesis is still erroring.
Whenever one of our cloud services went down, he would go to great lengths to not update our status dashboard. When we finally forced him to update the status page, he would only change it to yellow and write vague updates about how service might be degraded for some customers. He flat out refused to ever admit that the cloud services were down.
After some digging, he told us that admitting your services were down was considered a death sentence for your job at his previous team at Amazon. He was so scarred from the experience that he refused to ever take responsibility for outages. Ultimately, we had to put someone else in charge of updating the status page because he just couldn't be trusted.
FWIW, I have other friends who work on different teams at Amazon who have not had such bad experiences.
Having 'red' dashboard catches lot of eyes, so people responsible for making this decision always look at it from political point of view.
As a dev oncall, we used to get 20 sev2s per day (an oncall ticket which needs to be handled within 15 mins) so most of the time things are broken, its just that its not visible to external customers through dashboard.
The status page is entirely manually updated.
There are perverse incentives to NOT update your status dashboard. Once I was asked by management to _take our status dashboard down_ . That sounded backwards, so I dug a bit more.
Turns out our competitor was using our status dashboard as ammo against us in their sales pitch. Their claim was that we had too many issues and were unreliable.
That was ironic, because they didn't even have a status dashboard to begin with. Also, an outage on their system was much more catastrophic than an outage on our system. Ours was, for the most part, a control plane. If it went down, customers would lose management abilities for as long as the outage persisted. An outage at our competitor, meanwhile, would bring customer systems down.
We ended up removing the public dashboard and using other mechanisms to notify customers.
I assume there's some selection bias going on whenever we're able to hire people out of FAANG companies. We compensated similarly, but in theory had a lower promotion ceiling simply because we weren't FAANG. I assume he wanted out of Amazon because he wasn't on a great team there.
AWS and amazon in general espouse all sorts of values relating to taking responsibility and owning problems.
Whats left unstated is that the management structure hammers you to the wall as soon as they find somebody to blame.
I always wonder how many more products AWS pushes out the door versus cleaning up and improving what the have already. Cognito itself is such a half-baked mess...
But back to topic, when should we update status pages? On every incident? Or when SLAs are violated?
If a person or company’s compensation depends on not fessing up to problems, they won’t fess up to them.
Resolving COE can be a positive even if you know how to spin it, at least that was the case when I was there. But not sure whether things had changed
COE also doesn't lead to negative marks on anyone at AWS that I know of. It's a learning experience to know why it happened and action items so it doesn't happen again.
Writing COE is kind of admission of guilt and I have definitely seen promotions getting delayed. During perf-review, lot of times managers of other teams raise COE has a point against the person going for promotion.
Even if you don't know what to say still update saying that so the rest of us can report to our teams and make decisions about our own worklives and personal lives.
- https://github.com/dexidp/dex
- https://github.com/authelia/authelia
Well, this is a major outgage
Before someone replies and says use a different AZ, that's not possible for everyone. If you use a 3rd party service that is hosted on us-east-1 you can't do anything about it. For example, many Heroku services are broken because of this.
[0] https://twitter.com/apgwoz/status/1292519906433306625?s=20
In 2017 there was an S3 issue that supposedly affected their ability to post. I believe they said that they were updating how they posted to the status board so that there would no longer be a dependency on S3. Well, I guess whatever they're dependent on now broke.
AWS has to take a hard look at how they build their software. Their bad engineering practices will eventually catch up to them. You can't treat AWS the same as Alexa. Sometimes it's smarter to take your time to ship stuff instead of putting it out there. Burning out your oncall engineers is not a feasible long-term plan.
AWS will be in deep trouble when/if GCE fixes their customer support.
You seem to have insight on AWS's engineering practices. From your point of view what should be changed?
It really does seem that anytime there is an outage more often than not the status page is showing all green traffic lights. Making it redundant as a tool to corroborate whats happening.
How did AWS status page compare with status.io/aws?
We're using it to federate customer IDPs through user pools, but this ends up with customer configs being region specific.
Has anyone figured out how to set up Cognito in multiple regions without the hijinx of having the customer setup trusts for each region? Not to mention, while multiple trusts are I think possible with ADFS (not that I've tested it), I'm pretty sure that Okta doesn't support multiple trusts, so regardless of how many regions, we'd still be SOL there...
Of course you'll have to deal with home realm discovery--really need to go in with open eyes on that one.
Is that not a massive catch-22 for a service dashboard?
Cloudflare does it right for their status page (https://www.cloudflarestatus.com). They don't use Cloudflare itself for it (you can tell because /cdn-cgi/trace returns nothing), the actual backend is Atlassian Statuspage, their TLS certificate is issued by Let's Encrypt instead of Cloudflare itself, and it's on a completely separate domain for DNS purposes.
$ whois cloudflarestatus.com
Registrar: Cloudflare, Inc.Do you have a link for more details?
Last sentence of the alert at the top of the page.
I think the other explainations sound plausible. There is no technical difficulty here that AWS can't solve -- it's political. Having an outage with a status page makes you liable for your SLAs.
https://downdetector.co.uk/status/visa/map/
I am unable to order my Papa Johns pizza
This is why I prefer 3rd party monitoring systems to track health of my internal monitoring systems.
- 7 cloudfront distributions created today are still in "InProgress", a few already for more than one hour
- The support case I created about it doesn't show up in my support portal. Direct link to it does work though
Was posted 8 minutes ago.
Seems like they fixed Cognito while Kinesis and many other services are still broken - presumably somehow removing the dependency on Kinesis? It’ll be really interesting if their post mortem explains this mitigation.
Then the status page would be almost entirely useless ...
We're like Stripe for SSO/SAML auth. Docs here: https://workos.com/docs
Here's our HN launch: https://news.ycombinator.com/item?id=22607402
Regular Joes like us can use AWS, GCE, on premises, some non-reseller colocation provider, etc., and create failover duplicates, alternative deploy targets, or simply not ever have a complete outage due to the unlikelihood of all of these things failing at once.
Disclosure: I'm an employee of FusionAuth, and while there is a forever free community edition, it is free as in beer, not as in speech.
ory looks like a really good project
Here's a reddit with a bunch of posts you could sift through: https://www.reddit.com/r/KeyCloak/
The Lambda function associated with the CloudFront distribution is invalid or doesn't have the required permissions. We can't connect to the server for this app or website at this time. There might be too much traffic or a configuration error. Try again later, or contact the app or website owner. If you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by reviewing the CloudFront documentation.
Heh, maybe they accidentally locked themselves out of IAM, since those are great fun to troubleshoot, also
All on the eve of thanksgiving.
Having lots of services that do one thing and one thing well makes a lot of sense. Breaking them out into separate components brings a level of visibility into the system. And it's AWS's whole business model.
But it does mean that, fundamentally, service X is available when and only when (WAOW?) services A, B, C, etc. are all available. Its uptime is no greater than min(uptime(A), uptime(B), etc)
I'm trying to rework the authentication for our application and integrate it with our parent company's systems. As we talk to other teams, I see all these architecture diagrams where the solution to every problem is Yet Another Service, to where you're running a real rube goldberg machine.
No matter how much you value science and engineering, it ultimately doesn't matter to the business unless that aligns directly with their revenue stream. Sometimes it does, sometimes it doesn't.
When you're advertising uptime/availability, you're motivated not to report downtime/unavailability. Then the value of such reports is lost; developers start banging their heads trying to figure out if it's a service outage or a bug in their software (yes, informed by personal experience).
The main change they made in 2017 was the ability to post a message at the top of the page that is independent of the status of the individual items below. IIRC, it was the items they couldn't update. So that is kind of a hack, but it works.
It would be ideal if it was host entirely on completely separate infrastructure, and even a separate domain, but I won't hold my breath. Theirs is still more reliable than, for example, the IBM Cloud status page which was hard down during their epic outage back in June.
Luckily my company decided against multi-az for the cost savings so I spent all day firefighting.
Failure happens at the speed of computing but agreeing that something is failing in a way that customers need to be told about is a slower process.
Even when status pages are fully automatic (rather than manually updated), there will tend to be gaming of the metrics that constitute that.
Ideally you would just be monitoring your SLOs and publishing that to customers... that doesn't seem to be how it works, anywhere.
Publicly disclosing an incident to a customer is embarrassing and potentially damaging but almost equally as damaging is telling other teams you had an incident. Now anything that goes wrong is your fault by default because “it’s probably related to that incident” and any new security policies are blamed on the other team: “we wouldn’t have to do that if Ops didn’t mess up last month”.
The answer to “is this service suffering an outage” is seriously complex and hard to determine. The answer to “is this a security incident” is 10x harder and 100x more political because the industry is still just so wildly immature.
Admitting that your services are down could be costly to your career progression and bonus. When people know this, they go to great lengths to avoid admitting fault. Updating the status page is the first admission of fault. The longer the status page shows an outage, the worse it gets.
I worked with an ex-Amazon engineer at a previous company. After each outage, he would spend days or weeks writing long reports explaining how the outage was not his fault. He didn't care so much about downtime so much as not getting blamed for outages. Predictably, this was terrible for team morale and most of his team members ended up quitting.
If anyone else finds themselves in this position, the solution is have another team responsible for monitoring uptime, and to rate teams on how quickly they acknowledge outages. Once the response time and accuracy of your status page becomes a performance metric, people are less likely to play games with it.
What is an outage? When does an outage reach sufficient scale that updating the status page is the right thing to do?
I used to work for AWS, and now work for another cloud provider.
One thing that's hard to communicate is the sheer scale that these services operate at, what that means architecturally, and how they tend to break.
Outages, even just slight degradation, occurring on a whole service scale are very rare. I would argue from my experiences there that most incidents affect less than 10% of any given service's customers. Whether it gets noticed in part depends on who is encompassed by that percentage.
What is very often the case is that a subset of customers get impacted to some degree during any given incident. That can be even things like single percentage of customers or less, but be an incident that has all hands to deck and the entire management chain of the service aware and involved in.
At what percentage do you draw the line and say "Yes we need this many percentage of our customers to be affected before we post a green-i" (AWS terminology for the first stage of failure notification).
How do you communicate that effectively to customers, in such a way that doesn't suggest your service is unreliable when it really isn't.
The moment you post a green-i or above, customers start blaming you and your service for problems with their infrastructure that are not caused by it. If you're looking to use a service and go look at the status history and see it filled with green-i or similar, are you likely to trust it? No. Even if those green-i's were for impacts on a limited subset of customers.
AWS wrestled with this a bunch about 5-6 years ago. There were no end of discussions during the weekly ops meetings with senior leadership, directors and engineers across the company. Everyone wants to do the right thing and make sure customers get an accurate picture about the health of the service, without giving the wrong impression.
In the end they opted to move towards having personal notifications for outages, and build tooling to help services quickly identify which customers are being affected by any particular incident and provide personalised status pages for them that can be way more accurate than any generalised status page.
You'd think they would have learned from that.
If you look at where the content on https://status.aws.amazon.com/ is actually hosted from you'll see things like the status icons are all hosted under the same domain, e.g. https://status.aws.amazon.com/images/status1.gif https://status.aws.amazon.com/images/status0.gif etc.
If you look at the source code for the site, you'll again see that everything is hosted from the same domain.
One of their main goals was to ensure that it could never go wrong that way again.
> You'd think they would have learned from that.
They did.
The page has been updated numerous times since the start of this incident.
Which makes me wonder, why do we all rely on status pages rather than solve the problem ourselves in ways that don't require us to rely on the vendor?
“Don’t be evil”
buys doubleclick
This sort of shit happens all the time at all levels. Companies use each other’s public specs in their competition all the time.
Or capitalizing on features like headphone jacks etc. in their ads before proceeding to remove them from their own products anyway (Samsung and Google) and so on.
They are a formal, in-depth retrospective on customer-impacting service degradations or outages. They include a thorough functional description of how the state of your service evolved into failure, a exhaustively recursive review of the operational decisions and assumptions that contributed to that failure, and a series of action items the team will take to ensure that the service will never fail again for the same reason.
Edit: This list is incomplete, and the link included in the sibling provides a better, more thorough description.
The relevant bit:
>[customers] were texting with their account managers, because the account managers had no access to any internal systems. Reportedly, the corporate VPN was not working. My thesis is... everything was single-tracking through a corporate VPN that itself was subject to this disruption... their traditional tweets have been done through an enterprise social media client called Sprinklr
Cheaper than GCP. Still less crappy than Azure.
During my 10 years, I had multiple opportunities to break and then fix things. The breaking was always looked at as "these things happen" while the fixing was always commended.
In fact, AWS is the least 'blame game' playing company I've worked at. The mindset of fix the problem and not to find some scapegoat is strong at least in my org, I really do appreciate this because it aligns with my personal belief.
For there to be downtime in Bitcoin, there would need to be a rollback, where all (or most) miners go back to a previous block and mine from that point. This has only happened once, as far as I am aware (due to a bug in the protocol itself which needed correction).
> This issue has also affected our ability to post updates to the Service Health Dashboard.
Just seems so ridiculous that they have trouble reporting the impaired status of their system due to... the impaired status of that same system.
As Werner has said before everything fails all the time, so you need to design your system/architecture to accept that constant. US-east-1 is by far the largest of the regions, and at that scale you can probably assume that at any given point in time there is hardware in there failing that needs to be physically replaced. As a result it's the region most well equipped to tolerate that level of constant failure (it's got 6 AZs!). It's also the the most popular of the regions, is typically one of the launch regions for new services, and runs a bunch of critical Amazon infra too. If anything it holds a special place in terms of importance for AWS to keep it up because the impact of a widespread problem here is amplified. For the same reason though any problem here is much more visible across the entire internet. Which is why the handful of outages are so memorable to people.
rumor has it, some of the older hardware is moved there and that's why prices are a little cheaper but I have not been able to confirm that.
Amazon doesn't have a good engineering culture. It's all about shipping things as fast as possible. People get promoted an leave for other teams, and the new folk gets burned out due to on-call load while trying to fix crappy software they have inherited.
Why don't the new folk iteratively refactor their systems to remove operational burden? Isn't that part of owning any codebase you didn't write?
You don't hear a lot of people praising AWS, the same way you don't hear a lot of people saying how great it is to have an iPhone. If I am happy, I have little incentive to post about it, since that should be the default state.
But the matter of fact is simple. If you end up in a team like this, switch and raise complaints afterwards. Nothing stops you from it. There is no "toxic engineering culture" at AWS. The problem is that AWS makes you into an owner and that includes owning your career. That means if you feel something is wrong, YOU are expected to act. No one will do it for you. And there are plenty of mechanism for you to act.
This is the greatest benefit of working at Amazon but its also the downfall of people who are not able to own things.
Firing me for correctly telling customers that their services are down is not my idea of making me an owner.
This sort of corporate jargon does not exactly instill confidence. I think I'm more concerned about Amazon's engineering culture now than I was before.
You definitely hear a lot of people praising AWS.
Disgruntled people are the ones who often cry the loudest. Just because there may be teams who act like this, doesn't mean that is the case in general.
Is right up there with "we don't know it wasn't aliens"I think it's bigger than just "it's your problem, you own it". There are factors beyond your control.
Our competitors would have a field day with that
So I don't really understand what they gain by doing it. I think maybe I am wrong about it being a marketing concern and that the choice is more related to internal politics and incompetent management.
Few companies really respect their engineering teams/divisions in any sensible form from my experience, though I'm biased (even in heavy R&D environments). You're simply a means to an ends.
I understand your point though (and identify with it), but I find any mechanism/option that provides a way of containing potentially damaging information is going to be pushed by management over the option to release damaging information that a responsible engineer may want to disclose.
You're in a culture where admitting fault or liability is like pulling teeth and ripping finger nails off. It shouldn't be IMHO (we should own up to our mistakes and be reasonably forgiven), but that's unfortunately not the culture we have.
If they self-host it, it signals that they're overconfident in their ability to maintain an accurate status page.
Given these two options, which do you think a budget manager will have an easier time signing off on and defending upward?
(SHD being the Service Health Dashboard)
And the fact remains that currently an outage of AWS's own infrastructure is impacting AWS's ability to status updates on its own status dashboard. It's just seems so... amateurish.
I'd be curious to be a fly on the wall during the next Ops meeting when it comes up that yet again the status dashboard got made in a way that makes it hard to update during an outage.
Maybe if it started costing the company actual money, it might make the investments necessary to ensure it doesn't go down in the first place.
You have all the power you need to make the company change its behavior. Vote with your dollar and move to a different platform. I'm sure you have recommendations to share.
With any customer that has SLAs written into their contracts, they're not just going off your status page. They most likely have a direct point of contact and exact reporting will be done in the postmortem.
The status page is for customers for which there aren't significant legal or business complications and exists to provide transparency. In my opinion you do want "random" people at your company to be able to update it in order to provide very stressed out customers with the best information you have.
As an industry we probably should recognize this more explicitly and have more standard status pages that are like "everything might be broken but we're not sure yet"
Exactly. Apparently it's just a marketing tool if you believe parent comments...
Where I was about 99.9% of the COEs where just a lesson learned and new process to prevent it. There was one that was basically used as a tool to remove a VERY good engineer, that didn't mesh well with new leadership.
A sister org, one I worked a lot with, wouldn't COE anything. If you were the lead engineer on a product or service that had a COE you were going to get a PIP by year end review. I wasn't surprised when all the talent left that group.
Nothing against you personally of course, but I just have to congratulate whomever it was who came up with this gem of an euphemism. It's definitely going up there next to 'career-limiting move'.
In reality it was a task scheduler with some logging and metrics thrown in which awkwardly tied user's individual code builds to a third party service where they had to be registered and externally reference for every build. Virtually all SWF functionality was in the client library, not the service which was just a data store and API.
Other cool kid services that managers wanted to force teams to use included dynamodb, kinesis, lambda, etc.
I'm not sure that assigning this to a perceived internal power grab aligns with reality.
> obvious to me
> would become very powerful
> cabal
edit: not sure why my question deserved a downvote...
It's a shame Amazon doesn't have thousands of employees to divide these tasks between different people, as it is only these busy operators who could update this status page.
If you're right, why have the status page then? It is useless by your definition yes?
Its even more frustrating when you are aware of problems early on and start talking to support and THEY don't even know about problems yet.
Maybe the thousands of people is what prevents status from being updated, everyone tries to hide their own faults internally even
Which is why, during incident responses, there has to be people in charge of communication. Both internal and external communication, and some of this can be further delegated.
That's a poor excuse.
> It requires escalation up the management chain and careful wording
Careful wording is more important for external stakeholders who might not have the full context. If one is stepping in eggshells with internal management too, that's bad management. Incident communication should be factual and concise.
Could not agree more. It's immensely frustrating working with organisations that spend more time trying to cover up the cause of a outage to external stakeholders than actually fixing the root cause.
The same organisations tend try and blame individuals for outages.
I think both are a symptom of businesses that embrace the "blame culture"
"A top of rack switch let out the blue smoke and it'll be ~30 before we can re-rack it" would impact what fraction of a fraction of a percent of canaries? Irrelevant to me, unless of course my VM lives on a box backed by that switch. ;)
The status dashboard exists for us to laugh at when things break and to convince C*Os that everything is fine. That's it.
EDIT: 15 minutes later and the board is looking worse again.
Status pages are fundamentally boring things. Who wants to work on them?
It's always tempting to complicate something simple because in part "ooh shiny", and you can always find reasons to justify why. It takes some strong engineering leadership to effectively argue against complicating things, and not be just a constant pain in the arse to everyone and every thing.
The kinds of people that are that good, tend to be people that aren't going to want to do something so boring as build and maintain the infrastructure for hosting status pages.
I would work on a status page. It's a interesting problem, creating tests that prove services are viable at a place like AWS would be fun. However what I don't want to deal with is some director of so and so I never heard of yelling at me at 3 in the morning because my status page reported that his service was down accurately. I suspect that plays more into the problem. The status page is a political implement not a technical one.
The status page shouldn't be figuring out what the status of any service is. It's impossible to do without a lot of contextual information about a service and understanding how to evaluate service impact, something that is continually in flux.
It just needs to be a page that is updated manually. AWS has a 24x7 incident management team that could / should do it.
I considered them this private company subsidized by taxes.
Patently incorrect. A PIP is management telling you that you need to seek alternative employment, now.
Joking/sarcasm aside: I’ve never seen or heard someone who is placed on a PIP successfully “exit” the PIP. They exit the company or they’re exited from the company. PIPs seem to mark the start of the “we are building formal documentation to fire you” phase of losing a job.
I guess I'm the poster child for having vision insurance as a company benefit.
I have and at Amazon and AWS. The pattern I have seen is medical related. Someone is on some sort of medication that is screwing with their abilities and don't realize it. I've seen multiple cases: one where it was meds that caused liver problems and the person didn't know they were supposed to get regularly testing (crappy doctor) and another where they found out meds they were on caused short-term memory loss. These surfaced during the PIPs and were fixed - and the folks got out fine.
As you point out, though, the status dashboard isn't truly meant to be either of those things. I don't have any illusions about it ever changing.
I'm afraid you're shifting the complexity to a manual process.
I agree that it doesn't have to, and perhaps should not, be fully automated. But automating some parts will help not waste time on last minute arguments.
You're right, that's 100% what I'm doing. Why? Because it shouldn't be that complicated to update an overall health status page during an outage event, and it shouldn't take other tools and services within AWS to do it.
A common pattern in cloud providers (including AWS) is that services have some kind of tiering, whereby you can't pick up a dependency on any service on a lower tier than yourselves. Tier 2 services can't rely on Tier 3 services, etc. Services like, say, IAM, would be right at the very top. It can't rely on EBS, ELB etc. Everything has to be created in-service, because everything ultimately has to rely on authentication working.
If they're going to keep an overall status page going, it needs to be seen as a top tier service, just like identity is. That's where they were headed towards when I left AWS about 5 1/2 years ago. It had been spurred by a previous major incident couldn't be reflected in the status dashboard because of a failure in a dependency.
> I agree that it doesn't have to, and perhaps should not, be fully automated. But automating some parts will help not waste time on last minute arguments.
I go in to a bit more detail in another comment within this discussion, but a status page does not even close to accurately capture the ways that cloud environments fail, which are very, very rarely affecting more than a small percentage of customers, and even then often in some very specific way under specific circumstances. That's why AWS built the personalised status page service. They want to ensure that customers have an accurate way of telling what is going on with services they're consuming, rather than the confusing situation of checking an overall status site that doesn't really reflect their experience and never could.
Situations like today's where it at least (from the outside) seemed like Kinesis was completely down, would be a good example of something that should be reflect in the main overall status page.
The status page should be manual, and should be something the incident management team can do (and have political ability to force it to happen, rather than being subject to the whims of service directors)
Outside of someone protected by a labor union, I’ve very rarely seen anyone recover from a PIP and not be eventually let go. Most commonly employees see them as a 30 or 60 day window to proactively find a new job before they’re terminated.
For example, a friend I have that recently left Facebook knew for a good 6 months he needed to shape up. But they hadn't put him on a PIP in that time. They eventually offered him a decent severance to quit, and he took that rather than continuing to try. If he stayed, he probably would have been put on a PIP fairly shortly. It was the best thing for everyone. He wasn't all that happy there anyways.
Lots of people bad at their jobs blame the PIP system for their failure at Amazon.
People responsible for the product should not have say over the switch being flipped, for obvious reasons (illustrated in other comments in this thread).
A lawyer I spoke with suggested employees regularly visit their doctor about work related stress so that when they inevitably get PIP'ed they can claim medical leave and work related illness. Some places its a war zone and that's what workers have to do.