Cloudflare outage – 24 hours now(news.ycombinator.com) |
Cloudflare outage – 24 hours now(news.ycombinator.com) |
I understand we do not have the technology for that just yet, and DevOps able to configure TLS terminators on their own are worth their weight in gold.
Hard to imagine how the Internet could ever exist without Cloudflare.
> I understand we do not have the technology for that just yet
I looked at my router, remembered the term "packet-switched network", and wept.
It's not just packet routing though, many of their other products seem to be affected as well.
Cloudflare Is Down (Again) - https://news.ycombinator.com/item?id=38116892 - Nov 2023 (2 comments)
Cloudflare API Down - https://news.ycombinator.com/item?id=38112515 - Nov 2023 (141 comments)
Cloudflare incident on October 30, 2023 - https://news.ycombinator.com/item?id=38100932 - Nov 2023 (29 comments)
Jokes aside, it must be extremely stressful to be a SRE at CF recently. But something is clearly wrong over there. We have been burned so bad there is no chance we will touch CF ever again in the next decade once our migration off of it is complete.
https://azure.microsoft.com/en-us/blog/summary-of-windows-az...
We renewed our agreement with them in the middle of the year (~$50k) and they've yet to invoice us for it. Our financial controller noticed and I pinged our account rep a few times. Not a peep back.
Wasn't the previous outage on Oct 30 less than an hour?
Since Shopify's CLI uses Cloudflare tunnels by default to load local resources, Shopify partners are affected by this outage by unable to develop apps, unless they use another tunnel:
We moved out to BunnyCDN's stream after waiting for 20 hours.
One side benefit is that our videos are now stored in EU instead of Cloudflare's <hand wavy> edge location around you.
We still have some accessory features to be moved to video on Bunny. Like transcriptions, downloads.
What should our expectations be? The best assumption could be that this is the new normal.
I look fondly to earlier AWS outages where everything is Green on the status page because the Red icon hosted on S3 was down...
""" In a nutshell, Cloudflare rolled out a new KV build to production. It turned out that the deployment tool had a bug, and some traffic got diverted to the wrong destination, which triggered a rollback … which failed. The result was that engineers had to manually switch the production route to the previous working version of Workers KV.
The problem is that an awful lot of Cloudflare products and services depend on Workers KV, meaning that when there is a problem with the platform, the blast radius can be impressive. """
We're currently in the Nov 2-3 outage, soon to rollover into Nov 4 in my timezone. This one is the power outage — also mentioned in the article — but unrelated to KV.
https://blog.cloudflare.com/post-mortem-on-cloudflare-contro...
"On November 2 at 08:50 UTC Portland General Electric (PGE), the utility company that services PDX-04, had an unplanned maintenance event affecting one of their independent power feeds into the building. That event shut down one feed into PDX-04. The data center has multiple feeds with some level of independence that can power the facility. However, Flexential powered up their generators to effectively supplement the feed that was down.
Counter to best practices, Flexential did not inform Cloudflare that they had failed over to generator power. None of our observability tools were able to detect that the source of power had changed. Had they informed us, we would have stood up a team to monitor the facility closely and move control plane services that were dependent on that facility out while it was degraded.
It is also unusual that Flexential ran both the one remaining utility feed and the generators at the same time. It is not unusual for utilities to ask data centers to drop off the grid when power demands are high and run exclusively on generators. Flexential operates 10 generators, inclusive of redundant units, capable of supporting the facility at full load. It would also have been possible for Flexential to run the facility only from the remaining utility feed. We haven't gotten a clear answer why they ran utility power and generator power."
Straight up going on LinkedIn and other socials telling everything was solved in one hour (actually 37 minutes), even though I and many other companies I know still had issues with their services *16 hours after* the post.
Those are things that make me reconsider my position with Cloudflare. Straight up lying and not verifying whether your customers are able to operate on your platform while impacting their operations but making PR stunts about how good and fast they are at solving critical issues is something that erodes credibility.
Especially after they used the Okta security failure to bash them on their blog for their lack of honest communication to their customers.
This outage (not the current one) was 37 minutes long:
https://blog.cloudflare.com/cloudflare-incident-on-october-3...
Then I realised setting the NS in Namecheap to Cloudflare's nameservers was taking an inordinate amount of time to propagate, and that's when I checked X/Twitter. Set it back to Route53.
The only feature I need to research in new providers is: access to Whois ASN numbers, which I insert into HTTP request headers. I use this to tailor my site for .gov and .edu users.
I assume both Clouflare and Flexential are on DEFCON 1 right now, but I'm wondering if it might be more than just the building going dark.
There's something about a failover than was attempted and crashed halfway through, but unclear if that's what's causing the 24h+ situation.
I certainly prefer that failure mode to the opposite, but I do find the status information on Cloudflare's page to be very confusing about this.
Tunnels as a product is essentially heavily degraded (putting it lightly) and yet it's listed currently as "Cloudflare Tunnel: restored" [0]
https://www.cloudflarestatus.com [0]
Edit: also having used Shopify's CLI a little, one thing I noticed immediately is how opaque the whole thing is. They want to push you down a very specific path, and don't provide a lot of information if you want to take a bit more control over your dev process (as I always want to do) which directly leads to points of failure like this. From your GitHub links it looks like devs are struggling to figure out how to quickly switch to a different reverse proxy.
In that era we also saw the last sysadmin configuring Apache with their bare hands without the help of Cloudflare.
You just can't get that level of reliability if you do it yourself, no matter how hard you try.
That's barely clearing the one nine availability for the last 30 days (93%) for our particular stack on CF, this is insane.
Mind you last time we were hit by a 22h outage on Oct. 9 we didn't get so much as an email from CF either during or after the outage.
It continues to amaze me how major infrastructure providers seem to consistently fuck this one up (see also: AWS' status page outage a while ago).
To have 30% of the internet relying on a single building in a single city is hilarious.