Cloudflare outage – 24 hours now

Cloudflare outage – 24 hours now(news.ycombinator.com)

241 points by sanat 2 years ago | 73 comments

sph 2 years ago |

Looking forward to a more decentralised global Internet, with packets being routed through alternative paths, so outages like these become a non-event.

I understand we do not have the technology for that just yet, and DevOps able to configure TLS terminators on their own are worth their weight in gold.

Hard to imagine how the Internet could ever exist without Cloudflare.

renonce 2 years ago | |

The Internet has been decentralized from the beginning. Now I don't want to claim that Cloudflare made something worse (at least it's enabling a lot of websites to exist without fear of DDoS) but the fact is that Cloudflare made it more centralized, as there are lots of websites that cannot be accessed without going through Cloudflare.

belthesar 2 years ago | | |

I think you might have missed the joke on this one.

theideaofcoffee 2 years ago | |

It's such a sad reflection of the state of the devops art when setting up a TLS terminator is considered a black art worthy of vaunted experts being paid huge sums. I've seen this descent over the course of my career, watching the profession go from low-level knowledge to being mere YAML-wiring monkeys, slinging shit over the wall to get functionality working just well enough to make it to the nEXt SprInT. The joke above aside, I think it will continue to get worse, and the outcome to overall stability reflected in that until it comes to a head and either people re-learn 'lost' skills, or the ball of bailing wire, gum and glue implodes more completely.

TeMPOraL 2 years ago | |

> decentralised global Internet, with packets being routed through alternative paths

> I understand we do not have the technology for that just yet

I looked at my router, remembered the term "packet-switched network", and wept.

dylan604 2 years ago | | |

We have the technology. We can make him better, than he was. Better, stronger, faster.

survirtual 2 years ago | |

That technology is far too advanced, unfortunately. Maybe someday, packets will freely roam the cyber plains, untethered by the reins of single-point-of-failure gatekeepers. Until that halcyon day dawns, we'll remain humble supplicants at the towering obelisks of centralization, chanting incantations of redundancy and resilience, and laying burnt offerings of legacy hardware upon the altars of the uptime deities.

barbazoo 2 years ago | |

> Looking forward to a more decentralised global Internet, with packets being routed through alternative paths, so outages like these become a non-event.

It's not just packet routing though, many of their other products seem to be affected as well.

lazydon 2 years ago | |

Missing the /s I hope.

sph 2 years ago | | |

As I said elsewhere, I come from a time everyone was fluent in sarcasm on the Internet, without needing disclaimers.

arp242 2 years ago |

Cloudflare Continuing to Experience Outages - https://news.ycombinator.com/item?id=38121370 - Nov 2023 (2 comments)

Cloudflare Is Down (Again) - https://news.ycombinator.com/item?id=38116892 - Nov 2023 (2 comments)

Cloudflare API Down - https://news.ycombinator.com/item?id=38112515 - Nov 2023 (141 comments)

Cloudflare incident on October 30, 2023 - https://news.ycombinator.com/item?id=38100932 - Nov 2023 (29 comments)

sailingparrot 2 years ago |

I never experienced a longer than 12 hours outage with any service provider over my ~13 years career (maybe I was lucky). But thanks to Cloudflare I have been able to enjoy not just one, but two ~24h outages in not even a month!

Jokes aside, it must be extremely stressful to be a SRE at CF recently. But something is clearly wrong over there. We have been burned so bad there is no chance we will touch CF ever again in the next decade once our migration off of it is complete.

adrr 2 years ago | |

Azure leap year outage is a famous one.

https://azure.microsoft.com/en-us/blog/summary-of-windows-az...

throaway920181 2 years ago | |

> But something is clearly wrong over there

We renewed our agreement with them in the middle of the year (~$50k) and they've yet to invoice us for it. Our financial controller noticed and I pinged our account rep a few times. Not a peep back.

hotnfresh 2 years ago | | |

My limited interaction with their sales & account management org gave me the impression of remarkable levels of disorganization. I know those tend to have a lot of turnover, but it seemed like they also weren't really training or managing them. Really weird vibes.

CodesInChaos 2 years ago | |

> two ~24h outages in not even a month

Wasn't the previous outage on Oct 30 less than an hour?

sailingparrot 2 years ago | | |

Yep, but on Oct 9 they were down for 22h.

samlinnfer 2 years ago |

BTW Cloudflare tunnels are not working (for the at least the last 16 hours), but it says "Operational" and "restored" on the ticket.

Since Shopify's CLI uses Cloudflare tunnels by default to load local resources, Shopify partners are affected by this outage by unable to develop apps, unless they use another tunnel:

[0] https://github.com/Shopify/cli/issues/3065

[1] https://github.com/Shopify/cli/issues/3060

lamroger 2 years ago | |

Wanted to hack today but the universe is telling me to go enjoy the sun

perryizgr8 2 years ago | |

Data point of just one, but my tunnels are working just fine.

samlinnfer 2 years ago | | |

If you've previously created a tunnel it will still work, just don't close it because you won't be able to open a new one.

mypastself 2 years ago | | |

Same here, but they’ve been up for a while. Does anyone know if rebooting the machine will kill them?

sanat 2 years ago |

I run hirevire.com one way video interview SaaS - and we were pretty much dead in the water during the Cloudflare Stream outage.

We moved out to BunnyCDN's stream after waiting for 20 hours.

One side benefit is that our videos are now stored in EU instead of Cloudflare's <hand wavy> edge location around you.

summarity 2 years ago | |

I've also been using Bunny's image and video delivery, while using CF for everything else. It's pretty neat - it just works. I like having both in my toolbelt, makes fallbacks like these easy.

karlerss 2 years ago | |

How much work was the migration? Were the APIs feature-compatible or did you lose functionality?

sanat 2 years ago | | |

The migration work was only a couple of hours for our core process. Took us 4 in total to restart collecting video.

We still have some accessory features to be moved to video on Bunny. Like transcriptions, downloads.

creshal 2 years ago |

I'm really looking forward to the post-mortem to this.

liotier 2 years ago | |

Cloudflare's greatest product is arguably its blog !

dogweather 2 years ago | |

I can't believe we haven't heard anything yet. AFAIK we've only been told, "power outage", which was resolved yesterday.

What should our expectations be? The best assumption could be that this is the new normal.

laluser 2 years ago | |

Power outage + data inconsistency issues.

BillinghamJ 2 years ago | | |

Isn't the real issue that the control plane isn't decentralized/redundant? Entirely dependent on PDX

dixie_land 2 years ago |

A silver lining I take from this is at least we have incidents page hosted somewhere else :)

I look fondly to earlier AWS outages where everything is Green on the status page because the Red icon hosted on S3 was down...

dogweather 2 years ago |

Has Cloudflare said anything of substance yet? This is far beyond a simple power outage.

gtirloni 2 years ago | |

https://www.theregister.com/2023/11/02/cloudflare_outage/

""" In a nutshell, Cloudflare rolled out a new KV build to production. It turned out that the deployment tool had a bug, and some traffic got diverted to the wrong destination, which triggered a rollback … which failed. The result was that engineers had to manually switch the production route to the previous working version of Workers KV.

The problem is that an awful lot of Cloudflare products and services depend on Workers KV, meaning that when there is a problem with the platform, the blast radius can be impressive. """

tux3 2 years ago | | |

The KV outage is the previous one, from Nov 1st.

We're currently in the Nov 2-3 outage, soon to rollover into Nov 4 in my timezone. This one is the power outage — also mentioned in the article — but unrelated to KV.

sponaugle 2 years ago |

Cloudflare Postmortem:

https://blog.cloudflare.com/post-mortem-on-cloudflare-contro...

"On November 2 at 08:50 UTC Portland General Electric (PGE), the utility company that services PDX-04, had an unplanned maintenance event affecting one of their independent power feeds into the building. That event shut down one feed into PDX-04. The data center has multiple feeds with some level of independence that can power the facility. However, Flexential powered up their generators to effectively supplement the feed that was down.

Counter to best practices, Flexential did not inform Cloudflare that they had failed over to generator power. None of our observability tools were able to detect that the source of power had changed. Had they informed us, we would have stood up a team to monitor the facility closely and move control plane services that were dependent on that facility out while it was degraded.

It is also unusual that Flexential ran both the one remaining utility feed and the generators at the same time. It is not unusual for utilities to ask data centers to drop off the grid when power demands are high and run exclusively on generators. Flexential operates 10 generators, inclusive of redundant units, capable of supporting the facility at full load. It would also have been possible for Flexential to run the facility only from the remaining utility feed. We haven't gotten a clear answer why they ran utility power and generator power."

client4 2 years ago |

That are having issues with the new process spanning global MITM'd traffic to the NSA.

epolanski 2 years ago |

Honestly Cloudflare's PR pissed me off yesterday.

Straight up going on LinkedIn and other socials telling everything was solved in one hour (actually 37 minutes), even though I and many other companies I know still had issues with their services *16 hours after* the post.

Those are things that make me reconsider my position with Cloudflare. Straight up lying and not verifying whether your customers are able to operate on your platform while impacting their operations but making PR stunts about how good and fast they are at solving critical issues is something that erodes credibility.

Especially after they used the Okta security failure to bash them on their blog for their lack of honest communication to their customers.

corobo 2 years ago | |

Is it possible that you're referencing the other outage from the 30th? Just going by the 37 minutes number as it's very specific.

This outage (not the current one) was 37 minutes long:

https://blog.cloudflare.com/cloudflare-incident-on-october-3...

anacrolix 2 years ago | |

They are straight up scumbags.

spacebacon 2 years ago |

Hmm... Who just changed their dns vs. riding it out?

sirius87 2 years ago | |

I was in the midst of migrating my namecheap domain from Route53 to Cloudflare. Set up all the DNS records while ignoring the /api/ errors shown at the bottom of the Cloudflare dashboard thinking some ad block setting in my browser was messed up.

Then I realised setting the NS in Namecheap to Cloudflare's nameservers was taking an inordinate amount of time to propagate, and that's when I checked X/Twitter. Set it back to Route53.

burcs 2 years ago | |

We did, we were slowly working towards migrating to AWS entirely and this just helped expedite it.

issafram 2 years ago | |

It hasn't affected my home network at all. I use their DNS servers and nothing has resolving addresses has not stopped working

0x0000000 2 years ago | | |

Parent comment was likely referring to authoritative DNS, not Cloudflare's public resolvers.

dogweather 2 years ago | |

I'm planning my transition away for 10 or so subdomains and 30 records.

The only feature I need to research in new providers is: access to Whois ASN numbers, which I insert into HTTP request headers. I use this to tailor my site for .gov and .edu users.

hipadev23 2 years ago |

Is there a summary of what Cloudflare services are operational? Feels like it would be easier to track.

kijin 2 years ago | |

Basic proxying seems to be working fine for me. Existing DNS records continue to be served. Existing files on R2 are accessible. Can't change anything without a bunch of API errors, though. Hope I don't need to turn on "I'm under attack" anytime soon.

l5870uoo9y 2 years ago |

Wonder if this is related to the many product launches recently? Even though my general impression is that they test rigorously with long-running alpha and beta test phases.

stpe 2 years ago | |

It is the consequences of a power outage in Flexential PDX02 data center.

dogweather 2 years ago | | |

That's the inference, but AFAIK there's been no direct assertion or explanation: Why has CF been knocked back to Alpha-status reliability across the board.

tootie 2 years ago |

I've heard tell of massive DDoS attacks against international news sources (AP, Reuters, NY Times). Not sure if this is related.

nkcmr 2 years ago | |

In this case it is not. A power outage in a critical data-center is the root cause here: https://www.cloudflarestatus.com/incidents/hm7491k53ppg

sponaugle 2 years ago | | |

I am in this DC, and we lost power to all of our racks but one. Power was restored about 2 hours later. I would assume Cloudflare had some significant failures in equipment due to the power drop. We lost a couple of servers that didn't come back up, which is not uncommon problem with hardware that has been running without a power-off for 4-5 years.

tux3 2 years ago | | |

What I'm confused by is we had "power partially restored" 22 hours ago, and no news from PDX02 since.

I assume both Clouflare and Flexential are on DEFCON 1 right now, but I'm wondering if it might be more than just the building going dark.

There's something about a failover than was attempted and crashed halfway through, but unclear if that's what's causing the 24h+ situation.

ta1243 2 years ago | | |

If you can't cope with the loss of a data centre you're not really running a resilient system.

andrewinardeer 2 years ago | |

Anymore info on that?

tootie 2 years ago | | |

I can't find any published details, it was just circling the media biz.