Ask HN: Cloudflare Workers are down?

elithrar 2 years ago |

This should be resolved. We’re still investigating the underlying root cause, and intend to share a write-up once we have that in hand.

This is not the way we wanted anyone to start their week.

(I am the PM lead for Cloudflare Workers: Databases & Storage)

elithrar 2 years ago | |

Our public postmortem on the incident: https://blog.cloudflare.com/cloudflare-incident-on-october-3...

nkcmr 2 years ago |

https://www.cloudflarestatus.com/incidents/l6x2h1zp69bc

huerlisi1 2 years ago | |

Just noted on HN and already incident upgrade. Much faster "response" than most other companies:-)

All the best to the people fixing!

NicoJuicy 2 years ago | |

Works for me: https://blog.sapico.me/

Seems to be 30 minutes. According to status page.

Fix is fast. Curious what it was.

codegeek 2 years ago |

3:55 PM Eastern: Our entire website hosted on cloudflare pages is returning 500. I also cannot login to the dashboard either (it just spins)

EDIT 4:10 PM Eastern: Now I can login to the dashboard but "Workers and Pages" menu is returning errors and no access. Website still down :(

EDIT at 4:23 PM Eastern: RESOLVED. Website (cloudflare pages) is back up now for me.

Looks like they took about 25 mins to resolve.

camjohnson26 2 years ago | |

Our prod app and staging just completely died. Bad day for somebody at Cloudflare

codegeek 2 years ago | | |

Our main Marketing website that brings revenue is down. No Sympathy from me. It has been 20 mins now. Losing money as I type this.

EDIT: I panicked a little. As a dev, I should have been more sympathetic.

acdha 2 years ago | | |

It’s only reasonable to be angry but do try to remember that the people fixing this are people like you who showed up at work to build something and are instead dealing with a fire. Ask their bosses about how they got in that situation but be nice to them, they’re having an even worse day than you are.

codegeek 2 years ago | | |

Fair enough. They resolved it now and I was in a bit of panic considering our revenue depends on the website. As a developer though, I should have been more sympathetic.

andrelaszlo 2 years ago | | |

I'm curious, I definitely get the panic. How much does a 30 minute outage cost you, vs how much would it cost to build a solution with some kind of standby that you could fail over to in scenarios like this?

It could be worth it, but if you do the math and it seems like it's not worth it, it could perhaps give you some equanimity the next time it happens?

midasuni 2 years ago | |

3.55pm eastern. My websites work

4.10pm eastern, still working

4.23 eastern. Yep you guessed it

Half an hour means they’ve lost their five nines for this year based on this outage alone.

EthicalSimilar 2 years ago | |

Us also, prod and staging are down and dashboard is resulting in API failure requests (500).

madjam002 2 years ago |

And just 30 minutes ago we were about to flip the switch on a months long migration to Cloudflare Pages for our new website, I guess some things weren't meant to be :')

codegeek 2 years ago | |

Omg. What timing. I feel your pain. We recently migrated to Cloudflare Pages and I was happy at the speed and everything and now this :(. Never had a downtime when I self hosted on my DigitalOcean droplet. damn. Re-considering going back to old school nginx static site hosting.

ikekkdcjkfke 2 years ago | | |

Those might have had downtime, but never reported

hobs 2 years ago | | |

Well then you haven't used DO that long, I get regular emails about X or Y server needing to go down for maint.

goldinfra 2 years ago | | |

I've used Digital Ocean (and many other hosting providers) for as long as most of them have existed. Most of my servers have been running nearly uninterrupted for many years. Yes, there will be a reboot or move every so often but the uptime is incredibly high.

The idea that single server is capable beat the reliability of a massively distributed system is counter-intuitive and yet usually it's the case.

The average distributed system is a house of cards that can come tumbling down if any one of a number of pieces fails. The average static server is a rock of stability, with very few failure modes.

madjam002 2 years ago | | |

Yep our current marketing site is NextJS hosted on Hetzner fronted by Cloudflare, fortunately that's still up and never has any problems.

We've moved to next-on-pages for our new marketing site and I've spent the whole day on finishing touches ready for switch over at 20:00 UTC, and now this :((

nabakin 2 years ago | | |

Not sure why you're being downvoted

nijave 2 years ago | | |

Did you ever reboot for patches or was it load balanced?

rexreed 2 years ago | | |

Heck even shared hosting for $3/mo works just fine

JohnMakin 2 years ago |

For any terraform users that may be using code like this:

data "cloudflare_ip_ranges" "cloudflare_ipv4_list" {}

This is coming back with an empty list on some fields and causing havoc in terraform.

freedomben 2 years ago | |

It is shocking to me how bad to non-existent error handling is in most terraform providers. It leads to some remarkably arcane and esoteric error messages

mschuster91 2 years ago | | |

Terraform error handling as a whole is nuts anyway. Like, I recently tried to delete an ACM cert that still was in use in a Cloudfront distribution - didn't work, but it took 20 minutes for Terraform to recognize that, yes, there's an API error. It shouldn't have come so far given that the API call immediately errors out when trying over the CLI or Web Console, but instead of erroring out, Terraform retried for 20 minutes until it hit some sort of timeout.

To make it worse, you can't even kill Terraform safely because while it does register your Ctrl+C, it won't interrupt an ongoing process, and if you force kill it you run the very serious risk of corrupting your state file.

Seriously, I'm looking for OpenTofu to light some fire under the ass of Hashicorp. I don't know where all the VC money went, but for what's supposed to be the golden standard of IaC solutions, it's sometimes bloody ridiculous.

(Not to mention it's written in Go of all things which means there's virtually zero tooling and documentation to debug it or to develop anything for... especially when compared to the state of the art in Java, NodeJS or PHP)

nijave 2 years ago | | |

This is usually down to provider implementation which switching the core won't help. The provider controls HTTP calls and errors against the relevant service API.

Here are the retries in the provider code https://github.com/hashicorp/terraform-provider-aws/blob/mai...

It's hard coded to "certificateCrossServicePropagationTimeout" which is 20 minutes here https://github.com/hashicorp/terraform-provider-aws/blob/mai...

mschuster91 2 years ago | | |

Sure, but Terraform Core doesn't provide any way of getting user feedback in case unexpected situations happen, or aborting while saving the current state, both of which would save me serious amounts of time.

nijave 2 years ago | | |

Being able to actually interrupt/cancel would be nice. You can get more feedback by adjusting TF_LOG env var. Logging levels have been getting improvements for a while (it used to just be TRACE that spammed everything)

JohnMakin 2 years ago | | |

There was no error message, that was the really unsettling part.

eddyfromtheblok 2 years ago | | |

it's shocking how much of a desirable skill it is in devops job roles given its clear deficiencies.

JohnMakin 2 years ago | |

In the time I made this post and now it's come back. Really wish that would've returned an error and not an empty list, that almost caused a disaster in my automation.

TacticalCoder 2 years ago |

Anyone remember big iron and servers with uptimes of 5 or 7 nines?

I mean: it used to be a thing. Now we have the cloud.

fizx 2 years ago | |

7 9's is 3 seconds of downtime per year. That was never a thing.

vicnov 2 years ago |

Auth0 seems to be down as well

thom 2 years ago | |

Yeah, can confirm this (for those looking at their status pages which claim otherwise).

j-rom 2 years ago |

Complete Pages outage for me. I have several sites hosted on Cloudflare Pages and I can't access any of them, they're all returning 500's.

gsanderson 2 years ago | |

Yep, same for me :(

campbellman 2 years ago |

Fun day to release a blog post[0] about cloudflare page functions, on a site hosted on cloudflare pages.

[0] https://interbolt.org/blog/split-it-and-forget-it

tootie 2 years ago |

Apparently Auth0 as well. Possibly related.

juancampa 2 years ago | |

Most likely related, I see a `cf-ray` header in the 500 response.

c22 2 years ago |

It's probably bad that I noticed this just due to a large percentage of my regular online-habits suddenly breaking. I liked the old internet where websites just broke one at a time.

NicoJuicy 2 years ago | |

That was before ddos became common and cheap to execute.

blintz 2 years ago |

This is preventing new logins to ChatGPT.

toomuchtodo 2 years ago | |

Error 1101 Worker threw exception. Interestingly fronting their auth0 tenant with CF.

Recursing 2 years ago | | |

auth0 itself is (edit: was) down, https://auth0.com/

xrd 2 years ago |

I can't login to my domain dashboard either. Maybe that is a downstream effect of workers being offline?

dogweather 2 years ago |

Yes — Workaround is to disable your workers. That got my site back up and running.

EDIT 15:07 MDT: People are reporting that Workers are back up. Mine isn't in my site's critical path. So I'm going to leave the Worker disabled (un-routed) until tonight.

codegeek 2 years ago | |

Are you saying disable workers and then your cloudflare Pages will be back up ?

dogweather 2 years ago | | |

Ah, I don't know about Cloudflare Pages. I think they use Workers underneath. So unfortunately, there's no fix yet. Sorry.

codegeek 2 years ago | | |

Ah well. I cannot access the Workers and pages menu. It returns an error.

gsanderson 2 years ago |

My sites have started coming back up now. Their site has also just started working again (previously got a 500): https://pages.cloudflare.com/

pcblues 2 years ago |

I won't be the first or last to say these three things:

The internet was meant to stop reliance on single sources (in case of nuclear war)

The size of a house of cards increases the number of failure points

Marketers lie

jve 2 years ago | |

> The internet was meant to stop reliance on single sources

You have all the technical means. Your home server possibly won't be reachable, yes.

The global connectivity as-is is really, really, really fault tolerant.

CommonGuy 2 years ago |

Cloudflare Pages aren't working on a few of my sites too

mparnisari 2 years ago |

It is funny that just a few days ago the company that laughed at Okta for a breach and whose core competency is availability are now experiencing an outage.

ystad 2 years ago | |

You should probably indicate what you meant by laughed at okta. Do you have a link??

mparnisari 2 years ago | | |

https://blog.cloudflare.com/introducing-har-sanitizer-secure...

Maybe "laugh" is not accurate, idk. But their post kinda looked like 'here, we built this tool that should have been made by okta'.

gkfasdfasdf 2 years ago |

Ongoing DDoS attacks are targeting sites that raise funds for Gaza relief efforts: https://twitter.com/arblauvelt/status/1719027920054702363

I wonder if there's a connection.

elithrar 2 years ago | |

Not related.

(I am the PM lead for Workers databases & storage)

sterlind 2 years ago | | |

I really appreciate how you've showed up quickly and given direct answers. It's an admirable level of comms for a company so large.

codegeek 2 years ago | | |

Is there a postmortem coming ? Would you be able to tell us what happened at a high level ?

elithrar 2 years ago | | |

See my comment here: https://news.ycombinator.com/item?id=38075877

(We’ll share more when we can)

ironmagma 2 years ago |

That's sad, hopefully something comes along that can brighten their day.

TheCleric 2 years ago |

Wonder if this is related to the mini NPM outage I was experiencing earlier:

https://status.npmjs.org/incidents/zdznxkrp22py

imslavko 2 years ago | |

No way to confirm, but I think so, just because NPM threw this error at me:

     KV GET failed: 401 Unauthorized

where KV could refer to the CF KV in workers

Animats 2 years ago |

All of them, or just those in some data centers?

AxiomaticSpace 2 years ago |

Looks like it's working now as of 1:23 PST

ChrisArchitect 2 years ago |

Damn yeah, noticing for last few mins+