Ask HN: Cloudflare Workers are down? Got too much 500 on dozen of services now. UPD: https://www.cloudflarestatus.com/incidents/l6x2h1zp69bc |
Ask HN: Cloudflare Workers are down? Got too much 500 on dozen of services now. UPD: https://www.cloudflarestatus.com/incidents/l6x2h1zp69bc |
This is not the way we wanted anyone to start their week.
(I am the PM lead for Cloudflare Workers: Databases & Storage)
All the best to the people fixing!
Seems to be 30 minutes. According to status page.
Fix is fast. Curious what it was.
EDIT 4:10 PM Eastern: Now I can login to the dashboard but "Workers and Pages" menu is returning errors and no access. Website still down :(
EDIT at 4:23 PM Eastern: RESOLVED. Website (cloudflare pages) is back up now for me.
Looks like they took about 25 mins to resolve.
EDIT: I panicked a little. As a dev, I should have been more sympathetic.
4.10pm eastern, still working
4.23 eastern. Yep you guessed it
Half an hour means they’ve lost their five nines for this year based on this outage alone.
We've moved to next-on-pages for our new marketing site and I've spent the whole day on finishing touches ready for switch over at 20:00 UTC, and now this :((
data "cloudflare_ip_ranges" "cloudflare_ipv4_list" {}
This is coming back with an empty list on some fields and causing havoc in terraform.
To make it worse, you can't even kill Terraform safely because while it does register your Ctrl+C, it won't interrupt an ongoing process, and if you force kill it you run the very serious risk of corrupting your state file.
Seriously, I'm looking for OpenTofu to light some fire under the ass of Hashicorp. I don't know where all the VC money went, but for what's supposed to be the golden standard of IaC solutions, it's sometimes bloody ridiculous.
(Not to mention it's written in Go of all things which means there's virtually zero tooling and documentation to debug it or to develop anything for... especially when compared to the state of the art in Java, NodeJS or PHP)
I mean: it used to be a thing. Now we have the cloud.
EDIT 15:07 MDT: People are reporting that Workers are back up. Mine isn't in my site's critical path. So I'm going to leave the Worker disabled (un-routed) until tonight.
The internet was meant to stop reliance on single sources (in case of nuclear war)
The size of a house of cards increases the number of failure points
Marketers lie
You have all the technical means. Your home server possibly won't be reachable, yes.
The global connectivity as-is is really, really, really fault tolerant.
Maybe "laugh" is not accurate, idk. But their post kinda looked like 'here, we built this tool that should have been made by okta'.
I wonder if there's a connection.
(I am the PM lead for Workers databases & storage)
KV GET failed: 401 Unauthorized
where KV could refer to the CF KV in workersThe idea that single server is capable beat the reliability of a massively distributed system is counter-intuitive and yet usually it's the case.
The average distributed system is a house of cards that can come tumbling down if any one of a number of pieces fails. The average static server is a rock of stability, with very few failure modes.
(We’ll share more when we can)
It could be worth it, but if you do the math and it seems like it's not worth it, it could perhaps give you some equanimity the next time it happens?
Here are the retries in the provider code https://github.com/hashicorp/terraform-provider-aws/blob/mai...
It's hard coded to "certificateCrossServicePropagationTimeout" which is 20 minutes here https://github.com/hashicorp/terraform-provider-aws/blob/mai...
yep, but that's often enough useless after the fact. In the PHP world, for example, there's Symfony/Monolog's `fingers_crossed` logger [1]... it keeps logs below the normal threshold in memory, but if there is a single event of a given severity or worse, it dumps out all the logs it has ingested so far for this request.
A real lifesaver that one is.