Cloudflare outage on February 20, 2026(blog.cloudflare.com) |
Cloudflare outage on February 20, 2026(blog.cloudflare.com) |
The consensus was that returning everything is rarely what's desired, for two reasons: first, if the system grows, allowing API users to return everything at once can be a problem both for our server (lots of data in RAM when fetching from the DB => OOM, and additional stress on the DB) and for the user (the same problem on their side). Second, it's easy to forget to specify filters, especially in cases like "let's delete something based on some filters."
So the standard practice now is to return nothing if no filters are provided, and we pay attention to it during code reviews. If the user does really want all the data, you can add pagination to your API. With pagination, it's very unlikely for the user to accidentally fetch everything because they must explicitly work with pagination tokens, etc.
Another option, if you don't want pagination, is to have a separate method named accordingly, like ListAllObjects, without any filters.
If not optional then return 400, otherwise return all the results ( and have pagination ).
You should always have pagination in an API.
You can limit stress on RAM by streaming the data. You should ideally stream rows for any large dataset. Otherwise, like you say you are loading the entire thing into RAM.
Buffering up the entire data set before encoding it to JSON and sending it is one of the biggest sources of latency in API based software. Streaming can get latencies down to tens of microseconds!
For me it's like `filter1=*`
From all the recent outages, it sounds like Cloudflare is barely tested at all. Maybe they have lots of unit tests etc, but they do not seem to test their whole system... I get that their whole setup is vast, but even testing that subtask manually would have surfaced the bug
It's alarming already. Too many outages in the past months. CF should fix it, or it becomes unacceptable and people will leave the platform.
I really hope they will figure things out.
In the meantime, as you say, we’re now going through and evaluating other vendors for each component that CF provides - which is both unfortunate, and a frustrating use of time, as CF’s services “just worked” very well for a very long time.
Not everybody needs cloudflare, but those that need it and aren't major enterprises, have no other option.
They definitely failed big this time.
the explanation makes no sense:
> Because the client is passing pending_delete with no value, the result of Query().Get(“pending_delete”) here will be an empty string (“”), so the API server interprets this as a request for all BYOIP prefixes instead of just those prefixes that were supposed to be removed. The system interpreted this as all returned prefixes being queued for deletion.
client:
resp, err := d.doRequest(ctx, http.MethodGet, `/v1/prefixes?pending_delete`, nil)
server: if v := req.URL.Query().Get("pending_delete"); v != "" {
// ignore other behavior and fetch pending objects from the ip_prefixes_deleted table
prefixes, err := c.RO().IPPrefixes().FetchPrefixesPendingDeletion(ctx)
if err != nil {
api.RenderError(ctx, w, ErrInternalError)
return
}
api.Render(ctx, w, http.StatusOK, renderIPPrefixAPIResponse(prefixes, nil))
return
}
even if the client had passed a value it would have still done exactly the same thing, as the value of "v" (or anything from the request) is not used in that blockThat does mean having contracts with more than one CDN provider however the cost should be negotiated based on monthly volume. i.e. the CDN with the most uptime gets the most money. If an existing CDN under contract refuses to negotiate then move some non critical path services to them and let that contract expire. Instate a company wide policy to never return to a vendor if their contract was intentionally not renewed.
Lmao, iirc long time ago Google's internal system had the same exact bug (treating empty as "all" in the delete call) that took down all their edges. Surprisingly there was little impact as traffic just routed through the next set of proxies.
One thing I do appreciate about cloudflare is their actual use of their status page. That’s not to say these outages are okay. They aren’t. However I’m pretty confident in saying that a lot of providers would have a big paper trail of outages if they were more honest to the same degree or more so than cloudflare. At least from what I’ve noticed, especially this year.
But last few months has been quite rough for Cloudflare, and a few outages on their Workers platform that didn't quite make the headlines too. Can't wait for Code Orange to get to production.
As for your last sentence:
Businesses really do care about the incident reports because they give good insight into whether they can trust the company going forward. Full transparency and a clear path to non-repetition due to process or software changes are called for. You be the judge of whether or not you think that standard has been met.
They said their /v1/prefixes endpoint has this snippet:
if v := req.URL.Query().Get("pending_delete"); v != "" {
// ignore other behavior and fetch pending objects from the ip_prefixes_deleted table
prefixes, err := c.RO().IPPrefixes().FetchPrefixesPendingDeletion(ctx)
[..snip..]
}
What's implied but not shown here is that endpoint normally returns all prefixes. They modified it to return just those pending deletion when passing a pending_delete query string parameter.The immediate problem of course is this block will never execute if pending_delete has no value:
/v1/prefixes?pending_delete <-- doesn't execute block
This is because Go defaults query params to empty strings and the if statement skips this case. Which makes you wonder, what is the value supposed to be? This is not explained. If it's supposed to be: /v1/prefixes?pending_delete=true <--- executes block
Then this would work, but the implementation fails to validate this value. From this you can infer that no unit test was written to exercise the value: /v1/prefixes?pending_delete=false <-- wrongly executes block
The post explains "initial testing and code review focused on the BYOIP self-service API journey." We can reasonably guess their tests were passing some kind of "true" value for the param, either explicitly or using a client that defaulted param values. What they didn't test was how their new service actually called it.So, while there's plenty to criticize on the testing front, that's first and foremost a basic failure to clearly define an API contract and implement unit tests for it.
But there's a third problem, in my view the biggest one, at the design level. For a critical delete path they chose to overload an existing endpoint that defaults to returning everything. This was a dangerous move. When high stakes data loss bugs are a potential outcome, it's worth considering more restrictive API that is harder to use incorrectly. If they had implemented a dedicated endpoint for pending deletes they would have likely omitted this default behavior meant for non-destructive read paths.
In my experience, these sorts of decisions can stem from team ownership differences. If you owned the prefixes service and were writing an automated agent that could blow away everything, you might write a dedicated endpoint for it. But if you submitted a request to a separate team to enhance their service to returns a subset of X, without explaining the context or use case very much, they may be more inclined to modify the existing endpoint for getting X. The lack of context and communication can end up missing the risks involved.
Final note: It's a little odd that the implementation uses Go's "if with short statement" syntax when v is only ever used once. This isn't wrong per se but it's strange and makes me wonder to what extent an LLM was involved.
maybe go can do (string v, ok bool) for this or add proper sum types...
Just joking, no offence :)
but in short they are changing whether string is empty, and query string "pending_delete" is same as "pending_delete=" and will return empty
Or, if they specified `/v1/prefixes?pending_delete=potato` it would return "correct" list of objects to delete
Or in other words "Go have types safety, fuck it, let's use strings like in '90s PHP apps instead"
If they passed in any value, they would have entered the block and returned early with the results of FetchPrefixesPendingDeletion.
From the post:
> this was implemented as part of a regularly running sub-task that checks for BYOIP prefixes that should be removed, and then removes them.
They expected to drop into the block of code above, but since they didn't, they returned all routes.
actual explanation: the API server by default returns everything. the client attempted to make a request to return "pending_deletes", but as the request was malformed, the API instead went down the default path, which returned everything. then the client deleted everything.
makes sense now
but is that explanation is even worse
because that means the code path was never tested?
(Other comments have explained the bug so I won't repeat them)
(: phonetically, because 'l's are hard to read.
Even so, it is a strong reminder not to rely on any one vendor for critical stuff, in case that wasn't clear enough yet.
https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
https://github.com/cloudflare/terraform-provider-cloudflare/...
even the blog, that used to be a respected source of technical content, has morphed into a garbage fire of slop and vaporware announcements since jgc left.
If I was a CF customer I would be migrating off now.
i think i care much more about our SLAs (if any)
**everything breaks**
...
**everything breaks again**
oh fuck! Code Orange! I repeat, Code Orange! we need to rebuild trust(R)(TM)! we've let our customers down!
...
**everything breaks again**
Code Orangier! I repeat, Code Orangier!
Or POST endpoint, with client side just sending serialized object as query rather than relying that the developer remembers the magical query string.
Incident reports themselves are highly technical, so will not reach management because they are most likely simply not equipped to deal with them. But the CTOs of the companies will take notice, especially when their own committed SLAs are endangered and their own management asks them for an explanation. CF makes them all look bad right now.
Becuase management wants to know why the graphs all went to zero, and the engineers have nothing else to do but relay the incident report.
This builds a perception for management of the vendor, and if the perception is that the vendor doesnt tell them shit or doesnt even seem to know theres an outage, then management can decide to shift vendors
But if I was given option to steal one feature out of other languages it would be enums and resulting Result/Optional from Rust
https://xcancel.com/eastdakota/status/2025221270061580453
Rather than be driven by something rational like building a great product or making lots of money he is apparently driven by a desperate fear of being a dinosaur.
Regardless of how competent he is or isn’t as a technologist, a leader leading with fear is a recipe for disaster.
As a dev, you'll need to have those stuff in your knowledge toolbox
cto can't even articulate a sentence without passing it through an LLM, and instead of doing his job he's posting the stupidest shit to his personal bootlicking chat channel. I cringe every time at the brown-nosers that inhabit that hovel.
no words for what the product org is becoming too. they should take their own advice a bit further and just replace all the leadership with an LLM, it would be cheaper and it's the same shit in practice
Just got to keep your head, remember it’s just a job and you get paid regardless. Clock in, clock out, do the work assigned to you but mentally just check out while you look for a new role
Seems like Matthew Prince didn't choose that route.
This is the key individual paradox: they can be a massive asset and make the impossible happen but if and when they leave you've got a real problem unless you can find another individual that is just as capable. Now, I do trust JGC to have created an organization that is as mature as it could be, but at the same time it is next to impossible to quantify your own effect on the whole because you lack objectivity and your underlings may not always tell you the cold hard truth for reasons all their own.
And in this case the problem is even larger: the experience collected by the previous guru does not transfer cleanly to the new one, simply because the new one lacks the experience of seeing the company go from being a tiny player to being a behemoth, and that's something you can do only once.
I've always been of the opinion that without JGC Cloudflare did not stand a chance, irrespective of those hundreds of developers. And that's before we get into things like goodwill.
And of those hundreds of developers you have to wonder how many see the writing on the wall and are thinking of jumping ship. The best ones always leave first.
I would not be surprised at all if this whole saga ends with Google, Microsoft or Amazon absorbing CF at a fraction of its current value.
We tried Stackpath, Imperva (Incapsula back in the day), etc but they were either too expensive or went out of business.
pay peanuts, get monkeys
Management thinks AI tools should make everyone 10x as productive, so they're all trying to run lean teams and load up the remaining engineers with all the work. This will end about as well as the great offshoring of the early 2000s.
at which point the CVEs started to fly in
Auth for clients is also specified in the spec - there is some scope for homeservers to freestyle, but nowadays they have to implement OIDC: https://spec.matrix.org/latest/client-server-api/#client-aut...
I could be quite the kernel developer if making the test green was the only criteria.
I would go so far as to say that you shouldnt be employed in the industry. Malicious actors like you will contribute to an erosion of trust thatll make everything worse
Don’t stop sabotaging AI efforts.
Using LLMs for software development is a safety hazard. It also has a societal risk, because it centralizes more data, more power, more money to tech oligarchs.
It's ethical to fight this. Still not commenting on legality.
Doing unethical things because someone pays you would still be unethical. Opposing those while someone pays you is still ethical.