Cloudflare outage on February 20, 2026

Cloudflare outage on February 20, 2026(blog.cloudflare.com)

190 points by nomaxx117 131 days ago | 125 comments

kgeist 131 days ago |

It's something we debated in our team: if there's an API that returns data based on filters, what's the better behavior if no filters are provided - return everything or return nothing?

The consensus was that returning everything is rarely what's desired, for two reasons: first, if the system grows, allowing API users to return everything at once can be a problem both for our server (lots of data in RAM when fetching from the DB => OOM, and additional stress on the DB) and for the user (the same problem on their side). Second, it's easy to forget to specify filters, especially in cases like "let's delete something based on some filters."

So the standard practice now is to return nothing if no filters are provided, and we pay attention to it during code reviews. If the user does really want all the data, you can add pagination to your API. With pagination, it's very unlikely for the user to accidentally fetch everything because they must explicitly work with pagination tokens, etc.

Another option, if you don't want pagination, is to have a separate method named accordingly, like ListAllObjects, without any filters.

alemanek 131 days ago | |

Returning an empty result in that case may cause a more subtle failure. I would think returning an error would be a bit better as it would clearly communicate that the caller called the API endpoint incorrectly. If it’s HTTP a 400 Bad Request status code would seem appropriate.

Thaxll 130 days ago | |

Neither of your options are good, the first question you need to ask is that is the filter optional or not ( this is a contract / API question ).

If not optional then return 400, otherwise return all the results ( and have pagination ).

You should always have pagination in an API.

Philip-J-Fry 131 days ago | |

>allowing API users to return everything at once can be a problem both for our server (lots of data in RAM when fetching from the DB => OOM, and additional stress on the DB)

You can limit stress on RAM by streaming the data. You should ideally stream rows for any large dataset. Otherwise, like you say you are loading the entire thing into RAM.

jiggawatts 130 days ago | | |

Not to mention the latency reduction!

Buffering up the entire data set before encoding it to JSON and sending it is one of the biggest sources of latency in API based software. Streaming can get latencies down to tens of microseconds!

qwertyuiop_ 131 days ago | |

how about returning an error ? It’s the generic “client sent something wrong” bucket. Missing a required filter param is unambiguously a client mistake according to your own docs/contract → client error → 4xx family → 400 is the safest/default member of that family.

MobileVet 131 days ago | |

I like your thought process around the ‘empty’ case. While the opposite of a filter is no filter, to your point, that is probably not really the desire when it comes to data retrieval. We might have to revisit that ourselves.

PunchyHamster 131 days ago | |

But that query had parameter. They just fucked up parsing it

est 131 days ago | |

> to have a separate method named accordingly, like ListAllObjects, without any filters

For me it's like `filter1=*`

CommonGuy 131 days ago |

Insufficient mock data in the staging environment? Like no BYOIP prefixes at all? Since even one prefix should have shown that it would be deleted by that subtask...

From all the recent outages, it sounds like Cloudflare is barely tested at all. Maybe they have lots of unit tests etc, but they do not seem to test their whole system... I get that their whole setup is vast, but even testing that subtask manually would have surfaced the bug

zmj 131 days ago | |

Testing the "whole system" for a mature enterprise product is quite difficult. The combinatorial explosion of account configurations and feature usage becomes intractable on two levels: engineers can't anticipate every scenario they need their tests to cover (because the product is too big understand the whole of), and even if comprehensive testing was possible - it would be impractical on some combination of time, flakiness, and cost.

dabinat 131 days ago | |

I think Cloudflare does not sufficiently test lesser-used options. I lurk in the R2 Discord and a lot of users seem to have problems with custom domains.

asciii 131 days ago | |

It was also merged 15 days prior to production release...however, you're spot on with the empty test. That's a basic scenario that if it returned all...is like oh no.

suhputt 131 days ago | |

my guess is the company is rotting from the inside and drowning in tech debt

martinald 131 days ago | |

Just crazy. Why does a staging environment matter? They should be running some integration tests against eg an in memory database for these kinds of tasks surely?

otar 131 days ago |

Reliability was/is CF's label.

It's alarming already. Too many outages in the past months. CF should fix it, or it becomes unacceptable and people will leave the platform.

I really hope they will figure things out.

tallytarik 131 days ago | |

We’re still waiting on a solution for https://www.cloudflarestatus.com/incidents/391rky29892m (which actually started a month earlier than the incident reports)

In the meantime, as you say, we’re now going through and evaluating other vendors for each component that CF provides - which is both unfortunate, and a frustrating use of time, as CF’s services “just worked” very well for a very long time.

argestes 131 days ago | |

I have many things dependent on Cloudflare. That makes me root for Cloudflare and I think I'm not the only one. Instead of finding better options we're getting stuck on an already failing HA solution. I wonder what caused this.

slothsarecool 131 days ago | | |

There are no alternatives, and those alternatives that did exist back in the day, had to shut down due to either going out of business or not being able to keep a paygo model.

Not everybody needs cloudflare, but those that need it and aren't major enterprises, have no other option.

arcatech 131 days ago | | |

Do you not feel concern about you and everybody else deciding to put ALL of their eggs into one basket like this?

alansaber 131 days ago |

Not sure why everyone is complaining, new MCP features are more important than uptime

NinjaTrance 131 days ago |

The irony is that the outage was caused by a change from the "Code Orange: Fail Small initiative".

They definitely failed big this time.

vimda 131 days ago |

One has to wonder when the board realises Dane was a bad replacement for JGC. These outages are getting ridiculous

blibble 131 days ago |

is this blog post LLM generated?

the explanation makes no sense:

> Because the client is passing pending_delete with no value, the result of Query().Get(“pending_delete”) here will be an empty string (“”), so the API server interprets this as a request for all BYOIP prefixes instead of just those prefixes that were supposed to be removed. The system interpreted this as all returned prefixes being queued for deletion.

client:

     resp, err := d.doRequest(ctx, http.MethodGet, `/v1/prefixes?pending_delete`, nil)

server:

    if v := req.URL.Query().Get("pending_delete"); v != "" {
        // ignore other behavior and fetch pending objects from the ip_prefixes_deleted table
        prefixes, err := c.RO().IPPrefixes().FetchPrefixesPendingDeletion(ctx)
        if err != nil {
            api.RenderError(ctx, w, ErrInternalError)
            return
        }

        api.Render(ctx, w, http.StatusOK, renderIPPrefixAPIResponse(prefixes, nil))
        return
    }

even if the client had passed a value it would have still done exactly the same thing, as the value of "v" (or anything from the request) is not used in that block

atty 131 days ago |

I do not work in the space at all, but it seems like Cloudflare has been having more network disruptions lately than they used to. To anyone who deals with this sort of thing, is that just recency bias?

anurag 131 days ago |

The one redeeming feature of this failure is staged rollouts. As someone advertising routes through CF, we were quite happy to be spared from the initial 25%.

jaboostin 131 days ago |

Hindsight is 20/20 but why not dry run this change in production and monitor the logs/metrics before enabling it? Seems prudent for any new “delete something in prod” change.

Bender 131 days ago |

Old tech could work around these outages. Set up GSLB at a DNS provider that does health checks or perform your own health checks to both origin and CDN's and use API's to change DNS. If the origin servers are OK and the CDN is not, automatically change DNS to a different CDN. There should be multiple probes that form a consensus. This process assumes one is managing the configurations of their CDN's through code and API so that one can set up and tear down any number of CDN's on a whim.

That does mean having contracts with more than one CDN provider however the cost should be negotiated based on monthly volume. i.e. the CDN with the most uptime gets the most money. If an existing CDN under contract refuses to negotiate then move some non critical path services to them and let that contract expire. Instate a company wide policy to never return to a vendor if their contract was intentionally not renewed.

himata4113 131 days ago |

This blog post is inaccurate, the prefixes were being revoked over and over - to keep your prefixes advertised you had to have a script that would readd them or else it would be withdrawn again. The way they seemed to word it is really dishonest.

dilyevsky 131 days ago |

Lmao, iirc long time ago Google's internal system had the same exact bug (treating empty as "all" in the delete call) that took down all their edges. Surprisingly there was little impact as traffic just routed through the next set of proxies.

boarush 131 days ago |

While neither am I nor the company I work for directly impacted by this outage, I wonder how long can Cloudflare take these hits and keep apologizing for it. Truly appreciate them being transparent about it, but businesses care more about SLAs and uptime than the incident report.

llama052 131 days ago | |

I’ll take clarity and actual RCAs than Microsoft’s approach of not notifying customers and keeping their status page green until enough people notice.

One thing I do appreciate about cloudflare is their actual use of their status page. That’s not to say these outages are okay. They aren’t. However I’m pretty confident in saying that a lot of providers would have a big paper trail of outages if they were more honest to the same degree or more so than cloudflare. At least from what I’ve noticed, especially this year.

boarush 131 days ago | | |

Azure straight up refuses to show me if there's even an incident even if I can literally not access shit.

But last few months has been quite rough for Cloudflare, and a few outages on their Workers platform that didn't quite make the headlines too. Can't wait for Code Orange to get to production.

jacquesm 131 days ago | |

Bluntly: they expended that credit a while ago. Those that can will move on. Those that can't have a real problem.

As for your last sentence:

Businesses really do care about the incident reports because they give good insight into whether they can trust the company going forward. Full transparency and a clear path to non-repetition due to process or software changes are called for. You be the judge of whether or not you think that standard has been met.

boarush 131 days ago | | |

I might be looking at it differently, but aren't decisions over a certain provider of service being made by the management. Incident reports don't ever reach there in my experience.

VirusNewbie 131 days ago |

If you track large SaaS and Cloud uptime, it seem to correlate pretty highly with compensation for big companies. Is cloudflare getting top talent?

bombcar 131 days ago | |

Based on IPO date and lockups, I suspect top talent is moving on.

abalone 131 days ago |

The code they posted doesn't quite explain the root cause. This is a good study case for resilient API design and testing.

They said their /v1/prefixes endpoint has this snippet:

  if v := req.URL.Query().Get("pending_delete"); v != "" {
      // ignore other behavior and fetch pending objects from the ip_prefixes_deleted table
      prefixes, err := c.RO().IPPrefixes().FetchPrefixesPendingDeletion(ctx)
      
      [..snip..]
  }

What's implied but not shown here is that endpoint normally returns all prefixes. They modified it to return just those pending deletion when passing a pending_delete query string parameter.

The immediate problem of course is this block will never execute if pending_delete has no value:

  /v1/prefixes?pending_delete   <-- doesn't execute block

This is because Go defaults query params to empty strings and the if statement skips this case. Which makes you wonder, what is the value supposed to be? This is not explained. If it's supposed to be:

  /v1/prefixes?pending_delete=true   <--- executes block

Then this would work, but the implementation fails to validate this value. From this you can infer that no unit test was written to exercise the value:

  /v1/prefixes?pending_delete=false   <-- wrongly executes block

The post explains "initial testing and code review focused on the BYOIP self-service API journey." We can reasonably guess their tests were passing some kind of "true" value for the param, either explicitly or using a client that defaulted param values. What they didn't test was how their new service actually called it.

So, while there's plenty to criticize on the testing front, that's first and foremost a basic failure to clearly define an API contract and implement unit tests for it.

But there's a third problem, in my view the biggest one, at the design level. For a critical delete path they chose to overload an existing endpoint that defaults to returning everything. This was a dangerous move. When high stakes data loss bugs are a potential outcome, it's worth considering more restrictive API that is harder to use incorrectly. If they had implemented a dedicated endpoint for pending deletes they would have likely omitted this default behavior meant for non-destructive read paths.

In my experience, these sorts of decisions can stem from team ownership differences. If you owned the prefixes service and were writing an automated agent that could blow away everything, you might write a dedicated endpoint for it. But if you submitted a request to a separate team to enhance their service to returns a subset of X, without explaining the context or use case very much, they may be more inclined to modify the existing endpoint for getting X. The lack of context and communication can end up missing the risks involved.

Final note: It's a little odd that the implementation uses Go's "if with short statement" syntax when v is only ever used once. This isn't wrong per se but it's strange and makes me wonder to what extent an LLM was involved.

fjoaasdfas 130 days ago |

yikes: https://github.com/golang/go/blob/master/src/net/url/url.go#...

maybe go can do (string v, ok bool) for this or add proper sum types...

ssiddharth 131 days ago |

The eternal tech outage aphorism: It's always DNS, except for when it's BGP.

subscribed 131 days ago | |

You could argue BGP is like DNS for IPs :)

est 131 days ago |

bitbucket was done for a while as well. Seems no one noticed.

wa008 131 days ago |

This transparent report can earn my trust

NooneAtAll3 131 days ago |

again?

dryarzeg 131 days ago |

Just joking, no offence :)

logicchains 131 days ago | |

DaaS is good ja

henning 131 days ago |

[flagged]

sp00chy 131 days ago | |

that’s my feeling also. We will get this more and more in future.

djfobbz 131 days ago |

I'm honestly amazed that a company CF's size doesn't have a neat little cluster of Mac Minis running OpenClaw and quietly taking care of this for them.

user205738 131 days ago |

They should have rewritten this code in Rust using these brilliant language models. /jk