Cloudflare outage on December 5, 2025

This is architectural problem, the LUA bug, the longer global outage last week, a long list of earlier such outages only uncover the problem with architecture underneath. The original, distributed, decentralized web architecture with heterogeneous endpoints managed by myriad of organisations is much more resistant to this kind of global outages. Homogeneous systems like Cloudflare will continue to cause global outages. Rust won't help, people will always make mistakes, also in Rust. Robust architecture addresses this by not allowing a single mistake to bring down myriad of unrelated services at once.

Kudos to Cloudflare for clarity and diligence.

When talking of their earlier Lua code:

> we have never before applied a killswitch to a rule with an action of “execute”.

I was surprised that a rules-based system was not tested completely, perhaps because the Lua code is legacy relative to the newer Rust implementation?

It tracks what I've seen elsewhere: quality engineering can't keep up with the production engineering. It's just that I think of CloudFlare as an infrastructure place, where that shouldn't be true.

I had a manager who came from defense electronics in the 1980's. He said in that context, the quality engineering team was always in charge, and always more skilled. For him, software is backwards.

I've noticed that in recent months, even apart from these outages, cloudflare has been contributing to a general degradation and shittification of the internet. I'm seeing a lot more "prove you're human", "checking to make sure you're human", and there is normally at the very least a delay of a few seconds before the site loads.

I don't think this is really helping the site owners. I suspect it's mainly about AI extortion:

https://blog.cloudflare.com/introducing-pay-per-crawl/

I noticed this outage last night (Cloudflare 500s on a few unrelated websites). As usual, when I went to Cloudflare's status page, nothing about the outage was present; the only thing there was a notice about the pre-planned maintenance work they were doing for the security issue, reporting that everything was being routed around it successfully.

> Disabling this was done using our global configuration system. This system does not use gradual rollouts but rather propagates changes within seconds to the entire network and is under review following the outage we recently experienced on November 18.

> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:

They really need to figure out a way to correlate global configuration changes to the errors they trigger as fast as possible.

> as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules

Warning signs like this are how you know that something might be wrong!

Cloudflare is now below 99.9% uptime, for anyone keeping track. I reckon my home PC is at least 99.9%.

What I'm missing here is a test environment. Gradual or not; why are they deploying straight to prod? At Cloudflare's scale, there should be a dedicated room in Cloudflare HQ with a full isolated model-scale deployment of their entire system. All changes should go there first, with tests run for every possible scenario.

Only after that do you use gradual deployment, with a big red oopsie button which immediately rolls the changes back. Languages with strong type systems won't save you, good procedure will.

Internet packet switching based architecture was originally design to withstand this type of outages [1].

Some people even go further by speculating that the original military DARPA network precursor to the modern Internet was originally designed to ensure the continuity of command and control (C&C) of the US military operation in the potential event of all out nuclear attack during the Cold War.

This the time when Internet researchers need to redefine the Internet application and operation. The local-first paradigm is the first step in the right direction (pardon the pun) [2].

[1] The Real Internet Architecture: Past, Present, and Future Evolution:

https://press.princeton.edu/books/paperback/9780691255804/th...

[2] Local-first software You own your data, in spite of the cloud:

https://www.inkandswitch.com/essay/local-first/

The lesson presented by the last few big outages is that entropy is, in fact, inescapable. The comprehensibility of a system cannot keep up with its growing and aging complexity forever. The rate of unknown unknowns will increase.

The good news is that a more decentralized internet with human brain scoped components is better for innovation, progress, and freedom anyway.

Whats going on with cloudflare's software team?

I have seen similar bugs in cloudflare API recently as well.

There is an endpoint for a feature that is available only to enterprise users, but the check for whether the user is on an enterprise plan is done at the last step.

What's the culture like at Cloudflare re: ops/deployment safety?

They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?

Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.

Pure speculation, but to me that sounds like there's more to the story, this sounds like the sort of cowboy decision a team makes when they've either already broken all the rules or weren't following them in the first place

The deployment pattern from Cloudflare looks insane to me.

I've worked at one of the top fintech firms, whenever we do a config change or deployment, we are supposed to have rollback plan ready and monitor key dashboards for 15-30 minutes.

The dashboards need to be prepared beforehand on systems and key business metrics that would be affected by the deployment and reviewed by teammates.

I've never seen a downtime longer than 1 minute while I was there, because you get a spike on the dashboard immediately when something goes wrong.

For the entire system to be down for 10+ minutes due to a bad config change or deployment is just beyond me.

So Cloudflare: - Did a last minute, untested change to their change: "turning off our WAF rule testing tool". - Did an immediate global roll-out, instead of a staged one. . Is seems they would have enough leaning-cases now never to do that again...

Today, after the Cloudflare outage, I noticed that almost all upload routes for my applications were being blocked.

After some investigation, I realized that none of these routes passed through Cloudflare OWASP. The reported anomalies total 50, exceeding the pre-configured maximum of 40 (Medium).

Despite being simple image or video uploads, the WAF is generating anomalies that make no sense, such as the following:

Cloudflare OWASP Core Ruleset Score (+5)

933100: PHP Injection Attack: PHP Open Tag Found

Cloudflare OWASP Core Ruleset Score (+5)

933180: PHP Injection Attack: Variable Function Call Found

For now, I’ve had to raise the OWASP Anomaly Score Threshold to 60 and enable the JS Challenge, but I believe something is wrong with the WAF after today’s outage.

This issue was still not solved to this moment.

Time for Cloudflare to start using the BOFH excuse generator. https://bofh.d00t.org/

My understanding, paraphrased: "In order to gradually roll out one change, we had to globally push a different configuration change, which broke everything at once".

But a more important takeaway:

> This type of code error is prevented by languages with strong type systems

The interesting part:

After rolling out a bad ruleset update, they tried a killswitch (rolled out immediately to 100%) which was a code path never executed before:

> However, we have never before applied a killswitch to a rule with an action of “execute”. When the killswitch was applied, the code correctly skipped the evaluation of the execute action, and didn’t evaluate the sub-ruleset pointed to by it. However, an error was then encountered while processing the overall results of evaluating the ruleset

> a straightforward error in the code, which had existed undetected for many years

Is there some underlying factors that resulted in the recent outages (e.g., new processes, layoffs, etc.) or just a series of pure coincidences?

How hard can it be for a company with 1000 engineers to create a canary region before blasting their centralized changes out to everyone.

Every change is a deployment, even if its config. Treat it as such.

Also you should know that a strongly typed language won't save you from every type of problem. And especially not if you allow things like unwrap().

It is just mind boggling that they very obviously have completely untested code which proxies requests for all their customers. If you don't want to write the tests then at least fuzz it.

I sometimes feel we'd be better off without all the paternalistic kitchensink features. The solid, properly engineered features used intentionally aren't causing these outages.

> provides customers with protection against malicious payloads, allowing them to be detected and blocked. To do this, Cloudflare’s proxy buffers HTTP request body content in memory for analysis.

I have a mixed feeling about this.

On the other hand, I absolutely don't want a CDN to look inside my payloads and decide what's good for me or. Today it's protection, tomorrow it's censorship.

At the same time this is exactly what CloudFlare is good for - to protect sites from malicious requests.

"This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur." It's starting to sound like a broken record at this point, languages are still seen as equal and as a result, interchangeable.

> This type of code error is prevented by languages with strong type systems.

True, as long as you don't call unwrap!

There's a lot of bad karma in this discussion. It's hard to run large services. Careful when you set a precedent of pillorying after an outage. It could be you next!

Yes, this is the second time in a month. Were folks expecting that to have been enough time for them to have made sweeping technical and organization changes? I say no—this doesn't mean they aren't trying or haven't learned any lessons from the last outage. It's a bit too soon to say that.

I see this event primarily as another example of the #1 class of major outages: bad rapid global configuration change. (The last CloudFlare outage was too, but I'm not just talking about CloudFlare. Google has had many many such outages. There was an inexplicable multi-year gap between recognizing this and having a good, widely available staged config rollout system for teams to drop into their systems.) Stuff like DoS attack configurations needs to roll out globally quickly. But they really need make it not quite this quick. Imagine they deployed to one server for one minute, one region for one minute on success, then everywhere on success. Then this would have been a tiny blip rather than a huge deal.

(It can be a bit hard to define "success" when you're doing something like blocking bad requests that may even be a majority of traffic during a DDoS attack, but noticing 100% 5xx errors for 38% of your users due to a parsing bug is doable!)

As for the specific bug: meh. They should have had 100% branch coverage on something as critical (and likely small) as the parsing for this config. Arguably a statically typed language would have helped (but the `.unwrap()` error in the previous outage is a bit of a counterargument to that). But it just wouldn't have mattered that much if they caught it before global rollout.

First, what Cloudflare does is hard and I want to start with that.

That being said, I think it’s worth a discussion. How much of the last 3 outages were because of the JGC (the former CTO) retiring and Dane taking over?

Did JGC have a steady hand that’s missing? Or was it just time for outages that would have happened anyway?

Dane has maintained a culture of transparency which is fantastic, but did something get injected in the culture leading towards these issues? Will it become more or less stable since JGC left?

Curious for anyone with some insight or opinions.

(Also, if it wasn’t clear - huge Cloudflare fan and sending lots of good vibes to the team)

> This first change was being rolled out using our gradual deployment system.

So they are aware of some basic mitigation tactics guarding against errors

> This system does not perform gradual rollouts,

They just choose to YOLO

> Typical actions are “block”, “log”, or “skip”. Another type of action is “execute”,

> However, we have never before applied a killswitch to a rule with an action of “execute”.

Do they do no testing? These isn't even fuzzing with “infinite” variations, but a limited list of actions

> existed undetected for many years. This type of code error is prevented by languages with strong type systems.

So this solution is also well known, just ignored for years, because "if it’s not broken, don’t fix it?", right?

Having their changes fully propagate within 1 minute is pretty fantastic.

> As part of our ongoing work to protect customers using React against a critical vulnerability, CVE-2025-55182, we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications.

Why would increasing the buffer size help with that security vulnerability? Is it just a performance optimization?

The interesting aspect of the Cloudflare support, which is not clarified, is how they came to the risk assessment that it is ok to roll out a change non-gradual globally without testing the procedure first. The only justification I can see is that the React/Next.js remote command execution vulnerabilities are actively exploited. But if this is the case they should say so.

If I'm remembering correctly, there was another outage around 10 days ago.

It still surprises me that there are basically no free alternatives comparable to Cloudflare. Putting everything on CF creates a pretty serious single point of failure.

It's strange that in most industries you have at least two major players, like Coke vs. Pepsi or Nike vs. Adidas. But in the CDN/edge space, there doesn't seem to be a real free competitor that matches Cloudflare's feature set.

It feels very unhealthy for the ecosystem. Does anyone know why this is the case?

I still don't understand what is cloudflare's business model, yet they manage to make news.

I don't see how their main product is ddos protection, yet cloudflare goes down for some reason.

This company makes zero sense to me.

It's (at least) the second time Couldflage gets bitten by React. Last time an useEffect caused an incident.

https://blog.cloudflare.com/deep-dive-into-cloudflares-sept-...

From a customer perspective, I think there should be an option:

- prioritize security: get patchs ASAP

- prioritize availability: get patchs after a cooldown period

Because ultimately, it's a tradeoff that cannot be handled by Cloudflare. It depends on your business, your threat model.

Suggestion for Cloudflare: Create an early adopter option for free accounts.

Benefit: Earliest uptake of new features and security patches.

Drawback: Higher risk of outages.

I think this should be possible since they already differentiate between free, pro and enterprise accounts. I do not know how the routing for that works but I bet they could do this. Think crowd-sourced beta testers. Also a perk for anything PCI audit or FEDRAMP security prioritized over uptime.

They bypassed the gradual rollout system in order to meet a deadline for a cve. They put security above availability, tough tradeoff. Is there a non prod environment where that one off waf testing tool change could have been tested?

before Cloudflare suffered an outage due to React's useEffect, now again trying to mitigate security issues around React Server Pages.

at one point in time - they've to admit this react thing ain't working. & just use classic server rendered pages, since their dashboards are simple toggle controls

I wonder anyone from internal could share the culture a bit. I'm mostly interested in the following part:

If someone messes up royally, is there someone who says "if you break the build/whatever super critical, then your ass is the grass and I'm the lawn mower"?

So no static compiler checks and apparently no fuzzers used to ensure these rules work as intended?

Dang… I don’t even use React and it still brings down my sites. Good beats I guess.

The problem that irks me isn’t that Cloudflare is having outages (everyone does and will at some point, no matter how many 9’s your SLA states), it’s that the internet is so damn centralized that a Cloudflare issue can take out a continent-sized chunk of the internet. Kudos to them on their success story, but oh my god that’s way too many eggs in one basket in general.

I notice that this is the kind of thing that solid sociable tests ought to have caught. I am very curious how testable that code is (random procedural if-statements don't inspire high confidence.)

1.1.1.1 domain test server, whether a relay or endpoints including /cdn-cgi/trace is WAF testing error, for 500 HTTP network & Cloudflare managed R-W-X permissions

I have to wonder if there is a relation to the rising prevalence of coding LLMs.

Is it crazy to anyone else that they deploy every 5 minutes? And that it's not just config updates, but actual code changes with this "execute" action.

Ironically, this time around the issue was in the proxy they're going to phase out (and replace with the Rust one).

I truly believe they're really going to make resilience their #1 priority now, and acknowledging the release process errors that they didn't acknowledge for a while (according to other HN comments) is the first step towards this.

HugOps. Although bad for reputation, I think these incidents will help them shape (and prioritize!) resilience efforts more than ever.

At the same time, I can't think of a company more transparent than CloudFlare when it comes to these kind of things. I also understand the urgency behind this change: CloudFlare acted (too) fast to mitigate the React vulnerability and this is the result.

Say what you want, but I'd prefer to trust CloudFlare who admits and act upon their fuckups, rather than trying to cover them up or downplaying them like some other major cloud providers.

@eastdakota: ignore the negative comments here, transparency is a very good strategy and this article shows a good plan to avoid further problems

As a reliability statistician (and web user!), I'd love to see Cloudflare investing in reliability statistics. :)

A lot of these kind of bugs feel like they could be caught be a simple review bot like Greptile... I wonder if Cloudlare uses an equivalent tool internally?

When should we just give up on Cloudflare? Seems like this just keeps happening. Like some kind of backdoor triggered willy nilly, Hmmm?

I’m not sure I share this sentiment.

First, let’s set aside the separate question of whether monopolies are bad. They are not good but that’s not the issue here.

As to architecture:

Cloudflare has had some outages recently. However, what’s their uptime over the longer term? If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.

But there’s a more interesting argument in favour of the status quo.

Assuming cloudflare’s uptime is above average, outages affecting everything at once is actually better for the average internet user.

It might not be intuitive but think about it.

How many Internet services does someone depend on to accomplish something such as their work over a given hour? Maybe 10 directly, and another 100 indirectly? (Make up your own answer, but it’s probably quite a few).

If everything goes offline for one hour per year at the same time, then a person is blocked and unproductive for an hour per year.

On the other hand, if each service experiences the same hour per year of downtime but at different times, then the person is likely to be blocked for closer to 100 hours per year.

It’s not really bad end user experience that every service uses cloudflare. It’s more-so a question of why is cloudflare’s stability seeming to go downhill?

And that’s a fair question. Because if their reliability is below average, then the value prop evaporates.

Kudos to Cloudflare for clarity and diligence.

When talking of their earlier Lua code:

> we have never before applied a killswitch to a rule with an action of “execute”.

I was surprised that a rules-based system was not tested completely, perhaps because the Lua code is legacy relative to the newer Rust implementation?

It tracks what I've seen elsewhere: quality engineering can't keep up with the production engineering. It's just that I think of CloudFlare as an infrastructure place, where that shouldn't be true.

I had a manager who came from defense electronics in the 1980's. He said in that context, the quality engineering team was always in charge, and always more skilled. For him, software is backwards.

I don't think this is really helping the site owners. I suspect it's mainly about AI extortion:

https://blog.cloudflare.com/introducing-pay-per-crawl/

> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:

They really need to figure out a way to correlate global configuration changes to the errors they trigger as fast as possible.

> as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules

Warning signs like this are how you know that something might be wrong!

Cloudflare is now below 99.9% uptime, for anyone keeping track. I reckon my home PC is at least 99.9%.

Only after that do you use gradual deployment, with a big red oopsie button which immediately rolls the changes back. Languages with strong type systems won't save you, good procedure will.

I’m not sure I share this sentiment.

First, let’s set aside the separate question of whether monopolies are bad. They are not good but that’s not the issue here.

As to architecture:

But there’s a more interesting argument in favour of the status quo.

Assuming cloudflare’s uptime is above average, outages affecting everything at once is actually better for the average internet user.

It might not be intuitive but think about it.

If everything goes offline for one hour per year at the same time, then a person is blocked and unproductive for an hour per year.

On the other hand, if each service experiences the same hour per year of downtime but at different times, then the person is likely to be blocked for closer to 100 hours per year.

It’s not really bad end user experience that every service uses cloudflare. It’s more-so a question of why is cloudflare’s stability seeming to go downhill?

And that’s a fair question. Because if their reliability is below average, then the value prop evaporates.

> On the other hand, if each service experiences the same hour per year of downtime but at different times, then the person is likely to be blocked for closer to 100 hours per year.

I think the parent post made a different argument:

- Centralizing most of the dependency on Cloudflare results in a major outage when something happens at Cloudflare, it is fragile because Cloudflare becomes the single point of failure. Like: Oh Cloudflare is down... oh, none of my SaaS services work anymore.

- In a world where this is not the case, we might see more outages, but they would be smaller and more contained. Like: oh, Figma is down? fine, let me pickup another task and come back to Figma once it's back up. It's also easier to work around by having alternative providers as a fallback, as they are less likely to share the same failure point.

As a result, I don't think you'll be blocked 100 hours a year in scenario 2. You may observe 100 non-blocking inconveniences per year, vs a completely blocking Cloudflare outage.

And in observed uptime, I'm not even sure these providers ever won. We're running all our auxiliary services on a decent Hetzner box with a LB. Say what you want, but that uptime is looking pretty good compared to any services relying on AWS (Oct 20, 15 hours), Cloudflare (Dec 5 (half hour), Nov 18 (3 hours)). Easier to reason about as well. Our clients are much more forgiving when we go down due to Azure/GCP/AWS/Cloudflare vs our own setup though...

> If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.

The point is that it doesn’t matter. A single site going down has a very small chance of impacting a large number of users. Cloudflare going down breaks an appreciable portion of the internet.

If Jim’s Big Blog only maintains 95% uptime, most people won’t care. If BofA were at 95%.. actually same. Most of the world aren’t BofA customers.

If Cloudflare is at 99.95% then the world suffers

> If everything goes offline for one hour per year at the same time, then a person is blocked and unproductive for an hour per year. > On the other hand, if each service experiences the same hour per year of downtime but at different times, then the person is likely to be blocked for closer to 100 hours per year.

Putting Cloudflare in front of a site doesn't mean that site's backend suddenly never goes down. Availability will now be worse - you'll have Cloudflare outages* affecting all the sites they proxy for, along with individual site back-end failures which will of course still happen.

* which are still pretty rare

> If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.

I’m tired of this sentiment. Imagine if people said, why develop your own cloud offering? Can you really do better than VMWare..?

Innovation in technology has only happened because people dared to do better, rather than giving up before they started…

"My architecture depends upon a single point of failure" is a great way to get laughed out of a design meeting. Outsourcing that single point of failure doesn't cure my design of that flaw, especially when that architecture's intended use-case is to provide redundancy and fault-tolerance.

The problem with pursuing efficiency as the primary value prop is that you will necessarily end up with a brittle result.

Notwithstanding that most people using Cloudflare aren't even benefiting from what it actually provides. They just use it...because reasons.

In other words, the consolidation on Cloudflare and AWS makes the web less stable. I agree.

Would you rather be attacked by 1,000 wasps or 1 dog? A thousand paper cuts or one light stabbing? Global outages are bad but the choice isn’t global pain vs local pleasure. Local and global both bring pain, with different, complicated tradeoffs.

Cloudflare is down and hundreds of well paid engineers spring into action to resolve the issue. Your server goes down and you can’t get ahold of your Server Person because they’re at a cabin deep in the woods.

Not too long ago, critical avionics were programmed by different software developers and the software was run on different hardware architectures, produced by different manufacturers. These heterogeneous systems produced combined control outputs via a quorum architecture – all in a single airplane.

Now half of the global economy seems to run on same service provider, it seems…

> Homogeneous systems like Cloudflare will continue to cause global outages

But the distributed system is vulnerable to DDOS.

Is there an architecture that maintains the advantages of both systems? (Distributed resilience with a high-volume failsafe.)

Actually, maybe 1 hour downtime for ~ the whole internet every month is a public good provided by Cloudflare. For everyone that doesn’t get paged, that is.

Robust architecture that is serving 80M requests/second worldwide?

My answer would be that no one product should get this big.

On the other hand, as long as the entire internet goes down when Cloudflare goes down, I'll be able to host everything there without ever getting flack from anyone.

That's a reflect of social organisation. Pushing for hierarchical organisation with a few key centralising nodes will also impact business and technological decisions.

It's weird reading these reports because they don't seem to test anything at all (or at least there's very little mention of testing).

Canary deployment, testing environments, unit tests, integration tests, anything really?

It sounds like they test by merging directly to production but surely they don't

"Kudos"? This is like the South Park episode in which the oil company guy just excuses himself while the company just continues to fuck up over and over again. There's nothing to praise, this shouldn't happen twice in a month. Its inexcusable.

This is funny, considering that someone that worked on the defense industry (guide missile system) found a memory leak on one of their products, at that time. They told him that they knew about it, but that it's timed just right with the range of the system it would be used, so it doesn't matter.

You call it extortion of the AI companies, but isn’t stealing/crawling/hammering a site to scrape their content to resell just as nefarious? I would say Cloudflare is giving these site owners an option to protect their content and as a byproduct, reduce their own costs of subsidizing their thieves. They can choose to turn off the crawl protection. If they aren't, that tells you that they want it, doesn’t it?

Ive been seeing more of those prove your human pages as well, but I generally assume they are there to combat a DDOS or other type of attack (or maybe ai/bot). I remember how annoying it was combating DDOS attacks, or hacked sites before Cloudflare existed. I also remember how annoying capcha s were, everywhere. Cloudflare is not perfect but net, I think it’s been a great improvement.

> I've noticed that in recent months, even apart from these outages, cloudflare has been contributing to a general degradation and shittification of the internet. I'm seeing a lot more "prove you're human", "checking to make sure you're human", and there is normally at the very least a delay of a few seconds before the site loads.

Good to know I'm not the only one

the two things are unrelated...

The pay-per-crawl thing, is about them thinking ahead about post-AI business/revenue models.

The way AI happened, it removed a big chunk of revenue from news companies, blogs, etc. Because lots of people go to AI instead of reaching the actual 3rd party website.

AI currently gets the content for free from the 3rd party websites, but they have revenue from their users.

So Cloudflare is proposing that AI companies should be paying for their crawling. Cloudflare's solution would give the lost revenue back where it belongs, just through a different mechanism.

The ugly side of the story is that this was already an existing solution, and open source, called L402.org.

Cloudflare wants to be the first to take a piece of the pie, but also instead of using the open source version, they forked it internally and published it as their own service, which is cloudflare specific.

To be completely fair, the l402 requires you to solve the payment mechanism itself, which for Cloudflare is easy because they already deal with payments.

Feel like that’s the fault of LLMs, not cloudflare

More and more sites I can't even visit because of this "prove you're human" because it's not compatible with older web browsers, even though the website it's blocking is.

In my experience it's been in recent years, not months.

it can't even spy on us silently, damn

This is the case with just about every status page I’ve ever seen. It takes them a while to realize there’s really a problem and then to update the page. One day these things will be automated, but until then, I wouldn’t expect more of Cloudflare than any other provider.

What’s more concerning to me is that now we’ve had AWS, Azure, and CloudFlare (and CliudFlare twice) go down recently. My gut says:

1. developers and IT are using LLMs in some part of the process, which will not be 100% reliable.

2. Current culture of I have (some personal activity or problem) or we don’t have staff, AI will replace me, f-this.

3. Pandemic after effects.

4. Political climate / war / drugs; all are intermingled.

The status page was updated 6 minutes after the first internal alert was triggered (8:50 -> 8:56:26 UTC), I wouldn't say this is too long.

Only way to change that it to shame them for it: "Cloudflare is so incompetent at detecting and managing outages that even their simple status page is unable to be accurate"

If enough high-ranked customers report this feedback...

> They really need to figure out a way to correlate global configuration changes to the errors they trigger as fast as possible.

This is what jumped out at me as the biggest problem. A wild west deployment process is a valid (but questionable) business decision, but if you do that then you need smart people in place to troubleshoot and make quick rollback decisions.

Their timeline:

> 08:47: Configuration change deployed and propagated to the network

> 08:48: Change fully propagated

> 08:50: Automated alerts

> 09:11: Configuration change reverted and propagation start

> 09:12: Revert fully propagated, all traffic restored

2 minutes for their automated alerts to fire is terrible. For a system that is expected to have no downtime, they should have been alerted to the spike in 500 errors within seconds before the changes even fully propagated. Ideally the rollback would have been automated, but even if it is manual, the dude pressing the deploy button should have had realtime metrics on a second display with his finger hovering over the rollback button.

Ok, so they want to take the approach of roll forward instead of immediate rollback. Again, that's a valid approach, but you need to be prepared. At 08:48, they would have had tens of millions of "init.lua:314: attempt to index field 'execute'" messages being logged per second. Exact line of code. Not a complex issue. They should have had engineers reading that code and piecing this together by 08:49. The change you just deployed was to disable an "execute" rule. Put two and two together. Initiate rollback by 08:50.

How disconnected are the teams that do deployments vs the teams that understand the code? How many minutes were they scratching their butts wondering "what is init.lua"? Are they deploying while their best engineers are sleeping?

> Warning signs like this are how you know that something might be wrong!

Yes, as they explain it's the rollback that was triggered due to seeing these errors that broke stuff.

“ Uh...it's probably not a problem...probably...but I'm showing a small discrepancy in...well, no, it's well within acceptable bounds again. Sustaining sequence. Nothing you need to worry about, Gordon. Go ahead.“

they arent a panacea though, internal tools like that can be super noisy on errors, and be broken more often than theyre working

"Hey, this change is making the 'check engine' light turn on all the time. No problem; I just grabbed some pliers and crushed the bulb."

Indeed. AWS too.

I feel like the cloud hosting companies have lost the plot. "They can provide better uptime than us" is the entire rationale that a lot of small companies have when choosing to run everything in the cloud.

If they cost more AND they're less reliable, what exactly is the reason to not self host?

TBF, it depends on the number of outages locally. In my area it is one outage every thunderstorm/snow storm, so unfortunately the up time of my laptop, even with the help of a large, portable battery charging station (which can charge multiple laptops at the same time), is not optimistic.

I sometimes fancy that I could just take cash, go into the wood, build a small solar array, collect & cleanse river water, and buy a starlink console.

When a piece of hardware goes or a careless backup process fails, downtime of a self-hosted service can be measured in days or weeks.

Where/how are you keeping track of this? What is their current uptime percentage?

Do they include uptime guarantees in any contracts?

That's a pretty silly comparison though.

This is kinda what I'm thinking. We're absolutely not at the scale Cloudflare is at.

But we run software and configuration changes through three tiers - first stage for the dev-team only, second stage with internal customers and other teams depending on it for integration and internal usage -- and finally production. Some teams have also split production into different rings depending on the criticality of the customers and the number of customers.

This has lead to a bunch of discussions early on, because teams with simpler software and very good testing usually push through dev and testing with no or little problem. And that's fine. If you have a track record of good changes, there is little reason to artificially prolong deployment in dev and test just because. If you want to, just go through it in minutes.

But after a few spicy production incidents, even the better and faster teams understood and accepted that once technical velocity exists, actual velocity is a choice, or a throttle if you want an analogy.

If you do good, by all means, promote from test to prod within minutes. If you fuck up production several times in a row and start threatening SLAs, slow down, spend more resources on manual testing and improving automated testing, give changes time to simmer in the internally productive environment, spend more time between promotions from production ring to production ring.

And this is on top of considerations of e.g. change risk. Some frontend-only application can move much faster than the PostgreSQL team, because one rollback is a container restart, and the other could be a multi-hour recovery from backups.

They have millions of “free” subscribers; said subscribers should be the test pigs for rollouts; paying (read: big) subscribers can get the breaking changes later.

I am sure they have this. What tends to happen is that the gradual rollout system becomes too slow for some rare, low latency rollout requirements, so a config system is introduced that fulfills the requirements. For example, let’s say you have a gradual rollout for binaries (slow) and configuration (fast). Over time, the fast rollout of the configuration system will cause outages, so it’s slowed down. Then a requirement pops up for which the config system is too slow and someone identifies a global system with no gradual rollout (e.g. a database) to be used as the solution. That solution will be compliant with all the processes that have been introduced to the letter, because so far nobody has thought of using a single database row for global configuration yet. Add new processes whenever this happens and at some point everything will be too slow and taking on more risk becomes necessary to stay competitive. So processes are adjusted. Repeat forever.

> Languages with strong type systems won't save you, good procedure will.

One of the items in the list of procedures is to use types to encode rules of your system.

> Languages with strong type systems won't save you

Neither will seatbelts if you drive into the ocean, or helmets if you drink poison. I'm not sure what your point is.

Whats going on with cloudflare's software team?

I have seen similar bugs in cloudflare API recently as well.

There is an endpoint for a feature that is available only to enterprise users, but the check for whether the user is on an enterprise plan is done at the last step.

I recently ran into an issue with the Cloudflare API feature that if you want to roll back requires contacting the support team because there's no way to roll it back with the API or GUI. Even when the exact issue was pointed out, it took multiple days to change the setting and to my knowledge there's still no API fix available.

https://www.answeroverflow.com/m/1234405297787764816

Can you elaborate? I'm not sure what you mean by "at the last step"

My guess? Code written by AI