When talking of their earlier Lua code:
> we have never before applied a killswitch to a rule with an action of “execute”.
I was surprised that a rules-based system was not tested completely, perhaps because the Lua code is legacy relative to the newer Rust implementation?
It tracks what I've seen elsewhere: quality engineering can't keep up with the production engineering. It's just that I think of CloudFlare as an infrastructure place, where that shouldn't be true.
I had a manager who came from defense electronics in the 1980's. He said in that context, the quality engineering team was always in charge, and always more skilled. For him, software is backwards.
I don't think this is really helping the site owners. I suspect it's mainly about AI extortion:
> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:
They really need to figure out a way to correlate global configuration changes to the errors they trigger as fast as possible.
> as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules
Warning signs like this are how you know that something might be wrong!
Only after that do you use gradual deployment, with a big red oopsie button which immediately rolls the changes back. Languages with strong type systems won't save you, good procedure will.
First, let’s set aside the separate question of whether monopolies are bad. They are not good but that’s not the issue here.
As to architecture:
Cloudflare has had some outages recently. However, what’s their uptime over the longer term? If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.
But there’s a more interesting argument in favour of the status quo.
Assuming cloudflare’s uptime is above average, outages affecting everything at once is actually better for the average internet user.
It might not be intuitive but think about it.
How many Internet services does someone depend on to accomplish something such as their work over a given hour? Maybe 10 directly, and another 100 indirectly? (Make up your own answer, but it’s probably quite a few).
If everything goes offline for one hour per year at the same time, then a person is blocked and unproductive for an hour per year.
On the other hand, if each service experiences the same hour per year of downtime but at different times, then the person is likely to be blocked for closer to 100 hours per year.
It’s not really bad end user experience that every service uses cloudflare. It’s more-so a question of why is cloudflare’s stability seeming to go downhill?
And that’s a fair question. Because if their reliability is below average, then the value prop evaporates.
Some people even go further by speculating that the original military DARPA network precursor to the modern Internet was originally designed to ensure the continuity of command and control (C&C) of the US military operation in the potential event of all out nuclear attack during the Cold War.
This the time when Internet researchers need to redefine the Internet application and operation. The local-first paradigm is the first step in the right direction (pardon the pun) [2].
[1] The Real Internet Architecture: Past, Present, and Future Evolution:
https://press.princeton.edu/books/paperback/9780691255804/th...
[2] Local-first software You own your data, in spite of the cloud:
After some investigation, I realized that none of these routes passed through Cloudflare OWASP. The reported anomalies total 50, exceeding the pre-configured maximum of 40 (Medium).
Despite being simple image or video uploads, the WAF is generating anomalies that make no sense, such as the following:
Cloudflare OWASP Core Ruleset Score (+5)
933100: PHP Injection Attack: PHP Open Tag Found
Cloudflare OWASP Core Ruleset Score (+5)
933180: PHP Injection Attack: Variable Function Call Found
For now, I’ve had to raise the OWASP Anomaly Score Threshold to 60 and enable the JS Challenge, but I believe something is wrong with the WAF after today’s outage.
This issue was still not solved to this moment.
Every change is a deployment, even if its config. Treat it as such.
Also you should know that a strongly typed language won't save you from every type of problem. And especially not if you allow things like unwrap().
It is just mind boggling that they very obviously have completely untested code which proxies requests for all their customers. If you don't want to write the tests then at least fuzz it.
Yes, this is the second time in a month. Were folks expecting that to have been enough time for them to have made sweeping technical and organization changes? I say no—this doesn't mean they aren't trying or haven't learned any lessons from the last outage. It's a bit too soon to say that.
I see this event primarily as another example of the #1 class of major outages: bad rapid global configuration change. (The last CloudFlare outage was too, but I'm not just talking about CloudFlare. Google has had many many such outages. There was an inexplicable multi-year gap between recognizing this and having a good, widely available staged config rollout system for teams to drop into their systems.) Stuff like DoS attack configurations needs to roll out globally quickly. But they really need make it not quite this quick. Imagine they deployed to one server for one minute, one region for one minute on success, then everywhere on success. Then this would have been a tiny blip rather than a huge deal.
(It can be a bit hard to define "success" when you're doing something like blocking bad requests that may even be a majority of traffic during a DDoS attack, but noticing 100% 5xx errors for 38% of your users due to a parsing bug is doable!)
As for the specific bug: meh. They should have had 100% branch coverage on something as critical (and likely small) as the parsing for this config. Arguably a statically typed language would have helped (but the `.unwrap()` error in the previous outage is a bit of a counterargument to that). But it just wouldn't have mattered that much if they caught it before global rollout.
https://blog.cloudflare.com/deep-dive-into-cloudflares-sept-...
So they are aware of some basic mitigation tactics guarding against errors
> This system does not perform gradual rollouts,
They just choose to YOLO
> Typical actions are “block”, “log”, or “skip”. Another type of action is “execute”,
> However, we have never before applied a killswitch to a rule with an action of “execute”.
Do they do no testing? These isn't even fuzzing with “infinite” variations, but a limited list of actions
> existed undetected for many years. This type of code error is prevented by languages with strong type systems.
So this solution is also well known, just ignored for years, because "if it’s not broken, don’t fix it?", right?
- prioritize security: get patchs ASAP
- prioritize availability: get patchs after a cooldown period
Because ultimately, it's a tradeoff that cannot be handled by Cloudflare. It depends on your business, your threat model.
at one point in time - they've to admit this react thing ain't working. & just use classic server rendered pages, since their dashboards are simple toggle controls
If someone messes up royally, is there someone who says "if you break the build/whatever super critical, then your ass is the grass and I'm the lawn mower"?
I think the parent post made a different argument:
- Centralizing most of the dependency on Cloudflare results in a major outage when something happens at Cloudflare, it is fragile because Cloudflare becomes the single point of failure. Like: Oh Cloudflare is down... oh, none of my SaaS services work anymore.
- In a world where this is not the case, we might see more outages, but they would be smaller and more contained. Like: oh, Figma is down? fine, let me pickup another task and come back to Figma once it's back up. It's also easier to work around by having alternative providers as a fallback, as they are less likely to share the same failure point.
As a result, I don't think you'll be blocked 100 hours a year in scenario 2. You may observe 100 non-blocking inconveniences per year, vs a completely blocking Cloudflare outage.
And in observed uptime, I'm not even sure these providers ever won. We're running all our auxiliary services on a decent Hetzner box with a LB. Say what you want, but that uptime is looking pretty good compared to any services relying on AWS (Oct 20, 15 hours), Cloudflare (Dec 5 (half hour), Nov 18 (3 hours)). Easier to reason about as well. Our clients are much more forgiving when we go down due to Azure/GCP/AWS/Cloudflare vs our own setup though...
The point is that it doesn’t matter. A single site going down has a very small chance of impacting a large number of users. Cloudflare going down breaks an appreciable portion of the internet.
If Jim’s Big Blog only maintains 95% uptime, most people won’t care. If BofA were at 95%.. actually same. Most of the world aren’t BofA customers.
If Cloudflare is at 99.95% then the world suffers
Putting Cloudflare in front of a site doesn't mean that site's backend suddenly never goes down. Availability will now be worse - you'll have Cloudflare outages* affecting all the sites they proxy for, along with individual site back-end failures which will of course still happen.
* which are still pretty rare
I’m tired of this sentiment. Imagine if people said, why develop your own cloud offering? Can you really do better than VMWare..?
Innovation in technology has only happened because people dared to do better, rather than giving up before they started…
The problem with pursuing efficiency as the primary value prop is that you will necessarily end up with a brittle result.
The good news is that a more decentralized internet with human brain scoped components is better for innovation, progress, and freedom anyway.
They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?
Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.
Pure speculation, but to me that sounds like there's more to the story, this sounds like the sort of cowboy decision a team makes when they've either already broken all the rules or weren't following them in the first place
I've worked at one of the top fintech firms, whenever we do a config change or deployment, we are supposed to have rollback plan ready and monitor key dashboards for 15-30 minutes.
The dashboards need to be prepared beforehand on systems and key business metrics that would be affected by the deployment and reviewed by teammates.
I've never seen a downtime longer than 1 minute while I was there, because you get a spike on the dashboard immediately when something goes wrong.
For the entire system to be down for 10+ minutes due to a bad config change or deployment is just beyond me.
But a more important takeaway:
> This type of code error is prevented by languages with strong type systems
After rolling out a bad ruleset update, they tried a killswitch (rolled out immediately to 100%) which was a code path never executed before:
> However, we have never before applied a killswitch to a rule with an action of “execute”. When the killswitch was applied, the code correctly skipped the evaluation of the execute action, and didn’t evaluate the sub-ruleset pointed to by it. However, an error was then encountered while processing the overall results of evaluating the ruleset
> a straightforward error in the code, which had existed undetected for many years
I have a mixed feeling about this.
On the other hand, I absolutely don't want a CDN to look inside my payloads and decide what's good for me or. Today it's protection, tomorrow it's censorship.
At the same time this is exactly what CloudFlare is good for - to protect sites from malicious requests.
That being said, I think it’s worth a discussion. How much of the last 3 outages were because of the JGC (the former CTO) retiring and Dane taking over?
Did JGC have a steady hand that’s missing? Or was it just time for outages that would have happened anyway?
Dane has maintained a culture of transparency which is fantastic, but did something get injected in the culture leading towards these issues? Will it become more or less stable since JGC left?
Curious for anyone with some insight or opinions.
(Also, if it wasn’t clear - huge Cloudflare fan and sending lots of good vibes to the team)
Why would increasing the buffer size help with that security vulnerability? Is it just a performance optimization?
It still surprises me that there are basically no free alternatives comparable to Cloudflare. Putting everything on CF creates a pretty serious single point of failure.
It's strange that in most industries you have at least two major players, like Coke vs. Pepsi or Nike vs. Adidas. But in the CDN/edge space, there doesn't seem to be a real free competitor that matches Cloudflare's feature set.
It feels very unhealthy for the ecosystem. Does anyone know why this is the case?
Benefit: Earliest uptake of new features and security patches.
Drawback: Higher risk of outages.
I think this should be possible since they already differentiate between free, pro and enterprise accounts. I do not know how the routing for that works but I bet they could do this. Think crowd-sourced beta testers. Also a perk for anything PCI audit or FEDRAMP security prioritized over uptime.
I truly believe they're really going to make resilience their #1 priority now, and acknowledging the release process errors that they didn't acknowledge for a while (according to other HN comments) is the first step towards this.
HugOps. Although bad for reputation, I think these incidents will help them shape (and prioritize!) resilience efforts more than ever.
At the same time, I can't think of a company more transparent than CloudFlare when it comes to these kind of things. I also understand the urgency behind this change: CloudFlare acted (too) fast to mitigate the React vulnerability and this is the result.
Say what you want, but I'd prefer to trust CloudFlare who admits and act upon their fuckups, rather than trying to cover them up or downplaying them like some other major cloud providers.
@eastdakota: ignore the negative comments here, transparency is a very good strategy and this article shows a good plan to avoid further problems
Now half of the global economy seems to run on same service provider, it seems…
But the distributed system is vulnerable to DDOS.
Is there an architecture that maintains the advantages of both systems? (Distributed resilience with a high-volume failsafe.)
My answer would be that no one product should get this big.
Good to know I'm not the only one
The pay-per-crawl thing, is about them thinking ahead about post-AI business/revenue models.
The way AI happened, it removed a big chunk of revenue from news companies, blogs, etc. Because lots of people go to AI instead of reaching the actual 3rd party website.
AI currently gets the content for free from the 3rd party websites, but they have revenue from their users.
So Cloudflare is proposing that AI companies should be paying for their crawling. Cloudflare's solution would give the lost revenue back where it belongs, just through a different mechanism.
The ugly side of the story is that this was already an existing solution, and open source, called L402.org.
Cloudflare wants to be the first to take a piece of the pie, but also instead of using the open source version, they forked it internally and published it as their own service, which is cloudflare specific.
To be completely fair, the l402 requires you to solve the payment mechanism itself, which for Cloudflare is easy because they already deal with payments.
But we run software and configuration changes through three tiers - first stage for the dev-team only, second stage with internal customers and other teams depending on it for integration and internal usage -- and finally production. Some teams have also split production into different rings depending on the criticality of the customers and the number of customers.
This has lead to a bunch of discussions early on, because teams with simpler software and very good testing usually push through dev and testing with no or little problem. And that's fine. If you have a track record of good changes, there is little reason to artificially prolong deployment in dev and test just because. If you want to, just go through it in minutes.
But after a few spicy production incidents, even the better and faster teams understood and accepted that once technical velocity exists, actual velocity is a choice, or a throttle if you want an analogy.
If you do good, by all means, promote from test to prod within minutes. If you fuck up production several times in a row and start threatening SLAs, slow down, spend more resources on manual testing and improving automated testing, give changes time to simmer in the internally productive environment, spend more time between promotions from production ring to production ring.
And this is on top of considerations of e.g. change risk. Some frontend-only application can move much faster than the PostgreSQL team, because one rollback is a container restart, and the other could be a multi-hour recovery from backups.
One of the items in the list of procedures is to use types to encode rules of your system.
In this case it's not just a matter of 'hold back for another day to make sure it's done right', like when adding a new feature to a normal SaaS application. In Cloudflare's case moving slower also comes with a real cost.
That isn't to say it didn't work out badly this time, just that the calculation is a bit different.
I'm not sure of the nature of the rollback process in this case, but leaning on ill-founded assumptions is a bad practice. I do agree that a global rollout is a problem.
Note that the two deployments were of different components.
Basically, imagine the following scenario: A patch for a critical vulnerability gets released, during rollout you get a few reports of it causing the screensaver to show a corrupt video buffer instead, you roll out a GPO to use a blank screensaver instead of the intended corporate branding, a crash in a script parsing the GPOs on this new value prevents users from logging in.
There's no direct technical link between the two issues. A mitigation of the first one merely exposed a latent bug in the second one. In hindsight it is easy to say that the right approach is obviously to roll back, but in practice a roll forward is often the better choice - both from an ops perspective and from a safety perspective.
Given the above scenario, how many people are genuinely willing to do a full rollback, file a ticket with Microsoft, and hope they'll get around to fixing it some time soon? I think in practice the vast majority of us will just look for a suitable temporary workaround instead.
“We have spoken directly with hundreds of customers following that incident and shared our plans to make changes to prevent single updates from causing widespread impact like this. We believe these changes would have helped prevent the impact of today’s incident but, unfortunately, we have not finished deploying them yet.
“We know it is disappointing that this work has not been completed yet. It remains our first priority across the organization.”
Comparing the difficulty of running the world’s internet traffic with hundreds of customer products with your fintech experience is like saying “I can lift 10 pounds. I don’t know why these guys are struggling to lift 500 pounds”.
https://www.henricodolfing.ch/case-study-4-the-440-million-s...
The process was pretty tight, almost no revenue-affecting outages from what I can remember because it was such a collaborative effort (even though the board presentation seemed a bit spiky and confrontational at the time, everyone was working together).
One might think a company on the scale of Cloudflare would have a suite of comprehensive tests to cover various scenarios.
Kind of funny that we get something showing the benefits of Rust so soon after everyone was ragging on a out unwrap anyway!
As with any organisation where the CTO is not technical, there will be someone who the 'CTO' has to ask to understand technical situations. In my opinion, that person being asked is the real CTO, for any given situation.
Since attackers might rotate IPs more frequently than once per minute, this effectively means that the whole fleet of servers should be able to quickly react depending on the decisions done centrally.
i.e. it's the difference between "site goes down for a few hours every few months" and "an attacker slammed your site, and through in on-demand scaling or serverless component cloud fees blew your entire infrastructure budget for the year.
Doubly so when your service is part of a larger platform and attacks on your service risk harming your reputation for the larger platform.
An other suggestion is to do it along night shift in every country, right now they only take into account EEUU night.
Once I worked with a team in the anti-abuse space where the policy is that code deployments must happen over 5 days and config updates can take a few minutes. Then an engineer on the team argued that deploying new Python code doesn’t count as a code change because the CPython interpreter did not change; it didn’t even restart. And indeed given how dynamic Python is, it is totally possible to import new Python modules that did not exist when the interpreter process is launched.
Cloudflare is down and hundreds of well paid engineers spring into action to resolve the issue. Your server goes down and you can’t get ahold of your Server Person because they’re at a cabin deep in the woods.
Canary deployment, testing environments, unit tests, integration tests, anything really?
It sounds like they test by merging directly to production but surely they don't
What’s more concerning to me is that now we’ve had AWS, Azure, and CloudFlare (and CliudFlare twice) go down recently. My gut says:
1. developers and IT are using LLMs in some part of the process, which will not be 100% reliable.
2. Current culture of I have (some personal activity or problem) or we don’t have staff, AI will replace me, f-this.
3. Pandemic after effects.
4. Political climate / war / drugs; all are intermingled.
This is what jumped out at me as the biggest problem. A wild west deployment process is a valid (but questionable) business decision, but if you do that then you need smart people in place to troubleshoot and make quick rollback decisions.
Their timeline:
> 08:47: Configuration change deployed and propagated to the network
> 08:48: Change fully propagated
> 08:50: Automated alerts
> 09:11: Configuration change reverted and propagation start
> 09:12: Revert fully propagated, all traffic restored
2 minutes for their automated alerts to fire is terrible. For a system that is expected to have no downtime, they should have been alerted to the spike in 500 errors within seconds before the changes even fully propagated. Ideally the rollback would have been automated, but even if it is manual, the dude pressing the deploy button should have had realtime metrics on a second display with his finger hovering over the rollback button.
Ok, so they want to take the approach of roll forward instead of immediate rollback. Again, that's a valid approach, but you need to be prepared. At 08:48, they would have had tens of millions of "init.lua:314: attempt to index field 'execute'" messages being logged per second. Exact line of code. Not a complex issue. They should have had engineers reading that code and piecing this together by 08:49. The change you just deployed was to disable an "execute" rule. Put two and two together. Initiate rollback by 08:50.
How disconnected are the teams that do deployments vs the teams that understand the code? How many minutes were they scratching their butts wondering "what is init.lua"? Are they deploying while their best engineers are sleeping?
Yes, as they explain it's the rollback that was triggered due to seeing these errors that broke stuff.
I feel like the cloud hosting companies have lost the plot. "They can provide better uptime than us" is the entire rationale that a lot of small companies have when choosing to run everything in the cloud.
If they cost more AND they're less reliable, what exactly is the reason to not self host?
I sometimes fancy that I could just take cash, go into the wood, build a small solar array, collect & cleanse river water, and buy a starlink console.
Neither will seatbelts if you drive into the ocean, or helmets if you drink poison. I'm not sure what your point is.
Ouch. Harsh given that Cloudflare's being over-honest (to disabling the internal tool) and the outage's relatively limited impact (time wise & no. of customers wise). It was just an unfortunate latent bug: Nov 18 was Rust's Unwrap, Dec 5 its Lua's turn with its dynamic typing.
Now, the real cowboy decision I want to see is Cloudflare [0] running a company-wide Rust/Lua code-review with Codex / Claude...
cf TFA:
if rule_result.action == "execute" then
rule_result.execute.results = ruleset_results[tonumber(rule_result.execute.results_index)]
end
This code expects that, if the ruleset has action="execute", the "rule_result.execute" object will exist ... error in the [Lua] code, which had existed undetected for many years ... prevented by languages with strong type systems. In our replacement [FL2 proxy] ... code written in Rust ... the error did not occur.
[0] https://news.ycombinator.com/item?id=44159166The consequence of some services being offline is much, much worse than a person (or a billion) being bored in front of a screen.
Sure, it’s arguably not Cloudflares fault that these services are cloud-dependent in the first place, but even if service just degrades somewhat gracefully in an ideal case, that’s a lot of global clustering of a lot of exceptional system behavior.
Or another analogy: Every person probably passes out for a few minutes in their live at one point or another. Yet I wouldn’t want to imagine what happens if everybody got that over with at the very same time without warning…
Not really, they're just lying. I mean yes of course they aren't oracles who discover complex problems in instant of the first failure, but naw they know when well there are problems and significantly underreport them to the extent they are are less "smoke alarms" and more "your house has burned down and the ashes are still smoldering" alarms. Incidents are intentionally underreported. It's bad enough that there ought to be legislation and civil penalties for the large providers who fail to report known issues promptly.
For a start-up it's much easier to just pay the Cloud tax than it is to hire people with the appropriate skill sets to manage hardware or to front the cost.
Larger companies on the other hand? Yeah, I don't see the reason to not self host.
Shifting liability. You're paying someone else for it to be their problem, and if everyone does it, no one will take flak for continuing to do so. What is the average tenure of a CIO or decision maker electing to move to or remain at a cloud provider? This is why you get picked to talk on stage at cloud provider conferences.
(have been in the meetings where these decisions are made)
https://www.cloudflare.com/careers/jobs/?department=Engineer...
However, this preliminary report doesn't really justify the decision to use the same deployment system responsible for the 11/18 outage. Deployment safety should have been the focus of this report, not the technical details. My question that I want answered isn't "are there bugs in Cloudflare's systems" it's "has Cloudflare learned from it's recent mistakes to respond appropriately to events"
I won't say never, but a situation where the right answer to avoid a rollback (that it sounds like was technically fine to do, just undesirable from a security/business perspective) is a parallel deployment through a radioactive, global blast radius, near instantaneous deployment system that is under intense scrutiny after another recent outage should be about as probable as a bowl of petunias in orbit
During an incident, the incident lead should be able to say to your team's on call: "can you roll back? If so, roll back" and the oncall engineer should know if it's okay. By default it should be if you're writing code mindfully.
Certain well-understood migrations are the only cases where roll back might not be acceptable.
Always keep your services in "roll back able", "graceful fail", "fail open" state.
This requires tremendous engineering consciousness across the entire org. Every team must be a diligent custodian of this. And even then, it will sometimes break down.
Never make code changes you can't roll back from without reason and without informing the team. Service calls, data write formats, etc.
I've been in the line of billion dollar transaction value services for most of my career. And unfortunately I've been in billion dollar outages.
If there’s indeed a 5 min lag in monitoring dashboard in Cloudflare, I honestly think that's a pretty big concern.
For example, a simple curl script on your top 100 customers' homepage that runs every 30 seconds would have given the warning and notifications within a minute. If you stagger deployments at 5 minute intervals, you could have identified the issue and initiated the rollback within 2 minutes and completed it within 3 minutes.
This kind of thing would be more understandable for a company without hundreds of billions of dollars, and for one that hasn't centralized so much of the internet. If a company has grown too large and complex to be well managed and effective and it's starting to look like a liability for large numbers of people there are obvious solutions for that.
Just speculating based on my experience: It's more likely than not that they likely refused to invest in fail-safe architectures for cost reasons. Control-plane and data-plane should be separate, a react patch shouldn't affect traffic forwarding.
Forget manual rollbacks, there should be automated reversion to a known working state.
Cloudflare’s own post says the configuration change that resulted in the outage rolled out in seconds.
I'm more talking about how slow it was to detect the issue caused by the config change, and perform the rollback of the config change. It took 20 minutes.
It required a significant organizational failure to happen. These happen but they ought to be rarer than your average bug (unless your organization is fundamentally malfunctioning, that is)
It was mostly an amateur mistake. Not Rust's fault. Rust could never gain adoption if it didn't have a few escape hatches.
"Damned if they do, damned if they don't" kind of situation.
There are even lints for the usage of the `unwrap` and `expect` functions.
As the other sibling comment points out, the previous Cloudflare problem was an acute and extensive organizational failure.
Perhaps it's the similar way of not testing the possible error path, which is an organizational problem.
They are probably OK with occasional breaks as long as customers don't mind.
Actual deployments take hours to propagate worldwide.
(Disclosure: former Cloudflare SRE)
The latter is easier to handle, easier to fix, and much more suvivable if you do fuck it up a bit. It gives you some leeway to learn from mistakes.
If you make a mistake during the 1000 dog siege, or if you don't have enough guards on standby and ready to go just in case of this rare event, you're just cooked.
It's still a bit silly though, their claimed reasoning probably doesn't really stack up for most of their config changes - I don't see it to be that likely that a 0.1->1->10->100 rollout over the period of 10 minutes would be a catastrophically bad idea for them for _most_ changes.
And to their credit, it does seem they want to change that.
You can easily block ChatGPT and most other AI scrapers if you want:
Which makes it feel that much more special when a service provides open access to all of the infrastructure diagnostics, like e.g. https://status.ppy.sh/
I take exception to that, to be honest. It's not desirable or ideal, but calling it "terrible" is a bit ... well, sorry to use the word ... entitled. For context, I have experience running a betting exchange. A system where it's common for a notable fraction of transactions in a medium-volume event to take place within a window of less than 30 seconds.
Vast majority of current monitoring systems are built on Prometheus. (Well okay, these days it's more likely something Prom-compatible but more reliable.) That implies collection via recurring scrapes. A supposedly "high" frequency online service monitoring system does a scrape every 30 seconds. Well known reliability engineering practices state that you need a minimum of two consecutive telemetry points to detect any given event - because we're talking about a distributed system and network is not a reliable transport. That in turn means that with near-perfect reliability the maximum time window before you can detect something failing is the time it takes to perform three scrapes: thing A might have failed a second after the last scrape, so two consecutive failures will show up only after a delay of just-a-hair-shy-of-three scraping cycle windows.
At Cloudflare's scale, I would not be surprised if they require three consecutive events to trigger an alert.
As for my history? The betting exchange monitoring was tuned to run scrapes at 10-second intervals. That still meant that the first an alert fired for something failing could have been effectively 30 seconds after the failures manifested.
Two minutes for something that does not run primarily financial transactions is a pretty decent alerting window.
I know when I need to reset the clock on my microwave oven.
To me this reads as a form of misdirection, intentional or not. A monopolist has little reason to care about downstream effects, since customers have nowhere else to turn. Framing this as roll your own versus Cloudflare rather than as a monoculture CDN environment versus a diverse CDN ecosystem feels off.
That said, the core problem is not the monopoly itself but its enablers, the collective impulse to align with whatever the group is already doing, the desire to belong and appear to act the "right way", meaning in the way everyone else behaves. There are a gazillion ways of doing CDN, why are we not doing them? Why the focus on one single dominant player?
i dont think this is an entropy issue its human error bubbling up and cloudflare charges a premium for it
my faith in cloudflare is shoook for sure two major outages weeks apart ad this wont be the last
Super-procedural code in particular is too complex for humans to follow, much less AI.
Software development is a rare exception to this. We’re often writing from scratch (same with designers, and some other creatives). But these are definitely the exception compared to the broader workforce.
Same concept applies for any app that’s built on top of multiple third-party vendors (increasingly common for critical dependencies of SaaS)
A key part of secure systems is availability...
It really looks like vibe-coding.
Every company that has ignored my following advice has experienced a day for day slip in first quarter scheduling. And that advice is: not much work gets done between Dec 15 and Jan 15. You can rely on a week worth, more than that is optimistic. People are taking it easy and they need to verify things with someone who is on vacation so they are blocked. And when that person gets back, it’s two days until their vacation so it’s a crap shoot.
NB: there’s work happening on Jan 10, for certain, but it’s not getting finished until the 15th. People are often still cleaning up after bad decisions they made during the holidays and the subsequent hangover.
This is far too dismissive of how disruptive the downtime can be and it sets the bar way too low for a company so deeply entangled in global internet infrastructure.
I don’t think you can make such an assertion with any degree of credibility.
This reads like sarcasm. But I guess it is not. Yes, you are a CDN, a major one at that. 30 minutes of downtime or "whatever" is not acceptable. I worked at traffic teams of social networks that looked at themselves as that mission critical. CF is absolutely that critical and it is definitely lives at stake.
2025-12-05
5 min read

On December 5, 2025, at 08:47 UTC (all times in this blog are UTC), a portion of Cloudflare’s network began experiencing significant failures. The incident was resolved at 09:12 (~25 minutes total impact), when all services were fully restored.
A subset of customers were impacted, accounting for approximately 28% of all HTTP traffic served by Cloudflare. Several factors needed to combine for an individual customer to be affected as described below.
The issue was not caused, directly or indirectly, by a cyber attack on Cloudflare’s systems or malicious activity of any kind. Instead, it was triggered by changes being made to our body parsing logic while attempting to detect and mitigate an industry-wide vulnerability disclosed this week in React Server Components.
Any outage of our systems is unacceptable, and we know we have let the Internet down again following the incident on November 18. We will be publishing details next week about the work we are doing to stop these types of incidents from occurring.
The graph below shows HTTP 500 errors served by our network during the incident timeframe (red line at the bottom), compared to unaffected total Cloudflare traffic (green line at the top).

Cloudflare's Web Application Firewall (WAF) provides customers with protection against malicious payloads, allowing them to be detected and blocked. To do this, Cloudflare’s proxy buffers HTTP request body content in memory for analysis. Before today, the buffer size was set to 128KB.
As part of our ongoing work to protect customers who use React against a critical vulnerability, CVE-2025-55182, we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications, to make sure as many customers as possible were protected.
This first change was being rolled out using our gradual deployment system. During rollout, we noticed that our internal WAF testing tool did not support the increased buffer size. As this internal test tool was not needed at that time and had no effect on customer traffic, we made a second change to turn it off.
This second change of turning off our WAF testing tool was implemented using our global configuration system. This system does not perform gradual rollouts, but rather propagates changes within seconds to the entire fleet of servers in our network and is under review following the outage we experienced on November 18.
Unfortunately, in our FL1 version of our proxy, under certain circumstances, the second change of turning off our WAF rule testing tool caused an error state that resulted in 500 HTTP error codes to be served from our network.
As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following Lua exception:
[lua] Failed to run module rulesets callback late_routing: /usr/local/nginx-fl/lua/modules/init.lua:314: attempt to index field 'execute' (a nil value)
resulting in HTTP code 500 errors being issued.
The issue was identified shortly after the change was applied, and was reverted at 09:12, after which all traffic was served correctly.
Customers that have their web assets served by our older FL1 proxy AND had the Cloudflare Managed Ruleset deployed were impacted. All requests for websites in this state returned an HTTP 500 error, with the small exception of some test endpoints such as /cdn-cgi/trace.
Customers that did not have the configuration above applied were not impacted. Customer traffic served by our China network was also not impacted.
Cloudflare’s rulesets system consists of sets of rules which are evaluated for each request entering our system. A rule consists of a filter, which selects some traffic, and an action which applies an effect to that traffic. Typical actions are “block”, “log”, or “skip”. Another type of action is “execute”, which is used to trigger evaluation of another ruleset.
Our internal logging system uses this feature to evaluate new rules before we make them available to the public. A top level ruleset will execute another ruleset containing test rules. It was these test rules that we were attempting to disable.
We have a killswitch subsystem as part of the rulesets system which is intended to allow a rule which is misbehaving to be disabled quickly. This killswitch system receives information from our global configuration system mentioned in the prior sections. We have used this killswitch system on a number of occasions in the past to mitigate incidents and have a well-defined Standard Operating Procedure, which was followed in this incident.
However, we have never before applied a killswitch to a rule with an action of “execute”. When the killswitch was applied, the code correctly skipped the evaluation of the execute action, and didn’t evaluate the sub-ruleset pointed to by it. However, an error was then encountered while processing the overall results of evaluating the ruleset:
if rule_result.action == "execute" then
rule_result.execute.results = ruleset_results[tonumber(rule_result.execute.results_index)]
end
This code expects that, if the ruleset has action=”execute”, the “rule_result.execute” object will exist. However, because the rule had been skipped, the rule_result.execute object did not exist, and Lua returned an error due to attempting to look up a value in a nil value.
This is a straightforward error in the code, which had existed undetected for many years. This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur.
We made an unrelated change that caused a similar, longer availability incident two weeks ago on November 18, 2025. In both cases, a deployment to help mitigate a security issue for our customers propagated to our entire network and led to errors for nearly all of our customer base.
We have spoken directly with hundreds of customers following that incident and shared our plans to make changes to prevent single updates from causing widespread impact like this. We believe these changes would have helped prevent the impact of today’s incident but, unfortunately, we have not finished deploying them yet.
We know it is disappointing that this work has not been completed yet. It remains our first priority across the organization. In particular, the projects outlined below should help contain the impact of these kinds of changes:
Enhanced Rollouts & Versioning: Similar to how we slowly deploy software with strict health validation, data used for rapid threat response and general configuration needs to have the same safety and blast mitigation features. This includes health validation and quick rollback capabilities among other things.
Streamlined break glass capabilities: Ensure that critical operations can still be achieved in the face of additional types of failures. This applies to internal services as well as all standard methods of interaction with the Cloudflare control plane used by all Cloudflare customers.
"Fail-Open" Error Handling: As part of the resilience effort, we are replacing the incorrectly applied hard-fail logic across all critical Cloudflare data-plane components. If a configuration file is corrupt or out-of-range (e.g., exceeding feature caps), the system will log the error and default to a known-good state or pass traffic without scoring, rather than dropping requests. Some services will likely give the customer the option to fail open or closed in certain scenarios. This will include drift-prevention capabilities to ensure this is enforced continuously.
Before the end of next week we will publish a detailed breakdown of all the resiliency projects underway, including the ones listed above. While that work is underway, we are locking down all changes to our network in order to ensure we have better mitigation and rollback systems before we begin again.
These kinds of incidents, and how closely they are clustered together, are not acceptable for a network like ours. On behalf of the team at Cloudflare we want to apologize for the impact and pain this has caused again to our customers and the Internet as a whole.
Time (UTC) | Status | Description |
08:47 | INCIDENT start | Configuration change deployed and propagated to the network |
08:48 | Full impact | Change fully propagated |
08:50 | INCIDENT declared | Automated alerts |
09:11 | Change reverted | Configuration change reverted and propagation start |
09:12 | INCIDENT end | Revert fully propagated, all traffic restored |
Cloudflare's connectivity cloud protects entire corporate networks, helps customers build Internet-scale applications efficiently, accelerates any website or Internet application, wards off DDoS attacks, keeps hackers at bay, and can help you on your journey to Zero Trust.
Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.
To learn more about our mission to help build a better Internet, start here. If you're looking for a new career direction, check out our open positions.
Related posts
November 18, 2025 12:00 AM
[
](https://blog.cloudflare.com/18-november-2025-outage/)
Cloudflare suffered a service outage on November 18, 2025. The outage was triggered by a bug in generation logic for a Bot Management feature file causing many Cloudflare services to be affected. ...
By
November 18, 2025 12:00 AM
[
](https://blog.cloudflare.com/18-november-2025-outage-uk-ua/)
Cloudflare suffered a service outage on November 18, 2025. The outage was triggered by a bug in generation logic for a Bot Management feature file causing many Cloudflare services to be affected. ...
By
October 28, 2025 12:00 PM
[
](https://blog.cloudflare.com/q3-2025-internet-disruption-summary/)
In Q3 2025, we observed Internet disruptions around the world resulting from government directed shutdowns, power outages, cable cuts, a cyberattack, an earthquake, a fire, and technical problems, as well as several with unexplained causes....
By
September 30, 2025 10:05 AM
[
](https://blog.cloudflare.com/nationwide-internet-shutdown-in-afghanistan/)
On September 29, 2025, Internet connectivity was completely shut down across Afghanistan, impacting business, education, finance, and government services....
By
I don’t the answer to the all questions. But here I think it is just a way to avoid responsibility. If someone choses CDN “number 3” and it goes down, business people *might* put a blame on this person for not choosing “the best”. I am not saying it is a right approach, I just seen it happens too many times.
I hope that was their #1 priority from the very start given the services they sell...
Anyway, people always tend to overthink about those black-swan events. Yes, 2 happened in a quick succession, but what is the average frequency overall? Insignificant.
It's like saying that Chipotle having X% chance of tainted food is worse than local burrito places having 2*X% chance of tainted food. It's true in the lens that each individual event affects more people, but if you removed that Chipotle and replaced with all local, the total amount of illness is still strictly higher, it's just tons of small events that are harder to write news articles about.
There are likely emergency services dependent on Cloudflare at this point, so I’m only semi serious.
How? If Github is down how many people are affected? Google?
> Jim’s Big Blog only maintains 95% uptime, most people won’t care
Yeah, and in the world with Cloudflare people don't care if Jim's Blog is down either. So Cloudflare doesn't make things worse.
if the world suffers, those doing the "suffering" needs to push that complaint/cost back up the chain - to the website operator, which would push the complaint/cost up to cloudflare.
The fact that nobody did - or just verbally complained without action - is evidence that they didn't really suffer.
In the mean time, BofA saved cost in making their site 99.95% uptime themselves (presumably cloudflare does it cheaper than they could individually). So the entire system became more efficient as a result.
This is a simplistic opinion. Claiming services like Cloudflare are modeled as single points of failure is like complaining that your use of electricity to power servers is a single point of failure. Cloudflare sells a global network of highly reliable edge servers running services like caching, firewall, image processing, etc. And more importas a global firewall that protects services against global distributed attacks. Until a couple of months ago, it was unthinkable to casual observers that Cloudflare was such an utter unreliable mess.
If so, is it a good or bad trade to have more overall uptime but when things go down it all goes down together?
Here is an article (from TODAY) about the case where Perplexity is being accused of ignoring robots.txt: https://www.theverge.com/news/839006/new-york-times-perplexi...
If you think a robots.txt is the answer to stopping the billion-dollar AI machine from scraping you, I don’t know what to say.
robots.txt isn't even respected by all of the American companies. Chinese ones (which often also use what are essentially botnets in Latin American and the rest of the world to evade detection) certainly don't care about anything short of dropping their packets.
Sorry but that’s a method you use if you serve 100 requests per second, not when you are at Cloudflare scale. Cloudflare easily have big enough volume that this problem would trigger an instant change in a monitorable failure rate.
Prometheus has as an unaddressed flaw [0], where rate functions must be at least 2x the scrape interval. This means that if you scrape at 30s intervals, your rate charts won’t reflect the change until a minute after.
If your target is security, then _assuming your patch is actually valid_ you're giving better security coverage for free customers than to your paying ones.
Cloudflare is both, and their tradeoffs seem to be set on maximizing security at cost of availability. And it makes sense. A fully unavailable system is perfectly secure.
Disclosure: I work at Cloudflare, but not on the WAF
And on top of that, Cloudflare's value proposition is "we're smart enough to know that instantaneous global deployments are a bad idea, so trust us to manage services for you so you don't have to rely on in house folks who might not know better"
There’s no other deployment system available. There’s a single system for config deployment and it’s all that was available as they haven’t yet done the progressive roll out implementation yet.
That's to say, it's an incredibly good idea when you can physically implement it. It's not something that everybody can do.
With small deployments it usually isn't too difficult to re-deploy a previous commit. But once you get big enough you've got enough developers that half a dozen PRs will have been merged since the start of the incident and now. How viable is it to stop the world, undo everything, and start from scratch any time a deployment causes the tiniest issues?
Realistically the best you're going to get is merging a revert of the problematic changeset - but with the intervening merges that's still going to bring the system in a novel state. You're rolling forwards, not backwards.
If this were purely a money problem it would have been solved ages ago. It’s a difficult problem to solve. Also, they’re the youngest of the major cloud providers and have a fraction of the resources that Google, Amazon, and Microsoft have.
They are separate.
> a react patch shouldn't affect traffic forwarding.
If you can’t even bother to read the blog post maybe you shouldn’t be so confident in your own analysis of what should and shouldn’t have happened?
This was a configuration change to change the buffered size of a body from 256kb to 1mib.
The ability to be so wrong in so few words with such confidence is impressive but you may want to take more of a curiosity first approach rather than reaction first.
Give me a break.
Though this is showing the problem with these things: Migrating faster could have reduced the impact of this outage, while increasing the impact of the last outage. Migrating slower could have reduced the impact of the last outage, while increasing the impact of this outage.
This is a hard problem: How fast do you rip old working infrastructure out and risk finding new problems in the new stack, yet, how long do you tolerate shortcomings of the old stack that caused you to build the new stack?
Looking across the errors, it points to some underlying practices: a lack of systems metaphors, modularity, testability, and an reliance on super-generic configuration instead of software with enforced semantics.
Just because CF is up doesnt mean the site is
The world can also live a few hours without sewers, water supply, food, cars, air travel, etc.
But "can" and "should" are different words.
What an utterly clueless claim. You're literally posting in a thread with nearly 500 posts of people complaining. Taking action takes time. A business just doesn't switch cloud providers overnight.
I can tell you in no uncertain terms that there are businesses impacted by Cloudflare's frequent outages that started work shedding their dependency on Cloudflare's services. And it's not just because of these outages.
It may have been unthinkable to some casual observers that creating a giant single point of failure for the internet was a bad idea but it was entirely thinkable to others.
Also, if you need to switchover to backup systems for everything at once, then either the backup has to be the same for everything and very easily implementable remotely - which to me seems unlikely for specialty systems, like hospital systems, or for the old tech that so many organizations still rely on (and remember the CrowdStrike BSODs that had to be fixed individually and in person and so took forever to fix?) - or you're gonna need a LOT of well-trained IT people, paid to be on standby constantly, if you want to fix the problems quickly, on account of they can't be everywhere at once.
If the problems are more spread out over time, then you don't need to have quite so many IT people constantly on standby. Saves a lot of $$$, I'd think.
And if problems are smaller and more spread out over time, then an organization can learn how to deal with them regularly, as opposed to potentially beginning to feel and behave as though the problem will never actually happen. And if they DO fuck up their preparedness/response, the consequences are likely less severe.
You are responsible of your dependencies, unless they are specific integrations. Either switch to more reliable dependencies or add redundancy so that you can switch between providers when any is down.
At scale there's no such thing as "instant". There is distribution of progress over time.
The failure is an event. Collection of events takes times (at scale, going through store and forward layers). Your "monitorable failure rate" is over an interval. You must measure for that interval. And then you are going to emit another event.
Global config systems are a tradeoff. They're not inherently bad; they have both strengths and weaknesses. Really bad: non-zero possibility for system collapse. Bad: Can progress very quickly to towards global outages. Good: Faults are detected quickly, response decision making is easy, and mitigation is fast.
Hyperscale is not just "a very large number of small simple systems".
Denoising alerts is a fact of life for SRE...and is a survival skill.
Most scaled analysis systems provide precise control over the type of aggregation used within the analyzed time slices. There are many possibilities, and different purposes for each.
High frequency events are often collected into distributions and the individual timestamps are thrown away.
Hindsight is always 20/20, but I don't know how that sort of oversight could happen in an organization whose business model rides on reliability. Small shops understand the importance of safeguards such as progressive deployments or one-box-style deployments with a baking period, so why not the likes of Cloudflare? Don't they have anyone on their payroll who warns about the risks of global deployments without safeguards?
Privately Disclosed: Nov 29 Fix pushed: Dec 1 Publicly disclosed: Dec 3
This is specious reasoning. How come I had to endure a total outage due to the rollout of a mitigation of a Nextjs vulnerability when my organization doesn't even own any React app, let alone a Nextjs one?
Also specious reasoning #2, not wanting to maintain a service does not justify blindly rolling out config changes globally without any safeguards.
There is another name for rolling forward, it's called tripping up.
The short answer is "yes" due to the way the configuration management works. Other infrastructure changes or service upgrades might get undone, but it's possible. Or otherwise revert the commit that introduced the package bump with the new code and force that to rollout everywhere rather than waiting for progressive rollout.
There shouldn't be much chance of bringing the system to a novel state because configuration management will largely put things into the correct state. (Where that doesn't work is if CM previously created files, it won't delete them unless explicitly told to do so.)
This can be architected in such a way that if one rules engine crashes, other systems are not impacted and other rules, cached rules, heuristics, global policies, etc. continue to function and provide shielding.
You can't ask for Cloudflare to turn on a dime and implement this in this manner. Their infra is probably very sensibly architected by great engineers. But there are always holes, especially when moving fast, migrating systems, etc. And there's probably room for more resiliency.
Honestly we shouldn't have created a system where any single company's failure is able to impact such a huge percentage of the network. The internet was designed for resilience and we abandoned that ideal to put our trust in a single company that maybe isn't up for the job. Maybe no one company ever could do it well enough, but I suspect that no single company should carry that responsibility in the first place.
That fact that no major cloud provider is actually good is not an argument that cloudflare isn't bad, or even that they couldn't/shouldn't do better than they are. They have fewer resources than Google or Microsoft but they're also in a unique position that makes us differently vulnerable when they fuck up. It's not all their fault, since it was a mistake to centralize the internet to the extent that we have in the first place, but now that they are responsible for so much they have to expect that people will be upset when they fail.
> Instead, it was triggered by changes being made to our body parsing logic while attempting to detect and mitigate an industry-wide vulnerability disclosed this week in React Server Components.
> Unfortunately, in our FL1 version of our proxy, under certain circumstances, the second change of turning off our WAF rule testing tool caused an error state that resulted in 500 HTTP error codes to be served from our network.
The body parsing logic is in react or nextjs, that's my takeaway, is it that incorrect? and the WAF rule testing tool (control plane) was interdependent with the WAF's body parsing logic, is that also incorrect?
> This was a configuration change to change the buffered size of a body from 256kb to 1mib.
Yes, and if it was resilient,the body parsing is done on a discrete forwarding plane. Any config changes should be auto-tested for forwarding failures by the separate control plane and auto-revered when there are errors. If the waf rule testing tool was part of that test then it being down shouldn't have affected data-plane because it would be a separate system.
data/control plane separate means the run time of the two and any dependencies they have are separate. It isn't cheap to do this right, that's why I speculated (I made clear i was speculating) that it was because they wanted to save costs.
> The ability to be so wrong in so few words with such confidence is impressive but you may want to take more of a curiosity first approach rather than reaction first.
Please tone down the rage a bit and leave room for some discussion. You should take your own pill and be curious about what I meant instead of taking a rage-first approach.
You can be angry - but that doesn't help anyone. They fucked up, yes, they admitted it and they provided plans on how to address that.
I don't think they do these things on purpose. Of course given their good market penetration they end up disrupting a lot of customers - and they should focus on slow rollouts - but I also believe that in a DDOS protection system (or WAF) you don't want or have the luxury to wait for days until your rule is applied.
I think it's human nature (it's hard to realize something is going well until it breaks), but still has a very negative psychological effect. I can barely imagine the stress the team is going through right now.
So at this point no, the world can most definitely not “just live without the Internet”. And emergency services aren’t the only important thing that exists to the extent that anything else can just be handwaved away.
It is far worse if all of the competitors are down at once. To some extent you can and should have a little bit of stock at home (water, food, medicine, ways to stay warm, etc) but not everything is practical to do so with (gasoline for example, which could have knock on effects on delivery of other goods).
Is it? I can’t say that my personal server has been (unplanned) down at any time in the past 10 years, and these global outages have just flown right past it.
Yes, there are various bots, and some of the large US companies such as Perplexity do indeed seem to be ignoring robots.txt.
Is that a problem? It's certainly not a problem with cpu or network bandwidth (it's very minimal). Yes, it may be an issue if you are concerned with scraping (which I'm not).
Cloudflare's "solution" is a much bigger problem that affects me multiple times daily (as a user of sites that use it), and those sites don't seem to need protection against scraping.
Obviously it depends on the bot, and you can't block the scammy ones. I was really just referring to the major legitimate companies (which might not include Perplexity).
This seems like an issue with the design of your status page. If the broken dependencies truly had a limited blast radius, that should've been able to be communicated in your indicators and statistics. If not, then the unreliable reputation was deserved, and all you did by removing the status page was hide it.
users want to do things, if their goal depends on a complex chain of functions (provided by various semi-independent services) then the ideal setup would be to have redundant providers and users could simply "load balance" between them and that separate high-level providers' uptime state is clustered (meaning that when Google is unavailable Bing is up, and when Random Site A, goes down their payment provider goes down too, etc..)
So ideally sites would somehow sort themselves nearly to separate availability groups.
Otherwise simply having a lot of uncorrelated downtimes doesn't help (if we count the sum of downtime experienced by people). Though again it gets complicated by the downtime percentage, because likely there's a phase shift between the states when user can mostly complete their goals and when they cannot because too many cascading failures.
I won’t remember this block of code because five other people have touched it. So I need to be able to see what has changed and what it talks to so I can quickly verify if my old assumptions still hold true
Military hardware is produced with engineering design practices that look nothing at all like what most of the HN crowd is used to. There is an extraordinary amount of documentation, requirements, and validation done for everything.
There is a MIL-SPEC for pop tarts which defines all parts sizes, tolerances, etc.
Unlike a lot in the software world military hardware gets DONE with design and then they just manufacture it.
They're going to see "oh, it leaks 3MiB per minute… and this system runs for twice as long as the old system", and then they're going to think for five seconds, copy-paste the appropriate paragraph, double the memory requirements in the new system's paperwork, and call it a day.
Checklists work.
While we're here, any other Prometheus or Grafana advice is welcome.
I’m happy to see they’re changing their systems to fail open which is one of the things I mentioned in the conversation about their last outage.
As a recovering devops/infra person from a lifetime ago (who has, much to my heartbreak, broken prod more than once), perhaps that is where my grace in this regard comes from. Systems and their components break, systems and processes are imperfect, and urgency can lead to unexpected failure. Sometimes its Cloudflare, other times it's Azure, GCP, Github, etc. You can always use something else, but most of us continue to pick the happy path of "it works most of the time, and sometimes it does not." Hopefully the post mortem has action items to improve the safeguards you mention. If there are no process and technical improvements from the outage, certainly, that is where the failure lies (imho).
China-nexus cyber threat groups rapidly exploit React2Shell vulnerability (CVE-2025-55182) - https://aws.amazon.com/blogs/security/china-nexus-cyber-thre... - December 4th, 2025
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
Could cloudflare do better? Sure, that’s a truism for everyone. Did they make mistakes and continue to make mistakes? Also a truism.
Trust me, they are acutely aware of people getting upset when they fail. Why do you think they’re CEO and CTO are writing these blog posts?
1. There is an active vulnerability unrelated to Cloudflare where React/Next.JS can be abused via a malicious payload. The payload could be up to 1MB.
2. Cloudflare had buffer size that wasn't enough to prevent that payload from being sent to the Customer of the Cloudflare.
3. Cloudflare to protect their customers wanted to increase the buffer size to 1MB.
4. Internal Testing Tool wasn't able to handle change to 1MB and started failing.
5. They wanted to stop Internal Testing Tool from failing, but the Internal Testing Tool required disabling a ruleset which an existing system was depending on (due to a long existing bug). This caused the wider incident.
It does seem to be like a mess in the sense that in order to stop internal testing tool from failing they had to endanger things globally in production, yes. It looks like legacy, tech debt mess.
It seems like bad decisions done in the past though.
I think the question then is how much of the Internet has fungible alternatives such that uncorrelated downtime can meaningfully be less impact. If you have a "to buy" shopping list, the existence of alternative shopping list products doesn't help you, when the one you use is down it's just down, the substitutes cannot substitute on short notice. Obviously for some things there's clear substitutes though, but actually I think "has fungible alternatives" is mostly correlated with "being down for 30 minutes doesn't matter", it seems that the things where you want the one specific site are the ones where availability matters more.
It's never right to leave structural issues even if "they don't happen under normal conditions".
Not all sites can have full caching, we've tried.
I equate it to driving. I'd rather be moving at a normal speed on side streets than sitting in traffic on the expressway, even if the expressway is technically faster.
(and also, rolling your own version of WAF is probably not the right answer if you need better uptime. It's exceedingly unlikely a medical devices company will beat CF at this game.)
It's fine to be upset, and especially rightfully so after the second outage in less than 30 days, but this doesn't justify toxicity.
It might be more maintainable to have leaks instead of elaborate destruction routines, because then you only have to consider the costs of allocations.
Java has a null garbage collector (Sigma GC) for the same reason. If your financial application really needs good performance at any cost and you don't want to rewrite it, you can throw money at the problem to make it go away.
What works much better is having an intentional review step that you come back to.
If there is a memory leak, them this is a flaw, that might not matter so much for a specific product, but I can also easily see it being forgotten, if it was maybe mentioned somewhere in the documentation, but maybe not clear enough and deadlines and stress to ship are a thing there as well.
https://www.ailawandpolicy.com/2025/10/anti-circumvention-re...
I think your take is terribly simplistic. In a professional setting, virtually all engineers have no say on whether the company switches platforms or providers. Their responsibility is to maintain and develop services that support business. The call to switch a provider is ultimately a business and strategic call, and is a subject that has extremely high inertia. You hired people specialized in technologies, and now you're just dumping all that investment? Not to mention contracts. Think about the problem this creates.
Some of you sound like amateurs toying with pet projects, where today it's framework A on cloud provider X whereas tomorrow it's framework B on cloud provider Y. Come the next day, rinse and repeat. This is unthinkable in any remotely professional setting.
https://www.csoonline.com/article/3814810/backdoor-in-chines...
Most hospital and healthcare IT teams are extremely under funded, undertrained, overworked, and the software, configurations and platforms are normally not the most resilient things.
I have a friend at one in the North East right now going through a hell of a security breach for multiple months now and I'm flabbergasted no one is dead yet.
When it comes to tech, I get the impression most organizations are not very "healthy" in the durability of systems.
In this particular case, they seem to be doing two things: - Phasing out the old proxy (Lua based) which is replaced by FL2 (Rust based, the one that caused the previous incident) - Reacting to an actively exploited vulnerability in React by deploying WAF rules - and they're doing them in a relatively careful way (test rules) to avoid fuckups, which caused this unknown state, which triggered the issue
A better analogy is that if the restaurant you'll be going to is unexpectedly closed for a little while, you would do an after-dinner errand before dinner instead and then visit the restaurant a bit later. If the problem affects both businesses (like a utility power outage) you're stuck, but you can simply rearrange your schedule if problems are local and uncorrelated.
At some point you have to admit that humans are pretty bad at some things. Keeping documentation up to date and coherent is one of those things, especially in the age of TikTok.
Better to live in the world we have and do the best you can, than to endlessly argue about how things should be but never will become.
We already started looking into moving away from Zoom, I suggested self-hosting http://jitsi.org Based on their docs, self-hosting is well supported, and probably a $50-$100 server is more than enough, so a lot cheaper than Zoom.
Vendor contracts have 1-3 year terms. We (a financial services firm) re-evaluate tech vendors every year for potential replacement and technologists have direct input into these processes. I understand others may operate under a different vendor strategy. As a vendor customer, your choices are to remain a customer or to leave and find another vendor. These are not feelings, these are facts. If you are unhappy but choose not to leave a vendor, that is a choice, but it is your choice to make, and unless you are a large enough customer that you have leverage over the vendor, these are your only options.
Shouldn't grey beards, grizzled by years of practicing rigorous engineering, be passing this knowledge on to the next generation? How did they learn it when just starting out? They weren't born with it. Maybe engineering has actually improved so much that we only need to experience outages this frequently, and such feelings of nostalgia are born from never having to deal with systems having such high degrees of complexity and, realistically, 100% availability expectations on a global scale.
If a missile passes the long hurdles and hoops built into modern Defence T&E procurement it will only ever be considered out of spec once it fails.
For a good portion of platforms they will go into service, be used for a decade or longer, and not once will the design be modified before going end of life and replaced.
If you wanted to progressively iterate or improve on these platforms, then yes continual updates and investing in the eradication of tech debt is well worth the cost.
If you're strapping explosives attached to a rocket engine to your vehicle and pointing it at someone, there is merit in knowing it will behave exactly the same way it has done the past 1000 times.
Neither ethos in modifying a system is necessarily wrong, but you do have to choose which you're going with, and what the merits and drawbacks of that are.
The amount of dedication and meticulous and concentrated work I know from older engineers when I started work and that I remember from my grand fathers is something I very rarely observe these days. Neither in engineering specific fields nor in general.
Now, there can be tens of thousands of similar considerations to document. And keeping up that documentation with the actual state of the world is a full time job in itself.
You can argue all you want that folks "should" do this or that, but all I've seen in my entire career is that documentation is almost universally: out of date, and not worth relying on because it's actively steering you in the wrong direction. And I actually disagree (as someone with some gray in my beard) with your premise that this is part of "rigorous engineering" as is practiced today. I wish it was, but the reality is you have to read the code, read it again, see what it does on your desk, see what it does in the wild, and still not trust it.
We "should" be nice to each other, I "should" make more money, and it "should" be sunny more often. And we "should" have well written, accurate and reliable docs, but I'm too old to be waiting around for that day to come, especially in the age of zero attention and AI generated shite.
A lot of people are angry about this, and I think it's borderline illegal: https://devforum.zoom.us/t/you-have-exceeded-the-limit-of-li...
You pay for something, and you can't use it.
Of course, it's also possible you signed a contract that basically says "we can just decide not to work and you can't do anything about it" in which case, sucks, and fire whoever negotiates your B2B contracts. But also, those clauses can be void if the violation is serious enough.
What I don't like, is that whenever you contact Zoom, their representatives are taught to say one thing: buy more licenses.
Not only that, but their API/pricing is specifically designed to cover edge-cases that will force you to buy a license.
For example, they don't expose an API to assign a co-host. You can do that via the UI, manually, but not via the API.
Can you share which solution are you moving to?