Introduction
On November 18, 2025, at 11:20 UTC, Cloudflare's network experienced a major outage that prevented core network traffic delivery for several hours. The incident, the worst since 2019, affected millions of websites and services relying on Cloudflare's CDN infrastructure. Users attempting to access customer sites received HTTP 5xx error pages indicating an internal network malfunction.
The root cause was not a cyber attack or malicious activity, but an error in the permissions management of a ClickHouse database used by the Bot Management system. This generated a configuration file with double the normal size for machine learning features, which once propagated across all network servers caused the routing software to crash.
Timeline of the Incident
The outage began at 11:20 UTC when the volume of HTTP 5xx error codes increased dramatically from normal baseline levels. The anomalous system behavior initially led to suspicions of a hyper-scale DDoS attack, delaying identification of the true cause.
A peculiar aspect of the incident was the fluctuating nature of the errors: the system would recover periodically only to fail again every five minutes. This pattern was caused by the configuration file being regenerated every five minutes from a query on a ClickHouse cluster that was being gradually updated. Only the already-updated nodes generated bad data, creating an alternation between good and bad files.
Core traffic resumed flowing normally at 14:30 UTC after the team correctly identified the issue and replaced the problematic file with an earlier working version. All systems returned to full operation at 17:06 UTC after mitigating increased load across various parts of the network.
Technical Origin of the Problem
Cloudflare's Bot Management system uses a machine learning model to generate bot scores for every request traversing the network. This model relies on a "feature" configuration file that is updated every few minutes to react quickly to variations in Internet traffic and new types of bots.
At 11:05 UTC, a change was implemented to the ClickHouse database permissions management system. The goal was to improve security and reliability of distributed queries by making explicit the access to underlying tables that users already had implicitly. Before the change, users only saw tables in the "default" database; after the change, they could also see metadata for tables in the "r0" database where data is actually stored.
The SQL query used to generate the feature configuration file did not filter by database name. Consequently, after the change it began returning duplicate rows for each column—one for the "default" database and one for "r0"—effectively doubling the number of features in the final file.
The Memory Limit and System Panic
Each module in Cloudflare's proxy service has limits set to avoid unbounded memory consumption and to preallocate memory as a performance optimization. The Bot Management system had a limit fixed at 200 features, well above the normal usage of approximately 60 features.
When the bad file with over 200 features was propagated to servers, this limit was exceeded, causing a panic in the FL2 Rust code. The code called unwrap() on an error result without handling it, generating the message: "thread fl2_worker_thread panicked: called Result::unwrap() on an Err value". This panic translated into HTTP 5xx errors for end users.
Customers on the new FL2 proxy version observed HTTP 5xx errors, while those on the older FL version did not see errors but received bot scores of zero for all traffic, causing numerous false positives for those with rules configured to block bots.
Services Impacted by the Outage
The outage affected several core services and Cloudflare products:
- Core CDN and security services: HTTP 5xx status codes for customer traffic
- Turnstile: inability to load the anti-bot verification service
- Workers KV: elevated number of HTTP 5xx errors on requests to the front-end gateway
- Dashboard: most users unable to log in due to Turnstile unavailability
- Email Security: temporary loss of access to an IP reputation source, reducing spam detection accuracy
- Access: widespread authentication failures for most users from incident start until rollback
Beyond HTTP 5xx errors, significant increases in CDN response latency were observed. This was due to high CPU consumption by debugging and observability systems, which automatically enhance uncaught errors with additional debugging information.
The Solution and Recovery
At 13:05 UTC, the team implemented a bypass for Workers KV and Cloudflare Access, reverting them to a previous version of the core proxy. Although the issue was also present in previous versions, the impact was smaller.
At 14:24 UTC, after confirming with certainty that the Bot Management configuration file was the incident trigger, the team stopped creation and propagation of new files. A test with the previous version of the file confirmed successful recovery.
At 14:30 UTC, a correct Bot Management configuration file was deployed globally and most services began operating correctly. Core traffic returned to flowing normally. The following hours were dedicated to restarting remaining services that had entered a bad state, with full restoration achieved at 17:06 UTC.
Corrective Actions and Future Prevention
Cloudflare has announced a series of measures to prevent similar incidents in the future:
- Hardening configuration file ingestion: Cloudflare-generated files will be treated with the same rigor as user-generated files, with more stringent validation
- Enhanced global kill switches: implementation of more effective emergency shutoff switches for features
- System resource management: elimination of the possibility that core dumps or other error reports can overwhelm system resources
- Review of failure modes: thorough analysis of error conditions across all core proxy modules
- Improved error handling: avoiding use of
unwrap()on error results without appropriate handling
Conclusion
The November 18, 2025 outage represents Cloudflare's worst downtime since 2019. A seemingly innocuous change in database permissions management triggered a chain of events that led to the crash of much of the network. The incident underscores the importance of thorough testing even for changes that appear minor and robust error handling in critical systems.
Cloudflare has taken full responsibility for the outage, apologizing for the impact on customers and the broader Internet ecosystem. The company has demonstrated transparency by publishing a detailed technical analysis and concrete action plan to prevent similar incidents. For an infrastructure managing a significant portion of global Internet traffic, even a few hours of downtime have enormous consequences for millions of users and businesses worldwide.
FAQ
How long did the Cloudflare outage on November 18, 2025 last?
The outage began at 11:20 UTC and core traffic returned to normal at 14:30 UTC, for approximately 3 hours total. Full restoration of all systems occurred at 17:06 UTC.
What caused the Cloudflare down?
A change in ClickHouse database permissions caused generation of a Bot Management configuration file with double the normal size, exceeding the preallocated memory limit and causing the routing system to crash.
Was the Cloudflare outage caused by a cyber attack?
No, the incident was not caused by DDoS attacks, cyber attacks, or malicious activity of any kind. It was an internal system configuration error.
Which services were affected by the Cloudflare down?
Impacted services included core CDN, Bot Management, Turnstile, Workers KV, Dashboard, Access, and Email Security. Users received HTTP 5xx errors when attempting to access sites hosted on Cloudflare.
How did Cloudflare resolve the service outage?
The team stopped propagation of the bad file and replaced the Bot Management configuration file with a previous working version, then forced a restart of the core proxy.
Had Cloudflare ever experienced a similar outage before?
This was Cloudflare's worst outage since 2019. In the past 6+ years there had been no other incidents that prevented core traffic from flowing through the network.
What measures has Cloudflare implemented to prevent future downs?
Cloudflare is implementing more stringent configuration file validation, improved global kill switches, better system resource management, and review of failure modes across all proxy modules.