Gemini summary!
Cloudflare uses a Web Application Firewall (WAF) to inspect incoming traffic for attacks. An engineer deployed a rule to block cross-site scripting (XSS). The rule used a Regular Expression (Regex) to find patterns in the data.
- The Technical Flaw: The Regex contained a pattern that triggered Exponential Backtracking.
- The "Greedy" Loop: When a Regex engine encounters a complex pattern with many wildcards (like
.*.*), it tries every possible way to match the text. If the text almost matches but not quite, the computer gets stuck in a loop of trillions of calculations just to check a single line of code.
Because the Regex engine was stuck in an infinite loop of "guessing" matches, it demanded every bit of processing power available.
- Global Gridlock: Cloudflare’s edge servers are designed to handle thousands of requests per second. Suddenly, the "brains" (CPUs) of these servers were 100% occupied by this one broken Regex rule.
- The 502 Error: Because the CPUs were busy "thinking" about the Regex, they couldn't process actual user traffic. This resulted in the HTTP 502 Bad Gateway error, meaning the server was there, but it couldn't complete the handshake.
At the time, Cloudflare’s deployment system, called Quicksilver, was designed for speed. It could push a rule change to every server on the planet in seconds.
- The "Blast Radius": There was no Staged Rollout. Usually, changes should be pushed to 1% of traffic (a "Canary" release) to see if things break. Instead, this rule was pushed to 100% of the network simultaneously.
- The Result: The entire global network failed at the exact same moment.
The team faced a classic "Circular Dependency" problem during the outage:
- Cloudflare's engineers use internal tools to monitor their network.
- Those internal tools were protected by... Cloudflare.
- Because the network was down, the engineers couldn't log into their own dashboards to see what was happening, slowing down the resolution.
To ensure this never happened again, Cloudflare made three fundamental shifts in their engineering culture:
- Algorithmic Safety: They moved toward using Regex engines (like RE2 or Rust-based engines) that have "runtime guarantees." These engines are mathematically incapable of backtracking forever.
- Sandboxing: They re-introduced resource limits. If a single rule starts using more than a tiny percentage of CPU, the system now "kills" that specific rule rather than letting it crash the whole server.
- Staged Rollouts: They implemented a mandatory "soak time." Rules now go to a single test city (like Lisbon or Chicago) before moving to a small region, and finally the world.