When the Magic Stops: Inside the Byzantine Failure that Brought Down Cloudflare’s Control Plane

Ashutosh Malve

Target Audience: Engineers who understand the basics of distributed systems and were previously reading about the speed of the edge (the Data Plane).

Summary & Key Takeaways:

This article would analyze the Cloudflare API availability incident on November 2, 2020, focusing on how redundancy failed, how consensus protocols reacted to conflicting information, and the lessons learned about designing for degraded states rather than just total failures.

1. The Fundamental Split (Review): We start by reiterating that while Cloudflare’s edge (the Data Plane) remained massively distributed and functional throughout the incident, the problem occurred in the Control Plane (API & dashboard). The Control Plane is responsible for management, configurations, and storing strongly consistent data.

2. The Catalyst: A Misbehaving Switch: The cascade began with a network switch starting to misbehave, entering a partially operating state. The switch's data plane (forwarding packets) was failing, but control plane protocols (like BGP) remained operational. This "degraded state" was the key starting point.

3. The Byzantine Fault in etcd (Consensus Failure): The misbehaving switch affected network traffic involving one server (Node 1) in the core etcd cluster. Etcd is used heavily whenever Cloudflare needs strongly consistent data storage that is reliable across multiple nodes.

◦ Due to the partial switch failure, different nodes in the etcd cluster had conflicting views of reality—a condition known as a Byzantine fault.

◦ Node 1 repeatedly initiated leader elections, voting for itself, while Node 2 repeatedly voted for the existing leader (Node 3), which it could still reach.

◦ This conflict resulted in ties, preventing a new leader from being promoted, which effectively made the etcd cluster read-only.

4. The Database Cascade: The Control Plane services use high-availability relational databases (specifically PostgreSQL) whose cluster management system leverages etcd for discovery and coordination. When etcd became read-only, two database clusters were unable to communicate that they had a healthy primary database. This triggered the automatic promotion of a synchronous database replica to a new primary.

◦ Although the promotion was instant and achieved without data loss, a defect in the cluster management system required a rebuild of all database replicas. Depending on the size of the database, this caused considerable delay until services could be restored.

◦ Ultimately, the outage impacted API availability, periodically dipping as low as 75%, and made the dashboard experience up to 80 times slower than normal for six hours and 33 minutes.

5. Engineering Takeaways (Avoiding SPOFs in Distributed Redundancy): The entire incident was interesting because every system involved initially had redundancy, yet the systems entered a degraded state rather than failing completely, making the failure chain harder to model.

◦ Cloudflare learned that for some auto-remediation processes (like database replica promotion), the cure could be worse than the disease, leading to adjustments in the configuration parameters that trigger quick remediation.

◦ The discussion highlights the tradeoff between Byzantine Fault Tolerance (BFT) protocols and simpler consensus mechanisms like RAFT (which etcd uses), which are generally preferred for better performance and algorithmic simplicity, even if vulnerable to rare failure modes.

◦ Postscript Insight: The community later suggested the fault was more accurately characterized as an omission fault rather than a general Byzantine fault, which can be tolerated without needing complex BFT protocols. Cloudflare promised a detailed post about different fault types in distributed systems.

This article would move the conversation from "How do they go fast?" to "How do they stay correct when parts fail?", providing a crucial architectural counterpoint to the initial discussion of edge latency.