Engineering

Key Engineering Lessons from the Cloudflare WAF

Ashutosh Malve

1. The Core Constraint: Security Must Be Instantaneous

The primary lesson learned from the WAF is that for security to be effective at the edge, it must operate within the incredibly strict latency budget of the Data Plane (D-Plane).

• Extreme Low Latency: The WAF engine operates in the request processing pipeline, often referred to as the "hot path". This path demands Extreme Low Latency (measured at the P99 percentile) and massive parallelism. This means that WAF processing must be completed in milliseconds (often implied to be less than 5ms) for every request that hits the edge.

• Compiled Rule Engines: To meet this demanding constraint, WAF engineers cannot rely on slow, general-purpose processing. Rules must be bounded or pre-compiled. Cloudflare utilizes an architecture that incorporates compiled rules and an Abstract Syntax Tree (AST)-based match engine to ensure rapid, fast-path evaluation.

• Avoiding Performance Killers: Since every microsecond matters, WAF developers must mitigate risks like Regular Expression Denial of Service (ReDoS) attacks caused by inefficient regex patterns. Strategies to avoid ReDoS include compiling regex into a safe subset (one without backtracking constructs), running complex regex in an asynchronous path with timeouts, and enforcing strict complexity limits before deployment.

2. Architectural Design: Layer 7 Defense and Dynamic Scoring

The WAF is primarily a Layer 7 (Application Layer) defense system, targeting protocols like HTTP/HTTPS, which handle requests and responses.

• Rule Mechanisms: The WAF employs various rulesets to identify threats:

◦ Cloudflare Managed Rulesets: These are pre-configured and routinely updated by the Cloudflare security team to provide fast protection against zero-day vulnerabilities and OWASP Top 10 attack techniques (like SQL injection and cross-site scripting/XSS).

◦ Custom Rules: These allow users to define specific criteria using the Rules language to filter traffic, perform actions like Block or Managed Challenge, or implement a Skip action for other security features.

• Machine Learning for Detection: The WAF leverages machine learning to automatically block emerging threats in real time. The system assigns a WAF Attack Score (from 1 to 99) to each request based on its likelihood of being malicious. This score is particularly useful for detecting attack variations achieved via fuzzing techniques used to bypass security policies.

• Security Feature Integration: The WAF doesn't work in isolation; it sits within a clear processing order at the edge: Firewall → WAF → Bot Management → Rate Limiting → Security transforms. It integrates key capabilities like Bot Management (which assigns a bot score used for granular WAF rules), and Advanced Rate Limiting to protect resources against credential stuffing, brute force, and volumetric attacks.

3. Operational Lessons: Safety and Observability

A major learning point for any system engineer is how to safely manage security policies that span over 330 data centers globally.

• Safe Rule Rollout: Rolling out new WAF rules requires extreme caution to prevent high-impact False Positives. Deployment must use sophisticated methods like versioned policies with canary rollout percentages (phasing) and simulation modes. The goal is to collect True Positive (TP) and False Positive (FP) metrics and trigger automatic rollback if FP rates exceed a safety threshold.

• Security Analytics and Forensics: The WAF is the data source for the Security Operations Center (SOC), providing security events and telemetry. The data pipeline must handle massive volumes (Cloudflare reports processing up to 706 million events per second at peak) and feed them to analytics platforms.

• Fast Search Capability: SOC teams need to quickly investigate alerts, requiring sub-second search latency over recent logs (the hot window). This high-speed query performance is typically achieved by relying on columnar databases like ClickHouse and indexing techniques like inverted indexes and partition pruning.

• Inline Actioning: The most advanced capability learned is the ability to enable inline actioning from the SOC dashboard. This means an operator investigating an event can immediately apply a block, challenge, or rate limit to live traffic directly via an API integration. This entire action sequence requires rigorous engineering to ensure idempotency and an immutable audit trail.