The Three Failure Concepts Worth Knowing
Welcome
Distributed systems fail in patterns. Once you learn the patterns, every postmortem becomes a recognition exercise instead of a mystery.
Three concepts cover most of what matters in production failure analysis:
Single point of failure (SPOF): a component whose failure brings down a larger system. Often hidden: the DNS server that everyone depends on; the certificate that everything renews against; the single database master.
Cascading failure: one component's failure triggers another's, which triggers another's. A slow database causes timeouts in the API tier, which causes retries, which load the database further, which causes more timeouts. The blast spreads.
Blast radius: how much of the system goes down when one piece fails. Architectural choices either bound or unbound the radius. A SPOF has unbounded blast radius. A bulkheaded service has bounded radius.
By the end of this lesson you will:
- Identify SPOFs in an architecture by inspection
- Recognize cascading failure patterns: thundering herd, retry storm, queue of death
- Read a real timeline & separate the trigger from the latent defect that the trigger surfaced
- Write blameless action items that target systems instead of people, covering prevention / detection / recovery
- Reason about bulkheads & circuit breakers as blast-radius bounding tools
Spot the Single Point of Failure
Layered Architecture Inspection
Consider a small web architecture:
- DNS: api.example.com -> single nameserver IP 203.0.113.10 hosted by one DNS provider
- CDN: single CDN vendor in front of api.example.com
- Ingress: two reverse proxy machines behind a load balancer
- Backend: six API replicas in two availability zones (three per zone)
- Database: one primary + one read replica, in the same availability zone
- Cache: Redis cluster, three nodes spread across the same two availability zones
Question: which components are SPOFs? Hint: SPOFs are not always the obvious 'single machine' kind. A cluster of three machines all in one availability zone is a SPOF for that zone's failure.
Three Classic Cascade Patterns
Failures Travel Through Dependencies
Pattern 1: Thundering herd. A shared resource (cache, lock, database) fails or restarts. Every client that depended on it retries simultaneously. The flood overwhelms whatever comes back up; the retries pile on faster than the recovery can absorb them; recovery never completes.
Pattern 2: Retry storm. A downstream service slows down. Upstream callers, instead of failing, retry. The retries multiply the original load. The slow service slows further, triggering more retries. Eventually the load exceeds even a healthy version of the service.
Pattern 3: Queue of death. A processing queue with no backpressure receives faster than it processes. The queue grows unbounded. Memory exhausts; the consumer crashes; restarts; finds a still-larger queue; crashes again.
Common thread: a small initial perturbation triggers a positive-feedback loop. The system's own response amplifies the failure rather than damping it.
Damping Mechanisms
Exponential backoff with jitter. Clients that retry wait longer each time, with random offset. Prevents synchronized retry waves.
Circuit breaker. A caller tracks downstream failure rate. Past a threshold, the caller stops calling for a cooldown period & immediately fails its own requests instead. Prevents wasted work, lets the downstream recover.
Bulkhead. Isolate resources per dependency. Connection pool A for database, separate connection pool B for cache. A slow database cannot starve all connections; cache calls continue.
Load shedding. When overloaded, drop requests at the edge instead of accepting them & failing slowly. A 429 in 1 ms is better than a 500 in 30 seconds.
Backpressure. Slow producers when consumers cannot keep up. Queues become bounded; senders block; the original source of work feels the friction.
Diagnose a Cascade
A team's API tier melts down during a routine database failover. Timeline:
- 14:00:00 — operator promotes standby database. Expected unavailability: ~10 seconds.
- 14:00:08 — primary unavailable. API tier requests start failing with database connection errors.
- 14:00:08 — API tier retries (default config: 5 retries, no backoff, 100ms apart).
- 14:00:11 — standby promoted, accepting new connections.
- 14:00:11 — API tier opens thousands of fresh database connections simultaneously (every replica × every concurrent request × every retry).
- 14:00:13 — new primary's connection pool exhausted; new connections rejected.
- 14:00:13-14:05:00 — API tier replicas exhaust connection pools, throw exceptions, crash, restart, repeat.
- 14:05:00 — operator manually stops API tier traffic; database stabilizes.
- 14:10:00 — gradual traffic restoration completes. Total outage: ~10 minutes (vs expected ~10 seconds).
DNS SERVFAIL: Two Compounding Defects
A Real-Shape Postmortem
What follows is a sanitized version of a real incident. Vendor names changed, IPs anonymized; the shape, the timeline, & the lessons are real.
Summary
Site example.com returned SERVFAIL from all public DNS resolvers for approximately 3-4 hours. All other 46 zones on the same DNS master were unaffected. Root cause: two compounding defects.
1. Vendor A (a secondary DNS provider) added a new internal sync IP that was not in the primary's allow-axfr-ips allowlist.
2. The example.com zone had a years-old RFC-violating CNAME conflict (demo.example.com had both CNAME & MX/TXT records at the same label) that caused Vendor A to reject the zone on fresh AXFR.
Timeline (UTC)
- ~15:00 — Vendor A adds new sync IP 198.51.100.42 to their infrastructure
- 15:02 — first AXFR-out denied for 198.51.100.42 appears in primary DNS logs (no alerting on this signal)
- ~18:00 — SOA expire window reached; Vendor A drops example.com zone from cache
- ~18:30 — SERVFAIL detected externally
- ~19:45 — root cause identified
- 20:00 — 198.51.100.42 added to allow-axfr-ips; primary restarted
- 20:05 — NOTIFY sent; AXFR initiated; zone STILL SERVFAIL (CNAME conflict)
- 20:07 — check-zone reveals 1 error: CNAME conflict on demo.example.com
- 20:09 — CNAME replaced with A record; zone check clean (0 errors)
- 20:10 — NOTIFY sent; AXFR completes; Vendor A begins serving zone
- 20:11 — dig @8.8.8.8 example.com A returns correct IP — RESOLVED
Why Only example.com?
All 47 zones share the same DNS primary. The AXFR IP block affected all zones. But only example.com had the CNAME conflict, & only example.com needed a fresh AXFR at the moment the deny was enforced. Other zones had already refreshed before the deny or did not yet need to.
Latent defect
The CNAME conflict at demo.example.com had existed for years. It worked because the primary served the zone from its database (lenient about RFC violations) & Vendor A was serving from stale cached data from before the violation was introduced. When Vendor A dropped its cache & needed fresh data, the violation surfaced.
Trigger
Vendor A silently added a new sync IP. The primary's allowlist did not include it. AXFR denied. Three hours later (SOA expire), Vendor A dropped the zone. The latent defect surfaced when the system tried to recover.
Write Blameless Action Items
Blameless: Target Systems, Not People
A blameless action item names something the system should do differently, not something a person should do differently. 'Train the operator' is blameful. 'Add an automated check that catches this before deploy' is blameless.
Good blameless action items cluster into three dimensions:
- Prevention: make the bad thing harder or impossible
- Detection: notice it sooner if it happens
- Recovery: limit the damage when it happens
Each item should name (1) the specific system change, (2) an owner team, & (3) the dimension it serves.
Compartments That Sink Without the Ship
Borrowed from Naval Engineering
Ships carry watertight bulkheads: vertical walls that divide the hull into compartments. One compartment can flood without sinking the ship; another can fail without affecting the rest.
Distributed systems borrow the same word & the same idea.
Bulkhead pattern: isolate resources per dependency. A service that calls three downstream APIs uses three separate connection pools, three separate thread pools, three separate retry budgets. A slow or failing downstream cannot consume the resources allocated for the other two.
Without bulkheads: one slow dependency exhausts the shared thread pool; calls to other dependencies block waiting for threads; the entire service becomes unresponsive.
With bulkheads: one slow dependency exhausts its own pool; calls to it fail fast; calls to other dependencies continue normally; the blast radius stays bounded to the failing dependency.
Circuit Breakers
Circuit breaker pattern: a stateful wrapper around a downstream dependency that tracks failure rate. Three states:
- Closed (normal): calls pass through. Failures counted.
- Open (tripped): past a failure threshold (say, 50% failures in the last 30 seconds), the breaker opens. Calls fail immediately without trying the dependency. Saves the caller from wasting work; saves the dependency from receiving load while it is unhealthy.
- Half-open (testing): after a cooldown period, the breaker lets a small fraction of calls through. If they succeed, it closes back to normal. If they fail, it re-opens for another cooldown.
The key insight: the circuit breaker prevents wasted effort during known-unhealthy periods, & gives the downstream a chance to recover without continued load.
Bulkheads bound the blast radius. Circuit breakers prevent the blast from sustaining itself.
Bound the Blast Radius
Your API service calls four downstream services: User Service, Recommendation Service, Notification Service, & a third-party Payment API. The team has heard 'the Recommendation Service has been a little flaky' & wants to make sure that when it fails, the rest of the system stays healthy.
Today the service uses a single shared thread pool of 200 threads & a single shared HTTP connection pool. All four downstreams compete for these resources. There are no circuit breakers.
Design a Failure-Mode Review
Synthesis
You have learned to spot SPOFs by inspection, recognize cascading failure patterns, separate trigger from latent defect when reading a postmortem, write blameless action items across prevention / detection / recovery, & bound blast radius with bulkheads + circuit breakers + graceful degradation.
Apply all five.
Your team is launching a new service search.example.com that depends on three downstream services: a primary search index (index.example.com), an analytics service (analytics.example.com), & a recommendations service (recs.example.com). The team wants you to lead a 'failure-mode review' before launch.
Where This Course Goes Next
Where This Course Goes Next
You can now spot a SPOF, recognize a cascade, read a postmortem productively, write blameless action items, & bound blast radius by design.
The final lesson in this course (cs_distsys_observability_and_capacity) teaches what to measure so you find out a problem is happening before users do. Health checks, version endpoints, the four golden signals at a proxy tier, & how surge capacity decisions tie back to observed data.
Companion lesson: geometry_of_failure_modes_and_blast_radius derives betweenness centrality (which graph node is the bottleneck) & min-cut (the bound on blast radius).
Well done. Onward.