un — Failure Modes & Blast Radius

un

guest

1 / ?

back to lessons

The Three Failure Concepts Worth Knowing

Welcome

Distributed systems fail in patterns. Once you learn the patterns, every postmortem becomes a recognition exercise instead of a mystery.

Three concepts cover most of what matters in production failure analysis:

Single point of failure (SPOF): a component whose failure brings down a larger system. Often hidden: the DNS server that everyone depends on; the certificate that everything renews against; the single database master.

Cascading failure: one component's failure triggers another's, which triggers another's. A slow database causes timeouts in the API tier, which causes retries, which load the database further, which causes more timeouts. The blast spreads.

Blast radius: how much of the system goes down when one piece fails. Architectural choices either bound or unbound the radius. A SPOF has unbounded blast radius. A bulkheaded service has bounded radius.

By the end of this lesson you will:

- Identify SPOFs in an architecture by inspection

- Recognize cascading failure patterns: thundering herd, retry storm, queue of death

- Read a real timeline & separate the trigger from the latent defect that the trigger surfaced

- Write blameless action items that target systems instead of people, covering prevention / detection / recovery

- Reason about bulkheads & circuit breakers as blast-radius bounding tools

Spot the Single Point of Failure

Layered Architecture Inspection

Consider a small web architecture:

- DNS: api.example.com -> single nameserver IP 203.0.113.10 hosted by one DNS provider

- CDN: single CDN vendor in front of api.example.com

- Ingress: two reverse proxy machines behind a load balancer

- Backend: six API replicas in two availability zones (three per zone)

- Database: one primary + one read replica, in the same availability zone

- Cache: Redis cluster, three nodes spread across the same two availability zones

Question: which components are SPOFs? Hint: SPOFs are not always the obvious 'single machine' kind. A cluster of three machines all in one availability zone is a SPOF for that zone's failure.

Identify at least three SPOFs in this architecture. For each one, name what fails when it fails, & propose a concrete change that would remove the SPOF (without rewriting the application).

Three Classic Cascade Patterns

Failures Travel Through Dependencies

Pattern 1: Thundering herd. A shared resource (cache, lock, database) fails or restarts. Every client that depended on it retries simultaneously. The flood overwhelms whatever comes back up; the retries pile on faster than the recovery can absorb them; recovery never completes.

Pattern 2: Retry storm. A downstream service slows down. Upstream callers, instead of failing, retry. The retries multiply the original load. The slow service slows further, triggering more retries. Eventually the load exceeds even a healthy version of the service.

Pattern 3: Queue of death. A processing queue with no backpressure receives faster than it processes. The queue grows unbounded. Memory exhausts; the consumer crashes; restarts; finds a still-larger queue; crashes again.

Common thread: a small initial perturbation triggers a positive-feedback loop. The system's own response amplifies the failure rather than damping it.

Damping Mechanisms

Exponential backoff with jitter. Clients that retry wait longer each time, with random offset. Prevents synchronized retry waves.

Circuit breaker. A caller tracks downstream failure rate. Past a threshold, the caller stops calling for a cooldown period & immediately fails its own requests instead. Prevents wasted work, lets the downstream recover.

Bulkhead. Isolate resources per dependency. Connection pool A for database, separate connection pool B for cache. A slow database cannot starve all connections; cache calls continue.

Load shedding. When overloaded, drop requests at the edge instead of accepting them & failing slowly. A 429 in 1 ms is better than a 500 in 30 seconds.

Backpressure. Slow producers when consumers cannot keep up. Queues become bounded; senders block; the original source of work feels the friction.

Cascading failure: trigger -> amplification -> collapse, with damping mechanisms

Diagnose a Cascade

A team's API tier melts down during a routine database failover. Timeline:

- 14:00:00 — operator promotes standby database. Expected unavailability: ~10 seconds.

- 14:00:08 — primary unavailable. API tier requests start failing with database connection errors.

- 14:00:08 — API tier retries (default config: 5 retries, no backoff, 100ms apart).

- 14:00:11 — standby promoted, accepting new connections.

- 14:00:11 — API tier opens thousands of fresh database connections simultaneously (every replica × every concurrent request × every retry).

- 14:00:13 — new primary's connection pool exhausted; new connections rejected.

- 14:00:13-14:05:00 — API tier replicas exhaust connection pools, throw exceptions, crash, restart, repeat.

- 14:05:00 — operator manually stops API tier traffic; database stabilizes.

- 14:10:00 — gradual traffic restoration completes. Total outage: ~10 minutes (vs expected ~10 seconds).

Identify the cascade pattern at play, name the damping mechanisms that would have prevented it (at least two), & explain why the failover from primary to standby (intended to be a 10-second blip) instead caused a 10-minute outage.

DNS SERVFAIL: Two Compounding Defects

A Real-Shape Postmortem

What follows is a sanitized version of a real incident. Vendor names changed, IPs anonymized; the shape, the timeline, & the lessons are real.

Summary

Site example.com returned SERVFAIL from all public DNS resolvers for approximately 3-4 hours. All other 46 zones on the same DNS master were unaffected. Root cause: two compounding defects.

1. Vendor A (a secondary DNS provider) added a new internal sync IP that was not in the primary's allow-axfr-ips allowlist.

2. The example.com zone had a years-old RFC-violating CNAME conflict (demo.example.com had both CNAME & MX/TXT records at the same label) that caused Vendor A to reject the zone on fresh AXFR.

Timeline (UTC)

- ~15:00 — Vendor A adds new sync IP 198.51.100.42 to their infrastructure

- 15:02 — first AXFR-out denied for 198.51.100.42 appears in primary DNS logs (no alerting on this signal)

- ~18:00 — SOA expire window reached; Vendor A drops example.com zone from cache

- ~18:30 — SERVFAIL detected externally

- ~19:45 — root cause identified

- 20:00 — 198.51.100.42 added to allow-axfr-ips; primary restarted

- 20:05 — NOTIFY sent; AXFR initiated; zone STILL SERVFAIL (CNAME conflict)

- 20:07 — check-zone reveals 1 error: CNAME conflict on demo.example.com

- 20:09 — CNAME replaced with A record; zone check clean (0 errors)

- 20:10 — NOTIFY sent; AXFR completes; Vendor A begins serving zone

- 20:11 — dig @8.8.8.8 example.com A returns correct IP — RESOLVED

Why Only example.com?

All 47 zones share the same DNS primary. The AXFR IP block affected all zones. But only example.com had the CNAME conflict, & only example.com needed a fresh AXFR at the moment the deny was enforced. Other zones had already refreshed before the deny or did not yet need to.

Latent defect

The CNAME conflict at demo.example.com had existed for years. It worked because the primary served the zone from its database (lenient about RFC violations) & Vendor A was serving from stale cached data from before the violation was introduced. When Vendor A dropped its cache & needed fresh data, the violation surfaced.

Trigger

Vendor A silently added a new sync IP. The primary's allowlist did not include it. AXFR denied. Three hours later (SOA expire), Vendor A dropped the zone. The latent defect surfaced when the system tried to recover.

Write Blameless Action Items

Blameless: Target Systems, Not People

A blameless action item names something the system should do differently, not something a person should do differently. 'Train the operator' is blameful. 'Add an automated check that catches this before deploy' is blameless.

Good blameless action items cluster into three dimensions:

- Prevention: make the bad thing harder or impossible

- Detection: notice it sooner if it happens

- Recovery: limit the damage when it happens

Each item should name (1) the specific system change, (2) an owner team, & (3) the dimension it serves.

Write three blameless action items addressing the DNS-SERVFAIL postmortem above. Distribute them across prevention / detection / recovery (one per dimension). Each item must name a specific system change & an owning team. Do NOT target any human as the cause.

Compartments That Sink Without the Ship

Borrowed from Naval Engineering

Ships carry watertight bulkheads: vertical walls that divide the hull into compartments. One compartment can flood without sinking the ship; another can fail without affecting the rest.

Distributed systems borrow the same word & the same idea.

Bulkhead pattern: isolate resources per dependency. A service that calls three downstream APIs uses three separate connection pools, three separate thread pools, three separate retry budgets. A slow or failing downstream cannot consume the resources allocated for the other two.

Without bulkheads: one slow dependency exhausts the shared thread pool; calls to other dependencies block waiting for threads; the entire service becomes unresponsive.

With bulkheads: one slow dependency exhausts its own pool; calls to it fail fast; calls to other dependencies continue normally; the blast radius stays bounded to the failing dependency.

Circuit Breakers

Circuit breaker pattern: a stateful wrapper around a downstream dependency that tracks failure rate. Three states:

- Closed (normal): calls pass through. Failures counted.

- Open (tripped): past a failure threshold (say, 50% failures in the last 30 seconds), the breaker opens. Calls fail immediately without trying the dependency. Saves the caller from wasting work; saves the dependency from receiving load while it is unhealthy.

- Half-open (testing): after a cooldown period, the breaker lets a small fraction of calls through. If they succeed, it closes back to normal. If they fail, it re-opens for another cooldown.

The key insight: the circuit breaker prevents wasted effort during known-unhealthy periods, & gives the downstream a chance to recover without continued load.

Bulkheads bound the blast radius. Circuit breakers prevent the blast from sustaining itself.

Bound the Blast Radius

Your API service calls four downstream services: User Service, Recommendation Service, Notification Service, & a third-party Payment API. The team has heard 'the Recommendation Service has been a little flaky' & wants to make sure that when it fails, the rest of the system stays healthy.

Today the service uses a single shared thread pool of 200 threads & a single shared HTTP connection pool. All four downstreams compete for these resources. There are no circuit breakers.

Propose a bulkhead + circuit-breaker design for this API service. Be specific: how do you partition the thread / connection pools across the four dependencies, what circuit-breaker thresholds make sense for the flaky Recommendation Service, & what should the user-facing API do when the Recommendation Service is open-circuited?

Design a Failure-Mode Review

Synthesis

You have learned to spot SPOFs by inspection, recognize cascading failure patterns, separate trigger from latent defect when reading a postmortem, write blameless action items across prevention / detection / recovery, & bound blast radius with bulkheads + circuit breakers + graceful degradation.

Apply all five.

Your team is launching a new service search.example.com that depends on three downstream services: a primary search index (index.example.com), an analytics service (analytics.example.com), & a recommendations service (recs.example.com). The team wants you to lead a 'failure-mode review' before launch.

Outline the failure-mode review you would lead. Include: how you would surface SPOFs (one technique), how you would prevent cascading failure between the search service & its three downstreams (two patterns), one concrete action item for the recommendations service (which the team flagged as least reliable), & what monitoring you would require to be in place at launch.

Where This Course Goes Next

You can now spot a SPOF, recognize a cascade, read a postmortem productively, write blameless action items, & bound blast radius by design.

The final lesson in this course (cs_distsys_observability_and_capacity) teaches what to measure so you find out a problem is happening before users do. Health checks, version endpoints, the four golden signals at a proxy tier, & how surge capacity decisions tie back to observed data.

Companion lesson: geometry_of_failure_modes_and_blast_radius derives betweenness centrality (which graph node is the bottleneck) & min-cut (the bound on blast radius).

Well done. Onward.