un — Geometry of Failure Modes & Blast Radius

un

guest

1 / ?

back to lessons

The Bottleneck Node Identified Before Traffic Arrives

Betweenness Centrality

For every pair of nodes in a graph, there is a shortest path between them. Betweenness centrality of a node N = the fraction of all shortest paths that pass through N.

A node with high betweenness is on the path between many other pairs. If it slows down, many flows slow down. If it fails, many flows break.

Architectural reading: high-betweenness nodes are the ones every architecture review should pay extra attention to. They are bottlenecks, SPOFs, & capacity-critical components in one. They tend to be:

- The DNS provider (between every client & every service)

- The ingress proxy (between every client & every backend)

- The database primary (between every backend & every read)

- The authentication service (between every user & every authorized action)

Detection without measurement: graph topology alone identifies high-betweenness nodes. You don't need traffic data; you need the architecture diagram. A node that sits between many pairs of others is structurally critical.

Operational consequence: high-betweenness nodes deserve disproportionate investment in (1) capacity headroom, (2) redundancy, (3) observability, & (4) incident response playbooks.

Betweenness centrality: highlighted node sits on most shortest paths

A system has: 100 external clients -> 1 DNS -> 1 CDN vendor -> 3 reverse proxies -> 12 backend replicas -> {1 DB primary, 2 cache nodes, 5 external API endpoints}. Rank these node classes by betweenness centrality (highest first), & explain why the top two ranks deserve special attention.

The Smallest Cut Disconnects the Smallest Slice

Min-Cut Theorem in Plain Terms

The min-cut between two nodes in a graph = the smallest number of edges (or nodes) you must remove to disconnect them.

Operational reading: min-cut bounds the worst-case blast radius. If the min-cut between 'clients' & 'database' is 1 edge (a single proxy), then losing that edge disconnects all clients from the database. If the min-cut is 5, you need to lose 5 components simultaneously to fully disconnect; bad luck, but bounded.

Designing for blast radius: increase the min-cut at every important boundary. Multiple proxies; multiple cache nodes; multiple network paths between DCs. Each addition raises the min-cut by 1.

The bulkhead pattern in graph terms: partition resources into separate sub-graphs that share no min-cut with each other. A failure within one sub-graph cannot propagate to the others because the edges don't exist.

Diameter Sets Failure Propagation Distance

Graph diameter = the longest shortest path between any two nodes.

Failure propagation: when a node fails & retries flow back, they touch upstream nodes up to the diameter distance away. A diameter-3 system (client -> proxy -> backend -> DB) means a DB failure affects 3 upstream layers in a retry storm.

Implication: shorter diameter = faster failure containment but also more node concentration. Each design has its tradeoff.

Min-cut as bound on blast radius; diameter as propagation distance

Compute Min-Cut for a Real Architecture

An architecture: 1 DNS, 1 CDN, 3 reverse proxies, 12 backend replicas, 1 DB primary.

Compute (or estimate) the min-cut at three boundaries: (1) between external clients & the reverse-proxy tier; (2) between the reverse-proxy tier & the backend tier; (3) between the backend tier & the DB primary. For each, name what fails when that min-cut is exceeded.

Failure-Mode Audit via Graph Metrics

Synthesis

You can now identify high-betweenness nodes, compute min-cut at every boundary, & estimate failure-propagation distance via diameter.

Apply all three.

A system: 50 customer endpoints -> 1 DNS -> 2 CDN POPs -> 4 reverse proxies -> 16 backend replicas -> { DB cluster (1 primary + 2 standbys), Redis cluster (5 nodes), 3 external APIs }.

Audit the system: (1) name the highest-betweenness node, (2) compute min-cut at the most concerning boundary, & (3) propose two specific architectural changes (each raising a min-cut, each named with the boundary it strengthens).

Companion Notes

This geometry-of lesson recasts the Failure Modes & Blast Radius main lesson through graph metrics (betweenness, min-cut, diameter).

The final companion, geometry_of_observability_and_capacity, treats Voronoi cells for CDN PoP catchments, the latency triangle's speed-of-light floor, & the queueing curve revisited at the proxy tier.

Well done.