un — Observability & Capacity

un

guest

1 / ?

back to lessons

Welcome

A web-scale fleet contains many machines. At any moment, some are healthy, some are starting up, some are draining, some are quietly broken. The fleet survives this because every machine answers two simple questions on demand:

- /health — am I currently able to serve real requests?

- /version — what code am I running?

Plus a metrics endpoint (commonly /metrics) that exposes counters & gauges for monitoring tools to scrape.

This lesson teaches how to design these endpoints so they actually reflect reality, what the four golden signals mean at a proxy tier, & how observed data drives capacity decisions.

By the end you will:

- Design a /health endpoint that detects real path failure, not just process liveness

- Design a /version endpoint that lets you verify a deploy landed

- Apply the four golden signals (latency, traffic, errors, saturation) at a proxy tier

- Tie observed surge metrics back to capacity decisions: when to scale up, when to drain, when to page

- Reason about SLOs & error budget burn rate as the operational discipline behind 'how much do we care?'

The Two Kinds of Health Check

Liveness vs Readiness

Liveness: is the process alive at all? Used by orchestrators (Kubernetes, systemd) to decide whether to restart the process.

Readiness: is the process ready to handle real traffic right now? Used by load balancers to decide whether to send requests.

These are different questions. A process that is alive but cannot reach its database is alive but not ready. A process that is starting up is alive but not yet ready.

Shallow vs Deep Health Checks

Shallow: returns {"status": "ok"} if the HTTP handler runs. Trivial. Detects process-down only.

Deep: actually exercises the real request path. Checks that database connection pool can return a connection, cache is reachable, downstream dependencies respond. Detects functional outages that shallow checks miss.

The tradeoff: deep checks cost more (each one is essentially a synthetic request) & can cause cascading failure (if every replica's health check hammers the database, a slow database makes all replicas unhealthy, which removes them from rotation, which removes all capacity).

Best practice: a shallow check for liveness (fast, cheap, no external dependencies) & a deeper check for readiness (cached results, throttled to avoid hammering downstreams).

Version Endpoints

/version returns the git commit, build time, & service name. After a deploy, you curl https://service.example.com/version & confirm the returned commit matches what you pushed. If it doesn't, the deploy failed silently.

Without /version, a stale deploy can look successful & hide for hours.

Minimum response shape: {"service": "my-api", "git_commit": "abc1234", "build_time": "2026-05-19T10:00:00Z"}.

A team's load balancer is configured to drop a replica from rotation after 3 consecutive failed health checks. Their current `/health` returns `{"status": "ok"}` immediately. The team is surprised when, during an incident, every replica still showed healthy even though no replica could reach the database. Design a better readiness check that would have caught the database outage, & explain one specific risk your new design introduces.

Latency, Traffic, Errors, Saturation

Four Numbers Cover Most of Operations

From the Google SRE book. Four signals you measure on every service tier. If you instrument these four well, you catch most production problems before users do.

Latency: how long does a request take? Report distributions, not just averages. The p99 (99th percentile latency) matters more than the mean, because tail latency is what users feel as 'slow'. A service with 50 ms mean & 5,000 ms p99 has a real problem most users never notice but the worst-affected 1% absolutely do.

Traffic: how many requests per second? Total requests, per-endpoint, per-status-code, per-region. Baseline known; alert on anomalies (sudden drop = ingress problem; sudden spike = surge or attack).

Errors: rate of failed requests. Distinguish 4xx (client errors, not your fault) from 5xx (server errors, your fault). Track error rate as a percentage of traffic, not as absolute counts, so alerts work across load levels.

Saturation: how full is the system? CPU utilization, memory, connection pool depth, queue length. The leading indicator. Saturation rises before latency or errors degrade. A tier at 90% saturation is one bad minute away from queue collapse.

At a Proxy Tier Specifically

Each signal lights up at the edge layer:

- Latency at the proxy: TLS handshake duration, upstream connect time, request-response total. Separately measured because they live on different parts of the path.

- Traffic at the proxy: total requests/sec, per-backend distribution (a hot backend signals load-balancer skew), per-status-code breakdown.

- Errors at the proxy: 4xx from clients (your users hitting bad endpoints), 5xx from backends (your services failing), proxy-internal errors (502 = backend unreachable, 504 = backend timeout).

- Saturation at the proxy: TLS session count, upstream connection pool depth, CPU at the proxy itself (TLS termination is CPU-heavy).

Pro tip: a sudden rise in 502s with low backend latency means the backend is hanging up before responding (connection reset, crash, OOM). A rise in 504s means the backend is slow but still answering. Read the error code; it tells you where the failure lives.

Four golden signals on a single dashboard: latency, traffic, errors, saturation

Read the Signals

Your dashboard shows the following over the last 10 minutes:

- Traffic: roughly flat at 800 req/s (no surge)

- Latency: p50 stable at 40ms, p99 climbed from 200ms to 2,500ms over 5 minutes & is still rising

- Errors: 4xx rate stable at 0.3% (normal background); 5xx rate climbed from 0.1% to 1.2% (mostly 504 Gateway Timeout)

- Saturation: backend CPU climbed from 45% to 78% over the same 5 minutes; proxy CPU stable at 30%

Diagnose what's happening. What is the most likely failure mode, what one or two follow-up measurements would confirm or refute your hypothesis, & what action would you take in the next 5 minutes if the trend continues?

When to Scale, When to Drain, When to Page

Capacity Decisions Need Triggers

Observing metrics is easy. Knowing when to act on them is the discipline.

Scale up when: saturation crosses a sustained threshold (e.g., backend CPU >70% for 5 minutes), or queue depth grows beyond a target, or latency p99 exceeds the SLO. The trigger should fire before things break, not at break.

Drain a replica when: it is consistently slow / errorful while peers are healthy (one replica running hot is often a host-level problem, not an application problem), or when rolling out a new version, or when retiring a replica gracefully.

Page a human when: an SLO is being burned faster than the error budget can sustain, or a saturation trigger fires without autoscaling absorbing it, or a cascade pattern shows up (error rate + retry rate both climbing).

Don't page when: a single bad minute resolves on its own, or background batch jobs cause expected periodic blips, or noise crosses the threshold (the threshold is wrong, not the system).

SLOs & Error Budget Burn

An SLO (service level objective) defines acceptable performance: 'success rate >= 99.9% over a 28-day window'. The complement (0.1%) is the error budget.

Burn rate: how fast you are consuming the error budget. If you burn 10% of the budget in 1 hour, the rate is 240x faster than sustainable (1 hour is 1/672 of a 28-day window; burning 10% in that window = 10% × 672 = 6720% projected for the full window, when only 100% is allowed).

Multi-window burn-rate alerts: page when both a short window (5 minutes at 14.4x rate) & a long window (1 hour at 6x rate) burn faster than sustainable. Catches both fast outages & slow degradations.

Why this matters for capacity: a service running at 99.9% SLO with 1% slack room can absorb minor blips. A service at 99.93% (just barely meeting SLO) is one bad day from violation. Capacity decisions should target a comfortable SLO margin, not the minimum that meets it.

A Capacity Decision Under Observation

Your service has an SLO of 99.9% successful requests over 28 days. Current state from monitoring over the last hour:

- Success rate: 99.5% (sustained for 30 minutes)

- Backend CPU: averaging 82% across the fleet (target 70%)

- p99 latency: 800 ms (SLO target: <500 ms)

- Traffic: 1,400 req/s, up from baseline 1,000 req/s (40% above normal; trend still rising)

- Autoscaling: configured to add replicas when CPU > 80% sustained 5 min; currently in the middle of a scale-up that will add 3 replicas in ~90 seconds

Decide three things: (1) is this an incident that warrants paging a human now, (2) should you take any immediate action beyond letting autoscaling complete, & (3) what would your decision be different if the traffic trend had been flat instead of rising? Justify each.

Design a Launch Observability Plan

Synthesis

You can now design a /health that catches real failures, a /version that lets you verify deploys, four-golden-signal dashboards at a proxy tier, & capacity triggers tied to SLO burn rate.

Apply all four.

Your team is launching search.example.com (the search service from the failure-modes lesson). The team wants to ship observability that catches problems before users do, with a clear page-or-not decision matrix. SLO: 99.9% successful requests, p99 latency < 300 ms, over a 28-day window.

Design the launch observability plan. Address: (1) what `/health` & `/version` return for each backend replica & for each proxy, (2) which four-golden-signal dashboards you would require at the proxy & at the backend tiers, (3) at what threshold(s) autoscaling triggers a scale-up, & (4) at what threshold(s) a human gets paged (use SLO burn rate where it applies).

Closing the Course

You have completed all five lessons:

- Proxies & Origins: the edge-layer shape that almost every public web service uses

- Stateless Horizontal Scaling: why a stateless tier multiplies cheaply & how to size it

- Ingress & Egress Separation: why one box becomes two, & the failure mode that forces it

- Failure Modes & Blast Radius: SPOFs, cascades, postmortems, blameless action items

- Observability & Capacity (this one): what to measure so problems surface before users do

The throughline: a web-scale distributed system is not magic. It is a small set of patterns (reverse proxy, stateless replicas, ingress/egress split, bulkheads & circuit breakers, four golden signals) composed thoughtfully. Once you recognize the patterns, you see them in every production architecture.

Companion lessons: five geometry-of-* lessons recast the same material as graph theory & geometry. They go well in either order.

Well done.