Welcome
Welcome
A web-scale fleet contains many machines. At any moment, some are healthy, some are starting up, some are draining, some are quietly broken. The fleet survives this because every machine answers two simple questions on demand:
- /health — am I currently able to serve real requests?
- /version — what code am I running?
Plus a metrics endpoint (commonly /metrics) that exposes counters & gauges for monitoring tools to scrape.
This lesson teaches how to design these endpoints so they actually reflect reality, what the four golden signals mean at a proxy tier, & how observed data drives capacity decisions.
By the end you will:
- Design a /health endpoint that detects real path failure, not just process liveness
- Design a /version endpoint that lets you verify a deploy landed
- Apply the four golden signals (latency, traffic, errors, saturation) at a proxy tier
- Tie observed surge metrics back to capacity decisions: when to scale up, when to drain, when to page
- Reason about SLOs & error budget burn rate as the operational discipline behind 'how much do we care?'
The Two Kinds of Health Check
Liveness vs Readiness
Liveness: is the process alive at all? Used by orchestrators (Kubernetes, systemd) to decide whether to restart the process.
Readiness: is the process ready to handle real traffic right now? Used by load balancers to decide whether to send requests.
These are different questions. A process that is alive but cannot reach its database is alive but not ready. A process that is starting up is alive but not yet ready.
Shallow vs Deep Health Checks
Shallow: returns {"status": "ok"} if the HTTP handler runs. Trivial. Detects process-down only.
Deep: actually exercises the real request path. Checks that database connection pool can return a connection, cache is reachable, downstream dependencies respond. Detects functional outages that shallow checks miss.
The tradeoff: deep checks cost more (each one is essentially a synthetic request) & can cause cascading failure (if every replica's health check hammers the database, a slow database makes all replicas unhealthy, which removes them from rotation, which removes all capacity).
Best practice: a shallow check for liveness (fast, cheap, no external dependencies) & a deeper check for readiness (cached results, throttled to avoid hammering downstreams).
Version Endpoints
/version returns the git commit, build time, & service name. After a deploy, you curl https://service.example.com/version & confirm the returned commit matches what you pushed. If it doesn't, the deploy failed silently.
Without /version, a stale deploy can look successful & hide for hours.
Minimum response shape: {"service": "my-api", "git_commit": "abc1234", "build_time": "2026-05-19T10:00:00Z"}.
Latency, Traffic, Errors, Saturation
Four Numbers Cover Most of Operations
From the Google SRE book. Four signals you measure on every service tier. If you instrument these four well, you catch most production problems before users do.
Latency: how long does a request take? Report distributions, not just averages. The p99 (99th percentile latency) matters more than the mean, because tail latency is what users feel as 'slow'. A service with 50 ms mean & 5,000 ms p99 has a real problem most users never notice but the worst-affected 1% absolutely do.
Traffic: how many requests per second? Total requests, per-endpoint, per-status-code, per-region. Baseline known; alert on anomalies (sudden drop = ingress problem; sudden spike = surge or attack).
Errors: rate of failed requests. Distinguish 4xx (client errors, not your fault) from 5xx (server errors, your fault). Track error rate as a percentage of traffic, not as absolute counts, so alerts work across load levels.
Saturation: how full is the system? CPU utilization, memory, connection pool depth, queue length. The leading indicator. Saturation rises before latency or errors degrade. A tier at 90% saturation is one bad minute away from queue collapse.
At a Proxy Tier Specifically
Each signal lights up at the edge layer:
- Latency at the proxy: TLS handshake duration, upstream connect time, request-response total. Separately measured because they live on different parts of the path.
- Traffic at the proxy: total requests/sec, per-backend distribution (a hot backend signals load-balancer skew), per-status-code breakdown.
- Errors at the proxy: 4xx from clients (your users hitting bad endpoints), 5xx from backends (your services failing), proxy-internal errors (502 = backend unreachable, 504 = backend timeout).
- Saturation at the proxy: TLS session count, upstream connection pool depth, CPU at the proxy itself (TLS termination is CPU-heavy).
Pro tip: a sudden rise in 502s with low backend latency means the backend is hanging up before responding (connection reset, crash, OOM). A rise in 504s means the backend is slow but still answering. Read the error code; it tells you where the failure lives.
Read the Signals
Your dashboard shows the following over the last 10 minutes:
- Traffic: roughly flat at 800 req/s (no surge)
- Latency: p50 stable at 40ms, p99 climbed from 200ms to 2,500ms over 5 minutes & is still rising
- Errors: 4xx rate stable at 0.3% (normal background); 5xx rate climbed from 0.1% to 1.2% (mostly 504 Gateway Timeout)
- Saturation: backend CPU climbed from 45% to 78% over the same 5 minutes; proxy CPU stable at 30%
When to Scale, When to Drain, When to Page
Capacity Decisions Need Triggers
Observing metrics is easy. Knowing when to act on them is the discipline.
Scale up when: saturation crosses a sustained threshold (e.g., backend CPU >70% for 5 minutes), or queue depth grows beyond a target, or latency p99 exceeds the SLO. The trigger should fire before things break, not at break.
Drain a replica when: it is consistently slow / errorful while peers are healthy (one replica running hot is often a host-level problem, not an application problem), or when rolling out a new version, or when retiring a replica gracefully.
Page a human when: an SLO is being burned faster than the error budget can sustain, or a saturation trigger fires without autoscaling absorbing it, or a cascade pattern shows up (error rate + retry rate both climbing).
Don't page when: a single bad minute resolves on its own, or background batch jobs cause expected periodic blips, or noise crosses the threshold (the threshold is wrong, not the system).
SLOs & Error Budget Burn
An SLO (service level objective) defines acceptable performance: 'success rate >= 99.9% over a 28-day window'. The complement (0.1%) is the error budget.
Burn rate: how fast you are consuming the error budget. If you burn 10% of the budget in 1 hour, the rate is 240x faster than sustainable (1 hour is 1/672 of a 28-day window; burning 10% in that window = 10% × 672 = 6720% projected for the full window, when only 100% is allowed).
Multi-window burn-rate alerts: page when both a short window (5 minutes at 14.4x rate) & a long window (1 hour at 6x rate) burn faster than sustainable. Catches both fast outages & slow degradations.
Why this matters for capacity: a service running at 99.9% SLO with 1% slack room can absorb minor blips. A service at 99.93% (just barely meeting SLO) is one bad day from violation. Capacity decisions should target a comfortable SLO margin, not the minimum that meets it.
A Capacity Decision Under Observation
Your service has an SLO of 99.9% successful requests over 28 days. Current state from monitoring over the last hour:
- Success rate: 99.5% (sustained for 30 minutes)
- Backend CPU: averaging 82% across the fleet (target 70%)
- p99 latency: 800 ms (SLO target: <500 ms)
- Traffic: 1,400 req/s, up from baseline 1,000 req/s (40% above normal; trend still rising)
- Autoscaling: configured to add replicas when CPU > 80% sustained 5 min; currently in the middle of a scale-up that will add 3 replicas in ~90 seconds
Design a Launch Observability Plan
Synthesis
You can now design a /health that catches real failures, a /version that lets you verify deploys, four-golden-signal dashboards at a proxy tier, & capacity triggers tied to SLO burn rate.
Apply all four.
Your team is launching search.example.com (the search service from the failure-modes lesson). The team wants to ship observability that catches problems before users do, with a clear page-or-not decision matrix. SLO: 99.9% successful requests, p99 latency < 300 ms, over a 28-day window.
Closing the Course
Closing the Course
You have completed all five lessons:
- Proxies & Origins: the edge-layer shape that almost every public web service uses
- Stateless Horizontal Scaling: why a stateless tier multiplies cheaply & how to size it
- Ingress & Egress Separation: why one box becomes two, & the failure mode that forces it
- Failure Modes & Blast Radius: SPOFs, cascades, postmortems, blameless action items
- Observability & Capacity (this one): what to measure so problems surface before users do
The throughline: a web-scale distributed system is not magic. It is a small set of patterns (reverse proxy, stateless replicas, ingress/egress split, bulkheads & circuit breakers, four golden signals) composed thoughtfully. Once you recognize the patterns, you see them in every production architecture.
Companion lessons: five geometry-of-* lessons recast the same material as graph theory & geometry. They go well in either order.
Well done.