Two Ways to Carry More Load
Welcome
When a service starts buckling under load, an operator faces a choice. Make the existing machine bigger (more CPU, more RAM, faster disks). Or add more machines that each do the same work.
The first path goes by vertical scaling (scale up). The second goes by horizontal scaling (scale out).
This lesson teaches why almost every modern web architecture chooses horizontal & what property of the workload makes that choice viable. The answer hides in one word: state.
By the end you will understand:
- The cost curves of vertical vs horizontal scaling & where each makes sense
- What 'stateful' & 'stateless' mean in practice, & why one of them multiplies cheaply
- The math that sizes a replica fleet under expected & surge load
- The headroom rule that keeps a tier from collapsing past the queueing knee
- Where state must live (it never disappears) & how to push it out of the layers that need to scale
Why Horizontal Wins Past a Threshold
Vertical Scaling: One Bigger Box
Pros: simple. No code changes. No coordination. The same process now has more CPU.
Cons: ceiling. The largest commercially available VM has finite RAM & cores. Above that, no money buys more headroom. Costs go superlinearly past the sweet spot of a vendor's offering. A failure of that one machine takes the entire service down.
Horizontal Scaling: Many Smaller Boxes
Pros: no ceiling (up to your willingness to pay for & coordinate machines). Capacity adds linearly with replica count, predictably. A single replica failure removes 1/N of capacity, not 100%.
Cons: requires the workload to support it. Some workloads (a single big database, a stateful game server holding live sessions) resist horizontal scaling. Coordination & load distribution become operational concerns.
The crossover: any production service that needs to survive any single machine failure must run on at least two machines. Once you accept two, you have already chosen horizontal scaling. From there, the question is not 'should we?' but 'how cheaply can we add the next replica?'
The key enabler: a workload that holds no per-request state on the machine itself. Then any replica can answer any request, & adding a replica adds capacity with no coordination.
Stateful vs Stateless in Practice
State Never Disappears, It Just Moves
Stateful component: holds information whose loss would change behavior. A database holding user accounts. A cache holding session tokens. A worker pinning a long-running streaming connection to a specific user.
Stateless component: holds no information whose loss would matter. A web tier that reads a request, queries a database, & writes a response. Each request stands alone; the tier remembers nothing between requests.
Key insight: state never disappears from a system. It moves to a layer designed to hold it (a database, a Redis cluster, an object store). The layers that face traffic can then go stateless, & stateless layers scale horizontally because any replica can answer any request.
Practical test: if you randomly killed one process in this tier & restarted it, would any user experience a wrong answer or a lost session? If yes, it holds state. If no, it does not.
Examples
- A Python web process that reads requests, queries Postgres, returns JSON: stateless. State lives in Postgres.
- A Python web process that holds user shopping carts in local memory: stateful. Killing the process loses carts.
- A WebSocket server that holds open connections to chat users: stateful in the connection sense. Killing the process drops connections; clients must reconnect. Often these still scale horizontally with care (sticky sessions, consistent hashing).
- A Redis cache fronting Postgres: stateful for the cache contents, but acceptable if cache misses are tolerable. A replica failure means cache miss, not data loss.
Designing for horizontal scaling = pushing state out of the layer that needs to scale.
Audit a Suspect Tier
A team runs a recommendation API on 6 backend VMs behind a reverse proxy. The application: reads a user ID from the request, fetches the user's recent activity from Postgres, runs a scoring algorithm, returns a list of recommended items. Two non-standard behaviors:
- The application keeps a 'recent user activity' cache in process memory, populated on first request for a user, reused on subsequent requests.
- The application uses sticky sessions: once a user hits VM #3, all their subsequent requests go to VM #3 (the proxy is configured for sticky routing on a cookie).
The Replica Formula
The Simplest Capacity Formula
Once a tier goes stateless, sizing it becomes arithmetic. You need enough replicas so the steady-state load arrives & departs at the same rate, with headroom for surge.
The formula:
replicas = ⌈ (peak_load × surge_factor) / per_replica_capacity ⌉ + headroom
Where:
- peak_load: maximum sustained requests/second you expect in normal operation
- surge_factor: a multiplier covering brief bursts above peak (commonly 1.5x to 2x for predictable traffic, 3x or more for viral / unpredictable)
- per_replica_capacity: requests/second one replica handles at acceptable latency & utilization (commonly measured at 70% CPU, not at saturation)
- headroom: extra replicas so a few replica failures do not collapse the tier (commonly 1-2 replicas for small fleets, 10-20% for larger ones)
Worked example: a backend handles 100 req/s at 70% CPU per replica. Peak load is 600 req/s. You expect occasional 2x surges. You want to survive 2 replica failures.
replicas = ⌈ (600 × 2) / 100 ⌉ + 2 = 12 + 2 = 14 replicas
The 80% Rule
Per-replica capacity is not the saturation point. Measure capacity at 70-80% CPU, never at 100%.
Past about 80% utilization, queueing curves climb sharply: a queue that ran in 10 ms at 60% utilization runs in 80 ms at 90% utilization. Latency, not throughput, breaks first. (The companion lesson geometry_of_stateless_horizontal_scaling derives this curve mathematically.)
Autoscaling vs Static Provisioning
Static: provision for peak × surge headroom & accept the cost of running at low utilization off-peak.
Autoscaling: a controller adds & removes replicas based on observed utilization, target latency, or queue depth.
Autoscaling caveat: cold-start time matters. If a new replica takes 2 minutes to boot, autoscaling cannot respond to a 30-second surge. Mature autoscaling keeps a warm pool of pre-provisioned replicas just below the scale-up threshold.
Size a Fleet for a New Service
Your team plans to launch a video metadata API. Benchmarks show a single replica handles 250 req/s at 70% CPU & 50 ms p99 latency. Marketing forecasts peak load at 4,000 req/s during prime-time hours. A planned promotional event could surge to 3x peak briefly. You want the service to survive 3 simultaneous replica failures without exceeding 80% utilization on the survivors.
Cold Start, Slow Drain, & Other Real Edges
Real Fleets Have Real Edges
The formula assumes replicas appear instantly, accept traffic instantly, & shed traffic instantly. None of those hold in production.
Cold start: a new replica needs to boot the OS, start the process, load configuration, warm caches, & pass health checks. Anywhere from 5 seconds (container restart) to 5 minutes (full VM boot + image pull). Autoscaling cannot respond to bursts shorter than this delay.
Slow drain: a replica being removed from the pool needs time to finish in-flight requests before terminating. Otherwise users see truncated responses. Reverse proxies support drain (stop accepting new requests, finish active ones) but it takes seconds to minutes.
Warm pool: production fleets keep a pool of pre-provisioned but idle replicas ready to take traffic on signal. Trades a small steady cost for fast surge response.
Connection draining vs immediate kill: graceful shutdown matters. A SIGTERM that triggers drain takes longer than SIGKILL but does not break user requests.
Health check window: a replica that just started might pass its first health check before its database connection pool is warm; the proxy then sends real traffic & the first dozen requests are slow. Tune health checks to test the real path, not just process liveness.
Stickiness creep: even nominally stateless tiers acquire stickiness over time (CDN caches, DNS resolver caches, connection pools). Be suspicious of 'identical replicas' that nonetheless behave differently.
Warm Pool or Reactive Autoscaling?
Your video metadata API (the same one from the previous question, sized at 51 replicas for steady peak + surge) experiences a 30-second surge to 5x normal load whenever a new viral video gets uploaded. Autoscaling currently takes 90 seconds to add a new replica from cold (image pull + warmup). During the 90-second gap, latency climbs sharply & some requests time out.
Design a Stateless Tier Under Constraints
Synthesis
You have learned why horizontal scaling wins past a small threshold, what state means in practice, how to size a fleet under expected & surge load, & where horizontal scaling breaks at the edges.
Apply all four.
Design a backend tier for feed.example.com, a social-feed API. Constraints: per-replica capacity 200 req/s at 70% CPU; expected peak load 1500 req/s; surge factor 2.5x (occasional trending stories); survive 2 simultaneous replica failures; cold-start time 60 seconds; bursts can last 45 seconds; budget allows some idle capacity but not 2.5x permanent provisioning.
Where This Course Goes Next
Where This Course Goes Next
You now have a working mental model of a stateless tier: why it scales, how to size it, what breaks at its edges, & where state must move when you push it out of the layer that needs to grow.
The next lesson in this course (cs_distsys_ingress_egress_separation) tackles a subtler problem: even a perfectly sized stateless tier can fail in surprising ways when incoming & outgoing traffic share the same network path. The classic example involves a proxy trying to connect to itself; the fix involves splitting one tier into two with different responsibilities.
Companion lesson: geometry_of_stateless_horizontal_scaling derives the queueing curve, Little's Law applied to a replica fleet, & the geometric meaning of the 80% utilization knee.
Well done. Onward.