un — Ingress & Egress Separation

un

guest

1 / ?

back to lessons

Two Traffic Directions, One Box

Welcome

Most architecture diagrams show traffic going one way: client at top, server at bottom, arrow pointing down. Reality has traffic going both ways.

Ingress: outside clients reach your services through this path. A reverse proxy at the edge of your network terminates TLS, routes requests, & enforces access policy.

Egress: your services reach outside services through this path. Calling a payment processor's API, fetching a webhook target, sending a request to a partner. Often through a forward proxy or NAT gateway with an allowlist.

Many architectures start with one box handling both. It works, until the day it does not. The failure mode is subtle, surfaces only after enough internal services exist, & teaches an important separation-of-concerns lesson.

By the end of this lesson you will understand:

- Why ingress & egress represent fundamentally different traffic patterns with different scaling axes & different failure modes

- Hairpin NAT & why a proxy that tries to connect to itself fails

- The architectural fork: one box becomes two, & what each one then owns exclusively

- Security isolation gains: each side can lock down to its real allowed peers

- How to identify when your single-box design has crossed the threshold where the split is necessary

Why The Directions Demand Different Tools

Two Different Workloads at One Network Boundary

Ingress traffic characteristics:

- Initiated by outside parties (the internet at large)

- Volume scales with your user base

- TLS termination, request routing, rate limiting per source

- Defense-in-depth concerns: DDoS, abuse, scraping

- Public IP needs to accept connections from anyone

Egress traffic characteristics:

- Initiated by your own services (a known, small set of clients)

- Volume scales with your service-to-service & external-API call patterns

- Source-IP allowlisting at remote endpoints (you have one fixed outbound IP that partners trust)

- Defense-in-depth concerns: data exfiltration, compromised internal services calling out

- Should reject connections from anyone other than your own services

The key asymmetry: ingress accepts traffic from the world; egress accepts traffic only from your own services. Putting them on the same machine means that machine must simultaneously be reachable from the world (for ingress) & be reachable only from your services (for egress). The firewall rules that satisfy one work against the other.

The growth path: a tiny project can hide both behind one IP & one tool, because the volume is small & the partner-IP-allowlist is short. As the project grows, the friction between the two roles increases, & one day a specific failure mode (hairpin NAT) forces the split.

Ingress vs egress: different sources, different destinations, different requirements

A small startup runs everything (ingress reverse proxy, egress forward proxy / NAT, internal services) on a single VM with one public IP. They are early enough that this seems fine. Name two specific failure modes or operational pains this design will hit as they grow, & for each one, explain the underlying cause.

The Bug That Forces the Split

A Sanitized Outage Story

Picture a real architectural fork that happens in production fleets. The names below have been changed; the shape is identical to what teams hit in the wild.

An organization runs a single proxy server at 203.0.113.5. It handles ingress (port 443 for users) & egress (port 1080 SOCKS5 for internal services calling outbound). Internal services live in private subnets & route all outbound traffic through that SOCKS5 proxy on 203.0.113.5:1080.

One of the services hosted behind the same 203.0.113.5 is api.example.com. Public DNS resolves api.example.com to 203.0.113.5.

Now a different internal service needs to call api.example.com. Its outbound path:

1. Internal service resolves api.example.com → 203.0.113.5

2. Internal service sends the request through the SOCKS5 egress proxy at 203.0.113.5:1080

3. The proxy attempts to open a connection from itself to 203.0.113.5:443

4. Connection refused. The packet would have to exit & re-enter the same NAT, which most network stacks reject. The proxy cannot connect to itself via its own public IP.

This is hairpin NAT: a packet that exits a NAT & needs to re-enter the same NAT to reach its destination. Without special hairpin support in the routing layer, the packet drops.

Why It Surfaces Late

Early in the project's life, every internal service either talked to other internal services by private hostname (internal-api.local) or did not call back into its own organization's public services. The hairpin path simply did not exist.

Then a new feature required service A to call api.example.com (a public hostname). The hairpin path activated. Connection refused. Outage.

The fix patched the symptom (force the resolver to give api.example.com's private IP instead of public). The root cause: a single box was doing too many jobs.

Hairpin NAT: packet exits & cannot re-enter the same NAT

The Architectural Fork

One Box Becomes Two

The clean fix: separate the proxy into two machines.

Ingress server (public IP 203.0.113.5):

- Caddy / reverse proxy on ports 80, 443

- Public DNS records point here

- Hosts api.example.com, app.example.com, etc.

Egress server (different public IP 203.0.113.99):

- SOCKS5 / forward proxy on port 1080

- Firewall restricts incoming connections to internal subnet IPs only

- Internal services route all outbound through this address

What this buys:

1. Hairpin resolved. An internal service calling api.example.com routes outbound via 203.0.113.99 (egress), which then connects normally to 203.0.113.5 (ingress, a different IP). The NAT loop disappears because the two IPs live on different machines.

2. Security isolation. The egress server's firewall can lock down to a small set of internal IPs. The ingress server's firewall stays open to the world. Two separate rule sets, each expressing one role cleanly.

3. Independent scaling. Ingress bandwidth scales with users; egress bandwidth scales with internal-service activity. Upgrade one without touching the other.

4. Failure isolation. A misconfigured egress no longer breaks the public site. A DDoS against the public site no longer starves egress bandwidth.

5. Clearer mental model. Each machine has one job. Engineers reason about ingress concerns without thinking about egress, & vice versa.

After the split, an internal service still needs to call `api.example.com`. Walk through the new packet path from internal service to the api backend. Include: which IP the internal service connects to first, what that machine does with the request, which IP it sends to next, & where the response goes.

Two Axes, Two Sizing Decisions

Independent Scaling

Before the split, growth in either direction stressed the same machine. After the split, each direction has its own provisioning.

Ingress sizing: scales with users. Capacity decisions live in the public-facing tier (more reverse proxy replicas, larger VMs, CDN in front). Bandwidth budget calculated against user traffic at peak.

Egress sizing: scales with internal service-to-external API call volume. Often dominated by webhook delivery, payment processor calls, or third-party data fetches. Bandwidth budget calculated against internal call patterns.

Failure isolation: a DDoS against the public ingress no longer eats egress bandwidth (those payment processor calls go through anyway). An egress proxy crash no longer takes down the public site (users keep reaching the site; only internal outbound calls fail).

Different SLOs: ingress availability matters to users (visible site outage); egress availability matters to operators (background failures that may take longer to detect). Each side can carry its own SLO.

Multiple Egress Servers

Once the egress role is its own machine, the next obvious move is to run several egress machines behind a load balancer for HA. Each new internal service points at the egress hostname (which resolves to the load-balanced pool) rather than at a single IP.

Same lesson as the rest of distributed systems: once a tier goes stateless & has its own role, it multiplies cheaply.

A New Partner Integration

Your organization runs the ingress / egress split as designed. The egress server has a fixed public IP (203.0.113.99) that you have allowlisted with three existing partner APIs (a payment processor, an SMS gateway, an email provider).

A product team wants to add a fourth integration: a webhook delivery system that calls back into customer endpoints worldwide. Volume forecast: 10,000 calls per minute, with bursts to 30,000.

Decide: does this new integration belong on the existing egress server, or does it need a separate egress path? Reason about bandwidth, failure isolation, & whether the existing partner allowlists need updating either way.

Design a Network Boundary for a Growing Service

Synthesis

You have learned why ingress & egress demand different tools, the hairpin NAT failure that forces the split in real fleets, & how independent scaling, security isolation, & failure isolation accrue once the split lands.

Apply all four.

A mid-sized SaaS company runs three product subdomains (app, api, admin) for their users, plus four outbound integrations (Stripe, Twilio, SendGrid, a customer-webhook system). Today everything lives behind a single proxy machine at one public IP. They have started getting reports of intermittent hairpin failures when internal services try to call api.example.com. They want to design a permanent fix.

Propose an ingress / egress architecture for this company. Address: how many machines, which IPs serve which roles, where each subdomain's DNS points, which outbound integrations share an egress path (& which should be split), & one concrete monitoring concern that the new design enables that the old one did not.

Where This Course Goes Next

You have now seen one of the cleanest separation-of-concerns refactors in distributed systems: one box becomes two, each with a clear role, & the system inherits scaling, security, & failure-isolation benefits along the way.

The next lesson (cs_distsys_failure_modes_and_blast_radius) extends the failure-isolation reasoning. You will read a sanitized DNS-SERVFAIL postmortem, identify the cascading failure pattern, & write blameless action items that target systems instead of people.

Companion lesson: geometry_of_ingress_egress_separation recasts the split as a bipartite graph & explores cut vertices, network partitions, & what graph theory tells you about a network boundary.

Well done. Onward.