un — MOAD-0003: A Leaked Context

un

guest

1 / ?

back to lessons

ThreadLocal: Correct Idiom, Wrong Era

Java EE Servlet containers, circa 1999: one thread per request. A thread handles exactly one request from start to finish, then terminates. ThreadLocal stores a value keyed to the current thread. With one-thread-per-request, a value stored in ThreadLocal belongs to exactly one request. The idiom: correct.

Thread pools changed the contract. A thread handles request A, stores principal A in ThreadLocal, finishes request A, & returns to the pool. Thread pools do not reset thread state. ThreadLocal.remove() cleans up, but calling it requires explicit discipline. When discipline fails, request B runs on the same thread & reads principal A in ThreadLocal.

The 5-step leak:

1. Request A arrives. Server assigns Thread-7.

2. Thread-7 sets ThreadLocal.set(principal_A) at request start.

3. Request A completes. Thread-7 returns to pool. ThreadLocal.remove() not called.

4. Request B arrives. Server assigns Thread-7 (pool reuse).

5. Thread-7 reads ThreadLocal.get(): returns principal_A. Request B runs under the wrong identity.

Why Tests Miss It

Unit tests run in isolation: no thread pool, no reuse. Integration tests use fresh threads or reset state between tests. Load tests warm up with correct users & low concurrency. The defect only manifests under thread pool reuse with overlapping requests, a condition that appears in production under normal traffic, not in any test configuration that checks for it.

The Security Consequence

User A's principal bleeds into user B's request. Not a crash. Not an exception. A silent security boundary violation: user B performs actions as user A, reads user A's data, or bypasses user B's permissions. The system produces no error. Logs show request B was authorized. Everything looks correct.

The Five Steps

The five steps of a ThreadLocal leak matter precisely: the defect does not occur at the moment the wrong code runs. It occurs earlier, in the absence of a cleanup step.

Walk through the 5 steps. At which step does the defect occur, and why would a test suite miss it?

Scope-Attached Values

ThreadLocal attaches a value to a thread. A thread outlives a request. Mismatch.

Scope-attached values attach a value to a unit of work. When the unit of work ends, the value ends with it. No explicit cleanup. No remove() to forget.

Java 21: ScopedValue

// ThreadLocal (DEFECT carrier)
static final ThreadLocal<Principal> PRINCIPAL = new ThreadLocal<>();
PRINCIPAL.set(principal);           // set at request start
// ... request handling ...
PRINCIPAL.remove();                 // MUST be called; often forgotten

// ScopedValue (CORRECT carrier)
static final ScopedValue<Principal> PRINCIPAL = ScopedValue.newInstance();
ScopedValue.where(PRINCIPAL, principal).run(() -> {
    // ... request handling ...
    // value automatically gone when run() returns
});

Go: context.Context

// context.Context carries values explicitly; scope = function call chain
ctx := context.WithValue(r.Context(), principalKey, principal)
handleRequest(ctx)  // ctx passed explicitly; gone when function returns

Python asyncio: contextvars.ContextVar

# ContextVar scoped to each async task
PRINCIPAL: ContextVar[str] = ContextVar('principal')
token = PRINCIPAL.set(principal)    # set for this task only
# ... task handling ...
PRINCIPAL.reset(token)              # or: scope ends with task

The property these share: lifetime matches the unit of work. When the request ends (the run() returns, the function returns, the task completes), the value ends. No cleanup to forget. No pool to corrupt.

Identify & Replace

A Java EE application stores tenant ID in a ThreadLocal at request start. Under high load, tenant A's ID appears in requests from tenant B. Tenant B's queries return tenant A's data. No exception gets thrown. The defect only appears in production load testing.

What MOAD does this describe? What carrier made the defect possible? What replaces it, & what property of the replacement prevents the leak?