Sixteen Days of Compute on One GPU
A Single Long Run
ANDREA-120M takes ~23 days on an RTX 4090 (FP16, 6 steps/min, 200K steps). Wall power, kernel panics, proxy crashes, & deliberate config changes all happen across that window. Without checkpoints a single hiccup discards the entire run.
v1 lost 27K steps to one mistake (lr=0.001 too aggressive) because no checkpoint sat closer than the launch point. v2 absorbed that lesson: checkpoint cadence dropped to every 100 steps, & a CUDA signal handler guarantees a checkpoint write on SIGTERM.
Three Roles
A checkpoint serves three jobs at once:
1. Recovery point. Process dies, machine reboots, kernel panics: resume from latest step_NNNNNN.bin.
2. Polish pivot. Step 112,619: change curriculum without retraining. SIGUSR1 forces a clean checkpoint, proxy stops, new caps & floors get submitted, CUDA resumes from the saved point under a new policy.
3. Audit fork. Compare two configs at the same starting weights: copy a checkpoint, run two divergent branches forward, observe which converges.
Every 100 steps gives ~17 minutes of training between writes at 6 steps/min. 100 steps also matches sample_every: every checkpoint corresponds to one fresh sample audit, & every sample audit corresponds to a recoverable point.
Three Roles for One File
Five Regions in One File
The Format
Every step_NNNNNN.bin packs five contiguous regions:
[int32 step] [int32 n_params] [n_params x float32 weights] [n_params x float32 m] [n_params x float32 v]
Region by Region
Header (8 bytes total). A 32-bit step number tells us where in training this snapshot lives; a 32-bit parameter count tells us how big the three trailing arrays each are.
Weights (n_params x 4 bytes). Every learned parameter, flat. Order matches the model's parameter iterator: token & position embeddings, then per-layer attention & MLP weights, then the output head.
Adam first moment m (n_params x 4 bytes). EMA of past gradients (beta1 = 0.9). Same shape as weights. Required for AdamW resumption.
Adam second moment v (n_params x 4 bytes). EMA of past squared gradients (beta2 = 0.999). Same shape as weights. Required for AdamW resumption.
Total Size
Total bytes = 8 + 12 x n_params. ANDREA-12M (12.8M params): 154 MB on disk (147 MB rounded). ANDREA-120M (~120M params) FP32: ~1.44 GB. Three arrays of identical shape, stacked back to back, with a tiny header.
Why Save m & v
Vanilla Adam tracks per-parameter learning rates via m & v. Drop them at checkpoint write & a resumed run starts with zero momentum & zero variance estimate, equivalent to learning rate 0 for one step then a sudden ramp. Loss spikes; the model can fall out of its current basin. Saving m & v makes resume bit-equivalent (modulo dataloader randomness) to the never-stopped baseline.
Sizing One Checkpoint
SIGTERM & SIGUSR1
Why CUDA Handles Signals
Training runs as a long-lived foreground process. When the proxy or operator wants the GPU to stop, a signal goes to the CUDA engine. Without a handler, a default SIGTERM kills the process immediately: in-flight gradient computation discarded, latest weights since last checkpoint lost. With a handler, the engine writes a checkpoint first then exits cleanly.
SIGTERM: write & exit
Sent by a stop button, an systemctl stop, or a kill from a parent proxy. CUDA finishes the current step, writes step_NNNNNN.bin to disk, then exits. Recovery from this state needs only the latest .bin: zero work lost beyond the partial step in flight.
SIGUSR1: write & continue
Sent on demand by an operator or proxy script. CUDA finishes the current step, writes step_NNNNNN.bin, then continues training as if nothing happened. Useful for: triggering an audit point right before a config change; capturing weights at a known-good moment; aligning a checkpoint with an external sample-quality grading run.
The Polish Pivot Sequence (step 112,619)
1. Operator sends SIGUSR1 to CUDA. Checkpoint writes at the next 100-step boundary (step 112,700).
2. Operator stops the proxy.
3. .samples.json & .state.json get archived (sample log & bandit state preserved as historical record).
4. .loss.json stays in place. Cumulative training history; never archived.
5. Proxy restarts under new caps & floors.
6. CUDA resumes from step_112700.bin with a fresh bandit but full weights, m, & v.
Loss history continues unbroken across the pivot. Sample log resets cleanly. Bandit gets a fresh start under new policy.
Picking the Signal
Cumulative Training History
Three Sidecar Files
Alongside every checkpoint, the proxy maintains three JSON sidecars in the run directory:
- .loss.json -- one entry per step, ever. ~200,000 entries by run end. Cumulative training history.
- .samples.json -- recent generated samples for audit. Reset on polish pivots.
- .state.json -- bandit arm pulls, EMA rewards, phase counters. Reset on polish pivots.
What Resets, What Persists
Polish pivots are policy changes, not run resets. The model's weights, m, v, & loss history all continue unbroken. The bandit's accumulated rewards do NOT continue: the new caps & floors define a different policy, & the bandit must re-learn under the new policy from a clean state.
Why .loss.json Stays
Loss history serves as the run's audit trail. Every published claim about ANDREA-120M (loss EMA at step 110K, polish-pivot recovery, convergence at step 112K) traces back to entries in this file. Archiving .loss.json between phases would force readers to stitch together fragments to reconstruct the run; keeping it append-only & untouched preserves provenance.
The Zombie Arm Lesson
Step 112,619 found a repo-docstrings arm in .state.json carrying weight 1.546 from a prior run. The bandit state had been preserved across an earlier restart but the data source was no longer available, producing zombie pulls that distorted exploration accounting. Lesson: bandit state IS allowed to drift across restarts in surprising ways. Loss history is the only file that must remain untouched for the run's full lifetime.
One Rule to Govern Them All
Archive .samples.json & .state.json freely between phases. Never archive .loss.json. The latest .loss.json is always the canonical training history.
Applying the Rule
What Got Built & Why
Five Decisions
1. Cadence: every 100 steps. Recovery point granularity ~17 minutes. Aligned with sample_every so every checkpoint corresponds to one fresh sample audit.
2. Format: header + 3 arrays. Minimal: 8-byte header tells us how big each trailing array gets. No metadata bloat. Bit-equivalent resumes when m & v get saved.
3. Signals: SIGTERM & SIGUSR1. Two roles, two signals. Default systemd shutdown gets a clean checkpoint via SIGTERM; on-demand audit points get a clean checkpoint via SIGUSR1 without stopping.
4. Loss continuity: never archived. Cumulative training history persists across polish pivots, restarts, & policy changes. One audit trail for the whole run.
5. Bandit state: resets allowed. Bandit policy lives under one config at a time. Polish pivots reset; loss history continues. Two different lifetimes share the same run directory.
What This Lesson Connects To
- Activity 23 (grow_a_language_model_sample_audit). sample_every cadence matches checkpoint cadence; both fire every 100 steps.
- Activity 24 (grow_a_language_model_microgpt_to_andrea). v1 collapse, v2.5 patch, v3 polish pivot all required clean checkpoints to operate.
- Activity 10 (grow_a_language_model_adamw). Saving m & v in the checkpoint matters because AdamW's update rule depends on both. Drop them & resume diverges.
One Last Engineering Truth
Code outlasts authors. Infrastructure outlasts builders. A simple checkpoint format outlasts every fancy resume scheme that promised to skip saving optimizer state. Save the bytes; save the run.