Local Gradients Multiply
The Forward Pass
ANDREA-120M's forward pass walks input through a sequence of operations:
x = embed(token_ids) # token embeddings
for layer in 12_layers:
x = x + attn(LN(x)) # attention sublayer
x = x + mlp(LN(x)) # MLP sublayer
logits = LN(x) @ embed.T # tied output projection
loss = cross_entropy(logits, targets)
Each operation reads input tensors & produces output tensors. The forward pass terminates in a single scalar: the cross-entropy loss for this batch.
The Backward Pass
Training updates the weights in the direction that decreases the loss. To get update directions, the engine needs:
dL/dW for every learnable W in the model
The chain rule gives this. For a chain loss = f(g(h(x))):
dL/dx = (dL/df) * (df/dg) * (dg/dh) * (dh/dx)
Each factor is a local gradient: how the output of one operation changes when its input changes by a small amount. Multiplying local gradients backward through the graph propagates the loss signal to every weight.
Reverse-Mode Differentiation
Backprop computes gradients in reverse order: starting from dL/dlogits = 1, then walking backward through cross-entropy, then output projection, then layer norm, then twelve transformer blocks, then embeddings. At each step, multiply the incoming gradient by the local Jacobian.
Reverse-mode is efficient when the output is a single scalar (the loss) & there are many inputs (the weights). One backward pass produces gradients for every weight in the model. Forward-mode would need one pass per weight; for ANDREA-120M with ~120M weights, forward-mode is infeasible.
Why Reverse-Mode
Every Forward Op Gets A Backward Twin
The Pairing Discipline
microgpt_cuda.cu ships two CUDA kernels for every operation: one that computes the forward output, one that computes input gradients given output gradients. The pairing is one-to-one:
| Forward kernel | Backward kernel | Operation |
|---|---|---|
k_embed_fwd | k_embed_bwd | Token embedding lookup |
k_layernorm_fwd | k_layernorm_bwd | Layer normalization |
k_attn_qkv_fwd | k_attn_qkv_bwd | Q, K, V projections |
k_attn_fwd | k_attn_bwd | Scaled dot-product attention |
k_attn_out_fwd | k_attn_out_bwd | Output projection W_O |
k_mlp_fwd | k_mlp_bwd | MLP (with GELU) |
k_residual_add | k_residual_add_bwd | Residual connection |
k_loss_fwd | k_loss_bwd | Cross-entropy loss |
Eight operation pairs cover the full transformer. Plus a few utility kernels: k_grad_norm_partial, k_grad_norm_final, k_grad_scale for gradient clipping (see activity 75).
A Backward Kernel's Job
Given the gradient flowing in from later layers (grad_output), a backward kernel computes:
1. grad_input: the gradient with respect to the operation's input tensor. This gets passed further backward.
2. grad_weight: the gradient with respect to learnable parameters in the operation. This goes into the optimizer state.
Both are computed in a single kernel launch. CUDA threads cooperate on tiles of the gradient tensor in parallel.
Saved Tensors
Backward computation often needs values from the forward pass. For example, k_layernorm_bwd needs the mean & variance computed during forward; k_mlp_bwd needs the GELU pre-activation. The training engine stores these in dedicated buffers during forward, then reads them during backward.
Memory cost: roughly the same shape as the forward output for each saved tensor. For ANDREA-120M with batch=8, seq=1024, d_model=768, one saved tensor is 8 × 1024 × 768 × 4 bytes = 25 MB. Across 12 layers & multiple saved tensors per layer, activations dominate VRAM during training (~5-10 GB on a 24 GB card).
Tracing One Backward Step
Where Gradients Live In Memory
One Gradient Tensor Per Weight Tensor
Every learnable weight tensor in ANDREA-120M has a matching gradient tensor of identical shape. For each block:
W_Q [768, 768] ↔ grad_W_Q [768, 768]
W_K [768, 768] ↔ grad_W_K [768, 768]
W_V [768, 768] ↔ grad_W_V [768, 768]
W_O [768, 768] ↔ grad_W_O [768, 768]
W_1 [768, 3072] ↔ grad_W_1 [768, 3072]
W_2 [3072, 768] ↔ grad_W_2 [3072, 768]
LN1.gamma [768] ↔ grad_LN1.gamma [768]
LN1.beta [768] ↔ grad_LN1.beta [768]
LN2.gamma [768] ↔ grad_LN2.gamma [768]
LN2.beta [768] ↔ grad_LN2.beta [768]
Plus token embeddings, position embeddings, & a final layer norm. The gradient buffer total memory matches the weight memory: ~120M floats, ~480 MB at FP32, ~240 MB at FP16.
Accumulation Across Microbatches
ANDREA's batch_size = 8 fits in VRAM at FP16. Larger effective batches require gradient accumulation: run multiple forward+backward passes on small batches, summing gradients into the same buffer, then take one optimizer step.
for microbatch in range(n_microbatches):
forward(microbatch)
backward() # ADDS to grad buffers, doesn't overwrite
scale_grads(1.0 / n_microbatches) # average across microbatches
optimizer_step()
zero_grads() # reset for next training step
Backward kernels use += semantics, not =. Each call adds gradient contributions to the existing buffer; the buffer holds the running sum until zero_grads() clears it.
The Optimizer State
AdamW (activity 73) holds two more buffers per weight: first moment m & second moment v. Total training-time memory:
weights: 1× weight count
gradients: 1× weight count
Adam m: 1× weight count
Adam v: 1× weight count
saved acts: ~2-4× depending on layers & batch
──────────────────────────────────────────
total: ~6-8× weight count
ANDREA-120M at FP16: ~240 MB × 4 buffers (weight, grad, m, v) + ~5-10 GB activations = ~10-12 GB total. Comfortably below the RTX 4090's 24 GB ceiling. ANDREA-12M trained in 1.4 GB; the 10× parameter scaling brings ~10× memory.
Sizing Gradient Buffers
Full Control Of Memory & Precision
What Generic Frameworks Cost
PyTorch & JAX make autograd convenient: write Python code, get gradients automatically. The cost: a generic dispatch layer between your code & CUDA. Every operation goes through Python interpreter overhead, framework bookkeeping, & dynamic kernel selection. For training a small language model on one GPU, that overhead matters.
Concrete costs ANDREA avoids:
1. Python interpreter latency. Every PyTorch op crosses the Python/C++ boundary. For ~100 kernel launches per training step at ~9 steps/min, that's ~900 boundary crossings per minute. C-level dispatch eliminates this.
2. Framework allocator unpredictability. PyTorch's caching allocator gives good throughput on average but unpredictable peak memory. ANDREA's training engine pre-allocates every buffer at startup; no reallocation during training, no fragmentation, no surprise OOMs at step 100K.
3. Generic kernel selection. PyTorch picks kernels at runtime via heuristics. ANDREA picks kernels at compile time, tuned to RTX 4090 tensor core tile sizes.
4. Mixed-precision plumbing. ANDREA-120M's FP16 cuBLAS path & ANDREA's FP8 E4M3 tensor core experiments require precise control over which tensors live at which precision. Generic frameworks expose this control through layered APIs; custom CUDA writes it directly.
The Tradeoff
Custom CUDA costs: more code to write, more bugs to find, no community ecosystem. ANDREA's microgpt_cuda.cu is ~6000 lines of hand-written CUDA that took months to debug. Each new operation requires writing a forward kernel, a backward kernel, & tests.
What ANDREA gains:
- Full reproducibility. The training pipeline is one C binary plus one Python proxy. No version drift across PyTorch releases, no CUDA version mismatches with framework wheels.
- Bit-exact resumes. SIGTERM triggers a checkpoint write that captures every tensor exactly as the GPU sees it. Resume picks up the same loss trajectory the run was on.
- Predictable memory. ANDREA-120M trained for 200K steps with no OOMs. Memory got accounted for at engine startup.
- Direct hardware access. Tensor core tile sizes, FP8 E4M3 settings, asynchronous memory copies: all directly addressable in CUDA, opaque in generic frameworks.
Reproducibility As Mission
The ANDREA whitepaper section 9 lists the full reproducibility stack:
Training engine: microgpt/microgpt_cuda.cu
Training proxy: microgpt/training_proxy.py
Experiment configs: experiments/ANDREA-*-TRAIN.json
Data pipeline: scripts/pull-hermes3.py, scripts/prep-megachat.py
Dashboard: scripts/live-loss-dashboard.html
Bandit specification: docs/FIREHOSE-BANDIT.md
Model documentation: docs/ANDREA.md
Hardware requirement: one NVIDIA GPU with ≥8 GB VRAM (RTX 3060 or better). Anyone can reproduce ANDREA-12M from these artifacts. The custom CUDA path is part of why: no framework version freezes, no dependency surprises five years from now.
Signals & Checkpoints
The CUDA training loop responds to two POSIX signals:
- SIGTERM: write an immediate checkpoint, then exit. Used when stopping training cleanly.
- SIGUSR1: write an immediate checkpoint, continue training. Used during the polish pivot in v3 to capture state without interrupting the run.
Checkpoint format: [int32 step][int32 n_params][n_params × float32 weights][n_params × float32 m][n_params × float32 v]. Step counter, weight count, then weights followed by Adam moments. Resumes bit-exactly. The proxy archives .samples.json & .state.json separately on polish; .loss.json is never archived (it accumulates the full training history).