un — Grow a Language Model: Backprop In Custom CUDA

un

guest

1 / ?

back to lessons

Local Gradients Multiply

Forward & Backward Kernels

The Forward Pass

ANDREA-120M's forward pass walks input through a sequence of operations:

x = embed(token_ids)         # token embeddings
for layer in 12_layers:
    x = x + attn(LN(x))      # attention sublayer
    x = x + mlp(LN(x))       # MLP sublayer
logits = LN(x) @ embed.T     # tied output projection
loss   = cross_entropy(logits, targets)

Each operation reads input tensors & produces output tensors. The forward pass terminates in a single scalar: the cross-entropy loss for this batch.

The Backward Pass

Training updates the weights in the direction that decreases the loss. To get update directions, the engine needs:

dL/dW for every learnable W in the model

The chain rule gives this. For a chain loss = f(g(h(x))):

dL/dx = (dL/df) * (df/dg) * (dg/dh) * (dh/dx)

Each factor is a local gradient: how the output of one operation changes when its input changes by a small amount. Multiplying local gradients backward through the graph propagates the loss signal to every weight.

Reverse-Mode Differentiation

Backprop computes gradients in reverse order: starting from dL/dlogits = 1, then walking backward through cross-entropy, then output projection, then layer norm, then twelve transformer blocks, then embeddings. At each step, multiply the incoming gradient by the local Jacobian.

Reverse-mode is efficient when the output is a single scalar (the loss) & there are many inputs (the weights). One backward pass produces gradients for every weight in the model. Forward-mode would need one pass per weight; for ANDREA-120M with ~120M weights, forward-mode is infeasible.

Why Reverse-Mode

ANDREA-120M has ~120M weights & produces a single scalar loss per training step. Compare reverse-mode automatic differentiation against forward-mode. State (1) which mode produces all weight gradients in a single backward pass; (2) how many forward-mode passes would be needed to compute all 120M weight gradients; (3) which mode ANDREA uses & why.

Every Forward Op Gets A Backward Twin

The Pairing Discipline

microgpt_cuda.cu ships two CUDA kernels for every operation: one that computes the forward output, one that computes input gradients given output gradients. The pairing is one-to-one:

Forward kernel	Backward kernel	Operation
`k_embed_fwd`	`k_embed_bwd`	Token embedding lookup
`k_layernorm_fwd`	`k_layernorm_bwd`	Layer normalization
`k_attn_qkv_fwd`	`k_attn_qkv_bwd`	Q, K, V projections
`k_attn_fwd`	`k_attn_bwd`	Scaled dot-product attention
`k_attn_out_fwd`	`k_attn_out_bwd`	Output projection W_O
`k_mlp_fwd`	`k_mlp_bwd`	MLP (with GELU)
`k_residual_add`	`k_residual_add_bwd`	Residual connection
`k_loss_fwd`	`k_loss_bwd`	Cross-entropy loss

Eight operation pairs cover the full transformer. Plus a few utility kernels: k_grad_norm_partial, k_grad_norm_final, k_grad_scale for gradient clipping (see activity 75).

A Backward Kernel's Job

Given the gradient flowing in from later layers (grad_output), a backward kernel computes:

1. grad_input: the gradient with respect to the operation's input tensor. This gets passed further backward.

2. grad_weight: the gradient with respect to learnable parameters in the operation. This goes into the optimizer state.

Both are computed in a single kernel launch. CUDA threads cooperate on tiles of the gradient tensor in parallel.

Saved Tensors

Backward computation often needs values from the forward pass. For example, k_layernorm_bwd needs the mean & variance computed during forward; k_mlp_bwd needs the GELU pre-activation. The training engine stores these in dedicated buffers during forward, then reads them during backward.

Memory cost: roughly the same shape as the forward output for each saved tensor. For ANDREA-120M with batch=8, seq=1024, d_model=768, one saved tensor is 8 × 1024 × 768 × 4 bytes = 25 MB. Across 12 layers & multiple saved tensors per layer, activations dominate VRAM during training (~5-10 GB on a 24 GB card).

Tracing One Backward Step

ANDREA-120M completes a forward pass through one transformer block. Trace what happens during the backward pass through that same block (in pre-norm structure: `x = x + Attention(LN(x))` then `x = x + MLP(LN(x))`). Name the backward kernels in the order they fire, & state which forward kernel each one pairs with. Cover at least 4 kernels.

Where Gradients Live In Memory

One Gradient Tensor Per Weight Tensor

Every learnable weight tensor in ANDREA-120M has a matching gradient tensor of identical shape. For each block:

W_Q       [768, 768]     ↔   grad_W_Q       [768, 768]
W_K       [768, 768]     ↔   grad_W_K       [768, 768]
W_V       [768, 768]     ↔   grad_W_V       [768, 768]
W_O       [768, 768]     ↔   grad_W_O       [768, 768]
W_1       [768, 3072]    ↔   grad_W_1       [768, 3072]
W_2       [3072, 768]    ↔   grad_W_2       [3072, 768]
LN1.gamma [768]          ↔   grad_LN1.gamma [768]
LN1.beta  [768]          ↔   grad_LN1.beta  [768]
LN2.gamma [768]          ↔   grad_LN2.gamma [768]
LN2.beta  [768]          ↔   grad_LN2.beta  [768]

Plus token embeddings, position embeddings, & a final layer norm. The gradient buffer total memory matches the weight memory: ~120M floats, ~480 MB at FP32, ~240 MB at FP16.

Accumulation Across Microbatches

ANDREA's batch_size = 8 fits in VRAM at FP16. Larger effective batches require gradient accumulation: run multiple forward+backward passes on small batches, summing gradients into the same buffer, then take one optimizer step.

for microbatch in range(n_microbatches):
    forward(microbatch)
    backward()           # ADDS to grad buffers, doesn't overwrite
scale_grads(1.0 / n_microbatches)  # average across microbatches
optimizer_step()
zero_grads()             # reset for next training step

Backward kernels use += semantics, not =. Each call adds gradient contributions to the existing buffer; the buffer holds the running sum until zero_grads() clears it.

The Optimizer State

AdamW (activity 73) holds two more buffers per weight: first moment m & second moment v. Total training-time memory:

weights:    1× weight count
gradients:  1× weight count
Adam m:     1× weight count
Adam v:     1× weight count
saved acts: ~2-4× depending on layers & batch
──────────────────────────────────────────
total:      ~6-8× weight count

ANDREA-120M at FP16: ~240 MB × 4 buffers (weight, grad, m, v) + ~5-10 GB activations = ~10-12 GB total. Comfortably below the RTX 4090's 24 GB ceiling. ANDREA-12M trained in 1.4 GB; the 10× parameter scaling brings ~10× memory.

Sizing Gradient Buffers

ANDREA-120M holds ~120,000,000 weights & uses gradient accumulation across 4 microbatches per training step. Compute: (a) gradient buffer size in MB at FP16; (b) total memory for weights + gradients + Adam m + Adam v at FP16; (c) how many separate `forward()` + `backward()` calls fire per training step. Show your arithmetic.

Full Control Of Memory & Precision

What Generic Frameworks Cost

PyTorch & JAX make autograd convenient: write Python code, get gradients automatically. The cost: a generic dispatch layer between your code & CUDA. Every operation goes through Python interpreter overhead, framework bookkeeping, & dynamic kernel selection. For training a small language model on one GPU, that overhead matters.

Concrete costs ANDREA avoids:

1. Python interpreter latency. Every PyTorch op crosses the Python/C++ boundary. For ~100 kernel launches per training step at ~9 steps/min, that's ~900 boundary crossings per minute. C-level dispatch eliminates this.

2. Framework allocator unpredictability. PyTorch's caching allocator gives good throughput on average but unpredictable peak memory. ANDREA's training engine pre-allocates every buffer at startup; no reallocation during training, no fragmentation, no surprise OOMs at step 100K.

3. Generic kernel selection. PyTorch picks kernels at runtime via heuristics. ANDREA picks kernels at compile time, tuned to RTX 4090 tensor core tile sizes.

4. Mixed-precision plumbing. ANDREA-120M's FP16 cuBLAS path & ANDREA's FP8 E4M3 tensor core experiments require precise control over which tensors live at which precision. Generic frameworks expose this control through layered APIs; custom CUDA writes it directly.

The Tradeoff

Custom CUDA costs: more code to write, more bugs to find, no community ecosystem. ANDREA's microgpt_cuda.cu is ~6000 lines of hand-written CUDA that took months to debug. Each new operation requires writing a forward kernel, a backward kernel, & tests.

What ANDREA gains:

- Full reproducibility. The training pipeline is one C binary plus one Python proxy. No version drift across PyTorch releases, no CUDA version mismatches with framework wheels.

- Bit-exact resumes. SIGTERM triggers a checkpoint write that captures every tensor exactly as the GPU sees it. Resume picks up the same loss trajectory the run was on.

- Predictable memory. ANDREA-120M trained for 200K steps with no OOMs. Memory got accounted for at engine startup.

- Direct hardware access. Tensor core tile sizes, FP8 E4M3 settings, asynchronous memory copies: all directly addressable in CUDA, opaque in generic frameworks.

Reproducibility As Mission

The ANDREA whitepaper section 9 lists the full reproducibility stack:

Training engine: microgpt/microgpt_cuda.cu
Training proxy: microgpt/training_proxy.py
Experiment configs: experiments/ANDREA-*-TRAIN.json
Data pipeline: scripts/pull-hermes3.py, scripts/prep-megachat.py
Dashboard: scripts/live-loss-dashboard.html
Bandit specification: docs/FIREHOSE-BANDIT.md
Model documentation: docs/ANDREA.md

Hardware requirement: one NVIDIA GPU with ≥8 GB VRAM (RTX 3060 or better). Anyone can reproduce ANDREA-12M from these artifacts. The custom CUDA path is part of why: no framework version freezes, no dependency surprises five years from now.

Signals & Checkpoints

The CUDA training loop responds to two POSIX signals:

- SIGTERM: write an immediate checkpoint, then exit. Used when stopping training cleanly.

- SIGUSR1: write an immediate checkpoint, continue training. Used during the polish pivot in v3 to capture state without interrupting the run.

Checkpoint format: [int32 step][int32 n_params][n_params × float32 weights][n_params × float32 m][n_params × float32 v]. Step counter, weight count, then weights followed by Adam moments. Resumes bit-exactly. The proxy archives .samples.json & .state.json separately on polish; .loss.json is never archived (it accumulates the full training history).

Why Not PyTorch

ANDREA could have used PyTorch's autograd instead of writing `microgpt_cuda.cu` by hand. Give two distinct engineering reasons why ANDREA chose custom CUDA. One reason should reference memory or precision control; the other should reference reproducibility, framework dependencies, or long-term maintenance.