un — Grow a Language Model: LR Warmup & Cosine Decay

un

guest

1 / ?

back to lessons

Two Problems at Either End of Training

Early-Step Problem: Fresh Weights Cannot Take Big Steps

At step 0, every weight starts as a small random number sampled from a near-zero distribution. Activations stay near zero. Gradients carry almost no information about a final solution. Apply a peak learning rate to those gradients & a model jumps far away from initialization in directions that do not encode meaningful structure.

ANDREA-120M v1 made this mistake. No warmup. Step 1 used lr = 0.0003 on freshly initialized weights. Result: model landed in a bad parameter basin within a few hundred steps. Loss numbers looked reasonable; samples produced repetition loops by step 80K & never recovered.

Late-Step Problem: Big Steps Cannot Polish a Solution

By step 100K, a model has learned coarse structure. Gradients now carry fine-grained signal: which token weights need a small nudge, which attention head needs slight rebalancing. Applying a peak learning rate at this stage overshoots every fine adjustment, oscillating around an optimum without settling.

Two problems, opposite ends of training. One schedule, two regions: ramp up gently, decay smoothly down.

Linear Warmup: First 2000 Steps

The Formula

ANDREA-120M v2 uses linear warmup over 2000 steps:

lr(t) = lr_scheduled(t) * min(1, (t + 1) / warmup_steps)

where t is the step number (0-indexed), warmup_steps = 2000, & lr_scheduled(t) is what the cosine schedule would prescribe ignoring warmup.

Reading the formula:

- At t = 0: lr = lr_scheduled(0) min(1, 1/2000) = lr_scheduled(0) 0.0005. Tiny first step.

- At t = 1000: lr = lr_scheduled(1000) min(1, 1001/2000) = lr_scheduled(1000) 0.5. Half-strength.

- At t = 2000: lr = lr_scheduled(2000) min(1, 2001/2000) = lr_scheduled(2000) 1.0. Full strength.

- At t > 2000: clamp keeps the multiplier at 1, warmup no longer affects anything, cosine decay takes over alone.

Linear ramp from zero gives a model 2000 steps to form coarse representations before AdamW & gradient clipping see full-strength updates. By step 2000, weights have drifted enough that peak lr no longer pushes them into a bad basin.

Computing LR During Warmup

ANDREA-120M v2 uses `lr_scheduled = 0.0003` (peak) & `warmup_steps = 2000`. Ignore cosine decay during warmup (assume `lr_scheduled` stays flat). Compute the actual learning rate at: (a) step 0, (b) step 500, (c) step 2000, (d) step 5000. Show your arithmetic.

Cosine Decay After Warmup

The Curve

After warmup ends at step 2000, learning rate follows a cosine curve from peak down to zero over the remaining steps:

lr(t) = lr_min + (lr_peak - lr_min) 0.5 (1 + cos(pi * progress))

where progress = (t - warmup_steps) / (total_steps - warmup_steps). At progress = 0 (just past warmup), cos(0) = 1, lr = peak. At progress = 1 (final step), cos(pi) = -1, lr = lr_min (typically 0 or a tiny floor).

Why Cosine, Not Linear or Exponential?

Cosine decay starts slow (curve nearly flat near peak), accelerates through the middle, then slows again near zero. Three benefits:

1. Plateau near peak. Early post-warmup steps still get nearly full lr, letting the model use a long stretch of high learning rate to build representations.

2. Smooth transition through middle. No abrupt jumps that AdamW must absorb.

3. Plateau near zero. Final steps get tiny lr for fine polishing, similar to simulated annealing.

ANDREA-120M trains for 200K steps total; 198K of those are cosine decay region after the 2000-step warmup.

LR Warmup & Cosine Decay

ANDREA-12M's Warm Restart at Step 25K

The Plateau

ANDREA-12M trained for 60K steps with cosine decay from lr = 0.0004 peak. Around step 22K, loss plateaued at EMA ~2.4. Cosine decay had taken lr down to ~0.00015. The bandit kept feeding diverse data; the model stopped improving.

Diagnosis: lr had decayed too far for the model to escape its current basin. Hermes data was about to enter the curriculum (step 25K), bringing 590K new conversations. The model needed energy to absorb that data shock.

The Restart

At step 25K, the schedule executed a warm restart: spike lr from 0.00015 (decayed) back up to 0.0004 (original peak), then resume cosine decay over the remaining steps.

Loshchilov & Hutter (2017) named this technique "SGDR" (stochastic gradient descent with warm restarts). The intuition: a high lr adds enough kinetic energy to escape a local basin & explore neighboring ones; subsequent cosine decay re-anneals into a better basin.

Outcome. Loss EMA dropped from 2.40 to 2.10 over the next 10K steps after the restart. Model shipped at step 43.6K with SMMA loss 2.0, demonstrating coherent Q&A turn structure.

ANDREA-120M v2 chose NOT to use warm restarts: with 200K steps available & a much larger parameter count, smooth monotonic decay produced steadier convergence. Restart works best when training is short & a plateau coincides with a known data shift.

Diagnosing v1's Failure

ANDREA-120M v1 used `lr = 0.0003` from step 1, no warmup. By step 80K samples produced `region region region region`. Reason mechanistically: walk through what happens to a freshly initialized weight matrix in steps 1 to 100 under `lr = 0.0003` vs under a 2000-step warmup. Why does the no-warmup path land in a bad basin?

Schedule Choices in Practice

If you were training a 120M model on a noisy dataset for ONLY 50K steps total, would you use a longer or shorter warmup than ANDREA-120M v2's 2000 steps? Justify with one mechanistic argument.

Adjacent Activities

Three siblings link to LR schedule:

- Activity 10: AdamW. Warmup gives AdamW's bias correction time to stabilize. Without warmup, the 10x amplification at step 1 multiplies whatever noise gradients carry; with warmup, the multiplier hits real signal.

- Activity 12: Gradient clipping. Clipping caps gradient L2 norm at 1.0 BEFORE AdamW. Warmup damps lr; clipping damps g. Together they keep early steps safe even on shock-prone curricula.

- Activity 22: Checkpointing. A warm restart requires loading optimizer state (m, v, step counter) from a checkpoint, then mutating the schedule mid-run. ANDREA-12M's restart at step 25K demonstrates this; it took two attempts to get the state-loading logic right.

Schedule, optimizer, & clipping form a stability triangle. Drop a vertex, watch ANDREA repeat its v1 collapse.