un — Grow a Language Model: From microGPT to ANDREA-120M

un

guest

1 / ?

back to lessons

Sixteen Days of region region region

The Run That Ended

ANDREA-120M v1 launched 2026-03-22 & terminated 2026-04-15 at step 165,000 of 200,000 planned. EMA loss minimum: 3.23 at step 110K (random chance: ln(8449) = 9.04, so loss looked respectable). Samples did not.

Step 80K:  region region region region region region region
Step 110K: ''''' ''''' '' ''' '' ''' '''?' ''' ' '' '' '
Step 140K: games, games, games, games, games, games
Step 165K: Budy Budy Budy Budy Budy Budy Budy Budy Budy

Sixteen days of RTX 4090 compute. 130W continuous. Garbage from step 80K onward.

From microGPT to ANDREA-120M

Why microGPT Worked but 120M Did Not

ANDREA-12M used the same training proxy & passed. Smaller weight matrices proved more robust to gradient shocks. Scaling to 120M parameters multiplied every fragility. Five failures compounded.

Five Compounding Failures

Failure 1: No gradient clipping. Source transitions every 7-42 steps produced unbounded gradient spikes. A single bad batch at 120M can push the model into a degenerate attractor the optimizer cannot escape. The 12M model survived because smaller weights tolerated the shocks.

Failure 2: No LR warmup. Learning rate jumped from 0 to peak immediately on freshly initialized weights. The model fell into a bad basin before any representations could form.

Failure 3: No weight decay. Vanilla Adam allowed arbitrarily large weights that amplified repetition patterns at 120M capacity.

Failure 4: No sample quality monitoring. eval_chat_quality() was wired only to the legacy multi-phase runner; the firehose curriculum never invoked it. The model produced garbage from step 80K onward, undetected for 10+ days.

Failure 5: Bandit rewarded repetitive sources. repo-docs, repo-docstrings, & unfirehose-chat scored highest (mean rewards 340-453) because list-structured content reduces cross-entropy trivially. The bandit fed the model more of what made it degenerate.

Compounding

No one failure alone would have collapsed v1. Each amplified the others. Gradient shocks (1) without warmup (2) hit a freshly-initialized model with arbitrarily large weights (3) producing repetition that the bandit rewarded (5) while no one was watching the output (4). Five intersecting causes, one collapse.

Why Five Failures, Not One

Pick any TWO of the five v1 failures. For each, explain in one sentence: (a) what the failure was; (b) how it specifically interacted with another of the five failures to compound the damage.

One Fix Per Failure

v2 Configuration (2026-04-15)

Fix	Targets failure	Implementation
Gradient clipping	F1 (no clipping)	Global L2 norm, max_norm=1.0; three CUDA kernels (k_grad_norm_partial, k_grad_norm_final, k_grad_scale) compute & apply pre-Adam
LR warmup	F2 (no warmup)	Linear ramp 0 to peak over 2000 steps. lr(t) = lr_scheduled(t) * min(1, (t+1)/warmup_steps)
AdamW	F3 (no weight decay)	Decoupled weight decay (Loshchilov & Hutter 2019), weight_decay=0.01. p -= lr (m_hat/(sqrt(v_hat)+eps) + weight_decayp)
Coherence-gated early stopping	F4 (no monitoring)	Score every sample (bigram/trigram/word/char diversity). Auto-halt after 5 consecutive samples score below 30
Curriculum warmup	F5 (bandit eats repetition)	First 20K steps restricted to 7 chat/prose sources; firehose activates after; repo-docstrings excluded entirely

Plus sample_every dropped from 200 to 100 steps (audit cadence doubled), & repo-docs cap dropped from 0.5 to 0.3.

Back-Test

Coherence gate back-tested on v1: would have triggered at step 132K, saving 3.8 days of compute. The gate alone would have cut v1's wasted compute by ~30%; the other four fixes prevent v1 from ever reaching that gate trigger.

What v2 Did NOT Fix

Data contamination. v2 trusted hermes3-* sources as 'pre-clean' because they came from an LLM teacher. DEEP_CLEAN_SKIP in the Makefile excluded hermes3-general, hermes3-creative, & hermes3-roleplay from make deep-clean. unfirehose-chat captured agent system prompts as user turns. Those two defects waited at the data layer, ready to surface.

Mapping Fixes to Failures

Three of v2's fixes connect cleanly to one v1 failure each. Match: (a) gradient clipping (max_norm=1.0); (b) LR warmup (2000-step linear ramp); (c) AdamW with weight_decay=0.01. For each, name the v1 failure it addresses & state in one sentence WHY this specific fix counters that failure.

Step 15K: Two Data Defects Surface

What v2 Saw

v2 launched 2026-04-15. By step ~15K of 200K (7.5% complete), samples produced agent-harness ornaments (○ ●) & article-dominance fallback (a = 26% of words at step 14,966; the = 21% at step 14,798). The five v2 stability fixes were working correctly. The failure had moved from architecture to data.

Two Independent Pipeline Defects

Defect A: unfirehose-chat captured agent system prompts as user turns. unfirehose-chat builds from harness session JSONL files at ~/.claude/, ~/.fetch/, ~/.uncloseai/. The ingest pipeline converted multi-section agent system prompts (# Agent X, ## Identity, ## Rules, etc.) into the user-turn slot of > user / < assistant pairs. The model learned that 'users' speak in multi-section markdown, & reproduced those ornaments in its own outputs.

Defect B: hermes3-* bypassed all filters. DEEP_CLEAN_SKIP in the Makefile excluded hermes3-general, hermes3-creative, & hermes3-roleplay from make deep-clean on the false assumption that LLM-distilled data was pre-clean. An exhaustive scan showed the existing filters, when applied, would reject 87-93% of hermes3 lines (oversize paragraphs >2000 chars overflowing block_size=1024; translation responses in CJK/Cyrillic/Arabic; low-bigram-diversity runs).

v2.5 Patch (commit de24332, 2026-04-18)

Two structural changes.

Change 1: has_system_prompt_shape() in filter-dataset.c. Detects leaked system prompts by SHAPE, not by character matching. Three signals combined:

1. 3+ markdown headers in one turn = drop.

2. 2+ headers with turn length >=500 chars = drop.

3. Agent-shard fingerprint phrases (# Agent , Shadow Clone, Your shard, Read it. Become it, This file defines) combined with any header or length >=400 = drop.

Isolation rule: check the first user turn at the / separator (with spaces, not bare / which fragments URL paths) to avoid false-positives on legitimate markdown in assistant responses.

Change 2: hermes3-* moved out of DEEP_CLEAN_SKIP. Trust nothing unfiltered.

Drop Rates After Patch

source	in-lines	out-lines	dropped
hermes3-general	536,858	67,395	87.7%
hermes3-roleplay	35,191	2,481	93.0%
hermes3-creative	14,258	1,373	90.4%
unfirehose-chat	3,816	2,653	30.5%
chat	45,257	44,538	1.6% (noise)
smoltalk	11,812	11,812	0.0%

Baseline filters were already catching 87-93% of hermes3 contamination; DEEP_CLEAN_SKIP was the load-bearing defect. The new shape detector adds ~0.1% additional rejection overall, concentrated in unfirehose-chat where it removes specific agent-shard leaks existing filters miss.

Why Shape Beats Character

Ornaments evolve. A character-matching filter that drops ○ does nothing about ◇ next week. A shape-based filter (count headers, count chars, recognize fingerprint phrases) generalizes across ornament variants. Pattern: contamination detection must use structural heuristics.

Why Filter By Shape

v2.5 filters agent-shard leaks by SHAPE (header count, length, fingerprint phrases) rather than by CHARACTER (matching specific symbols like ornaments). Give one practical reason this matters & one concrete failure mode that a character-only filter would NOT catch.

A Bandit Arm with No Data

v3 Launched 2026-04-18

Same architecture & hyperparameters as v2; cleaned data after v2.5 patch. Zero ornament leaks in sample audits. v3 ran cleanly through step 112K.

Step 112,619: Sample Audit Catches a Pattern

Sample audit revealed coherent conversational turns (haiku, Q&A, dialogue) but periodic phases focused on knowledge arms (gutenberg, repo-docstrings, repo-docs) leaked code-like fragments & repository tokenization noise. One sample at step 112,080 reached loss 0.13: anomalously low, signaling memorized repo-docs substrings rather than learned chat distribution.

The Zombie Arm

Diagnosis: exclude_sources correctly removed repo-docstrings at training start, but the persisted bandit state carried a residual repo-docstrings arm with weight 1.546 from a prior run. State reload reinstated it into the UCB pool even though no .btok existed to sample from, producing a zombie pull that distorted exploration accounting.

Lesson: bandit state files (.state.json) drift across restarts in surprising ways. Configuration excludes do not erase residual arm memory. Belt-and-suspenders required: cap = 0.0 alongside exclude.

Polish Configuration

Curriculum perturbation only. Architecture, optimizer state, learning rate schedule, & loss history all preserved from step_112600.bin.

Source	v3 base	v3 polish
repo-docs	cap 0.3	excluded (cap 0.0)
repo-docstrings	excluded	excluded + cap 0.0
repo-commits	cap 0.4	cap 0.2
dictionary	cap 0.5	cap 0.25
gutenberg	cap 0.8 / floor 0.3	cap 0.7 / floor 0.4
irc-qa-strict	--	cap 0.3
unweapon	--	cap 0.3
synthetic-chat	--	cap 0.4
hermes3-general	floor 0.5	floor 0.7
hermes3-creative	floor 0.4	floor 0.55
hermes3-roleplay	floor 0.4	floor 0.5
chat	floor 0.4	floor 0.6
smoltalk	floor 0.3	floor 0.5
oasst	floor 0.3	floor 0.5
dolly	--	floor 0.4
curriculum_warmup_steps	20000	0

Polish Protocol

1. SIGUSR1 to CUDA forces immediate checkpoint at next 100-step boundary.

2. Proxy stops.

3. .samples.json & .state.json archived (sample log & bandit state saved as historical record).

4. .loss.json retained -- cumulative training history; never archived.

5. Proxy restarts. Polish config submitted.

6. CUDA resumes from step_112600.bin with fresh bandit state under new caps & floors.

Loss history continues unbroken.

Why Polish, Not Restart

Step 112,619 sat at 56% of planned training. The team chose CURRICULUM PERTURBATION (preserve weights, change caps & floors) instead of FULL RESTART (re-initialize at step 0). Give two reasons why polish-pivot was correct & one specific risk that going to step 0 would have introduced.

What Polish Produced

Sample Audit, Hours After Pivot

Within hours of the polish pivot, sample audit confirmed convergence on the four original ANDREA-120M goals.

Factual recall (biology), step 112,584, loss 0.30, ppl 1.

> [extinction prompt]

> < black spider montano is alive, carolina parakeet is extinct.

Two correct facts. External grading: 9.5/10.

Factual recall (signal processing), step 112,668, loss 0.74, ppl 2.

> [Fourier transform prompt]

> < The Fourier transform decomposes signals into frequency components, allowing analysis of signal characteristics & filtering in the frequency domain.

Textbook one-sentence definition. External grading: 9.5/10.

Constraint following (haiku), step 112,248, loss 1.05, ppl 3.

> Write a haiku about rain using three words that start with 'r'.

> < Rain's rhythmic refrain, Rivulets on the window, Respite from life's pain.

Tool knowledge (Python), step 112,500, loss 1.94, ppl 7.

> How do I find outdated packages in a Python project?

> < Use pip list --outdated names to see the pip packages list in your project management...

Right tool emerges; phrasing imprecise.

Six Domains in 700 Steps

Biology, signal processing, poetry, python tools, conversational dialogue, ops dialogue. Six unrelated domains within 700 steps tells us the bandit & model are working in concert. Domain breadth IS the convergence signal.

External Grading

Independent reviewer rated samples 'solid for a 120M param model -- impressive coherence & knowledge retention at this scale,' with the Carolina parakeet & Fourier transform samples rated 9.5/10 & 'punching above its weight on knowledge tasks.'

What Each Phase Taught

v1 taught: five compounding failures collapse training. No fix in isolation rescues; all five must land at once.

v2 taught: architectural fixes are necessary but not sufficient. Data layer can defeat them silently.

v2.5 taught: filter contamination by shape, not character. Patterns are stable; symbols evolve.

v3 base taught: bandit state drifts across restarts in surprising ways. Excludes alone are not enough; cap 0.0 belt-and-suspenders required.

v3 polish taught: when the failure is in policy & the model is healthy, perturb policy. Keep weights. Keep loss history. Move forward.

One Truth

Convergence is not a single event; it is a chain of corrections. Each phase exposed one defect, fixed it, & uncovered the next. ANDREA-120M reads 9.5/10 at step 112,584 because v1, v2, v2.5, v3 base, & v3 polish each did their job.

Which Phase Taught the Hardest Lesson

Of the five phases (v1, v2, v2.5, v3 base, v3 polish), which one would you say taught the most-transferable engineering lesson? Pick one. State the lesson in your own words & give 2-3 sentences explaining why this lesson generalizes beyond language model training.