Sixteen Days of region region region
The Run That Ended
ANDREA-120M v1 launched 2026-03-22 & terminated 2026-04-15 at step 165,000 of 200,000 planned. EMA loss minimum: 3.23 at step 110K (random chance: ln(8449) = 9.04, so loss looked respectable). Samples did not.
Step 80K: region region region region region region region
Step 110K: ''''' ''''' '' ''' '' ''' '''?' ''' ' '' '' '
Step 140K: games, games, games, games, games, games
Step 165K: Budy Budy Budy Budy Budy Budy Budy Budy Budy
Sixteen days of RTX 4090 compute. 130W continuous. Garbage from step 80K onward.
Why microGPT Worked but 120M Did Not
ANDREA-12M used the same training proxy & passed. Smaller weight matrices proved more robust to gradient shocks. Scaling to 120M parameters multiplied every fragility. Five failures compounded.
Five Compounding Failures
Failure 1: No gradient clipping. Source transitions every 7-42 steps produced unbounded gradient spikes. A single bad batch at 120M can push the model into a degenerate attractor the optimizer cannot escape. The 12M model survived because smaller weights tolerated the shocks.
Failure 2: No LR warmup. Learning rate jumped from 0 to peak immediately on freshly initialized weights. The model fell into a bad basin before any representations could form.
Failure 3: No weight decay. Vanilla Adam allowed arbitrarily large weights that amplified repetition patterns at 120M capacity.
Failure 4: No sample quality monitoring. eval_chat_quality() was wired only to the legacy multi-phase runner; the firehose curriculum never invoked it. The model produced garbage from step 80K onward, undetected for 10+ days.
Failure 5: Bandit rewarded repetitive sources. repo-docs, repo-docstrings, & unfirehose-chat scored highest (mean rewards 340-453) because list-structured content reduces cross-entropy trivially. The bandit fed the model more of what made it degenerate.
Compounding
No one failure alone would have collapsed v1. Each amplified the others. Gradient shocks (1) without warmup (2) hit a freshly-initialized model with arbitrarily large weights (3) producing repetition that the bandit rewarded (5) while no one was watching the output (4). Five intersecting causes, one collapse.
Why Five Failures, Not One
One Fix Per Failure
v2 Configuration (2026-04-15)
| Fix | Targets failure | Implementation |
|---|---|---|
| Gradient clipping | F1 (no clipping) | Global L2 norm, max_norm=1.0; three CUDA kernels (k_grad_norm_partial, k_grad_norm_final, k_grad_scale) compute & apply pre-Adam |
| LR warmup | F2 (no warmup) | Linear ramp 0 to peak over 2000 steps. lr(t) = lr_scheduled(t) * min(1, (t+1)/warmup_steps) |
| AdamW | F3 (no weight decay) | Decoupled weight decay (Loshchilov & Hutter 2019), weight_decay=0.01. p -= lr (m_hat/(sqrt(v_hat)+eps) + weight_decayp) |
| Coherence-gated early stopping | F4 (no monitoring) | Score every sample (bigram/trigram/word/char diversity). Auto-halt after 5 consecutive samples score below 30 |
| Curriculum warmup | F5 (bandit eats repetition) | First 20K steps restricted to 7 chat/prose sources; firehose activates after; repo-docstrings excluded entirely |
Plus sample_every dropped from 200 to 100 steps (audit cadence doubled), & repo-docs cap dropped from 0.5 to 0.3.
Back-Test
Coherence gate back-tested on v1: would have triggered at step 132K, saving 3.8 days of compute. The gate alone would have cut v1's wasted compute by ~30%; the other four fixes prevent v1 from ever reaching that gate trigger.
What v2 Did NOT Fix
Data contamination. v2 trusted hermes3-* sources as 'pre-clean' because they came from an LLM teacher. DEEP_CLEAN_SKIP in the Makefile excluded hermes3-general, hermes3-creative, & hermes3-roleplay from make deep-clean. unfirehose-chat captured agent system prompts as user turns. Those two defects waited at the data layer, ready to surface.
Mapping Fixes to Failures
Step 15K: Two Data Defects Surface
What v2 Saw
v2 launched 2026-04-15. By step ~15K of 200K (7.5% complete), samples produced agent-harness ornaments (○ ●) & article-dominance fallback (a = 26% of words at step 14,966; the = 21% at step 14,798). The five v2 stability fixes were working correctly. The failure had moved from architecture to data.
Two Independent Pipeline Defects
Defect A: unfirehose-chat captured agent system prompts as user turns. unfirehose-chat builds from harness session JSONL files at ~/.claude/, ~/.fetch/, ~/.uncloseai/. The ingest pipeline converted multi-section agent system prompts (# Agent X, ## Identity, ## Rules, etc.) into the user-turn slot of > user / < assistant pairs. The model learned that 'users' speak in multi-section markdown, & reproduced those ornaments in its own outputs.
Defect B: hermes3-* bypassed all filters. DEEP_CLEAN_SKIP in the Makefile excluded hermes3-general, hermes3-creative, & hermes3-roleplay from make deep-clean on the false assumption that LLM-distilled data was pre-clean. An exhaustive scan showed the existing filters, when applied, would reject 87-93% of hermes3 lines (oversize paragraphs >2000 chars overflowing block_size=1024; translation responses in CJK/Cyrillic/Arabic; low-bigram-diversity runs).
v2.5 Patch (commit de24332, 2026-04-18)
Two structural changes.
Change 1: has_system_prompt_shape() in filter-dataset.c. Detects leaked system prompts by SHAPE, not by character matching. Three signals combined:
1. 3+ markdown headers in one turn = drop.
2. 2+ headers with turn length >=500 chars = drop.
3. Agent-shard fingerprint phrases (# Agent , Shadow Clone, Your shard, Read it. Become it, This file defines) combined with any header or length >=400 = drop.
Isolation rule: check the first user turn at the / separator (with spaces, not bare / which fragments URL paths) to avoid false-positives on legitimate markdown in assistant responses.
Change 2: hermes3-* moved out of DEEP_CLEAN_SKIP. Trust nothing unfiltered.
Drop Rates After Patch
| source | in-lines | out-lines | dropped |
|---|---|---|---|
| hermes3-general | 536,858 | 67,395 | 87.7% |
| hermes3-roleplay | 35,191 | 2,481 | 93.0% |
| hermes3-creative | 14,258 | 1,373 | 90.4% |
| unfirehose-chat | 3,816 | 2,653 | 30.5% |
| chat | 45,257 | 44,538 | 1.6% (noise) |
| smoltalk | 11,812 | 11,812 | 0.0% |
Baseline filters were already catching 87-93% of hermes3 contamination; DEEP_CLEAN_SKIP was the load-bearing defect. The new shape detector adds ~0.1% additional rejection overall, concentrated in unfirehose-chat where it removes specific agent-shard leaks existing filters miss.
Why Shape Beats Character
Ornaments evolve. A character-matching filter that drops ○ does nothing about ◇ next week. A shape-based filter (count headers, count chars, recognize fingerprint phrases) generalizes across ornament variants. Pattern: contamination detection must use structural heuristics.
Why Filter By Shape
A Bandit Arm with No Data
v3 Launched 2026-04-18
Same architecture & hyperparameters as v2; cleaned data after v2.5 patch. Zero ornament leaks in sample audits. v3 ran cleanly through step 112K.
Step 112,619: Sample Audit Catches a Pattern
Sample audit revealed coherent conversational turns (haiku, Q&A, dialogue) but periodic phases focused on knowledge arms (gutenberg, repo-docstrings, repo-docs) leaked code-like fragments & repository tokenization noise. One sample at step 112,080 reached loss 0.13: anomalously low, signaling memorized repo-docs substrings rather than learned chat distribution.
The Zombie Arm
Diagnosis: exclude_sources correctly removed repo-docstrings at training start, but the persisted bandit state carried a residual repo-docstrings arm with weight 1.546 from a prior run. State reload reinstated it into the UCB pool even though no .btok existed to sample from, producing a zombie pull that distorted exploration accounting.
Lesson: bandit state files (.state.json) drift across restarts in surprising ways. Configuration excludes do not erase residual arm memory. Belt-and-suspenders required: cap = 0.0 alongside exclude.
Polish Configuration
Curriculum perturbation only. Architecture, optimizer state, learning rate schedule, & loss history all preserved from step_112600.bin.
| Source | v3 base | v3 polish |
|---|---|---|
| repo-docs | cap 0.3 | excluded (cap 0.0) |
| repo-docstrings | excluded | excluded + cap 0.0 |
| repo-commits | cap 0.4 | cap 0.2 |
| dictionary | cap 0.5 | cap 0.25 |
| gutenberg | cap 0.8 / floor 0.3 | cap 0.7 / floor 0.4 |
| irc-qa-strict | -- | cap 0.3 |
| unweapon | -- | cap 0.3 |
| synthetic-chat | -- | cap 0.4 |
| hermes3-general | floor 0.5 | floor 0.7 |
| hermes3-creative | floor 0.4 | floor 0.55 |
| hermes3-roleplay | floor 0.4 | floor 0.5 |
| chat | floor 0.4 | floor 0.6 |
| smoltalk | floor 0.3 | floor 0.5 |
| oasst | floor 0.3 | floor 0.5 |
| dolly | -- | floor 0.4 |
| curriculum_warmup_steps | 20000 | 0 |
Polish Protocol
1. SIGUSR1 to CUDA forces immediate checkpoint at next 100-step boundary.
2. Proxy stops.
3. .samples.json & .state.json archived (sample log & bandit state saved as historical record).
4. .loss.json retained -- cumulative training history; never archived.
5. Proxy restarts. Polish config submitted.
6. CUDA resumes from step_112600.bin with fresh bandit state under new caps & floors.
Loss history continues unbroken.
Why Polish, Not Restart
What Polish Produced
Sample Audit, Hours After Pivot
Within hours of the polish pivot, sample audit confirmed convergence on the four original ANDREA-120M goals.
Factual recall (biology), step 112,584, loss 0.30, ppl 1.
> [extinction prompt]
> < black spider montano is alive, carolina parakeet is extinct.
Two correct facts. External grading: 9.5/10.
Factual recall (signal processing), step 112,668, loss 0.74, ppl 2.
> [Fourier transform prompt]
> < The Fourier transform decomposes signals into frequency components, allowing analysis of signal characteristics & filtering in the frequency domain.
Textbook one-sentence definition. External grading: 9.5/10.
Constraint following (haiku), step 112,248, loss 1.05, ppl 3.
> Write a haiku about rain using three words that start with 'r'.
> < Rain's rhythmic refrain, Rivulets on the window, Respite from life's pain.
Tool knowledge (Python), step 112,500, loss 1.94, ppl 7.
> How do I find outdated packages in a Python project?
> < Use pip list --outdated names to see the pip packages list in your project management...
Right tool emerges; phrasing imprecise.
Six Domains in 700 Steps
Biology, signal processing, poetry, python tools, conversational dialogue, ops dialogue. Six unrelated domains within 700 steps tells us the bandit & model are working in concert. Domain breadth IS the convergence signal.
External Grading
Independent reviewer rated samples 'solid for a 120M param model -- impressive coherence & knowledge retention at this scale,' with the Carolina parakeet & Fourier transform samples rated 9.5/10 & 'punching above its weight on knowledge tasks.'
What Each Phase Taught
v1 taught: five compounding failures collapse training. No fix in isolation rescues; all five must land at once.
v2 taught: architectural fixes are necessary but not sufficient. Data layer can defeat them silently.
v2.5 taught: filter contamination by shape, not character. Patterns are stable; symbols evolve.
v3 base taught: bandit state drifts across restarts in surprising ways. Excludes alone are not enough; cap 0.0 belt-and-suspenders required.
v3 polish taught: when the failure is in policy & the model is healthy, perturb policy. Keep weights. Keep loss history. Move forward.
One Truth
Convergence is not a single event; it is a chain of corrections. Each phase exposed one defect, fixed it, & uncovered the next. ANDREA-120M reads 9.5/10 at step 112,584 because v1, v2, v2.5, v3 base, & v3 polish each did their job.