Steps 0-20K: A Restricted Diet
Two Phases, One Run
The v2 firehose curriculum runs in two phases inside a single 200K-step training run:
Phase A (steps 0 to 20K). Bandit pulls only from 7 chat & prose sources:
- hermes3-general
- hermes3-creative
- hermes3-roleplay
- chat
- smoltalk
- oasst
- gutenberg
Phase B (steps 20K to 200K). Bandit pulls from the full mix, all 16 sources, including reference (dictionary), technical (repo-docs, repo-commits), & social (irc, unweapon).
What the Restricted Diet Shares
Six of the seven warmup sources are conversational. One (gutenberg) is paragraph prose. Together they share a common shape: turn structure (prompt then response) or narrative flow. Vocabulary distribution across the 7 sources looks roughly normal English; cross-entropy targets stay in a stable range; gradient magnitudes stay predictable.
Config Field
"curriculum_warmup_steps": 20000,
"curriculum_warmup_sources": ["hermes3-general", "hermes3-creative",
"hermes3-roleplay", "chat", "smoltalk", "oasst", "gutenberg"]
Identify the Warmup Phase
What v1 Looked Like Without Warmup
v1: All 16 Sources from Step 0
The first ANDREA-120M training run (March-April 2026) activated the full firehose at step 0: 16 sources, including dictionary (88K word definitions in > define X / < X is... shape), repo-docs (markdown documentation), repo-docstrings (Python docstrings), & repo-commits (git commit messages alongside chat & prose).
What Went Wrong
A freshly initialized 120M model with random weights cannot model 16 distinct distributions at once. Each batch from a structurally different source produces a different gradient direction. Source transitions every 7-42 steps swung gradient magnitudes wildly; the model hopped between attractors faster than it could form representations.
By step 80K, v1 produced: region region region region region region region. Hermes3-general teacher distillation rewards (mean 340-453) made repetitive list-structured sources score highest on cross-entropy, which the bandit interpreted as 'these arms are easy.' The bandit fed the model more of what made it degenerate.
Why Restricting to 7 Sources Helps
1. Distribution similarity. All 7 warmup sources produce text of similar shape (turn structure or narrative). Gradient directions across batches stay roughly aligned.
2. Coherence first. The model learns vocabulary frequency, syntactic patterns, & turn structure before encountering definition lists, code, or git messages.
3. Stable curriculum. Bandit reward signals across 7 chat/prose sources stay in a comparable range; UCB1 selection does not get hijacked by a single anomalously-rewarding source.
When Phase B Activates
At step 20K, the model has produced ~40-50 samples (one per 100 steps), shows coherent English in samples, & has built stable bigram & trigram distributions. Now it can absorb dictionary's > define X / < X is... pattern, repo-docs' code blocks, & git commit headers without losing the chat structure underneath.
Diagnose v1's Failure
v3 Polish Sets curriculum_warmup_steps = 0
A Different Starting Point
The v3 polish pivot at step 112,619 resumed training from step_112600.bin with curriculum_warmup_steps set to 0. At first glance this looks like a contradiction: if warmup helped v2, why disable it for the polish phase?
Because the Model Already Learned Coherence
Phase A buys time for a freshly initialized model to learn vocabulary frequency, turn structure, & paragraph coherence. By step 112K, the model has already done all of that. Sample audits at 112K showed coherent conversational turns, haiku, Q&A, & dialogue. The original purpose of warmup (protect a fragile new model from gradient chaos) no longer applies.
Polish Reweights, Does Not Restart
Polish is a curriculum perturbation, not a fresh run. The same 200K target, same architecture, same optimizer state, same loss history. What changes: source caps & floors get reweighted to favor conversation over knowledge arms. With the model already coherent, every active source is fair game from step 112,619 onward.
Summary Table
| Phase | curriculum_warmup_steps | Why |
|---|---|---|
| v1 | (not present) | All 16 sources from step 0 -> collapse |
| v2 (steps 0-200K) | 20,000 | Protect freshly initialized weights from gradient chaos |
| v3 base (steps 0-112K) | 20,000 | Same protection as v2 |
| v3 polish (steps 112K-200K) | 0 | Model already coherent; no fragile-init regime to protect |