English· Español· Deutsch· Nederlands· Français· 日本語· ქართული· 繁體中文· 简体中文· Português· Русский· العربية· हिन्दी· Italiano· 한국어· Polski· Svenska· Türkçe· Українська· Tiếng Việt· Bahasa Indonesia

un

guest
1 / ?
back to lessons

Steps 0-20K: A Restricted Diet

Two Phases, One Run

The v2 firehose curriculum runs in two phases inside a single 200K-step training run:


Phase A (steps 0 to 20K). Bandit pulls only from 7 chat & prose sources:


- hermes3-general

- hermes3-creative

- hermes3-roleplay

- chat

- smoltalk

- oasst

- gutenberg


Phase B (steps 20K to 200K). Bandit pulls from the full mix, all 16 sources, including reference (dictionary), technical (repo-docs, repo-commits), & social (irc, unweapon).


Curriculum warmup timeline


What the Restricted Diet Shares

Six of the seven warmup sources are conversational. One (gutenberg) is paragraph prose. Together they share a common shape: turn structure (prompt then response) or narrative flow. Vocabulary distribution across the 7 sources looks roughly normal English; cross-entropy targets stay in a stable range; gradient magnitudes stay predictable.


Config Field


"curriculum_warmup_steps": 20000,
"curriculum_warmup_sources": ["hermes3-general", "hermes3-creative",
  "hermes3-roleplay", "chat", "smoltalk", "oasst", "gutenberg"]

Identify the Warmup Phase

A training run has run for 18,400 steps. Without looking at the bandit state, can the model have sampled from `dictionary` or `repo-docs`? Explain why or why not & cite the configuration value that determines this.

What v1 Looked Like Without Warmup

v1: All 16 Sources from Step 0

The first ANDREA-120M training run (March-April 2026) activated the full firehose at step 0: 16 sources, including dictionary (88K word definitions in > define X / < X is... shape), repo-docs (markdown documentation), repo-docstrings (Python docstrings), & repo-commits (git commit messages alongside chat & prose).


What Went Wrong

A freshly initialized 120M model with random weights cannot model 16 distinct distributions at once. Each batch from a structurally different source produces a different gradient direction. Source transitions every 7-42 steps swung gradient magnitudes wildly; the model hopped between attractors faster than it could form representations.


By step 80K, v1 produced: region region region region region region region. Hermes3-general teacher distillation rewards (mean 340-453) made repetitive list-structured sources score highest on cross-entropy, which the bandit interpreted as 'these arms are easy.' The bandit fed the model more of what made it degenerate.


Why Restricting to 7 Sources Helps

1. Distribution similarity. All 7 warmup sources produce text of similar shape (turn structure or narrative). Gradient directions across batches stay roughly aligned.

2. Coherence first. The model learns vocabulary frequency, syntactic patterns, & turn structure before encountering definition lists, code, or git messages.

3. Stable curriculum. Bandit reward signals across 7 chat/prose sources stay in a comparable range; UCB1 selection does not get hijacked by a single anomalously-rewarding source.


When Phase B Activates

At step 20K, the model has produced ~40-50 samples (one per 100 steps), shows coherent English in samples, & has built stable bigram & trigram distributions. Now it can absorb dictionary's > define X / < X is... pattern, repo-docs' code blocks, & git commit headers without losing the chat structure underneath.

Diagnose v1's Failure

A freshly initialized 120M transformer trains on 16 structurally different sources from step 0. By step 80K, samples read `region region region region region`. Connect the no-warmup design choice to this specific failure mode: name the mechanism by which 16 sources at step 0 makes a model collapse into single-token repetition. One or two sentences.

v3 Polish Sets curriculum_warmup_steps = 0

A Different Starting Point

The v3 polish pivot at step 112,619 resumed training from step_112600.bin with curriculum_warmup_steps set to 0. At first glance this looks like a contradiction: if warmup helped v2, why disable it for the polish phase?


Because the Model Already Learned Coherence

Phase A buys time for a freshly initialized model to learn vocabulary frequency, turn structure, & paragraph coherence. By step 112K, the model has already done all of that. Sample audits at 112K showed coherent conversational turns, haiku, Q&A, & dialogue. The original purpose of warmup (protect a fragile new model from gradient chaos) no longer applies.


Polish Reweights, Does Not Restart

Polish is a curriculum perturbation, not a fresh run. The same 200K target, same architecture, same optimizer state, same loss history. What changes: source caps & floors get reweighted to favor conversation over knowledge arms. With the model already coherent, every active source is fair game from step 112,619 onward.


Summary Table


Phasecurriculum_warmup_stepsWhy
v1(not present)All 16 sources from step 0 -> collapse
v2 (steps 0-200K)20,000Protect freshly initialized weights from gradient chaos
v3 base (steps 0-112K)20,000Same protection as v2
v3 polish (steps 112K-200K)0Model already coherent; no fragile-init regime to protect

Why Disabling Warmup at Polish Is Safe

Argue (in 2-3 sentences) why setting curriculum_warmup_steps = 0 at the v3 polish pivot does NOT recreate the v1 collapse, even though both runs feature 'all sources active from the current step.' Reference the model state at step 112K.