eval_chat_quality() Was Wired to the Wrong Runner
A 10-Day Failure That Should Have Stopped at Day 3
ANDREA-120M v1 trained for 16.1 days on an RTX 4090 at 130W continuous. Sample outputs were stored every 200 steps but never analyzed during the run. By step 80K (day 4), samples read region region region region region. By step 110K, ''''' ''''' '' ''' ''. Training continued for another 11 days before being killed manually at step 165K.
What Went Wrong With the Smoke Alarm
eval_chat_quality() existed in the codebase. It scored sample quality. It even worked correctly. But it was wired only to the legacy multi-phase runner. The v1 firehose curriculum used a different code path & never invoked the eval. The smoke alarm sat in another room with the door closed.
The Cost
16.1 days of compute. 130W continuous. ~50 kWh of electricity. The model produced no usable output at any point. Loss EMA bottomed at 3.23 at step 110K, then climbed back to 4.54 at step 165K when training stopped. Numerically reasonable; semantically empty.
Random chance for an 8449-token vocabulary is 9.04. v1 reached 3.23 EMA loss while producing region region region. Loss alone cannot detect coherence collapse. A model that minimizes cross-entropy by repeating one high-frequency token gets numerically rewarded for the failure mode.
Why Loss Curves Lied
Score Every Sample on Four Axes
The Composite Score
v2 ships a coherence gate that scores every sample (taken every 100 steps during firehose training) on four metrics:
| Metric | Range | What it catches |
|---|---|---|
| Bigram diversity | 0-35 | Repetition at the two-token level (region region) |
| Trigram diversity | 0-35 | Repetition at the three-token level (a, b, a, b, a, b) |
| English word presence | 0-20 | Drift into non-English (CJK, Cyrillic, gibberish) |
| Character diversity | 0-10 | Single-character collapse (''''', ... ... ...) |
Total possible: 100. Threshold: 30.
Why Four Metrics, Not One
Each metric catches a different failure mode:
- A model collapsing to one bigram fails Bigram diversity but passes Character diversity.
- A model producing punctuation noise (''''' ''''' '') fails Character diversity but might pass Bigram diversity if the punctuation pairs vary.
- A model drifting into non-English (translation training contamination) fails English word presence but passes Bigram & Trigram diversity if it produces grammatical Mandarin.
- A model producing a, b, a, b, a, b passes Bigram (a-b & b-a appear) but fails Trigram (a-b-a, b-a-b dominate).
Together, the four metrics span the failure space. A composite score below 30 means at least one axis collapsed badly enough to drag the whole sample down.
Consecutive Counter
Auto-halt fires after 5 consecutive samples score below 30. Single bad samples can occur during phase transitions or rare-source pulls; five in a row mean the model has stopped recovering. With samples taken every 100 steps, 5 consecutive degenerate samples = 500 steps of confirmed coherence collapse.
Compute a Score
What v1 Would Have Looked Like
Back-Tested Trigger
Given v1's stored samples, applying the v2 coherence gate retroactively shows the gate would have triggered at step 132K. v1 ran to step 165K before manual termination. The gate would have stopped training 33,000 steps earlier.
Compute Saved
RTX 4090 trained at ~6 steps/min in FP16 cuBLAS. 33,000 steps / 6 steps/min = 5,500 minutes = 91.6 hours = 3.8 days of compute saved. At 130W continuous, that's ~12 kWh of electricity, plus 3.8 days of GPU wear.
Why Step 132K & Not Step 80K
v1 produced region region region at step 80K. Why didn't the gate fire then?
Because intermittent good samples appeared between bad ones. The bandit cycled through sources every 7-42 steps. Even a degenerate model occasionally produced more diverse outputs when sampling from a different source, momentarily resetting the consecutive counter. By step 132K, the model had collapsed deeply enough that 5 consecutive degenerate samples (500 steps) became inevitable.
Lesson: Wire the Smoke Alarm to Every Runner
v2 wires eval_chat_quality() directly into the firehose curriculum's sample-handling code path, not just the legacy runner. Every sample, every run, every code path: the same gate. The fix took ~30 lines of code.