English· Español· Deutsch· Nederlands· Français· 日本語· ქართული· 繁體中文· 简体中文· Português· Русский· العربية· हिन्दी· Italiano· 한국어· Polski· Svenska· Türkçe· Українська· Tiếng Việt· Bahasa Indonesia

un

guest
1 / ?
back to lessons

eval_chat_quality() Was Wired to the Wrong Runner

A 10-Day Failure That Should Have Stopped at Day 3

ANDREA-120M v1 trained for 16.1 days on an RTX 4090 at 130W continuous. Sample outputs were stored every 200 steps but never analyzed during the run. By step 80K (day 4), samples read region region region region region. By step 110K, ''''' ''''' '' ''' ''. Training continued for another 11 days before being killed manually at step 165K.


What Went Wrong With the Smoke Alarm

eval_chat_quality() existed in the codebase. It scored sample quality. It even worked correctly. But it was wired only to the legacy multi-phase runner. The v1 firehose curriculum used a different code path & never invoked the eval. The smoke alarm sat in another room with the door closed.


The Cost

16.1 days of compute. 130W continuous. ~50 kWh of electricity. The model produced no usable output at any point. Loss EMA bottomed at 3.23 at step 110K, then climbed back to 4.54 at step 165K when training stopped. Numerically reasonable; semantically empty.


Random chance for an 8449-token vocabulary is 9.04. v1 reached 3.23 EMA loss while producing region region region. Loss alone cannot detect coherence collapse. A model that minimizes cross-entropy by repeating one high-frequency token gets numerically rewarded for the failure mode.

Why Loss Curves Lied

v1 reached EMA loss 3.23 (well below random chance 9.04) while producing `region region region region`. Explain in 2-3 sentences how a model can achieve numerically reasonable loss while producing degenerate output. Reference the cross-entropy mechanism.

Score Every Sample on Four Axes

The Composite Score

v2 ships a coherence gate that scores every sample (taken every 100 steps during firehose training) on four metrics:


Coherence gate flow


MetricRangeWhat it catches
Bigram diversity0-35Repetition at the two-token level (region region)
Trigram diversity0-35Repetition at the three-token level (a, b, a, b, a, b)
English word presence0-20Drift into non-English (CJK, Cyrillic, gibberish)
Character diversity0-10Single-character collapse (''''', ... ... ...)

Total possible: 100. Threshold: 30.


Why Four Metrics, Not One

Each metric catches a different failure mode:


- A model collapsing to one bigram fails Bigram diversity but passes Character diversity.

- A model producing punctuation noise (''''' ''''' '') fails Character diversity but might pass Bigram diversity if the punctuation pairs vary.

- A model drifting into non-English (translation training contamination) fails English word presence but passes Bigram & Trigram diversity if it produces grammatical Mandarin.

- A model producing a, b, a, b, a, b passes Bigram (a-b & b-a appear) but fails Trigram (a-b-a, b-a-b dominate).


Together, the four metrics span the failure space. A composite score below 30 means at least one axis collapsed badly enough to drag the whole sample down.


Consecutive Counter

Auto-halt fires after 5 consecutive samples score below 30. Single bad samples can occur during phase transitions or rare-source pulls; five in a row mean the model has stopped recovering. With samples taken every 100 steps, 5 consecutive degenerate samples = 500 steps of confirmed coherence collapse.

Compute a Score

A v1 sample at step 80K reads `region region region region region region region region`. Estimate scores: (a) Bigram diversity, (b) Trigram diversity, (c) English word presence, (d) Character diversity. Compute the total. Does the gate trigger on this sample alone?

What v1 Would Have Looked Like

Back-Tested Trigger

Given v1's stored samples, applying the v2 coherence gate retroactively shows the gate would have triggered at step 132K. v1 ran to step 165K before manual termination. The gate would have stopped training 33,000 steps earlier.


Compute Saved

RTX 4090 trained at ~6 steps/min in FP16 cuBLAS. 33,000 steps / 6 steps/min = 5,500 minutes = 91.6 hours = 3.8 days of compute saved. At 130W continuous, that's ~12 kWh of electricity, plus 3.8 days of GPU wear.


Why Step 132K & Not Step 80K

v1 produced region region region at step 80K. Why didn't the gate fire then?


Because intermittent good samples appeared between bad ones. The bandit cycled through sources every 7-42 steps. Even a degenerate model occasionally produced more diverse outputs when sampling from a different source, momentarily resetting the consecutive counter. By step 132K, the model had collapsed deeply enough that 5 consecutive degenerate samples (500 steps) became inevitable.


Lesson: Wire the Smoke Alarm to Every Runner

v2 wires eval_chat_quality() directly into the firehose curriculum's sample-handling code path, not just the legacy runner. Every sample, every run, every code path: the same gate. The fix took ~30 lines of code.

Generalize the Engineering Pattern

v1 wasted 3.8 days because eval_chat_quality() was wired only to one runner. Argue (in 2-3 sentences) what the v2 coherence gate establishes as an engineering principle for long-running ML training. Reference both the wiring choice & the composite-metric design.