English· Español· Deutsch· Nederlands· Français· 日本語· ქართული· 繁體中文· 简体中文· Português· Русский· العربية· हिन्दी· Italiano· 한국어· Polski· Svenska· Türkçe· Українська· Tiếng Việt· Bahasa Indonesia

un

ضيف
1 / ?

v1's Lesson: Loss Looks Fine, Output Is Garbage

A Cautionary Tale

ANDREA-120M v1 reached EMA loss 3.43 at step 110K, well below random chance (ln(8449) = 9.04). The number looked respectable. The samples did not.


step 80K:  region region region region region region region
step 110K: ''''' ''''' '' ''' '' ''' '''?' ''' ' '' '' '
step 140K: games, games, games, games, games, games
step 165K: Budy Budy Budy Budy Budy Budy Budy Budy

v1 had no sample monitoring wired up. The model produced repetition-loop garbage from step 80K onward & training continued for 85K more steps before someone noticed. 10+ days of compute wasted because nobody read the output.


What Loss Hides

Cross-entropy loss measures how surprised the model gets at the next token. A model that emits region region region region looks unsurprised by its own output (it predicted the same word every time). Numerical loss can stay low while semantic quality collapses.


The v2 Fix

sample_every = 100 steps. Generate 420 free-form tokens. Coherence-gated early stopping scores every sample on bigram diversity, trigram diversity, English word presence, & character diversity (0-100 scale). Auto-halt after 5 consecutive samples score below 30. Back-tested on v1: would have triggered at step 132K, saving 3.8 days.


Reading samples is not optional. Reading samples is how we know loss means anything.

Loss vs Sample Quality

v1 reached EMA loss 3.43 (well below random 9.04) but emitted 'region region region'. Explain in two parts: (a) HOW can loss stay numerically reasonable while output collapses to repetition? (b) WHAT structural fix in v2 catches this without depending on a human reading every sample?

ppl = exp(loss)

The Conversion

Cross-entropy loss reports in nats. Perplexity reports the equivalent number of equally-likely tokens the model considers at each step. Conversion: ppl = exp(loss).


Random over an 8449-token vocab: loss = ln(8449) = 9.04, ppl = 8449. Memorized perfect prediction: loss = 0, ppl = 1.


A Reference Table


losspplReading
9.048449random over full vocab
5.00148early training, basic structure
3.0020learning vocabulary distribution
2.007knowledgeable but imprecise
1.002.7constraint-following emerging
0.702textbook one-liner
0.301factual recall, mostly memorized
0.131ALERT: memorized substring
0.001perfect memorization

Per-Sample Loss vs EMA Loss

EMA loss (exponential moving average over many steps) reports overall training health. Per-sample loss reports one specific sample's quality. The two diverge: EMA might sit at 2.0 while individual samples land anywhere from 0.13 to 4.0 depending on which prompt the bandit selected.


Reading individual sample loss is how we catch outliers. EMA loss tells us nothing about whether one sample memorized a repo-docs substring; per-sample loss does.

Reading Three Loss Values

ANDREA-120M produced three samples in close succession at step 112,584-112,668. (a) loss 0.30 ppl ?; (b) loss 0.74 ppl ?; (c) loss 1.94 ppl ?. Compute each ppl, then state the quality tier each represents (memorized / textbook / constraint / imprecise).

Walking the Audit Window

A 700-Step Window

Between step 111,996 & 112,668, ANDREA-120M produced samples spanning six unrelated knowledge domains. Each lands in a loss tier:


Sample Audit Loss Tiers


Tier 1: Factual Recall (loss 0.30, ppl 1)

step 112,584, biology


> [extinction prompt]

> < black spider montano is alive, carolina parakeet is extinct.


Two correct facts. Carolina parakeet declared extinct 1939; black spider monkey extant. ppl 1 means the model picked the right token at every step. External grading: 9.5/10.


Tier 2: Textbook One-Liner (loss 0.74, ppl 2)

step 112,668, signal processing


> [Fourier transform prompt]

> < The Fourier transform decomposes signals into frequency components, allowing analysis of signal characteristics & filtering in the frequency domain.


A textbook one-sentence definition. Time-to-frequency decomposition with correct application terms. External grading: 9.5/10.


Tier 3: Constraint Following (loss 1.05, ppl 3)

step 112,248, poetry


> Write a haiku about rain using three words that start with 'r'.

> < Rain's rhythmic refrain, Rivulets on the window, Respite from life's pain.


Three 'r' words (Rain, Rivulets, Respite), three lines, haiku structure preserved, semantic coherence around rain. ppl 3 = the model considered a few candidates at each step but stayed inside the constraint.


Tier 4: Knowledgeable but Imprecise (loss 1.94, ppl 7)

step 112,500, python tools


> How do I find outdated packages in a Python project?

> < Use pip list --outdated names to see the pip packages list in your project management, or should be via pyenv to get more automatic packages.


Right kernel command (pip list --outdated) emerges from training data. Phrasing fluency still developing. Minor mix-up: pyenv manages Python versions, not packages. ppl 7 means the model considered ~7 plausible continuations at each step; not all were precisely correct.


Tier 5: ALERT, Memorized Substring (loss 0.13, ppl 1)

step 112,080, anomalous


Loss 0.13 sits BELOW the textbook tier. The model is too confident on this sample. Diagnosis: a memorized repo-docs substring, not learned chat. The bandit pulled a zombie repo-docstrings arm carrying weight 1.546 from a prior run state, & the sample reproduced a verbatim chunk of training data.


Anomalously low loss is a memorization warning, not a quality signal. Sub-0.20 means: check whether the model is generating or copying.


Six Domains in 700 Steps

Biology (parakeet), signal processing (Fourier), poetry (haiku), python tools (pip), conversational dialogue, ops dialogue. Six unrelated domains within 700 steps tells us the bandit is doing diverse work, not stuck on one source. Domain breadth IS a quality metric.

Diagnosing Three Samples

Three new samples land in your audit window. (a) loss 0.40, generates 'photosynthesis converts sunlight into chemical energy in chloroplasts'. (b) loss 0.10, generates a verbatim chunk of a Python docstring. (c) loss 1.30, generates a sonnet that follows ABAB rhyme scheme but with one slightly forced rhyme. For each, name the quality tier & state your action: ACCEPT (genuine learning), INVESTIGATE (anomaly signal), or ACCEPT_WITH_NOTE (imperfect but bandit healthy).

Why Submit Samples to Outside Eyes

What External Grading Caught

Internal sample audit told us the model was producing biology, signal processing, poetry, & python on demand. External chat-quality grading rated those samples '9.5/10' & 'punching above its weight on knowledge tasks at this scale'.


Internal review answers: did the bandit do diverse work? External review answers: would a human reader rate these outputs as good?


Why Both Matter

Internal audit catches structural failures: repetition collapse, memorization spikes, low-diversity zombie arms. Loss tiers, n-gram diversity, & domain breadth are all observable from the proxy.


External grading catches semantic quality failures: confidently-wrong facts, awkward phrasing, missed nuance. None of those show up in loss numbers.


Methodology

ANDREA's training dashboard at training.ai.unturf.com/dashboard is intentionally public & read-only. Anyone can poll .loss.json, .samples.json, & bandit state in real time. External reviewers had access to the same data the operator did.


9.5/10 from an independent reader, on samples drawn at step 112,584 of 200,000, with full provenance: that result is reproducible, auditable, & not gameable. The same samples, the same loss values, the same bandit state are visible to anyone who looks.


Two Independent Signals

Internal: low loss + high diversity + multi-domain coverage = bandit healthy.

External: 9.5/10 from independent reviewer = output rates as good.


Both align: training is converging on factual recall, constraint following, & multi-paragraph coherence. If they diverged (low loss but external rated 3/10), we would have a metric-gaming problem to investigate.

Two Signals, One Diagnosis

Imagine ANDREA samples get external grading at step 100K & step 150K. (a) Step 100K: internal EMA loss 2.5, n-gram diversity 70/100, external rating 3/10. What does the gap between internal & external suggest? (b) Step 150K: internal EMA loss 2.0, diversity 85/100, external 9/10. What does alignment of internal & external suggest? Give one sentence per scenario.

Five Steps Per Audit Window

One Audit, Five Checks

1. Read the loss tier. ppl = exp(loss). Match against the five-tier table.

2. Check for sub-0.20 outliers. Memorization signal. Investigate before treating as a quality result.

3. Read the actual sample text. Loss numbers cannot tell you what the output says. Read it.

4. Count domain breadth. Six unrelated domains in 700 steps = bandit healthy. One domain repeated 7 times = bandit stuck.

5. Compare with external grading. If your sample looks good to you, ask someone outside the run to read it. Their disagreement is information.


What This Connects To


- Activity 22 (grow_a_language_model_checkpoints). sample_every cadence aligns with checkpoint cadence; both fire every 100 steps.

- Activity 21 (coherence-gated early stopping). Diversity metrics that auto-halt training when samples collapse.

- Activity 24 (grow_a_language_model_microgpt_to_andrea). v1 collapse, v2.5 contamination, v3 polish all caught (or could have been caught) by sample audit.


One Truth

Loss is a number. Reading samples is how we know what the number means.

What Will You Watch?

Of the five audit checks (loss tier, sub-0.20 outliers, sample text, domain breadth, external grading), which one would you put highest priority on if you trained your own model? Pick one with 2-3 sentences of reasoning.