How Surprised Should The Model Be?
From Logits To Probabilities
After 12 transformer blocks, ANDREA-120M produces a vector of vocab_size numbers per token position: the logits. For ANDREA-120M, vocab_size = 8449, so every position outputs 8449 logits. Logits are unnormalized scores; some positive, some negative, no constraint to sum to 1.
Softmax converts logits into a probability distribution:
p_i = exp(logit_i) / sum_j exp(logit_j)
After softmax, all 8449 numbers sit between 0 & 1, summing to 1. The model assigns probability to every possible next token.
Cross-Entropy Loss
Training requires a loss function: a number that measures how wrong the model's prediction was for a given correct answer. Cross-entropy works for language modeling:
loss_t = -log(p_correct_token_t)
Take the model's predicted probability for the actual next token (the one in the training data). Take the negative log of that probability. That's the loss for one position.
Why Negative Log
Three properties make -log(p) a natural loss function:
- -log(1) = 0: When the model predicts the correct token with 100% confidence, loss is zero.
- -log(0) = ∞: When the model assigns zero probability to the correct token, loss is infinite. (In practice, softmax never outputs exactly 0; the loss stays finite but large.)
- Monotonic: As predicted probability for the correct token increases, loss decreases smoothly.
Higher confidence on the correct answer = lower loss. The training objective is straightforward: maximize predicted probability for the actual next token.
Per-Sequence Loss
ANDREA trains on sequences of length 1024 (the context window). Each sequence produces 1024 next-token predictions. The sequence loss averages across all positions:
sequence_loss = mean(-log(p_correct_t)) for t in 0..1023
Then sequence losses get averaged across the batch (ANDREA-120M uses batch_size = 8). One scalar number per training step. That number is what the loss curve plots.
Computing Loss For One Position
Perplexity = exp(loss)
A Friendlier Scale
Loss values like 2.0 or 3.43 don't immediately convey what the model can do. Perplexity translates loss onto a more intuitive scale:
perplexity = exp(loss)
Perplexity answers a clean question: among how many equally-likely tokens does the model effectively choose? A perplexity of 7 means the model behaves as if picking from 7 plausible next tokens at each position. A perplexity of 1 means perfect prediction.
Common Loss-Perplexity Pairs
| Loss | Perplexity | What it feels like |
|---|---|---|
| 0.0 | 1.0 | Perfect prediction |
| 1.0 | 2.7 | Choosing among ~3 plausible tokens |
| 2.0 | 7.4 | ANDREA-12M final SMMA territory |
| 3.0 | 20.1 | Reasonable text but uncertain |
| 3.43 | 30.9 | ANDREA-120M v1 minimum (before polish) |
| 5.0 | 148 | Early training, learning vocabulary distribution |
| 9.04 | 8449 | Random-chance baseline for ANDREA-120M's vocab |
Perplexity puts loss values into context: a loss of 2.0 means the model effectively picks from ~7 tokens, not from 8449.
The Random-Chance Baseline
A model that knows nothing & guesses uniformly assigns probability 1/V to every token, where V = vocab_size:
p_uniform = 1 / V
loss = -log(1/V) = log(V)
For ANDREA-120M with V = 8449:
loss_uniform = ln(8449) ≈ 9.04
For ANDREA-12M with V = 2305:
loss_uniform = ln(2305) ≈ 7.74
Any loss above this baseline means the model performs worse than random. Any loss below it means the model has learned something: it concentrates probability mass on a smaller subset of tokens than uniform would.
Reading A Loss Value
Smoothing Step-Level Noise
Raw Loss Is Noisy
Per-step loss bounces around. ANDREA's bandit picks a different source every 7-42 steps; some sources (dictionary definitions) produce easy losses; others (gutenberg paragraphs) produce harder losses. Plotting raw step loss against step number produces a chaotic scatter.
Smoothed Modified Moving Average (SMMA) damps the noise & reveals the trend. ANDREA's training proxy computes SMMA as:
SMMA[0] = loss[0]
SMMA[t] = (SMMA[t-1] * (N-1) + loss[t]) / N
With N = 100 (ANDREA's default smoothing window), each new SMMA value mixes 99% of the previous SMMA with 1% of the new step loss. Sudden spikes get absorbed; sustained shifts show up gradually.
Why Not Just Averaging?
A simple moving average over the last 100 steps requires storing 100 loss values. SMMA stores one value (the running average) & one constant (the window size). Memory-cheap, computationally trivial, & smooth enough to read a curve.
Different smoothing weights answer different questions:
- N = 10: tracks short-term changes; useful during phase transitions
- N = 100: ANDREA's default; tracks medium-term progress
- N = 1000: long-term trend only; useful at the end of training
What ANDREA Tracks
Every 100 steps, the training proxy writes loss.json with the current SMMA, raw loss, step number, & per-source breakdowns. The dashboard at training.ai.unturf.com/dashboard polls this file every 10 seconds. External viewers see live progress; the dashboard is read-only.
ANDREA-12M's Actual Curve
The Recipe That Reached SMMA 2.0
| Steps | Avg Loss | Notes |
|---|---|---|
| 0--2.5K | 4.50 | Random init, early learning |
| 2.5K--5K | 3.88 | Fast decline through structure phase |
| 5K--10K | 3.30 | Approaching coherence boundary |
| 10K--20K | 2.80 | Bandit finding optimal mix |
| 20K--25K | 2.40 | Plateau --- data starvation |
| 25K--30K | 2.50 | Hermes data introduced + LR restart |
| 30K--35K | 2.35 | Hermes integrated, new lows |
| 35K--40K | 2.10 | 4-arm focus, steep descent |
| 40K--43.6K | 2.00 | Knowledge territory, SMMA below 2.0 |
Three phases stand out:
1. Steep early descent (0-10K). Loss falls from 4.50 to 3.30 as the model learns vocabulary distribution & basic turn structure. Random-chance baseline ln(2305) ≈ 7.74 sits high above this curve; the model concentrates probability mass quickly once embeddings stabilize.
2. Plateau (20K-25K). Loss stalls at 2.40. The bandit had run out of headroom on its current source mix. Hermes data getting added at step 25K, plus an LR warm restart, broke the plateau.
3. Final descent (35K-43.6K). Curriculum narrowed from 16 sources to 4 (hermes3-general + dictionary + gutenberg + chat). Steeper loss decline than the full-arm bandit. Final SMMA: 2.0.
ANDREA-120M v1: A Cautionary Curve
Same vocabulary calculation: ln(8449) ≈ 9.04. ANDREA-120M v1 reached SMMA 3.43 at step 110K (its minimum), then diverged:
| Steps | EMA Loss | Trend |
|---|---|---|
| 26K--40K | 4.29 | Converging |
| 70K--85K | 3.60 | Best region |
| 85K--110K | 3.43 | Minimum |
| 110K--125K | 3.54 | Diverging |
| 140K--155K | 4.05 | Diverging |
| 155K--165K | 4.54 | Collapsed |
Numerically reasonable loss values throughout (3.43 sits well below the 9.04 random baseline). But samples were repetition collapse: Budy Budy Budy Budy. Loss told a misleading story; sample audits did not.
v2's coherence-gated early stopping (activity 78) added a parallel signal: bigram diversity, trigram diversity, English word presence, character diversity. When all four scores stay below 30 for 5 consecutive samples, training auto-halts. This signal would have caught v1 at step 132K, saving 3.8 days of compute.
What Each Phase Tells You
The 120M v1 Lesson
Two Independent Signals
ANDREA-120M v1 reached SMMA 3.43 at step 110K. Numerically: 273× better than random chance (perplexity 31 vs vocab size 8449). Looks great on the curve.
Sample at step 110K::
''''' ''''' '' ''' '' ''' '''?' ''' ' '' '' '
Loss & coherence are independent signals. Low loss means the model concentrates probability mass effectively. Coherence means the model produces text humans can read. The first does not imply the second.
Why The Gap Exists
A model can lower loss by:
1. Learning real language patterns (good): subject-verb agreement, common phrases, factual associations.
2. Memorizing structural artifacts (bad): high-frequency repetition patterns, dataset-specific quirks, degenerate attractors.
Both reduce loss. The first produces samples humans like. The second produces Budy Budy Budy Budy. The training objective alone cannot distinguish them.
ANDREA-120M v3 Polish: Where Loss & Coherence Met
After v3 polish (step 112K onward), ANDREA-120M produces:
Step 112,584, loss 0.30, ppl 1::
> [extinction prompt]
< black spider montano is alive, carolina parakeet is extinct.
Both facts correct. Carolina parakeet declared extinct 1939; black spider monkey extant. Low loss (0.30) AND coherent factual recall.
Step 112,500, loss 1.94, ppl 7::
> How do I find outdated packages in a Python project?
< Use pip list --outdated names to see the pip packages list...
Higher loss (1.94 → ppl 7) but the right tool emerges from training data. Phrasing fluency still developing at the 56% training mark.
The Two-Signal Discipline
Modern training pipelines monitor BOTH:
- Loss curve. Tells you if the model is learning anything quantitative.
- Sample audit. Tells you if what the model learned is useful.
v2 added coherence-gated early stopping (activity 78). v3 polish was a curriculum perturbation triggered by sample audits, not by loss values. Loss alone is necessary but never sufficient.