English· Español· Deutsch· Nederlands· Français· 日本語· ქართული· 繁體中文· 简体中文· Português· Русский· العربية· हिन्दी· Italiano· 한국어· Polski· Svenska· Türkçe· Українська· Tiếng Việt· Bahasa Indonesia

un

gość
1 / ?
powrót do lekcji

How Surprised Should The Model Be?

Loss Pipeline: Logits To Cross-Entropy


From Logits To Probabilities

After 12 transformer blocks, ANDREA-120M produces a vector of vocab_size numbers per token position: the logits. For ANDREA-120M, vocab_size = 8449, so every position outputs 8449 logits. Logits are unnormalized scores; some positive, some negative, no constraint to sum to 1.


Softmax converts logits into a probability distribution:


p_i = exp(logit_i) / sum_j exp(logit_j)

After softmax, all 8449 numbers sit between 0 & 1, summing to 1. The model assigns probability to every possible next token.


Cross-Entropy Loss

Training requires a loss function: a number that measures how wrong the model's prediction was for a given correct answer. Cross-entropy works for language modeling:


loss_t = -log(p_correct_token_t)

Take the model's predicted probability for the actual next token (the one in the training data). Take the negative log of that probability. That's the loss for one position.


Why Negative Log

Three properties make -log(p) a natural loss function:


- -log(1) = 0: When the model predicts the correct token with 100% confidence, loss is zero.

- -log(0) = ∞: When the model assigns zero probability to the correct token, loss is infinite. (In practice, softmax never outputs exactly 0; the loss stays finite but large.)

- Monotonic: As predicted probability for the correct token increases, loss decreases smoothly.


Higher confidence on the correct answer = lower loss. The training objective is straightforward: maximize predicted probability for the actual next token.


Per-Sequence Loss

ANDREA trains on sequences of length 1024 (the context window). Each sequence produces 1024 next-token predictions. The sequence loss averages across all positions:


sequence_loss = mean(-log(p_correct_t)) for t in 0..1023

Then sequence losses get averaged across the batch (ANDREA-120M uses batch_size = 8). One scalar number per training step. That number is what the loss curve plots.

Computing Loss For One Position

At one training position, ANDREA-120M produces softmax probabilities of 0.4 for the actual next token (some other tokens received the remaining 0.6). Compute the cross-entropy loss for this single position. Show the formula & the arithmetic. Then state in one sentence whether this represents a confident or uncertain prediction.

Perplexity = exp(loss)

A Friendlier Scale

Loss values like 2.0 or 3.43 don't immediately convey what the model can do. Perplexity translates loss onto a more intuitive scale:


perplexity = exp(loss)

Perplexity answers a clean question: among how many equally-likely tokens does the model effectively choose? A perplexity of 7 means the model behaves as if picking from 7 plausible next tokens at each position. A perplexity of 1 means perfect prediction.


Common Loss-Perplexity Pairs


LossPerplexityWhat it feels like
0.01.0Perfect prediction
1.02.7Choosing among ~3 plausible tokens
2.07.4ANDREA-12M final SMMA territory
3.020.1Reasonable text but uncertain
3.4330.9ANDREA-120M v1 minimum (before polish)
5.0148Early training, learning vocabulary distribution
9.048449Random-chance baseline for ANDREA-120M's vocab

Perplexity puts loss values into context: a loss of 2.0 means the model effectively picks from ~7 tokens, not from 8449.


The Random-Chance Baseline

A model that knows nothing & guesses uniformly assigns probability 1/V to every token, where V = vocab_size:


p_uniform = 1 / V
loss      = -log(1/V) = log(V)

For ANDREA-120M with V = 8449:


loss_uniform = ln(8449) ≈ 9.04

For ANDREA-12M with V = 2305:


loss_uniform = ln(2305) ≈ 7.74

Any loss above this baseline means the model performs worse than random. Any loss below it means the model has learned something: it concentrates probability mass on a smaller subset of tokens than uniform would.

Reading A Loss Value

ANDREA-120M v1 reached its EMA loss minimum of 3.43 at step 110K (before collapsing). Compute: (a) the perplexity at loss 3.43; (b) how many times better than the random-chance baseline (ln(8449) ≈ 9.04) this loss value represents, expressed as a perplexity ratio. Show your arithmetic.

Smoothing Step-Level Noise

Raw Loss Is Noisy

Per-step loss bounces around. ANDREA's bandit picks a different source every 7-42 steps; some sources (dictionary definitions) produce easy losses; others (gutenberg paragraphs) produce harder losses. Plotting raw step loss against step number produces a chaotic scatter.


Smoothed Modified Moving Average (SMMA) damps the noise & reveals the trend. ANDREA's training proxy computes SMMA as:


SMMA[0]  = loss[0]
SMMA[t]  = (SMMA[t-1] * (N-1) + loss[t]) / N

With N = 100 (ANDREA's default smoothing window), each new SMMA value mixes 99% of the previous SMMA with 1% of the new step loss. Sudden spikes get absorbed; sustained shifts show up gradually.


Why Not Just Averaging?

A simple moving average over the last 100 steps requires storing 100 loss values. SMMA stores one value (the running average) & one constant (the window size). Memory-cheap, computationally trivial, & smooth enough to read a curve.


Different smoothing weights answer different questions:


- N = 10: tracks short-term changes; useful during phase transitions

- N = 100: ANDREA's default; tracks medium-term progress

- N = 1000: long-term trend only; useful at the end of training


What ANDREA Tracks

Every 100 steps, the training proxy writes loss.json with the current SMMA, raw loss, step number, & per-source breakdowns. The dashboard at training.ai.unturf.com/dashboard polls this file every 10 seconds. External viewers see live progress; the dashboard is read-only.

ANDREA-12M's Actual Curve

The Recipe That Reached SMMA 2.0


StepsAvg LossNotes
0--2.5K4.50Random init, early learning
2.5K--5K3.88Fast decline through structure phase
5K--10K3.30Approaching coherence boundary
10K--20K2.80Bandit finding optimal mix
20K--25K2.40Plateau --- data starvation
25K--30K2.50Hermes data introduced + LR restart
30K--35K2.35Hermes integrated, new lows
35K--40K2.104-arm focus, steep descent
40K--43.6K2.00Knowledge territory, SMMA below 2.0

Three phases stand out:


1. Steep early descent (0-10K). Loss falls from 4.50 to 3.30 as the model learns vocabulary distribution & basic turn structure. Random-chance baseline ln(2305) ≈ 7.74 sits high above this curve; the model concentrates probability mass quickly once embeddings stabilize.


2. Plateau (20K-25K). Loss stalls at 2.40. The bandit had run out of headroom on its current source mix. Hermes data getting added at step 25K, plus an LR warm restart, broke the plateau.


3. Final descent (35K-43.6K). Curriculum narrowed from 16 sources to 4 (hermes3-general + dictionary + gutenberg + chat). Steeper loss decline than the full-arm bandit. Final SMMA: 2.0.


ANDREA-120M v1: A Cautionary Curve

Same vocabulary calculation: ln(8449) ≈ 9.04. ANDREA-120M v1 reached SMMA 3.43 at step 110K (its minimum), then diverged:


StepsEMA LossTrend
26K--40K4.29Converging
70K--85K3.60Best region
85K--110K3.43Minimum
110K--125K3.54Diverging
140K--155K4.05Diverging
155K--165K4.54Collapsed

Numerically reasonable loss values throughout (3.43 sits well below the 9.04 random baseline). But samples were repetition collapse: Budy Budy Budy Budy. Loss told a misleading story; sample audits did not.


v2's coherence-gated early stopping (activity 78) added a parallel signal: bigram diversity, trigram diversity, English word presence, character diversity. When all four scores stay below 30 for 5 consecutive samples, training auto-halts. This signal would have caught v1 at step 132K, saving 3.8 days of compute.

What Each Phase Tells You

Looking at ANDREA-12M's loss curve (4.50 → 3.30 → 2.40 plateau → 2.00 final), state which phase shows the model learning vocabulary distribution, which phase indicates the bandit has run out of mixed-source headroom, & which phase reflects narrowed-curriculum focus. One sentence per phase, referencing specific step ranges.

The 120M v1 Lesson

Two Independent Signals

ANDREA-120M v1 reached SMMA 3.43 at step 110K. Numerically: 273× better than random chance (perplexity 31 vs vocab size 8449). Looks great on the curve.


Sample at step 110K::


''''' ''''' '' ''' '' ''' '''?' ''' ' '' '' '

Loss & coherence are independent signals. Low loss means the model concentrates probability mass effectively. Coherence means the model produces text humans can read. The first does not imply the second.


Why The Gap Exists

A model can lower loss by:


1. Learning real language patterns (good): subject-verb agreement, common phrases, factual associations.

2. Memorizing structural artifacts (bad): high-frequency repetition patterns, dataset-specific quirks, degenerate attractors.


Both reduce loss. The first produces samples humans like. The second produces Budy Budy Budy Budy. The training objective alone cannot distinguish them.


ANDREA-120M v3 Polish: Where Loss & Coherence Met

After v3 polish (step 112K onward), ANDREA-120M produces:


Step 112,584, loss 0.30, ppl 1::


> [extinction prompt]
< black spider montano is alive, carolina parakeet is extinct.

Both facts correct. Carolina parakeet declared extinct 1939; black spider monkey extant. Low loss (0.30) AND coherent factual recall.


Step 112,500, loss 1.94, ppl 7::


> How do I find outdated packages in a Python project?
< Use pip list --outdated names to see the pip packages list...

Higher loss (1.94 → ppl 7) but the right tool emerges from training data. Phrasing fluency still developing at the 56% training mark.


The Two-Signal Discipline

Modern training pipelines monitor BOTH:


- Loss curve. Tells you if the model is learning anything quantitative.

- Sample audit. Tells you if what the model learned is useful.


v2 added coherence-gated early stopping (activity 78). v3 polish was a curriculum perturbation triggered by sample audits, not by loss values. Loss alone is necessary but never sufficient.

Diagnosing A Hypothetical Run

A new training run shows SMMA loss declining from 8.0 → 3.5 → 2.8 over 100K steps. Sample audits at step 100K show: bigram diversity 12 (low), trigram diversity 8 (low), English word presence 18 (high), character diversity 7 (high). What is the model likely doing? Should training continue, halt, or pivot? Justify your answer in 3-4 sentences.