Semantic Distance as Euclidean Distance
A High-Dimensional Vector Space
Every token in ANDREA-120M's 8449-token vocabulary maps to one point in R^768. The token embedding matrix has shape 8449 x 768: 8449 rows, one per vocabulary token; 768 columns, one per embedding dimension.
Three Properties Make This a Vector Space
1. Addition. v_a + v_b lands in R^768. Sum of two embeddings is a valid vector.
2. Scalar multiplication. alpha * v lands in R^768 for any real alpha. Stretch or shrink along the same direction.
3. Linearity. alpha v_a + beta v_b lands in R^768. Linear combinations stay inside the space.
These properties give us geometric tools: distance, angle, projection, basis, orthogonality.
Distance as Semantic Similarity
Cosine similarity of two embeddings measures the angle between them: cos(theta) = (v_a . v_b) / (||v_a|| * ||v_b||). Range: -1 (opposite) to +1 (parallel).
Empirical pattern after training: tokens with similar contexts produce embeddings with high cosine similarity. ANDREA-120M places parakeet & monkey close together (both biological, both species, both extant or extinct categories). It places Fourier & transform close together (signal processing context). It places parakeet & Fourier far apart (cross-domain orthogonality).
Why R^768 Not R^384
ANDREA-12M used d_model = 384. ANDREA-120M doubled to 768. The doubling matters: a 384-dim space has fewer 'angles' available, & cross-domain disambiguation suffers. Doubling capacity gives the model room to resolve bank (river) versus bank (financial) into different basins of the embedding space without one collapsing into the other.
Embedding Updates as Vector Translation
Each gradient step adds delta_v to v_token. Geometrically: small translations in R^768 nudge each token's position toward neighborhoods that reduce loss. Over 200K steps, every token migrates from its random initialization to a learned location.
Computing a Distance
Three trained embeddings (simplified to R^3 for arithmetic):
- v(parakeet) = (1.0, 0.5, 0.0)
- v(monkey) = (1.2, 0.3, 0.1)
- v(Fourier) = (0.0, 0.0, 1.5)
Projection onto a Query Subspace
What Attention Computes
For a token at position t, attention computes:
softmax(Q K^T / sqrt(d_k)) V
Where Q is the query (this token's question), K is keys (every past token's identifier), V is values (every past token's content). The output mixes V weighted by how much the query relates to each key.
Geometric Reading
Think of K as a list of vectors in R^d_k. Each row is one past token's key. Q is one vector in R^d_k: this token's question.
Q K^T projects every key onto Q. The dot product q . k_i measures how much k_i lies along q's direction. Long projection = key strongly relevant to query. Short projection = key barely relevant.
softmax normalizes the projections into weights summing to 1. The weighted sum of V is a single vector: a mixture of past content, weighted by relevance to the current query.
Multi-Head Attention as Multi-Subspace Projection
ANDREA-120M uses 12 attention heads. d_model = 768; d_k = 768 / 12 = 64. Each head projects into a different 64-dim subspace of R^768. Twelve heads give twelve independent views of the same sequence: one head might track grammatical role, another semantic similarity, another long-range references.
Geometrically: each head defines a 64-dim oriented subspace (a 'window') through which it views the past.
The Causal Mask
Decoder-only models add a causal mask: every Q K^T entry above the diagonal gets set to -infinity before softmax. Geometrically: the projection onto any future token gets zero weight. Token t can only see tokens 0 through t.
Why this matters: training & inference become symmetric. Same forward pass, same masked projections, no special generation logic.
sqrt(d_k) Scaling
Without scaling, dot products grow with d_k. Large dot products push softmax into one-hot regions (one weight near 1, rest near 0). Dividing by sqrt(d_k) keeps the projections at unit-variance scale, preserving softmax sharpness over a wide range of d_k values.
Geometrically: sqrt(d_k) normalizes the lengths of projections so the softmax sees comparable magnitudes regardless of subspace dimension.
Reading a Projection
Three keys & one query in R^4 (simplified for arithmetic):
- q = (1, 0, 1, 0)
- k_1 = (1, 0, 0, 0) [past token 1]
- k_2 = (0, 0, 1, 0) [past token 2]
- k_3 = (0, 1, 0, 1) [past token 3]
d_k = 4, so sqrt(d_k) = 2.
Gradient Descent as Path on Terrain
A Surface in 120M+1 Dimensions
Every weight configuration of ANDREA-120M is one point in R^120,000,000. Loss L(w) maps each point to a real number: training loss at this configuration. Together, loss values trace a (120M+1)-dimensional surface above the parameter space.
Geometrically impossible to visualize directly. Conceptually: a terrain. Mountains (high loss), valleys (low loss), saddle points, plateaus, ridges, basins.
Gradient as Local Slope
grad L(w) is a vector in R^120M pointing in the direction of steepest INCREASE of L. Negating it: -grad L(w) points downhill steepest.
One AdamW step nudges w in the negative gradient direction (with adaptive scaling from m & v). Geometrically: a tiny step along the surface, downhill, with step size controlled by lr.
v1's Bad Basin
v1 took its first step at LR = peak (0.0003) on freshly initialized weights. Geometric picture: w_0 sits in a wildly curved region (random init has high curvature in many directions), & a peak-LR step lands in the wrong basin. Subsequent steps cannot escape. The model gets stuck producing 'region region region' because that basin has the lowest loss the model can find from where it landed.
v2's Warmup Path
v2 takes 2000 small steps with LR ramping from 0 to peak. Geometric picture: w_0 first migrates gently along smooth directions (where curvature is low). By step 2000, w has moved into a more navigable region; peak LR can then drive it toward a better basin without overshooting.
Warmup is a geometry-aware initialization protocol: let the model find a safe local neighborhood before pushing it hard.
Wide vs Narrow Basins
At step 112K, ANDREA-120M sits in a basin. Question: how wide is it?
Wide basin = many neighboring weight configurations also achieve low training loss. Generalization tends to be good (basin width predicts test performance; see PAC-Bayes lesson, Chapter 3).
Narrow basin = only a thin set of weights achieves low loss. Generalization tends to suffer.
v3 polish at step 112,619 nudged the model along the surface (without resetting) to a wider basin via curriculum perturbation: change the loss function (different bandit, different training mix), let SGD find a nearby flat region under the new policy.
The Zombie Cliff
The anomalous loss 0.13 at step 112,080 was a CLIFF: a sharp, narrow region where one specific input pattern (memorized repo-docs substring) hits near-zero loss. The model fell off the broader basin into a narrow gully. Polish-pivot's hard-exclusion of repo-docs filled in that gully so SGD could no longer find it.
Reading the Terrain
Curriculum Mix as a Walk on a Discrete Simplex
What a Simplex Is
An n-dimensional simplex (specifically the standard (n-1)-simplex) is the set of n-tuples (w_1, w_2, ..., w_n) with each w_i >= 0 & sum(w_i) = 1.
For n = 2: a line segment from (1, 0) to (0, 1). For n = 3: a triangle with vertices (1, 0, 0), (0, 1, 0), (0, 0, 1). For n = 16 (ANDREA's full source list): a 15-dimensional simplex sitting inside R^16.
Bandit Weights as Simplex Coordinates
ANDREA's bandit produces a weight vector w over data sources at each phase. Each component w_i is the probability of sampling source i. Probabilities are non-negative & sum to 1: every weight vector lives on the simplex.
Vertices = pure strategies (sample only one source). Interior = mixed strategies (sample multiple sources, each with positive probability). Edges = mixtures of two sources only.
Source Floors as Restricted Region
ANDREA imposes minimum weights: hermes3-general at floor 0.7 (post-polish). This carves out a sub-region of the simplex: only weight vectors with w_hermes3-general >= 0.7 are reachable.
Geometrically: the floor cuts the simplex with a hyperplane. The reachable region is the part of the simplex on the correct side of every floor hyperplane.
Caps as the Other Restriction
ANDREA imposes maximum weights too: dictionary at cap 0.25 (post-polish). Each cap is another hyperplane, & the reachable region must sit on the correct side of every cap hyperplane too.
Excluding a source entirely (cap = 0.0) is the strongest possible cap: the coordinate gets pinned to zero, reducing the effective simplex by one dimension.
Phase Transitions as Simplex Walks
Every phase transition (every 7-42 steps) produces a new weight vector. Each new vector is a point on the simplex. Over 200K steps, the bandit traces a long path through the simplex's reachable region.
Random phases = teleport to a uniform-random point inside the reachable region.
Bandit-controlled phases = step toward the UCB-best vertex consistent with the floors & caps.
Polish pivot = redraw the reachable region (new floors, new caps, some sources excluded), & the walk continues from the new starting point.
Why Vertices Are Dangerous
Pure-source phases (one w_i = 1, rest = 0) sit at simplex vertices. Diversity is zero. The model trains on one distribution only. v1's collapse partly traced to the bandit camping near the repo-docs vertex; samples reproduced that source's distribution exclusively.
Floors prevent vertex-camping: a floor at 0.7 says 'never let any source's weight drop below 0.7' (or whatever the floor is for the priority sources).
Walking the Reachable Region
Three sources: hermes3-general (H), gutenberg (G), dictionary (D). Constraints: H floor = 0.5, D cap = 0.25. (Implicit: all weights >= 0, sum to 1, no other constraints.)
Restricting Dimensions for the First 20K Steps
What v2's Curriculum Warmup Did
v2 set curriculum_warmup_sources to seven sources: hermes3-general, hermes3-creative, hermes3-roleplay, chat, smoltalk, oasst, gutenberg. For the first 20K steps, ONLY those seven sources contributed. After step 20K, the full 16-source firehose activated.
Geometric Reading
The full 16-source simplex sits in R^16. Restricting to 7 sources collapses 9 of the 16 coordinates to zero. The bandit's walk takes place in a 6-dimensional sub-simplex (one less than the source count, by the sum-to-1 constraint).
Geometrically: a SUBMANIFOLD of the full simplex. Lower-dimensional, smoother, easier to navigate.
Why This Helps Early Training
Early in training, the model has not learned coherent language at all. Diverse sources confuse it: each source has its own style, its own vocabulary distribution, its own pattern. Mixing 16 sources at random init creates a too-broad target distribution that the model cannot fit.
Restricting to 7 conversational/prose sources gives a more uniform target. The model learns a stable representation first, then expands.
Geometric Path Through Training
1. Steps 0 to 20K (warmup). Walk lives on the 6-D sub-simplex. Stable language patterns emerge in the model.
2. Steps 20K to 112K (full firehose). Walk expands to the 15-D full simplex. Domain breadth emerges.
3. Step 112K onward (polish). Walk restricted again: repo-docs & repo-docstrings excluded, conversational floors raised. Smaller polygon in the full simplex; conversational quality consolidates.
Why Polish Sets curriculum_warmup_steps = 0
Polish enters at step 112K. The model already speaks coherent language. Restricting to a sub-simplex now would lose breadth without gaining anything (the warmup benefit is for fresh-init models). Setting warmup_steps = 0 says: stay on the full simplex, but with new caps & floors.
Three Geometries, One Training Run
v2 warmup: low-dimensional sub-simplex.
v2 firehose: full 15-D simplex.
v3 polish: full simplex with smaller polygon (more constraints).
Same 200K-step run, three different geometric regimes. Each was tuned for a different phase of model maturity.