English· Español· Deutsch· Nederlands· Français· 日本語· ქართული· 繁體中文· 简体中文· Português· Русский· العربية· हिन्दी· Italiano· 한국어· Polski· Svenska· Türkçe· Українська· Tiếng Việt· Bahasa Indonesia

un

гість
1 / ?
назад до уроків

Semantic Distance as Euclidean Distance

A High-Dimensional Vector Space

Every token in ANDREA-120M's 8449-token vocabulary maps to one point in R^768. The token embedding matrix has shape 8449 x 768: 8449 rows, one per vocabulary token; 768 columns, one per embedding dimension.


Geometry of ANDREA Panels


Three Properties Make This a Vector Space

1. Addition. v_a + v_b lands in R^768. Sum of two embeddings is a valid vector.

2. Scalar multiplication. alpha * v lands in R^768 for any real alpha. Stretch or shrink along the same direction.

3. Linearity. alpha v_a + beta v_b lands in R^768. Linear combinations stay inside the space.


These properties give us geometric tools: distance, angle, projection, basis, orthogonality.


Distance as Semantic Similarity

Cosine similarity of two embeddings measures the angle between them: cos(theta) = (v_a . v_b) / (||v_a|| * ||v_b||). Range: -1 (opposite) to +1 (parallel).


Empirical pattern after training: tokens with similar contexts produce embeddings with high cosine similarity. ANDREA-120M places parakeet & monkey close together (both biological, both species, both extant or extinct categories). It places Fourier & transform close together (signal processing context). It places parakeet & Fourier far apart (cross-domain orthogonality).


Why R^768 Not R^384

ANDREA-12M used d_model = 384. ANDREA-120M doubled to 768. The doubling matters: a 384-dim space has fewer 'angles' available, & cross-domain disambiguation suffers. Doubling capacity gives the model room to resolve bank (river) versus bank (financial) into different basins of the embedding space without one collapsing into the other.


Embedding Updates as Vector Translation

Each gradient step adds delta_v to v_token. Geometrically: small translations in R^768 nudge each token's position toward neighborhoods that reduce loss. Over 200K steps, every token migrates from its random initialization to a learned location.

Computing a Distance

Three trained embeddings (simplified to R^3 for arithmetic):


- v(parakeet) = (1.0, 0.5, 0.0)

- v(monkey) = (1.2, 0.3, 0.1)

- v(Fourier) = (0.0, 0.0, 1.5)

(a) Compute the Euclidean distance ||v(parakeet) - v(monkey)||. (b) Compute ||v(parakeet) - v(Fourier)||. (c) State which two tokens cluster together & give a geometric reason citing the actual numbers.

Projection onto a Query Subspace

What Attention Computes

For a token at position t, attention computes:


softmax(Q K^T / sqrt(d_k)) V


Where Q is the query (this token's question), K is keys (every past token's identifier), V is values (every past token's content). The output mixes V weighted by how much the query relates to each key.


Geometric Reading

Think of K as a list of vectors in R^d_k. Each row is one past token's key. Q is one vector in R^d_k: this token's question.


Q K^T projects every key onto Q. The dot product q . k_i measures how much k_i lies along q's direction. Long projection = key strongly relevant to query. Short projection = key barely relevant.


softmax normalizes the projections into weights summing to 1. The weighted sum of V is a single vector: a mixture of past content, weighted by relevance to the current query.


Multi-Head Attention as Multi-Subspace Projection

ANDREA-120M uses 12 attention heads. d_model = 768; d_k = 768 / 12 = 64. Each head projects into a different 64-dim subspace of R^768. Twelve heads give twelve independent views of the same sequence: one head might track grammatical role, another semantic similarity, another long-range references.


Geometrically: each head defines a 64-dim oriented subspace (a 'window') through which it views the past.


The Causal Mask

Decoder-only models add a causal mask: every Q K^T entry above the diagonal gets set to -infinity before softmax. Geometrically: the projection onto any future token gets zero weight. Token t can only see tokens 0 through t.


Why this matters: training & inference become symmetric. Same forward pass, same masked projections, no special generation logic.


sqrt(d_k) Scaling

Without scaling, dot products grow with d_k. Large dot products push softmax into one-hot regions (one weight near 1, rest near 0). Dividing by sqrt(d_k) keeps the projections at unit-variance scale, preserving softmax sharpness over a wide range of d_k values.


Geometrically: sqrt(d_k) normalizes the lengths of projections so the softmax sees comparable magnitudes regardless of subspace dimension.

Reading a Projection

Three keys & one query in R^4 (simplified for arithmetic):


- q = (1, 0, 1, 0)

- k_1 = (1, 0, 0, 0) [past token 1]

- k_2 = (0, 0, 1, 0) [past token 2]

- k_3 = (0, 1, 0, 1) [past token 3]


d_k = 4, so sqrt(d_k) = 2.

(a) Compute q . k_i for i = 1, 2, 3 (dot products). (b) Divide each by sqrt(d_k) = 2 to get scaled scores. (c) Without computing softmax explicitly, state which key would receive the LARGEST attention weight & give a geometric reason.

Gradient Descent as Path on Terrain

A Surface in 120M+1 Dimensions

Every weight configuration of ANDREA-120M is one point in R^120,000,000. Loss L(w) maps each point to a real number: training loss at this configuration. Together, loss values trace a (120M+1)-dimensional surface above the parameter space.


Geometrically impossible to visualize directly. Conceptually: a terrain. Mountains (high loss), valleys (low loss), saddle points, plateaus, ridges, basins.


Gradient as Local Slope

grad L(w) is a vector in R^120M pointing in the direction of steepest INCREASE of L. Negating it: -grad L(w) points downhill steepest.


One AdamW step nudges w in the negative gradient direction (with adaptive scaling from m & v). Geometrically: a tiny step along the surface, downhill, with step size controlled by lr.


v1's Bad Basin

v1 took its first step at LR = peak (0.0003) on freshly initialized weights. Geometric picture: w_0 sits in a wildly curved region (random init has high curvature in many directions), & a peak-LR step lands in the wrong basin. Subsequent steps cannot escape. The model gets stuck producing 'region region region' because that basin has the lowest loss the model can find from where it landed.


v2's Warmup Path

v2 takes 2000 small steps with LR ramping from 0 to peak. Geometric picture: w_0 first migrates gently along smooth directions (where curvature is low). By step 2000, w has moved into a more navigable region; peak LR can then drive it toward a better basin without overshooting.


Warmup is a geometry-aware initialization protocol: let the model find a safe local neighborhood before pushing it hard.


Wide vs Narrow Basins

At step 112K, ANDREA-120M sits in a basin. Question: how wide is it?


Wide basin = many neighboring weight configurations also achieve low training loss. Generalization tends to be good (basin width predicts test performance; see PAC-Bayes lesson, Chapter 3).


Narrow basin = only a thin set of weights achieves low loss. Generalization tends to suffer.


v3 polish at step 112,619 nudged the model along the surface (without resetting) to a wider basin via curriculum perturbation: change the loss function (different bandit, different training mix), let SGD find a nearby flat region under the new policy.


The Zombie Cliff

The anomalous loss 0.13 at step 112,080 was a CLIFF: a sharp, narrow region where one specific input pattern (memorized repo-docs substring) hits near-zero loss. The model fell off the broader basin into a narrow gully. Polish-pivot's hard-exclusion of repo-docs filled in that gully so SGD could no longer find it.

Reading the Terrain

Three weight configurations after a polish pivot. (a) Configuration A: training loss 2.0, & 95% of small perturbations within distance 0.1 still produce loss < 2.2. (b) Configuration B: training loss 2.0, & 5% of small perturbations within distance 0.1 still produce loss < 2.2. (c) Configuration C: training loss 0.13 on one specific input but loss 8.0 on average across other inputs. Classify each as WIDE BASIN, NARROW BASIN, or CLIFF, & give a one-sentence geometric reason.

Curriculum Mix as a Walk on a Discrete Simplex

What a Simplex Is

An n-dimensional simplex (specifically the standard (n-1)-simplex) is the set of n-tuples (w_1, w_2, ..., w_n) with each w_i >= 0 & sum(w_i) = 1.


For n = 2: a line segment from (1, 0) to (0, 1). For n = 3: a triangle with vertices (1, 0, 0), (0, 1, 0), (0, 0, 1). For n = 16 (ANDREA's full source list): a 15-dimensional simplex sitting inside R^16.


Bandit Weights as Simplex Coordinates

ANDREA's bandit produces a weight vector w over data sources at each phase. Each component w_i is the probability of sampling source i. Probabilities are non-negative & sum to 1: every weight vector lives on the simplex.


Vertices = pure strategies (sample only one source). Interior = mixed strategies (sample multiple sources, each with positive probability). Edges = mixtures of two sources only.


Source Floors as Restricted Region

ANDREA imposes minimum weights: hermes3-general at floor 0.7 (post-polish). This carves out a sub-region of the simplex: only weight vectors with w_hermes3-general >= 0.7 are reachable.


Geometrically: the floor cuts the simplex with a hyperplane. The reachable region is the part of the simplex on the correct side of every floor hyperplane.


Caps as the Other Restriction

ANDREA imposes maximum weights too: dictionary at cap 0.25 (post-polish). Each cap is another hyperplane, & the reachable region must sit on the correct side of every cap hyperplane too.


Excluding a source entirely (cap = 0.0) is the strongest possible cap: the coordinate gets pinned to zero, reducing the effective simplex by one dimension.


Phase Transitions as Simplex Walks

Every phase transition (every 7-42 steps) produces a new weight vector. Each new vector is a point on the simplex. Over 200K steps, the bandit traces a long path through the simplex's reachable region.


Random phases = teleport to a uniform-random point inside the reachable region.

Bandit-controlled phases = step toward the UCB-best vertex consistent with the floors & caps.

Polish pivot = redraw the reachable region (new floors, new caps, some sources excluded), & the walk continues from the new starting point.


Why Vertices Are Dangerous

Pure-source phases (one w_i = 1, rest = 0) sit at simplex vertices. Diversity is zero. The model trains on one distribution only. v1's collapse partly traced to the bandit camping near the repo-docs vertex; samples reproduced that source's distribution exclusively.


Floors prevent vertex-camping: a floor at 0.7 says 'never let any source's weight drop below 0.7' (or whatever the floor is for the priority sources).

Walking the Reachable Region

Three sources: hermes3-general (H), gutenberg (G), dictionary (D). Constraints: H floor = 0.5, D cap = 0.25. (Implicit: all weights >= 0, sum to 1, no other constraints.)

(a) Could the bandit pick (H=1.0, G=0, D=0)? Why or why not? (b) Could it pick (H=0.5, G=0.5, D=0)? (c) Could it pick (H=0.5, G=0.25, D=0.25)? (d) Describe geometrically what the reachable region looks like in this 3-source simplex.

Restricting Dimensions for the First 20K Steps

What v2's Curriculum Warmup Did

v2 set curriculum_warmup_sources to seven sources: hermes3-general, hermes3-creative, hermes3-roleplay, chat, smoltalk, oasst, gutenberg. For the first 20K steps, ONLY those seven sources contributed. After step 20K, the full 16-source firehose activated.


Geometric Reading

The full 16-source simplex sits in R^16. Restricting to 7 sources collapses 9 of the 16 coordinates to zero. The bandit's walk takes place in a 6-dimensional sub-simplex (one less than the source count, by the sum-to-1 constraint).


Geometrically: a SUBMANIFOLD of the full simplex. Lower-dimensional, smoother, easier to navigate.


Why This Helps Early Training

Early in training, the model has not learned coherent language at all. Diverse sources confuse it: each source has its own style, its own vocabulary distribution, its own pattern. Mixing 16 sources at random init creates a too-broad target distribution that the model cannot fit.


Restricting to 7 conversational/prose sources gives a more uniform target. The model learns a stable representation first, then expands.


Geometric Path Through Training

1. Steps 0 to 20K (warmup). Walk lives on the 6-D sub-simplex. Stable language patterns emerge in the model.

2. Steps 20K to 112K (full firehose). Walk expands to the 15-D full simplex. Domain breadth emerges.

3. Step 112K onward (polish). Walk restricted again: repo-docs & repo-docstrings excluded, conversational floors raised. Smaller polygon in the full simplex; conversational quality consolidates.


Why Polish Sets curriculum_warmup_steps = 0

Polish enters at step 112K. The model already speaks coherent language. Restricting to a sub-simplex now would lose breadth without gaining anything (the warmup benefit is for fresh-init models). Setting warmup_steps = 0 says: stay on the full simplex, but with new caps & floors.


Three Geometries, One Training Run

v2 warmup: low-dimensional sub-simplex.

v2 firehose: full 15-D simplex.

v3 polish: full simplex with smaller polygon (more constraints).


Same 200K-step run, three different geometric regimes. Each was tuned for a different phase of model maturity.

Reading the Submanifold

(a) v2 warmup uses 7 sources from the 16-source full set. What is the dimension of the warmup sub-simplex? Compute & state. (b) ANDREA-120M v3 polish hard-excludes repo-docs & repo-docstrings (cap 0.0) but otherwise allows the remaining 14 sources. What is the dimension of the polish sub-simplex? (c) Geometrically, what does it mean to set curriculum_warmup_steps = 0 in the polish config?