un — Grow a Language Model: Embeddings & Position

un

guest

1 / ?

back to lessons

An Embedding Is a Lookup, Not a Function

A First Layer After a Tokenizer

A tokenizer hands a model integer IDs: [256, 1842, 7301, ...]. A first thing a transformer does: convert each ID into a vector of d_model floats. That vector lives in d_model-dimensional space (768 dimensions for ANDREA-120M).

An embedding layer is a lookup table, not a function. Imagine a giant matrix:

shape: (V, d_model)
row 0:    [e_0_0, e_0_1, ..., e_0_767]
row 1:    [e_1_0, e_1_1, ..., e_1_767]
...
row 8448: [e_8448_0, e_8448_1, ..., e_8448_767]

Token ID i selects row i. Direct array access. No arithmetic, no activation. Just an index.

Token & position embedding flow

Trainable Floats

Every entry in that table starts as a small random float (typically drawn from a normal distribution scaled by 1/sqrt(d_model)). Backpropagation updates each row whenever its token ID appears in a batch. After training, similar tokens (cat, dog, pet) end up with similar vectors; unrelated tokens (cat, Tuesday, xylophone) sit far apart in vector space.

ANDREA-120M Token Embedding Cost

Quantity	Value
V	8,449
d_model	768
Parameters	6,488,832

Roughly 6.5M parameters live in a token embedding table alone, about 5.4% of ANDREA-120M's total. Every vocabulary slot earns these 768 floats.

Sizing Embedding Tables

Compute a token embedding parameter count for two future variants. (a) ANDREA-480M: V = 16,641 (a 16,384-segment tokenizer plus 256 bytes plus 1 BOS), d_model = 1536. (b) ANDREA-12M: V = 4,353, d_model = 384. Show V × d_model arithmetic for each.

Dot Products Measure Similarity

Vectors as Arrows

A 768-dimensional vector lives in a space humans cannot picture, but a same algebra works in any dimension. Two key operations matter for transformers:

Magnitude (length of an arrow):

||v|| = sqrt(v_0² + v_1² + ... + v_767²)

Dot product (alignment between two arrows):

u · v = u_0 × v_0 + u_1 × v_1 + ... + u_767 × v_767

What a Dot Product Tells You

Two facts that hold in any dimension:

- u · v = ||u|| × ||v|| × cos(theta), where theta is an angle between them.

- Vectors pointing a same direction give large positive dot products.

- Vectors pointing opposite directions give large negative dot products.

- Vectors at a right angle give a dot product of zero.

Dot product = unnormalized similarity. Two trained token embeddings for cat & dog end up with a high dot product because backpropagation pushed them together (both predict pet-related contexts). cat & Tuesday end up nearly orthogonal because they predict different contexts.

Why a Transformer Cares

Activity 5 (grow_a_language_model_attention) builds attention from dot products: a query vector dot-producted with key vectors yields scores that say which past tokens matter for predicting a next one. Embeddings & dot products together carry every interaction inside a transformer.

Predict Similarity

After training, ANDREA-120M's embedding for `believ` (token row 4287, hypothetical) ends up roughly aligned with `know`, `understand`, `learn`. Without computing exact values, predict an order from largest dot product to smallest: `believ · know`, `believ · stone`, `believ · understand`. Justify your ordering in one phrase per pair.

ANDREA Uses Learned Position Embeddings

A Problem

A token embedding tells a model what word lives at this position. It does not tell a model where that word sits. Without position information, a transformer treats the cat sat on a mat & mat a on sat cat the identically: a same set of tokens, no order signal.

Three solutions exist in a transformer literature:

Sinusoidal (Vaswani 2017). A fixed mathematical formula based on sines & cosines. Position 0 gets a specific 768-vector; position 1 gets another; never trained, never updated. Generalizes to any position via formula.

RoPE (Rotary Position Embedding). Rotates query & key vectors based on position. Used by LLaMA, Qwen. No additional parameters; rotation built into attention.

Learned. A separate embedding table shaped (T, d_model) where T is a context length. Each row trains via backpropagation, just like token embeddings.

ANDREA's Choice: Learned

ANDREA inherits a learned-position approach from microGPT, which inherited from nanoGPT, which inherited from GPT-2. A reasoning:

- Simplicity. No special math in attention. A position table looks like a token table.

- Compatibility with custom CUDA. ANDREA's microgpt_cuda.cu engine handles two embedding lookups identically; no sin/cos kernels needed.

- Sufficient for fixed context. ANDREA caps T at 1024. A learned table works fine for fixed-length sequences.

ANDREA-120M Position Embedding Cost

Quantity	Value
T (context)	1,024
d_model	768
Parameters	786,432

0.79M parameters for position. Combined with token embeddings: 6.49M + 0.79M = 7.27M embedding parameters total for ANDREA-120M.

How They Combine

At each position t in an input sequence:

x_t = token_embedding[token_id_t] + position_embedding[t]

Two 768-vectors, summed elementwise. A result, x_t, flows into a first transformer block. A model never separates them again; it learns to use a combined signal.

Learned Versus Sinusoidal

Compare two position embedding strategies for a hypothetical ANDREA model. Strategy A: learned, T = 1024. Strategy B: sinusoidal, T arbitrary (works for any sequence length). Name one advantage of each. Then state which one ANDREA picks & one reason from a CUDA / engineering angle.

Where Embedding Parameters Live

A Full ANDREA-120M Embedding Layer

Component	Shape	Parameters
Token embedding table	8,449 × 768	6,488,832
Position embedding table	1,024 × 768	786,432
Total		7,275,264

Roughly 7.3M parameters. ANDREA-120M's total parameter count: ~120M. Embedding layer alone: 6%. A remaining 94% lives in transformer blocks (attention + MLP, covered in activities 5-7).

Untied vs Tied Embeddings

Many transformer designs (GPT-2 included) tie a token embedding to a final output projection: same V × d_model matrix used at input & at output (logits over vocabulary). Tying saves V × d_model parameters & often improves quality.

ANDREA uses untied embeddings: input embedding & output projection train as separate matrices. Activity 7 (grow_a_language_model_transformer_block) covers a final layer.

A Forward Pass So Far

Input: token IDs [256, 1842, 7301, ...] (1024 of them). Each ID looks up a 768-vector. Each position looks up a 768-vector. Sum elementwise. Result: a (1024, 768) matrix x of token-+-position vectors. x flows into transformer block 1.

Activity 5 (grow_a_language_model_attention) covers what block 1 does: scaled dot-product attention with causal mask & softmax.

Predict Embedding Structure

Reflect: ANDREA-120M has 8449 token embeddings & 1024 position embeddings, sharing a same 768-dimensional space. After training, what would you expect: (a) a token embedding matrix to look like (cluster patterns?), or (b) a position embedding matrix to look like (smooth gradient?). Pick one & predict in one or two sentences. No wrong answer; reasoning matters.