An Embedding Is a Lookup, Not a Function
A First Layer After a Tokenizer
A tokenizer hands a model integer IDs: [256, 1842, 7301, ...]. A first thing a transformer does: convert each ID into a vector of d_model floats. That vector lives in d_model-dimensional space (768 dimensions for ANDREA-120M).
An embedding layer is a lookup table, not a function. Imagine a giant matrix:
shape: (V, d_model)
row 0: [e_0_0, e_0_1, ..., e_0_767]
row 1: [e_1_0, e_1_1, ..., e_1_767]
...
row 8448: [e_8448_0, e_8448_1, ..., e_8448_767]
Token ID i selects row i. Direct array access. No arithmetic, no activation. Just an index.
Trainable Floats
Every entry in that table starts as a small random float (typically drawn from a normal distribution scaled by 1/sqrt(d_model)). Backpropagation updates each row whenever its token ID appears in a batch. After training, similar tokens (cat, dog, pet) end up with similar vectors; unrelated tokens (cat, Tuesday, xylophone) sit far apart in vector space.
ANDREA-120M Token Embedding Cost
| Quantity | Value |
|---|---|
| V | 8,449 |
| d_model | 768 |
| Parameters | 6,488,832 |
Roughly 6.5M parameters live in a token embedding table alone, about 5.4% of ANDREA-120M's total. Every vocabulary slot earns these 768 floats.
Sizing Embedding Tables
Dot Products Measure Similarity
Vectors as Arrows
A 768-dimensional vector lives in a space humans cannot picture, but a same algebra works in any dimension. Two key operations matter for transformers:
Magnitude (length of an arrow):
||v|| = sqrt(v_0² + v_1² + ... + v_767²)
Dot product (alignment between two arrows):
u · v = u_0 × v_0 + u_1 × v_1 + ... + u_767 × v_767
What a Dot Product Tells You
Two facts that hold in any dimension:
- u · v = ||u|| × ||v|| × cos(theta), where theta is an angle between them.
- Vectors pointing a same direction give large positive dot products.
- Vectors pointing opposite directions give large negative dot products.
- Vectors at a right angle give a dot product of zero.
Dot product = unnormalized similarity. Two trained token embeddings for cat & dog end up with a high dot product because backpropagation pushed them together (both predict pet-related contexts). cat & Tuesday end up nearly orthogonal because they predict different contexts.
Why a Transformer Cares
Activity 5 (grow_a_language_model_attention) builds attention from dot products: a query vector dot-producted with key vectors yields scores that say which past tokens matter for predicting a next one. Embeddings & dot products together carry every interaction inside a transformer.
Predict Similarity
ANDREA Uses Learned Position Embeddings
A Problem
A token embedding tells a model what word lives at this position. It does not tell a model where that word sits. Without position information, a transformer treats the cat sat on a mat & mat a on sat cat the identically: a same set of tokens, no order signal.
Three solutions exist in a transformer literature:
Sinusoidal (Vaswani 2017). A fixed mathematical formula based on sines & cosines. Position 0 gets a specific 768-vector; position 1 gets another; never trained, never updated. Generalizes to any position via formula.
RoPE (Rotary Position Embedding). Rotates query & key vectors based on position. Used by LLaMA, Qwen. No additional parameters; rotation built into attention.
Learned. A separate embedding table shaped (T, d_model) where T is a context length. Each row trains via backpropagation, just like token embeddings.
ANDREA's Choice: Learned
ANDREA inherits a learned-position approach from microGPT, which inherited from nanoGPT, which inherited from GPT-2. A reasoning:
- Simplicity. No special math in attention. A position table looks like a token table.
- Compatibility with custom CUDA. ANDREA's microgpt_cuda.cu engine handles two embedding lookups identically; no sin/cos kernels needed.
- Sufficient for fixed context. ANDREA caps T at 1024. A learned table works fine for fixed-length sequences.
ANDREA-120M Position Embedding Cost
| Quantity | Value |
|---|---|
| T (context) | 1,024 |
| d_model | 768 |
| Parameters | 786,432 |
0.79M parameters for position. Combined with token embeddings: 6.49M + 0.79M = 7.27M embedding parameters total for ANDREA-120M.
How They Combine
At each position t in an input sequence:
x_t = token_embedding[token_id_t] + position_embedding[t]
Two 768-vectors, summed elementwise. A result, x_t, flows into a first transformer block. A model never separates them again; it learns to use a combined signal.
Learned Versus Sinusoidal
Where Embedding Parameters Live
A Full ANDREA-120M Embedding Layer
| Component | Shape | Parameters |
|---|---|---|
| Token embedding table | 8,449 × 768 | 6,488,832 |
| Position embedding table | 1,024 × 768 | 786,432 |
| Total | 7,275,264 |
Roughly 7.3M parameters. ANDREA-120M's total parameter count: ~120M. Embedding layer alone: 6%. A remaining 94% lives in transformer blocks (attention + MLP, covered in activities 5-7).
Untied vs Tied Embeddings
Many transformer designs (GPT-2 included) tie a token embedding to a final output projection: same V × d_model matrix used at input & at output (logits over vocabulary). Tying saves V × d_model parameters & often improves quality.
ANDREA uses untied embeddings: input embedding & output projection train as separate matrices. Activity 7 (grow_a_language_model_transformer_block) covers a final layer.
A Forward Pass So Far
Input: token IDs [256, 1842, 7301, ...] (1024 of them). Each ID looks up a 768-vector. Each position looks up a 768-vector. Sum elementwise. Result: a (1024, 768) matrix x of token-+-position vectors. x flows into transformer block 1.
Activity 5 (grow_a_language_model_attention) covers what block 1 does: scaled dot-product attention with causal mask & softmax.