Width Splits Across Heads
A Single Head Sees One Pattern
Activity 67 covered scaled dot-product attention: a query vector Q, a key vector K, a value vector V; compute Q·Kᵀ/√d_k, mask, softmax, weight V. One head learns one relationship pattern: maybe subject-verb agreement, maybe punctuation pairing, maybe nothing useful.
Multi-head attention runs that same operation in parallel, several times, on different slices of a token's representation. Twelve parallel heads. Twelve possible relationship patterns. Heads specialize through training pressure alone; no architect tells head 4 to look at verb tense.
The Split Relation
ANDREA-120M sets d_model = 768 & n_head = 12. Multi-head attention splits 768 into 12 chunks of 64 each:
head_dim = d_model / n_head
64 = 768 / 12
Every head operates on 64-dim vectors. The split applies cleanly: d_model must divide by n_head with zero remainder. Configurations that violate this fail at config validation, not at runtime.
Three Models, Three Splits
| Variant | d_model | n_head | head_dim |
|---|---|---|---|
| ANDREA-12M | 384 | 12 | 32 |
| ANDREA-120M | 768 | 12 | 64 |
| ANDREA-480M | 1536 | 24 | 64 |
Notice: ANDREA-12M & ANDREA-120M keep n_head=12 constant; only d_model & therefore head_dim scale. ANDREA-480M doubles head count to 24, keeping head_dim=64 matched to ANDREA-120M.
Computing head_dim
Three Matrices Per Head, Or One Big Matrix
Per-Head View
Each head needs its own query projection, key projection, & value projection. For head h:
Q_h = X · W_Q^h where W_Q^h has shape [d_model, head_dim]
K_h = X · W_K^h where W_K^h has shape [d_model, head_dim]
V_h = X · W_V^h where W_V^h has shape [d_model, head_dim]
X carries the input shape [batch, seq_len, d_model]. After projection, Q_h, K_h, V_h each carry shape [batch, seq_len, head_dim].
Fused View
Per-head matrices sit alongside each other in memory. A single fused matrix W_Q of shape [d_model, d_model] produces all heads at once:
Q_fused = X · W_Q # [batch, seq_len, d_model]
Q_per_head = reshape(Q_fused) # [batch, n_head, seq_len, head_dim]
The fused matmul ships one BLAS call instead of 12. CUDA tensor cores reach peak throughput on matmuls of this size; per-head matmuls would underuse the hardware.
Parameter Count
Three fused matrices W_Q, W_K, W_V, each d_model × d_model. Plus the output projection W_O, also d_model × d_model. For ANDREA-120M:
params per layer's attention = 4 × 768² = 2,359,296 ≈ 2.36M
params across 12 layers = 12 × 2.36M ≈ 28.3M
Roughly a quarter of ANDREA-120M's total parameters live in attention projections. The remaining three-quarters live in the MLP sublayer & embeddings.
Naming the Projections
Twelve Vectors Become One
After Each Head Computes
Every head produces an output tensor of shape [batch, seq_len, head_dim]. Twelve heads produce twelve such tensors. Concatenation along the feature dimension stacks them back together:
concat_output = concat(head_1, head_2, ..., head_12)
shape = [batch, seq_len, n_head × head_dim]
= [batch, seq_len, 768] # for ANDREA-120M
Concat reverses the split. The total feature dimension returns to d_model. No information loss in the dimensions; the difference lives in what each chunk contains: head 1's chunk reflects head 1's learned attention pattern.
The Output Projection W_O
Concatenation alone leaves heads isolated: head 4's output sits next to head 7's output, neither aware of the other. The output projection W_O of shape [d_model, d_model] mixes them:
attention_output = concat_output · W_O
shape = [batch, seq_len, d_model]
After W_O, every output dimension carries a learned linear combination of all twelve heads. Information flows freely between heads through this single matrix multiply.
Why Heads Specialize
Nothing in the architecture forces head 4 to learn verb tense or head 9 to learn paired punctuation. Specialization emerges from gradient pressure: during training, heads that contribute redundantly receive smaller gradients than heads that contribute uniquely. Over thousands of steps, each head settles into a niche that reduces total loss most effectively.
Empirically, trained transformers show heads that handle: positional patterns (head looks at the previous token), syntactic patterns (head looks at the matching close bracket), semantic patterns (head looks at the most recent named entity). No labels train this specialization. The training signal alone, propagated through W_O, sorts the heads.
Why Twelve Heads, Not One Wider Head
How CUDA Stores The Heads
A Single Tensor, Reshaped
ANDREA's training engine microgpt_cuda.cu does not allocate twelve separate buffers for twelve heads. It allocates one fused tensor & treats the head dimension as a stride pattern:
// after Q = X · W_Q (one matmul, fused across heads)
// Q has shape [batch, seq_len, d_model]
// reshape to [batch, seq_len, n_head, head_dim]
// transpose to [batch, n_head, seq_len, head_dim]
// each head now contiguous in the inner two dimensions
The transpose moves n_head ahead of seq_len. Why? Because the next operation (Q_h · K_h^T) needs each head's seq_len × head_dim slice contiguous in memory. CUDA matmuls run faster on contiguous tensors.
One Kernel, Many Heads
A single attention CUDA kernel runs across all heads in parallel. Each thread block handles one (batch, head) pair; threads inside the block cooperate on the seq_len × head_dim tile. The kernel never knows it processes multiple heads; the launch grid handles the parallelism.
Configuration Reflects the Hardware
ANDREA-120M's choice of n_head=12, head_dim=64 aligns with RTX 4090 tensor cores, which prefer matmul tiles in multiples of 16. head_dim=64 = 4 × 16 matches the tile shape exactly. head_dim=32 (ANDREA-12M) also matches but underuses the tile. head_dim=72 would not match & would force fallback kernels.
Final Picture
| Step | Operation | Output shape |
|---|---|---|
| 1. Project | Q = X · W_Q (likewise K, V) | [batch, seq, d_model] |
| 2. Reshape & transpose | split d_model → (n_head, head_dim) | [batch, n_head, seq, head_dim] |
| 3. Attention per head | scaled dot-product on each head | [batch, n_head, seq, head_dim] |
| 4. Transpose & reshape | merge (n_head, head_dim) → d_model | [batch, seq, d_model] |
| 5. Output projection | output = concat · W_O | [batch, seq, d_model] |
Five steps. Three matmuls touch the input (Q, K, V projections). One matmul touches the concatenated heads (W_O). One attention kernel handles all heads in parallel. ANDREA-120M runs all five steps once per layer, twelve layers deep, every forward pass.