un — Grow a Language Model: Multi-Head Attention

un

guest

1 / ?

back to lessons

Width Splits Across Heads

A Single Head Sees One Pattern

Activity 67 covered scaled dot-product attention: a query vector Q, a key vector K, a value vector V; compute Q·Kᵀ/√d_k, mask, softmax, weight V. One head learns one relationship pattern: maybe subject-verb agreement, maybe punctuation pairing, maybe nothing useful.

Multi-head attention runs that same operation in parallel, several times, on different slices of a token's representation. Twelve parallel heads. Twelve possible relationship patterns. Heads specialize through training pressure alone; no architect tells head 4 to look at verb tense.

The Split Relation

ANDREA-120M sets d_model = 768 & n_head = 12. Multi-head attention splits 768 into 12 chunks of 64 each:

head_dim = d_model / n_head
64       = 768     / 12

Every head operates on 64-dim vectors. The split applies cleanly: d_model must divide by n_head with zero remainder. Configurations that violate this fail at config validation, not at runtime.

Three Models, Three Splits

Variant	d_model	n_head	head_dim
ANDREA-12M	384	12	32
ANDREA-120M	768	12	64
ANDREA-480M	1536	24	64

Notice: ANDREA-12M & ANDREA-120M keep n_head=12 constant; only d_model & therefore head_dim scale. ANDREA-480M doubles head count to 24, keeping head_dim=64 matched to ANDREA-120M.

Computing head_dim

Suppose a hypothetical ANDREA-240M variant uses `d_model=1024` & `n_head=16`. Compute `head_dim`. Then explain in one sentence what would happen at config validation if a colleague proposed `d_model=1024` & `n_head=15`.

Three Matrices Per Head, Or One Big Matrix

Per-Head View

Each head needs its own query projection, key projection, & value projection. For head h:

Q_h = X · W_Q^h    where W_Q^h has shape [d_model, head_dim]
K_h = X · W_K^h    where W_K^h has shape [d_model, head_dim]
V_h = X · W_V^h    where W_V^h has shape [d_model, head_dim]

X carries the input shape [batch, seq_len, d_model]. After projection, Q_h, K_h, V_h each carry shape [batch, seq_len, head_dim].

Fused View

Per-head matrices sit alongside each other in memory. A single fused matrix W_Q of shape [d_model, d_model] produces all heads at once:

Q_fused = X · W_Q              # [batch, seq_len, d_model]
Q_per_head = reshape(Q_fused)  # [batch, n_head, seq_len, head_dim]

The fused matmul ships one BLAS call instead of 12. CUDA tensor cores reach peak throughput on matmuls of this size; per-head matmuls would underuse the hardware.

Parameter Count

Three fused matrices W_Q, W_K, W_V, each d_model × d_model. Plus the output projection W_O, also d_model × d_model. For ANDREA-120M:

params per layer's attention = 4 × 768² = 2,359,296 ≈ 2.36M
params across 12 layers      = 12 × 2.36M ≈ 28.3M

Roughly a quarter of ANDREA-120M's total parameters live in attention projections. The remaining three-quarters live in the MLP sublayer & embeddings.

Naming the Projections

ANDREA-120M has `d_model=768` & `n_head=12`. Name the four projection matrices in one transformer block's attention sublayer & give the shape of each (rows × columns). Then state which of the four uses input X & which uses the concatenated head outputs.

Twelve Vectors Become One

Multi-Head Attention Flow

After Each Head Computes

Every head produces an output tensor of shape [batch, seq_len, head_dim]. Twelve heads produce twelve such tensors. Concatenation along the feature dimension stacks them back together:

concat_output = concat(head_1, head_2, ..., head_12)
shape         = [batch, seq_len, n_head × head_dim]
              = [batch, seq_len, 768]    # for ANDREA-120M

Concat reverses the split. The total feature dimension returns to d_model. No information loss in the dimensions; the difference lives in what each chunk contains: head 1's chunk reflects head 1's learned attention pattern.

The Output Projection W_O

Concatenation alone leaves heads isolated: head 4's output sits next to head 7's output, neither aware of the other. The output projection W_O of shape [d_model, d_model] mixes them:

attention_output = concat_output · W_O
shape            = [batch, seq_len, d_model]

After W_O, every output dimension carries a learned linear combination of all twelve heads. Information flows freely between heads through this single matrix multiply.

Why Heads Specialize

Nothing in the architecture forces head 4 to learn verb tense or head 9 to learn paired punctuation. Specialization emerges from gradient pressure: during training, heads that contribute redundantly receive smaller gradients than heads that contribute uniquely. Over thousands of steps, each head settles into a niche that reduces total loss most effectively.

Empirically, trained transformers show heads that handle: positional patterns (head looks at the previous token), syntactic patterns (head looks at the matching close bracket), semantic patterns (head looks at the most recent named entity). No labels train this specialization. The training signal alone, propagated through W_O, sorts the heads.

Why Twelve Heads, Not One Wider Head

Imagine an alternative: replace 12 heads of 64 dim each with 1 head of 768 dim. Total parameters in the projections stay the same. Give two distinct reasons why the 12-head configuration outperforms the 1-head configuration on language modeling tasks. Reference attention patterns or training dynamics in your answer.

How CUDA Stores The Heads

A Single Tensor, Reshaped

ANDREA's training engine microgpt_cuda.cu does not allocate twelve separate buffers for twelve heads. It allocates one fused tensor & treats the head dimension as a stride pattern:

// after Q = X · W_Q (one matmul, fused across heads)
// Q has shape [batch, seq_len, d_model]
// reshape to    [batch, seq_len, n_head, head_dim]
// transpose to  [batch, n_head, seq_len, head_dim]
// each head now contiguous in the inner two dimensions

The transpose moves n_head ahead of seq_len. Why? Because the next operation (Q_h · K_h^T) needs each head's seq_len × head_dim slice contiguous in memory. CUDA matmuls run faster on contiguous tensors.

One Kernel, Many Heads

A single attention CUDA kernel runs across all heads in parallel. Each thread block handles one (batch, head) pair; threads inside the block cooperate on the seq_len × head_dim tile. The kernel never knows it processes multiple heads; the launch grid handles the parallelism.

Configuration Reflects the Hardware

ANDREA-120M's choice of n_head=12, head_dim=64 aligns with RTX 4090 tensor cores, which prefer matmul tiles in multiples of 16. head_dim=64 = 4 × 16 matches the tile shape exactly. head_dim=32 (ANDREA-12M) also matches but underuses the tile. head_dim=72 would not match & would force fallback kernels.

Final Picture

Step	Operation	Output shape
1. Project	Q = X · W_Q (likewise K, V)	[batch, seq, d_model]
2. Reshape & transpose	split d_model → (n_head, head_dim)	[batch, n_head, seq, head_dim]
3. Attention per head	scaled dot-product on each head	[batch, n_head, seq, head_dim]
4. Transpose & reshape	merge (n_head, head_dim) → d_model	[batch, seq, d_model]
5. Output projection	output = concat · W_O	[batch, seq, d_model]

Five steps. Three matmuls touch the input (Q, K, V projections). One matmul touches the concatenated heads (W_O). One attention kernel handles all heads in parallel. ANDREA-120M runs all five steps once per layer, twelve layers deep, every forward pass.

Reading the Pipeline

Walk through what happens to a single token's representation X[i] (a 768-dim vector) as it flows through one multi-head attention sublayer in ANDREA-120M. Mention: the projections it undergoes, how it gets split, what attention does to it, & how it returns to a 768-dim vector. Show the dimensions at each step.