un — Grow a Language Model: Scaled Dot-Product Attention

un

guest

1 / ?

back to lessons

Query, Key, Value

Three Linear Maps from a Same Input

After embedding (activity 4), each position carries a 768-dimensional vector x_t. Attention starts by producing three distinct projections of x:

Q (query): what does this position want to know?

K (key): what does this position offer to other positions?

V (value): what content does this position deliver if attended to?

Each projection comes from a learned weight matrix:

Q = x · W_Q     # W_Q shape: (d_model, d_k)
K = x · W_K     # W_K shape: (d_model, d_k)
V = x · W_V     # W_V shape: (d_model, d_k)

Three matrices, all trained via backpropagation. A model learns: at this position, what query best retrieves useful past context? What key advertises this position's content well? What value delivers if selected?

Scaled dot-product attention

A Library Analogy

Imagine a library card catalog. You walk in with a topic in mind (your query). Every card lists keywords (a key). When your topic matches a card's keywords, you grab a book contents (a value). Attention does this for every token in parallel: every position queries every other position, ranks alignment, & retrieves a weighted combination of value vectors.

ANDREA-120M Dimensions

Quantity	Value	Notes
d_model	768	Vector size at every position
n_head	12	Parallel attention heads
d_k	64	Per-head dim (= d_model / n_head)
T	1024	Context length

d_k = d_model / n_head = 768 / 12 = 64. Each head sees a 64-dimensional slice of a full 768-dimensional space. Activity 6 (grow_a_language_model_multi_head) covers a per-head split in detail.

Compute d_k

Compute d_k for two ANDREA variants. (a) ANDREA-12M: d_model = 384, n_head = 12. (b) ANDREA-480M: d_model = 1536, n_head = 24. Show your formula d_k = d_model / n_head for each.

Why Divide by sqrt(d_k)

A Score Matrix

Once Q & K exist (each shape (T, d_k)), attention computes a score matrix:

scores = Q · K^T     # shape: (T, T)

scores[i, j] = how strongly position i's query aligns with position j's key. Every (i, j) pair gets one score: 1024 × 1024 = 1,048,576 scores per attention head per forward pass.

Why Divide

Dot products of two random d-dimensional unit vectors have magnitude on order of sqrt(d). Without scaling, scores grow with d_k:

- d_k = 64: typical dot products on order of 8.

- d_k = 256: typical dot products on order of 16.

- d_k = 4096: typical dot products on order of 64.

Big scores produce a peaky softmax (one position dominates, gradients vanish elsewhere). Training stalls. Scaling fixes a magnitude:

scaled_scores = (Q · K^T) / sqrt(d_k)

For ANDREA-120M, sqrt(d_k) = sqrt(64) = 8. Every score gets divided by 8. Magnitudes stay roughly unit-scale regardless of d_k. Softmax stays well-behaved. Gradients flow.

Vaswani's Original Justification

From Attention Is All You Need (2017): 'For large values of d_k, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients.' A sqrt(d_k) divisor counteracts that growth.

A Code View

Inside microgpt_cuda.cu, this scaling appears as a literal division:

scores[i][j] = dot(Q[i], K[j]) * (1.0f / sqrtf(d_k));

One float multiplication per score. Cheap. Critical.

Scale at d_model = 4096

Suppose a research team builds ANDREA-2B with d_model = 4096 & n_head = 32. (a) Compute d_k. (b) Compute sqrt(d_k). (c) Explain in one sentence what would happen if a team forgot to divide by sqrt(d_k) at this scale.

Why Position i Cannot See Position j > i

A Constraint Born from Generation

ANDREA generates one token at a time. At inference, position 0 produces a first token, then position 1 sees position 0's output & produces a second token, & so on. A model never has access to future tokens during generation.

Training must mirror this. If during training position 5 could attend to position 6, a model would learn a shortcut: 'predict token 6 by reading token 6'. At inference, that shortcut disappears (token 6 doesn't exist yet). A model's training-versus-inference behavior would diverge catastrophically.

A Mask

A causal mask blocks attention from any position i to any position j > i. Implementation: set scaled_scores[i][j] = -infinity wherever j > i. After softmax, those entries become exp(-inf) = 0. Mask zeros out attention to future positions cleanly.

for i in range(T):
    for j in range(T):
        if j > i:
            scaled_scores[i][j] = -1e9   # effectively -inf

After softmax (row-wise), each row sums to 1, but only entries [0, i] carry probability mass. Position i mixes information only from past positions.

Visualizing a Mask

A score matrix shape (T, T) with mask applied looks like a lower-triangular structure:

scaled_scores after mask, row-wise softmax:

row 0:  [1.0, 0,   0,   0,   ...]   # sees only itself
row 1:  [0.4, 0.6, 0,   0,   ...]   # sees positions 0, 1
row 2:  [0.2, 0.3, 0.5, 0,   ...]   # sees 0, 1, 2
row 3:  [0.1, 0.2, 0.3, 0.4, ...]   # sees 0, 1, 2, 3
...

Strict lower-triangular probability distribution per row. Future stays invisible.

Why a Decoder-Only Transformer Needs This

Decoder-only models like ANDREA, GPT, & LLaMA all share one objective: predict next token from past. A causal mask makes that objective trainable in parallel: every position computes its own next-token prediction at once, & no position cheats by peeking ahead.

Mask & Flavor

Activity 2 (intro) covered three transformer flavors: encoder-only, encoder-decoder, decoder-only. (a) Which flavor uses a causal mask? (b) State in one sentence why a different flavor (encoder-only, like BERT) would NOT use a causal mask. (c) What objective does an unmasked encoder train on instead?

From Scores to Output

Softmax: Scores to Probabilities

Masked, scaled scores still range over real numbers. Softmax converts each row into a probability distribution:

A[i][j] = exp(scaled_scores[i][j]) / sum_k exp(scaled_scores[i][k])

Three properties result:

- A[i][j] >= 0 for all (i, j).

- sum_j A[i][j] = 1 for every row i.

- Larger raw scores produce larger probabilities (monotone).

Row i's probability vector tells a model: how much should position i attend to each prior position when computing its output?

Weighted V Sum

A final attention output for position i:

output[i] = sum_j A[i][j] · V[j]

Each value vector V[j] gets weighted by attention probability A[i][j], then summed. Position i's output combines value vectors from every prior position, weighted by relevance.

In matrix form, all positions at once:

Attention(Q, K, V) = softmax(mask(Q · K^T / sqrt(d_k))) · V

One line. A whole attention mechanism. Vaswani et al. wrote that line in 2017; transformers haven't fundamentally changed since.

Per-Head Output Shape

Output of one attention head: shape (T, d_k). For ANDREA-120M: (1024, 64). All 12 heads compute in parallel; their outputs concatenate to (1024, 768) & feed into a final linear projection (W_O), then on to a transformer block's MLP.

Activity 6 (grow_a_language_model_multi_head) covers a multi-head split. Activity 7 (grow_a_language_model_transformer_block) covers everything that surrounds attention: residual connections, layer norm, MLP.

Synthesize a Pipeline

Synthesize an entire attention pipeline in your own words. Walk through what happens to a single position i (e.g. position 5 in a sequence) from input vector x_5 to attention output[5]. Name a four operations in order: (1) project to Q/K/V, (2) compute scaled scores against all positions, (3) apply causal mask + softmax, (4) sum V vectors weighted by probabilities. One short paragraph.