English· Español· Deutsch· Nederlands· Français· 日本語· ქართული· 繁體中文· 简体中文· Português· Русский· العربية· हिन्दी· Italiano· 한국어· Polski· Svenska· Türkçe· Українська· Tiếng Việt· Bahasa Indonesia

un

guest
1 / ?
back to lessons

Attention Plus MLP, Repeated

Transformer Block Pre-Norm Structure


Two Sublayers

A transformer block contains exactly two sublayers, each operating on a token sequence of shape [batch, seq_len, d_model]:


1. Multi-head attention sublayer. Tokens look at each other. Activity 68 covered this in detail. Output shape matches input shape.

2. Feed-forward MLP sublayer. Each token transforms independently through a two-layer perceptron. No cross-token information flow. Output shape matches input shape.


Both sublayers preserve the [batch, seq_len, d_model] shape. That preservation lets blocks stack: layer N's output feeds layer N+1's input with no shape acrobatics.


What Each Sublayer Contributes

Attention moves information across positions: a token at position 17 can pull information from positions 1 through 16. The MLP transforms information within each position: a token's representation gets reshaped through learned non-linear functions, but never sees its neighbors.


Stacking blocks alternates these two operations. Layer 1 attention mixes positions. Layer 1 MLP reshapes per-position. Layer 2 attention mixes again, now over the reshaped representations. This alternation grows expressive power with depth.


ANDREA's Stack


Variantn_layern_headd_modelmlp_dim
ANDREA-12M6123841536
ANDREA-120M12127683072
ANDREA-480M162415366144

Notice mlp_dim = 4 × d_model across the family. That ratio holds in nearly every modern transformer. Section 3 walks through why.

Naming The Sublayers

A transformer block holds two sublayers. Name them in order, & for each one state whether it moves information across positions (token-to-token) or transforms information within a single position (independent per token). One sentence per sublayer.

Why Skip Connections Matter

The Residual Pattern

Each sublayer wraps its computation in a residual connection. The output adds back the input:


x = x + Attention(LayerNorm(x))     # attention sublayer
x = x + MLP(LayerNorm(x))           # MLP sublayer

Inside each sublayer, the function Attention(...) or MLP(...) produces a delta. The block does not replace the input; it adds a learned correction.


Why This Matters

Three reasons residual connections dominate deep architectures:


1. Gradient flow. During backpropagation, the addition gives gradients a direct path from output back to input, bypassing the sublayer. A 12-layer stack without residuals would lose gradient signal long before reaching the embeddings; with residuals, gradient magnitude stays usable through hundreds of layers.


2. Identity initialization. At initialization, sublayer weights produce small outputs. The residual connection means layer N initially passes through nearly unchanged. Training learns deltas progressively from a working starting point.


3. Compositional learning. Each block learns a refinement, not a replacement. Layer 1 might add positional information. Layer 2 might add syntactic structure. Layer 7 might add semantic relationships. The residual stream accumulates.


Layer Normalization

Before each sublayer, LayerNorm rescales each token's representation to zero mean & unit variance, then applies learned per-feature gain & bias:


y = gamma * (x - mean(x)) / sqrt(var(x) + epsilon) + beta

Mean & variance get computed across the d_model dimension, separately for every token. Two learned vectors (gamma, beta, each shape [d_model]) restore expressive scale. Normalization keeps activations in a numerically stable range; without it, small training instabilities snowball into NaN gradients.

Pre-Norm vs Post-Norm

A Subtle But Critical Choice

Two ways to wire layer norm into a residual sublayer:


Post-norm (original 2017 paper):

x = LayerNorm(x + Attention(x))

Layer norm sits after the residual addition. The residual stream itself gets normalized at every layer.


Pre-norm (modern standard, used in ANDREA):

x = x + Attention(LayerNorm(x))

Layer norm sits before the sublayer, inside the residual branch. The residual stream stays un-normalized; only the input to the sublayer gets rescaled.


Why Pre-Norm Won

Post-norm trains poorly without LR warmup & careful hyperparameter tuning. Gradients explode in early layers because each layer norm scrambles the residual stream's accumulated state. The original 2017 paper used post-norm with extensive tuning; subsequent work (GPT-2, LLaMA, ANDREA) standardized on pre-norm.


Pre-norm trains stably. The residual stream accumulates clean across all layers; only sublayer inputs get normalized for numerical stability. Modern transformers default to pre-norm, & ANDREA inherits that choice.


Final Block Equation

Combining residuals, layer norm in pre-norm position, & both sublayers gives ANDREA's full block:


def block_forward(x):
    x = x + Attention(LayerNorm(x))   # attention sublayer
    x = x + MLP(LayerNorm(x))         # MLP sublayer
    return x

Two sublayers, two residual additions, two layer norms (note: each sublayer has its own layer norm; ANDREA-120M has 24 layer norms across 12 blocks plus a final one at output, so 25 total). Repeat 12 times. That's the trunk of ANDREA-120M.

Why Pre-Norm Stabilizes Training

ANDREA uses pre-norm: `x = x + Attention(LayerNorm(x))`. Compare against post-norm: `x = LayerNorm(x + Attention(x))`. Give one reason from a gradient-flow perspective why pre-norm trains more stably than post-norm in deep stacks. Reference the residual stream in your answer.

Two Linear Layers, One Activation

Three Operations

The MLP sublayer is a two-layer perceptron with a non-linear activation between the layers:


def mlp_forward(x):
    h = x · W_1 + b_1        # expand: d_model → mlp_dim
    h = GELU(h)              # non-linear activation
    y = h · W_2 + b_2        # contract: mlp_dim → d_model
    return y

Three operations. Two linear, one non-linear. The first linear expands width; the second contracts back.


The 4× Expansion Ratio

Modern transformers set mlp_dim = 4 × d_model. ANDREA-120M:


d_model = 768
mlp_dim = 4 × 768 = 3072
W_1 shape = [768, 3072]      # ~2.36M params
W_2 shape = [3072, 768]      # ~2.36M params
MLP params per block = 4.72M (ignoring biases)

Two MLPs sit between every pair of attention sublayers (one per block). Twelve blocks × 4.72M ≈ 56.6M MLP parameters total in ANDREA-120M, roughly half of all parameters.


Why 4×

The 4× ratio emerged empirically. Smaller ratios reduce model capacity. Larger ratios produce diminishing returns per parameter spent. Across decades of architecture search, 4× has held up; it appears in GPT, BERT, T5, LLaMA, & ANDREA.


Recent work (PaLM, Chinchilla) found that gating mechanisms (SwiGLU) can use 8/3× expansion with comparable capacity at less cost; ANDREA stays with classic GELU + 4× for simplicity.

GELU: A Smooth Activation

What GELU Computes

GELU (Gaussian Error Linear Unit) is the standard activation between MLP layers in modern transformers. Its formula:


GELU(x) = x · Φ(x)

Φ(x) is the cumulative distribution function of the standard normal: the probability that a standard Gaussian random variable falls at or below x. Computed numerically:


Φ(x) ≈ 0.5 × (1 + tanh(sqrt(2/π) × (x + 0.044715 × x³)))

Behavior By Region

- For large positive x: Φ(x) ≈ 1, so GELU(x) ≈ x. Like ReLU.

- For large negative x: Φ(x) ≈ 0, so GELU(x) ≈ 0. Like ReLU.

- Near x = 0: Φ(x) ≈ 0.5, so GELU(0) = 0 exactly. Smooth transition through the origin.


Unlike ReLU, GELU lets some negative inputs through, weighted by Φ(x). A small negative input still contributes a small negative output, just less than the full input would.


Why GELU Outperformed ReLU

Empirically, transformers trained with GELU reach lower loss than transformers trained with ReLU at the same parameter count. The smoothness around zero matters: ReLU's hard cutoff at zero produces gradient discontinuities; GELU's smooth curve provides cleaner gradients for backpropagation.


ANDREA's training engine microgpt_cuda.cu ships a hand-written GELU CUDA kernel. The kernel uses the tanh approximation above; modern GPUs include tanh as a single-instruction op.

Computing MLP Parameters

ANDREA-120M has `d_model=768`, `mlp_dim=3072`, & `n_layer=12`. Compute the total parameters in MLP weight matrices (`W_1` & `W_2`) across all 12 blocks. Ignore biases. Show your arithmetic. Then state what fraction of ANDREA-120M's ~120M total parameters this represents (round to one decimal).

Twelve Blocks Compose Into ANDREA-120M

From Block to Model

ANDREA-120M's full forward pass:


def model_forward(token_ids):
    x = token_embed(token_ids) + position_embed(positions)
    for block_idx in range(n_layer):       # 12 blocks
        x = block_forward(x)               # attention + MLP w/ residuals
    x = LayerNorm(x)                       # final norm
    logits = x · token_embed.T             # tied weights for output projection
    return logits

Six lines. The bulk lives inside block_forward, called twelve times. Embeddings start the pipeline; tied unembedding (the same matrix used for input lookup, transposed for output projection) ends it.


Depth As Composition

Each block reads the residual stream, computes a delta, & adds it back. After twelve passes, the stream contains accumulated contributions from every block. Internally, layers tend to specialize:


- Early layers (1-3): syntactic patterns, positional structure

- Middle layers (4-8): word relationships, phrase boundaries

- Late layers (9-12): semantic content, factual recall


This specialization emerges from training pressure, not from architectural choices. The same uniform block design produces specialized layers when trained on language.


Total Block Parameters


ComponentPer blockAcross 12 blocks
Attention projections (4×W)2.36M28.3M
MLP weights (W_1 + W_2)4.72M56.6M
Layer norms (gamma, beta)~3K (negligible)~37K
Total per block~7.1M~85M

85M parameters in the trunk. Add ~13M in token embeddings (8449 vocab × 768 d_model × 2 for tied input/output) plus a final layer norm, & ANDREA-120M's parameter count lands at roughly 120M. The block design accounts for two-thirds; embeddings the rest.

Tracing One Token Through One Block

A 768-dim token vector enters block 7 of ANDREA-120M. Walk through what happens to it inside the block (in pre-norm structure). Mention: both layer norms, both sublayers, both residual additions, & the final shape. State at least one place where the residual stream is left untouched & one place where it gets modified.