What a Language Model Predicts
A Probabilistic Continuation Engine
A language model takes a sequence of tokens & assigns a probability distribution over what token comes next. Feed it the cat sat on the & it outputs probabilities across an entire vocabulary: high mass on mat, floor, couch; low mass on xylophone, Tuesday.
Sampling that distribution, appending a token, & feeding it back: that loop generates text one token at a time. Autoregressive generation, named because each step regresses on its own prior output.
Three Numbers Define a Language Model
Vocabulary size (V). How many distinct tokens a model can produce. ANDREA-12M used 4,353 tokens; ANDREA-120M uses 8,449.
Context window (T). How many tokens fit in one forward pass. ANDREA models use T = 1,024.
Parameter count (P). How many learned weights live inside. 12M, 120M, & 480M name a family by P.
A Family of Three
| Variant | d_model | Heads | Layers | Context | Params |
|---|---|---|---|---|---|
| ANDREA-12M | 384 | 12 | 6 | 1024 | 12.8M |
| ANDREA-120M | 768 | 12 | 12 | 1024 | ~120M |
| ANDREA-480M | 1536 | 24 | 16 | 1024 | ~480M |
Three knobs scale: d_model (width of every internal vector), n_layer (depth of stacked transformer blocks), n_head (parallel attention projections). Vocabulary & context stay fixed across the family.
Reading the Family Table
Why Small Matters
Constraint as Liberation
Large language models with hundreds of billions of parameters require thousands of GPUs, proprietary datasets, & corporate budgets. Few people get to train one. Few people get to repair one.
A small language model on one GPU flips that. Anyone with a 4090 (or a 3060) can reproduce ANDREA from open data. The training recipe doubles as the model card. Open code, open weights, open data; full provenance in 72 hours of compute.
Capacity vs Quality
Smaller models cannot memorize their training corpus. ANDREA-12M, at 12.8M parameters, lacks the capacity to store factual content; it learns turn structure, vocabulary distribution, & response shape. ANDREA-120M, at 10× capacity, learns factual recall, multi-paragraph coherence, & domain breadth (verified through external grading at 9.5/10 on biology & signal-processing samples).
The takeaway: capacity sets a ceiling. Curriculum decides whether the ceiling gets reached. Activities 14-23 cover curriculum.
Three Transformer Flavors
Encoder, Decoder, Both
The original Transformer (Vaswani et al., 2017) shipped an encoder & a decoder, glued together for translation. Three architectural lineages descend from that paper:
Encoder-only (BERT lineage). Bidirectional attention, no causal mask. Optimized for classification, not generation. A token sees both its past & its future during training.
Encoder-decoder (T5, BART). Encoder reads the input; decoder generates the output, attending to the encoder via cross-attention. Used for translation, summarization.
Decoder-only (GPT, ANDREA). Causal mask: every token sees only its past. Trained to predict the next token. Generation comes free; training & inference share the same forward pass.
Why Decoder-Only Wins on One GPU
Three reasons:
1. Single objective. Next-token prediction works on any text. No paired source/target needed.
2. Training & inference symmetry. Same forward pass, no special generation logic.
3. Memory simplicity. No cross-attention; one stack of layers; one flow of activations.
ANDREA inherits the decoder-only choice from microGPT, which inherited from nanoGPT, which inherited from GPT-2. The lineage stays standard; what changes lives in tokenization, training infrastructure, & curriculum.
Why Decoder-Only for ANDREA
What Fits in 24 GB
Bytes Per Parameter
An RTX 4090 ships with 24 GB of VRAM. ANDREA-12M training used 1.4 GB. ANDREA-120M used substantially more. The gap traces to a simple accounting exercise: every parameter shows up multiple times in memory during training.
For each parameter, training holds:
- The weight itself (1× weight)
- Adam first moment (m): same shape as weight (1× weight)
- Adam second moment (v): same shape as weight (1× weight)
- Gradients: same shape as weight (1× weight)
- Activations & temporaries: ~2-4× weight (varies with batch & context)
Total: ~6-8× the weight count, in bytes determined by precision.
Precision Multiplies Everything
| Precision | Bytes/param | Total for 120M weights | Notes |
|---|---|---|---|
| FP32 | 4 | 480 MB | Baseline; safest, slowest |
| FP16 | 2 | 240 MB | cuBLAS, half memory |
| FP8 E4M3 | 1 | 120 MB | Tensor cores, NaN risk |
Multiply by 6-8× for full training-time footprint. ANDREA-120M trains comfortably in FP16 (~2 GB for weights + optimizer + grads); FP8 E4M3 halves training time on RTX 4090 tensor cores.
Activity 13 (grow_a_language_model_precision) walks the FP precision tradeoffs in detail.
Sizing ANDREA-120M's Optimizer State
Twenty-Five Activities
Two Halves
This course splits cleanly. The first half covers what microGPT taught the field: a transformer that runs on one GPU. The second half covers ANDREA's actual contribution: a curriculum that learns.
Half 1: A Transformer on One GPU (activities 2-13)
| # | Activity | Beat |
|---|---|---|
| 2 | Harris morpheme tokenizer | distributional segmentation, 256+N+1 vocab |
| 3 | Tokenizer-diet alignment | saturation point, why 12M wasted 63.6% |
| 4 | Embeddings & position | learned token + position lookup |
| 5 | Scaled dot-product attention | Q·Kᵀ/√d, causal mask, softmax |
| 6 | Multi-head attention | head splits, parallel projections |
| 7 | Transformer block | MLP, residuals, layer norm |
| 8 | Cross-entropy & perplexity | log-likelihood, SMMA loss |
| 9 | Backprop in custom CUDA | chain rule across microgpt_cuda.cu |
| 10 | AdamW | decoupled weight decay; why vanilla Adam died |
| 11 | LR warmup + cosine decay | 2000-step ramp; why instant peak destroys 120M |
| 12 | Gradient clipping | global L2 norm; 3 CUDA kernels |
| 13 | FP32 / FP16 / FP8 E4M3 | precision tradeoffs; tensor cores |
Half 2: A Curriculum That Learns (activities 14-24)
| # | Activity | Beat |
|---|---|---|
| 14 | Multi-armed bandits | UCB1, exploration vs exploitation |
| 15 | Phase-based dice control | 7/14/21/28/42 phases, 1d3/1d4 dice |
| 16 | Reward attribution & EMA | per-source loss EMA, 1000× scaling |
| 17 | Source floors & epoch penalty | 1/(1+epochs) prevents memorization |
| 18 | Coverage bonus | doc-level tracking, 1.3× freshness |
| 19 | Curriculum warmup | 7 chat/prose sources first 20K steps |
| 20 | Filtering by shape, not chars | has_system_prompt_shape() |
| 21 | Coherence-gated early stopping | bigram/trigram/word/char auto-halt |
| 22 | Checkpoint, resume, signals | format, SIGTERM/SIGUSR1, loss.json continuity |
| 23 | Sample audit & external grading | reading a run, 9.5/10 territory |
| 24 | From microGPT to ANDREA-120M | v1 collapse, v2 fixes, v2.5 patch, v3 polish |
Plus a companion: geometry_of_andrea views every layer as geometry (embedding space, attention as projection, loss surface, bandit as a walk on a discrete simplex).
Suggested Order
Activities 2-13 build a working transformer. Skip ahead to half 2 if you've trained transformers before; come back when curiosity strikes.
Each activity stands alone where possible. Math beats reference earlier activities by name (see activity 5: scaled dot-product attention). Code references point at microgpt/microgpt_cuda.cu & microgpt/training_proxy.py in ~/git/uncloseai-cli/.