English· Español· Deutsch· Nederlands· Français· 日本語· ქართული· 繁體中文· 简体中文· Português· Русский· العربية· हिन्दी· Italiano· 한국어· Polski· Svenska· Türkçe· Українська· Tiếng Việt· Bahasa Indonesia

un

guest
1 / ?
back to lessons

What a Language Model Predicts

A Probabilistic Continuation Engine

A language model takes a sequence of tokens & assigns a probability distribution over what token comes next. Feed it the cat sat on the & it outputs probabilities across an entire vocabulary: high mass on mat, floor, couch; low mass on xylophone, Tuesday.


Sampling that distribution, appending a token, & feeding it back: that loop generates text one token at a time. Autoregressive generation, named because each step regresses on its own prior output.


Three Numbers Define a Language Model


Vocabulary size (V). How many distinct tokens a model can produce. ANDREA-12M used 4,353 tokens; ANDREA-120M uses 8,449.


Context window (T). How many tokens fit in one forward pass. ANDREA models use T = 1,024.


Parameter count (P). How many learned weights live inside. 12M, 120M, & 480M name a family by P.


A Family of Three


Variantd_modelHeadsLayersContextParams
ANDREA-12M384126102412.8M
ANDREA-120M76812121024~120M
ANDREA-480M153624161024~480M

Three knobs scale: d_model (width of every internal vector), n_layer (depth of stacked transformer blocks), n_head (parallel attention projections). Vocabulary & context stay fixed across the family.

Reading the Family Table

Compare ANDREA-12M (d_model=384, 6 layers, 12 heads) against ANDREA-120M (d_model=768, 12 layers, 12 heads). Name two architectural axes that scale from 12M to 120M, & one that stays constant. A one-sentence reason for each scaling choice helps.

Why Small Matters

Constraint as Liberation

Large language models with hundreds of billions of parameters require thousands of GPUs, proprietary datasets, & corporate budgets. Few people get to train one. Few people get to repair one.


A small language model on one GPU flips that. Anyone with a 4090 (or a 3060) can reproduce ANDREA from open data. The training recipe doubles as the model card. Open code, open weights, open data; full provenance in 72 hours of compute.


Capacity vs Quality

Smaller models cannot memorize their training corpus. ANDREA-12M, at 12.8M parameters, lacks the capacity to store factual content; it learns turn structure, vocabulary distribution, & response shape. ANDREA-120M, at 10× capacity, learns factual recall, multi-paragraph coherence, & domain breadth (verified through external grading at 9.5/10 on biology & signal-processing samples).


The takeaway: capacity sets a ceiling. Curriculum decides whether the ceiling gets reached. Activities 14-23 cover curriculum.

Three Transformer Flavors

Encoder, Decoder, Both

The original Transformer (Vaswani et al., 2017) shipped an encoder & a decoder, glued together for translation. Three architectural lineages descend from that paper:


Encoder-only (BERT lineage). Bidirectional attention, no causal mask. Optimized for classification, not generation. A token sees both its past & its future during training.


Encoder-decoder (T5, BART). Encoder reads the input; decoder generates the output, attending to the encoder via cross-attention. Used for translation, summarization.


Decoder-only (GPT, ANDREA). Causal mask: every token sees only its past. Trained to predict the next token. Generation comes free; training & inference share the same forward pass.


Why Decoder-Only Wins on One GPU

Three reasons:


1. Single objective. Next-token prediction works on any text. No paired source/target needed.

2. Training & inference symmetry. Same forward pass, no special generation logic.

3. Memory simplicity. No cross-attention; one stack of layers; one flow of activations.


ANDREA inherits the decoder-only choice from microGPT, which inherited from nanoGPT, which inherited from GPT-2. The lineage stays standard; what changes lives in tokenization, training infrastructure, & curriculum.

Why Decoder-Only for ANDREA

Give one reason from a training-data perspective & one reason from an inference-behavior perspective why ANDREA uses a decoder-only transformer instead of an encoder-decoder like T5.

What Fits in 24 GB

Bytes Per Parameter

An RTX 4090 ships with 24 GB of VRAM. ANDREA-12M training used 1.4 GB. ANDREA-120M used substantially more. The gap traces to a simple accounting exercise: every parameter shows up multiple times in memory during training.


For each parameter, training holds:

- The weight itself (1× weight)

- Adam first moment (m): same shape as weight (1× weight)

- Adam second moment (v): same shape as weight (1× weight)

- Gradients: same shape as weight (1× weight)

- Activations & temporaries: ~2-4× weight (varies with batch & context)


Total: ~6-8× the weight count, in bytes determined by precision.


Precision Multiplies Everything


PrecisionBytes/paramTotal for 120M weightsNotes
FP324480 MBBaseline; safest, slowest
FP162240 MBcuBLAS, half memory
FP8 E4M31120 MBTensor cores, NaN risk

Multiply by 6-8× for full training-time footprint. ANDREA-120M trains comfortably in FP16 (~2 GB for weights + optimizer + grads); FP8 E4M3 halves training time on RTX 4090 tensor cores.


Activity 13 (grow_a_language_model_precision) walks the FP precision tradeoffs in detail.

Sizing ANDREA-120M's Optimizer State

ANDREA-120M holds ~120,000,000 parameters. Each FP32 weight occupies 4 bytes. AdamW stores two extra optimizer-state floats per weight (m, v). Compute: (a) weights only in FP32, in MB; (b) weights + optimizer state in FP32, in MB; (c) weights + optimizer state in FP16, in MB. Show your arithmetic.

Twenty-Five Activities

Two Halves

This course splits cleanly. The first half covers what microGPT taught the field: a transformer that runs on one GPU. The second half covers ANDREA's actual contribution: a curriculum that learns.


Half 1: A Transformer on One GPU (activities 2-13)


#ActivityBeat
2Harris morpheme tokenizerdistributional segmentation, 256+N+1 vocab
3Tokenizer-diet alignmentsaturation point, why 12M wasted 63.6%
4Embeddings & positionlearned token + position lookup
5Scaled dot-product attentionQ·Kᵀ/√d, causal mask, softmax
6Multi-head attentionhead splits, parallel projections
7Transformer blockMLP, residuals, layer norm
8Cross-entropy & perplexitylog-likelihood, SMMA loss
9Backprop in custom CUDAchain rule across microgpt_cuda.cu
10AdamWdecoupled weight decay; why vanilla Adam died
11LR warmup + cosine decay2000-step ramp; why instant peak destroys 120M
12Gradient clippingglobal L2 norm; 3 CUDA kernels
13FP32 / FP16 / FP8 E4M3precision tradeoffs; tensor cores

Half 2: A Curriculum That Learns (activities 14-24)


#ActivityBeat
14Multi-armed banditsUCB1, exploration vs exploitation
15Phase-based dice control7/14/21/28/42 phases, 1d3/1d4 dice
16Reward attribution & EMAper-source loss EMA, 1000× scaling
17Source floors & epoch penalty1/(1+epochs) prevents memorization
18Coverage bonusdoc-level tracking, 1.3× freshness
19Curriculum warmup7 chat/prose sources first 20K steps
20Filtering by shape, not charshas_system_prompt_shape()
21Coherence-gated early stoppingbigram/trigram/word/char auto-halt
22Checkpoint, resume, signalsformat, SIGTERM/SIGUSR1, loss.json continuity
23Sample audit & external gradingreading a run, 9.5/10 territory
24From microGPT to ANDREA-120Mv1 collapse, v2 fixes, v2.5 patch, v3 polish

Plus a companion: geometry_of_andrea views every layer as geometry (embedding space, attention as projection, loss surface, bandit as a walk on a discrete simplex).


Suggested Order

Activities 2-13 build a working transformer. Skip ahead to half 2 if you've trained transformers before; come back when curiosity strikes.


Each activity stands alone where possible. Math beats reference earlier activities by name (see activity 5: scaled dot-product attention). Code references point at microgpt/microgpt_cuda.cu & microgpt/training_proxy.py in ~/git/uncloseai-cli/.

Where Will You Start?

Looking at the 24 activities + geometry companion, name one activity you want to start with & one reason: prior knowledge gap, professional relevance, or pure curiosity. There's no wrong answer; the path through the course belongs to you.