un — Grow a Language Model: One GPU, One Model

un

guest

1 / ?

back to lessons

What a Language Model Predicts

A Probabilistic Continuation Engine

A language model takes a sequence of tokens & assigns a probability distribution over what token comes next. Feed it the cat sat on the & it outputs probabilities across an entire vocabulary: high mass on mat, floor, couch; low mass on xylophone, Tuesday.

Sampling that distribution, appending a token, & feeding it back: that loop generates text one token at a time. Autoregressive generation, named because each step regresses on its own prior output.

Three Numbers Define a Language Model

Vocabulary size (V). How many distinct tokens a model can produce. ANDREA-12M used 4,353 tokens; ANDREA-120M uses 8,449.

Context window (T). How many tokens fit in one forward pass. ANDREA models use T = 1,024.

Parameter count (P). How many learned weights live inside. 12M, 120M, & 480M name a family by P.

A Family of Three

Variant	d_model	Heads	Layers	Context	Params
ANDREA-12M	384	12	6	1024	12.8M
ANDREA-120M	768	12	12	1024	~120M
ANDREA-480M	1536	24	16	1024	~480M

Three knobs scale: d_model (width of every internal vector), n_layer (depth of stacked transformer blocks), n_head (parallel attention projections). Vocabulary & context stay fixed across the family.

Reading the Family Table

Compare ANDREA-12M (d_model=384, 6 layers, 12 heads) against ANDREA-120M (d_model=768, 12 layers, 12 heads). Name two architectural axes that scale from 12M to 120M, & one that stays constant. A one-sentence reason for each scaling choice helps.

Why Small Matters

Constraint as Liberation

Large language models with hundreds of billions of parameters require thousands of GPUs, proprietary datasets, & corporate budgets. Few people get to train one. Few people get to repair one.

A small language model on one GPU flips that. Anyone with a 4090 (or a 3060) can reproduce ANDREA from open data. The training recipe doubles as the model card. Open code, open weights, open data; full provenance in 72 hours of compute.

Capacity vs Quality

Smaller models cannot memorize their training corpus. ANDREA-12M, at 12.8M parameters, lacks the capacity to store factual content; it learns turn structure, vocabulary distribution, & response shape. ANDREA-120M, at 10× capacity, learns factual recall, multi-paragraph coherence, & domain breadth (verified through external grading at 9.5/10 on biology & signal-processing samples).

The takeaway: capacity sets a ceiling. Curriculum decides whether the ceiling gets reached. Activities 14-23 cover curriculum.

Three Transformer Flavors

Encoder, Decoder, Both

The original Transformer (Vaswani et al., 2017) shipped an encoder & a decoder, glued together for translation. Three architectural lineages descend from that paper:

Encoder-only (BERT lineage). Bidirectional attention, no causal mask. Optimized for classification, not generation. A token sees both its past & its future during training.

Encoder-decoder (T5, BART). Encoder reads the input; decoder generates the output, attending to the encoder via cross-attention. Used for translation, summarization.

Decoder-only (GPT, ANDREA). Causal mask: every token sees only its past. Trained to predict the next token. Generation comes free; training & inference share the same forward pass.

Why Decoder-Only Wins on One GPU

Three reasons:

1. Single objective. Next-token prediction works on any text. No paired source/target needed.

2. Training & inference symmetry. Same forward pass, no special generation logic.

3. Memory simplicity. No cross-attention; one stack of layers; one flow of activations.

ANDREA inherits the decoder-only choice from microGPT, which inherited from nanoGPT, which inherited from GPT-2. The lineage stays standard; what changes lives in tokenization, training infrastructure, & curriculum.

Why Decoder-Only for ANDREA

Give one reason from a training-data perspective & one reason from an inference-behavior perspective why ANDREA uses a decoder-only transformer instead of an encoder-decoder like T5.

What Fits in 24 GB

Bytes Per Parameter

An RTX 4090 ships with 24 GB of VRAM. ANDREA-12M training used 1.4 GB. ANDREA-120M used substantially more. The gap traces to a simple accounting exercise: every parameter shows up multiple times in memory during training.

For each parameter, training holds:

- The weight itself (1× weight)

- Adam first moment (m): same shape as weight (1× weight)

- Adam second moment (v): same shape as weight (1× weight)

- Gradients: same shape as weight (1× weight)

- Activations & temporaries: ~2-4× weight (varies with batch & context)

Total: ~6-8× the weight count, in bytes determined by precision.

Precision Multiplies Everything

Precision	Bytes/param	Total for 120M weights	Notes
FP32	4	480 MB	Baseline; safest, slowest
FP16	2	240 MB	cuBLAS, half memory
FP8 E4M3	1	120 MB	Tensor cores, NaN risk

Multiply by 6-8× for full training-time footprint. ANDREA-120M trains comfortably in FP16 (~2 GB for weights + optimizer + grads); FP8 E4M3 halves training time on RTX 4090 tensor cores.

Activity 13 (grow_a_language_model_precision) walks the FP precision tradeoffs in detail.

Sizing ANDREA-120M's Optimizer State

ANDREA-120M holds ~120,000,000 parameters. Each FP32 weight occupies 4 bytes. AdamW stores two extra optimizer-state floats per weight (m, v). Compute: (a) weights only in FP32, in MB; (b) weights + optimizer state in FP32, in MB; (c) weights + optimizer state in FP16, in MB. Show your arithmetic.

Twenty-Five Activities

Two Halves

This course splits cleanly. The first half covers what microGPT taught the field: a transformer that runs on one GPU. The second half covers ANDREA's actual contribution: a curriculum that learns.

Half 1: A Transformer on One GPU (activities 2-13)

#	Activity	Beat
2	Harris morpheme tokenizer	distributional segmentation, 256+N+1 vocab
3	Tokenizer-diet alignment	saturation point, why 12M wasted 63.6%
4	Embeddings & position	learned token + position lookup
5	Scaled dot-product attention	Q·Kᵀ/√d, causal mask, softmax
6	Multi-head attention	head splits, parallel projections
7	Transformer block	MLP, residuals, layer norm
8	Cross-entropy & perplexity	log-likelihood, SMMA loss
9	Backprop in custom CUDA	chain rule across `microgpt_cuda.cu`
10	AdamW	decoupled weight decay; why vanilla Adam died
11	LR warmup + cosine decay	2000-step ramp; why instant peak destroys 120M
12	Gradient clipping	global L2 norm; 3 CUDA kernels
13	FP32 / FP16 / FP8 E4M3	precision tradeoffs; tensor cores

Half 2: A Curriculum That Learns (activities 14-24)

#	Activity	Beat
14	Multi-armed bandits	UCB1, exploration vs exploitation
15	Phase-based dice control	7/14/21/28/42 phases, 1d3/1d4 dice
16	Reward attribution & EMA	per-source loss EMA, 1000× scaling
17	Source floors & epoch penalty	1/(1+epochs) prevents memorization
18	Coverage bonus	doc-level tracking, 1.3× freshness
19	Curriculum warmup	7 chat/prose sources first 20K steps
20	Filtering by shape, not chars	`has_system_prompt_shape()`
21	Coherence-gated early stopping	bigram/trigram/word/char auto-halt
22	Checkpoint, resume, signals	format, SIGTERM/SIGUSR1, loss.json continuity
23	Sample audit & external grading	reading a run, 9.5/10 territory
24	From microGPT to ANDREA-120M	v1 collapse, v2 fixes, v2.5 patch, v3 polish

Plus a companion: geometry_of_andrea views every layer as geometry (embedding space, attention as projection, loss surface, bandit as a walk on a discrete simplex).

Suggested Order

Activities 2-13 build a working transformer. Skip ahead to half 2 if you've trained transformers before; come back when curiosity strikes.

Each activity stands alone where possible. Math beats reference earlier activities by name (see activity 5: scaled dot-product attention). Code references point at microgpt/microgpt_cuda.cu & microgpt/training_proxy.py in ~/git/uncloseai-cli/.

Where Will You Start?

Looking at the 24 activities + geometry companion, name one activity you want to start with & one reason: prior knowledge gap, professional relevance, or pure curiosity. There's no wrong answer; the path through the course belongs to you.