What a Tokenizer Eats Becomes What It Knows
Tokenizer Diet: A Definition
A Harris tokenizer trains on a corpus sample. It runs distributional analysis across that sample, picks N segments that recur most strongly, & writes them into a vocabulary. After training, those N segments become a fixed alphabet a language model uses for everything: training, inference, every input, every output.
Tokenizer diet = a sample of text a tokenizer trains on.
Training diet = a corpus a language model trains on.
When a two diets differ, a tokenizer learns segments tuned for text a model will never see. Embedding capacity (one slot per vocabulary entry) gets spent on segments that earn no reward during training.
ANDREA-12M's Mistake
ANDREA-12M trained its Harris tokenizer on a raw head of megachat-v8.txt. That head contained code samples & tool-call data. The training curriculum, however, excluded code & tool calls; ANDREA-12M only saw conversational text.
Result: a tokenizer learned segments from Python keywords, JSON braces, shell flags. A model trained on dictionary entries & dialogue. Only 36.4% of segments overlapped a curriculum-weighted sample. A remaining 63.6% of vocabulary slots got allocated to segments a model would never encounter at training time.
Why That Matters
Each vocabulary entry consumes embedding parameters: one row of an embedding matrix shaped V × d_model (covered in activity 4). At V = 4353 & d_model = 384, every vocab slot costs 384 floats. Wasting 63.6% wastes 63.6% of an embedding matrix on data a model never sees.
State a Diet Rule
How Big Should N Get
A Vocab Science Sweep
ANDREA-120M ran a vocab science experiment: train Harris tokenizers at different N values (segments requested) on a same 1.25B-character firehose corpus. Measure how many segments a tokenizer actually finds. Plot results.
| Requested N | Actual segments found | Status |
|---|---|---|
| 2,048 | 2,048 | Unsaturated (room to grow) |
| 4,096 | 4,096 | Unsaturated |
| 8,192 | 8,192 | Saturation point |
| 16,384 | 13,106 | Corpus exhausted |
What Saturation Means
At small N, a corpus has plenty of recurring patterns; a tokenizer fills every slot it asks for. At large N, a tokenizer runs out of statistically meaningful boundaries. A 1.25B-character corpus contains roughly 13,106 distinct morpheme-shaped segments above a frequency threshold. Asking for 16,384 yields 13,106; a remaining 3,278 slots get padded or left empty.
Saturation: a point where requested N = found N. Beyond saturation, a tokenizer cannot discover more segments without diluting quality (lowering frequency thresholds & accepting noise).
Sweet Spot at 8192
ANDREA-120M chose N = 8192. A reasoning:
- Below 8192 (e.g. 4096): vocabulary undercaptures common morphemes; sequences fragment into more tokens; throughput drops.
- At 8192: every segment slot maps to a real, recurrent pattern in a corpus.
- Above 8192: diminishing returns; 13,106 < 16,384 means slots get wasted.
Final ANDREA-120M vocabulary: 256 + 8192 + 1 = 8449 tokens. Average compression: 5.91 UTF-8 bytes per token, meaning each token replaces ~5.9 bytes of raw text. That ratio sets a model's effective context: at 1024 tokens × 5.91 bytes/token, ANDREA-120M reads roughly 6,050 characters of context per forward pass.
Above or Below Saturation
Where 63.6% Came From
Counting a Wasted Slots
ANDREA-12M's tokenizer trained on raw megachat-v8.txt (4096 segments requested, found). A team sampled a curriculum-weighted subset: a corpus weighted by how often each source got pulled by a bandit. They re-ran a Harris analysis on that weighted sample & asked: how many of an original 4096 segments still appear?
Result: 36.4% overlap. 1,491 of 4,096 segments matched curriculum weighting. A remaining 2,605 segments came from sources a model excluded.
63.6% of vocab slots got allocated to bytes a model never saw.
Embedding Cost
Every vocabulary entry occupies one row of an embedding matrix shaped (V, d_model). For ANDREA-12M:
- V = 4353 (256 + 4096 + 1)
- d_model = 384
- Embedding params = V × d_model = 4353 × 384 = 1,671,552 parameters
63.6% of those parameters went unused for conversational training. 1,063,107 parameters allocated, 0 reward signal. ANDREA-12M survives because 256 base bytes always cover any character; but capacity per parameter dropped sharply.
How ANDREA-120M Fixed It
ANDREA-120M's tokenizer trained on a full firehose (1.25B chars, 21 sources) at saturation N = 8192. A training corpus = a same firehose. Diet alignment: 100%. Resulting overlap on chat-weighted sample: 36.5%. (Note: 36.5% is overlap, not coverage; chat alone is a subset of full firehose, so this number behaves differently from 12M's 36.4%.)
Effective compression: 5.91 UTF-8 bytes per token. ANDREA-120M's embedding matrix: 8449 × 768 = 6,488,832 parameters. Every parameter earns a reward signal because every segment maps to text a model actually trains on.
Coverage Versus Overlap
Why 5.91 Bytes Per Token Matters
A Compression Ratio
Average UTF-8 bytes per token measures how much raw text each vocabulary entry compresses. ANDREA-120M averages 5.91. A model with shorter pieces (3 bytes/token) reads less context per forward pass; a model with longer pieces (8 bytes/token) reads more but trains slower (each piece needs more samples to learn well).
Effective Context
| Quantity | Value |
|---|---|
| Token context window | 1,024 tokens |
| Average bytes per token | 5.91 |
| Effective character context | 1024 × 5.91 ≈ 6,050 |
Roughly 6,000 UTF-8 characters fit in one ANDREA-120M forward pass. A page of dense English prose runs ~3,000-4,000 characters; ANDREA reads about a page & a half per pass.
Diet Tightens Compression
A well-aligned tokenizer compresses better. When a tokenizer learns segments that recur in a training corpus, more text fits per token. ANDREA-12M's poorly-aligned tokenizer compressed worse on chat (more bytes spent on byte-fallback fragments because chat segments were sparser in vocab). ANDREA-120M's diet-aligned tokenizer keeps a chat-shaped piece on a fast path & rare scripts on a byte fallback.
Activity 4 Continues
Activity 4 (grow_a_language_model_embeddings) covers what happens to those 8449 vocab entries: they become rows of an embedding matrix shaped V × d_model, then add learned position embeddings before flowing into a first transformer block.