un — Grow a Language Model: Tokenizer Diet

un

guest

1 / ?

back to lessons

What a Tokenizer Eats Becomes What It Knows

Tokenizer Diet: A Definition

A Harris tokenizer trains on a corpus sample. It runs distributional analysis across that sample, picks N segments that recur most strongly, & writes them into a vocabulary. After training, those N segments become a fixed alphabet a language model uses for everything: training, inference, every input, every output.

Tokenizer diet = a sample of text a tokenizer trains on.

Training diet = a corpus a language model trains on.

When a two diets differ, a tokenizer learns segments tuned for text a model will never see. Embedding capacity (one slot per vocabulary entry) gets spent on segments that earn no reward during training.

Tokenizer diet & saturation

ANDREA-12M's Mistake

ANDREA-12M trained its Harris tokenizer on a raw head of megachat-v8.txt. That head contained code samples & tool-call data. The training curriculum, however, excluded code & tool calls; ANDREA-12M only saw conversational text.

Result: a tokenizer learned segments from Python keywords, JSON braces, shell flags. A model trained on dictionary entries & dialogue. Only 36.4% of segments overlapped a curriculum-weighted sample. A remaining 63.6% of vocabulary slots got allocated to segments a model would never encounter at training time.

Why That Matters

Each vocabulary entry consumes embedding parameters: one row of an embedding matrix shaped V × d_model (covered in activity 4). At V = 4353 & d_model = 384, every vocab slot costs 384 floats. Wasting 63.6% wastes 63.6% of an embedding matrix on data a model never sees.

State a Diet Rule

Explain a tokenizer-diet rule in one sentence. Then describe a worst case: a researcher trains a Harris tokenizer on Wikipedia (formal prose, citations) but trains a model on Twitter (slang, emoji, hashtags). What goes wrong?

How Big Should N Get

A Vocab Science Sweep

ANDREA-120M ran a vocab science experiment: train Harris tokenizers at different N values (segments requested) on a same 1.25B-character firehose corpus. Measure how many segments a tokenizer actually finds. Plot results.

Requested N	Actual segments found	Status
2,048	2,048	Unsaturated (room to grow)
4,096	4,096	Unsaturated
8,192	8,192	Saturation point
16,384	13,106	Corpus exhausted

What Saturation Means

At small N, a corpus has plenty of recurring patterns; a tokenizer fills every slot it asks for. At large N, a tokenizer runs out of statistically meaningful boundaries. A 1.25B-character corpus contains roughly 13,106 distinct morpheme-shaped segments above a frequency threshold. Asking for 16,384 yields 13,106; a remaining 3,278 slots get padded or left empty.

Saturation: a point where requested N = found N. Beyond saturation, a tokenizer cannot discover more segments without diluting quality (lowering frequency thresholds & accepting noise).

Sweet Spot at 8192

ANDREA-120M chose N = 8192. A reasoning:

- Below 8192 (e.g. 4096): vocabulary undercaptures common morphemes; sequences fragment into more tokens; throughput drops.

- At 8192: every segment slot maps to a real, recurrent pattern in a corpus.

- Above 8192: diminishing returns; 13,106 < 16,384 means slots get wasted.

Final ANDREA-120M vocabulary: 256 + 8192 + 1 = 8449 tokens. Average compression: 5.91 UTF-8 bytes per token, meaning each token replaces ~5.9 bytes of raw text. That ratio sets a model's effective context: at 1024 tokens × 5.91 bytes/token, ANDREA-120M reads roughly 6,050 characters of context per forward pass.

Above or Below Saturation

Suppose a researcher considers two N values for a future ANDREA model: N = 6144 (below saturation) versus N = 12288 (above saturation, where actual found segments = 13106 still applies because corpus is fixed). For each: (a) compute final vocab size (256 + N + 1), & (b) state in one phrase whether each setting wastes vocab capacity, captures all available signal, or undercaptures. Show your work.

Where 63.6% Came From

Counting a Wasted Slots

ANDREA-12M's tokenizer trained on raw megachat-v8.txt (4096 segments requested, found). A team sampled a curriculum-weighted subset: a corpus weighted by how often each source got pulled by a bandit. They re-ran a Harris analysis on that weighted sample & asked: how many of an original 4096 segments still appear?

Result: 36.4% overlap. 1,491 of 4,096 segments matched curriculum weighting. A remaining 2,605 segments came from sources a model excluded.

63.6% of vocab slots got allocated to bytes a model never saw.

Embedding Cost

Every vocabulary entry occupies one row of an embedding matrix shaped (V, d_model). For ANDREA-12M:

- V = 4353 (256 + 4096 + 1)

- d_model = 384

- Embedding params = V × d_model = 4353 × 384 = 1,671,552 parameters

63.6% of those parameters went unused for conversational training. 1,063,107 parameters allocated, 0 reward signal. ANDREA-12M survives because 256 base bytes always cover any character; but capacity per parameter dropped sharply.

How ANDREA-120M Fixed It

ANDREA-120M's tokenizer trained on a full firehose (1.25B chars, 21 sources) at saturation N = 8192. A training corpus = a same firehose. Diet alignment: 100%. Resulting overlap on chat-weighted sample: 36.5%. (Note: 36.5% is overlap, not coverage; chat alone is a subset of full firehose, so this number behaves differently from 12M's 36.4%.)

Effective compression: 5.91 UTF-8 bytes per token. ANDREA-120M's embedding matrix: 8449 × 768 = 6,488,832 parameters. Every parameter earns a reward signal because every segment maps to text a model actually trains on.

Coverage Versus Overlap

ANDREA-120M's tokenizer corpus matches its training corpus. Yet a 'segment coverage on chat-weighted sample' still landed at 36.5%, similar to 12M's 36.4%. Why is 36.5% not a problem for 120M when 36.4% was a problem for 12M? Use a phrase about which subset is which.

Why 5.91 Bytes Per Token Matters

A Compression Ratio

Average UTF-8 bytes per token measures how much raw text each vocabulary entry compresses. ANDREA-120M averages 5.91. A model with shorter pieces (3 bytes/token) reads less context per forward pass; a model with longer pieces (8 bytes/token) reads more but trains slower (each piece needs more samples to learn well).

Effective Context

Quantity	Value
Token context window	1,024 tokens
Average bytes per token	5.91
Effective character context	1024 × 5.91 ≈ 6,050

Roughly 6,000 UTF-8 characters fit in one ANDREA-120M forward pass. A page of dense English prose runs ~3,000-4,000 characters; ANDREA reads about a page & a half per pass.

Diet Tightens Compression

A well-aligned tokenizer compresses better. When a tokenizer learns segments that recur in a training corpus, more text fits per token. ANDREA-12M's poorly-aligned tokenizer compressed worse on chat (more bytes spent on byte-fallback fragments because chat segments were sparser in vocab). ANDREA-120M's diet-aligned tokenizer keeps a chat-shaped piece on a fast path & rare scripts on a byte fallback.

Activity 4 Continues

Activity 4 (grow_a_language_model_embeddings) covers what happens to those 8449 vocab entries: they become rows of an embedding matrix shaped V × d_model, then add learned position embeddings before flowing into a first transformer block.

Pick an N

Reflect on a tradeoff: should a future ANDREA model use N = 4096 (faster training, more bytes-per-token = longer effective context) or N = 16384 (longer-but-rarer segments, fewer tokens per piece of text, but past saturation so wasted slots)? Pick one & give a one-sentence reason. No wrong answer.