English· Español· Deutsch· Nederlands· Français· 日本語· ქართული· 繁體中文· 简体中文· Português· Русский· العربية· हिन्दी· Italiano· 한국어· Polski· Svenska· Türkçe· Українська· Tiếng Việt· Bahasa Indonesia

un

guest
1 / ?
back to lessons

What a Tokenizer Eats Becomes What It Knows

Tokenizer Diet: A Definition

A Harris tokenizer trains on a corpus sample. It runs distributional analysis across that sample, picks N segments that recur most strongly, & writes them into a vocabulary. After training, those N segments become a fixed alphabet a language model uses for everything: training, inference, every input, every output.


Tokenizer diet = a sample of text a tokenizer trains on.


Training diet = a corpus a language model trains on.


When a two diets differ, a tokenizer learns segments tuned for text a model will never see. Embedding capacity (one slot per vocabulary entry) gets spent on segments that earn no reward during training.


Tokenizer diet & saturation


ANDREA-12M's Mistake

ANDREA-12M trained its Harris tokenizer on a raw head of megachat-v8.txt. That head contained code samples & tool-call data. The training curriculum, however, excluded code & tool calls; ANDREA-12M only saw conversational text.


Result: a tokenizer learned segments from Python keywords, JSON braces, shell flags. A model trained on dictionary entries & dialogue. Only 36.4% of segments overlapped a curriculum-weighted sample. A remaining 63.6% of vocabulary slots got allocated to segments a model would never encounter at training time.


Why That Matters

Each vocabulary entry consumes embedding parameters: one row of an embedding matrix shaped V × d_model (covered in activity 4). At V = 4353 & d_model = 384, every vocab slot costs 384 floats. Wasting 63.6% wastes 63.6% of an embedding matrix on data a model never sees.

State a Diet Rule

Explain a tokenizer-diet rule in one sentence. Then describe a worst case: a researcher trains a Harris tokenizer on Wikipedia (formal prose, citations) but trains a model on Twitter (slang, emoji, hashtags). What goes wrong?

How Big Should N Get

A Vocab Science Sweep

ANDREA-120M ran a vocab science experiment: train Harris tokenizers at different N values (segments requested) on a same 1.25B-character firehose corpus. Measure how many segments a tokenizer actually finds. Plot results.


Requested NActual segments foundStatus
2,0482,048Unsaturated (room to grow)
4,0964,096Unsaturated
8,1928,192Saturation point
16,38413,106Corpus exhausted

What Saturation Means

At small N, a corpus has plenty of recurring patterns; a tokenizer fills every slot it asks for. At large N, a tokenizer runs out of statistically meaningful boundaries. A 1.25B-character corpus contains roughly 13,106 distinct morpheme-shaped segments above a frequency threshold. Asking for 16,384 yields 13,106; a remaining 3,278 slots get padded or left empty.


Saturation: a point where requested N = found N. Beyond saturation, a tokenizer cannot discover more segments without diluting quality (lowering frequency thresholds & accepting noise).


Sweet Spot at 8192

ANDREA-120M chose N = 8192. A reasoning:


- Below 8192 (e.g. 4096): vocabulary undercaptures common morphemes; sequences fragment into more tokens; throughput drops.

- At 8192: every segment slot maps to a real, recurrent pattern in a corpus.

- Above 8192: diminishing returns; 13,106 < 16,384 means slots get wasted.


Final ANDREA-120M vocabulary: 256 + 8192 + 1 = 8449 tokens. Average compression: 5.91 UTF-8 bytes per token, meaning each token replaces ~5.9 bytes of raw text. That ratio sets a model's effective context: at 1024 tokens × 5.91 bytes/token, ANDREA-120M reads roughly 6,050 characters of context per forward pass.

Above or Below Saturation

Suppose a researcher considers two N values for a future ANDREA model: N = 6144 (below saturation) versus N = 12288 (above saturation, where actual found segments = 13106 still applies because corpus is fixed). For each: (a) compute final vocab size (256 + N + 1), & (b) state in one phrase whether each setting wastes vocab capacity, captures all available signal, or undercaptures. Show your work.

Where 63.6% Came From

Counting a Wasted Slots

ANDREA-12M's tokenizer trained on raw megachat-v8.txt (4096 segments requested, found). A team sampled a curriculum-weighted subset: a corpus weighted by how often each source got pulled by a bandit. They re-ran a Harris analysis on that weighted sample & asked: how many of an original 4096 segments still appear?


Result: 36.4% overlap. 1,491 of 4,096 segments matched curriculum weighting. A remaining 2,605 segments came from sources a model excluded.


63.6% of vocab slots got allocated to bytes a model never saw.


Embedding Cost

Every vocabulary entry occupies one row of an embedding matrix shaped (V, d_model). For ANDREA-12M:


- V = 4353 (256 + 4096 + 1)

- d_model = 384

- Embedding params = V × d_model = 4353 × 384 = 1,671,552 parameters


63.6% of those parameters went unused for conversational training. 1,063,107 parameters allocated, 0 reward signal. ANDREA-12M survives because 256 base bytes always cover any character; but capacity per parameter dropped sharply.


How ANDREA-120M Fixed It

ANDREA-120M's tokenizer trained on a full firehose (1.25B chars, 21 sources) at saturation N = 8192. A training corpus = a same firehose. Diet alignment: 100%. Resulting overlap on chat-weighted sample: 36.5%. (Note: 36.5% is overlap, not coverage; chat alone is a subset of full firehose, so this number behaves differently from 12M's 36.4%.)


Effective compression: 5.91 UTF-8 bytes per token. ANDREA-120M's embedding matrix: 8449 × 768 = 6,488,832 parameters. Every parameter earns a reward signal because every segment maps to text a model actually trains on.

Coverage Versus Overlap

ANDREA-120M's tokenizer corpus matches its training corpus. Yet a 'segment coverage on chat-weighted sample' still landed at 36.5%, similar to 12M's 36.4%. Why is 36.5% not a problem for 120M when 36.4% was a problem for 12M? Use a phrase about which subset is which.

Why 5.91 Bytes Per Token Matters

A Compression Ratio

Average UTF-8 bytes per token measures how much raw text each vocabulary entry compresses. ANDREA-120M averages 5.91. A model with shorter pieces (3 bytes/token) reads less context per forward pass; a model with longer pieces (8 bytes/token) reads more but trains slower (each piece needs more samples to learn well).


Effective Context


QuantityValue
Token context window1,024 tokens
Average bytes per token5.91
Effective character context1024 × 5.91 ≈ 6,050

Roughly 6,000 UTF-8 characters fit in one ANDREA-120M forward pass. A page of dense English prose runs ~3,000-4,000 characters; ANDREA reads about a page & a half per pass.


Diet Tightens Compression

A well-aligned tokenizer compresses better. When a tokenizer learns segments that recur in a training corpus, more text fits per token. ANDREA-12M's poorly-aligned tokenizer compressed worse on chat (more bytes spent on byte-fallback fragments because chat segments were sparser in vocab). ANDREA-120M's diet-aligned tokenizer keeps a chat-shaped piece on a fast path & rare scripts on a byte fallback.


Activity 4 Continues

Activity 4 (grow_a_language_model_embeddings) covers what happens to those 8449 vocab entries: they become rows of an embedding matrix shaped V × d_model, then add learned position embeddings before flowing into a first transformer block.

Pick an N

Reflect on a tradeoff: should a future ANDREA model use N = 4096 (faster training, more bytes-per-token = longer effective context) or N = 16384 (longer-but-rarer segments, fewer tokens per piece of text, but past saturation so wasted slots)? Pick one & give a one-sentence reason. No wrong answer.