Words to Numbers
A Translator at a Border
A language model never sees text. It sees integers. A tokenizer sits at a border crossing: human words flow in, integer IDs flow out. Generation reverses a flow: integer IDs come back, a tokenizer renders text.
Three jobs:
1. Segment. Cut a string into pieces (tokens).
2. Map. Assign each piece a unique integer ID from a fixed vocabulary.
3. Reverse. Reconstruct text from IDs at generation time.
Why Pieces, Not Whole Words
A whole-word vocabulary explodes. English alone has hundreds of thousands of forms. Worse, a model trained on whole words cannot handle a typo, a new name, or a foreign phrase: any unseen word maps to a single <UNK> slot.
Subword tokenization fixes that. A vocabulary of common pieces composes into any word, including ones never seen during training. Two strategies dominate: BPE (byte pair encoding) & distributional segmentation. ANDREA picks a second strategy.
Why Subword
Where Does a Word Break
Zellig Harris, 1955
A linguist named Zellig Harris noticed something. Inside a word, a count of distinct letters that follow a given letter sequence varies sharply. After un you can find dozens of letters: a, b, c, d, e ... After unbel only a tiny set follows: i (then ievable).
A spike in successor variety marks a likely morpheme boundary. After un (a prefix), variety jumps because many roots can follow. Inside a root like believ, variety stays low because letters predict each other. At a transition between morphemes, variety jumps again.
From Variety Spikes to Segments
Run that detector across a training corpus. Every word donates statistical evidence. A tokenizer collects high-frequency segments that recur at morpheme-shaped boundaries: un, re, pre, believ, know, ing, able, ly, tion, ed.
No labels. No linguist hand-tags morphemes. A statistic of letter co-occurrence does a work.
Harris vs BPE
| Property | Harris | BPE |
|---|---|---|
| Boundary criterion | Successor variety spike | Pair frequency |
| Linguistic shape | Morpheme-aligned (prefix, root, suffix) | Frequent byte pairs |
Example: unbelievably | un + believ + abl + y | unb + eli + eva + bly |
| Generalization | Strong (root + affix recombines) | Weaker (pairs need not align) |
Both produce subword pieces. Harris pieces tend to align with what a linguist would call a morpheme: a smallest meaningful unit. BPE pieces optimize compression: a most frequent byte pair gets merged, regardless of meaning.
Segment a Word
Three Vocab Slabs
Anatomy of an ANDREA Vocabulary
Harris tokenization produces a vocabulary with three slabs:
Slab 1: 256 base bytes. Every possible UTF-8 byte (0x00 through 0xFF) gets its own token ID. A safety net: any character a corpus contains, a tokenizer can represent as a sequence of bytes. No <UNK> ever fires.
Slab 2: N morpheme segments. Common pieces discovered through distributional analysis. ANDREA-12M trained N = 4096; ANDREA-120M trained N = 8192. Each segment compresses a recurring multi-byte string into a single token.
Slab 3: 1 BOS token. A special marker placed at a start of every training sequence. Lets a model learn 'this position has no past'. ANDREA-12M & ANDREA-120M both reserve exactly one ID for BOS.
Vocabulary Sizes
| Model | Base bytes | Morpheme segments (N) | BOS | Vocab size |
|---|---|---|---|---|
| ANDREA-12M | 256 | 4096 | 1 | 4353 |
| ANDREA-120M | 256 | 8192 | 1 | 8449 |
256 + N + 1 = vocabulary size. Simple. Reproducible. Open.
Why a Byte Slab Matters
A byte fallback guarantees coverage. If a model encounters 日本語 & a tokenizer has no Japanese morphemes, individual UTF-8 bytes carry a sequence through. A model trains on bytes; quality on rare scripts depends on capacity & exposure, but no input ever crashes a tokenizer.
Compute a Vocabulary
Beginning of Sequence
Why a Sequence Needs a Marker
A decoder-only transformer predicts a next token from prior context. Position 0 has no prior context. Without a marker, position 0 sits in a logical hole: a model has nothing to attend to.
BOS solves a hole. A single special token (ID = 256 + N) sits at every sequence start during training. A model learns:
- 'When you see BOS, predict a likely first token of natural text.'
- 'When you see BOS followed by a word, that word is a sequence beginning, not a continuation.'
One Token, Many Uses
BOS shows up at:
- Training time: prepended to every chunk of text fed into a model.
- Inference time: prepended to a prompt so a model sees a familiar 'fresh start' signal.
- Boundary marking: in some pipelines, a separator between concatenated documents.
ANDREA reserves exactly one ID for BOS. No EOS, no PAD, no special tokens beyond what a vocabulary needs. Simplicity stays a permacomputer value: every token earns its slot.
Activity 3 Continues
Activity 3 (grow_a_language_model_tokenizer_diet) covers what happens when N is too large or a tokenizer corpus diverges from a training corpus. ANDREA-12M wasted 63.6% of its vocabulary; ANDREA-120M fixed it. Read on.