un — Grow a Language Model: Harris Morpheme Tokenizer

Words to Numbers

A Translator at a Border

A language model never sees text. It sees integers. A tokenizer sits at a border crossing: human words flow in, integer IDs flow out. Generation reverses a flow: integer IDs come back, a tokenizer renders text.

Three jobs:

1. Segment. Cut a string into pieces (tokens).

2. Map. Assign each piece a unique integer ID from a fixed vocabulary.

3. Reverse. Reconstruct text from IDs at generation time.

Why Pieces, Not Whole Words

A whole-word vocabulary explodes. English alone has hundreds of thousands of forms. Worse, a model trained on whole words cannot handle a typo, a new name, or a foreign phrase: any unseen word maps to a single <UNK> slot.

Subword tokenization fixes that. A vocabulary of common pieces composes into any word, including ones never seen during training. Two strategies dominate: BPE (byte pair encoding) & distributional segmentation. ANDREA picks a second strategy.

Harris vs BPE

Why Subword

A whole-word tokenizer fails on the rare word `proporian` (a word ANDREA-12M produced at step 43,100). Name two distinct problems that a subword tokenizer (BPE or Harris) avoids that a whole-word tokenizer cannot.

Where Does a Word Break

Zellig Harris, 1955

A linguist named Zellig Harris noticed something. Inside a word, a count of distinct letters that follow a given letter sequence varies sharply. After un you can find dozens of letters: a, b, c, d, e ... After unbel only a tiny set follows: i (then ievable).

A spike in successor variety marks a likely morpheme boundary. After un (a prefix), variety jumps because many roots can follow. Inside a root like believ, variety stays low because letters predict each other. At a transition between morphemes, variety jumps again.

From Variety Spikes to Segments

Run that detector across a training corpus. Every word donates statistical evidence. A tokenizer collects high-frequency segments that recur at morpheme-shaped boundaries: un, re, pre, believ, know, ing, able, ly, tion, ed.

No labels. No linguist hand-tags morphemes. A statistic of letter co-occurrence does a work.

Harris vs BPE

Property	Harris	BPE
Boundary criterion	Successor variety spike	Pair frequency
Linguistic shape	Morpheme-aligned (prefix, root, suffix)	Frequent byte pairs
Example: `unbelievably`	`un` + `believ` + `abl` + `y`	`unb` + `eli` + `eva` + `bly`
Generalization	Strong (root + affix recombines)	Weaker (pairs need not align)

Both produce subword pieces. Harris pieces tend to align with what a linguist would call a morpheme: a smallest meaningful unit. BPE pieces optimize compression: a most frequent byte pair gets merged, regardless of meaning.

Segment a Word

Apply Harris-style reasoning to the word `replayed`. Propose three morpheme segments & justify each in one phrase (what role does it play: prefix, root, or suffix).

Three Vocab Slabs

Anatomy of an ANDREA Vocabulary

Harris tokenization produces a vocabulary with three slabs:

Slab 1: 256 base bytes. Every possible UTF-8 byte (0x00 through 0xFF) gets its own token ID. A safety net: any character a corpus contains, a tokenizer can represent as a sequence of bytes. No <UNK> ever fires.

Slab 2: N morpheme segments. Common pieces discovered through distributional analysis. ANDREA-12M trained N = 4096; ANDREA-120M trained N = 8192. Each segment compresses a recurring multi-byte string into a single token.

Slab 3: 1 BOS token. A special marker placed at a start of every training sequence. Lets a model learn 'this position has no past'. ANDREA-12M & ANDREA-120M both reserve exactly one ID for BOS.

Vocabulary Sizes

Model	Base bytes	Morpheme segments (N)	BOS	Vocab size
ANDREA-12M	256	4096	1	4353
ANDREA-120M	256	8192	1	8449

256 + N + 1 = vocabulary size. Simple. Reproducible. Open.

Why a Byte Slab Matters

A byte fallback guarantees coverage. If a model encounters 日本語 & a tokenizer has no Japanese morphemes, individual UTF-8 bytes carry a sequence through. A model trains on bytes; quality on rare scripts depends on capacity & exposure, but no input ever crashes a tokenizer.

Compute a Vocabulary

ANDREA-480M (a third model in a family, future activity 24 covers it) plans to train a Harris tokenizer with N = 16,384 segments on a larger corpus. Compute its vocabulary size. Show a formula. Then explain in one sentence why a byte slab stays at 256 even when N grows.

Beginning of Sequence

Why a Sequence Needs a Marker

A decoder-only transformer predicts a next token from prior context. Position 0 has no prior context. Without a marker, position 0 sits in a logical hole: a model has nothing to attend to.

BOS solves a hole. A single special token (ID = 256 + N) sits at every sequence start during training. A model learns:

- 'When you see BOS, predict a likely first token of natural text.'

- 'When you see BOS followed by a word, that word is a sequence beginning, not a continuation.'

One Token, Many Uses

BOS shows up at:

- Training time: prepended to every chunk of text fed into a model.

- Inference time: prepended to a prompt so a model sees a familiar 'fresh start' signal.

- Boundary marking: in some pipelines, a separator between concatenated documents.

ANDREA reserves exactly one ID for BOS. No EOS, no PAD, no special tokens beyond what a vocabulary needs. Simplicity stays a permacomputer value: every token earns its slot.

Activity 3 Continues

Activity 3 (grow_a_language_model_tokenizer_diet) covers what happens when N is too large or a tokenizer corpus diverges from a training corpus. ANDREA-12M wasted 63.6% of its vocabulary; ANDREA-120M fixed it. Read on.

BOS-Only Tradeoffs

Reflect on a design choice ANDREA makes: only one special token (BOS), no EOS, no PAD. Name one tradeoff this creates. Tradeoff can be a benefit (simpler engine, fewer wasted slots) or a constraint (some training tricks need extra tokens). One sentence is plenty.