English· Español· Deutsch· Nederlands· Français· 日本語· ქართული· 繁體中文· 简体中文· Português· Русский· العربية· हिन्दी· Italiano· 한국어· Polski· Svenska· Türkçe· Українська· Tiếng Việt· Bahasa Indonesia

un

guest
1 / ?
back to lessons

Words to Numbers

A Translator at a Border

A language model never sees text. It sees integers. A tokenizer sits at a border crossing: human words flow in, integer IDs flow out. Generation reverses a flow: integer IDs come back, a tokenizer renders text.


Three jobs:


1. Segment. Cut a string into pieces (tokens).

2. Map. Assign each piece a unique integer ID from a fixed vocabulary.

3. Reverse. Reconstruct text from IDs at generation time.


Why Pieces, Not Whole Words

A whole-word vocabulary explodes. English alone has hundreds of thousands of forms. Worse, a model trained on whole words cannot handle a typo, a new name, or a foreign phrase: any unseen word maps to a single <UNK> slot.


Subword tokenization fixes that. A vocabulary of common pieces composes into any word, including ones never seen during training. Two strategies dominate: BPE (byte pair encoding) & distributional segmentation. ANDREA picks a second strategy.


Harris vs BPE

Why Subword

A whole-word tokenizer fails on the rare word `proporian` (a word ANDREA-12M produced at step 43,100). Name two distinct problems that a subword tokenizer (BPE or Harris) avoids that a whole-word tokenizer cannot.

Where Does a Word Break

Zellig Harris, 1955

A linguist named Zellig Harris noticed something. Inside a word, a count of distinct letters that follow a given letter sequence varies sharply. After un you can find dozens of letters: a, b, c, d, e ... After unbel only a tiny set follows: i (then ievable).


A spike in successor variety marks a likely morpheme boundary. After un (a prefix), variety jumps because many roots can follow. Inside a root like believ, variety stays low because letters predict each other. At a transition between morphemes, variety jumps again.


From Variety Spikes to Segments

Run that detector across a training corpus. Every word donates statistical evidence. A tokenizer collects high-frequency segments that recur at morpheme-shaped boundaries: un, re, pre, believ, know, ing, able, ly, tion, ed.


No labels. No linguist hand-tags morphemes. A statistic of letter co-occurrence does a work.


Harris vs BPE


PropertyHarrisBPE
Boundary criterionSuccessor variety spikePair frequency
Linguistic shapeMorpheme-aligned (prefix, root, suffix)Frequent byte pairs
Example: unbelievablyun + believ + abl + yunb + eli + eva + bly
GeneralizationStrong (root + affix recombines)Weaker (pairs need not align)

Both produce subword pieces. Harris pieces tend to align with what a linguist would call a morpheme: a smallest meaningful unit. BPE pieces optimize compression: a most frequent byte pair gets merged, regardless of meaning.

Segment a Word

Apply Harris-style reasoning to the word `replayed`. Propose three morpheme segments & justify each in one phrase (what role does it play: prefix, root, or suffix).

Three Vocab Slabs

Anatomy of an ANDREA Vocabulary

Harris tokenization produces a vocabulary with three slabs:


Slab 1: 256 base bytes. Every possible UTF-8 byte (0x00 through 0xFF) gets its own token ID. A safety net: any character a corpus contains, a tokenizer can represent as a sequence of bytes. No <UNK> ever fires.


Slab 2: N morpheme segments. Common pieces discovered through distributional analysis. ANDREA-12M trained N = 4096; ANDREA-120M trained N = 8192. Each segment compresses a recurring multi-byte string into a single token.


Slab 3: 1 BOS token. A special marker placed at a start of every training sequence. Lets a model learn 'this position has no past'. ANDREA-12M & ANDREA-120M both reserve exactly one ID for BOS.


Vocabulary Sizes


ModelBase bytesMorpheme segments (N)BOSVocab size
ANDREA-12M256409614353
ANDREA-120M256819218449

256 + N + 1 = vocabulary size. Simple. Reproducible. Open.


Why a Byte Slab Matters

A byte fallback guarantees coverage. If a model encounters 日本語 & a tokenizer has no Japanese morphemes, individual UTF-8 bytes carry a sequence through. A model trains on bytes; quality on rare scripts depends on capacity & exposure, but no input ever crashes a tokenizer.

Compute a Vocabulary

ANDREA-480M (a third model in a family, future activity 24 covers it) plans to train a Harris tokenizer with N = 16,384 segments on a larger corpus. Compute its vocabulary size. Show a formula. Then explain in one sentence why a byte slab stays at 256 even when N grows.

Beginning of Sequence

Why a Sequence Needs a Marker

A decoder-only transformer predicts a next token from prior context. Position 0 has no prior context. Without a marker, position 0 sits in a logical hole: a model has nothing to attend to.


BOS solves a hole. A single special token (ID = 256 + N) sits at every sequence start during training. A model learns:


- 'When you see BOS, predict a likely first token of natural text.'

- 'When you see BOS followed by a word, that word is a sequence beginning, not a continuation.'


One Token, Many Uses


BOS shows up at:


- Training time: prepended to every chunk of text fed into a model.

- Inference time: prepended to a prompt so a model sees a familiar 'fresh start' signal.

- Boundary marking: in some pipelines, a separator between concatenated documents.


ANDREA reserves exactly one ID for BOS. No EOS, no PAD, no special tokens beyond what a vocabulary needs. Simplicity stays a permacomputer value: every token earns its slot.


Activity 3 Continues

Activity 3 (grow_a_language_model_tokenizer_diet) covers what happens when N is too large or a tokenizer corpus diverges from a training corpus. ANDREA-12M wasted 63.6% of its vocabulary; ANDREA-120M fixed it. Read on.

BOS-Only Tradeoffs

Reflect on a design choice ANDREA makes: only one special token (BOS), no EOS, no PAD. Name one tradeoff this creates. Tradeoff can be a benefit (simpler engine, fewer wasted slots) or a constraint (some training tricks need extra tokens). One sentence is plenty.