Grammar as Graph Structure
A compiler translates source code by building a parse tree — a rooted tree where each node represents a grammar rule applied, & each leaf represents a terminal symbol (variable name, literal, operator).
Tree Geometry
A tree with n nodes & n−1 edges. Depth d: the maximum distance from root to any leaf. For a balanced binary tree of depth d: up to 2^d leaves & 2^(d+1)−1 total nodes.
Parse trees for typical expressions are not balanced: operator precedence creates right-skewed or left-skewed trees. A + B C produces a tree where is deeper than +, encoding that * binds more tightly.
BNF & ALGOL
Backus-Naur Form (BNF), invented to specify ALGOL, formalizes grammar as production rules:
``
<expr> ::= <term> | <expr> + <term>
<term> ::= <factor> | <term> * <factor>
<factor> ::= id | (<expr>)
``
Each production rule defines one level of the tree. The grammar is a directed graph on non-terminal symbols; parsing a sentence traces a path through that graph from start symbol to leaves.
Parse Tree Depth & Expression Complexity
For the expression (A + B) * (C + D), the parse tree has a specific structure under standard operator precedence.
The depth of a parse tree affects compiler performance: deeper trees require more stack frames during recursive descent parsing.
Shannon Entropy & Redundancy
Hamming noted that spoken language is ~60% redundant, written language ~40%. This has a precise information-theoretic meaning.
Shannon Entropy
For a source generating symbols from alphabet A with probabilities {p₁, p₂, ..., pₙ}:
H = −Σ pᵢ log₂(pᵢ) (bits per symbol)
Maximum entropy: uniform distribution (all symbols equally likely). H_max = log₂(n) for n symbols.
Redundancy R: fraction of bits that carry no information (could be removed without losing content):
R = 1 − H / H_max
If R = 0.4 (40% redundant): 40% of bits can be predicted from context. A channel carrying English text at 8 bits/character is using only 60% of its capacity for information; 40% is structure the receiver already knows.
Error detection uses redundancy: if 40% of bits are predictable, a transmission error may produce a sequence that violates the redundancy structure — detectable even without error-correcting codes.
APL's near-zero redundancy: a single character change is almost never detectable from context alone. English's 60% redundancy: you can often reconstruct a word from surrounding context even if a letter is missing or wrong.
Computing Redundancy
English letter frequency (approximate): 'e' appears ~12.7% of the time; 'z' appears ~0.07%. The actual entropy of English is approximately H ≈ 1.0 bit/character (accounting for word-level context). Maximum entropy for 26-letter alphabet: H_max = log₂(26) ≈ 4.70 bits/character.
Growth Curves as Geometry
Compiler algorithms process programs of size n (tokens, lines, or nodes). The choice of algorithm determines how compile time scales with program size.
Common Complexity Classes
| Class | Runtime | Example | |---|---|---| | O(n) | linear | lexical scanning | | O(n log n) | quasi-linear | optimal sorting | | O(n²) | quadratic | naive duplicate detection | | O(n³) | cubic | matrix multiply, CYK parsing | | O(2ⁿ) | exponential | naive theorem proving |
The geometric picture: plot runtime vs n. O(n) is a line. O(n²) is a parabola. O(2ⁿ) is an exponential curve that looks flat near n=0 and nearly vertical near n=50.
General context-free grammar parsing (CYK algorithm) runs in O(n³). For most practical programming languages, the grammar is designed to be LR(1)-parseable, allowing O(n) parsing. ALGOL's grammar was more complex than FORTRAN's, contributing to slower compilers — a practical friction that mattered in 1958 when compiling took hours.
Crossover Points
A naive symbol-table lookup uses O(n²) total operations for a program of n identifiers (linear scan for each of n lookups). A hash-table symbol table uses O(n) expected total operations.
Suppose: O(n²) algorithm runs at 10⁶ operations/sec on 1958 hardware. O(n) algorithm runs at the same speed but requires 100n setup operations (hash table construction).