Decision Boundaries as Hyperplanes
A binary classifier assigns each input to one of two classes. The classifier's decision boundary divides input space into two regions: one per class. The geometry of that boundary determines what patterns the classifier can learn.
A hyperplane in ℝ^n: the set of all points x satisfying w·x + b = 0, where w is a weight vector in ℝ^n and b is a scalar bias. A hyperplane has n−1 dimensions.
In 2D: a hyperplane is a line. In 3D: a flat plane. In n-D: a flat (n−1)-dimensional subspace.
A perceptron classifies by computing w·x + b and returning class 1 if positive, class 0 if negative. Its decision boundary is a hyperplane.
Linear Separability
A dataset is linearly separable in ℝ^n if there exists a hyperplane that puts all class-0 points on one side and all class-1 points on the other. This is a purely geometric property of the dataset.
Testing Linear Separability
The AND gate dataset in 2D: class-0 points at (0,0), (1,0), (0,1); class-1 point at (1,1). This dataset is linearly separable.
The XOR dataset in 2D: class-0 points at (0,0) and (1,1); class-1 points at (1,0) and (0,1). These two classes lie on opposing diagonals.
Lifting to Higher Dimensions
XOR is not linearly separable in 2D. The solution: map the data to a higher-dimensional space where it becomes linearly separable. This is the core idea of the kernel trick.
Feature map: a function φ: ℝ^n → ℝ^m (m > n) that transforms each input point into a higher-dimensional representation.
For XOR, one useful feature map: φ(x₁, x₂) = (x₁, x₂, x₁x₂)
This adds a third dimension z = x₁ × x₂. The XOR points transform to:
- (0,0) → (0, 0, 0), class 0
- (1,0) → (1, 0, 0), class 1
- (0,1) → (0, 1, 0), class 1
- (1,1) → (1, 1, 1), class 0
In 3D: the class-0 points are at (0,0,0) and (1,1,1); the class-1 points are at (1,0,0) and (0,1,0). Now find a separating plane.
Separating Plane in 3D
After the feature map φ(x₁, x₂) = (x₁, x₂, x₁x₂), the XOR data lives in 3D. A hyperplane in 3D has equation w₁x₁ + w₂x₂ + w₃z + b = 0.
Cover's Theorem: Why High Dimensions Help
Cover's theorem (1965): a complex classification problem cast in a high-dimensional space is more likely to be linearly separable than in a low-dimensional space, provided the space is not densely populated.
Informal statement: if you map n data points to a space of dimension d >> n, the probability that a random labeling is linearly separable approaches 1.
Formal version: for n points in general position in ℝ^d, the number of linearly separable dichotomies (class assignments) is exactly 2 × Σ_{k=0}^{d} C(n−1, k) for d < n, and equals 2^n (all dichotomies) for d ≥ n − 1.
Practical implication: the feature map φ that lifts XOR to 3D is a special case of this general principle. Lifting to higher dimensions increases the chance of separability. The cost: more parameters to fit, higher risk of overfitting.
The Bias-Variance Tradeoff as Geometry
Low-dimensional decision boundary (few parameters): high bias (cannot capture complex patterns), low variance (stable across samples). High-dimensional boundary (many parameters): low bias, high variance (can overfit to noise in training data).
VC Dimension: How Expressive Is a Classifier?
The Vapnik-Chervonenkis (VC) dimension of a hypothesis class H measures how complex the class is: the largest number of points that H can shatter (correctly classify in all 2^n possible labelings).
Perceptron in ℝ^d: VC dimension = d + 1. A d-dimensional hyperplane can shatter d + 1 points (in general position) but not d + 2.
The VC dimension determines the sample complexity: to learn a hypothesis with generalization error ε with probability 1 − δ, you need roughly n ≥ (d × log(1/ε) + log(1/δ)) / ε samples, where d is the VC dimension.
Decision Boundaries & Machine Capability Limits
The geometry of decision boundaries connects directly to Hamming's machine reasoning limits.
A single-layer perceptron (hyperplane classifier) cannot solve XOR. This was Minsky & Papert's critique of early perceptrons in 1969. The geometric argument: XOR is not linearly separable. The machine cannot solve it, not because of lack of computing power, but because of a fundamental geometric incompatibility between the hypothesis class and the problem.
The resolution: multi-layer networks can represent non-linear boundaries. The hidden layers implement the feature map φ — lifting the data to higher dimensions where linear separation becomes possible. Each hidden neuron computes one hyperplane; the combination of multiple hyperplanes approximates curves.
This history maps onto Hamming's observation: every limitation of machine reasoning has a geometric structure underneath it. The task is not to argue about whether machines 'can think' but to identify the geometric constraints and find ways to work around them.