un — Geometry of Machine Reasoning: III

un

guest

1 / ?

back to lessons

Decision Boundaries as Hyperplanes

A binary classifier assigns each input to one of two classes. The classifier's decision boundary divides input space into two regions: one per class. The geometry of that boundary determines what patterns the classifier can learn.

A hyperplane in ℝ^n: the set of all points x satisfying w·x + b = 0, where w is a weight vector in ℝ^n and b is a scalar bias. A hyperplane has n−1 dimensions.

In 2D: a hyperplane is a line. In 3D: a flat plane. In n-D: a flat (n−1)-dimensional subspace.

A perceptron classifies by computing w·x + b and returning class 1 if positive, class 0 if negative. Its decision boundary is a hyperplane.

Linear Separability

A dataset is linearly separable in ℝ^n if there exists a hyperplane that puts all class-0 points on one side and all class-1 points on the other. This is a purely geometric property of the dataset.

Decision Boundary Geometry: Linear Separability & XOR

Testing Linear Separability

The AND gate dataset in 2D: class-0 points at (0,0), (1,0), (0,1); class-1 point at (1,1). This dataset is linearly separable.

The XOR dataset in 2D: class-0 points at (0,0) and (1,1); class-1 points at (1,0) and (0,1). These two classes lie on opposing diagonals.

Verify that the XOR dataset is NOT linearly separable in 2D. Use a geometric argument: explain why no line in the 2D plane can separate the two classes. Your argument should reference the positions of the four points and the property of a straight line that makes separation impossible.

Lifting to Higher Dimensions

XOR is not linearly separable in 2D. The solution: map the data to a higher-dimensional space where it becomes linearly separable. This is the core idea of the kernel trick.

Feature map: a function φ: ℝ^n → ℝ^m (m > n) that transforms each input point into a higher-dimensional representation.

For XOR, one useful feature map: φ(x₁, x₂) = (x₁, x₂, x₁x₂)

This adds a third dimension z = x₁ × x₂. The XOR points transform to:

- (0,0) → (0, 0, 0), class 0

- (1,0) → (1, 0, 0), class 1

- (0,1) → (0, 1, 0), class 1

- (1,1) → (1, 1, 1), class 0

In 3D: the class-0 points are at (0,0,0) and (1,1,1); the class-1 points are at (1,0,0) and (0,1,0). Now find a separating plane.

Separating Plane in 3D

After the feature map φ(x₁, x₂) = (x₁, x₂, x₁x₂), the XOR data lives in 3D. A hyperplane in 3D has equation w₁x₁ + w₂x₂ + w₃z + b = 0.

Find a hyperplane w·x + b = 0 in the transformed 3D space that correctly separates the XOR classes. Verify your hyperplane by substituting all four transformed points. Each class-0 point should give w·x + b < 0 (or > 0) and each class-1 point should give the opposite sign.

Cover's Theorem: Why High Dimensions Help

Cover's theorem (1965): a complex classification problem cast in a high-dimensional space is more likely to be linearly separable than in a low-dimensional space, provided the space is not densely populated.

Informal statement: if you map n data points to a space of dimension d >> n, the probability that a random labeling is linearly separable approaches 1.

Formal version: for n points in general position in ℝ^d, the number of linearly separable dichotomies (class assignments) is exactly 2 × Σ_{k=0}^{d} C(n−1, k) for d < n, and equals 2^n (all dichotomies) for d ≥ n − 1.

Practical implication: the feature map φ that lifts XOR to 3D is a special case of this general principle. Lifting to higher dimensions increases the chance of separability. The cost: more parameters to fit, higher risk of overfitting.

The Bias-Variance Tradeoff as Geometry

Low-dimensional decision boundary (few parameters): high bias (cannot capture complex patterns), low variance (stable across samples). High-dimensional boundary (many parameters): low bias, high variance (can overfit to noise in training data).

VC Dimension: How Expressive Is a Classifier?

The Vapnik-Chervonenkis (VC) dimension of a hypothesis class H measures how complex the class is: the largest number of points that H can shatter (correctly classify in all 2^n possible labelings).

Perceptron in ℝ^d: VC dimension = d + 1. A d-dimensional hyperplane can shatter d + 1 points (in general position) but not d + 2.

The VC dimension determines the sample complexity: to learn a hypothesis with generalization error ε with probability 1 − δ, you need roughly n ≥ (d × log(1/ε) + log(1/δ)) / ε samples, where d is the VC dimension.

A perceptron in ℝ^3 has VC dimension 4. According to the VC sample complexity bound, approximately how many training samples are needed to achieve generalization error ε = 0.05 with confidence 1 − δ = 0.95? Use the simplified bound n ≥ (d × log(1/ε) + log(1/δ)) / ε with the given values. Show all calculations.

Decision Boundaries & Machine Capability Limits

The geometry of decision boundaries connects directly to Hamming's machine reasoning limits.

A single-layer perceptron (hyperplane classifier) cannot solve XOR. This was Minsky & Papert's critique of early perceptrons in 1969. The geometric argument: XOR is not linearly separable. The machine cannot solve it, not because of lack of computing power, but because of a fundamental geometric incompatibility between the hypothesis class and the problem.

The resolution: multi-layer networks can represent non-linear boundaries. The hidden layers implement the feature map φ — lifting the data to higher dimensions where linear separation becomes possible. Each hidden neuron computes one hyperplane; the combination of multiple hyperplanes approximates curves.

This history maps onto Hamming's observation: every limitation of machine reasoning has a geometric structure underneath it. The task is not to argue about whether machines 'can think' but to identify the geometric constraints and find ways to work around them.

Minsky & Papert's 1969 critique of the perceptron used the XOR non-separability argument. Their book, 'Perceptrons,' nearly killed neural network research for a decade. But multi-layer networks resolve the XOR problem. What does this history suggest about the right way to interpret a demonstrated limitation of a machine reasoning system? Specifically: should a demonstrated geometric limitation be understood as permanent or as contingent on the current hypothesis class? Give a principled answer.