Sign, Exponent, Mantissa
IEEE 754 Floating-Point Format
Every floating-point number stores three fields:
- Sign bit (1 bit): positive or negative
- Exponent (E bits): the magnitude scale, an integer power of 2
- Mantissa (M bits): the fractional precision, a number between 1.0 & ~2.0
Total bits = 1 + E + M. The value equals roughly (-1)^sign (1 + mantissa) 2^(exponent - bias).
Two properties matter for training:
Dynamic range = 2^(2^E) (roughly). More exponent bits means representing tinier & huger numbers without overflow.
Precision = 2^M distinct values per power of 2. More mantissa bits means finer-grained representation between consecutive powers of 2.
The Three Formats
| Format | Total bits | Sign | Exp | Mant | Dynamic range | Precision |
|---|---|---|---|---|---|---|
| FP32 | 32 | 1 | 8 | 23 | ~10^-38 to ~10^38 | ~7 digits |
| FP16 | 16 | 1 | 5 | 10 | ~10^-5 to ~10^5 | ~3 digits |
| FP8 E4M3 | 8 | 1 | 4 | 3 | ~2^-9 to ~448 | ~2 digits |
FP8 E4M3 reads "4 exponent bits, 3 mantissa bits". An alternative FP8 E5M2 trades precision for range; ANDREA experiments use E4M3 because transformer activations stay in narrow magnitude bands where extra precision wins over extra range.
Bytes Per Parameter
Why Lower Precision Runs Faster
Memory Bandwidth Dominates Training Speed
Modern GPUs spend more time waiting on memory than computing. The RTX 4090 has 1008 GB/s memory bandwidth & 165 TFLOPS of FP16 compute. A typical layer reads weights from VRAM, multiplies activations, writes results back. Bandwidth, not compute, decides throughput.
Halving precision halves bytes per parameter, so reading the same weights uses half the memory bandwidth. Throughput roughly doubles.
Tensor Cores: Hardware-Accelerated Matrix Multiply
RTX 4090 ships dedicated tensor core units that compute matrix multiplies at FP16 or FP8 directly. A single tensor core operation multiplies a small block (e.g. 16x16) in one cycle, dramatically faster than scalar FP32 multiplies.
Empirical numbers from ANDREA-120M:
| Precision | Steps/min | Notes |
|---|---|---|
| FP32 | ~3 | baseline; no tensor core acceleration |
| FP16 | ~6 | cuBLAS tensor cores; 2x speedup |
| FP8 E4M3 | ~6 | tensor cores; comparable to FP16 |
FP8 did not beat FP16 in throughput on this workload because compute throughput stopped being the bottleneck; memory bandwidth & launch overhead became binding. ANDREA-120M v3 ships on FP16 cuBLAS at 6 steps/min for a comfortable safety margin without losing throughput.
NaN Risk at FP8
FP8 E4M3 represents numbers from ~2^-9 to ~448. Activations or gradients outside that range overflow to NaN (not a number) or underflow to zero. A single NaN poisons every downstream computation: matrix multiplies with a NaN return all-NaN; all-NaN gradients corrupt AdamW state; AdamW with NaN m & v outputs NaN updates; weights become NaN; entire training run dies.
ANDREA's FP8 experiments produced occasional NaN cliffs requiring loss scaling, scheduled precision switching, or fallback paths. FP16 dynamic range (~10^-5 to ~10^5) is wide enough that NaN events stay rare without intricate scaling tricks.
Choosing Precision for a New Run
Fitting 120M on a Single 4090
The 6-8x Multiplier from the Intro Lesson
Recall from grow_a_language_model_intro that training memory equals roughly 6-8x the raw weight count, accounting for:
- Weight (1x)
- Adam first moment m (1x)
- Adam second moment v (1x)
- Gradient buffer (1x)
- Activations & temporaries (~2-4x, depends on batch & context)
ANDREA-120M at FP16 with batch_size=8, context=1024:
| Component | FP16 size |
|---|---|
| Weights | 240 MB |
| m (first moment) | 240 MB |
| v (second moment) | 240 MB |
| Gradients | 240 MB |
| Activations | ~2-4 GB (batch, ctx) |
| Total | ~3.5 GB |
RTX 4090 has 24 GB VRAM. ANDREA-120M uses ~14% at FP16. Plenty of room for larger batch sizes or longer context windows. ANDREA-12M used only 1.4 GB total.
Where Mixed Precision Lives
ANDREA does NOT keep everything at one precision. Mixed-precision training stores:
- Master weights: FP32 (preserves training stability)
- Forward & backward compute: FP16 (uses tensor cores)
- AdamW optimizer state: FP32 (m & v need precision for long-tail updates)
- Gradient buffer: FP16 (compute side)
Final memory budget mixes both. ANDREA's actual footprint sits between pure FP16 (720 MB optimizer state) & pure FP32 (1.44 GB optimizer state), closer to FP32 because m & v stay in FP32.
Sizing a Budget for ANDREA-480M
Precision in Practice
Adjacent Activities
Three siblings link to precision:
- Activity 1: Intro / VRAM budget. Precision multiplies every term in the memory budget arithmetic. The 6-8x multiplier rule of thumb is unitless; bytes-per-param gives it units.
- Activity 10: AdamW. Optimizer state (m & v) typically stays at FP32 even when forward/backward compute runs at FP16. Reason: long-tail accumulator precision matters more than runtime speed for the optimizer.
- Activity 12: Gradient clipping. Clipping caps gradient magnitudes before optimizer state updates. With FP16 forward/backward & FP32 optimizer, clipping happens at the boundary where precision changes & where overflow risk concentrates.
Precision is a free knob: change it, the model trains faster & uses less memory. The cost is numerical care: NaN handling, loss scaling, mixed-precision discipline. ANDREA-120M v3 demonstrates the payoff: 120M parameters trained on consumer hardware in 23 days because FP16 cut everything in half.