English· Español· Deutsch· Nederlands· Français· 日本語· ქართული· 繁體中文· 简体中文· Português· Русский· العربية· हिन्दी· Italiano· 한국어· Polski· Svenska· Türkçe· Українська· Tiếng Việt· Bahasa Indonesia

un

khách
1 / ?
trở lại bài học

Sign, Exponent, Mantissa

IEEE 754 Floating-Point Format

Every floating-point number stores three fields:


- Sign bit (1 bit): positive or negative

- Exponent (E bits): the magnitude scale, an integer power of 2

- Mantissa (M bits): the fractional precision, a number between 1.0 & ~2.0


Total bits = 1 + E + M. The value equals roughly (-1)^sign (1 + mantissa) 2^(exponent - bias).


Two properties matter for training:


Dynamic range = 2^(2^E) (roughly). More exponent bits means representing tinier & huger numbers without overflow.


Precision = 2^M distinct values per power of 2. More mantissa bits means finer-grained representation between consecutive powers of 2.


The Three Formats


FormatTotal bitsSignExpMantDynamic rangePrecision
FP32321823~10^-38 to ~10^38~7 digits
FP16161510~10^-5 to ~10^5~3 digits
FP8 E4M38143~2^-9 to ~448~2 digits

FP8 E4M3 reads "4 exponent bits, 3 mantissa bits". An alternative FP8 E5M2 trades precision for range; ANDREA experiments use E4M3 because transformer activations stay in narrow magnitude bands where extra precision wins over extra range.

Bytes Per Parameter

ANDREA-120M holds approximately 120,000,000 parameters. Compute the storage footprint of just the weight matrices in (a) FP32, (b) FP16, (c) FP8. Show your arithmetic in MB. Then compute (d) the storage footprint with weight + Adam first moment + Adam second moment (3x the weight count) at FP16.

Why Lower Precision Runs Faster

Memory Bandwidth Dominates Training Speed

Modern GPUs spend more time waiting on memory than computing. The RTX 4090 has 1008 GB/s memory bandwidth & 165 TFLOPS of FP16 compute. A typical layer reads weights from VRAM, multiplies activations, writes results back. Bandwidth, not compute, decides throughput.


Halving precision halves bytes per parameter, so reading the same weights uses half the memory bandwidth. Throughput roughly doubles.


Tensor Cores: Hardware-Accelerated Matrix Multiply

RTX 4090 ships dedicated tensor core units that compute matrix multiplies at FP16 or FP8 directly. A single tensor core operation multiplies a small block (e.g. 16x16) in one cycle, dramatically faster than scalar FP32 multiplies.


Empirical numbers from ANDREA-120M:


PrecisionSteps/minNotes
FP32~3baseline; no tensor core acceleration
FP16~6cuBLAS tensor cores; 2x speedup
FP8 E4M3~6tensor cores; comparable to FP16

FP8 did not beat FP16 in throughput on this workload because compute throughput stopped being the bottleneck; memory bandwidth & launch overhead became binding. ANDREA-120M v3 ships on FP16 cuBLAS at 6 steps/min for a comfortable safety margin without losing throughput.


NaN Risk at FP8

FP8 E4M3 represents numbers from ~2^-9 to ~448. Activations or gradients outside that range overflow to NaN (not a number) or underflow to zero. A single NaN poisons every downstream computation: matrix multiplies with a NaN return all-NaN; all-NaN gradients corrupt AdamW state; AdamW with NaN m & v outputs NaN updates; weights become NaN; entire training run dies.


ANDREA's FP8 experiments produced occasional NaN cliffs requiring loss scaling, scheduled precision switching, or fallback paths. FP16 dynamic range (~10^-5 to ~10^5) is wide enough that NaN events stay rare without intricate scaling tricks.


Precision Comparison: FP32 vs FP16 vs FP8

Choosing Precision for a New Run

You are starting a new ANDREA-style training run on an RTX 4090. You have two priorities in conflict: (1) maximize steps/min, (2) avoid debugging NaN crashes mid-training. ANDREA-120M v3 chose FP16 cuBLAS over FP8 E4M3 despite both running at ~6 steps/min. Reason about why FP16 won this decision. Reference dynamic range AND tensor core support in your answer.

Fitting 120M on a Single 4090

The 6-8x Multiplier from the Intro Lesson

Recall from grow_a_language_model_intro that training memory equals roughly 6-8x the raw weight count, accounting for:


- Weight (1x)

- Adam first moment m (1x)

- Adam second moment v (1x)

- Gradient buffer (1x)

- Activations & temporaries (~2-4x, depends on batch & context)


ANDREA-120M at FP16 with batch_size=8, context=1024:


ComponentFP16 size
Weights240 MB
m (first moment)240 MB
v (second moment)240 MB
Gradients240 MB
Activations~2-4 GB (batch, ctx)
Total~3.5 GB

RTX 4090 has 24 GB VRAM. ANDREA-120M uses ~14% at FP16. Plenty of room for larger batch sizes or longer context windows. ANDREA-12M used only 1.4 GB total.


Where Mixed Precision Lives

ANDREA does NOT keep everything at one precision. Mixed-precision training stores:


- Master weights: FP32 (preserves training stability)

- Forward & backward compute: FP16 (uses tensor cores)

- AdamW optimizer state: FP32 (m & v need precision for long-tail updates)

- Gradient buffer: FP16 (compute side)


Final memory budget mixes both. ANDREA's actual footprint sits between pure FP16 (720 MB optimizer state) & pure FP32 (1.44 GB optimizer state), closer to FP32 because m & v stay in FP32.

Sizing a Budget for ANDREA-480M

ANDREA-480M (the planned third member of the family) holds ~480 million parameters. Estimate (a) FP16 weights only in MB, (b) FP16 weights + m + v in MB (assume m & v also FP16 for simplicity), & (c) given the 6-8x multiplier rule of thumb, total training-time footprint at FP16. Does ANDREA-480M fit on a single RTX 4090 (24 GB)?

Precision in Practice

Suppose you discovered mid-training that ANDREA-120M was producing occasional NaN losses every ~5000 steps at FP16, & each NaN required restarting from a checkpoint. What ONE change would you try first to reduce NaN frequency without leaving FP16? Justify with a one-sentence mechanism.

Adjacent Activities

Three siblings link to precision:


- Activity 1: Intro / VRAM budget. Precision multiplies every term in the memory budget arithmetic. The 6-8x multiplier rule of thumb is unitless; bytes-per-param gives it units.

- Activity 10: AdamW. Optimizer state (m & v) typically stays at FP32 even when forward/backward compute runs at FP16. Reason: long-tail accumulator precision matters more than runtime speed for the optimizer.

- Activity 12: Gradient clipping. Clipping caps gradient magnitudes before optimizer state updates. With FP16 forward/backward & FP32 optimizer, clipping happens at the boundary where precision changes & where overflow risk concentrates.


Precision is a free knob: change it, the model trains faster & uses less memory. The cost is numerical care: NaN handling, loss scaling, mixed-precision discipline. ANDREA-120M v3 demonstrates the payoff: 120M parameters trained on consumer hardware in 23 days because FP16 cut everything in half.