FP8 Training#
DeepSeek-V3 models adopt Blockwise FP8 training:
Finer-grained scaling (tile-wise for activations, block-wise for weights) replaces per-tensor quantisation, cutting quantisation noise.
Up-to-date amax statistics reduce distribution-shift error that plagues delayed updates.
This section lists the feature flags / environment variables required to turn the scheme on in LoongForge, gives a proven recipe, and collects troubleshooting hints.
0. Prerequisites#
Item |
Requirement |
|---|---|
Hardware |
Native FP8 support |
Software |
Transformer Engine enabled in the framework |
Care |
FP8 is numerically stricter → keep NaN/Inf/overflow monitors active while you dial in the setup |
1. Feature switches#
1.1 CLI arguments#
Argument |
Meaning |
|---|---|
|
Use E4M3 (4-bit exponent, 3-bit mantissa) for FP8 tensors. Must be combined with |
|
Turn on block-wise / tile-wise quantisation and per-block/tile amax tracking. Requires |
|
Keep weights in FP8 during distributed gather/communication and throughout the param buffer. Lowers memory and traffic, but needs a full convergence & checkpoint regression test. |
1.2 Environment variables#
Variable |
Purpose |
|---|---|
|
Epsilon clamp for forward activation amax (avoid div-by-zero → NaN). Default 0, recommended 1e-12 |
|
Same for forward weight amax. |
|
Same for backward gradient amax. If NaN appears in back-prop, check this first. |
|
Store scaling factors in FP32 instead of E8M0 when set to |
|
Force E8M0 scales for forward activations when set to |
|
Force E8M0 scales for forward weights when set to |
|
Force E8M0 scales for backward grads when set to |
2. Recommended recipe#
Stage 1 – baseline (prove stability)#
--fp8-format e4m3 \
--fp8-recipe blockwise
Train until loss/metrics match the BF16 reference.
Stage 2 – optimise (save memory)#
--fp8-format e4m3 \
--fp8-recipe blockwise \
--fp8-param-gather
Re-run full convergence + downstream eval + checkpoint round-trip.
Universal epsilon guard (add at the top of your launch script)#
export FP8_QUANT_FWD_INP_AMAX_EPS=1e-12
export FP8_QUANT_FWD_WEIGHT_AMAX_EPS=1e-12
export FP8_QUANT_BWD_GRAD_AMAX_EPS=1e-12
3. Quick troubleshooting checklist#
Symptom |
Likely fix |
|---|---|
NaN/Inf in loss or grads |
Raise the three |
Divergence vs. BF16 |
Disable |
Checkpoint reload failure |
Ensure the same FP8 flags & epsilon values were used when the checkpoint was saved. |
With the above switches and epsilon guards, Blockwise FP8 training in LoongForge is ready for production-scale runs.