FP8 Training#

DeepSeek-V3 models adopt Blockwise FP8 training:

  • Finer-grained scaling (tile-wise for activations, block-wise for weights) replaces per-tensor quantisation, cutting quantisation noise.

  • Up-to-date amax statistics reduce distribution-shift error that plagues delayed updates.

This section lists the feature flags / environment variables required to turn the scheme on in LoongForge, gives a proven recipe, and collects troubleshooting hints.


0. Prerequisites#

Item

Requirement

Hardware

Native FP8 support

Software

Transformer Engine enabled in the framework

Care

FP8 is numerically stricter → keep NaN/Inf/overflow monitors active while you dial in the setup


1. Feature switches#

1.1 CLI arguments#

Argument

Meaning

--fp8-format e4m3

Use E4M3 (4-bit exponent, 3-bit mantissa) for FP8 tensors. Must be combined with --fp8-recipe blockwise.

--fp8-recipe blockwise

Turn on block-wise / tile-wise quantisation and per-block/tile amax tracking. Requires --fp8-format e4m3.

--fp8-param-gather

Keep weights in FP8 during distributed gather/communication and throughout the param buffer. Lowers memory and traffic, but needs a full convergence & checkpoint regression test.

1.2 Environment variables#

Variable

Purpose

FP8_QUANT_FWD_INP_AMAX_EPS

Epsilon clamp for forward activation amax (avoid div-by-zero → NaN). Default 0, recommended 1e-12

FP8_QUANT_FWD_WEIGHT_AMAX_EPS

Same for forward weight amax.

FP8_QUANT_BWD_GRAD_AMAX_EPS

Same for backward gradient amax. If NaN appears in back-prop, check this first.

NVTE_FP8_BLOCK_SCALING_FP32_SCALES

Store scaling factors in FP32 instead of E8M0 when set to 1. Do not enable on Blackwell.

NVTE_FP8_BLOCK_SCALING_FWD_INP_POWER2

Force E8M0 scales for forward activations when set to 1.

NVTE_FP8_BLOCK_SCALING_FWD_WEIGHT_POWER2

Force E8M0 scales for forward weights when set to 1.

NVTE_FP8_BLOCK_SCALING_BWD_GRAD_POWER2

Force E8M0 scales for backward grads when set to 1.



3. Quick troubleshooting checklist#

Symptom

Likely fix

NaN/Inf in loss or grads

Raise the three *_AMAX_EPS values gradually (1e-12 → 1e-10).

Divergence vs. BF16

Disable --fp8-param-gather first; if still diverging, lower LR 10-20 %.

Checkpoint reload failure

Ensure the same FP8 flags & epsilon values were used when the checkpoint was saved.

With the above switches and epsilon guards, Blockwise FP8 training in LoongForge is ready for production-scale runs.