Parallelism Strategies & Optimization Guide#
LoongForge is built on Megatron-LM and is fully compatible with all existing Megatron-LM optimization strategies.
On top of that we have added several enhancements. This document describes the basic parallelism strategies and how to enable their optimizations.
They can be combined as needed to efficiently train billion- to trillion-parameter models on hundreds to thousands of GPUs.
1. Parallelism Strategies#
Strategy |
Parallel Dimension |
Primary Use-Case |
|---|---|---|
Data Parallel (DP) |
Batch dimension |
Standard training; enabled by default |
Tensor Parallel (TP) |
Single-layer ops / weights |
Large hidden-size or memory-bound cases |
Pipeline Parallel (PP) |
Model depth |
Ultra-deep models with many layers |
Context Parallel (CP) |
Sequence length |
Long-sequence training (8 K+) |
Expert Parallel (EP) |
MoE experts |
Mixture-of-Experts models |
1.1 Data Parallelism (DP)#
What is parallelised: different mini-batch samples
Key idea: every rank keeps a full copy of the model; gradients are synchronised
In DP each GPU processes a subset of the batch. Depending on the configuration, model-related states can be fully replicated or sharded across the DP dimension to save memory.
Standard DP (no sharding)#
torchrun --nproc_per_node=8 pretrain_gpt.py \
--data-parallel-sharding-strategy no_shard
Each GPU stores full parameters, gradients and optimizer states
Each GPU processes only part of the batch
Gradients are synchronised with All-Reduce
1.2 Tensor Parallelism (TP)#
What is parallelised: matrix computations inside one layer
Key idea: split large matrices along a dimension across GPUs
--tensor-model-parallel-size 4 # 4-way TP
--sequence-parallel # recommended
--sequence-parallel shards the sequence dimension in LayerNorm and Dropout to reduce activation memory; usually used together with TP.
1.3 Pipeline Parallelism (PP)#
What is parallelised: model depth (layer dimension)
Key idea: different GPUs own different stages; execute with micro-batch pipeline
--pipeline-model-parallel-size 8 # 8 stages
--num-layers-per-virtual-pipeline-stage 4 # virtual stages for load balance
1.4 Context Parallelism (CP)#
What is parallelised: sequence length (token dimension)
Key idea: split a long sequence across GPUs
--context-parallel-size 2 # 2-way CP
--cp-comm-type p2p # communication flavour
1.5 Expert Parallelism (EP)#
What is parallelised: experts in an MoE layer
Key idea: different GPUs hold different experts; tokens are dispatched with All-to-All
--expert-model-parallel-size 8 # 8-way EP
2. Performance Optimisations#
2.1 Communication Optimisations#
Gradient-Reduction Overlap
--overlap-grad-reduce
Overlaps All-Reduce of gradients with backward compute.
Parameter-Gather Overlap
--overlap-param-gather
Overlaps All-Gather of parameters with forward compute.
TP Communication Overlap
--tp-comm-overlap
Overlaps Tensor-Parallel communication with compute.
EP Communication Overlap
--overlap-moe-expert-parallel-comm
Overlaps MoE All-to-All with compute.
DeepEP optimisation
DeepEP is a high-performance MoE token dispatch/combine library from DeepSeek. It greatly reduces scheduling and sync overhead for cross-node All-to-All, and is the recommended setup for DeepSeek-V3 and similar models.--moe-token-dispatcher-type=flex # {allgather | alltoall | flex} --moe-enable-deepep --moe-deepep-num-sms N # number of SMs DeepEP may use
DeepEP only works with the
flexdispatcher.allgather(default): collect tokens with All-Gatheralltoall: exchange tokens directly between expertsflex: allows high-performance back-ends such as DeepEP
2.2 Pipeline Load Balancing#
Pipeline-load-balancing is an advanced partitioning mechanism for PP / VPP that lets users specify exactly how each layer is mapped to pipeline stages via an explicit layout string.
It solves:
Uneven model structure (e.g. decoder layers not divisible, MTP/loss layers)
Load imbalance among stages with default cutting
Large pipeline bubbles and low GPU utilisation
Use --pipeline-model-parallel-layout to assign layer types and counts per stage.
--pipeline-model-parallel-size 16
--pipeline-model-parallel-layout "Et*3|(tt|)*29,m|L"
Layout is split by
|per stage*Nrepeats a block N timesSupported symbols
E: Embeddingt: Transformer decoderm: MTP layerL: Loss computation
2.3 Operator Fusion#
MoE permute fusion#
--moe-permute-fusion
Fuses token-reordering kernels to reduce memory traffic.
3. Memory Optimisations#
3.1 Re-computation (Activation Checkpointing)#
Trade extra backward compute for lower activation memory when GPU memory is tight.
Controlled by three orthogonal knobs:
Dimension |
Flag |
Choices |
|---|---|---|
Method |
|
|
Granularity |
|
|
Layers |
|
positive integer |
Method#
--recompute-method uniform # split model into equal checkpoint units
--recompute-method block # only recompute selected Transformer layers
Granularity#
--recompute-granularity full # checkpoint whole Transformer layer
--recompute-granularity selective # checkpoint only listed sub-modules
Number of layers#
--recompute-num-layers N
uniform: layers per checkpoint unitblock: number of layers to checkpoint on each rank / PP stage
Selective sub-modules (only with selective)#
--recompute-modules core_attn moe_act mlp
Supported modules:
core_attn, mlp, moe, moe_act, shared_experts, routed_experts,
layernorm, mla_up_proj,
a2a_overlap_attn, a2a_overlap_post_attn, a2a_overlap_mlp (last three require EP A2A overlap)
3.2 Activation Offloading#
Offloads selected activation tensors to CPU during forward and brings them back on demand during backward to reduce peak GPU memory.
Four controlling flags:
Dimension |
Flag |
Choices |
|---|---|---|
Enable |
|
on / off |
Modules |
|
module list |
Tensors |
|
tensor tag list |
Min size |
|
bytes (int) |
--fine-grained-activation-offloading
--offload-modules expert_fc1 core_attn
--offload-tensors dispatched_input
--min-offloaded-tensor-size 1048576
Supported modules:
attn_norm, core_attn, attn_proj, mlp_norm, expert_fc1, moe_act
Supported tensor tags:
dispatched_input(MoE token-dispatch output)pre_mlp_layernorm_output(pre-MLP LayerNorm output)
3.3 Optimiser-State CPU Offload#
Moves optimiser states (e.g. Adam momentum/variance) from GPU to CPU memory, greatly reducing GPU memory at the cost of extra CPU↔GPU traffic.
Can be combined with recompute, activation offload and communication overlap.
--optimizer-cpu-offload
--optimizer-offload-fraction 1.0 # (0, 1]; 1.0 = offload everything
fraction < 1.0allows a trade-off between memory saving and overhead
3.4 Fused Linear Cross Entropy#
Fuses the output-layer linear projection (hidden @ weight.T) with the cross-entropy loss into a single operation, combined with chunked computation along the vocabulary dimension, to eliminate the peak memory spike from the full logits tensor. For a typical configuration (num_tokens=16384, vocab_size=129280), this can save up to ~40 GB of logits-related memory.
The framework automatically selects the implementation based on GPU architecture:
GPUs except for Blackwell: pure PyTorch implementation with buffer reuse and online Softmax — significantly reduces peak memory while outperforming the native Torch implementation
--cross-entropy-loss-fusion \
--cross-entropy-fusion-impl linear
The vocabulary chunk size can be tuned via environment variable (default 3072):
export LCE_GENERIC_FWD_VOCAB_SPLIT_SIZE=3072
export LCE_GENERIC_BWD_VOCAB_SPLIT_SIZE=3072
See Fused Linear Cross Entropy for details.