Parallelism Strategies & Optimization Guide#

LoongForge is built on Megatron-LM and is fully compatible with all existing Megatron-LM optimization strategies.
On top of that we have added several enhancements. This document describes the basic parallelism strategies and how to enable their optimizations.
They can be combined as needed to efficiently train billion- to trillion-parameter models on hundreds to thousands of GPUs.

1. Parallelism Strategies#

Strategy	Parallel Dimension	Primary Use-Case
Data Parallel (DP)	Batch dimension	Standard training; enabled by default
Tensor Parallel (TP)	Single-layer ops / weights	Large hidden-size or memory-bound cases
Pipeline Parallel (PP)	Model depth	Ultra-deep models with many layers
Context Parallel (CP)	Sequence length	Long-sequence training (8 K+)
Expert Parallel (EP)	MoE experts	Mixture-of-Experts models

1.1 Data Parallelism (DP)#

What is parallelised: different mini-batch samples
Key idea: every rank keeps a full copy of the model; gradients are synchronised

In DP each GPU processes a subset of the batch. Depending on the configuration, model-related states can be fully replicated or sharded across the DP dimension to save memory.

Standard DP (no sharding)#

torchrun --nproc_per_node=8 pretrain_gpt.py \
    --data-parallel-sharding-strategy no_shard

Each GPU stores full parameters, gradients and optimizer states
Each GPU processes only part of the batch
Gradients are synchronised with All-Reduce

Sharded Data Parallel#

Use --data-parallel-sharding-strategy to shard some states along the DP dimension and reduce per-GPU memory.

--data-parallel-sharding-strategy {no_shard | optim | optim_grads | optim_grads_params}

Strategy	What remains on each DP rank
`no_shard` (default)	params + grads + opt state
`optim`	params + grads + sharded opt state
`optim_grads`	params + sharded grads + sharded opt state
`optim_grads_params`	sharded params + sharded grads + sharded opt state

1.2 Tensor Parallelism (TP)#

What is parallelised: matrix computations inside one layer
Key idea: split large matrices along a dimension across GPUs

--tensor-model-parallel-size 4   # 4-way TP
--sequence-parallel              # recommended

--sequence-parallel shards the sequence dimension in LayerNorm and Dropout to reduce activation memory; usually used together with TP.

1.3 Pipeline Parallelism (PP)#

What is parallelised: model depth (layer dimension)
Key idea: different GPUs own different stages; execute with micro-batch pipeline

--pipeline-model-parallel-size 8              # 8 stages
--num-layers-per-virtual-pipeline-stage 4     # virtual stages for load balance

1.4 Context Parallelism (CP)#

What is parallelised: sequence length (token dimension)
Key idea: split a long sequence across GPUs

--context-parallel-size 2   # 2-way CP
--cp-comm-type p2p          # communication flavour

1.5 Expert Parallelism (EP)#

What is parallelised: experts in an MoE layer
Key idea: different GPUs hold different experts; tokens are dispatched with All-to-All

--expert-model-parallel-size 8   # 8-way EP

2. Performance Optimisations#

2.1 Communication Optimisations#

Gradient-Reduction Overlap
```
--overlap-grad-reduce
```
Overlaps All-Reduce of gradients with backward compute.
Parameter-Gather Overlap
```
--overlap-param-gather
```
Overlaps All-Gather of parameters with forward compute.
TP Communication Overlap
```
--tp-comm-overlap
```
Overlaps Tensor-Parallel communication with compute.
EP Communication Overlap
```
--overlap-moe-expert-parallel-comm
```
Overlaps MoE All-to-All with compute.
DeepEP optimisation
DeepEP is a high-performance MoE token dispatch/combine library from DeepSeek. It greatly reduces scheduling and sync overhead for cross-node All-to-All, and is the recommended setup for DeepSeek-V3 and similar models.
```
--moe-token-dispatcher-type=flex   # {allgather | alltoall | flex}
--moe-enable-deepep
--moe-deepep-num-sms N             # number of SMs DeepEP may use
```
DeepEP only works with the flex dispatcher.
- allgather (default): collect tokens with All-Gather
- alltoall: exchange tokens directly between experts
- flex: allows high-performance back-ends such as DeepEP

2.2 Pipeline Load Balancing#

Pipeline-load-balancing is an advanced partitioning mechanism for PP / VPP that lets users specify exactly how each layer is mapped to pipeline stages via an explicit layout string.
It solves:

Uneven model structure (e.g. decoder layers not divisible, MTP/loss layers)
Load imbalance among stages with default cutting
Large pipeline bubbles and low GPU utilisation

Use --pipeline-model-parallel-layout to assign layer types and counts per stage.

--pipeline-model-parallel-size 16
--pipeline-model-parallel-layout "Et*3|(tt|)*29,m|L"

Layout is split by | per stage
*N repeats a block N times
Supported symbols
- E : Embedding
- t : Transformer decoder
- m : MTP layer
- L : Loss computation

2.3 Operator Fusion#

MoE permute fusion#

--moe-permute-fusion

Fuses token-reordering kernels to reduce memory traffic.

3. Memory Optimisations#

3.1 Re-computation (Activation Checkpointing)#

Trade extra backward compute for lower activation memory when GPU memory is tight.

Controlled by three orthogonal knobs:

Dimension	Flag	Choices
Method	`--recompute-method`	`uniform` / `block`
Granularity	`--recompute-granularity`	`full` / `selective`
Layers	`--recompute-num-layers`	positive integer

Method#

--recompute-method uniform   # split model into equal checkpoint units
--recompute-method block     # only recompute selected Transformer layers

Granularity#

--recompute-granularity full       # checkpoint whole Transformer layer
--recompute-granularity selective  # checkpoint only listed sub-modules

Number of layers#

--recompute-num-layers N

uniform: layers per checkpoint unit
block: number of layers to checkpoint on each rank / PP stage

Selective sub-modules (only with `selective`)#

--recompute-modules core_attn moe_act mlp

Supported modules:
core_attn, mlp, moe, moe_act, shared_experts, routed_experts,
layernorm, mla_up_proj,
a2a_overlap_attn, a2a_overlap_post_attn, a2a_overlap_mlp (last three require EP A2A overlap)

3.2 Activation Offloading#

Offloads selected activation tensors to CPU during forward and brings them back on demand during backward to reduce peak GPU memory.

Four controlling flags:

Dimension	Flag	Choices
Enable	`--fine-grained-activation-offloading`	on / off
Modules	`--offload-modules`	module list
Tensors	`--offload-tensors`	tensor tag list
Min size	`--min-offloaded-tensor-size`	bytes (int)

--fine-grained-activation-offloading
--offload-modules expert_fc1 core_attn
--offload-tensors dispatched_input
--min-offloaded-tensor-size 1048576

Supported modules:
attn_norm, core_attn, attn_proj, mlp_norm, expert_fc1, moe_act

Supported tensor tags:

dispatched_input (MoE token-dispatch output)
pre_mlp_layernorm_output (pre-MLP LayerNorm output)

3.3 Optimiser-State CPU Offload#

Moves optimiser states (e.g. Adam momentum/variance) from GPU to CPU memory, greatly reducing GPU memory at the cost of extra CPU↔GPU traffic.
Can be combined with recompute, activation offload and communication overlap.

--optimizer-cpu-offload
--optimizer-offload-fraction 1.0   # (0, 1]; 1.0 = offload everything

fraction < 1.0 allows a trade-off between memory saving and overhead

3.4 Fused Linear Cross Entropy#

Fuses the output-layer linear projection (hidden @ weight.T) with the cross-entropy loss into a single operation, combined with chunked computation along the vocabulary dimension, to eliminate the peak memory spike from the full logits tensor. For a typical configuration (num_tokens=16384, vocab_size=129280), this can save up to ~40 GB of logits-related memory.

The framework automatically selects the implementation based on GPU architecture:

GPUs except for Blackwell: pure PyTorch implementation with buffer reuse and online Softmax — significantly reduces peak memory while outperforming the native Torch implementation

--cross-entropy-loss-fusion \
--cross-entropy-fusion-impl linear

The vocabulary chunk size can be tuned via environment variable (default 3072):

export LCE_GENERIC_FWD_VOCAB_SPLIT_SIZE=3072
export LCE_GENERIC_BWD_VOCAB_SPLIT_SIZE=3072

See Fused Linear Cross Entropy for details.

Parallelism Strategies & Optimization Guide

Contents