Quick Start: LLM SFT#
This guide walks you through launching a Supervised Fine-Tuning (SFT) job for Large-Language Models (LLM) in the LoongForge framework.
0. Resource Preparation#
Before starting, download the required model weights, tokenizer, and datasets. All downloads use HuggingFace. Install the CLI first:
pip install "huggingface_hub[cli]"
0.1 Download Model Weights#
hf download deepseek-ai/DeepSeek-V3.1 --local-dir ./deepseek-v3.1
Note: This model requires approximately 1.37 TB of disk space (163 safetensor shards, FP8 weights).
0.2 Download Tokenizer#
The tokenizer is included in the model weights downloaded above (./deepseek-v3.1/).
0.3 Download Dataset#
We use the Alpaca-Cleaned dataset (~30 MB, 51,742 samples) for SFT. It already matches LoongForge’s built-in default Alpaca format.
hf download yahma/alpaca-cleaned --repo-type dataset --local-dir ./data/alpaca-cleaned
Export the dataset to a JSON file that LoongForge can read:
from datasets import load_dataset
import json
ds = load_dataset("yahma/alpaca-cleaned", split="train")
records = [{"instruction": item["instruction"], "input": item["input"], "output": item["output"]} for item in ds]
with open("./data/alpaca_cleaned.json", "w") as f:
json.dump(records, f, ensure_ascii=False, indent=2)
After export, use --data-path ./data/alpaca_cleaned.json --sft-dataset default in your training script.
1. Prepare the data#
1.1 Dataset format & configuration#
For instruction tuning, two dialogue styles are common: Alpaca-style and ShareGPT-style.
LoongForge currently supports Alpaca-style JSON, one sample per line:
[
{
"instruction": "User instruction",
"input": "User question",
"output": "Model answer"
}
]
Field names may differ across datasets, so you must supply a dataset config file that tells the loader how to map your columns.
A template is shipped at
/workspace/LoongForge/configs/data/sft_dataset_config.yaml.
(1) File format#
We follow the community convention used by LlamaFactory.
(2) Default section#
If your file already uses the canonical Alpaca names, simply rely on the built-in default block:
default:
format: alpaca
columns:
prompt: instruction
query: input
response: output
(3) Adding a custom dataset#
Suppose your file is called custom_dataset_name.json and contains extra fields:
[
{
"instruction": "...",
"input": "...",
"output": "...",
"system": "System prompt",
"history": [
["Q1", "A1"],
["Q2", "A2"]
]
}
]
Append a new section to the YAML:
custom_dataset_name:
format: alpaca
columns:
prompt: instruction
query: input
response: output
system: system
history: history
1.2 Dataset arguments#
Argument |
Meaning |
|---|---|
|
Path to the JSON file(s). Multiple datasets are supported; give sampling weights with colon separator: |
|
Train/valid/test ratio, e.g. |
|
Path to the YAML file described above. Default: |
|
Name of the dataset entry inside the YAML. Must correspond 1-to-1 with the order in |
1.3 Pre-processing modes#
LoongForge supports on-the-fly and offline pre-processing.
Pick offline when the dataset is large; it removes tokenisation overhead from the critical path.
(1) On-the-fly (default)#
--data-path /path/to/custom_dataset_name.json \
--sft-dataset custom_dataset_name \
--sft-data-streaming # optional: stream large files
(2) Offline#
Run the helper script once:
#!/bin/bash
MEGATRON_PATH=${MEGATRON_PATH:-"/workspace/Loong-Megatron"}
export LOONGFORGE_PATH=${LOONGFORGE_PATH:-"/workspace/LoongForge"}
TOKENIZER_PATH=/path/to/hf/tokenizer
input_data=/path/to/custom_dataset_name.json
output_path=/path/to/save_dir
PYTHONPATH=$MEGATRON_PATH:$LOONGFORGE_PATH:$PYTHONPATH \
python ${LOONGFORGE_PATH}/tools/data_preprocess/llm/preprocess_sft_data.py \
--input ${input_data} \
--output ${output_path} \
--seq-length 2048 \
--chat-template ${chat_template} \
--tokenizer-type HFTokenizer \
--hf-tokenizer-path $TOKENIZER_PATH \
--workers 50 \
--split 100,0,0
--packing-sft-data
Key flags
--seq-length– max token length; longer samples are truncated.--chat-template– must match the template you will use during training.--split– pre-split the data; you still need--splitin training for sanity checks.--packing-sft-data– pack multiple samples into one sequence (no padding).
After pre-processing, pass the directory to training:
--data-path /path/to/save_dir \
--is-tokenized-data \
--sft-dataset custom_dataset_name
2. Prepare the checkpoint#
Same as pre-training – see Quick Start: LLM Pre-training.
3. Launch SFT training#
3.1 Extra LoongForge arguments (beyond native Megatron)#
Argument |
Purpose |
|---|---|
|
Switch to fine-tuning stage. |
|
Which conversation template to apply ( |
|
Dataset name(s) in the YAML file. |
|
Enable sample packing to increase throughput. |
|
Stream large JSON files instead of loading everything into RAM. |
|
CPU workers for on-the-fly tokenisation. |
|
Pad p2p buffers to fixed length (recommended for SFT). |
|
Offload optimiser states to CPU to save GPU memory. |
All FP8, pipeline-parallel, recompute, MoE, MTP flags keep the same meaning as in pre-training.
3.2 Example SFT script#
Ready-to-run scripts are located under examples/{model}/finetuning/.
Below is the FP8 SFT script for DeepSeek-V3.1 (comments added in English):
#!/bin/bash
# DeepSeek-V3 FP8 supervised fine-tuning
export TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1
MEGATRON_PATH=${MEGATRON_PATH:-"/workspace/Loong-Megatron"}
export LOONGFORGE_PATH=${LOONGFORGE_PATH:-"/workspace/LoongForge"}
# ------------- data -------------
DATA_PATH=/path/to/your/data
DATA_CACHE_PATH=/path/to/your/data/cache
DATASET_CONFIG_PATH=/path/to/your/dataset_config # optional
# ------------- tokenizer & checkpoint -------------
TOKENIZER_PATH=/path/to/hf/tokenizer
CHECKPOINT_PATH=/path/to/mcore/checkpoint
CHECKPOINT_PATH_SAVE=/path/to/save_dir
# ------------- logging -------------
TENSORBOARD_PATH=/path/to/tensorboard
# ------------- FP8 quantisation -------------
export FP8_QUANT_FWD_INP_AMAX_EPS=1e-12
export FP8_QUANT_FWD_WEIGHT_AMAX_EPS=1e-12
export FP8_QUANT_BWD_GRAD_AMAX_EPS=1e-12
GPUS_PER_NODE=8
# ------------- NCCL & NVSHMEM -------------
export NCCL_SOCKET_IFNAME=bond0
export NCCL_IB_GID_INDEX=3
# choose HCA list according to your cluster
export NVSHMEM_HCA_LIST=mlx5_4,mlx5_7,mlx5_8,mlx5_9,mlx5_10,mlx5_11,mlx5_12,mlx5_13
export NVSHMEM_BOOTSTRAP=UID
export NVSHMEM_IB_TRAFFIC_CLASS=130
export NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME=bond0
export NVSHMEM_BOOTSTRAP_UID_SOCK_FAMILY=AF_INET
export NVSHMEM_IB_GID_INDEX=3
# ------------- Transformer Engine -------------
export NVTE_FWD_LAYERNORM_SM_MARGIN=8
export NVTE_BWD_LAYERNORM_SM_MARGIN=24
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1
# ------------- CUDA / PyTorch -------------
export CUDA_DEVICE_MAX_CONNECTIONS=1
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
# ------------- distributed -------------
MASTER_ADDR=${MASTER_ADDR:-"localhost"}
MASTER_PORT=${MASTER_PORT:-"6000"}
NNODES=${WORLD_SIZE:-"1"}
NODE_RANK=${RANK:-"0"}
DISTRIBUTED_ARGS=(
--nproc_per_node $GPUS_PER_NODE
--nnodes $NNODES
--node_rank $NODE_RANK
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT
)
# ------------- model -------------
MODEL_ARGS=(
--model-name deepseek-v3
--multi-latent-attention
--rotary-base 10000
--original-max-position-embeddings 4096
--mscale 1.0
--mscale-all-dim 1.0
--norm-epsilon 1e-6
--rotary-scaling-factor 40
--enable-fa-within-mla
--use-fp32-dtype-for-param-pattern expert_bias
)
# ------------- data loader -------------
DATA_ARGS=(
--tokenizer-type HFTokenizer
--hf-tokenizer-path $TOKENIZER_PATH
--data-path $DATA_PATH
--split 90,8,2
)
# ------------- SFT specific -------------
SFT_ARGS=(
--chat-template no-template
--sft-num-preprocess-workers 16
--no-check-for-nan-in-loss-and-grad
--packing-sft-data
--sft-dataset sharegpt
)
# ------------- training hyper-params -------------
TRAINING_ARGS=(
--training-phase sft
--seq-length 65536
--max-position-embeddings 163840
--init-method-std 0.02
--no-masked-softmax-fusion
--micro-batch-size 1
--global-batch-size 128
--lr 1e-06
--train-iters 1500
--lr-decay-iters 5000
--lr-decay-style cosine
--min-lr 1.0e-7
--weight-decay 0.1
--lr-warmup-fraction 0.002
--clip-grad 1.0
--bf16
--load $CHECKPOINT_PATH
--save $CHECKPOINT_PATH_SAVE
--save-interval 1000
--eval-interval 10
--eval-iters 1
--no-load-optim
--no-load-rng
--recompute-granularity full
--recompute-method block
--custom-pipeline-layers 8,7,8,8,8,8,8,6
--custom-pipeline-recompute-layers 8,7,8,8,8,8,8,6
--num-virtual-stages-per-pipeline-rank 2
--reduce-variable-seq-shape-p2p-comm
--fp8-format e4m3
--fp8-recipe blockwise
--fp8-param-gather
--enable-fp8-comm
--distributed-timeout-minutes 60
--optimizer-cpu-offload
--optimizer-offload-fraction 1.0
--enable-experimental
)
# ------------- MoE -------------
MOE_ARGS=(
--moe-router-load-balancing-type seq_aux_loss
--moe-router-topk 8
--moe-aux-loss-coeff 1e-3
--moe-grouped-gemm
--moe-router-enable-expert-bias
--moe-router-bias-update-rate 0.001
--moe-router-num-groups 8
--moe-router-group-topk 4
--moe-router-score-function sigmoid
--moe-router-topk-scaling-factor 2.5
--moe-router-dtype fp32
--empty-unused-memory-level 2
)
# ------------- parallelism & optimiser -------------
MODEL_PARALLEL_ARGS=(
--tensor-model-parallel-size 8
--pipeline-model-parallel-size 8
--expert-model-parallel-size 32
--expert-tensor-parallel-size 1
--sequence-parallel
--moe-token-dispatcher-type flex
--moe-enable-deepep
--use-precision-aware-optimizer
--exp-avg-dtype bf16
--exp-avg-sq-dtype bf16
--use-distributed-optimizer
--moe-permute-fusion
--cross-entropy-loss-fusion
--overlap-grad-reduce
--overlap-param-gather
)
# ------------- MTP -------------
MTP_ARGS=(
--mtp-loss-scaling-factor 0.1
)
# ------------- logging -------------
LOGGING_ARGS=(
--log-interval 1
--tensorboard-dir ${TENSORBOARD_PATH}
--log-timers-to-tensorboard
--log-memory-to-tensorboard
--log-validation-ppl-to-tensorboard
--check-weight-hash-across-dp-replicas-interval 30
)
# ------------- launch -------------
PYTHONPATH=$MEGATRON_PATH:$LOONGFORGE_PATH:$PYTHONPATH \
torchrun ${DISTRIBUTED_ARGS[@]} \
$LOONGFORGE_PATH/loongforge/train.py \
${MODEL_ARGS[@]} \
${DATA_ARGS[@]} \
${TRAINING_ARGS[@]} \
${SFT_ARGS[@]} \
${MOE_ARGS[@]} \
${MODEL_PARALLEL_ARGS[@]} \
${LOGGING_ARGS[@]} \
${MTP_ARGS[@]}
4. Monitoring#
TensorBoard logs are written to the directory specified by TENSORBOARD_PATH.
Open TensorBoard to inspect loss, perplexity, memory, throughput, etc.