Offline Packing#
This module provides an “offline sequence-packing” pipeline: it reads source WebDataset tar shards directly, groups and re-orders the samples according to max_token_len, and finally produces a packed WebDataset (pretrain-*.tar plus Energon meta files).
By concatenating variable-length sequences up to the target length we reduce padding and increase training throughput.
Entry script:
tools/data_preprocess/vlm/offline_packing/scripts/pack_wds.sh (4 steps, see below).
1. Supported packing scenarios (sample.sample_type)#
We currently support packing for single-sample captioning, VQA, and multi-modal mixed-QA formats.
Scenario |
|
Description |
|---|---|---|
Offline packed image/video/text mixed QA |
|
Input WDS JSON must declare |
2. Input requirements (data.wds_dir)#
The implementation reads uncompressed *.tar shards directly from data.wds_dir.
It does not unpack source shards into a flat directory.
Notes:
scan_wds_manifest.pyreads the message list from the field specified bydata.template_text_key; it also accepts the common keysmessagesandtexts.If the JSON files come from
tools/data_preprocess/vlm/convert_to_webdataset.py(multi-scenario writestextsby default) you usually need to setdata.template_text_keytotexts.packed_multi_mix_qa: JSON must declaremedia/media_type(text,image, orvideo). Image/video samples should supplyname/media_files; if absent, media members are inferred from WDS parts by extension..tgzinput is not supported in V1 because efficient byte-range reads require uncompressed tar.
3. Quick start#
cd tools/data_preprocess/vlm/offline_packing
# 1) Edit config.yaml (or copy packed_vqa_demo.yaml)
# 2) Run the 4-step pipeline (reads config.yaml by default)
bash scripts/pack_wds.sh
To switch to another config:
Option 1: overwrite/copy it to
config.yamlOption 2: run each script manually with
--config your.yaml(see next section)
4. Pipeline details (mirrors pack_wds.sh)#
Step 1: Scan WDS manifest and compute per-sample token length (scan_wds_manifest.py)#
Input:
*.tarshards underdata.wds_dirProcess: read WDS samples directly from tar, pick the template (
utils.TEMPLATES) according tosample.sample_type+model.model_type, tokenise text+vision inputs withAutoProcessororAutoTokenizer, and record tar byte locatorsOutput:
{data.work_dir}/sample_manifest.sqlite,{data.work_dir}/sample_manifest.jsonl, and per-media token reports
Manual run:
python scan_wds_manifest.py --config config.yaml
Step 2: Length bucketing & packing groups by media type (do_hashbacket.py)#
Input:
token_len/sample_len_report_{text,image,video}.txtProcess: build hash buckets separately for text/image/video, pack samples into “boxes” under
sample.max_token_lenOutput:
{data.work_dir}/bins/bins_boxs_{text,image,video}.pkl
Manual run:
python do_hashbacket.py --config config.yaml
Step 3: Generate pack plan (build_pack_plan.py)#
Input: per-media
bins_boxs_*.pkl+sample_manifest.sqliteProcess: convert hashbucket boxes into stable packed sample plans
Output:
{data.work_dir}/pack_plan.jsonl
Manual run:
python build_pack_plan.py --config config.yaml
Step 4: Write packed samples back to WebDataset (packed_to_wds.py)#
Input:
pack_plan.jsonl+sample_manifest.sqlite; media bytes are read from source tar byte offsetsOutput:
data.packed_wds_dir/pretrain-*.tarplus Energon meta (.nv-meta/dataset.yaml+ tar indexes)
Manual run:
python packed_to_wds.py --config config.yaml
5. Configuration (config.yaml)#
Key fields:
data.input_format– set towdsfor WDS-native packingdata.wds_dir– input WebDataset directory containing uncompressed*.tarshardsdata.template_text_key– message field name in JSON (messagesortexts)data.work_dir– working directory for manifest, token reports, bins and pack plandata.packed_wds_dir– final packed WDS output directorysample.max_token_len– target packing length (e.g. 8192 / 16384)sample.sample_type– V1 supportspacked_multi_mix_qamodel.model_type– model identifier used to pick the templatemodel.processor_loader–auto_processorfor VLM processors, orauto_tokenizerfor text-only smoke testsmodel.processor_kwargs.*– HF processor arguments passed totransformers.AutoProcessor.from_pretrainedpacked_wds.maxcount/maxsize– tar-shard splitting strategy
Example (excerpt, full fields see config.yaml):
data:
input_format: "wds"
wds_dir: "/mnt/cluster/.../wds/"
template_text_key: "texts"
work_dir: "/mnt/cluster/.../packing_work/"
packed_wds_dir: "/mnt/cluster/.../packed_wds/"
sample:
max_token_len: 8192
sample_type: packed_multi_mix_qa
6. Switching models / tuning image processing#
Step 1’s token counts depend on the actual AutoProcessor logic, so you can change the model or image-preprocessing parameters via config:
Change model: set
model.processor_kwargs.pretrained_model_name_or_pathto the desired HF model/processor; updatemodel.model_typeaccordingly.Adjust image-token budget / resolution: add processor-supported arguments under
model.processor_kwargs(e.g. Qwen-VL’smin_pixels/max_pixels).Template alignment: if you add a new
model.model_type, make suretools/data_preprocess/vlm/offline_packing/utils.pycontains the corresponding entry inTEMPLATES[sample_type][model_type]; otherwise Step 1 will raise “No template found for model_type …”.Media pre-processing: under
media_preprocessyou can assign pre-processing function names per modality (implementations intools/data_preprocess/vlm/offline_packing/media_preprocess_utils.py) to control resize/crop/frame-reading behaviour.