Offline Packing#

This module provides an “offline sequence-packing” pipeline: it reads source WebDataset tar shards directly, groups and re-orders the samples according to max_token_len, and finally produces a packed WebDataset (pretrain-*.tar plus Energon meta files). By concatenating variable-length sequences up to the target length we reduce padding and increase training throughput.

Entry script:
tools/data_preprocess/vlm/offline_packing/scripts/pack_wds.sh (4 steps, see below).

1. Supported packing scenarios (sample.sample_type)#

We currently support packing for single-sample captioning, VQA, and multi-modal mixed-QA formats.

Scenario

sample_type

Description

Offline packed image/video/text mixed QA

packed_multi_mix_qa

Input WDS JSON must declare media/media_type; packs are homogeneous by media type.

2. Input requirements (data.wds_dir)#

The implementation reads uncompressed *.tar shards directly from data.wds_dir. It does not unpack source shards into a flat directory.

Notes:

  • scan_wds_manifest.py reads the message list from the field specified by data.template_text_key; it also accepts the common keys messages and texts.

  • If the JSON files come from tools/data_preprocess/vlm/convert_to_webdataset.py (multi-scenario writes texts by default) you usually need to set data.template_text_key to texts.

  • packed_multi_mix_qa: JSON must declare media/media_type (text, image, or video). Image/video samples should supply name/media_files; if absent, media members are inferred from WDS parts by extension.

  • .tgz input is not supported in V1 because efficient byte-range reads require uncompressed tar.

3. Quick start#

cd tools/data_preprocess/vlm/offline_packing

# 1) Edit config.yaml (or copy packed_vqa_demo.yaml)
# 2) Run the 4-step pipeline (reads config.yaml by default)
bash scripts/pack_wds.sh

To switch to another config:

  • Option 1: overwrite/copy it to config.yaml

  • Option 2: run each script manually with --config your.yaml (see next section)

4. Pipeline details (mirrors pack_wds.sh)#

Step 1: Scan WDS manifest and compute per-sample token length (scan_wds_manifest.py)#

  • Input: *.tar shards under data.wds_dir

  • Process: read WDS samples directly from tar, pick the template (utils.TEMPLATES) according to sample.sample_type + model.model_type, tokenise text+vision inputs with AutoProcessor or AutoTokenizer, and record tar byte locators

  • Output: {data.work_dir}/sample_manifest.sqlite, {data.work_dir}/sample_manifest.jsonl, and per-media token reports

Manual run:

python scan_wds_manifest.py --config config.yaml

Step 2: Length bucketing & packing groups by media type (do_hashbacket.py)#

  • Input: token_len/sample_len_report_{text,image,video}.txt

  • Process: build hash buckets separately for text/image/video, pack samples into “boxes” under sample.max_token_len

  • Output: {data.work_dir}/bins/bins_boxs_{text,image,video}.pkl

Manual run:

python do_hashbacket.py --config config.yaml

Step 3: Generate pack plan (build_pack_plan.py)#

  • Input: per-media bins_boxs_*.pkl + sample_manifest.sqlite

  • Process: convert hashbucket boxes into stable packed sample plans

  • Output: {data.work_dir}/pack_plan.jsonl

Manual run:

python build_pack_plan.py --config config.yaml

Step 4: Write packed samples back to WebDataset (packed_to_wds.py)#

  • Input: pack_plan.jsonl + sample_manifest.sqlite; media bytes are read from source tar byte offsets

  • Output: data.packed_wds_dir/pretrain-*.tar plus Energon meta (.nv-meta/dataset.yaml + tar indexes)

Manual run:

python packed_to_wds.py --config config.yaml

5. Configuration (config.yaml)#

Key fields:

  • data.input_format – set to wds for WDS-native packing

  • data.wds_dir – input WebDataset directory containing uncompressed *.tar shards

  • data.template_text_key – message field name in JSON (messages or texts)

  • data.work_dir – working directory for manifest, token reports, bins and pack plan

  • data.packed_wds_dir – final packed WDS output directory

  • sample.max_token_len – target packing length (e.g. 8192 / 16384)

  • sample.sample_type – V1 supports packed_multi_mix_qa

  • model.model_type – model identifier used to pick the template

  • model.processor_loaderauto_processor for VLM processors, or auto_tokenizer for text-only smoke tests

  • model.processor_kwargs.* – HF processor arguments passed to transformers.AutoProcessor.from_pretrained

  • packed_wds.maxcount / maxsize – tar-shard splitting strategy

Example (excerpt, full fields see config.yaml):

data:
  input_format: "wds"
  wds_dir: "/mnt/cluster/.../wds/"
  template_text_key: "texts"
  work_dir: "/mnt/cluster/.../packing_work/"
  packed_wds_dir: "/mnt/cluster/.../packed_wds/"

sample:
  max_token_len: 8192
  sample_type: packed_multi_mix_qa

6. Switching models / tuning image processing#

Step 1’s token counts depend on the actual AutoProcessor logic, so you can change the model or image-preprocessing parameters via config:

  • Change model: set model.processor_kwargs.pretrained_model_name_or_path to the desired HF model/processor; update model.model_type accordingly.

  • Adjust image-token budget / resolution: add processor-supported arguments under model.processor_kwargs (e.g. Qwen-VL’s min_pixels/max_pixels).

  • Template alignment: if you add a new model.model_type, make sure tools/data_preprocess/vlm/offline_packing/utils.py contains the corresponding entry in TEMPLATES[sample_type][model_type]; otherwise Step 1 will raise “No template found for model_type …”.

  • Media pre-processing: under media_preprocess you can assign pre-processing function names per modality (implementations in tools/data_preprocess/vlm/offline_packing/media_preprocess_utils.py) to control resize/crop/frame-reading behaviour.