Offline Packing#
This module provides an “offline sequence-packing” pipeline: it takes a sample-level directory (one *.json plus its media files per sample), groups and re-orders the samples according to max_token_len, and finally produces a packed WebDataset (pretrain-*.tar plus Energon meta files).
By concatenating variable-length sequences up to the target length we reduce padding and increase training throughput.
Entry script:
tools/data_preprocess/vlm/offline_packing/scripts/pack_wds.sh (4 steps, see below).
1. Supported packing scenarios (sample.sample_type)#
We currently support packing for single-sample captioning, VQA, and multi-modal mixed-QA formats.
Scenario |
|
Description |
|---|---|---|
Offline packed caption |
|
Produces a |
Offline packed single-image QA |
|
Same as above. |
Offline packed image+video mixed QA |
|
Same as above (input JSON must declare media types and file lists). |
2. Input requirements (data.wds_dir)#
The implementation does NOT read tar shards directly; it expects a flat, random-accessible directory:
Many
*.jsonfiles (each file = one sample / one WDS json payload).Media files (images/videos) sitting in the same directory, or referenced via a relative path resolvable from that directory.
If your data are already pretrain-*.tar shards produced by convert_to_webdataset.py, unpack them first:
mkdir -p /path/to/wds_flat
for t in /path/to/wds/pretrain-*.tar; do tar -xf "$t" -C /path/to/wds_flat; done
Notes:
get_sample_len.pyreads the message list from the field specified bydata.template_text_key; it also accepts the common keysmessagesandtexts.If the JSON files come from
tools/data_preprocess/vlm/convert_to_webdataset.py(multi-scenario writestextsby default) you usually need to setdata.template_text_keytotexts.packed_vqa/packed_captioning: if the JSON does not contain an explicitmedia_files/namefield, the code tries to find a media file with the same stem (e.g.0001.json→0001.jpg).packed_multi_mix_qa: JSON must declaremedia/media_type(imageorvideo) and supplyname/media_fileslist (nested lists allowed).
3. Quick start#
cd tools/data_preprocess/vlm/offline_packing
# 1) Edit config.yaml (or copy packed_vqa_demo.yaml)
# 2) Run the 4-step pipeline (reads config.yaml by default)
bash scripts/pack_wds.sh
To switch to another config:
Option 1: overwrite/copy it to
config.yamlOption 2: run each script manually with
--config your.yaml(see next section)
4. Pipeline details (mirrors pack_wds.sh)#
Step 1: Compute per-sample token length (get_sample_len.py)#
Input:
*.json+ media files underdata.wds_dirProcess: pick the template (
utils.TEMPLATES) according tosample.sample_type+model.model_type, tokenise text+vision inputs withAutoProcessor, record token length for every sampleOutput:
{data.wds_dir}/.temp/sample_len_report.txt(sample_id: token_len)
Manual run:
python get_sample_len.py --config config.yaml
Step 2: Length bucketing & packing groups (do_hashbacket.py)#
Input:
sample_len_report.txtProcess: build hash buckets, pack samples into “boxes” under
sample.max_token_lenOutput:
{data.packed_json_dir}/bins_boxs.pkl(each box = list of sample ids that will be concatenated into one packed sample)
Manual run:
python do_hashbacket.py --config config.yaml
Step 3: Generate packed intermediate JSON (prepare_raw_samples.py)#
Input:
bins_boxs.pkl+ original*.json/mediaProcess: aggregate samples per box, produce packed-json with fields such as
prompts/captions/media_files/media_typeOutput:
{data.packed_json_dir}/row_packing_jsons/*.json
Manual run:
python prepare_raw_samples.py --config config.yaml
Step 4: Write packed JSON back to WebDataset (packed_to_wds.py)#
Input:
row_packing_jsons/*.json+ media (looked up under{data.wds_dir}or{data.packed_json_dir}/row_packing_images)Output:
data.packed_wds_dir/pretrain-*.tar(or{data.packed_json_dir}/packed_wdsif not configured) plus Energon meta (.wds/dataset.yaml+ index)
Manual run:
python packed_to_wds.py --config config.yaml
5. Configuration (config.yaml)#
Key fields:
data.wds_dir– input sample directory (*.json+ media)data.template_text_key– message field name in JSON (messagesortexts)data.packed_json_dir– working directory for intermediate pkl/jsondata.packed_wds_dir– final packed WDS output directorysample.max_token_len– target packing length (e.g. 8192 / 16384)sample.sample_type– see Section 1model.model_type– model identifier used to pick the templatemodel.processor_kwargs.*– HF processor arguments passed totransformers.AutoProcessor.from_pretrainedpacked_wds.maxcount/maxsize– tar-shard splitting strategy
Example (excerpt, full fields see config.yaml):
data:
wds_dir: "/mnt/cluster/.../wds_flat/"
template_text_key: "messages"
packed_json_dir: "/mnt/cluster/.../packed_json/"
packed_wds_dir: "/mnt/cluster/.../packed_wds/"
sample:
max_token_len: 8192
sample_type: packed_multi_mix_qa
6. Switching models / tuning image processing#
Step 1’s token counts depend on the actual AutoProcessor logic, so you can change the model or image-preprocessing parameters via config:
Change model: set
model.processor_kwargs.pretrained_model_name_or_pathto the desired HF model/processor; updatemodel.model_typeaccordingly.Adjust image-token budget / resolution: add processor-supported arguments under
model.processor_kwargs(e.g. Qwen-VL’smin_pixels/max_pixels).Template alignment: if you add a new
model.model_type, make suretools/data_preprocess/vlm/offline_packing/utils.pycontains the corresponding entry inTEMPLATES[sample_type][model_type]; otherwise Step 1 will raise “No template found for model_type …”.Media pre-processing: under
media_preprocessyou can assign pre-processing function names per modality (implementations intools/data_preprocess/vlm/offline_packing/media_preprocess_utils.py) to control resize/crop/frame-reading behaviour.