VLM Dataset Conversion#
1. Dataset Format and Processing#
Considering the diversity of multimodal datasets, this project adopts the Energon loader to improve data processing performance, which requires datasets to be stored in standard WebDataset format. WebDataset stores data in native file formats (jpg, mp4, etc.), allowing various native multimodal datasets to be simply compressed and converted to WebDataset format, then read by Energon.
Reference Documentation:
This directory provides tools/data_preprocess/vlm/convert_to_webdataset.py for converting .json/.jsonl annotation files + original media files (images/videos) into WebDataset directories that Energon can directly read (while generating Energon-required indexes and dataset.yaml).
2. Supported Data Types (--sample_type)#
Different dataset.yaml files are written based on --sample_type, determining the field organization of samples inside tar files:
Common pretrain datasets can use caption format, SFT datasets can use VQA format. For multi-image or mixed SFT requirements, multi_mix_qa format is recommended.
sample_type |
Applicable Scenario |
Description |
|---|---|---|
|
Single image VQA |
Generates |
|
Single image caption |
Generates |
|
Multi-image/video mixed QA |
Uses |
|
Multi-video VQA |
Same as above |
|
Post offline packing data |
Usually generated by |
Other strings |
Custom scenarios |
Still writes |
Notes:
--mediais only used to write dataset metadata (for distinguishing image/video/mix). Whether actual samples contain images/videos is determined by whether each entry containsimage(s)/video(s).If an entry has neither
image(s)norvideo(s), it will be written as a “pure text sample” (containing onlyjson).
3. Conversion Script Usage#
Supported input files:
--json_file:.json(list[dict]) or.jsonl(one dict per line)--image_dir/--video_dir: Original media file root directory (relative paths stored in entries)
python tools/data_preprocess/vlm/convert_to_webdataset.py \
--output_dir /workspace/wds_data/ \
--json_file tests/datasets/vlm/mllm_demo.json \
--image_dir tests/datasets/vlm/ \
--video_dir tests/datasets/vlm/ \
--media mix \
--columns_messages messages \
--maxcount 10000 \
--maxsize 3000000000 \
--sample_type multi_mix_qa
Parameter Description:
Parameter |
Default |
Description |
|---|---|---|
|
- |
Output directory (generates |
|
- |
Input |
|
- |
Image root directory (required when samples contain |
|
- |
Video root directory (required when samples contain |
|
|
|
|
|
Key for dialogue/text field in entry |
|
|
Maximum number of samples per shard (tar) |
|
|
Maximum byte size per shard (tar) |
|
- |
Data type (see table above) |
Output Description:
Output directory contains
pretrain-0.tar,pretrain-1.tar… (each tar stores several files corresponding to__key__according to WebDataset specification, such asxxx.jpg/xxx.json/xxx.0_a.mp4, etc.)Also generates Energon-required metadata directory (usually named
.wds/), containingdataset.yamland index files; during training,--data-pathtypically points to--output_dir
4. Input JSON Conventions (Common Fields)#
Each entry supports the following field combinations (all relative paths, will be concatenated with --image_dir/--video_dir to read binary files):
Images:
image: "a/b.jpg"orimages: ["a/b.jpg", "c/d.jpg"]Videos:
video: "a/b.mp4"orvideos: ["a/b.mp4", "c/d.mp4"]Text/Dialogue: Default reads
messages(can be modified with--columns_messages)
Text field requirements for different sample_type (aligned with script-generated dataset.yaml):
vqa:messagesshould support readingjson[0][content]andjson[1][content](commonly a list with length ≥ 2, elements containingcontent)caption:messagesshould support readingjson[captions][0][content](e.g., dict containscaptions: [{content: ...}])multi_mix_qa/multi_vid_vqa, etc.: Script writes a structuredjson(containingtexts/media/name), downstream parses according to correspondingsample_typecooker
5. Offline Packing Data Processing#
In multimodal scenarios, sequence offline packing processing methods are provided
See Offline Data Packing Guide for details