Supervised fine-tuning framework for MOSS-VL, built on HuggingFace transformers.Trainer.
mossvl_finetune/
├── train.py # Training entry point
├── data.py # Dataset and data collator
├── arguments.py # Argument dataclasses
├── scripts/
│ ├── run_sft.sh # Full-parameter SFT launch script
│ └── run_sft_lora.sh # LoRA SFT launch script
└── demo/
└── sft_data.json # Example training data
Use the same environment as the model checkpoint:
conda create -n moss_vl python=3.12 pip -y
conda activate moss_vl
pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txtFor LoRA training, additionally install:
pip install peftTraining data must be a JSON list. Each item should use exactly one of the following formats.
[
{
"prompt": "Describe this image.",
"response": "A beautiful landscape with mountains and a sunset.",
"images": ["path/to/image.jpg"],
"videos": [],
"system_prompt": "You are a helpful assistant."
}
]Use this format for single-turn supervised fine-tuning data.
prompt: user input text.response: target assistant output text.images: optional list of image paths.videos: optional list of video entries. See Video Entries for the supported formats.system_prompt: optional system instruction.
Automatic Media Placement
Media placeholders (<|image|> and <|video|>) are automatically prepended to the user message according to the following rules:
- Images: Each image consumes a single
<|image|>placeholder. - Videos:
- Plain Paths: One
<|video|>placeholder per video. - Segmented Videos: One
<|video|>placeholder per segment when using the dictionary format:{"video_path": "...", "segments": [...]}
- Plain Paths: One
[
{
"conversations": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "<|image|>\nDescribe this image."},
{"role": "assistant", "content": "A beautiful landscape."},
{"role": "user", "content": "What is the dominant color?"},
{"role": "assistant", "content": "Green."}
],
"images": ["path/to/image.jpg"],
"videos": []
}
]Use this format for multi-turn chat data.
conversations: required list of chat messages.- Each message should be an object like
{"role": "...", "content": "..."}. images: optional list of image paths.videos: optional list of video entries. See Video Entries for the supported formats.
Multimodal Placeholder Rules
When using conversations, you must explicitly include <|image|> or <|video|> placeholders in the message content:
-
Images: each image requires exactly one
<|image|>placeholder. -
Videos: each plain video path consumes one
<|video|>placeholder. -
Segmented videos: each segment within a video dictionary consumes one
<|video|>placeholder.
Note
If a sample provides fewer <|video|> placeholders than the actual number of video segments, the loader will expand them during preprocessing.
After this expansion, the final placement of <|video|> placeholders may not exactly match the user's original expectation.
Relative media paths in the JSON are resolved relative to the JSON file's parent directory (or the --data_dir argument if provided).
Each item in videos can use one of the following formats.
1. Plain video path
{
"videos": [
"path/to/video.mp4"
]
}This represents one full video and consumes one <|video|> placeholder.
2. Segmented videos
{
"videos": [
{
"video_path": "path/to/video_1.mp4",
"segments": [[0, 10]]
},
{
"video_path": "path/to/video_2.mp4",
"segments": [[20, 30]]
}
]
}In the segmented format:
video_pathis the path to the source video file.segmentsis a list of time segments in seconds.- Each segment is written as
[start, end], using a left-closed, right-open interval:[start, end). - Each segment expands to one video unit and therefore consumes one
<|video|>placeholder during training text construction.
In the example above, there are two segmented video entries and each entry has one segment, so the sample expands to two video units and needs two <|video|> placeholders.
Note
Run from the repository root.
bash mossvl_finetune/scripts/run_sft.shbash mossvl_finetune/scripts/run_sft_lora.shpython mossvl_finetune/train.py \
--model_name_or_path /path/to/checkpoint \
--data_path mossvl_finetune/demo/sft_data.json \
--output_dir ./checkpoints/test \
--bf16 True \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 \
--num_train_epochs 1 \
--dataloader_num_workers 0 \
--gradient_checkpointing True \
--report_to none| Argument | Default | Description |
|---|---|---|
--model_name_or_path |
(required) | Path to the MOSS-VL checkpoint |
--tune_vision |
False |
Train the vision encoder |
--tune_language |
True |
Train the language model layers |
--tune_lm_head |
True |
Train the LM head projection |
| Argument | Default | Description |
|---|---|---|
--data_path |
(required) | Path to the training data JSON file |
--data_dir |
auto | Base directory for relative media paths |
--max_length |
4096 |
Maximum token sequence length |
| Argument | Default | Description |
|---|---|---|
--vision_chunked_length |
64 |
Chunk size for vision encoding (saves VRAM) |
--lora_enable |
False |
Enable LoRA training |
--lora_r |
64 |
LoRA rank |
--lora_alpha |
128 |
LoRA alpha |
--lora_dropout |
0.0 |
LoRA dropout |
--lora_target_modules |
q_proj,k_proj,v_proj,o_proj |
Comma-separated LoRA target modules |
Plus all standard HuggingFace TrainingArguments (--learning_rate, --num_train_epochs, --deepspeed, etc.).
By default the vision encoder is frozen while the language model and LM head are trained:
tune_vision=False → vision encoder frozen
tune_language=True → all decoder layers trained
tune_lm_head=True → output projection trained
When LoRA is enabled (--lora_enable True), all base parameters are frozen and only the LoRA adapters are trained.
Pass a DeepSpeed config via --deepspeed:
torchrun --nproc_per_node=8 mossvl_finetune/train.py \
... \
--deepspeed ds_config_zero2.jsonTo ensure the model learns effectively, we apply a specific masking strategy to our training tokens:
-
Training Targets: Only the Assistant's responses are used as active training labels.
-
Masked Content: System prompts, user queries, and all vision-related tokens (e.g., <|image_pad|>) are assigned an ignore_index=-100 to exclude them from loss calculation.
-
EOS Learning: The trailing <|im_end|> token at the end of each Assistant turn is explicitly included in the labels, ensuring the model learns when to stop generating.