Fork Note (AMD ROCm Adaptation)
This branch (
feat/unified-amd) adapts 6 unified multimodal models (understanding + generation) into the VeOmni framework for ablation experiments. All model implementations are aligned with official code β inference outputs and training losses are verified to match.Environment:
- GPU: AMD Instinct MI308X (192GB HBM3)
- Platform: ROCm 7.0 + PyTorch 2.10.0+rocm7.0
- Python: 3.11
- Flash Attention: flash_attn 2.7.3 (ROCm)
| Model | model_type | LLM Backbone | Generation Method | Params |
|---|---|---|---|---|
| Bagel | bagel |
Qwen2.5-7B | MoVQGAN (Flow-Matching) | 14.6B |
| ThinkMorph | thinkmorph |
Qwen2.5-7B | MoVQGAN (Flow-Matching + CoT) | 14.6B |
| BLIP3o | blip3o_qwen |
Qwen2.5-7B | Diffusion (DIT + VAE) | 14.1B |
| SenseNova-U1 | neo_chat |
NEO (Qwen3-arch, 42L) | Flow-Matching (MoT) | 17.6B |
| LatentUM | latentum |
InternVL3.5-4B | MoT Discrete Tokens (AR Head) | 8.8B |
| Janus-Pro | janus |
DeepSeek-LLM-7B | VQ-16 Discrete Tokens (AR) | 7.4B |
All 6 models produce identical outputs to their official repos when given the same inputs. Tested on VisPuzzle benchmark (multiple-choice VQA):
| Model | Alignment | Test Samples | Official Accuracy |
|---|---|---|---|
| Bagel | 100% | 20/20 exact match | 33.8% |
| ThinkMorph | 100% | 5/5 exact match | 36.5% |
| BLIP3o | 100% | 10/10 exact match | 36.8% |
| SenseNova-U1 | 100% | 10/10 exact match | 68.5% |
| LatentUM | 100% | 3/3 exact match | 38.5% |
| Janus-Pro | 100% | 3/3 exact match | 33.2% |
Usage:
# Unified entry for all 6 models
python tasks/infer/infer_unified.py \
--model_type bagel \
--mode understand \
--prompt "Describe this image" \
--image img.jpg
python tasks/infer/infer_unified.py \
--model_type u1 \
--mode understand \
--prompt "What is shown?" \
--image img.jpgAll 6 models produce valid generated images through VeOmni's unified pipeline, aligned with official implementations under the same prompt and seed.
Prompt: "Add a blue circle in the center of this image"
| Model | Official | VeOmni |
|---|---|---|
| Bagel | ![]() |
![]() |
| ThinkMorph | ![]() |
![]() |
| BLIP3o | ![]() |
![]() |
| SenseNova-U1 | ![]() |
![]() |
| LatentUM | ![]() |
![]() |
| Janus-Pro | ![]() |
![]() |
Usage:
# Text-to-Image generation
python tasks/infer/infer_unified.py \
--model_type bagel \
--mode generate \
--prompt "A cat sitting on a windowsill" \
--output output.png
python tasks/infer/infer_unified.py \
--model_type u1 \
--mode generate \
--prompt "A blue circle on white background" \
--output output.png3 models (Bagel, ThinkMorph, U1) support image editing: given a source image and an editing instruction, the model generates the edited image. VeOmni outputs are aligned with official implementations under the same input and seed.
Note: LatentUM does not support general image editing β it is a navigation world model (next-frame prediction) and is excluded from this section.
Edit prompt: "Add a red arrow pointing from the center of the image to the top-right corner."
| Model | Input | Official | VeOmni |
|---|---|---|---|
| Bagel (1024Γ1024) |
![]() |
![]() |
![]() |
| ThinkMorph (1024Γ1024) |
![]() |
![]() |
![]() |
| SenseNova-U1 (1280Γ1280) |
![]() |
![]() |
![]() |
Usage:
from draw_to_understand.models.veomni_bagel import VeOmniBagelGenerationBackend
from PIL import Image
backend = VeOmniBagelGenerationBackend(
model_path="/path/to/BAGEL-7B-MoT",
device="cuda:0",
)
input_image = Image.open("input.png").convert("RGB")
output_image = backend.draw(input_image, "Add a red arrow to the top-right corner.")
output_image.save("edited.png")All 6 models produce aligned CE loss to their official training code with the same inputs.
Verified by running official model code from ablation_experiment/ repos with same seed & sequence.
| Model | VeOmni CE Loss | Official CE Loss | Diff | Grad Params |
|---|---|---|---|---|
| Bagel | 14.7230 | 14.7230 | 9.54e-7 | 395 |
| ThinkMorph | 14.7230 | 14.7230 | 9.54e-7 | 395 |
| BLIP3o | 13.3495 | 13.3964 | 0.047 | 311 |
| SenseNova-U1 | 13.0480 | 13.0480 | 0.000 | 549 |
| LatentUM | 12.7729 | 12.7839 | 0.011 | 399 |
| Janus-Pro | 13.6453 | N/A (no official training code) | β | 282 |
For understanding-only training, set mse_weight: 0.0 in the config to disable the generation loss.
# Bagel understanding-only (2 GPU, FSDP2)
GPUS=0,1 MODEL=bagel \
TRAIN_DATA=/path/to/understanding_data.jsonl \
bash scripts/mdl/amd/veomni_unified_sft_amd.sh
# U1 understanding-only (2 GPU, FSDP2)
GPUS=0,1 MODEL=u1 \
TRAIN_DATA=/path/to/understanding_data.jsonl \
bash scripts/mdl/amd/veomni_unified_sft_amd.shBagel and SenseNova-U1 support joint training of understanding (CE loss) and image generation (MSE/FM loss) in a single forward pass. The training data contains interleaved conversations where:
- Images in user messages are treated as input (understanding path, ViT encoder)
- Images in assistant messages are treated as output (generation path, MSE loss)
This enables conditional image generation / image editing tasks alongside standard visual QA.
| Model | Generation Method | Loss | Step 1 | Step 2 | Step 3 |
|---|---|---|---|---|---|
| Bagel | MoVQGAN latent-space flow-matching | CE + MSE | ce: 13.78, mse: 1.46 | ce: 11.88, mse: 0.90 | ce: 11.81, mse: 2.62 |
| SenseNova-U1 | Pixel-space flow-matching (MoT) | CE + FM | ce: 13.66, fm: 1.46 | ce: 11.88, fm: 0.90 | ce: 11.78, fm: 2.62 |
Bagel / ThinkMorph unified training:
# Bagel: understanding + generation (CE + MSE loss)
# Config: mse_weight=1.0, ce_weight=1.0
GPUS=0,1 MODEL=bagel \
TRAIN_DATA=/path/to/interleaved_data.jsonl \
IMAGE_ROOT=/path/to/images \
bash scripts/mdl/amd/veomni_unified_sft_amd.shKey config options (configs/multimodal/bagel/sft_amd.yaml):
train:
mse_weight: 1.0 # generation loss weight (0.0 = understanding-only)
ce_weight: 1.0 # understanding loss weight
freeze_vit: true # freeze ViT encoder
freeze_vae: true # freeze VAE encoder/decoder
freeze_und: false # keep understanding path trainable
# To switch to understanding-only: freeze_gen_modules: true, mse_weight: 0.0
# To switch to generation-only: freeze_und: true, ce_weight: 0.0SenseNova-U1 unified training:
# U1: understanding + generation (CE + FM loss)
# Config: mse_weight=1.0, ce_weight=1.0, flex_attention enabled
GPUS=0,1 MODEL=u1 STAGE=gen \
TRAIN_DATA=/path/to/interleaved_data.jsonl \
IMAGE_ROOT=/path/to/images \
bash scripts/mdl/amd/veomni_unified_sft_amd.shKey config options (configs/multimodal/u1/sft_gen_amd.yaml):
train:
mse_weight: 1.0 # flow-matching loss weight
ce_weight: 1.0 # CE loss weight
freeze_vit: true # freeze understanding ViT
freeze_llm: true # freeze LLM backbone
freeze_gen_modules: false # keep FM modules trainable
unfreeze_mot_gen: true # unfreeze MoT generation branch (*_mot_gen params)
unfreeze_vit_layers: -4 # unfreeze last 4 ViT encoder layers
unfreeze_lm_head: true # unfreeze lm_headTraining data is JSONL with interleaved conversations. Each <image> marker maps to the next entry in the images list.
{
"id": "sample_001",
"images": ["input.jpg", "output.jpg"],
"conversations": [
{"from": "human", "value": "<image>\nEdit this image: add a blue circle in the center."},
{"from": "gpt", "value": "Here is the edited image:\n<image>"}
]
}<image>inhumanturn +images[0]= understanding input (ViT)<image>ingptturn +images[1]= generation target (VAE/FM, MSE loss)
For understanding-only data, simply omit <image> from assistant responses:
{
"id": "vqa_001",
"images": ["photo.jpg"],
"conversations": [
{"from": "human", "value": "<image>\nWhat is shown in this image?"},
{"from": "gpt", "value": "A cat sitting on a windowsill."}
]
}Quick sanity-check with 3 training steps on 2 GPUs:
# Bagel debug
DEBUG=1 MAX_STEPS=3 GPUS=2,3 MODEL=bagel bash scripts/mdl/amd/veomni_unified_sft_amd.sh
# U1 debug (generation mode)
DEBUG=1 MAX_STEPS=3 GPUS=2,3 MODEL=u1 STAGE=gen bash scripts/mdl/amd/veomni_unified_sft_amd.shSee per-model branches:
feat/bagel-amd,feat/thinkmorph-amd,feat/blip3o-amd,feat/sensenova-u1-amd,feat/latentum-amd
VeOmni is a versatile framework for both single- and multi-modal pre-training and post-training. It empowers users to seamlessly scale models of any modality across various accelerators, offering both flexibility and user-friendliness.
Our guiding principles when building VeOmni are:
-
Flexibility and Modularity: VeOmni is built with a modular design, allowing users to decouple most components and replace them with their own implementations as needed.
-
Trainer-free: VeOmni supports linear training scripts that avoid rigid, structured trainer classes (e.g., PyTorch-Lightning or HuggingFace Trainer). These training scripts expose the entire training logic to users for maximum transparency and control. Besides, VeOmni supports a basic trainer for text-only or vlm/omni models training and a rl trainer as a trainer backend in reinforcement learning.
-
Omni model native: VeOmni enables users to effortlessly scale any omni-model across devices and accelerators.
-
Torch native: VeOmni is designed to leverage PyTorchβs native functions to the fullest extent, ensuring maximum compatibility and performance.
- [2025/11] Our Paper OmniScale: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo was accepted by AAAI 2026
- [2025/09] We release first offical release v0.1.0 of VeOmni.
- [2025/08] We release VeOmni Tech report and open the WeChat group. Feel free to join us!
- [2025/04] We release VeOmni!
- FSDP, FSDP2 backend for training.
- Sequence Parallelism with Deepspeed Ulysess, support with non-async and async mode.
- Experts Parallelism support large MOE model training, like Qwen3-Moe.
- Efficient GroupGemm kernel for Moe model, Liger-Kernel.
- Compatible with HuggingFace Transformers models. Qwen3, Qwen3-VL, Qwen3-Moe, etc
- Dynamic batching strategy, Omnidata processing
- Torch Distributed Checkpoint for checkpoint.
- Support for both Nvidia-GPU and Ascend-NPU training.
- Experiment tracking with wandb
- VeOmni v0.2 Roadmap ByteDance-Seed#268, ByteDance-Seed#271
- Vit balance tool ByteDance-Seed#280
- Validation dataset during training ByteDance-Seed#247
- RL post training for omni-modality models with VeRL ByteDance-Seed#262
| Model | Model size | Example config File |
|---|---|---|
| DeepSeek2.5/3/R1 | 236B/671B | deepseek.yaml |
| Llama3-3.3 | 1B/3B/8B/70B | llama3.yaml |
| Qwen2-3 | 0.5B/1.5B/3B/7B/14B/32B/72B/ | qwen2_5.yaml |
| Qwen2-3 VL/QVQ | 2B/3B/7B/32B/72B | qwen3_vl_dense.yaml |
| Qwen3-VL MoE | 30BA3B/235BA22B | qwen3_vl_moe.yaml |
| Qwen3-MoE | 30BA3B/235BA22B | qwen3-moe.yaml |
| Qwen2-3 Omni | 7B/30BA3B | qwen25_omni.yaml |
| Wan | Wan2.1-I2V-14B-480P | wan_sft.yaml |
| Omni Model | Any Modality Training | seed_omni.yaml |
Support new models to VeOmni see Support New Models
For more details, please refer to our paper.
- dFactory: Easy and Efficient dLLM Fine-Tuning
- LMMs-Engine
- UI-TARS: Pioneering Automated GUI Interaction with Native Agents
- OpenHA: A Series of Open-Source Hierarchical Agentic Models in Minecraft
- UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
- Open-dLLM: Open Diffusion Large Language Models
- LingBot-VLA: A Pragmatic VLA Foundation Model
Contributions from the community are welcome! Please check out CONTRIBUTING.md our project roadmap(To be updated),
If you find VeOmni useful for your research and applications, feel free to give us a star β or cite us using:
@article{ma2025veomni,
title={VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo},
author={Ma, Qianli and Zheng, Yaowei and Shi, Zhelun and Zhao, Zhongkai and Jia, Bin and Huang, Ziyue and Lin, Zhiqi and Li, Youjie and Yang, Jiacheng and Peng, Yanghua and others},
journal={arXiv preprint arXiv:2508.02317},
year={2025}
}Thanks to the following projects for their excellent work:
π± About ByteDance Seed Team
Founded in 2023, ByteDance Seed Team is dedicated to crafting the industry's most advanced AI foundation models. The team aspires to become a world-class research team and make significant contributions to the advancement of science and society. You can get to know Bytedance Seed better through the following channelsπ
























