GitHub - 86MaxCao/VeOmni: VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

Fork Note (AMD ROCm Adaptation)

This branch (feat/unified-amd) adapts 6 unified multimodal models (understanding + generation) into the VeOmni framework for ablation experiments. All model implementations are aligned with official code — inference outputs and training losses are verified to match.

Environment:

GPU: AMD Instinct MI308X (192GB HBM3)

Platform: ROCm 7.0 + PyTorch 2.10.0+rocm7.0

Python: 3.11

Flash Attention: flash_attn 2.7.3 (ROCm)

1. Supported Models

Model	model_type	LLM Backbone	Generation Method	Params
Bagel	`bagel`	Qwen2.5-7B	MoVQGAN (Flow-Matching)	14.6B
ThinkMorph	`thinkmorph`	Qwen2.5-7B	MoVQGAN (Flow-Matching + CoT)	14.6B
BLIP3o	`blip3o_qwen`	Qwen2.5-7B	Diffusion (DIT + VAE)	14.1B
SenseNova-U1	`neo_chat`	NEO (Qwen3-arch, 42L)	Flow-Matching (MoT)	17.6B
LatentUM	`latentum`	InternVL3.5-4B	MoT Discrete Tokens (AR Head)	8.8B
Janus-Pro	`janus`	DeepSeek-LLM-7B	VQ-16 Discrete Tokens (AR)	7.4B

2. Multimodal Understanding Inference Alignment

All 6 models produce identical outputs to their official repos when given the same inputs. Tested on VisPuzzle benchmark (multiple-choice VQA):

Model	Alignment	Test Samples	Official Accuracy
Bagel	100%	20/20 exact match	33.8%
ThinkMorph	100%	5/5 exact match	36.5%
BLIP3o	100%	10/10 exact match	36.8%
SenseNova-U1	100%	10/10 exact match	68.5%
LatentUM	100%	3/3 exact match	38.5%
Janus-Pro	100%	3/3 exact match	33.2%

Usage:

# Unified entry for all 6 models
python tasks/infer/infer_unified.py \
    --model_type bagel \
    --mode understand \
    --prompt "Describe this image" \
    --image img.jpg

python tasks/infer/infer_unified.py \
    --model_type u1 \
    --mode understand \
    --prompt "What is shown?" \
    --image img.jpg

3. Text-to-Image Generation Alignment

All 6 models produce valid generated images through VeOmni's unified pipeline, aligned with official implementations under the same prompt and seed.

Prompt: "Add a blue circle in the center of this image"

Model	Official	VeOmni
Bagel
ThinkMorph
BLIP3o
SenseNova-U1
LatentUM
Janus-Pro

Usage:

# Text-to-Image generation
python tasks/infer/infer_unified.py \
    --model_type bagel \
    --mode generate \
    --prompt "A cat sitting on a windowsill" \
    --output output.png

python tasks/infer/infer_unified.py \
    --model_type u1 \
    --mode generate \
    --prompt "A blue circle on white background" \
    --output output.png

4. Image Editing (it2i) Alignment

3 models (Bagel, ThinkMorph, U1) support image editing: given a source image and an editing instruction, the model generates the edited image. VeOmni outputs are aligned with official implementations under the same input and seed.

Note: LatentUM does not support general image editing — it is a navigation world model (next-frame prediction) and is excluded from this section.

Edit prompt: "Add a red arrow pointing from the center of the image to the top-right corner."

Model	Input	Official	VeOmni
Bagel (1024×1024)
ThinkMorph (1024×1024)
SenseNova-U1 (1280×1280)

Usage:

from draw_to_understand.models.veomni_bagel import VeOmniBagelGenerationBackend
from PIL import Image

backend = VeOmniBagelGenerationBackend(
    model_path="/path/to/BAGEL-7B-MoT",
    device="cuda:0",
)

input_image = Image.open("input.png").convert("RGB")
output_image = backend.draw(input_image, "Add a red arrow to the top-right corner.")
output_image.save("edited.png")

5. SFT Training

5.1 Understanding-Only SFT (CE Loss)

All 6 models produce aligned CE loss to their official training code with the same inputs. Verified by running official model code from ablation_experiment/ repos with same seed & sequence.

Model	VeOmni CE Loss	Official CE Loss	Diff	Grad Params
Bagel	14.7230	14.7230	9.54e-7	395
ThinkMorph	14.7230	14.7230	9.54e-7	395
BLIP3o	13.3495	13.3964	0.047	311
SenseNova-U1	13.0480	13.0480	0.000	549
LatentUM	12.7729	12.7839	0.011	399
Janus-Pro	13.6453	N/A (no official training code)	—	282

For understanding-only training, set mse_weight: 0.0 in the config to disable the generation loss.

# Bagel understanding-only (2 GPU, FSDP2)
GPUS=0,1 MODEL=bagel \
TRAIN_DATA=/path/to/understanding_data.jsonl \
bash scripts/mdl/amd/veomni_unified_sft_amd.sh

# U1 understanding-only (2 GPU, FSDP2)
GPUS=0,1 MODEL=u1 \
TRAIN_DATA=/path/to/understanding_data.jsonl \
bash scripts/mdl/amd/veomni_unified_sft_amd.sh

5.2 Unified Understanding + Generation SFT (CE + Generation Loss)

Bagel and SenseNova-U1 support joint training of understanding (CE loss) and image generation (MSE/FM loss) in a single forward pass. The training data contains interleaved conversations where:

Images in user messages are treated as input (understanding path, ViT encoder)
Images in assistant messages are treated as output (generation path, MSE loss)

This enables conditional image generation / image editing tasks alongside standard visual QA.

Model	Generation Method	Loss	Step 1	Step 2	Step 3
Bagel	MoVQGAN latent-space flow-matching	CE + MSE	ce: 13.78, mse: 1.46	ce: 11.88, mse: 0.90	ce: 11.81, mse: 2.62
SenseNova-U1	Pixel-space flow-matching (MoT)	CE + FM	ce: 13.66, fm: 1.46	ce: 11.88, fm: 0.90	ce: 11.78, fm: 2.62

Bagel / ThinkMorph unified training:

# Bagel: understanding + generation (CE + MSE loss)
# Config: mse_weight=1.0, ce_weight=1.0
GPUS=0,1 MODEL=bagel \
TRAIN_DATA=/path/to/interleaved_data.jsonl \
IMAGE_ROOT=/path/to/images \
bash scripts/mdl/amd/veomni_unified_sft_amd.sh

Key config options (configs/multimodal/bagel/sft_amd.yaml):

train:
  mse_weight: 1.0    # generation loss weight (0.0 = understanding-only)
  ce_weight: 1.0     # understanding loss weight
  freeze_vit: true   # freeze ViT encoder
  freeze_vae: true   # freeze VAE encoder/decoder
  freeze_und: false   # keep understanding path trainable
  # To switch to understanding-only:  freeze_gen_modules: true, mse_weight: 0.0
  # To switch to generation-only:     freeze_und: true, ce_weight: 0.0

SenseNova-U1 unified training:

# U1: understanding + generation (CE + FM loss)
# Config: mse_weight=1.0, ce_weight=1.0, flex_attention enabled
GPUS=0,1 MODEL=u1 STAGE=gen \
TRAIN_DATA=/path/to/interleaved_data.jsonl \
IMAGE_ROOT=/path/to/images \
bash scripts/mdl/amd/veomni_unified_sft_amd.sh

Key config options (configs/multimodal/u1/sft_gen_amd.yaml):

train:
  mse_weight: 1.0              # flow-matching loss weight
  ce_weight: 1.0               # CE loss weight
  freeze_vit: true             # freeze understanding ViT
  freeze_llm: true             # freeze LLM backbone
  freeze_gen_modules: false    # keep FM modules trainable
  unfreeze_mot_gen: true       # unfreeze MoT generation branch (*_mot_gen params)
  unfreeze_vit_layers: -4      # unfreeze last 4 ViT encoder layers
  unfreeze_lm_head: true       # unfreeze lm_head

5.3 Training Data Format

Training data is JSONL with interleaved conversations. Each <image> marker maps to the next entry in the images list.

{
  "id": "sample_001",
  "images": ["input.jpg", "output.jpg"],
  "conversations": [
    {"from": "human", "value": "<image>\nEdit this image: add a blue circle in the center."},
    {"from": "gpt", "value": "Here is the edited image:\n<image>"}
  ]
}

<image> in human turn + images[0] = understanding input (ViT)
<image> in gpt turn + images[1] = generation target (VAE/FM, MSE loss)

For understanding-only data, simply omit <image> from assistant responses:

{
  "id": "vqa_001",
  "images": ["photo.jpg"],
  "conversations": [
    {"from": "human", "value": "<image>\nWhat is shown in this image?"},
    {"from": "gpt", "value": "A cat sitting on a windowsill."}
  ]
}

5.4 Debug Mode

Quick sanity-check with 3 training steps on 2 GPUs:

# Bagel debug
DEBUG=1 MAX_STEPS=3 GPUS=2,3 MODEL=bagel bash scripts/mdl/amd/veomni_unified_sft_amd.sh

# U1 debug (generation mode)
DEBUG=1 MAX_STEPS=3 GPUS=2,3 MODEL=u1 STAGE=gen bash scripts/mdl/amd/veomni_unified_sft_amd.sh

See per-model branches: feat/bagel-amd, feat/thinkmorph-amd, feat/blip3o-amd, feat/sensenova-u1-amd, feat/latentum-amd

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

🍪 Overview

VeOmni is a versatile framework for both single- and multi-modal pre-training and post-training. It empowers users to seamlessly scale models of any modality across various accelerators, offering both flexibility and user-friendliness.

Our guiding principles when building VeOmni are:

Flexibility and Modularity: VeOmni is built with a modular design, allowing users to decouple most components and replace them with their own implementations as needed.
Trainer-free: VeOmni supports linear training scripts that avoid rigid, structured trainer classes (e.g., PyTorch-Lightning or HuggingFace Trainer). These training scripts expose the entire training logic to users for maximum transparency and control. Besides, VeOmni supports a basic trainer for text-only or vlm/omni models training and a rl trainer as a trainer backend in reinforcement learning.
Omni model native: VeOmni enables users to effortlessly scale any omni-model across devices and accelerators.
Torch native: VeOmni is designed to leverage PyTorch’s native functions to the fullest extent, ensuring maximum compatibility and performance.

🔥 Latest News

[2025/11] Our Paper OmniScale: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo was accepted by AAAI 2026
[2025/09] We release first offical release v0.1.0 of VeOmni.
[2025/08] We release VeOmni Tech report and open the WeChat group. Feel free to join us!
[2025/04] We release VeOmni!

📚 Key Features

FSDP, FSDP2 backend for training.
Sequence Parallelism with Deepspeed Ulysess, support with non-async and async mode.
Experts Parallelism support large MOE model training, like Qwen3-Moe.
Efficient GroupGemm kernel for Moe model, Liger-Kernel.
Compatible with HuggingFace Transformers models. Qwen3, Qwen3-VL, Qwen3-Moe, etc
Dynamic batching strategy, Omnidata processing
Torch Distributed Checkpoint for checkpoint.
Support for both Nvidia-GPU and Ascend-NPU training.
Experiment tracking with wandb

📝 Upcoming Features and Changes

VeOmni v0.2 Roadmap ByteDance-Seed#268, ByteDance-Seed#271
Vit balance tool ByteDance-Seed#280
Validation dataset during training ByteDance-Seed#247
RL post training for omni-modality models with VeRL ByteDance-Seed#262

🚀 Getting Started

Documentation

Quick Start

✏️ Supported Models

Model	Model size	Example config File
DeepSeek2.5/3/R1	236B/671B	deepseek.yaml
Llama3-3.3	1B/3B/8B/70B	llama3.yaml
Qwen2-3	0.5B/1.5B/3B/7B/14B/32B/72B/	qwen2_5.yaml
Qwen2-3 VL/QVQ	2B/3B/7B/32B/72B	qwen3_vl_dense.yaml
Qwen3-VL MoE	30BA3B/235BA22B	qwen3_vl_moe.yaml
Qwen3-MoE	30BA3B/235BA22B	qwen3-moe.yaml
Qwen2-3 Omni	7B/30BA3B	qwen25_omni.yaml
Wan	Wan2.1-I2V-14B-480P	wan_sft.yaml
Omni Model	Any Modality Training	seed_omni.yaml

Support new models to VeOmni see Support New Models

⛰️ Performance

For more details, please refer to our paper.

💡 Awesome work using VeOmni

🎨 Contributing

Contributions from the community are welcome! Please check out CONTRIBUTING.md our project roadmap(To be updated),

📝 Citation and Acknowledgement

If you find VeOmni useful for your research and applications, feel free to give us a star ⭐ or cite us using:

@article{ma2025veomni,
  title={VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo},
  author={Ma, Qianli and Zheng, Yaowei and Shi, Zhelun and Zhao, Zhongkai and Jia, Bin and Huang, Ziyue and Lin, Zhiqi and Li, Youjie and Yang, Jiacheng and Peng, Yanghua and others},
  journal={arXiv preprint arXiv:2508.02317},
  year={2025}
}

Thanks to the following projects for their excellent work:

Star History

🌱 About ByteDance Seed Team

Founded in 2023, ByteDance Seed Team is dedicated to crafting the industry's most advanced AI foundation models. The team aspires to become a world-class research team and make significant contributions to the advancement of science and society. You can get to know Bytedance Seed better through the following channels👇

Name		Name	Last commit message	Last commit date
Latest commit History 539 Commits
.agents		.agents
.cursor/rules		.cursor/rules
.gemini		.gemini
.github		.github
assets		assets
configs		configs
docker		docker
docs		docs
patchgen-pkg		patchgen-pkg
scripts		scripts
tasks		tasks
tests		tests
veomni		veomni
.env.local		.env.local
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build.sh		build.sh
logfile		logfile
pyproject.toml		pyproject.toml
train.sh		train.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. Supported Models

2. Multimodal Understanding Inference Alignment

3. Text-to-Image Generation Alignment

4. Image Editing (it2i) Alignment

5. SFT Training

5.1 Understanding-Only SFT (CE Loss)

5.2 Unified Understanding + Generation SFT (CE + Generation Loss)

5.3 Training Data Format

5.4 Debug Mode

🍪 Overview

🔥 Latest News

📚 Key Features

📝 Upcoming Features and Changes

🚀 Getting Started

Quick Start

✏️ Supported Models

⛰️ Performance

💡 Awesome work using VeOmni

🎨 Contributing

📝 Citation and Acknowledgement

Star History

🌱 About ByteDance Seed Team

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

1. Supported Models

2. Multimodal Understanding Inference Alignment

3. Text-to-Image Generation Alignment

4. Image Editing (it2i) Alignment

5. SFT Training

5.1 Understanding-Only SFT (CE Loss)

5.2 Unified Understanding + Generation SFT (CE + Generation Loss)

5.3 Training Data Format

5.4 Debug Mode

🍪 Overview

🔥 Latest News

📚 Key Features

📝 Upcoming Features and Changes

🚀 Getting Started

Quick Start

✏️ Supported Models

⛰️ Performance

💡 Awesome work using VeOmni

🎨 Contributing

📝 Citation and Acknowledgement

Star History

🌱 About ByteDance Seed Team

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages