Skip to content

86MaxCao/VeOmni

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

539 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Fork Note (AMD ROCm Adaptation)

This branch (feat/unified-amd) adapts 6 unified multimodal models (understanding + generation) into the VeOmni framework for ablation experiments. All model implementations are aligned with official code β€” inference outputs and training losses are verified to match.

Environment:

  • GPU: AMD Instinct MI308X (192GB HBM3)
  • Platform: ROCm 7.0 + PyTorch 2.10.0+rocm7.0
  • Python: 3.11
  • Flash Attention: flash_attn 2.7.3 (ROCm)

1. Supported Models

Model model_type LLM Backbone Generation Method Params
Bagel bagel Qwen2.5-7B MoVQGAN (Flow-Matching) 14.6B
ThinkMorph thinkmorph Qwen2.5-7B MoVQGAN (Flow-Matching + CoT) 14.6B
BLIP3o blip3o_qwen Qwen2.5-7B Diffusion (DIT + VAE) 14.1B
SenseNova-U1 neo_chat NEO (Qwen3-arch, 42L) Flow-Matching (MoT) 17.6B
LatentUM latentum InternVL3.5-4B MoT Discrete Tokens (AR Head) 8.8B
Janus-Pro janus DeepSeek-LLM-7B VQ-16 Discrete Tokens (AR) 7.4B

2. Multimodal Understanding Inference Alignment

All 6 models produce identical outputs to their official repos when given the same inputs. Tested on VisPuzzle benchmark (multiple-choice VQA):

Model Alignment Test Samples Official Accuracy
Bagel 100% 20/20 exact match 33.8%
ThinkMorph 100% 5/5 exact match 36.5%
BLIP3o 100% 10/10 exact match 36.8%
SenseNova-U1 100% 10/10 exact match 68.5%
LatentUM 100% 3/3 exact match 38.5%
Janus-Pro 100% 3/3 exact match 33.2%

Usage:

# Unified entry for all 6 models
python tasks/infer/infer_unified.py \
    --model_type bagel \
    --mode understand \
    --prompt "Describe this image" \
    --image img.jpg

python tasks/infer/infer_unified.py \
    --model_type u1 \
    --mode understand \
    --prompt "What is shown?" \
    --image img.jpg

3. Text-to-Image Generation Alignment

All 6 models produce valid generated images through VeOmni's unified pipeline, aligned with official implementations under the same prompt and seed.

Prompt: "Add a blue circle in the center of this image"

Model Official VeOmni
Bagel
ThinkMorph
BLIP3o
SenseNova-U1
LatentUM
Janus-Pro

Usage:

# Text-to-Image generation
python tasks/infer/infer_unified.py \
    --model_type bagel \
    --mode generate \
    --prompt "A cat sitting on a windowsill" \
    --output output.png

python tasks/infer/infer_unified.py \
    --model_type u1 \
    --mode generate \
    --prompt "A blue circle on white background" \
    --output output.png

4. Image Editing (it2i) Alignment

3 models (Bagel, ThinkMorph, U1) support image editing: given a source image and an editing instruction, the model generates the edited image. VeOmni outputs are aligned with official implementations under the same input and seed.

Note: LatentUM does not support general image editing β€” it is a navigation world model (next-frame prediction) and is excluded from this section.

Edit prompt: "Add a red arrow pointing from the center of the image to the top-right corner."

Model Input Official VeOmni
Bagel
(1024Γ—1024)
ThinkMorph
(1024Γ—1024)
SenseNova-U1
(1280Γ—1280)

Usage:

from draw_to_understand.models.veomni_bagel import VeOmniBagelGenerationBackend
from PIL import Image

backend = VeOmniBagelGenerationBackend(
    model_path="/path/to/BAGEL-7B-MoT",
    device="cuda:0",
)

input_image = Image.open("input.png").convert("RGB")
output_image = backend.draw(input_image, "Add a red arrow to the top-right corner.")
output_image.save("edited.png")

5. SFT Training

5.1 Understanding-Only SFT (CE Loss)

All 6 models produce aligned CE loss to their official training code with the same inputs. Verified by running official model code from ablation_experiment/ repos with same seed & sequence.

Model VeOmni CE Loss Official CE Loss Diff Grad Params
Bagel 14.7230 14.7230 9.54e-7 395
ThinkMorph 14.7230 14.7230 9.54e-7 395
BLIP3o 13.3495 13.3964 0.047 311
SenseNova-U1 13.0480 13.0480 0.000 549
LatentUM 12.7729 12.7839 0.011 399
Janus-Pro 13.6453 N/A (no official training code) β€” 282

For understanding-only training, set mse_weight: 0.0 in the config to disable the generation loss.

# Bagel understanding-only (2 GPU, FSDP2)
GPUS=0,1 MODEL=bagel \
TRAIN_DATA=/path/to/understanding_data.jsonl \
bash scripts/mdl/amd/veomni_unified_sft_amd.sh

# U1 understanding-only (2 GPU, FSDP2)
GPUS=0,1 MODEL=u1 \
TRAIN_DATA=/path/to/understanding_data.jsonl \
bash scripts/mdl/amd/veomni_unified_sft_amd.sh

5.2 Unified Understanding + Generation SFT (CE + Generation Loss)

Bagel and SenseNova-U1 support joint training of understanding (CE loss) and image generation (MSE/FM loss) in a single forward pass. The training data contains interleaved conversations where:

  • Images in user messages are treated as input (understanding path, ViT encoder)
  • Images in assistant messages are treated as output (generation path, MSE loss)

This enables conditional image generation / image editing tasks alongside standard visual QA.

Model Generation Method Loss Step 1 Step 2 Step 3
Bagel MoVQGAN latent-space flow-matching CE + MSE ce: 13.78, mse: 1.46 ce: 11.88, mse: 0.90 ce: 11.81, mse: 2.62
SenseNova-U1 Pixel-space flow-matching (MoT) CE + FM ce: 13.66, fm: 1.46 ce: 11.88, fm: 0.90 ce: 11.78, fm: 2.62

Bagel / ThinkMorph unified training:

# Bagel: understanding + generation (CE + MSE loss)
# Config: mse_weight=1.0, ce_weight=1.0
GPUS=0,1 MODEL=bagel \
TRAIN_DATA=/path/to/interleaved_data.jsonl \
IMAGE_ROOT=/path/to/images \
bash scripts/mdl/amd/veomni_unified_sft_amd.sh

Key config options (configs/multimodal/bagel/sft_amd.yaml):

train:
  mse_weight: 1.0    # generation loss weight (0.0 = understanding-only)
  ce_weight: 1.0     # understanding loss weight
  freeze_vit: true   # freeze ViT encoder
  freeze_vae: true   # freeze VAE encoder/decoder
  freeze_und: false   # keep understanding path trainable
  # To switch to understanding-only:  freeze_gen_modules: true, mse_weight: 0.0
  # To switch to generation-only:     freeze_und: true, ce_weight: 0.0

SenseNova-U1 unified training:

# U1: understanding + generation (CE + FM loss)
# Config: mse_weight=1.0, ce_weight=1.0, flex_attention enabled
GPUS=0,1 MODEL=u1 STAGE=gen \
TRAIN_DATA=/path/to/interleaved_data.jsonl \
IMAGE_ROOT=/path/to/images \
bash scripts/mdl/amd/veomni_unified_sft_amd.sh

Key config options (configs/multimodal/u1/sft_gen_amd.yaml):

train:
  mse_weight: 1.0              # flow-matching loss weight
  ce_weight: 1.0               # CE loss weight
  freeze_vit: true             # freeze understanding ViT
  freeze_llm: true             # freeze LLM backbone
  freeze_gen_modules: false    # keep FM modules trainable
  unfreeze_mot_gen: true       # unfreeze MoT generation branch (*_mot_gen params)
  unfreeze_vit_layers: -4      # unfreeze last 4 ViT encoder layers
  unfreeze_lm_head: true       # unfreeze lm_head

5.3 Training Data Format

Training data is JSONL with interleaved conversations. Each <image> marker maps to the next entry in the images list.

{
  "id": "sample_001",
  "images": ["input.jpg", "output.jpg"],
  "conversations": [
    {"from": "human", "value": "<image>\nEdit this image: add a blue circle in the center."},
    {"from": "gpt", "value": "Here is the edited image:\n<image>"}
  ]
}
  • <image> in human turn + images[0] = understanding input (ViT)
  • <image> in gpt turn + images[1] = generation target (VAE/FM, MSE loss)

For understanding-only data, simply omit <image> from assistant responses:

{
  "id": "vqa_001",
  "images": ["photo.jpg"],
  "conversations": [
    {"from": "human", "value": "<image>\nWhat is shown in this image?"},
    {"from": "gpt", "value": "A cat sitting on a windowsill."}
  ]
}

5.4 Debug Mode

Quick sanity-check with 3 training steps on 2 GPUs:

# Bagel debug
DEBUG=1 MAX_STEPS=3 GPUS=2,3 MODEL=bagel bash scripts/mdl/amd/veomni_unified_sft_amd.sh

# U1 debug (generation mode)
DEBUG=1 MAX_STEPS=3 GPUS=2,3 MODEL=u1 STAGE=gen bash scripts/mdl/amd/veomni_unified_sft_amd.sh

See per-model branches: feat/bagel-amd, feat/thinkmorph-amd, feat/blip3o-amd, feat/sensenova-u1-amd, feat/latentum-amd


VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

GitHub Repo stars Paper Documentation WeChat

πŸͺ Overview

VeOmni is a versatile framework for both single- and multi-modal pre-training and post-training. It empowers users to seamlessly scale models of any modality across various accelerators, offering both flexibility and user-friendliness.

Our guiding principles when building VeOmni are:

  • Flexibility and Modularity: VeOmni is built with a modular design, allowing users to decouple most components and replace them with their own implementations as needed.

  • Trainer-free: VeOmni supports linear training scripts that avoid rigid, structured trainer classes (e.g., PyTorch-Lightning or HuggingFace Trainer). These training scripts expose the entire training logic to users for maximum transparency and control. Besides, VeOmni supports a basic trainer for text-only or vlm/omni models training and a rl trainer as a trainer backend in reinforcement learning.

  • Omni model native: VeOmni enables users to effortlessly scale any omni-model across devices and accelerators.

  • Torch native: VeOmni is designed to leverage PyTorch’s native functions to the fullest extent, ensuring maximum compatibility and performance.

πŸ”₯ Latest News

πŸ“š Key Features

  • FSDP, FSDP2 backend for training.
  • Sequence Parallelism with Deepspeed Ulysess, support with non-async and async mode.
  • Experts Parallelism support large MOE model training, like Qwen3-Moe.
  • Efficient GroupGemm kernel for Moe model, Liger-Kernel.
  • Compatible with HuggingFace Transformers models. Qwen3, Qwen3-VL, Qwen3-Moe, etc
  • Dynamic batching strategy, Omnidata processing
  • Torch Distributed Checkpoint for checkpoint.
  • Support for both Nvidia-GPU and Ascend-NPU training.
  • Experiment tracking with wandb

πŸ“ Upcoming Features and Changes

πŸš€ Getting Started

Documentation

Quick Start

✏️ Supported Models

Model Model size Example config File
DeepSeek2.5/3/R1 236B/671B deepseek.yaml
Llama3-3.3 1B/3B/8B/70B llama3.yaml
Qwen2-3 0.5B/1.5B/3B/7B/14B/32B/72B/ qwen2_5.yaml
Qwen2-3 VL/QVQ 2B/3B/7B/32B/72B qwen3_vl_dense.yaml
Qwen3-VL MoE 30BA3B/235BA22B qwen3_vl_moe.yaml
Qwen3-MoE 30BA3B/235BA22B qwen3-moe.yaml
Qwen2-3 Omni 7B/30BA3B qwen25_omni.yaml
Wan Wan2.1-I2V-14B-480P wan_sft.yaml
Omni Model Any Modality Training seed_omni.yaml

Support new models to VeOmni see Support New Models

⛰️ Performance

For more details, please refer to our paper.

πŸ’‘ Awesome work using VeOmni

🎨 Contributing

Contributions from the community are welcome! Please check out CONTRIBUTING.md our project roadmap(To be updated),

πŸ“ Citation and Acknowledgement

If you find VeOmni useful for your research and applications, feel free to give us a star ⭐ or cite us using:

@article{ma2025veomni,
  title={VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo},
  author={Ma, Qianli and Zheng, Yaowei and Shi, Zhelun and Zhao, Zhongkai and Jia, Bin and Huang, Ziyue and Lin, Zhiqi and Li, Youjie and Yang, Jiacheng and Peng, Yanghua and others},
  journal={arXiv preprint arXiv:2508.02317},
  year={2025}
}

Thanks to the following projects for their excellent work:

Star History

Star History Chart

🌱 About ByteDance Seed Team

Founded in 2023, ByteDance Seed Team is dedicated to crafting the industry's most advanced AI foundation models. The team aspires to become a world-class research team and make significant contributions to the advancement of science and society. You can get to know Bytedance Seed better through the following channelsπŸ‘‡

About

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.8%
  • Other 0.2%