model: qwen2.5 omni (thinker only) #4969

mickqian · 2025-04-01T07:44:32Z

Motivation

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

yizhang2077

continue reviewing...

python/sglang/srt/models/qwen2_5_omni.py

test/srt/test_vision_openai_server.py

python/sglang/srt/configs/qwen2_5_o.py

python/sglang/srt/models/qwen2_5_omni.py

python/sglang/srt/models/minicpmo.py

python/sglang/srt/models/qwen2_5_vl.py

zhaochenyang20 · 2025-04-15T04:01:41Z

@mickqian Huggingface has updated a new version. We can rebase this

zhaochenyang20 · 2025-04-15T15:40:24Z

@mickqian is it ready to review and merge？

python/sglang/srt/managers/multimodal_processors/base_processor.py

zhaochenyang20 · 2025-04-17T18:38:56Z

please resolve and rebase

mickqian · 2025-04-24T11:01:34Z

mmmu accuracy: 0.503

Othame · 2025-04-25T17:54:34Z

I ran bench_hf.py to test MMMU with transformers and encountered the following issues:

AutoModel Import Issue with Qwen2.5-Omni-7B :
The model couldn't be imported correctly using AutoModel.

Thanks to @GeLee-Q and @liwenju0, this issue can be resolved by modifying bench_hf.py as follows:
Change this part:

sglang/benchmark/mmmu/bench_hf.py

Line 30 in 11e27d0

model = AutoModel.from_pretrained(

To:
```
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    args.model_path,
    torch_dtype="auto",
    trust_remote_code=True,
)
```
Alternatively, you can modify transformers repo by adding the corresponding configurations to here
CUDA Device-Side Assert Error :
After applying the changes mentioned above, I still encountered a "CUDA device-side assert failed" error when running hf_bench.py. This issue seems to occur randomly, and it doesn't appear to be caused by CUDA OOM .
Logs for reference: hf_bench.log

Othame · 2025-04-26T03:56:18Z

MMMU acc with HF: 0.399

{
    "Overall-Art and Design": {
        "num": 120,
        "acc": 0.492
    },
    "Art": {
        "num": 30,
        "acc": 0.433
    },
    "Art_Theory": {
        "num": 30,
        "acc": 0.467
    },
    "Design": {
        "num": 30,
        "acc": 0.5
    },
    "Music": {
        "num": 30,
        "acc": 0.567
    },
    "Overall-Business": {
        "num": 150,
        "acc": 0.36
    },
    "Accounting": {
        "num": 30,
        "acc": 0.3
    },
    "Economics": {
        "num": 30,
        "acc": 0.367
    },
    "Finance": {
        "num": 30,
        "acc": 0.3
    },
    "Manage": {
        "num": 30,
        "acc": 0.467
    },
    "Marketing": {
        "num": 30,
        "acc": 0.367
    },
    "Overall-Science": {
        "num": 150,
        "acc": 0.34
    },
    "Biology": {
        "num": 30,
        "acc": 0.233
    },
    "Chemistry": {
        "num": 30,
        "acc": 0.233
    },
    "Geography": {
        "num": 30,
        "acc": 0.5
    },
    "Math": {
        "num": 30,
        "acc": 0.433
    },
    "Physics": {
        "num": 30,
        "acc": 0.3
    },
    "Overall-Health and Medicine": {
        "num": 150,
        "acc": 0.36
    },
    "Basic_Medical_Science": {
        "num": 30,
        "acc": 0.5
    },
    "Clinical_Medicine": {
        "num": 30,
        "acc": 0.4
    },
    "Diagnostics_and_Laboratory_Medicine": {
        "num": 30,
        "acc": 0.3
    },
    "Pharmacy": {
        "num": 30,
        "acc": 0.3
    },
    "Public_Health": {
        "num": 30,
        "acc": 0.3
    },
    "Overall-Humanities and Social Science": {
        "num": 119,
        "acc": 0.555
    },
    "History": {
        "num": 30,
        "acc": 0.467
    },
    "Literature": {
        "num": 29,
        "acc": 0.724
    },
    "Sociology": {
        "num": 30,
        "acc": 0.6
    },
    "Psychology": {
        "num": 30,
        "acc": 0.433
    },
    "Overall-Tech and Engineering": {
        "num": 196,
        "acc": 0.352
    },
    "Agriculture": {
        "num": 16,
        "acc": 0.5
    },
    "Architecture_and_Engineering": {
        "num": 30,
        "acc": 0.4
    },
    "Computer_Science": {
        "num": 30,
        "acc": 0.4
    },
    "Electronics": {
        "num": 30,
        "acc": 0.3
    },
    "Energy_and_Power": {
        "num": 30,
        "acc": 0.267
    },
    "Materials": {
        "num": 30,
        "acc": 0.267
    },
    "Mechanical_Engineering": {
        "num": 30,
        "acc": 0.4
    },
    "Overall": {
        "num": 885,
        "acc": 0.399
    }
}

Modified bench_hf.py

Thanks to @mickqian, the test is now running successfully after adding enable_audio_output = False to disable tts.

import argparse

import PIL
import torch
from data_utils import save_json
from eval_utils import (
    EvalArgs,
    eval_result,
    get_sampling_params,
    prepare_samples,
    process_result,
)
from tqdm import tqdm
from transformers import AutoModel, AutoProcessor, GenerationConfig, Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor


@torch.no_grad()
def eval_mmmu(args):
    eval_args = EvalArgs.from_cli_args(args)
    try:
        from transformers import AutoModelForImageTextToText

        model = AutoModelForImageTextToText.from_pretrained(
            args.model_path,
            torch_dtype="auto",
            trust_remote_code=True,
        )
    except Exception as first_exception:
        try:
            # model = AutoModel.from_pretrained(
            #     args.model_path,
            #     torch_dtype="auto",
            #     trust_remote_code=True,
            #     init_tts=False,
            # )
            model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
                args.model_path,
                torch_dtype="auto",
                trust_remote_code=True,
                enable_audio_output = False, # add this argument to disable tts
            )
        except Exception as second_exception:
            raise RuntimeError(
                f"Failed to load model: First attempt failed with {first_exception}, "
                f"second attempt failed with {second_exception}"
            ) from second_exception

    model = model.eval().cuda()

    processor = Qwen2_5OmniProcessor.from_pretrained(
        args.model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True
    )

    samples = prepare_samples(eval_args)
    out_samples = dict()

    sampling_params = get_sampling_params(eval_args)
    generation_config = GenerationConfig(
        max_new_tokens=sampling_params["max_new_tokens"],
        do_sample=False,
    )

    answer_dict = {}
    for sample in tqdm(samples):
        prompt = sample["final_input_prompt"]
        image = sample["image"]
        prefix = prompt.split("<")[0]
        suffix = prompt.split(">")[1]
        assert image is not None
        contents = []
        if prefix:
            contents += [{"type": "text", "text": prefix}]
        contents += [
            {
                "type": "image",
                "image": sample["image_path"],
            }
        ]
        if suffix:
            contents += [{"type": "text", "text": suffix}]
        messages = [{"role": "user", "content": contents}]
        try:
            model_inputs = processor.tokenizer.apply_chat_template(
                messages,
                tokenize=True,
                return_dict=True,
                add_generation_prompt=True,
                return_tensors="pt",
            ).to(model.device)
            input_len = model_inputs["input_ids"].shape[-1]
            generation = model.generate(
                **model_inputs, generation_config=generation_config
            )
            generation = generation[0][input_len:]
            response = processor.decode(generation, skip_special_tokens=True)
        except:
            contents = []
            if prefix:
                contents += [prefix]
            image = PIL.Image.open(sample["image_path"])
            contents += [image]
            if suffix:
                contents += [suffix]
            messages = [{"role": "user", "content": contents}]
            response = model.chat(
                msgs=messages,
                tokenizer=processor.tokenizer,
                sampling=False,
                max_new_tokens=sampling_params["max_new_tokens"],
                use_tts_template=False,
                generate_audio=False,
                temperature=0.0,
            )
        print(f"response: {response}")
        process_result(response, sample, answer_dict, out_samples)

    args.output_path = f"{args.model_path}_val_hf.json"
    save_json(args.output_path, out_samples)
    eval_result(model_answer_path=args.output_path, answer_dict=answer_dict)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model-path",
        type=str,
        help="The path of the model weights. This can be a local folder or a Hugging Face repo ID.",
        required=True,
    )
    EvalArgs.add_cli_args(parser)
    args = parser.parse_args()

    eval_mmmu(args)

JustinTong0323 · 2025-05-29T22:18:10Z

/gemini review

gemini-code-assist · 2025-05-29T22:18:42Z

Warning

Gemini is unable to generate a review due to a potential policy violation.

UnlceYang · 2025-07-24T08:56:04Z

@mickqian Any plan to merge into main branch?

vincentzed · 2025-08-29T03:38:57Z

/gemini review
Still want this.

gemini-code-assist

Code Review

This pull request adds support for the qwen2.5 omni model. The changes include adding the model definition, updating conversation templates, and modifying multimodal processing logic. While the core implementation seems mostly correct, there are several critical issues that need to be addressed. These include a typo in a special token, usage of an undefined attribute, and hardcoded CUDA device specifications which will break execution on other hardware. Additionally, there are some opportunities for code cleanup and performance improvements.

gemini-code-assist · 2025-08-29T03:41:04Z

python/sglang/srt/managers/multimodal_processors/qwen_vl.py

+            self.IMAGE_TOKEN_REGEX = re.compile(
+                r"<\|vision_bos\|>(?:<\|IMAGE\|>)+<\|vision_eos\|>"
+            )
+            self.image_token = "<|vision_bos|><|IMAGE|><|vision_eo|>"


gemini-code-assist · 2025-08-29T03:41:04Z

python/sglang/srt/managers/multimodal_processors/qwen_vl.py

-            "im_end_id": self.IM_END_TOKEN_ID,
+            "im_start_id": self.image_start_id,
+            "im_end_id": self.image_end_id,
            "im_token_id": self.IM_TOKEN_ID,


The returned dictionary uses self.IM_TOKEN_ID, but this attribute is not set for the Qwen2_5OmniModel architecture in the __init__ method. This will raise an AttributeError. You should use self.image_token_id instead, which is correctly initialized.

Suggested change

"im_token_id": self.IM_TOKEN_ID,

"im_token_id": self.image_token_id,

gemini-code-assist · 2025-08-29T03:41:04Z

python/sglang/srt/layers/rotary_embedding.py

+                image_grid_thw is not None or video_grid_thw is not None
+            ):
+                total_input_ids = input_ids
+                attention_mask = torch.ones_like(total_input_ids).to("cuda")


The device is hardcoded to 'cuda' here and in several other places in this function (e.g., lines 955, 1016, 1030). This will cause issues when running on other devices like CPU or MPS. The device should be inferred from the input tensors, for example, by using input_ids.device.

Suggested change

attention_mask = torch.ones_like(total_input_ids).to("cuda")

attention_mask = torch.ones_like(total_input_ids).to(input_ids.device)

gemini-code-assist · 2025-08-29T03:41:04Z

python/sglang/srt/layers/rotary_embedding.py

+            .flatten()
+            .long()
+        )
+        _llm_pos_ids = torch.stack([t_index, h_index, w_index]).to("cuda")


The device is hardcoded to 'cuda'. This should be changed to use the device from an input tensor (e.g., grid_hs.device) to support execution on different hardware.

Suggested change

_llm_pos_ids = torch.stack([t_index, h_index, w_index]).to("cuda")

_llm_pos_ids = torch.stack([t_index, h_index, w_index]).to(grid_hs.device)

gemini-code-assist · 2025-08-29T03:41:04Z

python/sglang/srt/managers/multimodal_processors/base_processor.py

                futures.append(
                    self.io_executor.submit(
                        BaseMultimodalProcessor._load_single_item,
                        data,
                        is_video,
                        False,
-                        frame_count_limit,
+                        None,
                        discard_alpha_channel,
                    )
                )
-                task_info.append((Modality.IMAGE, data, frame_count_limit))
-                image_index += 1
+                task_info.append((Modality.IMAGE, data, None))


The logic for frame_count_limit based on estimated_frames_list and scaling_factor has been removed, and _load_single_item is now called with None for the frame limit. This seems to disable frame limiting for videos in the base processor, which could be a significant breaking change affecting all models. If this is intentional, it should be documented. Otherwise, it might be a bug.

gemini-code-assist · 2025-08-29T03:41:05Z

python/sglang/srt/managers/multimodal_processors/base_processor.py

+        # TODO: remove this
+        multimodal_tokens.compile_regex()


There's a TODO to remove the explicit call to compile_regex(). This suggests the current implementation is temporary. It would be cleaner to handle regex compilation within the MultimodalSpecialTokens class, for example in its __post_init__ or another initialization method, to avoid these explicit calls in the processor.

gemini-code-assist · 2025-08-29T03:41:05Z

python/sglang/srt/managers/multimodal_processors/qwen_vl.py

+        ret = self.process_mm_data(
+            input_text=base_output.input_text,
+            images=None if images_are_preprocessed else base_output.images,
+            audio=base_output.audios,
+        )
+
+        input_ids = ret["input_ids"].flatten()
+        image_offsets = self.get_mm_items_offset(
+            input_ids=input_ids, mm_token_id=self.image_token_id
+        )
+
+        image_grid_thw = None
        video_grid_thw = None  # TODO

        combined_mm_item, input_ids = self.process_and_combine_mm_data(base_output)


The method self.process_mm_data is called, and its return value ret is used later. However, self.process_and_combine_mm_data is also called, which internally calls self.process_mm_data again. This results in redundant processing and is inefficient. The logic should be refactored to avoid the duplicate call.

gemini-code-assist · 2025-08-29T03:41:05Z

python/sglang/srt/models/qwen2_5_omni.py

+
+                # Skip loading extra bias for GPTQ models.
+                if name.endswith(".bias") and name not in params_dict:
+                    print(f"skipping {name}")


This print() statement appears to be for debugging purposes. It should be removed or replaced with a proper logger call (e.g., logger.debug(...)) before merging.

gemini-code-assist · 2025-08-29T03:41:05Z

python/sglang/srt/models/qwen2_5_omni.py

+                        continue
+                    param = params_dict[name]
+                except KeyError:
+                    print(params_dict.keys())


This print() statement appears to be for debugging and should be removed from the final code.

gemini-code-assist · 2025-08-29T03:41:05Z

test/srt/test_omni_openai_server.py

+class TestOpenAIOmniServer(TestOpenAIVisionServer):
+    @classmethod
+    def setUpClass(cls):
+        cls.model = "openbmb/MiniCPM-o-2_6"


This test file is for the new omni model support, but it's using openbmb/MiniCPM-o-2_6 instead of a qwen2.5 omni model. To ensure the changes for qwen2.5 omni are correctly tested, a model from that family should be used.

Suggested change

cls.model = "openbmb/MiniCPM-o-2_6"

cls.model = "Qwen/Qwen2.5-0.5B-Omni-Instruct"

mickqian · 2025-09-26T07:44:41Z

We can directly move to #10911

mickqian requested review from ByronHsu, HaiShaw, Ying1123, hnyls2002, ispobock, merrymercy, xiezhq-hermann and zhyncs as code owners April 1, 2025 07:44

yizhang2077 reviewed Apr 1, 2025

View reviewed changes

mickqian marked this pull request as draft April 1, 2025 08:19

mickqian force-pushed the qwen2.5-omni branch from 933bc5c to e20d9d6 Compare April 1, 2025 13:21

mickqian marked this pull request as ready for review April 1, 2025 13:22

mickqian force-pushed the qwen2.5-omni branch from e20d9d6 to a1ef506 Compare April 2, 2025 07:32

yizhang2077 reviewed Apr 14, 2025

View reviewed changes

python/sglang/srt/models/minicpmo.py Outdated Show resolved Hide resolved

python/sglang/srt/models/qwen2_5_vl.py Outdated Show resolved Hide resolved

zhaochenyang20 mentioned this pull request Apr 14, 2025

VLM SGLang Tracker zhaochenyang20/Awesome-ML-SYS-Tutorial#111

Open

mickqian force-pushed the qwen2.5-omni branch 2 times, most recently from 7a19648 to 4000bb4 Compare April 15, 2025 11:09

JustinTong0323 reviewed Apr 17, 2025

View reviewed changes

python/sglang/srt/managers/multimodal_processors/base_processor.py Outdated Show resolved Hide resolved

python/sglang/srt/managers/multimodal_processors/base_processor.py Outdated Show resolved Hide resolved

mickqian force-pushed the qwen2.5-omni branch from e97c6b8 to b152d0b Compare April 24, 2025 06:05

weedge mentioned this pull request Apr 24, 2025

feat: add qwen2.5-omni ai-bot-pro/achatbot#143

Merged

mickqian force-pushed the qwen2.5-omni branch 2 times, most recently from 4f25037 to 1779fef Compare April 26, 2025 02:51

mickqian force-pushed the qwen2.5-omni branch 2 times, most recently from 7818f67 to d3e2af2 Compare April 26, 2025 13:13

mickqian force-pushed the qwen2.5-omni branch from 4f1b844 to 4892a60 Compare May 23, 2025 05:47

mickqian mentioned this pull request May 23, 2025

Support Phi-4 Multi-Modal (text + vision only) #6494

Merged

8 tasks

mickqian force-pushed the qwen2.5-omni branch 4 times, most recently from 40391f7 to 64f5896 Compare May 24, 2025 17:40

zhyncs mentioned this pull request May 24, 2025

chore: upgrade transformers 4.52.3 #6575

Merged

6 tasks

mickqian force-pushed the qwen2.5-omni branch 3 times, most recently from 82cb4cf to d846622 Compare May 27, 2025 03:39

b8zhong added the new-model label Jun 15, 2025

b8zhong mentioned this pull request Jun 21, 2025

[Tracking] Model support #7429

Closed

81 tasks

mickqian added 9 commits June 25, 2025 15:28

model: qwen2.5 omni (thinker only)

6840b0c

image index -> image iter

28d8e6a

remove config

e9e0551

upload test

1780594

rebase

2c527f3

use transformers audio encoder

5c7a68e

fix

05c5fcc

fix

6c78337

rebase

175e3b0

mickqian force-pushed the qwen2.5-omni branch from 7a47644 to 175e3b0 Compare June 25, 2025 07:31

gemini-code-assist bot reviewed Aug 29, 2025

View reviewed changes

mickqian marked this pull request as draft September 22, 2025 07:54

mickqian closed this Sep 26, 2025

	self.image_token = "<\|vision_bos\|><\|IMAGE\|><\|vision_eo\|>"
	self.image_token = "<\|vision_bos\|><\|IMAGE\|><\|vision_eos\|>"

	"im_token_id": self.IM_TOKEN_ID,
	"im_token_id": self.image_token_id,

	attention_mask = torch.ones_like(total_input_ids).to("cuda")
	attention_mask = torch.ones_like(total_input_ids).to(input_ids.device)

	_llm_pos_ids = torch.stack([t_index, h_index, w_index]).to("cuda")
	_llm_pos_ids = torch.stack([t_index, h_index, w_index]).to(grid_hs.device)

	cls.model = "openbmb/MiniCPM-o-2_6"
	cls.model = "Qwen/Qwen2.5-0.5B-Omni-Instruct"

model: qwen2.5 omni (thinker only) #4969

model: qwen2.5 omni (thinker only) #4969

Uh oh!

Conversation

mickqian commented Apr 1, 2025

Motivation

Modifications

Checklist

Uh oh!

yizhang2077 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhaochenyang20 commented Apr 15, 2025

Uh oh!

zhaochenyang20 commented Apr 15, 2025

Uh oh!

Uh oh!

Uh oh!

zhaochenyang20 commented Apr 17, 2025

Uh oh!

mickqian commented Apr 24, 2025

Uh oh!

Othame commented Apr 25, 2025

Uh oh!

Othame commented Apr 26, 2025

Uh oh!

JustinTong0323 commented May 29, 2025

Uh oh!

gemini-code-assist bot commented May 29, 2025

Uh oh!

UnlceYang commented Jul 24, 2025

Uh oh!

vincentzed commented Aug 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

mickqian commented Sep 26, 2025

Uh oh!

Reviewers

Assignees

Labels