Skip to content

Conversation

@mickqian
Copy link
Collaborator

@mickqian mickqian commented Apr 1, 2025

Motivation

Modifications

Checklist

Copy link
Collaborator

@yizhang2077 yizhang2077 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continue reviewing...

@mickqian mickqian marked this pull request as draft April 1, 2025 08:19
@mickqian mickqian marked this pull request as ready for review April 1, 2025 13:22
@zhaochenyang20
Copy link
Collaborator

@mickqian Huggingface has updated a new version. We can rebase this

@zhaochenyang20
Copy link
Collaborator

@mickqian is it ready to review and merge?

@zhaochenyang20
Copy link
Collaborator

please resolve and rebase

@mickqian
Copy link
Collaborator Author

mmmu accuracy: 0.503

@Othame
Copy link
Contributor

Othame commented Apr 25, 2025

I ran bench_hf.py to test MMMU with transformers and encountered the following issues:

  • AutoModel Import Issue with Qwen2.5-Omni-7B :
    The model couldn't be imported correctly using AutoModel.
    cb594e3450f450e4815ddf7c091dee7
    Thanks to @GeLee-Q and @liwenju0, this issue can be resolved by modifying bench_hf.py as follows:
    Change this part:

    model = AutoModel.from_pretrained(

    To:

    model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
        args.model_path,
        torch_dtype="auto",
        trust_remote_code=True,
    )

    Alternatively, you can modify transformers repo by adding the corresponding configurations to here
    2d978ec48b3e99307f0da6013526367

  • CUDA Device-Side Assert Error :
    After applying the changes mentioned above, I still encountered a "CUDA device-side assert failed" error when running hf_bench.py. This issue seems to occur randomly, and it doesn't appear to be caused by CUDA OOM .
    Logs for reference: hf_bench.log

@mickqian mickqian force-pushed the qwen2.5-omni branch 2 times, most recently from 4f25037 to 1779fef Compare April 26, 2025 02:51
@Othame
Copy link
Contributor

Othame commented Apr 26, 2025

MMMU acc with HF: 0.399
{
    "Overall-Art and Design": {
        "num": 120,
        "acc": 0.492
    },
    "Art": {
        "num": 30,
        "acc": 0.433
    },
    "Art_Theory": {
        "num": 30,
        "acc": 0.467
    },
    "Design": {
        "num": 30,
        "acc": 0.5
    },
    "Music": {
        "num": 30,
        "acc": 0.567
    },
    "Overall-Business": {
        "num": 150,
        "acc": 0.36
    },
    "Accounting": {
        "num": 30,
        "acc": 0.3
    },
    "Economics": {
        "num": 30,
        "acc": 0.367
    },
    "Finance": {
        "num": 30,
        "acc": 0.3
    },
    "Manage": {
        "num": 30,
        "acc": 0.467
    },
    "Marketing": {
        "num": 30,
        "acc": 0.367
    },
    "Overall-Science": {
        "num": 150,
        "acc": 0.34
    },
    "Biology": {
        "num": 30,
        "acc": 0.233
    },
    "Chemistry": {
        "num": 30,
        "acc": 0.233
    },
    "Geography": {
        "num": 30,
        "acc": 0.5
    },
    "Math": {
        "num": 30,
        "acc": 0.433
    },
    "Physics": {
        "num": 30,
        "acc": 0.3
    },
    "Overall-Health and Medicine": {
        "num": 150,
        "acc": 0.36
    },
    "Basic_Medical_Science": {
        "num": 30,
        "acc": 0.5
    },
    "Clinical_Medicine": {
        "num": 30,
        "acc": 0.4
    },
    "Diagnostics_and_Laboratory_Medicine": {
        "num": 30,
        "acc": 0.3
    },
    "Pharmacy": {
        "num": 30,
        "acc": 0.3
    },
    "Public_Health": {
        "num": 30,
        "acc": 0.3
    },
    "Overall-Humanities and Social Science": {
        "num": 119,
        "acc": 0.555
    },
    "History": {
        "num": 30,
        "acc": 0.467
    },
    "Literature": {
        "num": 29,
        "acc": 0.724
    },
    "Sociology": {
        "num": 30,
        "acc": 0.6
    },
    "Psychology": {
        "num": 30,
        "acc": 0.433
    },
    "Overall-Tech and Engineering": {
        "num": 196,
        "acc": 0.352
    },
    "Agriculture": {
        "num": 16,
        "acc": 0.5
    },
    "Architecture_and_Engineering": {
        "num": 30,
        "acc": 0.4
    },
    "Computer_Science": {
        "num": 30,
        "acc": 0.4
    },
    "Electronics": {
        "num": 30,
        "acc": 0.3
    },
    "Energy_and_Power": {
        "num": 30,
        "acc": 0.267
    },
    "Materials": {
        "num": 30,
        "acc": 0.267
    },
    "Mechanical_Engineering": {
        "num": 30,
        "acc": 0.4
    },
    "Overall": {
        "num": 885,
        "acc": 0.399
    }
}
Modified bench_hf.py

Thanks to @mickqian, the test is now running successfully after adding enable_audio_output = False to disable tts.

import argparse

import PIL
import torch
from data_utils import save_json
from eval_utils import (
    EvalArgs,
    eval_result,
    get_sampling_params,
    prepare_samples,
    process_result,
)
from tqdm import tqdm
from transformers import AutoModel, AutoProcessor, GenerationConfig, Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor


@torch.no_grad()
def eval_mmmu(args):
    eval_args = EvalArgs.from_cli_args(args)
    try:
        from transformers import AutoModelForImageTextToText

        model = AutoModelForImageTextToText.from_pretrained(
            args.model_path,
            torch_dtype="auto",
            trust_remote_code=True,
        )
    except Exception as first_exception:
        try:
            # model = AutoModel.from_pretrained(
            #     args.model_path,
            #     torch_dtype="auto",
            #     trust_remote_code=True,
            #     init_tts=False,
            # )
            model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
                args.model_path,
                torch_dtype="auto",
                trust_remote_code=True,
                enable_audio_output = False, # add this argument to disable tts
            )
        except Exception as second_exception:
            raise RuntimeError(
                f"Failed to load model: First attempt failed with {first_exception}, "
                f"second attempt failed with {second_exception}"
            ) from second_exception

    model = model.eval().cuda()

    processor = Qwen2_5OmniProcessor.from_pretrained(
        args.model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True
    )

    samples = prepare_samples(eval_args)
    out_samples = dict()

    sampling_params = get_sampling_params(eval_args)
    generation_config = GenerationConfig(
        max_new_tokens=sampling_params["max_new_tokens"],
        do_sample=False,
    )

    answer_dict = {}
    for sample in tqdm(samples):
        prompt = sample["final_input_prompt"]
        image = sample["image"]
        prefix = prompt.split("<")[0]
        suffix = prompt.split(">")[1]
        assert image is not None
        contents = []
        if prefix:
            contents += [{"type": "text", "text": prefix}]
        contents += [
            {
                "type": "image",
                "image": sample["image_path"],
            }
        ]
        if suffix:
            contents += [{"type": "text", "text": suffix}]
        messages = [{"role": "user", "content": contents}]
        try:
            model_inputs = processor.tokenizer.apply_chat_template(
                messages,
                tokenize=True,
                return_dict=True,
                add_generation_prompt=True,
                return_tensors="pt",
            ).to(model.device)
            input_len = model_inputs["input_ids"].shape[-1]
            generation = model.generate(
                **model_inputs, generation_config=generation_config
            )
            generation = generation[0][input_len:]
            response = processor.decode(generation, skip_special_tokens=True)
        except:
            contents = []
            if prefix:
                contents += [prefix]
            image = PIL.Image.open(sample["image_path"])
            contents += [image]
            if suffix:
                contents += [suffix]
            messages = [{"role": "user", "content": contents}]
            response = model.chat(
                msgs=messages,
                tokenizer=processor.tokenizer,
                sampling=False,
                max_new_tokens=sampling_params["max_new_tokens"],
                use_tts_template=False,
                generate_audio=False,
                temperature=0.0,
            )
        print(f"response: {response}")
        process_result(response, sample, answer_dict, out_samples)

    args.output_path = f"{args.model_path}_val_hf.json"
    save_json(args.output_path, out_samples)
    eval_result(model_answer_path=args.output_path, answer_dict=answer_dict)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model-path",
        type=str,
        help="The path of the model weights. This can be a local folder or a Hugging Face repo ID.",
        required=True,
    )
    EvalArgs.add_cli_args(parser)
    args = parser.parse_args()

    eval_mmmu(args)

@mickqian mickqian force-pushed the qwen2.5-omni branch 2 times, most recently from 7818f67 to d3e2af2 Compare April 26, 2025 13:13
@mickqian mickqian force-pushed the qwen2.5-omni branch 4 times, most recently from 40391f7 to 64f5896 Compare May 24, 2025 17:40
@zhyncs zhyncs mentioned this pull request May 24, 2025
6 tasks
@mickqian mickqian force-pushed the qwen2.5-omni branch 3 times, most recently from 82cb4cf to d846622 Compare May 27, 2025 03:39
@JustinTong0323
Copy link
Collaborator

/gemini review

@gemini-code-assist
Copy link
Contributor

Warning

Gemini is unable to generate a review due to a potential policy violation.

@b8zhong b8zhong mentioned this pull request Jun 21, 2025
81 tasks
@UnlceYang
Copy link

@mickqian Any plan to merge into main branch?

@vincentzed
Copy link
Contributor

/gemini review
Still want this.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the qwen2.5 omni model. The changes include adding the model definition, updating conversation templates, and modifying multimodal processing logic. While the core implementation seems mostly correct, there are several critical issues that need to be addressed. These include a typo in a special token, usage of an undefined attribute, and hardcoded CUDA device specifications which will break execution on other hardware. Additionally, there are some opportunities for code cleanup and performance improvements.

self.IMAGE_TOKEN_REGEX = re.compile(
r"<\|vision_bos\|>(?:<\|IMAGE\|>)+<\|vision_eos\|>"
)
self.image_token = "<|vision_bos|><|IMAGE|><|vision_eo|>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a typo in the image_token. It's set to "<|vision_bos|><|IMAGE|><|vision_eo|>", but it should likely be "<|vision_bos|><|IMAGE|><|vision_eos|>" to match the conversation template. This will cause issues with tokenization and prompt formatting.

Suggested change
self.image_token = "<|vision_bos|><|IMAGE|><|vision_eo|>"
self.image_token = "<|vision_bos|><|IMAGE|><|vision_eos|>"

"im_end_id": self.IM_END_TOKEN_ID,
"im_start_id": self.image_start_id,
"im_end_id": self.image_end_id,
"im_token_id": self.IM_TOKEN_ID,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The returned dictionary uses self.IM_TOKEN_ID, but this attribute is not set for the Qwen2_5OmniModel architecture in the __init__ method. This will raise an AttributeError. You should use self.image_token_id instead, which is correctly initialized.

Suggested change
"im_token_id": self.IM_TOKEN_ID,
"im_token_id": self.image_token_id,

image_grid_thw is not None or video_grid_thw is not None
):
total_input_ids = input_ids
attention_mask = torch.ones_like(total_input_ids).to("cuda")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The device is hardcoded to 'cuda' here and in several other places in this function (e.g., lines 955, 1016, 1030). This will cause issues when running on other devices like CPU or MPS. The device should be inferred from the input tensors, for example, by using input_ids.device.

Suggested change
attention_mask = torch.ones_like(total_input_ids).to("cuda")
attention_mask = torch.ones_like(total_input_ids).to(input_ids.device)

.flatten()
.long()
)
_llm_pos_ids = torch.stack([t_index, h_index, w_index]).to("cuda")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The device is hardcoded to 'cuda'. This should be changed to use the device from an input tensor (e.g., grid_hs.device) to support execution on different hardware.

Suggested change
_llm_pos_ids = torch.stack([t_index, h_index, w_index]).to("cuda")
_llm_pos_ids = torch.stack([t_index, h_index, w_index]).to(grid_hs.device)

Comment on lines 246 to +256
futures.append(
self.io_executor.submit(
BaseMultimodalProcessor._load_single_item,
data,
is_video,
False,
frame_count_limit,
None,
discard_alpha_channel,
)
)
task_info.append((Modality.IMAGE, data, frame_count_limit))
image_index += 1
task_info.append((Modality.IMAGE, data, None))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for frame_count_limit based on estimated_frames_list and scaling_factor has been removed, and _load_single_item is now called with None for the frame limit. This seems to disable frame limiting for videos in the base processor, which could be a significant breaking change affecting all models. If this is intentional, it should be documented. Otherwise, it might be a bug.

Comment on lines +302 to +303
# TODO: remove this
multimodal_tokens.compile_regex()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a TODO to remove the explicit call to compile_regex(). This suggests the current implementation is temporary. It would be cleaner to handle regex compilation within the MultimodalSpecialTokens class, for example in its __post_init__ or another initialization method, to avoid these explicit calls in the processor.

Comment on lines +156 to 170
ret = self.process_mm_data(
input_text=base_output.input_text,
images=None if images_are_preprocessed else base_output.images,
audio=base_output.audios,
)

input_ids = ret["input_ids"].flatten()
image_offsets = self.get_mm_items_offset(
input_ids=input_ids, mm_token_id=self.image_token_id
)

image_grid_thw = None
video_grid_thw = None # TODO

combined_mm_item, input_ids = self.process_and_combine_mm_data(base_output)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The method self.process_mm_data is called, and its return value ret is used later. However, self.process_and_combine_mm_data is also called, which internally calls self.process_mm_data again. This results in redundant processing and is inefficient. The logic should be refactored to avoid the duplicate call.


# Skip loading extra bias for GPTQ models.
if name.endswith(".bias") and name not in params_dict:
print(f"skipping {name}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This print() statement appears to be for debugging purposes. It should be removed or replaced with a proper logger call (e.g., logger.debug(...)) before merging.

continue
param = params_dict[name]
except KeyError:
print(params_dict.keys())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This print() statement appears to be for debugging and should be removed from the final code.

class TestOpenAIOmniServer(TestOpenAIVisionServer):
@classmethod
def setUpClass(cls):
cls.model = "openbmb/MiniCPM-o-2_6"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test file is for the new omni model support, but it's using openbmb/MiniCPM-o-2_6 instead of a qwen2.5 omni model. To ensure the changes for qwen2.5 omni are correctly tested, a model from that family should be used.

Suggested change
cls.model = "openbmb/MiniCPM-o-2_6"
cls.model = "Qwen/Qwen2.5-0.5B-Omni-Instruct"

@mickqian mickqian marked this pull request as draft September 22, 2025 07:54
@mickqian
Copy link
Collaborator Author

We can directly move to #10911

@mickqian mickqian closed this Sep 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants