-
Notifications
You must be signed in to change notification settings - Fork 3.3k
model: qwen2.5 omni (thinker only) #4969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
continue reviewing...
|
@mickqian Huggingface has updated a new version. We can rebase this |
7a19648 to
4000bb4
Compare
|
@mickqian is it ready to review and merge? |
python/sglang/srt/managers/multimodal_processors/base_processor.py
Outdated
Show resolved
Hide resolved
python/sglang/srt/managers/multimodal_processors/base_processor.py
Outdated
Show resolved
Hide resolved
|
please resolve and rebase |
|
mmmu accuracy: 0.503 |
|
I ran
|
4f25037 to
1779fef
Compare
MMMU acc with HF: 0.399Modified bench_hf.pyThanks to @mickqian, the test is now running successfully after adding import argparse
import PIL
import torch
from data_utils import save_json
from eval_utils import (
EvalArgs,
eval_result,
get_sampling_params,
prepare_samples,
process_result,
)
from tqdm import tqdm
from transformers import AutoModel, AutoProcessor, GenerationConfig, Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
@torch.no_grad()
def eval_mmmu(args):
eval_args = EvalArgs.from_cli_args(args)
try:
from transformers import AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained(
args.model_path,
torch_dtype="auto",
trust_remote_code=True,
)
except Exception as first_exception:
try:
# model = AutoModel.from_pretrained(
# args.model_path,
# torch_dtype="auto",
# trust_remote_code=True,
# init_tts=False,
# )
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
args.model_path,
torch_dtype="auto",
trust_remote_code=True,
enable_audio_output = False, # add this argument to disable tts
)
except Exception as second_exception:
raise RuntimeError(
f"Failed to load model: First attempt failed with {first_exception}, "
f"second attempt failed with {second_exception}"
) from second_exception
model = model.eval().cuda()
processor = Qwen2_5OmniProcessor.from_pretrained(
args.model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True
)
samples = prepare_samples(eval_args)
out_samples = dict()
sampling_params = get_sampling_params(eval_args)
generation_config = GenerationConfig(
max_new_tokens=sampling_params["max_new_tokens"],
do_sample=False,
)
answer_dict = {}
for sample in tqdm(samples):
prompt = sample["final_input_prompt"]
image = sample["image"]
prefix = prompt.split("<")[0]
suffix = prompt.split(">")[1]
assert image is not None
contents = []
if prefix:
contents += [{"type": "text", "text": prefix}]
contents += [
{
"type": "image",
"image": sample["image_path"],
}
]
if suffix:
contents += [{"type": "text", "text": suffix}]
messages = [{"role": "user", "content": contents}]
try:
model_inputs = processor.tokenizer.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
input_len = model_inputs["input_ids"].shape[-1]
generation = model.generate(
**model_inputs, generation_config=generation_config
)
generation = generation[0][input_len:]
response = processor.decode(generation, skip_special_tokens=True)
except:
contents = []
if prefix:
contents += [prefix]
image = PIL.Image.open(sample["image_path"])
contents += [image]
if suffix:
contents += [suffix]
messages = [{"role": "user", "content": contents}]
response = model.chat(
msgs=messages,
tokenizer=processor.tokenizer,
sampling=False,
max_new_tokens=sampling_params["max_new_tokens"],
use_tts_template=False,
generate_audio=False,
temperature=0.0,
)
print(f"response: {response}")
process_result(response, sample, answer_dict, out_samples)
args.output_path = f"{args.model_path}_val_hf.json"
save_json(args.output_path, out_samples)
eval_result(model_answer_path=args.output_path, answer_dict=answer_dict)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--model-path",
type=str,
help="The path of the model weights. This can be a local folder or a Hugging Face repo ID.",
required=True,
)
EvalArgs.add_cli_args(parser)
args = parser.parse_args()
eval_mmmu(args) |
7818f67 to
d3e2af2
Compare
40391f7 to
64f5896
Compare
82cb4cf to
d846622
Compare
|
/gemini review |
|
Warning Gemini is unable to generate a review due to a potential policy violation. |
|
@mickqian Any plan to merge into main branch? |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the qwen2.5 omni model. The changes include adding the model definition, updating conversation templates, and modifying multimodal processing logic. While the core implementation seems mostly correct, there are several critical issues that need to be addressed. These include a typo in a special token, usage of an undefined attribute, and hardcoded CUDA device specifications which will break execution on other hardware. Additionally, there are some opportunities for code cleanup and performance improvements.
| self.IMAGE_TOKEN_REGEX = re.compile( | ||
| r"<\|vision_bos\|>(?:<\|IMAGE\|>)+<\|vision_eos\|>" | ||
| ) | ||
| self.image_token = "<|vision_bos|><|IMAGE|><|vision_eo|>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a typo in the image_token. It's set to "<|vision_bos|><|IMAGE|><|vision_eo|>", but it should likely be "<|vision_bos|><|IMAGE|><|vision_eos|>" to match the conversation template. This will cause issues with tokenization and prompt formatting.
| self.image_token = "<|vision_bos|><|IMAGE|><|vision_eo|>" | |
| self.image_token = "<|vision_bos|><|IMAGE|><|vision_eos|>" |
| "im_end_id": self.IM_END_TOKEN_ID, | ||
| "im_start_id": self.image_start_id, | ||
| "im_end_id": self.image_end_id, | ||
| "im_token_id": self.IM_TOKEN_ID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The returned dictionary uses self.IM_TOKEN_ID, but this attribute is not set for the Qwen2_5OmniModel architecture in the __init__ method. This will raise an AttributeError. You should use self.image_token_id instead, which is correctly initialized.
| "im_token_id": self.IM_TOKEN_ID, | |
| "im_token_id": self.image_token_id, |
| image_grid_thw is not None or video_grid_thw is not None | ||
| ): | ||
| total_input_ids = input_ids | ||
| attention_mask = torch.ones_like(total_input_ids).to("cuda") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The device is hardcoded to 'cuda' here and in several other places in this function (e.g., lines 955, 1016, 1030). This will cause issues when running on other devices like CPU or MPS. The device should be inferred from the input tensors, for example, by using input_ids.device.
| attention_mask = torch.ones_like(total_input_ids).to("cuda") | |
| attention_mask = torch.ones_like(total_input_ids).to(input_ids.device) |
| .flatten() | ||
| .long() | ||
| ) | ||
| _llm_pos_ids = torch.stack([t_index, h_index, w_index]).to("cuda") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The device is hardcoded to 'cuda'. This should be changed to use the device from an input tensor (e.g., grid_hs.device) to support execution on different hardware.
| _llm_pos_ids = torch.stack([t_index, h_index, w_index]).to("cuda") | |
| _llm_pos_ids = torch.stack([t_index, h_index, w_index]).to(grid_hs.device) |
| futures.append( | ||
| self.io_executor.submit( | ||
| BaseMultimodalProcessor._load_single_item, | ||
| data, | ||
| is_video, | ||
| False, | ||
| frame_count_limit, | ||
| None, | ||
| discard_alpha_channel, | ||
| ) | ||
| ) | ||
| task_info.append((Modality.IMAGE, data, frame_count_limit)) | ||
| image_index += 1 | ||
| task_info.append((Modality.IMAGE, data, None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for frame_count_limit based on estimated_frames_list and scaling_factor has been removed, and _load_single_item is now called with None for the frame limit. This seems to disable frame limiting for videos in the base processor, which could be a significant breaking change affecting all models. If this is intentional, it should be documented. Otherwise, it might be a bug.
| # TODO: remove this | ||
| multimodal_tokens.compile_regex() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a TODO to remove the explicit call to compile_regex(). This suggests the current implementation is temporary. It would be cleaner to handle regex compilation within the MultimodalSpecialTokens class, for example in its __post_init__ or another initialization method, to avoid these explicit calls in the processor.
| ret = self.process_mm_data( | ||
| input_text=base_output.input_text, | ||
| images=None if images_are_preprocessed else base_output.images, | ||
| audio=base_output.audios, | ||
| ) | ||
|
|
||
| input_ids = ret["input_ids"].flatten() | ||
| image_offsets = self.get_mm_items_offset( | ||
| input_ids=input_ids, mm_token_id=self.image_token_id | ||
| ) | ||
|
|
||
| image_grid_thw = None | ||
| video_grid_thw = None # TODO | ||
|
|
||
| combined_mm_item, input_ids = self.process_and_combine_mm_data(base_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method self.process_mm_data is called, and its return value ret is used later. However, self.process_and_combine_mm_data is also called, which internally calls self.process_mm_data again. This results in redundant processing and is inefficient. The logic should be refactored to avoid the duplicate call.
|
|
||
| # Skip loading extra bias for GPTQ models. | ||
| if name.endswith(".bias") and name not in params_dict: | ||
| print(f"skipping {name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| continue | ||
| param = params_dict[name] | ||
| except KeyError: | ||
| print(params_dict.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| class TestOpenAIOmniServer(TestOpenAIVisionServer): | ||
| @classmethod | ||
| def setUpClass(cls): | ||
| cls.model = "openbmb/MiniCPM-o-2_6" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test file is for the new omni model support, but it's using openbmb/MiniCPM-o-2_6 instead of a qwen2.5 omni model. To ensure the changes for qwen2.5 omni are correctly tested, a model from that family should be used.
| cls.model = "openbmb/MiniCPM-o-2_6" | |
| cls.model = "Qwen/Qwen2.5-0.5B-Omni-Instruct" |
|
We can directly move to #10911 |


Motivation
Modifications
Checklist