Skip to content

Conversation

@lifuhuang
Copy link
Collaborator

@lifuhuang lifuhuang commented May 21, 2025

Motivation

Support Phi4-MM model with text + vision.

Modifications

This change introduced the basic text + image support.
image

It's worth noting that the current MMMU run (without LoRA) is lower than advertised because Phi4MM relies on LoRA for full image understanding capabilities. However, LoRA support requires refactoring / generalizing the existing SGL LoRA handling, which will hopefully be addressed in this separate PR: #6585

Example: degraded image understanding without LoRA (MMMU is only 38). As comparison in our local branch (#6585) with LoRA, MMMU is boosted to ~50:
image

TODO in this PR:

  • add unit tests
  • clean-up styling issues

TODO in follow-up PR (ordered by priority):

  1. Precomputed feature support.
  2. LoRA support (required for multi-image understanding)
  3. SGLang LoRA compatibility with CUDA Graph and Radix Attention
  4. Refactor SGL MM processor logic support for support the original token variable image token (e.g., <image_1>)
  5. perf optimization
  6. audio support
  7. pipeline parallelism support

Tracked in #6544

Checklist

@lifuhuang lifuhuang mentioned this pull request May 23, 2025
7 tasks
@zhaochenyang20
Copy link
Collaborator

@mickqian @yizhang2077

@mickqian
Copy link
Collaborator

Better to be merged after #4969 , due to some change to the omni model processing and testing

@lifuhuang
Copy link
Collaborator Author

Better to be merged after #4969 , due to some change to the omni model processing and testing

Hi @mickqian, thank you so much for reviewing my PR :)

Can you share more details about the concern you have such that I can test them locally? JFYI, I was able to merge your branch mickqian:qwen2.5-omni locally without conflict and get a green TestOpenAIVisionServer run for phi4mm.

@mickqian
Copy link
Collaborator

mickqian commented May 23, 2025

Can you share more details

It's mostly that, for omni models, there's a new TestOpenaiOmniServer. And yes, you can cherry-pick it.

I just noticed audio input is not supported this time.

@lifuhuang lifuhuang requested a review from mickqian May 24, 2025 00:36
@zhyncs zhyncs merged commit 022012a into sgl-project:main May 25, 2025
1 of 19 checks passed
@lifuhuang lifuhuang self-assigned this May 25, 2025
Layssy pushed a commit to Layssy/sglang-iaas that referenced this pull request Jun 9, 2025
xwu-intel pushed a commit to xwu-intel/sglang that referenced this pull request Jun 17, 2025
@lifuhuang lifuhuang mentioned this pull request Jun 23, 2025
67 tasks
@lifuhuang lifuhuang added new-model Multi-modal multi-modal language model labels Jul 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Multi-modal multi-modal language model new-model

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants