ci: reduce and refactor vlm ut and combine test files #11062

mickqian · 2025-09-29T10:26:33Z

Motivation

Modifications

models that are less popular, or with similar architectures (qwen2VL to qwen2_5VL) are removed from pr-test
resume a mistakenly deleted ut (test_mixed_batch)

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-09-29T10:26:48Z

Summary of Changes

Hello @mickqian, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request aims to optimize the Continuous Integration (CI) process by reducing the overall number of Vision-Language Model (VLM) unit tests and consolidating the remaining tests into fewer files. This streamlining effort enhances the efficiency and maintainability of the test suite, ensuring faster feedback cycles while maintaining essential test coverage.

Highlights

Test File Consolidation: Tests previously spread across test_vision_openai_server_a.py and test_vision_openai_server_b.py have been consolidated into test_vision_openai_server_a.py, leading to the removal of test_vision_openai_server_b.py.
VLM Unit Test Reduction: Several Vision-Language Model (VLM) unit tests, including TestQwen2VLServer, TestPhi4MMServer, TestPixtralServer, TestMistral3_1Server, TestDeepseekVL2Server, TestJanusProServer, TestGemma3nServer, and TestQwen2AudioServer, have been removed from the CI suite to streamline testing.
Updated Mllama Server Test: The TestMllamaServer test has been re-enabled, updated to use the meta-llama/Llama-3.2-11B-Vision-Instruct model, and its server launch arguments have been simplified.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the vision language model (VLM) unit tests, likely to reduce CI execution time. It combines tests from test_vision_openai_server_b.py into test_vision_openai_server_a.py, removes the former file, and also removes several test classes. The changes are logical and align with the goal of optimizing CI. My main feedback is on improving maintainability by reducing code duplication in the test setup methods, as new test classes with duplicated logic are being added.

test/srt/test_vision_openai_server_a.py

test/srt/test_vision_openai_server_b.py

JustinTong0323 · 2025-10-10T08:21:44Z

python/sglang/srt/managers/schedule_batch.py

+        if self.forward_mode.is_decode_or_idle():
+            attention_backend_str = global_server_args_dict["decode_attention_backend"]
+        else:
+            attention_backend_str = global_server_args_dict["prefill_attention_backend"]
+        # Create seq_lens_cpu when needed
+        if (
+            not global_server_args_dict["use_mla_backend"]
+            and attention_backend_str == "flashinfer"
+        ):
+            # for meta-llama/Llama-3.2-11B-Vision-Instruct
+            seq_lens_cpu = None
+        else:
+            seq_lens_cpu = (
+                seq_lens_cpu_cache
+                if seq_lens_cpu_cache is not None
+                else self.seq_lens.cpu()
+            )


this is ugly bro () Why we use this?

Previously mllama-3.1 is not included in Pr-test, when I add it back in, the bug above appears

/gemini why there is a bug?

The bug arises specifically for the meta-llama/Llama-3.2-11B-Vision-Instruct model (which I believe is what you're referring to as mllama-3.1) when it's used with the flashinfer attention backend and without the MLA backend. For this particular model and configuration, seq_lens_cpu needs to be None for correct operation. The added if condition handles this edge case, ensuring compatibility for Llama-3.2-11B-Vision-Instruct.

test/srt/test_vision_openai_server_a.py

JustinTong0323 · 2025-10-10T08:28:37Z

test/srt/test_vision_openai_server_a.py

+        "--mem-fraction-static",
+        "0.70",
+        "--enable-multimodal",
+        "--cuda-graph-max-bs",


Could we add the cuda graph max bs = 4 to base class?

better not, otherwise some model not being able to load in real cases will be able to in CI

python/sglang/srt/managers/schedule_batch.py

JustinTong0323

LGTM

merrymercy · 2025-10-10T19:14:45Z

test/srt/test_vision_openai_server_a.py

+    model = "Qwen/Qwen2.5-VL-7B-Instruct"
+    extra_args = [
+        "--mem-fraction-static",
+        "0.35",


Can we remove all --mem-fraction-static in this file (and nightly tests) so that we can test our auto mem-fraction-static logics?

Can you tune our auto mem fraction static logic to match here

sglang/docs/advanced_features/hyperparameter_tuning.md

Lines 43 to 50 in af96ca1

```

[2025-08-11 17:17:03] max_total_num_tokens=665690, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=65536, available_gpu_mem=13.50 GB

```

Check the `available_gpu_mem` value.

- If it is between 5–8 GB, the setting is good.

- If it is too high (e.g., 10 - 20 GB), increase `--mem-fraction-static` to allocate more memory to the KV cache.

- If it is too low, you risk out-of-memory (OOM) errors later, so decrease `--mem-fraction-static`.

. Maybe we need to reserve a little bit more for VLMs.

yes we should reserve more memories for VLMs.
Although the most frequent reasons for specifying a low --mem-fraction-static is for long-inputs, e.g. video inputs.
Except from video input, we might find a strategy to set the --mem-fraction-static well

--mem-fraction-static in this file all removed

sglang-bot added the run-ci label Sep 29, 2025

gemini-code-assist bot reviewed Sep 29, 2025

View reviewed changes

test/srt/test_vision_openai_server_a.py Outdated Show resolved Hide resolved

mickqian force-pushed the fix-ci branch from f336cf5 to 292c562 Compare September 29, 2025 11:58

mickqian changed the title ~~ci: reduce vlm ut and combine test files~~ ci: reduce and refactor vlm ut and combine test files Sep 29, 2025

mickqian force-pushed the fix-ci branch from 6a5778e to 70a1257 Compare September 29, 2025 13:56

mickqian requested review from Ying1123, hnyls2002, merrymercy and xiezhq-hermann as code owners September 30, 2025 07:57

JustinTong0323 reviewed Oct 1, 2025

View reviewed changes

test/srt/test_vision_openai_server_b.py Show resolved Hide resolved

JustinTong0323 reviewed Oct 10, 2025

View reviewed changes

mickqian force-pushed the fix-ci branch from b466b54 to 0017dc3 Compare October 10, 2025 08:26

JustinTong0323 reviewed Oct 10, 2025

View reviewed changes

JustinTong0323 approved these changes Oct 10, 2025

View reviewed changes

mickqian force-pushed the fix-ci branch from 0924186 to 3d8a96f Compare October 10, 2025 13:45

merrymercy approved these changes Oct 10, 2025

View reviewed changes

merrymercy reviewed Oct 10, 2025

View reviewed changes

JustinTong0323 mentioned this pull request Oct 12, 2025

model: qwen3-omni (thinker-only) #10911

Merged

4 tasks

mickqian force-pushed the fix-ci branch 2 times, most recently from 055d0d1 to e1bef17 Compare October 12, 2025 17:12

merrymercy approved these changes Oct 13, 2025

View reviewed changes

mickqian force-pushed the fix-ci branch from e1bef17 to da2abdc Compare October 13, 2025 05:25

mickqian requested a review from zhyncs as a code owner October 13, 2025 05:25

mickqian enabled auto-merge (squash) October 13, 2025 14:01

mickqian force-pushed the fix-ci branch 3 times, most recently from 473689a to cd9c610 Compare October 16, 2025 02:45

mickqian added 2 commits October 17, 2025 10:16

ci: reduce vlm ut and combine test files

76a220e

upd

d9da3c0

mickqian added 2 commits October 17, 2025 10:16

upd

5457e45

upd

8e71207

mickqian force-pushed the fix-ci branch from 5a2bb25 to 8e71207 Compare October 17, 2025 02:19

mickqian added 2 commits October 17, 2025 11:02

fix

ef373b6

Merge branch 'main' into fix-ci

86ad3ff

mickqian merged commit 3e4c7da into sgl-project:main Oct 17, 2025
46 of 70 checks passed

	```
	[2025-08-11 17:17:03] max_total_num_tokens=665690, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=65536, available_gpu_mem=13.50 GB
	```

	Check the `available_gpu_mem` value.
	- If it is between 5–8 GB, the setting is good.
	- If it is too high (e.g., 10 - 20 GB), increase `--mem-fraction-static` to allocate more memory to the KV cache.
	- If it is too low, you risk out-of-memory (OOM) errors later, so decrease `--mem-fraction-static`.

ci: reduce and refactor vlm ut and combine test files #11062

ci: reduce and refactor vlm ut and combine test files #11062

Uh oh!

Conversation

mickqian commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Sep 29, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JustinTong0323 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mickqian commented Sep 29, 2025 •

edited

Loading