-
Notifications
You must be signed in to change notification settings - Fork 3.3k
ci: reduce and refactor vlm ut and combine test files #11062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @mickqian, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request aims to optimize the Continuous Integration (CI) process by reducing the overall number of Vision-Language Model (VLM) unit tests and consolidating the remaining tests into fewer files. This streamlining effort enhances the efficiency and maintainability of the test suite, ensuring faster feedback cycles while maintaining essential test coverage. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the vision language model (VLM) unit tests, likely to reduce CI execution time. It combines tests from test_vision_openai_server_b.py into test_vision_openai_server_a.py, removes the former file, and also removes several test classes. The changes are logical and align with the goal of optimizing CI. My main feedback is on improving maintainability by reducing code duplication in the test setup methods, as new test classes with duplicated logic are being added.
| if self.forward_mode.is_decode_or_idle(): | ||
| attention_backend_str = global_server_args_dict["decode_attention_backend"] | ||
| else: | ||
| attention_backend_str = global_server_args_dict["prefill_attention_backend"] | ||
| # Create seq_lens_cpu when needed | ||
| if ( | ||
| not global_server_args_dict["use_mla_backend"] | ||
| and attention_backend_str == "flashinfer" | ||
| ): | ||
| # for meta-llama/Llama-3.2-11B-Vision-Instruct | ||
| seq_lens_cpu = None | ||
| else: | ||
| seq_lens_cpu = ( | ||
| seq_lens_cpu_cache | ||
| if seq_lens_cpu_cache is not None | ||
| else self.seq_lens.cpu() | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is ugly bro () Why we use this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously mllama-3.1 is not included in Pr-test, when I add it back in, the bug above appears
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/gemini why there is a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bug arises specifically for the meta-llama/Llama-3.2-11B-Vision-Instruct model (which I believe is what you're referring to as mllama-3.1) when it's used with the flashinfer attention backend and without the MLA backend. For this particular model and configuration, seq_lens_cpu needs to be None for correct operation. The added if condition handles this edge case, ensuring compatibility for Llama-3.2-11B-Vision-Instruct.
| "--mem-fraction-static", | ||
| "0.70", | ||
| "--enable-multimodal", | ||
| "--cuda-graph-max-bs", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add the cuda graph max bs = 4 to base class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better not, otherwise some model not being able to load in real cases will be able to in CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| model = "Qwen/Qwen2.5-VL-7B-Instruct" | ||
| extra_args = [ | ||
| "--mem-fraction-static", | ||
| "0.35", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can we remove all
--mem-fraction-staticin this file (and nightly tests) so that we can test our auto mem-fraction-static logics? - Can you tune our auto mem fraction static logic to match here . Maybe we need to reserve a little bit more for VLMs.
sglang/docs/advanced_features/hyperparameter_tuning.md
Lines 43 to 50 in af96ca1
``` [2025-08-11 17:17:03] max_total_num_tokens=665690, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=65536, available_gpu_mem=13.50 GB ``` Check the `available_gpu_mem` value. - If it is between 5–8 GB, the setting is good. - If it is too high (e.g., 10 - 20 GB), increase `--mem-fraction-static` to allocate more memory to the KV cache. - If it is too low, you risk out-of-memory (OOM) errors later, so decrease `--mem-fraction-static`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we should reserve more memories for VLMs.
Although the most frequent reasons for specifying a low --mem-fraction-static is for long-inputs, e.g. video inputs.
Except from video input, we might find a strategy to set the --mem-fraction-static well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--mem-fraction-static in this file all removed
055d0d1 to
e1bef17
Compare
473689a to
cd9c610
Compare
Motivation
Modifications
test_mixed_batch)Accuracy Tests
Benchmarking and Profiling
Checklist