-
Notifications
You must be signed in to change notification settings - Fork 416
Description
Motivation.
Build E2E CI for vllm-omni to strengthen quality protection.Currently, the CI pipeline needs to be expanded to cover the latest omni-modal and diffusion-based models. This update ensures robust validation for both online (real-time inference) and offline (batch/development) scenarios.
Proposed Change.
This testing system aims to build a complete, efficient, and well-structured quality assurance framework for the development, integration, and release of model services. It draws on the concept of the test pyramid from modern software engineering, progressively expanding testing activities from basic code logic verification to complex end-to-end (E2E) functionality, performance, accuracy, and even long-term stability validation.
Tiered testing structure:
| Level | Scope & Focus | Time Cost | Test Dir | Doc | Frequency | Hardware |
|---|---|---|---|---|---|---|
| Common | Contribution Guideline & PR checklist | / | / | docs/contributing/ci/README.md .github/PULL_REQUEST_TEMPLATE.md docs/contributing/ci/tests_style.md | / | / |
| CI Failure Description | / | / | docs/contributing/ci/failures.md | / | / | |
| L1 (Unit & Logic) |
Unit tests for components like entrypoints, models | <15min | /tests/{component_name}/test_xxx | docs/contributing/ci/CI_5levels.md | PR with ready label (also can run locally) | CPU |
| L2 (E2E across models & GPU-required UT) |
Online & Offline (basic deployment scenarios): dummy, normal inference function (output format, stream), some instance startup UT |
/tests/e2e/online_serving/test_{model_name}.py /tests/e2e/offline_inference/test_{model_name}.py |
docs/contributing/ci/CI_5levels.md Section 1 L1&L2: Purpose, Test Content, Directory Location, Example |
PR with ready label | GPU | |
| L3 (Important Perf & Integration & Accuracy) |
Online & Offline (multiple deployment scenarios): real model, normal inference function, normal accuracy |
<30min |
/tests/e2e/online_serving/test_{model_name}_expansion.py /tests/e2e/offline_inference/test_{model_name}_expansion.py |
docs/contributing/ci/CI_5levels.md Section 2 L3: Purpose, Test Content, Directory Location, Example |
PR Merged (Also run L1&L2 Tests) | GPU |
| L4 (Perf & Integration & Accuracy) |
Online & Offline: full functional scenarios + performance test + doc test | <3 hour |
Full Function: /tests/e2e/online_serving/test_{model_name}_expansion.py /tests/e2e/offline_inference/test_{model_name}_expansion.py Performance: /tests/e2e/perf/nightly.json Doc Test: tests/example/online_serving/test_{model_name}.py tests/example/offline_inference/test_{model_name}.py |
docs/contributing/ci/CI_5levels.md Section 3 L4: Purpose, Test Content, Directory Location, Example |
Nightly | GPU |
| L5 (Stability & Reliability) |
Online & Offline: long-term stability test + reliability test | Depends on reality |
Stability: tests/e2e/stability/weekly.json Reliability: tests/e2e/reliability/test_{model_name}.py |
docs/contributing/ci/CI_5levels.md Section 4 L5: Purpose, Test Content, Directory Location, Example |
Weekly / Days before Release | GPU |
Detailed Design for Each Level
Common Specifications
Before entering specific testing levels, the project establishes two common specifications aimed at standardizing the development process and quickly locating issues.
- PR Checklist (Tests Style): This template defines the self-check items that must be completed before submitting a code review (Pull Request). It ensures that each code change meets basic requirements such as code style, dependency updates, and documentation synchronization before entering the automated testing pipeline, serving as the first manual line of defense for quality assurance.
- CI Failure Explanation (CI Failures): This document archives and explains common failure patterns in the Continuous Integration (CI) pipeline, error log interpretation, and preliminary troubleshooting steps. It helps developers and testers quickly diagnose the causes of automated test failures, improving problem-solving efficiency.
L1 & L2 Level Testing - Unit Testing and Basic End-to-End Verification
1.1 Testing Purpose
L1 and L2 level testing form the foundation of the quality assurance system. L1 level testing focuses on verifying the internal logic correctness of code units (e.g., functions, classes), ensuring each independent component behaves as designed.
L2 level testing builds upon L1 by introducing GPU resources and verifying that the end-to-end (E2E) process of the model in basic deployment scenarios is smooth. For example, it uses dummy models to confirm that core interfaces like the inference pipeline, output format, and streaming response work properly. The common goal of these two levels is to provide developers with rapid feedback, discovering and fixing issues early in the development cycle .
1.2 Testing Content and Scope
- L1 (Unit & Logic Testing):
-
- Scope: Tests internal functions and methods of core components such as
entrypoints,models. - Focus: Branch coverage, exception handling, algorithm logic correctness. Does not involve external dependencies or the complete service stack.
- Time Cost: Execution time is controlled within 15 minutes to ensure fast feedback.
- Scope: Tests internal functions and methods of core components such as
- L2 (Basic End-to-End Testing):
-
- Scope: Covers two basic deployment scenarios:
online(serving) andoffline(inference). - Focus: Uses
dummymodels or lightweight real models to verify that the entire chain from request input to result output works normally, including output data structure, streaming (stream) support, etc. Also includes some unit tests that require launching independent service instances. - Characteristic: Requires GPU resources to perform model computations.
- Scope: Covers two basic deployment scenarios:
1.3 Test Directory and Execution Files
A clear directory structure is key to managing test cases efficiently.
- L1 Test Directory:
/tests/{component_name}/test_xxx.py -
- Here,
{component_name}corresponds to modules in the source code, such asdistributed,entrypoints, etc., andtest_xxx.pyis the specific test file.
- Here,
- L2 Test Directory:
-
- Online Serving:
/tests/e2e/online_serving/test_{model_name}.py - Offline Inference:
/tests/e2e/offline_inference/test_{model_name}.py
- Online Serving:
L3 Level Testing - Core Integration, Performance, and Accuracy Verification
2.1 Testing Purpose
L3 level testing executes after code is merged into the main branch. Its core purpose is to verify the integration effect, key performance indicators, and output accuracy of real models in multiple deployment scenarios
. It acts as the "quality gatekeeper" for the main branch, ensuring that no merge breaks the core capabilities of the model service. Testing needs to provide clear conclusions within a relatively short time (<30min), balancing test depth with feedback speed.
2.2 Testing Content and Scope
- Deployment Scenarios: Covers richer
onlineandofflinedeployment configurations, which may include different hardware configurations, batch sizes, concurrency levels, etc. - Core Verification:
-
- Inference Functionality: Ensures real models can perform forward computation normally and return results.
- Accuracy Compliance: Verifies that the model's evaluation metrics (e.g., accuracy) meet the expected baseline, preventing code changes from introducing accuracy issues.
- Important Performance: Verifies whether performance (e.g., P99 latency, throughput) in core scenarios meets preset thresholds.
2.3 Test Directory and Execution Files
- Functional Testing:
-
- Online Serving:
/tests/e2e/online_serving/test_{model_name}_expansion.py - Offline Inference:
/tests/e2e/offline_inference/test_{model_name}_expansion.py - (Note:
_expansion.pylikely means it contains more comprehensive scenario cases compared to L2 tests).
- Online Serving:
L4 Level Testing - Full Functionality, Performance, and Documentation Testing
3.1 Testing Purpose
L4 level testing is a comprehensive quality audit before a version release. It expands upon L3, executing full functional scenarios, conducting systematic performance stress tests, and simultaneously verifying the correctness of accompanying example documentation. Its purpose is to perform deep validation of the system during off-peak nighttime hours, providing quality trend reports for daytime development and data support for release decisions.
3.2 Testing Content and Scope
- Full Functionality Testing: Executes all test cases defined in
test_{model_name}_expansion.py, covering all implemented features, positive flows, boundary conditions, and exception handling. - Performance Testing: Uses the
/tests/e2e/perf/nightly.jsonconfiguration file to drive performance testing tools for stress, load, and endurance tests, collecting metrics like throughput, response time, and resource utilization. - Documentation Testing: Verifies whether the example code provided to users is runnable and its results match the description.
3.3 Test Directory and Execution Files
- Functional Testing: Same directories as L3.
- Performance Test Configuration:
/tests/e2e/perf/nightly.json - Documentation Example Tests:
-
tests/example/online_serving/test_{model_name}.pytests/example/offline_inference/test_{model_name}.py
L5 Level Testing - Stability and Reliability Testing
4.1 Testing Purpose
L5 level testing focuses on the performance of model services under long-running and abnormal fault scenarios. It aims to uncover deep-seated issues that only manifest under sustained pressure or extreme conditions, such as memory leaks, resource contention, gradual performance degradation, and lack of fault tolerance mechanisms. This is the final, yet crucial, line of defense for ensuring service high availability and production environment robustness.
4.2 Testing Content and Scope
- Long-term Stability (Stability) Testing: Uses the
tests/e2e/stability/weekly.jsonconfiguration to run the service under moderate load for an extended period (e.g., over 12 hours), monitoring whether metrics like memory/VRAM usage, response time, and throughput degrade over time, and whether the service process remains stable. - Reliability Testing: Uses
tests/e2e/reliability/test_{model_name}.pyto actively simulate various fault and abnormal scenarios, such as: dependent service interruption, abnormal input data, network flicker, hardware resource preemption, etc., to verify the system's fault tolerance, self-healing, and graceful degradation capabilities.
4.3 Test Directory and Execution Files
- Stability Test Configuration:
tests/e2e/stability/weekly.json - Reliability Test Suite:
tests/e2e/reliability/test_{model_name}.py
Detailed Implementation Roadmap & To-Do
| Priority | Category | Task | Description |
|---|---|---|---|
| P0 | Documentation | Five-Level CI Test Documentation | Update documentation [#1167] |
| P0 | Build & Automation | Nightly Build Script Implementation | Create test-nightly build script[#867] |
| P0 | Build & Test Organization | L2/L3 Test Case Refactoring | Configure test-merge build script and Split existing test cases into L2 and L3 levels [RFC: #1218] [PR: #1272] |
| P0 | Test Capabilities | Performance Test Framework | Develop public framework for performance test [RFC: #1313 ] [PR: #1321 ] |
| P1 | Test Capabilities | Stability Test Framework | Develop public framework for stability test |
| P1 | Test Capabilities | Add E2E, Example Test cases | Supplementing E2E (End-to-End) Example Use Cases for Various Models |
| P1 | Test Capabilities | Add UT Test Cases | Supplementing Unit Tests for Various Components |
Feedback Period.
No response
CC List.
No response
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.