[RFC]: vllm-omni CI/CD plan

### Motivation.
Build E2E CI for vllm-omni to strengthen quality protection.Currently, the CI pipeline needs to be expanded to cover the latest omni-modal and diffusion-based models. This update ensures robust validation for both online (real-time inference) and offline (batch/development) scenarios.

### Proposed Change.
This testing system aims to build a complete, efficient, and well-structured quality assurance framework for the development, integration, and release of model services. It draws on the concept of the test pyramid from modern software engineering, progressively expanding testing activities from basic code logic verification to complex end-to-end (E2E) functionality, performance, accuracy, and even long-term stability validation.

### Tiered testing structure:

<table>
 <thead>
 <tr>
 <th>Level</th>
 <th>Scope & Focus</th>
 <th>Time Cost</th>
 <th>Test Dir</th>
 <th>Doc</th>
 <th>Frequency</th>
 <th>Hardware</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td rowspan="2">Common</td>
 <td>Contribution Guideline & PR checklist</td>
 <td>/</td>
 <td>/</td>
 <td>docs/contributing/ci/README.md .github/PULL_REQUEST_TEMPLATE.md docs/contributing/ci/tests_style.md</td>
 <td>/</td>
 <td>/</td>
 </tr>
 <tr>
 <td>CI Failure Description</td>
 <td>/</td>
 <td>/</td>
 <td>docs/contributing/ci/failures.md</td>
 <td>/</td>
 <td>/</td>
 </tr>
 <tr>
 <td>L1 (Unit & Logic)</td>
 <td>Unit tests for components like entrypoints, models</td>
 <td rowspan="2">&lt;15min</td>
 <td>/tests/{component_name}/test_xxx</td>
 <td>docs/contributing/ci/CI_5levels.md</td>
 <td>PR with ready label (also can run locally)</td>
 <td>CPU</td>
 </tr>
 <tr>
 <td>L2 (E2E across models & GPU-required UT)</td>
 <td>Online & Offline (basic deployment scenarios): dummy, normal inference function (output format, stream), some instance startup UT</td>
 <td>
 /tests/e2e/online_serving/test_{model_name}.py 
 /tests/e2e/offline_inference/test_{model_name}.py
 </td>
 <td>docs/contributing/ci/CI_5levels.md Section 1 L1&L2: Purpose, Test Content, Directory Location, Example</td>
 <td>PR with ready label</td>
 <td>GPU</td>
 </tr>
 <tr>
 <td>L3 (Important Perf & Integration & Accuracy)</td>
 <td>Online & Offline (multiple deployment scenarios): real model, normal inference function, normal accuracy</td>
 <td>&lt;30min</td>
 <td>
 /tests/e2e/online_serving/test_{model_name}_expansion.py 
 /tests/e2e/offline_inference/test_{model_name}_expansion.py
 </td>
 <td>docs/contributing/ci/CI_5levels.md Section 2 L3: Purpose, Test Content, Directory Location, Example</td>
 <td>PR Merged (Also run L1&L2 Tests)</td>
 <td>GPU</td>
 </tr>
 <tr>
 <td>L4 (Perf & Integration & Accuracy)</td>
 <td>Online & Offline: full functional scenarios + performance test + doc test</td>
 <td>&lt;3 hour</td>
 <td>
 Full Function: 
 /tests/e2e/online_serving/test_{model_name}_expansion.py 
 /tests/e2e/offline_inference/test_{model_name}_expansion.py 
 Performance: 
 /tests/e2e/perf/nightly.json 
 Doc Test: 
 tests/example/online_serving/test_{model_name}.py 
 tests/example/offline_inference/test_{model_name}.py
 </td>
 <td>docs/contributing/ci/CI_5levels.md Section 3 L4: Purpose, Test Content, Directory Location, Example</td>
 <td>Nightly</td>
 <td>GPU</td>
 </tr>
 <tr>
 <td>L5 (Stability & Reliability)</td>
 <td>Online & Offline: long-term stability test + reliability test</td>
 <td> Depends on reality </td>
 <td>
 Stability: 
 tests/e2e/stability/weekly.json 
 Reliability: 
 tests/e2e/reliability/test_{model_name}.py
 </td>
 <td>docs/contributing/ci/CI_5levels.md Section 4 L5: Purpose, Test Content, Directory Location, Example</td>
 <td>Weekly / Days before Release</td>
 <td>GPU</td>
 </tr>
 </tbody>
</table>


### **Detailed Design for Each Level**

## Common Specifications

Before entering specific testing levels, the project establishes two common specifications aimed at standardizing the development process and quickly locating issues.

1. ****PR Checklist (******[Tests Style](../ci/tests_style.md)******)****: This template defines the self-check items that must be completed before submitting a code review (Pull Request). It ensures that each code change meets basic requirements such as code style, dependency updates, and documentation synchronization before entering the automated testing pipeline, serving as the first manual line of defense for quality assurance.
2. ****CI Failure Explanation ([CI Failures](../ci/failures.md))****: This document archives and explains common failure patterns in the Continuous Integration (CI) pipeline, error log interpretation, and preliminary troubleshooting steps. It helps developers and testers quickly diagnose the causes of automated test failures, improving problem-solving efficiency.

## L1 & L2 Level Testing - Unit Testing and Basic End-to-End Verification

### 1.1 Testing Purpose

L1 and L2 level testing form the foundation of the quality assurance system. L1 level testing focuses on verifying the internal logic correctness of code units (e.g., functions, classes), ensuring each independent component behaves as designed.

L2 level testing builds upon L1 by introducing GPU resources and verifying that the end-to-end (E2E) process of the model in basic deployment scenarios is smooth. For example, it uses dummy models to confirm that core interfaces like the inference pipeline, output format, and streaming response work properly. The common goal of these two levels is to provide developers with rapid feedback, discovering and fixing issues early in the development cycle .



### 1.2 Testing Content and Scope

- ****L1 (Unit & Logic Testing)****:
- - ****Scope****: Tests internal functions and methods of core components such as `entrypoints`, `models`.
 - ****Focus****: Branch coverage, exception handling, algorithm logic correctness. Does not involve external dependencies or the complete service stack.
 - ****Time Cost****: Execution time is controlled within ****15 minutes**** to ensure fast feedback.
- ****L2 (Basic End-to-End Testing)****:
- - ****Scope****: Covers two basic deployment scenarios: `online` (serving) and `offline` (inference).
 - ****Focus****: Uses `dummy` models or lightweight real models to verify that the entire chain from request input to result output works normally, including output data structure, streaming (stream) support, etc. Also includes some unit tests that require launching independent service instances.
 - ****Characteristic****: Requires ****GPU**** resources to perform model computations.

### 1.3 Test Directory and Execution Files

A clear directory structure is key to managing test cases efficiently.

- ****L1 Test Directory****: `/tests/{component_name}/test_xxx.py`
- - Here, `{component_name}` corresponds to modules in the source code, such as `distributed`, `entrypoints`, etc., and `test_xxx.py` is the specific test file.
- ****L2 Test Directory****:
- - Online Serving: `/tests/e2e/online_serving/test_{model_name}.py`
 - Offline Inference: `/tests/e2e/offline_inference/test_{model_name}.py`


## L3 Level Testing - Core Integration, Performance, and Accuracy Verification

### 2.1 Testing Purpose

L3 level testing executes after code is merged into the main branch. Its core purpose is to verify the integration effect, key performance indicators, and output accuracy of ****real models**** in ****multiple deployment scenarios****

. It acts as the "quality gatekeeper" for the main branch, ensuring that no merge breaks the core capabilities of the model service. Testing needs to provide clear conclusions within a relatively short time (<30min), balancing test depth with feedback speed.



### 2.2 Testing Content and Scope

- ****Deployment Scenarios****: Covers richer `online` and `offline` deployment configurations, which may include different hardware configurations, batch sizes, concurrency levels, etc.
- ****Core Verification****:
- 1. ****Inference Functionality****: Ensures real models can perform forward computation normally and return results.
 2. ****Accuracy Compliance****: Verifies that the model's evaluation metrics (e.g., accuracy) meet the expected baseline, preventing code changes from introducing accuracy issues.
 3. ****Important Performance****: Verifies whether performance (e.g., P99 latency, throughput) in core scenarios meets preset thresholds.

### 2.3 Test Directory and Execution Files

- ****Functional Testing****:
- - Online Serving: `/tests/e2e/online_serving/test_{model_name}_expansion.py`
 - Offline Inference: `/tests/e2e/offline_inference/test_{model_name}_expansion.py`
 - (Note: `_expansion.py` likely means it contains more comprehensive scenario cases compared to L2 tests).


## L4 Level Testing - Full Functionality, Performance, and Documentation Testing

### 3.1 Testing Purpose

L4 level testing is a comprehensive quality audit before a version release. It expands upon L3, executing ****full**** functional scenarios, conducting systematic ****performance stress tests****, and simultaneously verifying the correctness of accompanying ****example documentation****. Its purpose is to perform deep validation of the system during off-peak nighttime hours, providing quality trend reports for daytime development and data support for release decisions.



### 3.2 Testing Content and Scope

- ****Full Functionality Testing****: Executes all test cases defined in `test_{model_name}_expansion.py`, covering all implemented features, positive flows, boundary conditions, and exception handling.
- ****Performance Testing****: Uses the `/tests/e2e/perf/nightly.json` configuration file to drive performance testing tools for stress, load, and endurance tests, collecting metrics like throughput, response time, and resource utilization.
- ****Documentation Testing****: Verifies whether the example code provided to users is runnable and its results match the description.

### 3.3 Test Directory and Execution Files

- ****Functional Testing****: Same directories as L3.
- ****Performance Test Configuration****: `/tests/e2e/perf/nightly.json`
- ****Documentation Example Tests****:
- - `tests/example/online_serving/test_{model_name}.py`
 - `tests/example/offline_inference/test_{model_name}.py`


## L5 Level Testing - Stability and Reliability Testing

### 4.1 Testing Purpose

L5 level testing focuses on the performance of model services under ****long-running**** and ****abnormal fault**** scenarios. It aims to uncover deep-seated issues that only manifest under sustained pressure or extreme conditions, such as memory leaks, resource contention, gradual performance degradation, and lack of fault tolerance mechanisms. This is the final, yet crucial, line of defense for ensuring service high availability and production environment robustness.



### 4.2 Testing Content and Scope

- ****Long-term Stability (Stability) Testing****: Uses the `tests/e2e/stability/weekly.json` configuration to run the service under moderate load for an extended period (e.g., over 12 hours), monitoring whether metrics like memory/VRAM usage, response time, and throughput degrade over time, and whether the service process remains stable.
- ****Reliability Testing****: Uses `tests/e2e/reliability/test_{model_name}.py` to actively simulate various fault and abnormal scenarios, such as: dependent service interruption, abnormal input data, network flicker, hardware resource preemption, etc., to verify the system's fault tolerance, self-healing, and graceful degradation capabilities.

### 4.3 Test Directory and Execution Files

- ****Stability Test Configuration****: `tests/e2e/stability/weekly.json`
- ****Reliability Test Suite****: `tests/e2e/reliability/test_{model_name}.py`

### Detailed Implementation Roadmap & To-Do
Priority | Category | Task | Description
-- | -- | -- | --
P0 | Documentation | Five-Level CI Test Documentation | Update documentation [#1167]
P0 | Build & Automation | Nightly Build Script Implementation | Create test-nightly build script[#867]
P0 |Build & Test Organization | L2/L3 Test Case Refactoring | Configure test-merge build script and Split existing test cases into L2 and L3 levels [RFC: #1218] [PR: #1272]
P0 | Test Capabilities | Performance Test Framework | Develop public framework for performance test [RFC: #1313 ] [PR: #1321 ]
P1 | Test Capabilities | Stability Test Framework | Develop public framework for stability test
P1 | Test Capabilities | Add E2E, Example Test cases | Supplementing E2E (End-to-End) Example Use Cases for Various Models 
P1 | Test Capabilities | Add UT Test Cases | Supplementing Unit Tests for Various Components
---

### Feedback Period.

_No response_

### CC List.

_No response_

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://vllm-omni.readthedocs.io), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: vllm-omni CI/CD plan #400

Motivation.

Proposed Change.

Tiered testing structure:

Detailed Design for Each Level

Common Specifications

L1 & L2 Level Testing - Unit Testing and Basic End-to-End Verification

1.1 Testing Purpose

1.2 Testing Content and Scope

1.3 Test Directory and Execution Files

L3 Level Testing - Core Integration, Performance, and Accuracy Verification

2.1 Testing Purpose

2.2 Testing Content and Scope

2.3 Test Directory and Execution Files

L4 Level Testing - Full Functionality, Performance, and Documentation Testing

3.1 Testing Purpose

3.2 Testing Content and Scope

3.3 Test Directory and Execution Files

L5 Level Testing - Stability and Reliability Testing

4.1 Testing Purpose

4.2 Testing Content and Scope

4.3 Test Directory and Execution Files

Detailed Implementation Roadmap & To-Do

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Level	Scope & Focus	Time Cost	Test Dir	Doc	Frequency	Hardware
Common	Contribution Guideline & PR checklist	/	/	docs/contributing/ci/README.md .github/PULL_REQUEST_TEMPLATE.md docs/contributing/ci/tests_style.md	/	/
Common	CI Failure Description	/	/	docs/contributing/ci/failures.md	/	/
L1 (Unit & Logic)	Unit tests for components like entrypoints, models	<15min	/tests/{component_name}/test_xxx	docs/contributing/ci/CI_5levels.md	PR with ready label (also can run locally)	CPU
L2 (E2E across models & GPU-required UT)	Online & Offline (basic deployment scenarios): dummy, normal inference function (output format, stream), some instance startup UT	<15min	/tests/e2e/online_serving/test_{model_name}.py /tests/e2e/offline_inference/test_{model_name}.py	docs/contributing/ci/CI_5levels.md Section 1 L1&L2: Purpose, Test Content, Directory Location, Example	PR with ready label	GPU
L3 (Important Perf & Integration & Accuracy)	Online & Offline (multiple deployment scenarios): real model, normal inference function, normal accuracy	<30min	/tests/e2e/online_serving/test_{model_name}_expansion.py /tests/e2e/offline_inference/test_{model_name}_expansion.py	docs/contributing/ci/CI_5levels.md Section 2 L3: Purpose, Test Content, Directory Location, Example	PR Merged (Also run L1&L2 Tests)	GPU
L4 (Perf & Integration & Accuracy)	Online & Offline: full functional scenarios + performance test + doc test	<3 hour	Full Function: /tests/e2e/online_serving/test_{model_name}_expansion.py /tests/e2e/offline_inference/test_{model_name}_expansion.py Performance: /tests/e2e/perf/nightly.json Doc Test: tests/example/online_serving/test_{model_name}.py tests/example/offline_inference/test_{model_name}.py	docs/contributing/ci/CI_5levels.md Section 3 L4: Purpose, Test Content, Directory Location, Example	Nightly	GPU
L5 (Stability & Reliability)	Online & Offline: long-term stability test + reliability test	Depends on reality	Stability: tests/e2e/stability/weekly.json Reliability: tests/e2e/reliability/test_{model_name}.py	docs/contributing/ci/CI_5levels.md Section 4 L5: Purpose, Test Content, Directory Location, Example	Weekly / Days before Release	GPU

Priority	Category	Task	Description
P0	Documentation	Five-Level CI Test Documentation	Update documentation [#1167]
P0	Build & Automation	Nightly Build Script Implementation	Create test-nightly build script[#867]
P0	Build & Test Organization	L2/L3 Test Case Refactoring	Configure test-merge build script and Split existing test cases into L2 and L3 levels [RFC: #1218] [PR: #1272]
P0	Test Capabilities	Performance Test Framework	Develop public framework for performance test [RFC: #1313 ] [PR: #1321 ]
P1	Test Capabilities	Stability Test Framework	Develop public framework for stability test
P1	Test Capabilities	Add E2E, Example Test cases	Supplementing E2E (End-to-End) Example Use Cases for Various Models
P1	Test Capabilities	Add UT Test Cases	Supplementing Unit Tests for Various Components

[RFC]: vllm-omni CI/CD plan #400

Description

Motivation.

Proposed Change.

Tiered testing structure:

Detailed Design for Each Level

Common Specifications

L1 & L2 Level Testing - Unit Testing and Basic End-to-End Verification

1.1 Testing Purpose

1.2 Testing Content and Scope

1.3 Test Directory and Execution Files

L3 Level Testing - Core Integration, Performance, and Accuracy Verification

2.1 Testing Purpose

2.2 Testing Content and Scope

2.3 Test Directory and Execution Files

L4 Level Testing - Full Functionality, Performance, and Documentation Testing

3.1 Testing Purpose

3.2 Testing Content and Scope

3.3 Test Directory and Execution Files

L5 Level Testing - Stability and Reliability Testing

4.1 Testing Purpose

4.2 Testing Content and Scope

4.3 Test Directory and Execution Files

Detailed Implementation Roadmap & To-Do

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions