[Perf][V1] Fully overlap model execution #2783

jiangpeng36 · 2025-09-05T08:53:02Z

This PR is based on top of #23569 and #24219.

What this PR does / why we need it?

This PR allows the model runner to function asynchronously when using async scheduling. This allows full overlap of the cpu operations (including prepare_inputs) and the model forward pass. This diff is functional and does not support speculative decoding, PP, or guided decoding.

Expected speedup is 5-10% over the current async scheduling.

Does this PR introduce any user-facing change?

How was this patch tested?

server

python -m vllm.entrypoints.openai.api_server --model=Qwen3-32B\
	--trust-remote-code --enforce-eager \
	--distributed-executor-backend=mp \
	-tp=4 \
	--port 8006 \
	--max-model-len 32000 \
	--block-size 128 \
	--gpu-memory-utilization 0.99

client

python $TEST_PY --backend vllm --trust-remote-code --model Qwen3-32B \
  --dataset-name random --random-input-len 2048 --random-output-len 2048 \
  --ignore-eos\
  --num-prompts 48 --max-concurrency 48  --request-rate inf --temperature 0 \
  --metric-percentiles 90  --base-url http://localhost:8006 --save-result \
  --result-dir $PROFILER_DIR

benchmark test based on Qwen3-32B TPOT result:

	forward async	scheduler async	sync
avg	41.73	41.86	44.20
improve0	0.3%	0	0
improve1	5.58%	0	0

benchmark test based on Qwen2___5-VL-7B-Instruct TPOT result:

	forward async	sync
avg	23.22	29.16
improve	20.3%	0

vLLM version: main
vLLM main: vllm-project/vllm@0ae43db

github-actions · 2025-09-05T08:54:12Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

gemini-code-assist

Code Review

This pull request introduces asynchronous model execution to overlap CPU and NPU operations for a performance boost. The core changes involve a new AsyncNPUModelRunnerOutput class to handle non-blocking data transfers and modifications to the model execution pipeline to support this. While the changes are promising for performance, I've identified a critical issue with state management in the asynchronous path that could lead to incorrect model outputs, an unused attribute that should be removed, and a hardcoded path in an example file that hinders usability. Addressing these points will be crucial for the stability and correctness of this new feature.

gemini-code-assist · 2025-09-05T08:56:19Z

vllm_ascend/worker/model_runner_v1.py

        self,
        scheduler_output: "SchedulerOutput",
        intermediate_tensors: Optional[IntermediateTensors] = None,
-    ) -> Union[ModelRunnerOutput, torch.Tensor]:
+    ) -> Union[ModelRunnerOutput, AsyncModelRunnerOutput, IntermediateTensors]:


To correctly handle asynchronous scheduling, the worker's CPU-side state must be updated with the actual token IDs from the previous step. This should happen at the beginning of the current step, before preparing inputs. Without this, features like repetition penalty will use stale or incorrect token history.

Please add state update logic at the start of execute_model to synchronize prev_sampled_token_ids and update the CPU-side token history. Here is a code snippet to illustrate the required logic:

if self.use_async_scheduling and self.input_batch.prev_sampled_token_ids is not None: # Sync and update state from previous async step prev_sampled_token_ids_cpu = self.input_batch.prev_sampled_token_ids.tolist() prev_req_id_to_index = self.input_batch.prev_req_id_to_index assert prev_req_id_to_index is not None for req_id, prev_req_idx in prev_req_id_to_index.items(): if req_id not in self.requests: continue req_state = self.requests[req_id] req_idx = self.input_batch.req_id_to_index.get(req_id) if req_idx is None: continue sampled_ids = prev_sampled_token_ids_cpu[prev_req_idx] if not sampled_ids: continue req_state.output_token_ids.extend(sampled_ids) start_idx = self.input_batch.num_tokens_no_spec[req_idx] end_idx = start_idx + len(sampled_ids) assert end_idx <= self.model_config.max_model_len self.input_batch.token_ids_cpu[req_idx, start_idx:end_idx] = sampled_ids self.input_batch.num_tokens_no_spec[req_idx] = end_idx self.input_batch.num_tokens[req_idx] = end_idx # Clear the prev step's data self.input_batch.prev_sampled_token_ids = None self.input_batch.prev_req_id_to_index = None

gemini-code-assist · 2025-09-05T08:56:19Z

examples/offline_inference_npu_tp2.py

@@ -37,7 +37,7 @@ def main():
    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
    # Create an LLM.
-    llm = LLM(model="deepseek-ai/DeepSeek-V2-Lite",
+    llm = LLM(model="/home/jp/model/Qwen2.5-0.5B-Instruct",


The model path is hardcoded to a local directory. This will cause the example to fail for other users. Please use a model identifier from a public hub, like Hugging Face Hub, so that the example is runnable out of the box.

Suggested change

llm = LLM(model="/home/jp/model/Qwen2.5-0.5B-Instruct",

llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct",

gemini-code-assist · 2025-09-05T08:56:19Z

vllm_ascend/worker/model_runner_v1.py

+                self.input_batch.prev_sampled_token_ids_invalid_indices = \
+                    invalid_req_indices_set


The attribute self.input_batch.prev_sampled_token_ids_invalid_indices is assigned here but is never read or used anywhere. This appears to be dead code and should be removed along with its definition in InputBatch to improve clarity and reduce maintenance overhead.

gemini-code-assist · 2025-09-05T08:56:19Z

vllm_ascend/worker/npu_input_batch.py

@@ -262,6 +262,11 @@ def __init__(

        self.pooling_params: dict[str, PoolingParams] = {}

+        # Cached reference to the GPU tensor of previously sampled tokens
+        self.prev_sampled_token_ids: Optional[torch.Tensor] = None
+        self.prev_sampled_token_ids_invalid_indices: Optional[set[int]] = None


The attribute prev_sampled_token_ids_invalid_indices is defined here but is never read or used anywhere in the codebase. This appears to be dead code and should be removed to avoid confusion and reduce maintenance overhead.

github-actions · 2025-09-05T09:48:26Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions · 2025-09-10T00:50:44Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: jiangpeng36 <[email protected]> Signed-off-by: Ronald1995 <[email protected]> Co-authored-by: Ronald1995 <[email protected]>

github-actions bot added the merge-conflicts label Sep 5, 2025

gemini-code-assist bot reviewed Sep 5, 2025

View reviewed changes

jiangpeng36 force-pushed the pr23569 branch 5 times, most recently from 19e7592 to 703b715 Compare September 8, 2025 06:44

github-actions bot added module:tests and removed merge-conflicts labels Sep 8, 2025

jiangpeng36 force-pushed the pr23569 branch 3 times, most recently from e083282 to 74abecc Compare September 8, 2025 11:30

github-actions bot added the merge-conflicts label Sep 10, 2025

[Perf][V1] Fully overlap model execution

b8caedf

Signed-off-by: jiangpeng36 <[email protected]> Signed-off-by: Ronald1995 <[email protected]> Co-authored-by: Ronald1995 <[email protected]>

jiangpeng36 force-pushed the pr23569 branch from 74abecc to b8caedf Compare September 10, 2025 07:33

github-actions bot removed the merge-conflicts label Sep 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf][V1] Fully overlap model execution #2783

[Perf][V1] Fully overlap model execution #2783

jiangpeng36 commented Sep 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Sep 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 5, 2025

Uh oh!

gemini-code-assist bot Sep 5, 2025

Uh oh!

gemini-code-assist bot Sep 5, 2025

Uh oh!

gemini-code-assist bot Sep 5, 2025

Uh oh!

github-actions bot commented Sep 5, 2025

Uh oh!

github-actions bot commented Sep 10, 2025

Uh oh!

Uh oh!

	llm = LLM(model="/home/jp/model/Qwen2.5-0.5B-Instruct",
	llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct",

		self.input_batch.prev_sampled_token_ids_invalid_indices = \
		invalid_req_indices_set

[Perf][V1] Fully overlap model execution #2783

Are you sure you want to change the base?

[Perf][V1] Fully overlap model execution #2783

Conversation

jiangpeng36 commented Sep 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Sep 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 5, 2025

Uh oh!

github-actions bot commented Sep 10, 2025

Uh oh!

Uh oh!

jiangpeng36 commented Sep 5, 2025 •

edited by github-actions bot

Loading