-
Notifications
You must be signed in to change notification settings - Fork 424
[Perf][V1] Fully overlap model execution #2783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces asynchronous model execution to overlap CPU and NPU operations for a performance boost. The core changes involve a new AsyncNPUModelRunnerOutput
class to handle non-blocking data transfers and modifications to the model execution pipeline to support this. While the changes are promising for performance, I've identified a critical issue with state management in the asynchronous path that could lead to incorrect model outputs, an unused attribute that should be removed, and a hardcoded path in an example file that hinders usability. Addressing these points will be crucial for the stability and correctness of this new feature.
self, | ||
scheduler_output: "SchedulerOutput", | ||
intermediate_tensors: Optional[IntermediateTensors] = None, | ||
) -> Union[ModelRunnerOutput, torch.Tensor]: | ||
) -> Union[ModelRunnerOutput, AsyncModelRunnerOutput, IntermediateTensors]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To correctly handle asynchronous scheduling, the worker's CPU-side state must be updated with the actual token IDs from the previous step. This should happen at the beginning of the current step, before preparing inputs. Without this, features like repetition penalty will use stale or incorrect token history.
Please add state update logic at the start of execute_model
to synchronize prev_sampled_token_ids
and update the CPU-side token history. Here is a code snippet to illustrate the required logic:
if self.use_async_scheduling and self.input_batch.prev_sampled_token_ids is not None:
# Sync and update state from previous async step
prev_sampled_token_ids_cpu = self.input_batch.prev_sampled_token_ids.tolist()
prev_req_id_to_index = self.input_batch.prev_req_id_to_index
assert prev_req_id_to_index is not None
for req_id, prev_req_idx in prev_req_id_to_index.items():
if req_id not in self.requests:
continue
req_state = self.requests[req_id]
req_idx = self.input_batch.req_id_to_index.get(req_id)
if req_idx is None:
continue
sampled_ids = prev_sampled_token_ids_cpu[prev_req_idx]
if not sampled_ids:
continue
req_state.output_token_ids.extend(sampled_ids)
start_idx = self.input_batch.num_tokens_no_spec[req_idx]
end_idx = start_idx + len(sampled_ids)
assert end_idx <= self.model_config.max_model_len
self.input_batch.token_ids_cpu[req_idx, start_idx:end_idx] = sampled_ids
self.input_batch.num_tokens_no_spec[req_idx] = end_idx
self.input_batch.num_tokens[req_idx] = end_idx
# Clear the prev step's data
self.input_batch.prev_sampled_token_ids = None
self.input_batch.prev_req_id_to_index = None
@@ -37,7 +37,7 @@ def main(): | |||
# Create a sampling params object. | |||
sampling_params = SamplingParams(max_tokens=100, temperature=0.0) | |||
# Create an LLM. | |||
llm = LLM(model="deepseek-ai/DeepSeek-V2-Lite", | |||
llm = LLM(model="/home/jp/model/Qwen2.5-0.5B-Instruct", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model path is hardcoded to a local directory. This will cause the example to fail for other users. Please use a model identifier from a public hub, like Hugging Face Hub, so that the example is runnable out of the box.
llm = LLM(model="/home/jp/model/Qwen2.5-0.5B-Instruct", | |
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct", |
self.input_batch.prev_sampled_token_ids_invalid_indices = \ | ||
invalid_req_indices_set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -262,6 +262,11 @@ def __init__( | |||
|
|||
self.pooling_params: dict[str, PoolingParams] = {} | |||
|
|||
# Cached reference to the GPU tensor of previously sampled tokens | |||
self.prev_sampled_token_ids: Optional[torch.Tensor] = None | |||
self.prev_sampled_token_ids_invalid_indices: Optional[set[int]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
19e7592
to
703b715
Compare
e083282
to
74abecc
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: jiangpeng36 <[email protected]> Signed-off-by: Ronald1995 <[email protected]> Co-authored-by: Ronald1995 <[email protected]>
74abecc
to
b8caedf
Compare
This PR is based on top of #23569 and #24219.
What this PR does / why we need it?
This PR allows the model runner to function asynchronously when using async scheduling. This allows full overlap of the cpu operations (including prepare_inputs) and the model forward pass. This diff is functional and does not support speculative decoding, PP, or guided decoding.
Expected speedup is 5-10% over the current async scheduling.
Does this PR introduce any user-facing change?
How was this patch tested?
server
client
benchmark test based on Qwen3-32B TPOT result:
benchmark test based on Qwen2___5-VL-7B-Instruct TPOT result: