-
Notifications
You must be signed in to change notification settings - Fork 56
Add Node to update KV cache in Stateful LLM model #872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: ovep-develop
Are you sure you want to change the base?
Conversation
2a0d722 to
899feb5
Compare
899feb5 to
5432bd4
Compare
| } | ||
| } | ||
| } else if (key == "kvcache_reorder") { | ||
| // Convert kvcache_reorder value format "1,2,3;4,5,6" into two vectors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Convert kvcache_reorder value format "1,2,3;4,5,6" into two vectors | |
| // Convert kvcache_reorder value format "1,2,3;4,5,6" into two vectors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for KV cache reordering in the OpenVINO stateful LLM model to enable tree-based speculative decoding. It introduces a new subgraph pattern (Gather + ScatterElementsUpdate) that allows OpenVINO to perform KV cache reordering during inference, which can be optimized out by the GPU if not needed.
Key changes:
- Adds new graph nodes (
src_idx,dst_idxparameters andGather/ScatterElementsUpdateoperations) to enable KV cache manipulation - Implements
ReorderKVCacheAPI across the backend stack with parsing logic for comma-separated index pairs - Stores reorder indices in
StatefulOVInferRequestfor processing during inference
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| ov_stateful_patch_utils.h | Adds opset12 include for ScatterElementsUpdate operation |
| ov_stateful_patch_utils.cc | Implements the new KV cache reorder subgraph with src_idx/dst_idx parameters and Gather/ScatterElementsUpdate nodes |
| ov_interface.h | Declares ReorderKVCache method and adds member variables for storing reorder indices |
| ov_interface.cc | Implements ReorderKVCache with index validation and tensor population logic using hardcoded shape values |
| openvino_execution_provider.cc | Adds kvcache_reorder option parsing to convert semicolon-delimited string format into index vectors |
| ibackend.h | Adds virtual ReorderKVCache method to IBackend interface |
| basic_backend.h/cc | Implements ReorderKVCache to propagate calls to inference request pool |
| backend_manager.h/cc | Implements ReorderKVCache as pass-through to concrete backend |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| src_idx_tensor.data<int32_t>()[i] = int32_t(kv_src_indices[i]); | ||
| } | ||
| ovInfReq.set_tensor("src_idx", src_idx_tensor); | ||
| ov::Tensor dst_idx_tensor = ov::Tensor(ov::element::i32, {1, 32, kv_dst_indices.size(), 96}); |
Copilot
AI
Dec 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded dimensions {1, 32, kv_dst_indices.size(), 96} in dst_idx_tensor creation appear as magic numbers. These values (32 and 96) should be extracted from the model's KV cache shape or defined as named constants to improve maintainability and prevent issues if model dimensions change.
| for (int i = 0; i < kv_src_indices.size(); ++i) { | ||
| src_idx_tensor.data<int32_t>()[i] = int32_t(kv_src_indices[i]); | ||
| } | ||
| ovInfReq.set_tensor("src_idx", src_idx_tensor); | ||
| ov::Tensor dst_idx_tensor = ov::Tensor(ov::element::i32, {1, 32, kv_dst_indices.size(), 96}); | ||
| for (int i = 0; i < kv_dst_indices.size(); ++i) { |
Copilot
AI
Dec 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loop variable i should be size_t instead of int to match the type of kv_src_indices.size() and avoid signed/unsigned comparison warnings.
| for (int i = 0; i < kv_src_indices.size(); ++i) { | |
| src_idx_tensor.data<int32_t>()[i] = int32_t(kv_src_indices[i]); | |
| } | |
| ovInfReq.set_tensor("src_idx", src_idx_tensor); | |
| ov::Tensor dst_idx_tensor = ov::Tensor(ov::element::i32, {1, 32, kv_dst_indices.size(), 96}); | |
| for (int i = 0; i < kv_dst_indices.size(); ++i) { | |
| for (size_t i = 0; i < kv_src_indices.size(); ++i) { | |
| src_idx_tensor.data<int32_t>()[i] = int32_t(kv_src_indices[i]); | |
| } | |
| ovInfReq.set_tensor("src_idx", src_idx_tensor); | |
| ov::Tensor dst_idx_tensor = ov::Tensor(ov::element::i32, {1, 32, kv_dst_indices.size(), 96}); | |
| for (size_t i = 0; i < kv_dst_indices.size(); ++i) { |
| } | ||
| ovInfReq.set_tensor("src_idx", src_idx_tensor); | ||
| ov::Tensor dst_idx_tensor = ov::Tensor(ov::element::i32, {1, 32, kv_dst_indices.size(), 96}); | ||
| for (int i = 0; i < kv_dst_indices.size(); ++i) { |
Copilot
AI
Dec 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loop variable i should be size_t instead of int to match the type of kv_dst_indices.size() and avoid signed/unsigned comparison warnings.
| for (int i = 0; i < kv_dst_indices.size(); ++i) { | |
| for (size_t i = 0; i < kv_dst_indices.size(); ++i) { |
|
|
||
| kv_src_indices.clear(); | ||
| kv_dst_indices.clear(); | ||
| for (int i = 0; i < src_indices.size(); ++i) { |
Copilot
AI
Dec 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loop variable i should be size_t instead of int to match the type of src_indices.size() and avoid signed/unsigned comparison warnings.
| for (int i = 0; i < src_indices.size(); ++i) { | |
| for (size_t i = 0; i < src_indices.size(); ++i) { |
| ovInfReq.set_tensor("dst_idx", dst_idx_tensor); | ||
| } else { | ||
| FillTensor("src_idx", ov::element::i32, {0}, 0); | ||
| FillTensor("dst_idx", ov::element::i32, {1, 32, 0, 96}, 0); |
Copilot
AI
Dec 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded shape {1, 32, 0, 96} contains magic numbers (32 and 96) that should match the constants used in line 432. These should be extracted as named constants to ensure consistency and improve maintainability.
| src_token.erase(0, src_token.find_first_not_of(" \t")); | ||
| src_token.erase(src_token.find_last_not_of(" \t") + 1); |
Copilot
AI
Dec 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whitespace trimming logic is duplicated for both src_token and dst_token. Consider extracting this into a helper function (e.g., TrimWhitespace(std::string&)) to reduce code duplication and improve maintainability.
| dst_token.erase(0, dst_token.find_first_not_of(" \t")); | ||
| dst_token.erase(dst_token.find_last_not_of(" \t") + 1); |
Copilot
AI
Dec 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whitespace trimming logic is duplicated for both src_token and dst_token. Consider extracting this into a helper function (e.g., TrimWhitespace(std::string&)) to reduce code duplication and improve maintainability.
| } | ||
|
|
||
| } catch (const std::exception& e) { | ||
| LOGS_DEFAULT(WARNING) << "Conversion for kvcache_reorder string value to int64_t indices failed. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here we should actually return an error / throw an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| if (index >= 0) { | ||
| src_indices.push_back(static_cast<size_t>(index)); | ||
| } else { | ||
| LOGS_DEFAULT(WARNING) << "kvcache_reorder src_index is < 0: " << index; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should throw an exception here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| if (index >= 0) { | ||
| dst_indices.push_back(static_cast<size_t>(index)); | ||
| } else { | ||
| LOGS_DEFAULT(WARNING) << "kvcache_reorder dst_index is < 0: " << index; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should throw an exception here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| src_idx_tensor.data<int32_t>()[i] = int32_t(kv_src_indices[i]); | ||
| } | ||
| ovInfReq.set_tensor("src_idx", src_idx_tensor); | ||
| ov::Tensor dst_idx_tensor = ov::Tensor(ov::element::i32, {1, 32, kv_dst_indices.size(), 96}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded 32 and 96 values for this block -- could they be derived by something instead of fixing them as a magic number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| ov_model->add_parameters({beam_idx}); | ||
| not_kv_inputs.push_back(beam_idx->get_friendly_name()); | ||
|
|
||
| auto src_idx = std::make_shared<ov::opset13::Parameter>(ov::element::i32, ov::PartialShape({update_shape[2]})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So do I understand correctly that stateful flow will always add src_idx / dst_idx input tensors to the model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it will always to the model. For OV GPU, it will be optimized out if the input are all 0s. For NPU, a flag is added to bypass the logic of kv cache reroder.
| while (std::getline(dst_stream, dst_token, ',')) { | ||
| // Trim whitespace | ||
| dst_token.erase(0, dst_token.find_first_not_of(" \t")); | ||
| dst_token.erase(dst_token.find_last_not_of(" \t") + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost identical branches for src and dst, consider refactoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
|
||
| if (kv_src_indices.size() > 0) { | ||
| ov::Tensor src_idx_tensor = ov::Tensor(ov::element::i32, {kv_src_indices.size()}); | ||
| for (int i = 0; i < kv_src_indices.size(); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-unsigned mismatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all switches to auto
| return ovInfReq; | ||
| } | ||
| virtual void RewindKVCache([[maybe_unused]] size_t index) {} | ||
| virtual void ReorderKVCache([[maybe_unused]] const std::vector<size_t>& src_indices, [[maybe_unused]] const std::vector<size_t>& dst_indices) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is [[maybe_unused]] really necessary here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not necessary for functionality. More intention is to follow the original style.
Description
This PR is to add a small subgraph
Gather + ScatterElementUpdatefor KVCache to allow OpenVINO to do KV cache reorder during model inference. This pattern will be optimized out by OV GPU if there is no related information provided (done in OV 33114)The graph below shows how the PR impacts an onnx model when triggering makeStateful() path.
Motivation and Context
The Microsoft Phi-Silica application leverages tree-based speculative decoding to accelerate LLM inference. This technique requires frequent manipulation of past KV cache states (e.g. trimming, reordering). This is because only a single branch of the speculative draft tree is accepted after verification.
On the other side, the current KV Cache API available is OV is very slow which cannot meet MSFT requirements. Details in CVS-174809. As OV team suggested, the only way to support reorder feature is to add specific nodes in the original graph. This PR is to serve this purpose.
Open
If feature goes to new ABI?
Yes
Jira Ticket :
CVS-176367