Add Node to update KV cache in Stateful LLM model #872

Kotomi-Du · 2025-12-03T20:15:19Z

Description

This PR is to add a small subgraph Gather + ScatterElementUpdate for KVCache to allow OpenVINO to do KV cache reorder during model inference. This pattern will be optimized out by OV GPU if there is no related information provided (done in OV 33114)

The graph below shows how the PR impacts an onnx model when triggering makeStateful() path.

Motivation and Context

The Microsoft Phi-Silica application leverages tree-based speculative decoding to accelerate LLM inference. This technique requires frequent manipulation of past KV cache states (e.g. trimming, reordering). This is because only a single branch of the speculative draft tree is accepted after verification.

On the other side, the current KV Cache API available is OV is very slow which cannot meet MSFT requirements. Details in CVS-174809. As OV team suggested, the only way to support reorder feature is to add specific nodes in the original graph. This PR is to serve this purpose.

Open

If NPU don't want to have this path, a device specific flag has to be added.

If feature goes to new ABI?

Yes

Jira Ticket :

CVS-176367

onnxruntime/core/providers/openvino/ov_interface.cc

RyanMetcalfeInt8 · 2025-12-09T17:16:56Z

onnxruntime/core/providers/openvino/openvino_execution_provider.cc

        }
      }
+    } else if (key == "kvcache_reorder") {
+    // Convert kvcache_reorder value format "1,2,3;4,5,6" into two vectors


Suggested change

// Convert kvcache_reorder value format "1,2,3;4,5,6" into two vectors

// Convert kvcache_reorder value format "1,2,3;4,5,6" into two vectors

Copilot

Pull request overview

This PR adds support for KV cache reordering in the OpenVINO stateful LLM model to enable tree-based speculative decoding. It introduces a new subgraph pattern (Gather + ScatterElementsUpdate) that allows OpenVINO to perform KV cache reordering during inference, which can be optimized out by the GPU if not needed.

Key changes:

Adds new graph nodes (src_idx, dst_idx parameters and Gather/ScatterElementsUpdate operations) to enable KV cache manipulation
Implements ReorderKVCache API across the backend stack with parsing logic for comma-separated index pairs
Stores reorder indices in StatefulOVInferRequest for processing during inference

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
ov_stateful_patch_utils.h	Adds opset12 include for ScatterElementsUpdate operation
ov_stateful_patch_utils.cc	Implements the new KV cache reorder subgraph with src_idx/dst_idx parameters and Gather/ScatterElementsUpdate nodes
ov_interface.h	Declares ReorderKVCache method and adds member variables for storing reorder indices
ov_interface.cc	Implements ReorderKVCache with index validation and tensor population logic using hardcoded shape values
openvino_execution_provider.cc	Adds kvcache_reorder option parsing to convert semicolon-delimited string format into index vectors
ibackend.h	Adds virtual ReorderKVCache method to IBackend interface
basic_backend.h/cc	Implements ReorderKVCache to propagate calls to inference request pool
backend_manager.h/cc	Implements ReorderKVCache as pass-through to concrete backend

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-09T17:18:03Z

onnxruntime/core/providers/openvino/ov_interface.cc

+      src_idx_tensor.data<int32_t>()[i] = int32_t(kv_src_indices[i]);
+    }
+    ovInfReq.set_tensor("src_idx", src_idx_tensor);
+    ov::Tensor dst_idx_tensor = ov::Tensor(ov::element::i32, {1, 32, kv_dst_indices.size(), 96});


The hardcoded dimensions {1, 32, kv_dst_indices.size(), 96} in dst_idx_tensor creation appear as magic numbers. These values (32 and 96) should be extracted from the model's KV cache shape or defined as named constants to improve maintainability and prevent issues if model dimensions change.

Copilot · 2025-12-09T17:18:03Z

onnxruntime/core/providers/openvino/ov_interface.cc

+    for (int i = 0; i < kv_src_indices.size(); ++i) {
+      src_idx_tensor.data<int32_t>()[i] = int32_t(kv_src_indices[i]);
+    }
+    ovInfReq.set_tensor("src_idx", src_idx_tensor);
+    ov::Tensor dst_idx_tensor = ov::Tensor(ov::element::i32, {1, 32, kv_dst_indices.size(), 96});
+    for (int i = 0; i < kv_dst_indices.size(); ++i) {


Loop variable i should be size_t instead of int to match the type of kv_src_indices.size() and avoid signed/unsigned comparison warnings.

Suggested change

for (int i = 0; i < kv_src_indices.size(); ++i) {

src_idx_tensor.data<int32_t>()[i] = int32_t(kv_src_indices[i]);

}

ovInfReq.set_tensor("src_idx", src_idx_tensor);

ov::Tensor dst_idx_tensor = ov::Tensor(ov::element::i32, {1, 32, kv_dst_indices.size(), 96});

for (int i = 0; i < kv_dst_indices.size(); ++i) {

for (size_t i = 0; i < kv_src_indices.size(); ++i) {

src_idx_tensor.data<int32_t>()[i] = int32_t(kv_src_indices[i]);

}

ovInfReq.set_tensor("src_idx", src_idx_tensor);

ov::Tensor dst_idx_tensor = ov::Tensor(ov::element::i32, {1, 32, kv_dst_indices.size(), 96});

for (size_t i = 0; i < kv_dst_indices.size(); ++i) {

Copilot · 2025-12-09T17:18:03Z

onnxruntime/core/providers/openvino/ov_interface.cc

+    }
+    ovInfReq.set_tensor("src_idx", src_idx_tensor);
+    ov::Tensor dst_idx_tensor = ov::Tensor(ov::element::i32, {1, 32, kv_dst_indices.size(), 96});
+    for (int i = 0; i < kv_dst_indices.size(); ++i) {


Loop variable i should be size_t instead of int to match the type of kv_dst_indices.size() and avoid signed/unsigned comparison warnings.

Suggested change

for (int i = 0; i < kv_dst_indices.size(); ++i) {

for (size_t i = 0; i < kv_dst_indices.size(); ++i) {

Copilot · 2025-12-09T17:18:04Z

onnxruntime/core/providers/openvino/ov_interface.cc

+
+  kv_src_indices.clear();
+  kv_dst_indices.clear();
+  for (int i = 0; i < src_indices.size(); ++i) {


Loop variable i should be size_t instead of int to match the type of src_indices.size() and avoid signed/unsigned comparison warnings.

Suggested change

for (int i = 0; i < src_indices.size(); ++i) {

for (size_t i = 0; i < src_indices.size(); ++i) {

Copilot · 2025-12-09T17:18:04Z

onnxruntime/core/providers/openvino/ov_interface.cc

+    ovInfReq.set_tensor("dst_idx", dst_idx_tensor);
+  } else {
+    FillTensor("src_idx", ov::element::i32, {0}, 0);
+    FillTensor("dst_idx", ov::element::i32, {1, 32, 0, 96}, 0);


The hardcoded shape {1, 32, 0, 96} contains magic numbers (32 and 96) that should match the constants used in line 432. These should be extracted as named constants to ensure consistency and improve maintainability.

Copilot · 2025-12-09T17:18:04Z

onnxruntime/core/providers/openvino/openvino_execution_provider.cc

+          src_token.erase(0, src_token.find_first_not_of(" \t"));
+          src_token.erase(src_token.find_last_not_of(" \t") + 1);


The whitespace trimming logic is duplicated for both src_token and dst_token. Consider extracting this into a helper function (e.g., TrimWhitespace(std::string&)) to reduce code duplication and improve maintainability.

Copilot · 2025-12-09T17:18:04Z

onnxruntime/core/providers/openvino/openvino_execution_provider.cc

+          dst_token.erase(0, dst_token.find_first_not_of(" \t"));
+          dst_token.erase(dst_token.find_last_not_of(" \t") + 1);


The whitespace trimming logic is duplicated for both src_token and dst_token. Consider extracting this into a helper function (e.g., TrimWhitespace(std::string&)) to reduce code duplication and improve maintainability.

RyanMetcalfeInt8 · 2025-12-09T17:19:41Z

onnxruntime/core/providers/openvino/openvino_execution_provider.cc

+        }
+
+      } catch (const std::exception& e) {
+        LOGS_DEFAULT(WARNING) << "Conversion for kvcache_reorder string value to int64_t indices failed. "


I think here we should actually return an error / throw an exception.

RyanMetcalfeInt8 · 2025-12-09T17:20:07Z

onnxruntime/core/providers/openvino/openvino_execution_provider.cc

+            if (index >= 0) {
+              src_indices.push_back(static_cast<size_t>(index));
+            } else {
+              LOGS_DEFAULT(WARNING) << "kvcache_reorder src_index is < 0: " << index;


We should throw an exception here.

RyanMetcalfeInt8 · 2025-12-09T17:20:18Z

onnxruntime/core/providers/openvino/openvino_execution_provider.cc

+            if (index >= 0) {
+              dst_indices.push_back(static_cast<size_t>(index));
+            } else {
+              LOGS_DEFAULT(WARNING) << "kvcache_reorder dst_index is < 0: " << index;


We should throw an exception here.

RyanMetcalfeInt8 · 2025-12-09T17:21:30Z

onnxruntime/core/providers/openvino/ov_interface.cc

+      src_idx_tensor.data<int32_t>()[i] = int32_t(kv_src_indices[i]);
+    }
+    ovInfReq.set_tensor("src_idx", src_idx_tensor);
+    ov::Tensor dst_idx_tensor = ov::Tensor(ov::element::i32, {1, 32, kv_dst_indices.size(), 96});


The hardcoded 32 and 96 values for this block -- could they be derived by something instead of fixing them as a magic number?

RyanMetcalfeInt8 · 2025-12-09T17:25:35Z

onnxruntime/core/providers/openvino/ov_stateful_patch_utils.cc

  ov_model->add_parameters({beam_idx});
  not_kv_inputs.push_back(beam_idx->get_friendly_name());

+  auto src_idx = std::make_shared<ov::opset13::Parameter>(ov::element::i32, ov::PartialShape({update_shape[2]}));


So do I understand correctly that stateful flow will always add src_idx / dst_idx input tensors to the model?

yes, it will always to the model. For OV GPU, it will be optimized out if the input are all 0s. For NPU, a flag is added to bypass the logic of kv cache reroder.

mklimenk · 2025-12-11T10:07:05Z

onnxruntime/core/providers/openvino/openvino_execution_provider.cc

+        while (std::getline(dst_stream, dst_token, ',')) {
+          // Trim whitespace
+          dst_token.erase(0, dst_token.find_first_not_of(" \t"));
+          dst_token.erase(dst_token.find_last_not_of(" \t") + 1);


Almost identical branches for src and dst, consider refactoring

mklimenk · 2025-12-11T10:07:17Z

onnxruntime/core/providers/openvino/ov_interface.cc


+  if (kv_src_indices.size() > 0) {
+    ov::Tensor src_idx_tensor = ov::Tensor(ov::element::i32, {kv_src_indices.size()});
+    for (int i = 0; i < kv_src_indices.size(); ++i) {


Signed-unsigned mismatch

all switches to auto

mklimenk · 2025-12-11T10:08:50Z

onnxruntime/core/providers/openvino/ov_interface.h

    return ovInfReq;
  }
  virtual void RewindKVCache([[maybe_unused]] size_t index) {}
+  virtual void ReorderKVCache([[maybe_unused]] const std::vector<size_t>& src_indices,  [[maybe_unused]] const std::vector<size_t>& dst_indices) {}


Is [[maybe_unused]] really necessary here?

it is not necessary for functionality. More intention is to follow the original style.

mdvoretc-intel and others added 6 commits December 2, 2025 03:51

Reorder KV cache using the new gather_by_axis API

0762470

Do a ScatterElementsUpdate-based reorder during execution

1426c2a

Get variable update lengths from incoming indices

2945283

Make changes to support new KVCache fusion

5d00226

Add proper include

d884cd5

add reorder KV cache API

7676b30

Kotomi-Du marked this pull request as draft December 3, 2025 20:15

mdvoretc-intel reviewed Dec 4, 2025

View reviewed changes

onnxruntime/core/providers/openvino/ov_interface.cc Outdated Show resolved Hide resolved

Kotomi-Du force-pushed the update_kvcache_node branch from 2a0d722 to 899feb5 Compare December 6, 2025 01:31

clean up code

5432bd4

Kotomi-Du force-pushed the update_kvcache_node branch from 899feb5 to 5432bd4 Compare December 6, 2025 01:32

Kotomi-Du marked this pull request as ready for review December 9, 2025 05:03

Kotomi-Du requested review from MayureshV1 and RyanMetcalfeInt8 December 9, 2025 05:03

add post process for internal handled inputs

c7f57bb

RyanMetcalfeInt8 requested a review from Copilot December 9, 2025 17:16

RyanMetcalfeInt8 reviewed Dec 9, 2025

View reviewed changes

Copilot AI reviewed Dec 9, 2025

View reviewed changes

RyanMetcalfeInt8 reviewed Dec 9, 2025

View reviewed changes

mklimenk reviewed Dec 11, 2025

View reviewed changes

Kotomi-Du added 3 commits December 11, 2025 21:45

disable update_kvcache for npu + pass kv info

8f464d6

refactor code

203ee33

minor change

7d201fa

	// Convert kvcache_reorder value format "1,2,3;4,5,6" into two vectors
	// Convert kvcache_reorder value format "1,2,3;4,5,6" into two vectors

	for (int i = 0; i < kv_dst_indices.size(); ++i) {
	for (size_t i = 0; i < kv_dst_indices.size(); ++i) {

	for (int i = 0; i < src_indices.size(); ++i) {
	for (size_t i = 0; i < src_indices.size(); ++i) {

		src_token.erase(0, src_token.find_first_not_of(" \t"));
		src_token.erase(src_token.find_last_not_of(" \t") + 1);

		dst_token.erase(0, dst_token.find_first_not_of(" \t"));
		dst_token.erase(dst_token.find_last_not_of(" \t") + 1);

Add Node to update KV cache in Stateful LLM model #872

Are you sure you want to change the base?

Add Node to update KV cache in Stateful LLM model #872

Uh oh!

Conversation

Kotomi-Du commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Open

If feature goes to new ABI?

Jira Ticket :

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Kotomi-Du commented Dec 3, 2025 •

edited

Loading