-
Notifications
You must be signed in to change notification settings - Fork 150
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Overlap KV cache update for WindowedKeyValueCache in DecoderOnlyPipel…
…ineState (#1222) # Description Add support for overlapping the KV cache update for a pipelined model part with the graph execution of other pipelined model parts. It only applies to `DecoderOnlyPipelineState` with `WindowedKeyValueCache`. For example, consider a model with two parts (graph[1] and graph[2]) that have KV caches. This is the approach in this PR: ``` iter 1 graph[1] run | - iter 1 graph[2] run | iter 1 graph[1] KV cache update iter 2 graph[1] run | iter 1 graph[2] KV cache update iter 2 graph[2] run | iter 2 graph[1] KV cache update iter 3 graph[1] run | iter 2 graph[2] KV cache update ``` For comparison, this is the existing approach: ``` iter 1 graph[1] run iter 1 graph[2] run iter 1 graph[1] KV cache update iter 1 graph[2] KV cache update iter 2 graph[1] run iter 2 graph[2] run iter 2 graph[1] KV cache update iter 2 graph[2] KV cache update ``` # Measurements Token generation rate with QNN EP 3-part Llama3.2 3B: Baseline: 15.5328 tokens/sec Updated: 18.2981 tokens/sec Prompt processing logic is unchanged.
- Loading branch information
Showing
20 changed files
with
718 additions
and
350 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
// Copyright (c) Microsoft Corporation. All rights reserved. | ||
// Licensed under the MIT License. | ||
|
||
#pragma once | ||
|
||
#include <string> | ||
#include <sstream> | ||
|
||
namespace Generators { | ||
|
||
template <typename... Args> | ||
inline std::string MakeString(Args&&... args) { | ||
std::ostringstream s; | ||
(s << ... << std::forward<Args>(args)); | ||
return s.str(); | ||
} | ||
|
||
} // namespace Generators |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.