-
Notifications
You must be signed in to change notification settings - Fork 581
feat: DIS-373 dynamo KVBM connector API integration with TRTLLM #2544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
147dc1b
trtllm integration connector api
richardhuo-nv 1bdfd37
fix computed tokens
richardhuo-nv 85a7563
fix layout
richardhuo-nv af20c6f
fmt and rebase
richardhuo-nv cf771ce
fix fmt
richardhuo-nv cba8a85
fix tests
richardhuo-nv 63aa8f3
resolve comments
richardhuo-nv 7b5aed3
fix fmt
richardhuo-nv 48ae78e
integrate vllm
richardhuo-nv e9651ee
fix tests
richardhuo-nv 5bf7c50
resolve comments
richardhuo-nv 8727bbc
fix repo checkout
richardhuo-nv 8c7dd17
fix doc
richardhuo-nv 04c3cdb
fix
richardhuo-nv c68ecc8
fix doc
richardhuo-nv 16c4621
remove metrics in readme
richardhuo-nv 8c7068a
use dynamo in the readme example
richardhuo-nv File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
<!-- | ||
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
SPDX-License-Identifier: Apache-2.0 | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
--> | ||
|
||
# Running KVBM in TensorRT-LLM | ||
|
||
This guide explains how to leverage KVBM (KV Block Manager) to mange KV cache and do KV offloading in TensorRT-LLM (trtllm). | ||
|
||
To learn what KVBM is, please check [here](https://docs.nvidia.com/dynamo/latest/architecture/kvbm_intro.html) | ||
|
||
> [!Note] | ||
> - Ensure that `etcd` and `nats` are running before starting. | ||
> - KVBM does not currently support CUDA graphs in TensorRT-LLM. | ||
> - KVBM only supports TensorRT-LLM’s PyTorch backend. | ||
> - To enable disk cache offloading, you must first enable a CPU memory cache offloading. | ||
> - Disable partial reuse `enable_partial_reuse: false` in the LLM API config’s `kv_connector_config` to increase offloading cache hits. | ||
> - KVBM requires TensorRT-LLM at commit ce580ce4f52af3ad0043a800b3f9469e1f1109f6 or newer. | ||
> - Enabling KVBM metrics with TensorRT-LLM is still a work in progress. | ||
|
||
## Quick Start | ||
|
||
To use KVBM in TensorRT-LLM, you can follow the steps below: | ||
|
||
```bash | ||
# start up etcd for KVBM leader/worker registration and discovery | ||
docker compose -f deploy/docker-compose.yml up -d | ||
|
||
# Build a container that includes TensorRT-LLM and KVBM. Note: KVBM integration is only available in TensorRT-LLM commit ce580ce4f52af3ad0043a800b3f9469e1f1109f6 or newer. | ||
./container/build.sh --framework trtllm --tensorrtllm-commit ce580ce4f52af3ad0043a800b3f9469e1f1109f6 --enable-kvbm | ||
|
||
# launch the container | ||
./container/run.sh --framework trtllm -it --mount-workspace --use-nixl-gds | ||
|
||
# enable kv offloading to CPU memory | ||
# 60 means 60GB of pinned CPU memory would be used | ||
export DYN_KVBM_CPU_CACHE_GB=60 | ||
|
||
# enable kv offloading to disk. Note: To enable disk cache offloading, you must first enable a CPU memory cache offloading. | ||
# 20 means 20GB of disk would be used | ||
export DYN_KVBM_DISK_CACHE_GB=20 | ||
|
||
# Allocating memory and disk storage can take some time. | ||
# We recommend setting a higher timeout for leader–worker initialization. | ||
# 1200 means 1200 seconds timeout | ||
export DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=1200 | ||
``` | ||
|
||
```bash | ||
# write an example LLM API config | ||
# Note: Disable partial reuse "enable_partial_reuse: false" in the LLM API config’s "kv_connector_config" to increase offloading cache hits. | ||
cat > "/tmp/kvbm_llm_api_config.yaml" <<EOF | ||
backend: pytorch | ||
cuda_graph_config: null | ||
kv_cache_config: | ||
enable_partial_reuse: false | ||
free_gpu_memory_fraction: 0.80 | ||
kv_connector_config: | ||
connector_module: dynamo.llm.trtllm_integration.connector | ||
connector_scheduler_class: DynamoKVBMConnectorLeader | ||
connector_worker_class: DynamoKVBMConnectorWorker | ||
EOF | ||
|
||
# start dynamo frontend | ||
python3 -m dynamo.frontend --http-port 8000 & | ||
|
||
# To serve an LLM model with dynamo | ||
python3 -m dynamo.trtllm \ | ||
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ | ||
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ | ||
--extra-engine-args /tmp/kvbm_llm_api_config.yaml & | ||
|
||
# make a call to LLM | ||
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ | ||
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B", | ||
"messages": [ | ||
{ | ||
"role": "user", | ||
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden." | ||
} | ||
], | ||
"stream":false, | ||
"max_tokens": 30 | ||
}' | ||
|
||
# Optionally, we could also serve an LLM with trtllm-serve to utilize the KVBM feature. | ||
trtllm-serve deepseek-ai/DeepSeek-R1-Distill-Llama-8B --host localhost --port 8001 --backend pytorch --extra_llm_api_options /tmp/kvbm_llm_api_config.yaml | ||
|
||
``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
7 changes: 5 additions & 2 deletions
7
lib/bindings/python/rust/llm/block_manager/distributed/utils.rs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,9 @@ | ||
// SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
// SPDX-License-Identifier: Apache-2.0 | ||
|
||
pub fn get_barrier_id() -> String { | ||
std::env::var("DYN_KVBM_BARRIER_ID").unwrap_or("kvbm".to_string()) | ||
pub fn get_barrier_id_prefix() -> String { | ||
std::env::var("DYN_KVBM_BARRIER_ID_PREFIX") | ||
.ok() | ||
.filter(|s| !s.trim().is_empty()) | ||
.unwrap_or_else(|| "kvbm".to_string()) | ||
} |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.