Skip to content

Commit 6d62fc7

Browse files
authored
feat: Deployment for EPP as a static library (#3314)
Signed-off-by: Anna Tchernych <[email protected]>
1 parent af7a41c commit 6d62fc7

File tree

6 files changed

+1127
-81
lines changed

6 files changed

+1127
-81
lines changed

deploy/inference-gateway/README.md

Lines changed: 78 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ This guide demonstrates two setups.
55
- The basic setup treats each Dynamo deployment as a black box and routes traffic randomly among the deployments.
66
- The EPP-aware setup uses a custom Dynamo plugin `dyn-kv` to pick the best worker.
77

8-
EPP’s default approach is token-aware only `by approximation` because it relies on the non-tokenized text in the prompt. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model’s tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/).
8+
EPP’s default kv-routing approach is token-aware only `by approximation` because the prompt is tokenized with a generic tokenizer unaware of the model deployed. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model’s tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/).
99

1010
Currently, these setups are only supported with the kGateway based Inference Gateway.
1111

@@ -87,6 +87,28 @@ kubectl apply -f agg.yaml -n my-model
8787
```
8888
Take a note of or change the DYNAMO_IMAGE in the model deployment file.
8989

90+
91+
Do not forget docker registry secret if needed.
92+
```bash
93+
kubectl create secret docker-registry docker-imagepullsecret \
94+
--docker-server=$DOCKER_SERVER \
95+
--docker-username=$DOCKER_USERNAME \
96+
--docker-password=$DOCKER_PASSWORD \
97+
--namespace=$NAMESPACE
98+
```
99+
100+
Do not forget to include the the HuggingFace token if required.
101+
```bash
102+
export HF_TOKEN=your_hf_token
103+
kubectl create secret generic hf-token-secret \
104+
--from-literal=HF_TOKEN=${HF_TOKEN} \
105+
-n ${NAMESPACE}
106+
```
107+
108+
Create a model configuration file similar to the vllm_agg_qwen.yaml for you model.
109+
This file demonstrates the values needed for the Vllm Agg setup in [agg.yaml](../../components/backends/vllm/deploy/agg.yaml)
110+
Take a note of the model's block size provided in the model card.
111+
90112
### 4. Install Dynamo GAIE helm chart ###
91113

92114
The Inference Gateway is configured through the `inference-gateway-resources.yaml` file.
@@ -95,7 +117,7 @@ Deploy the Inference Gateway resources to your Kubernetes cluster by running one
95117

96118
#### Basic Black Box Integration ####
97119

98-
For the basic black box integration run:
120+
The basic black box integration uses a standard EPP image`us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:v0.4.0`. For the basic black box integration run:
99121

100122
```bash
101123
cd deploy/inference-gateway
@@ -104,9 +126,13 @@ helm install dynamo-gaie ./helm/dynamo-gaie -n my-model -f ./vllm_agg_qwen.yaml
104126

105127
#### EPP-aware Integration with the custom Dynamo Plugin ####
106128

129+
Dynamo provides a custom routing plugin `pkg/epp/scheduling/plugins/dynamo_kv_scorer/plugin.go` to perform efficient kv routing.
130+
The Dynamo router is built as a static library, the EPP router will call to provide fast inference.
131+
You can either use the image `nvcr.io/nvstaging/ai-dynamo/epp-inference-extension-dynamo:v0.6.0-1` for the EPP_IMAGE in the Helm deployment command and proceed to the step 2 or you can build the image yourself following the steps below.
132+
107133
##### 1. Build the custom EPP image #####
108134

109-
We provide git patches for you to use.
135+
If you choose to build your own image use the steps below. Proceed to step 2 otherwise to deploy with Helm.
110136

111137
##### 1.1 Clone the official GAIE repo in a separate folder #####
112138

@@ -116,44 +142,74 @@ cd gateway-api-inference-extension
116142
git checkout v0.5.1
117143
```
118144

119-
##### 1.2 Apply patch(es) #####
145+
##### 1.2 Build the Dynamo Custom EPP #####
146+
147+
148+
149+
###### 1.2.1 Clone the official EPP repo ######
150+
151+
```bash
152+
# Clone the official GAIE repo in a separate folder
153+
cd path/to/gateway-api-inference-extension
154+
git clone [email protected]:kubernetes-sigs/gateway-api-inference-extension.git
155+
git checkout v0.5.1
156+
```
157+
158+
###### 1.2.2 Run the script to build the EPP image ######
159+
160+
The script will apply a custom patch to the code with your GAIE repo and build the image for you to use.
120161

121162
```bash
122-
git apply <dynamo-folder>/deploy/inference-gateway/epp-patches/v0.5.1-1/epp-v0.5.1-dyn1.patch
163+
# Use your custom paths
164+
export DYNAMO_DIR=/path/to/dynamo
165+
export EPP_DIR=/path/to/gateway-api-inference-extension
166+
167+
# Run the script
168+
cd deploy/inference-gateway
169+
./build-epp-dynamo.sh
123170
```
124171

125-
##### 1.3 Build the custom EPP image #####
172+
Under the hood the script applies the Dynamo Patch to the EPP code base; creates a Dynamo Router static library and builds a custom EPP image with it.
173+
Re-tag the freshly built image and push it to your registry.
126174

127175
```bash
128-
# Build the image <your-docker-registry/dynamo-custom-epp:<your-tag> and then manually push
129-
make image-local-load \
130-
IMAGE_REGISTRY=<your-docker-registry> \
131-
IMAGE_NAME=dynamo-custom-epp \
132-
EXTRA_TAG=<your-tag>
133-
134-
# Or run the command below to build push to your registry
135-
make image-local-push \
136-
IMAGE_REGISTRY=<your-docker-registry> \
137-
IMAGE_NAME=dynamo-custom-epp \
138-
EXTRA_TAG=<your-tag>
176+
docker images
177+
docker tag <your-new-id> <your-image-tag>
178+
docker push <your-image-tag>
139179
```
140180

141-
##### 2. Install through helm #####
181+
##### 2. Deploy through helm #####
142182

143183
```bash
144184
cd deploy/inference-gateway
145185

146186
# Export the Dynamo image you have used when deploying your model in Step 3.
147187
export DYNAMO_IMAGE=<the-dynamo-image-you-have-used-when-deploying-the-model>
148-
export EPP_IMAGE=<the-epp-image-you-built> # i.e. docker.io/lambda108/epp-inference-extension-dynamo:v0.5.1-1
188+
# Export the image tag you have used when building the EPP i.e. docker.io/lambda108/epp-inference-extension-dynamo:v0.5.1-2
189+
export EPP_IMAGE=<the-epp-image-you-built>
190+
```
149191

192+
**Configuration**
193+
You can configure the plugin by setting environment vars in your [values-epp-aware.yaml].
194+
- Overwrite the `DYNAMO_NAMESPACE` env var if needed to match your model's dynamo namespace.
195+
- Set `DYNAMO_BUSY_THRESHOLD` to configure the upper bound on how “full” a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled.
196+
- Set `DYNAMO_ROUTER_REPLICA_SYNC=true` to enable a background watcher to keep multiple router instances in sync (important if you run more than one KV router per component).
197+
- By default the Dynamo plugin uses KV routing. You can expose `DYNAMO_USE_KV_ROUTING=false` in your [values-epp-aware.yaml] if you prefer to route in the round-robin fashion.
198+
- If using kv-routing:
199+
- Overwrite the `DYNAMO_KV_BLOCK_SIZE` in your [values-epp-aware.yaml](./values-epp-aware.yaml) to match your model's block size.The `DYNAMO_KV_BLOCK_SIZE` env var is ***MANDATORY*** to prevent silent KV routing failures.
200+
- Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes.
201+
- Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
202+
- Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable KV event tracking while using kv-routing
203+
- See the [KV cache routing design](../../docs/architecture/kv_cache_routing.md) for details.
204+
205+
206+
```bash
150207
helm upgrade --install dynamo-gaie ./helm/dynamo-gaie \
151208
-n my-model \
152209
-f ./vllm_agg_qwen.yaml \
153210
-f ./values-epp-aware.yaml \
154211
--set eppAware.enabled=true \
155-
--set-string eppAware.eppImage=$EPP_IMAGE \
156-
--set-string eppAware.sidecar.image=$DYNAMO_IMAGE
212+
--set-string eppAware.eppImage=$EPP_IMAGE
157213
```
158214

159215

@@ -162,6 +218,7 @@ Key configurations include:
162218
- A service for the inference gateway
163219
- Required RBAC roles and bindings
164220
- RBAC permissions
221+
- values-epp-aware.yaml sets eppAware.dynamoNamespace=vllm-agg for the bundled example. Point it at your actual Dynamo namespace by editing that file or adding --set eppAware.dynamoNamespace=<namespace> (and likewise for dynamoComponent, dynamoKvBlockSize if they differ).
165222

166223
### 5. Verify Installation ###
167224

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
#!/usr/bin/env bash
2+
3+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
4+
# SPDX-License-Identifier: Apache-2.0
5+
#
6+
# Licensed under the Apache License, Version 2.0 (the "License");
7+
# you may not use this file except in compliance with the License.
8+
# You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing, software
13+
# distributed under the License is distributed on an "AS IS" BASIS,
14+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
# See the License for the specific language governing permissions and
16+
# limitations under the License.
17+
set -e # Exit on any error
18+
19+
# Configuration - Set these environment variables before running
20+
if [[ -z "${DYNAMO_DIR}" ]]; then
21+
echo "DYNAMO_DIR environment variable must be set"
22+
echo " Example: export DYNAMO_DIR=/path/to/dynamo"
23+
exit 1
24+
fi
25+
26+
if [[ -z "${EPP_DIR}" ]]; then
27+
echo "EPP_DIR environment variable must be set"
28+
echo " Example: export EPP_DIR=/path/to/gateway-api-inference-extension-dynamo"
29+
exit 1
30+
fi
31+
DYNAMO_LIB_DIR="${EPP_DIR}/pkg/epp/scheduling/plugins/dynamo_kv_scorer/lib"
32+
DYNAMO_INCLUDE_DIR="${EPP_DIR}/pkg/epp/scheduling/plugins/dynamo_kv_scorer/include"
33+
34+
echo "🏗️ Building Dynamo KV Router C Library..."
35+
36+
# Step 1: Build the static library
37+
echo "📦 Building static library..."
38+
cd "${DYNAMO_DIR}"
39+
cargo build --release -p libdynamo_llm
40+
41+
# Step 2: Generate header file (with fallback)
42+
echo "📝 Generating C header..."
43+
HEADER_OUTPUT="${DYNAMO_DIR}/lib/bindings/c/include/nvidia/dynamo_llm/llm_engine.h"
44+
45+
if ! cbindgen --config lib/bindings/c/cbindgen.toml --crate libdynamo_llm --output "${HEADER_OUTPUT}"; then
46+
echo "cbindgen failed, using fallback header..."
47+
cp "${DYNAMO_DIR}/lib/bindings/c/src/fallback_header.h" "${HEADER_OUTPUT}"
48+
fi
49+
50+
# Step 3: Ensure EPP directories exist
51+
echo "Preparing EPP directories..."
52+
mkdir -p "${DYNAMO_LIB_DIR}"
53+
mkdir -p "${DYNAMO_INCLUDE_DIR}"
54+
55+
# Step 4: Copy files to EPP
56+
echo "Copying files to EPP..."
57+
cp "${HEADER_OUTPUT}" "${DYNAMO_INCLUDE_DIR}/"
58+
cp "${DYNAMO_DIR}/target/release/libdynamo_llm_capi.a" "${DYNAMO_LIB_DIR}/"
59+
60+
# Verify files were copied
61+
if [[ ! -f "${DYNAMO_INCLUDE_DIR}/llm_engine.h" ]]; then
62+
echo "Header file copy failed!"
63+
exit 1
64+
fi
65+
66+
if [[ ! -f "${DYNAMO_LIB_DIR}/libdynamo_llm_capi.a" ]]; then
67+
echo "Library file copy failed!"
68+
exit 1
69+
fi
70+
71+
echo "Files copied successfully:"
72+
echo " Header: ${DYNAMO_INCLUDE_DIR}/llm_engine.h"
73+
echo " Library: ${DYNAMO_LIB_DIR}/libdynamo_llm_capi.a"
74+
75+
# Step 5: Apply Dynamo patch (if it exists)
76+
echo "🔧 Applying Dynamo patch..."
77+
cd "${EPP_DIR}"
78+
79+
PATCH_FILE="${DYNAMO_DIR}/deploy/inference-gateway/epp-patches/v0.5.1-2/epp-v0.5.1-dyn2.patch"
80+
if [[ -f "${PATCH_FILE}" ]]; then
81+
if git apply --check "${PATCH_FILE}" 2>/dev/null; then
82+
git apply "${PATCH_FILE}"
83+
echo "Patch applied successfully"
84+
else
85+
echo "Patch doesn't apply cleanly - may already be applied or need manual resolution"
86+
fi
87+
else
88+
echo "No patch file found at ${PATCH_FILE}"
89+
fi
90+
91+
# Step 6: Build the EPP image
92+
echo "Building the EPP image..."
93+
make dynamo-image-local-load
94+
95+
echo "EPP with Dynamo KV routing built"

0 commit comments

Comments
 (0)