You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: deploy/inference-gateway/README.md
+78-21Lines changed: 78 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,7 @@ This guide demonstrates two setups.
5
5
- The basic setup treats each Dynamo deployment as a black box and routes traffic randomly among the deployments.
6
6
- The EPP-aware setup uses a custom Dynamo plugin `dyn-kv` to pick the best worker.
7
7
8
-
EPP’s default approach is token-aware only `by approximation` because it relies on the non-tokenized text in the prompt. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model’s tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/).
8
+
EPP’s default kv-routing approach is token-aware only `by approximation` because the prompt is tokenized with a generic tokenizer unaware of the model deployed. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model’s tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/).
9
9
10
10
Currently, these setups are only supported with the kGateway based Inference Gateway.
Do not forget to include the the HuggingFace token if required.
101
+
```bash
102
+
export HF_TOKEN=your_hf_token
103
+
kubectl create secret generic hf-token-secret \
104
+
--from-literal=HF_TOKEN=${HF_TOKEN} \
105
+
-n ${NAMESPACE}
106
+
```
107
+
108
+
Create a model configuration file similar to the vllm_agg_qwen.yaml for you model.
109
+
This file demonstrates the values needed for the Vllm Agg setup in [agg.yaml](../../components/backends/vllm/deploy/agg.yaml)
110
+
Take a note of the model's block size provided in the model card.
111
+
90
112
### 4. Install Dynamo GAIE helm chart ###
91
113
92
114
The Inference Gateway is configured through the `inference-gateway-resources.yaml` file.
@@ -95,7 +117,7 @@ Deploy the Inference Gateway resources to your Kubernetes cluster by running one
95
117
96
118
#### Basic Black Box Integration ####
97
119
98
-
For the basic black box integration run:
120
+
The basic black box integration uses a standard EPP image`us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:v0.4.0`. For the basic black box integration run:
#### EPP-aware Integration with the custom Dynamo Plugin ####
106
128
129
+
Dynamo provides a custom routing plugin `pkg/epp/scheduling/plugins/dynamo_kv_scorer/plugin.go` to perform efficient kv routing.
130
+
The Dynamo router is built as a static library, the EPP router will call to provide fast inference.
131
+
You can either use the image `nvcr.io/nvstaging/ai-dynamo/epp-inference-extension-dynamo:v0.6.0-1` for the EPP_IMAGE in the Helm deployment command and proceed to the step 2 or you can build the image yourself following the steps below.
132
+
107
133
##### 1. Build the custom EPP image #####
108
134
109
-
We provide git patches for you to use.
135
+
If you choose to build your own image use the steps below. Proceed to step 2 otherwise to deploy with Helm.
110
136
111
137
##### 1.1 Clone the official GAIE repo in a separate folder #####
112
138
@@ -116,44 +142,74 @@ cd gateway-api-inference-extension
116
142
git checkout v0.5.1
117
143
```
118
144
119
-
##### 1.2 Apply patch(es) #####
145
+
##### 1.2 Build the Dynamo Custom EPP #####
146
+
147
+
148
+
149
+
###### 1.2.1 Clone the official EPP repo ######
150
+
151
+
```bash
152
+
# Clone the official GAIE repo in a separate folder
export EPP_IMAGE=<the-epp-image-you-built># i.e. docker.io/lambda108/epp-inference-extension-dynamo:v0.5.1-1
188
+
# Export the image tag you have used when building the EPP i.e. docker.io/lambda108/epp-inference-extension-dynamo:v0.5.1-2
189
+
export EPP_IMAGE=<the-epp-image-you-built>
190
+
```
149
191
192
+
**Configuration**
193
+
You can configure the plugin by setting environment vars in your [values-epp-aware.yaml].
194
+
- Overwrite the `DYNAMO_NAMESPACE` env var if needed to match your model's dynamo namespace.
195
+
- Set `DYNAMO_BUSY_THRESHOLD` to configure the upper bound on how “full” a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled.
196
+
- Set `DYNAMO_ROUTER_REPLICA_SYNC=true` to enable a background watcher to keep multiple router instances in sync (important if you run more than one KV router per component).
197
+
- By default the Dynamo plugin uses KV routing. You can expose `DYNAMO_USE_KV_ROUTING=false` in your [values-epp-aware.yaml] if you prefer to route in the round-robin fashion.
198
+
- If using kv-routing:
199
+
- Overwrite the `DYNAMO_KV_BLOCK_SIZE` in your [values-epp-aware.yaml](./values-epp-aware.yaml) to match your model's block size.The `DYNAMO_KV_BLOCK_SIZE` env var is ***MANDATORY*** to prevent silent KV routing failures.
200
+
- Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes.
201
+
- Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
202
+
- Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable KV event tracking while using kv-routing
203
+
- See the [KV cache routing design](../../docs/architecture/kv_cache_routing.md) for details.
- values-epp-aware.yaml sets eppAware.dynamoNamespace=vllm-agg for the bundled example. Point it at your actual Dynamo namespace by editing that file or adding --set eppAware.dynamoNamespace=<namespace> (and likewise for dynamoComponent, dynamoKvBlockSize if they differ).
0 commit comments