diff --git a/components/backends/trtllm/kv-cache-transfer.md b/components/backends/trtllm/kv-cache-transfer.md index 43a7e39618..99ab98deae 100644 --- a/components/backends/trtllm/kv-cache-transfer.md +++ b/components/backends/trtllm/kv-cache-transfer.md @@ -24,10 +24,10 @@ In disaggregated serving architectures, KV cache must be transferred between pre ## Default Method: UCX By default, TensorRT-LLM uses UCX (Unified Communication X) for KV cache transfer between prefill and decode workers. UCX provides high-performance communication optimized for GPU-to-GPU transfers. -## Experimental Method: NIXL -TensorRT-LLM also provides experimental support for using **NIXL** (NVIDIA Inference Xfer Library) for KV cache transfer. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments. +## Beta Method: NIXL +TensorRT-LLM also supports using **NIXL** (NVIDIA Inference Xfer Library) for KV cache transfer. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments. -**Note:** NIXL support in TensorRT-LLM is experimental and is not suitable for production environments yet. +**Note:** NIXL support in TensorRT-LLM is currently beta and may have some sharp edges. ## Using NIXL for KV Cache Transfer @@ -61,4 +61,4 @@ To enable NIXL for KV cache transfer in disaggregated serving: 4. **Send the request:** See [client](./README.md#client) section to learn how to send the request to deployment. -**Important:** Ensure that ETCD and NATS services are running before starting the service. \ No newline at end of file +**Important:** Ensure that ETCD and NATS services are running before starting the service.