Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions components/backends/trtllm/kv-cache-transfer.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,10 @@ In disaggregated serving architectures, KV cache must be transferred between pre
## Default Method: UCX
By default, TensorRT-LLM uses UCX (Unified Communication X) for KV cache transfer between prefill and decode workers. UCX provides high-performance communication optimized for GPU-to-GPU transfers.

## Experimental Method: NIXL
TensorRT-LLM also provides experimental support for using **NIXL** (NVIDIA Inference Xfer Library) for KV cache transfer. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.
## Beta Method: NIXL
TensorRT-LLM also supports using **NIXL** (NVIDIA Inference Xfer Library) for KV cache transfer. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.

**Note:** NIXL support in TensorRT-LLM is experimental and is not suitable for production environments yet.
**Note:** NIXL support in TensorRT-LLM is currently beta and may have some sharp edges.

## Using NIXL for KV Cache Transfer

Expand Down Expand Up @@ -61,4 +61,4 @@ To enable NIXL for KV cache transfer in disaggregated serving:
4. **Send the request:**
See [client](./README.md#client) section to learn how to send the request to deployment.

**Important:** Ensure that ETCD and NATS services are running before starting the service.
**Important:** Ensure that ETCD and NATS services are running before starting the service.
Loading