diff --git a/README.md b/README.md index 21d70bac93..5a621a7ebb 100644 --- a/README.md +++ b/README.md @@ -30,7 +30,7 @@ High-throughput, low-latency inference framework designed for serving generative ## Latest News -- [08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./components/backends/trtllm/gpt-oss.md) +- [08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./docs/backends/trtllm/gpt-oss.md) ## The Era of Multi-GPU, Multi-Node @@ -65,9 +65,9 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa To learn more about each framework and their capabilities, check out each framework's README! -- **[vLLM](components/backends/vllm/README.md)** -- **[SGLang](components/backends/sglang/README.md)** -- **[TensorRT-LLM](components/backends/trtllm/README.md)** +- **[vLLM](docs/backends/vllm/README.md)** +- **[SGLang](docs/backends/sglang/README.md)** +- **[TensorRT-LLM](docs/backends/trtllm/README.md)** Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach. diff --git a/components/README.md b/components/README.md index f0f72aa452..18b88c9ee7 100644 --- a/components/README.md +++ b/components/README.md @@ -23,9 +23,9 @@ This directory contains the core components that make up the Dynamo inference fr Dynamo supports multiple inference engines (with a focus on SGLang, vLLM, and TensorRT-LLM), each with their own deployment configurations and capabilities: -- **[vLLM](backends/vllm/README.md)** - High-performance LLM inference with native KV cache events and NIXL-based transfer mechanisms -- **[SGLang](backends/sglang/README.md)** - Structured generation language framework with ZMQ-based communication -- **[TensorRT-LLM](backends/trtllm/README.md)** - NVIDIA's optimized LLM inference engine with TensorRT acceleration +- **[vLLM](/docs/backends/vllm/README.md)** - High-performance LLM inference with native KV cache events and NIXL-based transfer mechanisms +- **[SGLang](/docs/backends/sglang/README.md)** - Structured generation language framework with ZMQ-based communication +- **[TensorRT-LLM](/docs/backends/trtllm/README.md)** - NVIDIA's optimized LLM inference engine with TensorRT acceleration Each engine provides launch scripts for different deployment patterns in their respective `/launch` & `/deploy` directories. diff --git a/components/backends/sglang/slurm_jobs/README.md b/components/backends/sglang/slurm_jobs/README.md index 3d4b2ad436..800b1b2c43 100644 --- a/components/backends/sglang/slurm_jobs/README.md +++ b/components/backends/sglang/slurm_jobs/README.md @@ -17,7 +17,7 @@ For this example, we will make some assumptions about your SLURM cluster: If your cluster supports similar container based plugins, you may be able to modify the template to use that instead. 3. We assume you have already built a recent Dynamo+SGLang container image as - described [here](../docs/dsr1-wideep-gb200.md#instructions). + described [here](../../../../docs/backends/sglang/dsr1-wideep-gb200.md#instructions). This is the image that can be passed to the `--container-image` argument in later steps. ## Scripts Overview diff --git a/components/backends/trtllm/deploy/README.md b/components/backends/trtllm/deploy/README.md index 8cd13184db..315e518d01 100644 --- a/components/backends/trtllm/deploy/README.md +++ b/components/backends/trtllm/deploy/README.md @@ -232,7 +232,7 @@ envs: ## Testing the Deployment -Send a test request to verify your deployment. See the [client section](../../../../components/backends/vllm/README.md#client) for detailed instructions. +Send a test request to verify your deployment. See the [client section](../../../../docs/backends/vllm/README.md#client) for detailed instructions. **Note:** For multi-node deployments, target the node running `python3 -m dynamo.frontend `. @@ -254,7 +254,7 @@ TensorRT-LLM supports two methods for KV cache transfer in disaggregated serving - **UCX** (default): Standard method for KV cache transfer - **NIXL** (experimental): Alternative transfer method -For detailed configuration instructions, see the [KV cache transfer guide](../kv-cache-transfer.md). +For detailed configuration instructions, see the [KV cache transfer guide](../../../../docs/backends/trtllm/kv-cache-transfer.md). ## Request Migration @@ -282,8 +282,8 @@ Configure the `model` name and `host` based on your deployment. - **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/kubernetes/installation_guide.md) - **Examples**: [Deployment Examples](../../../../docs/examples/README.md) - **Architecture Docs**: [Disaggregated Serving](../../../../docs/architecture/disagg_serving.md), [KV-Aware Routing](../../../../docs/architecture/kv_cache_routing.md) -- **Multinode Deployment**: [Multinode Examples](../multinode/multinode-examples.md) -- **Speculative Decoding**: [Llama 4 + Eagle Guide](../llama4_plus_eagle.md) +- **Multinode Deployment**: [Multinode Examples](../../../../docs/backends/trtllm/multinode/multinode-examples.md) +- **Speculative Decoding**: [Llama 4 + Eagle Guide](../../../../docs/backends/trtllm/llama4_plus_eagle.md) - **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) ## Troubleshooting diff --git a/components/backends/trtllm/performance_sweeps/README.md b/components/backends/trtllm/performance_sweeps/README.md index cda70cc7ae..aaec28f543 100644 --- a/components/backends/trtllm/performance_sweeps/README.md +++ b/components/backends/trtllm/performance_sweeps/README.md @@ -41,7 +41,7 @@ Please note that: 3. `post_process.py` - Scan the genai-perf results to produce a json with entries to each config point. 4. `plot_performance_comparison.py` - Takes the json result file for disaggregated and/or aggregated configuration sweeps and plots a pareto line for better visualization. -For more finer grained details on how to launch TRTLLM backend workers with DeepSeek R1 on GB200 slurm, please refer [multinode-examples.md](../multinode/multinode-examples.md). This guide shares similar assumption to the multinode examples guide. +For more finer grained details on how to launch TRTLLM backend workers with DeepSeek R1 on GB200 slurm, please refer [multinode-examples.md](../../../../docs/backends/trtllm/multinode/multinode-examples.md). This guide shares similar assumption to the multinode examples guide. ## Usage diff --git a/docs/_includes/dive_in_examples.rst b/docs/_includes/dive_in_examples.rst index 60eb9048fb..261e896d77 100644 --- a/docs/_includes/dive_in_examples.rst +++ b/docs/_includes/dive_in_examples.rst @@ -11,20 +11,20 @@ The examples below assume you build the latest image yourself from source. If us Demonstrates the basic concepts of Dynamo by creating a simple GPU-unaware graph - .. grid-item-card:: :doc:`vLLM <../components/backends/vllm/README>` - :link: ../components/backends/vllm/README + .. grid-item-card:: :doc:`vLLM <../backends/vllm/README>` + :link: ../backends/vllm/README :link-type: doc Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with VLLM. - .. grid-item-card:: :doc:`SGLang <../components/backends/sglang/README>` - :link: ../components/backends/sglang/README + .. grid-item-card:: :doc:`SGLang <../backends/sglang/README>` + :link: ../backends/sglang/README :link-type: doc Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with SGLang. - .. grid-item-card:: :doc:`TensorRT-LLM <../components/backends/trtllm/README>` - :link: ../components/backends/trtllm/README + .. grid-item-card:: :doc:`TensorRT-LLM <../backends/trtllm/README>` + :link: ../backends/trtllm/README :link-type: doc Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with TensorRT-LLM. diff --git a/docs/_sections/backends.rst b/docs/_sections/backends.rst index 4b6b294b71..653f2ee770 100644 --- a/docs/_sections/backends.rst +++ b/docs/_sections/backends.rst @@ -37,6 +37,6 @@ Dynamo currently supports the following high-performance inference backends: .. toctree:: :maxdepth: 1 - vLLM <../components/backends/vllm/README> - SGLang <../components/backends/sglang/README> - TensorRT-LLM <../components/backends/trtllm/README> + vLLM <../backends/vllm/README> + SGLang <../backends/sglang/README> + TensorRT-LLM <../backends/trtllm/README> diff --git a/docs/architecture/kvbm_intro.rst b/docs/architecture/kvbm_intro.rst index 4c6cb0d227..d32e1fe2c5 100644 --- a/docs/architecture/kvbm_intro.rst +++ b/docs/architecture/kvbm_intro.rst @@ -63,4 +63,4 @@ The Dynamo KV Block Manager serves as a reference implementation that emphasizes KVBM Architecture Understanding KVBM components KVBM Further Reading - LMCache Integration <../components/backends/vllm/LMCache_Integration.md> + LMCache Integration <../backends/vllm/LMCache_Integration> diff --git a/components/backends/sglang/README.md b/docs/backends/sglang/README.md similarity index 89% rename from components/backends/sglang/README.md rename to docs/backends/sglang/README.md index a66b23f6d3..267ad87178 100644 --- a/components/backends/sglang/README.md +++ b/docs/backends/sglang/README.md @@ -35,13 +35,13 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) | Feature | SGLang | Notes | |---------|--------|-------| -| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ | | -| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) | -| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | | -| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ | | -| [**Multimodal EPD Disaggregation**](docs/multimodal_epd.md) | ✅ | | -| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | ❌ | Planned | -| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | ❌ | Planned | +| [**Disaggregated Serving**](../../architecture/disagg_serving.md) | ✅ | | +| [**Conditional Disaggregation**](../../architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) | +| [**KV-Aware Routing**](../../architecture/kv_cache_routing.md) | ✅ | | +| [**SLA-Based Planner**](../../architecture/sla_planner.md) | ✅ | | +| [**Multimodal EPD Disaggregation**](multimodal_epd.md) | ✅ | | +| [**Load Based Planner**](../../architecture/load_planner.md) | ❌ | Planned | +| [**KVBM**](../../architecture/kvbm_architecture.md) | ❌ | Planned | ### Large Scale P/D and WideEP Features @@ -229,7 +229,7 @@ cd $DYNAMO_HOME/components/backends/sglang ./launch/disagg_dp_attn.sh ``` -When using MoE models, you can also use the our implementation of the native SGLang endpoints to record expert distribution data. The `disagg_dp_attn.sh` script automatically sets up the SGLang HTTP server, the environment variable that controls the expert distribution recording directory, and sets up the expert distribution recording mode to `stat`. You can learn more about expert parallelism load balancing [here](docs/expert-distribution-eplb.md). +When using MoE models, you can also use the our implementation of the native SGLang endpoints to record expert distribution data. The `disagg_dp_attn.sh` script automatically sets up the SGLang HTTP server, the environment variable that controls the expert distribution recording directory, and sets up the expert distribution recording mode to `stat`. You can learn more about expert parallelism load balancing [here](expert-distribution-eplb.md). ### Testing the Deployment @@ -266,24 +266,24 @@ This allows a request to be migrated up to 3 times before failing. See the [Requ Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example! ### Run a multi-node sized model -- **[Run a multi-node model](docs/multinode-examples.md)** +- **[Run a multi-node model](multinode-examples.md)** ### Large scale P/D disaggregation with WideEP -- **[Run DeepSeek-R1 on 104+ H100s](docs/dsr1-wideep-h100.md)** -- **[Run DeepSeek-R1-FP8 on GB200s](docs/dsr1-wideep-gb200.md)** +- **[Run DeepSeek-R1 on 104+ H100s](dsr1-wideep-h100.md)** +- **[Run DeepSeek-R1-FP8 on GB200s](dsr1-wideep-gb200.md)** ### Hierarchical Cache (HiCache) -- **[Enable SGLang Hierarchical Cache (HiCache)](docs/sgl-hicache-example.md)** +- **[Enable SGLang Hierarchical Cache (HiCache)](sgl-hicache-example.md)** ### Multimodal Encode-Prefill-Decode (EPD) Disaggregation with NIXL -- **[Run a multimodal model with EPD Disaggregation](docs/multimodal_epd.md)** +- **[Run a multimodal model with EPD Disaggregation](multimodal_epd.md)** ## Deployment We currently provide deployment examples for Kubernetes and SLURM. ## Kubernetes -- **[Deploying Dynamo with SGLang on Kubernetes](deploy/README.md)** +- **[Deploying Dynamo with SGLang on Kubernetes](../../../components/backends/sglang/deploy/README.md)** ## SLURM -- **[Deploying Dynamo with SGLang on SLURM](slurm_jobs/README.md)** +- **[Deploying Dynamo with SGLang on SLURM](../../../components/backends/sglang/slurm_jobs/README.md)** diff --git a/components/backends/sglang/docs/dsr1-wideep-gb200.md b/docs/backends/sglang/dsr1-wideep-gb200.md similarity index 100% rename from components/backends/sglang/docs/dsr1-wideep-gb200.md rename to docs/backends/sglang/dsr1-wideep-gb200.md diff --git a/components/backends/sglang/docs/dsr1-wideep-h100.md b/docs/backends/sglang/dsr1-wideep-h100.md similarity index 100% rename from components/backends/sglang/docs/dsr1-wideep-h100.md rename to docs/backends/sglang/dsr1-wideep-h100.md diff --git a/components/backends/sglang/docs/expert-distribution-eplb.md b/docs/backends/sglang/expert-distribution-eplb.md similarity index 100% rename from components/backends/sglang/docs/expert-distribution-eplb.md rename to docs/backends/sglang/expert-distribution-eplb.md diff --git a/components/backends/sglang/gpt-oss.md b/docs/backends/sglang/gpt-oss.md similarity index 96% rename from components/backends/sglang/gpt-oss.md rename to docs/backends/sglang/gpt-oss.md index 8860380450..7ab7eb1866 100644 --- a/components/backends/sglang/gpt-oss.md +++ b/docs/backends/sglang/gpt-oss.md @@ -5,7 +5,7 @@ SPDX-License-Identifier: Apache-2.0 # Running gpt-oss-120b Disaggregated with SGLang -The gpt-oss-120b guide for SGLang is largely identical to the [guide for vLLM](/components/backends/vllm/gpt-oss.md), +The gpt-oss-120b guide for SGLang is largely identical to the [guide for vLLM](/docs/backends/vllm/gpt-oss.md), please ues the vLLM guide as a reference with the different deployment steps as highlighted below: # Launch the Deployment diff --git a/components/backends/sglang/docs/multimodal_epd.md b/docs/backends/sglang/multimodal_epd.md similarity index 98% rename from components/backends/sglang/docs/multimodal_epd.md rename to docs/backends/sglang/multimodal_epd.md index bb3e725fc9..0131804ae9 100644 --- a/components/backends/sglang/docs/multimodal_epd.md +++ b/docs/backends/sglang/multimodal_epd.md @@ -31,7 +31,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) The MultimodalEncodeWorker is responsible for encoding the image and passing the embeddings to the MultimodalWorker via a combination of NATS and RDMA. The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface. -Its MultimodalWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](../README.md) example. +Its MultimodalWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](README.md) example. By separating the encode from the prefill and decode stages, we can have a more flexible deployment and scale the MultimodalEncodeWorker independently from the prefill and decode workers if needed. @@ -116,7 +116,7 @@ You should see a response similar to this: For the Qwen2.5-VL model, embeddings are only required during the prefill stage. As such, the image embeddings are transferred using a NIXL descriptor from the encode worker to the worker and then passed to the prefill worker for processing. The prefill worker performs the prefilling step and forwards the KV cache to the worker for decoding. -For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](../README.md) example. +For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](README.md) example. This figure illustrates the workflow: ```mermaid diff --git a/components/backends/sglang/docs/multinode-examples.md b/docs/backends/sglang/multinode-examples.md similarity index 100% rename from components/backends/sglang/docs/multinode-examples.md rename to docs/backends/sglang/multinode-examples.md diff --git a/components/backends/sglang/docs/sgl-hicache-example.md b/docs/backends/sglang/sgl-hicache-example.md similarity index 100% rename from components/backends/sglang/docs/sgl-hicache-example.md rename to docs/backends/sglang/sgl-hicache-example.md diff --git a/components/backends/trtllm/README.md b/docs/backends/trtllm/README.md similarity index 96% rename from components/backends/trtllm/README.md rename to docs/backends/trtllm/README.md index 33628876d8..f5d5fa1d1b 100644 --- a/components/backends/trtllm/README.md +++ b/docs/backends/trtllm/README.md @@ -186,11 +186,11 @@ For comprehensive instructions on multinode serving, see the [multinode-examples ### Kubernetes Deployment -For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](deploy/README.md). +For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](../../../components/backends/trtllm/deploy/README.md). ### Client -See [client](../sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment. +See [client](../../../docs/backends/sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment. NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend `. @@ -230,7 +230,7 @@ This allows a request to be migrated up to 3 times before failing. See the [Requ ## Client -See [client](../sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment. +See [client](../../../docs/backends/sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment. NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend `. @@ -302,7 +302,7 @@ sampling_params.logits_processor = create_trtllm_adapters(processors) ## Performance Sweep -For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](./performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance. +For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](../../../components/backends/trtllm/performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance. ## Dynamo KV Block Manager Integration diff --git a/components/backends/trtllm/gemma3_sliding_window_attention.md b/docs/backends/trtllm/gemma3_sliding_window_attention.md similarity index 85% rename from components/backends/trtllm/gemma3_sliding_window_attention.md rename to docs/backends/trtllm/gemma3_sliding_window_attention.md index 80929cd355..5f9cca904c 100644 --- a/components/backends/trtllm/gemma3_sliding_window_attention.md +++ b/docs/backends/trtllm/gemma3_sliding_window_attention.md @@ -23,9 +23,9 @@ VSWA is a mechanism in which a model’s layers alternate between multiple slidi > [!Note] > - Ensure that required services such as `nats` and `etcd` are running before starting. > - Request access to `google/gemma-3-1b-it` on Hugging Face and set your `HF_TOKEN` environment variable for authentication. -> - It’s recommended to continue using the VSWA feature with the Dynamo 0.5.0 release and the TensorRT-LLM dynamo runtime image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0. The 0.5.1 release bundles TensorRT-LLM v1.1.0rc5, which has a regression that breaks VSWA. +> - It's recommended to continue using the VSWA feature with the Dynamo 0.5.0 release and the TensorRT-LLM dynamo runtime image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0. The 0.5.1 release bundles TensorRT-LLM v1.1.0rc5, which has a regression that breaks VSWA. -### Aggregated Serving +## Aggregated Serving ```bash cd $DYNAMO_HOME/components/backends/trtllm export MODEL_PATH=google/gemma-3-1b-it @@ -34,7 +34,7 @@ export AGG_ENGINE_ARGS=engine_configs/gemma3/vswa_agg.yaml ./launch/agg.sh ``` -### Aggregated Serving with KV Routing +## Aggregated Serving with KV Routing ```bash cd $DYNAMO_HOME/components/backends/trtllm export MODEL_PATH=google/gemma-3-1b-it @@ -43,7 +43,7 @@ export AGG_ENGINE_ARGS=engine_configs/gemma3/vswa_agg.yaml ./launch/agg_router.sh ``` -#### Disaggregated Serving +## Disaggregated Serving ```bash cd $DYNAMO_HOME/components/backends/trtllm export MODEL_PATH=google/gemma-3-1b-it @@ -53,7 +53,7 @@ export DECODE_ENGINE_ARGS=engine_configs/gemma3/vswa_decode.yaml ./launch/disagg.sh ``` -#### Disaggregated Serving with KV Routing +## Disaggregated Serving with KV Routing ```bash cd $DYNAMO_HOME/components/backends/trtllm export MODEL_PATH=google/gemma-3-1b-it diff --git a/components/backends/trtllm/gpt-oss.md b/docs/backends/trtllm/gpt-oss.md similarity index 100% rename from components/backends/trtllm/gpt-oss.md rename to docs/backends/trtllm/gpt-oss.md diff --git a/components/backends/trtllm/kv-cache-transfer.md b/docs/backends/trtllm/kv-cache-transfer.md similarity index 100% rename from components/backends/trtllm/kv-cache-transfer.md rename to docs/backends/trtllm/kv-cache-transfer.md diff --git a/components/backends/trtllm/llama4_plus_eagle.md b/docs/backends/trtllm/llama4_plus_eagle.md similarity index 100% rename from components/backends/trtllm/llama4_plus_eagle.md rename to docs/backends/trtllm/llama4_plus_eagle.md diff --git a/components/backends/trtllm/multimodal_epd.md b/docs/backends/trtllm/multimodal_epd.md similarity index 96% rename from components/backends/trtllm/multimodal_epd.md rename to docs/backends/trtllm/multimodal_epd.md index 7a40122548..9900a168a8 100644 --- a/components/backends/trtllm/multimodal_epd.md +++ b/docs/backends/trtllm/multimodal_epd.md @@ -1,8 +1,8 @@ -## Encode-Prefill-Decode (EPD) Flow with NIXL +# Encode-Prefill-Decode (EPD) Flow with NIXL For high-performance multimodal inference with large embeddings, Dynamo supports a specialized **Encode-Prefill-Decode (EPD)** flow using **NIXL (RDMA)** for zero-copy tensor transfer. -### Enabling the Feature +## Enabling the Feature This is an experimental feature that requires using a specific TensorRT-LLM commit. To enable it build the dynamo container with the `--tensorrtllm-commit` flag, followed by the commit hash: @@ -11,14 +11,14 @@ To enable it build the dynamo container with the `--tensorrtllm-commit` flag, fo ./container/build.sh --framework trtllm --tensorrtllm-commit b4065d8ca64a64eee9fdc64b39cb66d73d4be47c ``` -### Key Features +## Key Features - **High Performance**: Zero-copy RDMA transfer for embeddings - **Dynamic Shape Allocation**: Automatically handles variable embedding shapes per image - **Multi-Format Support**: Works with tensor files (`.pt`) and dictionary-based embeddings - **Hybrid Transfer**: Large tensors via NIXL, small metadata via JSON -### How to use +## How to use ```bash cd $DYNAMO_HOME/components/backends/trtllm @@ -27,7 +27,7 @@ cd $DYNAMO_HOME/components/backends/trtllm ./launch/epd_disagg.sh ``` -### Configuration +## Configuration The EPD flow uses a dedicated **Encode Worker** that runs separately from the Prefill and Decode workers. The `ENCODE_ENDPOINT` environment variable specifies how the Prefill worker communicates with the Encode worker: @@ -49,7 +49,7 @@ For tensor file size protection, use the `--max-file-size-mb "$MAX_FILE_SIZE_MB" export MAX_FILE_SIZE_MB=50 ``` -### Architecture Overview +## Architecture Overview The EPD flow implements a **3-worker architecture** for high-performance multimodal inference: @@ -57,9 +57,9 @@ The EPD flow implements a **3-worker architecture** for high-performance multimo - **Prefill Worker**: Handles initial context processing and KV-cache generation - **Decode Worker**: Performs streaming token generation -### Request Flow Diagrams +## Request Flow Diagrams -#### Prefill-First Disaggregation Strategy +### Prefill-First Disaggregation Strategy ```mermaid sequenceDiagram @@ -103,7 +103,7 @@ sequenceDiagram Gateway->>Client: Final response + [DONE] ``` -#### Decode-First Disaggregation Strategy +### Decode-First Disaggregation Strategy ```mermaid sequenceDiagram @@ -155,7 +155,7 @@ sequenceDiagram Gateway->>Client: Final response + [DONE] ``` -### How the System Works +## How the System Works 1. **Request Processing**: Multimodal requests containing embedding file paths OR urls are routed based on disaggregation strategy 2. **Multimodal Loading**: EncodeWorker loads large embedding files and extracts auxiliary metadata @@ -163,7 +163,7 @@ sequenceDiagram 4. **Dynamic Allocation**: Consumer workers allocate tensors with exact shapes received from EncodeWorker 5. **Reconstruction**: Original embedding format (dictionary or tensor) is reconstructed for model processing -### Example Request +## Example Request The request format is identical to regular multimodal requests: diff --git a/components/backends/trtllm/multimodal_support.md b/docs/backends/trtllm/multimodal_support.md similarity index 97% rename from components/backends/trtllm/multimodal_support.md rename to docs/backends/trtllm/multimodal_support.md index 5fb29038a4..a8cb246f41 100644 --- a/components/backends/trtllm/multimodal_support.md +++ b/docs/backends/trtllm/multimodal_support.md @@ -21,7 +21,7 @@ TRTLLM supports multimodal models with dynamo. You can provide multimodal inputs Please note that you should provide **either image URLs or embedding file paths** in a single request. -### Aggregated +## Aggregated Here are quick steps to launch Llama-4 Maverick BF16 in aggregated mode ```bash @@ -32,9 +32,9 @@ export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct" export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct" ./launch/agg.sh ``` -### Example Requests +## Example Requests -#### With Image URL +### With Image URL Below is an example of an image being sent to `Llama-4-Maverick-17B-128E-Instruct` model @@ -69,7 +69,7 @@ Response : {"id":"unknown-id","choices":[{"index":0,"message":{"content":"The image depicts a serene landscape featuring a large rock formation, likely El Capitan in Yosemite National Park, California. The scene is characterized by a winding road that curves from the bottom-right corner towards the center-left of the image, with a few rocks and trees lining its edge.\n\n**Key Features:**\n\n* **Rock Formation:** A prominent, tall, and flat-topped rock formation dominates the center of the image.\n* **Road:** A paved road winds its way through the landscape, curving from the bottom-right corner towards the center-left.\n* **Trees and Rocks:** Trees are visible on both sides of the road, with rocks scattered along the left side.\n* **Sky:** The sky above is blue, dotted with white clouds.\n* **Atmosphere:** The overall atmosphere of the","refusal":null,"tool_calls":null,"role":"assistant","function_call":null,"audio":null},"finish_reason":"stop","logprobs":null}],"created":1753322607,"model":"meta-llama/Llama-4-Maverick-17B-128E-Instruct","service_tier":null,"system_fingerprint":null,"object":"chat.completion","usage":null} ``` -### Disaggregated +## Disaggregated Here are quick steps to launch in disaggregated mode. @@ -93,11 +93,11 @@ In general, disaggregated serving can run on a single node, provided the model f To deploy `Llama-4-Maverick-17B-128E-Instruct` in disaggregated mode, you will need to follow the multi-node setup instructions, which can be found [here](./multinode/multinode-multimodal-example.md). -### Using Pre-computed Embeddings (Experimental) +## Using Pre-computed Embeddings (Experimental) Dynamo with TensorRT-LLM supports providing pre-computed embeddings directly in an inference request. This bypasses the need for the model to process an image and generate embeddings itself, which is useful for performance optimization or when working with custom, pre-generated embeddings. -#### How to Use +### How to Use Once the container is built, you can send requests with paths to local embedding files. @@ -107,7 +107,7 @@ Once the container is built, you can send requests with paths to local embedding When a request with a supported embedding file is received, Dynamo will load the tensor from the file and pass it directly to the model for inference, skipping the image-to-embedding pipeline. -#### Example Request +### Example Request Here is an example of how to send a request with a pre-computed embedding file. @@ -135,7 +135,7 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d ' "max_tokens": 160 }' ``` -### Encode-Prefill-Decode (EPD) Flow with NIXL +## Encode-Prefill-Decode (EPD) Flow with NIXL Dynamo with the TensorRT-LLM backend supports multimodal models in Encode -> Decode -> Prefill fashion, enabling you to process embeddings seperately in a seperate worker. For detailed setup instructions, example requests, and best practices, see the [Multimodal EPD Support Guide](./multimodal_epd.md). diff --git a/components/backends/trtllm/multinode/multinode-examples.md b/docs/backends/trtllm/multinode/multinode-examples.md similarity index 100% rename from components/backends/trtllm/multinode/multinode-examples.md rename to docs/backends/trtllm/multinode/multinode-examples.md diff --git a/components/backends/trtllm/multinode/multinode-multimodal-example.md b/docs/backends/trtllm/multinode/multinode-multimodal-example.md similarity index 99% rename from components/backends/trtllm/multinode/multinode-multimodal-example.md rename to docs/backends/trtllm/multinode/multinode-multimodal-example.md index 723a9b1f8b..fe050efd3c 100644 --- a/components/backends/trtllm/multinode/multinode-multimodal-example.md +++ b/docs/backends/trtllm/multinode/multinode-multimodal-example.md @@ -44,7 +44,7 @@ Before you begin, ensure you have completed the initial environment configuratio The following sections provide specific instructions for deploying `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, including environment variable setup and launch commands. These steps can be adapted for other large multimodal models. -### Environment Variable Setup +## Environment Variable Setup Assuming you have already allocated your nodes via `salloc`, and are inside an interactive shell on one of the allocated nodes, set the diff --git a/components/backends/vllm/LMCache_Integration.md b/docs/backends/vllm/LMCache_Integration.md similarity index 100% rename from components/backends/vllm/LMCache_Integration.md rename to docs/backends/vllm/LMCache_Integration.md diff --git a/components/backends/vllm/README.md b/docs/backends/vllm/README.md similarity index 94% rename from components/backends/vllm/README.md rename to docs/backends/vllm/README.md index ceff111abd..981e9226e3 100644 --- a/components/backends/vllm/README.md +++ b/docs/backends/vllm/README.md @@ -5,7 +5,7 @@ SPDX-License-Identifier: Apache-2.0 # LLM Deployment using vLLM -This directory contains a Dynamo vllm engine and reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation. +This directory contains reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation. ## Use the Latest Release @@ -153,7 +153,7 @@ Below we provide a selected list of advanced deployments. Please open up an issu ### Kubernetes Deployment -For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](deploy/README.md) +For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](/components/backends/vllm/deploy/README.md) ## Configuration diff --git a/components/backends/vllm/deepseek-r1.md b/docs/backends/vllm/deepseek-r1.md similarity index 99% rename from components/backends/vllm/deepseek-r1.md rename to docs/backends/vllm/deepseek-r1.md index 9170c4159c..c859695e6f 100644 --- a/components/backends/vllm/deepseek-r1.md +++ b/docs/backends/vllm/deepseek-r1.md @@ -7,7 +7,7 @@ SPDX-License-Identifier: Apache-2.0 Dynamo supports running Deepseek R1 with data parallel attention and wide expert parallelism. Each data parallel attention rank is a seperate dynamo component that will emit its own KV Events and Metrics. vLLM controls the expert parallelism using the flag `--enable-expert-parallel` -# Instructions +## Instructions The following script can be adapted to run Deepseek R1 with a variety of different configuration. The current configuration uses 2 nodes, 16 GPUs, and a dp of 16. Follow the [ReadMe](README.md) Getting Started section on each node, and then run these two commands. diff --git a/components/backends/vllm/gpt-oss.md b/docs/backends/vllm/gpt-oss.md similarity index 98% rename from components/backends/vllm/gpt-oss.md rename to docs/backends/vllm/gpt-oss.md index 09e69e2a6c..02fee37bd7 100644 --- a/components/backends/vllm/gpt-oss.md +++ b/docs/backends/vllm/gpt-oss.md @@ -16,7 +16,7 @@ This deployment uses disaggregated serving in vLLM where: ## Prerequisites -This guide assumes readers already knows how to deploy Dynamo disaggregated serving with vLLM as illustrated in [README.md](/components/backends/vllm/README.md) +This guide assumes readers already knows how to deploy Dynamo disaggregated serving with vLLM as illustrated in [README.md](/docs/backends/vllm/README.md) ## Instructions diff --git a/components/backends/vllm/multi-node.md b/docs/backends/vllm/multi-node.md similarity index 100% rename from components/backends/vllm/multi-node.md rename to docs/backends/vllm/multi-node.md diff --git a/docs/components/backends/sglang/README.md b/docs/components/backends/sglang/README.md deleted file mode 120000 index c481015d87..0000000000 --- a/docs/components/backends/sglang/README.md +++ /dev/null @@ -1 +0,0 @@ -../../../../components/backends/sglang/README.md \ No newline at end of file diff --git a/docs/components/backends/sglang/docs/multinode-examples.md b/docs/components/backends/sglang/docs/multinode-examples.md deleted file mode 120000 index 9929f08b4a..0000000000 --- a/docs/components/backends/sglang/docs/multinode-examples.md +++ /dev/null @@ -1 +0,0 @@ -../../../../../components/backends/sglang/docs/multinode-examples.md \ No newline at end of file diff --git a/docs/components/backends/trtllm/README.md b/docs/components/backends/trtllm/README.md deleted file mode 120000 index 15969304d0..0000000000 --- a/docs/components/backends/trtllm/README.md +++ /dev/null @@ -1 +0,0 @@ -../../../../components/backends/trtllm/README.md \ No newline at end of file diff --git a/docs/components/backends/trtllm/multinode/multinode-examples.md b/docs/components/backends/trtllm/multinode/multinode-examples.md deleted file mode 120000 index 495f44690b..0000000000 --- a/docs/components/backends/trtllm/multinode/multinode-examples.md +++ /dev/null @@ -1 +0,0 @@ -../../../../../components/backends/trtllm/multinode/multinode-examples.md \ No newline at end of file diff --git a/docs/components/backends/vllm/LMCache_Integration.md b/docs/components/backends/vllm/LMCache_Integration.md deleted file mode 120000 index 117bf4be15..0000000000 --- a/docs/components/backends/vllm/LMCache_Integration.md +++ /dev/null @@ -1 +0,0 @@ -../../../../components/backends/vllm/LMCache_Integration.md \ No newline at end of file diff --git a/docs/components/backends/vllm/README.md b/docs/components/backends/vllm/README.md deleted file mode 120000 index ec40eb5e49..0000000000 --- a/docs/components/backends/vllm/README.md +++ /dev/null @@ -1 +0,0 @@ -../../../../components/backends/vllm/README.md \ No newline at end of file diff --git a/docs/hidden_toctree.rst b/docs/hidden_toctree.rst index f32d1e2a9d..32b492dd47 100644 --- a/docs/hidden_toctree.rst +++ b/docs/hidden_toctree.rst @@ -42,8 +42,22 @@ architecture/request_migration.md architecture/request_cancellation.md - components/backends/trtllm/multinode/multinode-examples.md - components/backends/sglang/docs/multinode-examples.md + backends/trtllm/multinode/multinode-examples.md + backends/trtllm/multinode/multinode-multimodal-example.md + backends/trtllm/llama4_plus_eagle.md + backends/trtllm/kv-cache-transfer.md + backends/trtllm/multimodal_support.md + backends/trtllm/multimodal_epd.md + backends/trtllm/gemma3_sliding_window_attention.md + backends/trtllm/gpt-oss.md + + backends/sglang/multinode-examples.md + backends/sglang/dsr1-wideep-gb200.md + backends/sglang/dsr1-wideep-h100.md + backends/sglang/expert-distribution-eplb.md + backends/sglang/gpt-oss.md + backends/sglang/multimodal_epd.md + backends/sglang/sgl-hicache-example.md examples/README.md examples/runtime/hello_world/README.md @@ -51,6 +65,10 @@ architecture/distributed_runtime.md architecture/dynamo_flow.md + backends/vllm/deepseek-r1.md + backends/vllm/gpt-oss.md + backends/vllm/multi-node.md + .. TODO: architecture/distributed_runtime.md and architecture/dynamo_flow.md have some outdated names/references and need a refresh. diff --git a/docs/kubernetes/fluxcd.md b/docs/kubernetes/fluxcd.md index 248b135450..013e346056 100644 --- a/docs/kubernetes/fluxcd.md +++ b/docs/kubernetes/fluxcd.md @@ -1,6 +1,6 @@ # GitOps Deployment with FluxCD -This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](/components/backends/vllm/README.md) to demonstrate the workflow. +This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](/docs/backends/vllm/README.md) to demonstrate the workflow. ## Prerequisites diff --git a/docs/kubernetes/metrics.md b/docs/kubernetes/metrics.md index 628cb779aa..dfb135e0d9 100644 --- a/docs/kubernetes/metrics.md +++ b/docs/kubernetes/metrics.md @@ -64,7 +64,7 @@ This will create two components: - A Worker component exposing metrics on its system port Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about: -- Deployment configuration: See the [vLLM README](/components/backends/vllm/README.md) +- Deployment configuration: See the [vLLM README](/docs/backends/vllm/README.md) - Available metrics: See the [metrics guide](/docs/guides/metrics.md) ### Validate the Deployment @@ -87,7 +87,7 @@ curl localhost:8000/v1/chat/completions \ }' ``` -For more information about validating the deployment, see the [vLLM README](../../components/backends/vllm/README.md). +For more information about validating the deployment, see the [vLLM README](../backends/vllm/README.md). ## Set Up Metrics Collection diff --git a/examples/basics/disaggregated_serving/README.md b/examples/basics/disaggregated_serving/README.md index 8ed2d84d43..83085c8944 100644 --- a/examples/basics/disaggregated_serving/README.md +++ b/examples/basics/disaggregated_serving/README.md @@ -37,8 +37,8 @@ docker compose -f deploy/metrics/docker-compose.yml up -d ## Components - [Frontend](/components/src/dynamo/frontend/README.md) - HTTP API endpoint that receives requests and forwards them to the decode worker -- [vLLM Prefill Worker](/components/backends/vllm/README.md) - Specialized worker for prefill phase execution -- [vLLM Decode Worker](/components/backends/vllm/README.md) - Specialized worker that handles requests and decides between local/remote prefill +- [vLLM Prefill Worker](/docs/backends/vllm/README.md) - Specialized worker for prefill phase execution +- [vLLM Decode Worker](/docs/backends/vllm/README.md) - Specialized worker that handles requests and decides between local/remote prefill ```mermaid --- diff --git a/examples/basics/multinode/README.md b/examples/basics/multinode/README.md index 7a32898e15..fed574774c 100644 --- a/examples/basics/multinode/README.md +++ b/examples/basics/multinode/README.md @@ -88,7 +88,7 @@ Install Dynamo with [SGLang](https://docs.sglang.ai/) support: pip install ai-dynamo[sglang] ``` -For more information about the SGLang backend and its integration with Dynamo, see the [SGLang Backend Documentation](../../../components/backends/sglang/README.md). +For more information about the SGLang backend and its integration with Dynamo, see the [SGLang Backend Documentation](../../../docs/backends/sglang/README.md). ### 3. Network Requirements diff --git a/examples/basics/quickstart/README.md b/examples/basics/quickstart/README.md index af68687f07..0a325ef037 100644 --- a/examples/basics/quickstart/README.md +++ b/examples/basics/quickstart/README.md @@ -18,7 +18,7 @@ docker compose -f deploy/docker-compose.yml up -d ## Components - [Frontend](/components/src/dynamo/frontend/README.md) - A built-in component that launches an OpenAI compliant HTTP server, a pre-processor, and a router in a single process -- [vLLM Backend](/components/backends/vllm/README.md) - A built-in component that runs vLLM within the Dynamo runtime +- [vLLM Backend](/docs/backends/vllm/README.md) - A built-in component that runs vLLM within the Dynamo runtime ```mermaid --- diff --git a/examples/multimodal/README.md b/examples/multimodal/README.md index bf7de1709a..65cac21358 100644 --- a/examples/multimodal/README.md +++ b/examples/multimodal/README.md @@ -44,7 +44,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) In this workflow, we have two workers, [VllmEncodeWorker](components/encode_worker.py) and [VllmPDWorker](components/worker.py). The VllmEncodeWorker is responsible for encoding the image and passing the embeddings to the VllmPDWorker via a combination of NATS and RDMA. The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface. -Its VllmPDWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](../../components/backends/vllm/README.md) example. +Its VllmPDWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](../../docs/backends/vllm/README.md) example. By separating the encode from the prefill and decode stages, we can have a more flexible deployment and scale the VllmEncodeWorker independently from the prefill and decode workers if needed. @@ -122,7 +122,7 @@ For the Llava model, embeddings are only required during the prefill stage. As s The VllmEncodeWorker is responsible for encoding the image and passing the embeddings to the prefill worker via a combination of NATS and RDMA. Its work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface. The prefill worker performs the prefilling step and forwards the KV cache to the decode worker for decoding. -For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](../../components/backends/vllm/README.md) example. +For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](../../docs/backends/vllm/README.md) example. This figure illustrates the workflow: ```mermaid @@ -203,7 +203,7 @@ of the model per node. #### Workflow -In this workflow, we have [VllmPDWorker](components/worker.py) which will encode the image, prefill and decode the prompt, just like the [LLM aggregated serving](/components/backends/vllm/README.md) example. +In this workflow, we have [VllmPDWorker](components/worker.py) which will encode the image, prefill and decode the prompt, just like the [LLM aggregated serving](/docs/backends/vllm/README.md) example. This figure illustrates the workflow: ```mermaid @@ -267,7 +267,7 @@ You should see a response similar to this: In this workflow, we have two workers, [VllmDecodeWorker](components/worker.py), and [VllmPDWorker](components/worker.py). The prefill worker performs the encoding and prefilling steps and forwards the KV cache to the decode worker for decoding. -For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](/components/backends/vllm/README.md) example. +For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](/docs/backends/vllm/README.md) example. This figure illustrates the workflow: ```mermaid @@ -342,7 +342,7 @@ This example demonstrates deploying an aggregated multimodal model that can proc In this workflow, we have two workers, [VideoEncodeWorker](components/video_encode_worker.py) and [VllmPDWorker](components/worker.py). The VideoEncodeWorker is responsible for decoding the video into a series of frames. Unlike the image pipeline which generates embeddings, this pipeline passes the raw frames directly to the VllmPDWorker via a combination of NATS and RDMA. -Its VllmPDWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](/components/backends/vllm/README.md) example. +Its VllmPDWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](/docs/backends/vllm/README.md) example. By separating the video processing from the prefill and decode stages, we can have a more flexible deployment and scale the VideoEncodeWorker independently from the prefill and decode workers if needed. @@ -431,7 +431,7 @@ In this workflow, we have three workers, [VideoEncodeWorker](components/video_en For the LLaVA-NeXT-Video-7B model, frames are only required during the prefill stage. As such, the VideoEncodeWorker is connected directly to the prefill worker. The VideoEncodeWorker is responsible for decoding the video into a series of frames and passing them to the prefill worker via RDMA. The prefill worker performs the prefilling step and forwards the KV cache to the decode worker for decoding. -For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](/components/backends/vllm/README.md) example. +For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](/docs/backends/vllm/README.md) example. This figure illustrates the workflow: ```mermaid