Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ High-throughput, low-latency inference framework designed for serving generative

## Latest News

- [08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./components/backends/trtllm/gpt-oss.md)
- [08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./docs/backends/trtllm/gpt-oss.md)

## The Era of Multi-GPU, Multi-Node

Expand Down Expand Up @@ -65,9 +65,9 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa

To learn more about each framework and their capabilities, check out each framework's README!

- **[vLLM](components/backends/vllm/README.md)**
- **[SGLang](components/backends/sglang/README.md)**
- **[TensorRT-LLM](components/backends/trtllm/README.md)**
- **[vLLM](docs/backends/vllm/README.md)**
- **[SGLang](docs/backends/sglang/README.md)**
- **[TensorRT-LLM](docs/backends/trtllm/README.md)**

Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.

Expand Down
6 changes: 3 additions & 3 deletions components/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,9 @@ This directory contains the core components that make up the Dynamo inference fr

Dynamo supports multiple inference engines (with a focus on SGLang, vLLM, and TensorRT-LLM), each with their own deployment configurations and capabilities:

- **[vLLM](backends/vllm/README.md)** - High-performance LLM inference with native KV cache events and NIXL-based transfer mechanisms
- **[SGLang](backends/sglang/README.md)** - Structured generation language framework with ZMQ-based communication
- **[TensorRT-LLM](backends/trtllm/README.md)** - NVIDIA's optimized LLM inference engine with TensorRT acceleration
- **[vLLM](/docs/backends/vllm/README.md)** - High-performance LLM inference with native KV cache events and NIXL-based transfer mechanisms
- **[SGLang](/docs/backends/sglang/README.md)** - Structured generation language framework with ZMQ-based communication
- **[TensorRT-LLM](/docs/backends/trtllm/README.md)** - NVIDIA's optimized LLM inference engine with TensorRT acceleration

Each engine provides launch scripts for different deployment patterns in their respective `/launch` & `/deploy` directories.

Expand Down
2 changes: 1 addition & 1 deletion components/backends/sglang/slurm_jobs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ For this example, we will make some assumptions about your SLURM cluster:
If your cluster supports similar container based plugins, you may be able to
modify the template to use that instead.
3. We assume you have already built a recent Dynamo+SGLang container image as
described [here](../docs/dsr1-wideep-gb200.md#instructions).
described [here](../../../../docs/backends/sglang/dsr1-wideep-gb200.md#instructions).
This is the image that can be passed to the `--container-image` argument in later steps.

## Scripts Overview
Expand Down
8 changes: 4 additions & 4 deletions components/backends/trtllm/deploy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -232,7 +232,7 @@ envs:

## Testing the Deployment

Send a test request to verify your deployment. See the [client section](../../../../components/backends/vllm/README.md#client) for detailed instructions.
Send a test request to verify your deployment. See the [client section](../../../../docs/backends/vllm/README.md#client) for detailed instructions.

**Note:** For multi-node deployments, target the node running `python3 -m dynamo.frontend <args>`.

Expand All @@ -254,7 +254,7 @@ TensorRT-LLM supports two methods for KV cache transfer in disaggregated serving
- **UCX** (default): Standard method for KV cache transfer
- **NIXL** (experimental): Alternative transfer method

For detailed configuration instructions, see the [KV cache transfer guide](../kv-cache-transfer.md).
For detailed configuration instructions, see the [KV cache transfer guide](../../../../docs/backends/trtllm/kv-cache-transfer.md).

## Request Migration

Expand Down Expand Up @@ -282,8 +282,8 @@ Configure the `model` name and `host` based on your deployment.
- **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/kubernetes/installation_guide.md)
- **Examples**: [Deployment Examples](../../../../docs/examples/README.md)
- **Architecture Docs**: [Disaggregated Serving](../../../../docs/architecture/disagg_serving.md), [KV-Aware Routing](../../../../docs/architecture/kv_cache_routing.md)
- **Multinode Deployment**: [Multinode Examples](../multinode/multinode-examples.md)
- **Speculative Decoding**: [Llama 4 + Eagle Guide](../llama4_plus_eagle.md)
- **Multinode Deployment**: [Multinode Examples](../../../../docs/backends/trtllm/multinode/multinode-examples.md)
- **Speculative Decoding**: [Llama 4 + Eagle Guide](../../../../docs/backends/trtllm/llama4_plus_eagle.md)
- **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)

## Troubleshooting
Expand Down
2 changes: 1 addition & 1 deletion components/backends/trtllm/performance_sweeps/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ Please note that:
3. `post_process.py` - Scan the genai-perf results to produce a json with entries to each config point.
4. `plot_performance_comparison.py` - Takes the json result file for disaggregated and/or aggregated configuration sweeps and plots a pareto line for better visualization.

For more finer grained details on how to launch TRTLLM backend workers with DeepSeek R1 on GB200 slurm, please refer [multinode-examples.md](../multinode/multinode-examples.md). This guide shares similar assumption to the multinode examples guide.
For more finer grained details on how to launch TRTLLM backend workers with DeepSeek R1 on GB200 slurm, please refer [multinode-examples.md](../../../../docs/backends/trtllm/multinode/multinode-examples.md). This guide shares similar assumption to the multinode examples guide.

## Usage

Expand Down
12 changes: 6 additions & 6 deletions docs/_includes/dive_in_examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,20 +11,20 @@ The examples below assume you build the latest image yourself from source. If us

Demonstrates the basic concepts of Dynamo by creating a simple GPU-unaware graph

.. grid-item-card:: :doc:`vLLM <../components/backends/vllm/README>`
:link: ../components/backends/vllm/README
.. grid-item-card:: :doc:`vLLM <../backends/vllm/README>`
:link: ../backends/vllm/README
:link-type: doc

Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with VLLM.

.. grid-item-card:: :doc:`SGLang <../components/backends/sglang/README>`
:link: ../components/backends/sglang/README
.. grid-item-card:: :doc:`SGLang <../backends/sglang/README>`
:link: ../backends/sglang/README
:link-type: doc

Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with SGLang.

.. grid-item-card:: :doc:`TensorRT-LLM <../components/backends/trtllm/README>`
:link: ../components/backends/trtllm/README
.. grid-item-card:: :doc:`TensorRT-LLM <../backends/trtllm/README>`
:link: ../backends/trtllm/README
:link-type: doc

Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with TensorRT-LLM.
Expand Down
6 changes: 3 additions & 3 deletions docs/_sections/backends.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,6 @@ Dynamo currently supports the following high-performance inference backends:
.. toctree::
:maxdepth: 1

vLLM <../components/backends/vllm/README>
SGLang <../components/backends/sglang/README>
TensorRT-LLM <../components/backends/trtllm/README>
vLLM <../backends/vllm/README>
SGLang <../backends/sglang/README>
TensorRT-LLM <../backends/trtllm/README>
2 changes: 1 addition & 1 deletion docs/architecture/kvbm_intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -63,4 +63,4 @@ The Dynamo KV Block Manager serves as a reference implementation that emphasizes
KVBM Architecture <kvbm_architecture.md>
Understanding KVBM components <kvbm_components.md>
KVBM Further Reading <kvbm_reading>
LMCache Integration <../components/backends/vllm/LMCache_Integration.md>
LMCache Integration <../backends/vllm/LMCache_Integration>
Original file line number Diff line number Diff line change
Expand Up @@ -35,13 +35,13 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

| Feature | SGLang | Notes |
|---------|--------|-------|
| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ | |
| [**Multimodal EPD Disaggregation**](docs/multimodal_epd.md) | ✅ | |
| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | ❌ | Planned |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | ❌ | Planned |
| [**Disaggregated Serving**](../../architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
| [**KV-Aware Routing**](../../architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../architecture/sla_planner.md) | ✅ | |
| [**Multimodal EPD Disaggregation**](multimodal_epd.md) | ✅ | |
| [**Load Based Planner**](../../architecture/load_planner.md) | ❌ | Planned |
| [**KVBM**](../../architecture/kvbm_architecture.md) | ❌ | Planned |

### Large Scale P/D and WideEP Features

Expand Down Expand Up @@ -229,7 +229,7 @@ cd $DYNAMO_HOME/components/backends/sglang
./launch/disagg_dp_attn.sh
```

When using MoE models, you can also use the our implementation of the native SGLang endpoints to record expert distribution data. The `disagg_dp_attn.sh` script automatically sets up the SGLang HTTP server, the environment variable that controls the expert distribution recording directory, and sets up the expert distribution recording mode to `stat`. You can learn more about expert parallelism load balancing [here](docs/expert-distribution-eplb.md).
When using MoE models, you can also use the our implementation of the native SGLang endpoints to record expert distribution data. The `disagg_dp_attn.sh` script automatically sets up the SGLang HTTP server, the environment variable that controls the expert distribution recording directory, and sets up the expert distribution recording mode to `stat`. You can learn more about expert parallelism load balancing [here](expert-distribution-eplb.md).

### Testing the Deployment

Expand Down Expand Up @@ -266,24 +266,24 @@ This allows a request to be migrated up to 3 times before failing. See the [Requ
Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!

### Run a multi-node sized model
- **[Run a multi-node model](docs/multinode-examples.md)**
- **[Run a multi-node model](multinode-examples.md)**

### Large scale P/D disaggregation with WideEP
- **[Run DeepSeek-R1 on 104+ H100s](docs/dsr1-wideep-h100.md)**
- **[Run DeepSeek-R1-FP8 on GB200s](docs/dsr1-wideep-gb200.md)**
- **[Run DeepSeek-R1 on 104+ H100s](dsr1-wideep-h100.md)**
- **[Run DeepSeek-R1-FP8 on GB200s](dsr1-wideep-gb200.md)**

### Hierarchical Cache (HiCache)
- **[Enable SGLang Hierarchical Cache (HiCache)](docs/sgl-hicache-example.md)**
- **[Enable SGLang Hierarchical Cache (HiCache)](sgl-hicache-example.md)**

### Multimodal Encode-Prefill-Decode (EPD) Disaggregation with NIXL
- **[Run a multimodal model with EPD Disaggregation](docs/multimodal_epd.md)**
- **[Run a multimodal model with EPD Disaggregation](multimodal_epd.md)**

## Deployment

We currently provide deployment examples for Kubernetes and SLURM.

## Kubernetes
- **[Deploying Dynamo with SGLang on Kubernetes](deploy/README.md)**
- **[Deploying Dynamo with SGLang on Kubernetes](../../../components/backends/sglang/deploy/README.md)**

## SLURM
- **[Deploying Dynamo with SGLang on SLURM](slurm_jobs/README.md)**
- **[Deploying Dynamo with SGLang on SLURM](../../../components/backends/sglang/slurm_jobs/README.md)**
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ SPDX-License-Identifier: Apache-2.0

# Running gpt-oss-120b Disaggregated with SGLang

The gpt-oss-120b guide for SGLang is largely identical to the [guide for vLLM](/components/backends/vllm/gpt-oss.md),
The gpt-oss-120b guide for SGLang is largely identical to the [guide for vLLM](/docs/backends/vllm/gpt-oss.md),
please ues the vLLM guide as a reference with the different deployment steps as highlighted below:

# Launch the Deployment
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

The MultimodalEncodeWorker is responsible for encoding the image and passing the embeddings to the MultimodalWorker via a combination of NATS and RDMA.
The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
Its MultimodalWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](../README.md) example.
Its MultimodalWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](README.md) example.
By separating the encode from the prefill and decode stages, we can have a more flexible deployment and scale the
MultimodalEncodeWorker independently from the prefill and decode workers if needed.

Expand Down Expand Up @@ -116,7 +116,7 @@ You should see a response similar to this:

For the Qwen2.5-VL model, embeddings are only required during the prefill stage. As such, the image embeddings are transferred using a NIXL descriptor from the encode worker to the worker and then passed to the prefill worker for processing.
The prefill worker performs the prefilling step and forwards the KV cache to the worker for decoding.
For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](../README.md) example.
For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](README.md) example.

This figure illustrates the workflow:
```mermaid
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -186,11 +186,11 @@ For comprehensive instructions on multinode serving, see the [multinode-examples

### Kubernetes Deployment

For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](deploy/README.md).
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](../../../components/backends/trtllm/deploy/README.md).

### Client

See [client](../sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
See [client](../../../docs/backends/sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.

NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.

Expand Down Expand Up @@ -230,7 +230,7 @@ This allows a request to be migrated up to 3 times before failing. See the [Requ

## Client

See [client](../sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
See [client](../../../docs/backends/sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.

NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.

Expand Down Expand Up @@ -302,7 +302,7 @@ sampling_params.logits_processor = create_trtllm_adapters(processors)

## Performance Sweep

For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](./performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.
For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](../../../components/backends/trtllm/performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.

## Dynamo KV Block Manager Integration

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,9 @@ VSWA is a mechanism in which a model’s layers alternate between multiple slidi
> [!Note]
> - Ensure that required services such as `nats` and `etcd` are running before starting.
> - Request access to `google/gemma-3-1b-it` on Hugging Face and set your `HF_TOKEN` environment variable for authentication.
> - Its recommended to continue using the VSWA feature with the Dynamo 0.5.0 release and the TensorRT-LLM dynamo runtime image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0. The 0.5.1 release bundles TensorRT-LLM v1.1.0rc5, which has a regression that breaks VSWA.
> - It's recommended to continue using the VSWA feature with the Dynamo 0.5.0 release and the TensorRT-LLM dynamo runtime image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0. The 0.5.1 release bundles TensorRT-LLM v1.1.0rc5, which has a regression that breaks VSWA.

### Aggregated Serving
## Aggregated Serving
```bash
cd $DYNAMO_HOME/components/backends/trtllm
export MODEL_PATH=google/gemma-3-1b-it
Expand All @@ -34,7 +34,7 @@ export AGG_ENGINE_ARGS=engine_configs/gemma3/vswa_agg.yaml
./launch/agg.sh
```

### Aggregated Serving with KV Routing
## Aggregated Serving with KV Routing
```bash
cd $DYNAMO_HOME/components/backends/trtllm
export MODEL_PATH=google/gemma-3-1b-it
Expand All @@ -43,7 +43,7 @@ export AGG_ENGINE_ARGS=engine_configs/gemma3/vswa_agg.yaml
./launch/agg_router.sh
```

#### Disaggregated Serving
## Disaggregated Serving
```bash
cd $DYNAMO_HOME/components/backends/trtllm
export MODEL_PATH=google/gemma-3-1b-it
Expand All @@ -53,7 +53,7 @@ export DECODE_ENGINE_ARGS=engine_configs/gemma3/vswa_decode.yaml
./launch/disagg.sh
```

#### Disaggregated Serving with KV Routing
## Disaggregated Serving with KV Routing
```bash
cd $DYNAMO_HOME/components/backends/trtllm
export MODEL_PATH=google/gemma-3-1b-it
Expand Down
File renamed without changes.
Loading
Loading