Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,9 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ |
| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
| [**Load Based Planner**](/docs/architecture/load_planner.md) | 🚧 | 🚧 | 🚧 |
| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | ✅ | ✅ |
| [**KVBM**](/docs/architecture/kvbm_architecture.md) | ✅ | 🚧 | ✅ |
| [**Load Based Planner**](docs/planner/load_planner.md) | 🚧 | 🚧 | 🚧 |
| [**SLA-Based Planner**](docs/planner/sla_planner.md) | ✅ | ✅ | ✅ |
| [**KVBM**](docs/kvbm/kvbm_architecture.md) | ✅ | 🚧 | ✅ |

To learn more about each framework and their capabilities, check out each framework's README!

Expand All @@ -74,7 +74,7 @@ Built in Rust for performance and in Python for extensibility, Dynamo is fully o
# Installation

The following examples require a few system level packages.
Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [docs/support_matrix.md](docs/support_matrix.md)
Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [docs/reference/support-matrix.md](docs/reference/support-matrix.md)

## 1. Initial setup

Expand Down
4 changes: 2 additions & 2 deletions components/backends/sglang/prometheus.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ When running SGLang through Dynamo, SGLang engine metrics are automatically pass

For the complete and authoritative list of all SGLang metrics, always refer to the official documentation linked above.

Dynamo runtime metrics are documented in [docs/guides/metrics.md](../../../docs/guides/metrics.md).
Dynamo runtime metrics are documented in [docs/observability/metrics.md](../../observability/metrics.md).

## Metric Reference

Expand Down Expand Up @@ -91,7 +91,7 @@ sglang:cache_hit_rate{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0075
- [SGLang GitHub - Metrics Collector](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/metrics/collector.py)

### Dynamo Metrics
- **Dynamo Metrics Guide**: See `docs/guides/metrics.md` for complete documentation on Dynamo runtime metrics
- **Dynamo Metrics Guide**: See [docs/observability/metrics.md](../../observability/metrics.md) for complete documentation on Dynamo runtime metrics
- **Dynamo Runtime Metrics**: Metrics prefixed with `dynamo_*` for runtime, components, endpoints, and namespaces
- Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics)
- Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants)
Expand Down
2 changes: 1 addition & 1 deletion components/backends/vllm/deploy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,7 @@ args:
- **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/kubernetes/create_deployment.md)
- **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md)
- **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/kubernetes/installation_guide.md)
- **SLA Planner**: [SLA Planner Quickstart Guide](../../../../docs/kubernetes/sla_planner_quickstart.md)
- **SLA Planner**: [SLA Planner Quickstart Guide](../../../../docs/planner/sla_planner_quickstart.md)
- **Examples**: [Deployment Examples](../../../../docs/examples/README.md)
- **Architecture Docs**: [Disaggregated Serving](../../../../docs/architecture/disagg_serving.md), [KV-Aware Routing](../../../../docs/architecture/kv_cache_routing.md)

Expand Down
4 changes: 2 additions & 2 deletions components/backends/vllm/prometheus.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ When running vLLM through Dynamo, vLLM engine metrics are automatically passed t

For the complete and authoritative list of all vLLM metrics, always refer to the official documentation linked above.

Dynamo runtime metrics are documented in [docs/guides/metrics.md](../../../docs/guides/metrics.md).
Dynamo runtime metrics are documented in [docs/observability/metrics.md](../../observability/metrics.md).

## Metric Reference

Expand Down Expand Up @@ -96,7 +96,7 @@ vllm:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B"} 89.38
- [vLLM GitHub - Metrics Implementation](https://github.com/vllm-project/vllm/tree/main/vllm/engine/metrics)

### Dynamo Metrics
- **Dynamo Metrics Guide**: See `docs/guides/metrics.md` for complete documentation on Dynamo runtime metrics
- **Dynamo Metrics Guide**: See [docs/observability/metrics.md](../../observability/metrics.md) for complete documentation on Dynamo runtime metrics
- **Dynamo Runtime Metrics**: Metrics prefixed with `dynamo_*` for runtime, components, endpoints, and namespaces
- Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics)
- Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants)
Expand Down
2 changes: 1 addition & 1 deletion components/src/dynamo/planner/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,4 @@ See the License for the specific language governing permissions and
limitations under the License.
-->

Please refer to [planner docs](../../docs/architecture/planner_intro.rst) for planner documentation.
Please refer to [planner docs](../../../../docs/planner/planner_intro.rst) for planner documentation.
2 changes: 1 addition & 1 deletion deploy/metrics/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
This directory contains configuration for visualizing metrics from the metrics aggregation service using Prometheus and Grafana.

> [!NOTE]
> For detailed information about Dynamo's metrics system, including hierarchical metrics, automatic labeling, and usage examples, see the [Metrics Guide](../../docs/guides/metrics.md).
> For detailed information about Dynamo's metrics system, including hierarchical metrics, automatic labeling, and usage examples, see the [Metrics Guide](../../docs/observability/metrics.md).

## Overview

Expand Down
2 changes: 1 addition & 1 deletion deploy/tracing/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Dynamo supports OpenTelemetry-based distributed tracing, allowing you to visuali

## Environment Variables

Dynamo's tracing is configured via environment variables. For complete logging documentation, see [docs/guides/logging.md](../../docs/guides/logging.md).
Dynamo's tracing is configured via environment variables. For complete logging documentation, see [docs/observability/logging.md](../../docs/observability/logging.md).

### Required Environment Variables

Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
4 changes: 2 additions & 2 deletions docs/architecture/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,8 @@ The following diagram outlines Dynamo's high-level architecture. To enable large

- [Dynamo Disaggregated Serving](disagg_serving.md)
- [Dynamo Smart Router](kv_cache_routing.md)
- [Dynamo KV Cache Block Manager](kvbm_intro.rst)
- [Planner](planner_intro.rst)
- [Dynamo KV Cache Block Manager](../kvbm/kvbm_intro.rst)
- [Planner](../planner/planner_intro.rst)
- [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)

Every component in the Dynamo architecture is independently scalable and portable. The API server can adapt to task-specific deployment. A smart router processes user requests to route them to the optimal worker for performance. Specifically, for Large Language Models (LLMs), Dynamo employs KV cache-aware routing, which directs requests to the worker with the highest cache hit rate while maintaining load balance, expediting decoding. This routing strategy leverages a KV cache manager that maintains a global radix tree registry for hit rate calculation. The KV cache manager also oversees a multi-tiered memory system, enabling rapid KV cache storage and eviction. This design results in substantial TTFT reductions, increased throughput, and the ability to process extensive context lengths.
Expand Down
4 changes: 2 additions & 2 deletions docs/architecture/kv_cache_routing.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ For improved fault tolerance, you can launch multiple frontend + router replicas

### Router State Management

The KV Router tracks two types of state (see [KV Router Architecture](../components/router/README.md) for details):
The KV Router tracks two types of state (see [KV Router Architecture](../router/README.md) for details):

1. **Prefix blocks (cached KV blocks)**: Maintained in a radix tree, tracking which blocks are cached on each worker. This state is **persistent** - backed by NATS JetStream events and object store snapshots. New router replicas automatically sync this state on startup, ensuring consistent cache awareness across restarts.

Expand Down Expand Up @@ -506,4 +506,4 @@ This approach gives you complete control over routing decisions, allowing you to
- **Maximize cache reuse**: Use `best_worker_id()` which considers both prefill and decode loads
- **Balance load**: Consider both `potential_prefill_tokens` and `potential_decode_blocks` together

See [KV Router Architecture](../components/router/README.md) for performance tuning details.
See [KV Router Architecture](../router/README.md) for performance tuning details.
4 changes: 2 additions & 2 deletions docs/backends/sglang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| [**Disaggregated Serving**](../../architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
| [**KV-Aware Routing**](../../architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../architecture/sla_planner.md) | ✅ | |
| [**SLA-Based Planner**](../../planner/sla_planner.md) | ✅ | |
| [**Multimodal EPD Disaggregation**](multimodal_epd.md) | ✅ | |
| [**KVBM**](../../architecture/kvbm_architecture.md) | ❌ | Planned |
| [**KVBM**](../../kvbm/kvbm_architecture.md) | ❌ | Planned |


## Dynamo SGLang Integration
Expand Down
8 changes: 4 additions & 4 deletions docs/backends/trtllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,9 +55,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | Planned |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | 🚧 | Planned |
| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | Planned |
| [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | 🚧 | Planned |

### Large Scale P/D and WideEP Features

Expand Down Expand Up @@ -308,4 +308,4 @@ For detailed instructions on running comprehensive performance sweeps across bot

Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.

Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/guides/run_kvbm_in_trtllm.md) .
Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/kvbm/trtllm-setup.md) .
6 changes: 3 additions & 3 deletions docs/backends/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | WIP |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | WIP |
| [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | ✅ | |
| [**LMCache**](./LMCache_Integration.md) | ✅ | |

### Large Scale P/D and WideEP Features
Expand Down
4 changes: 2 additions & 2 deletions docs/benchmarks/pre_deployment_profiling.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Pre-Deployment Profiling

> [!TIP]
> **New to SLA Planner?** For a complete workflow including profiling and deployment, see the [SLA Planner Quick Start Guide](/docs/kubernetes/sla_planner_quickstart.md).
> **New to SLA Planner?** For a complete workflow including profiling and deployment, see the [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md).

## Profiling Script

Expand Down Expand Up @@ -99,7 +99,7 @@ SLA planner can work with any interpolation data that follows the above format.
## Detailed Kubernetes Profiling Instructions

> [!TIP]
> For a complete step-by-step workflow, see the [SLA Planner Quick Start Guide](/docs/kubernetes/sla_planner_quickstart.md).
> For a complete step-by-step workflow, see the [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md).

This section provides detailed technical information for advanced users who need to customize the profiling process.

Expand Down
1 change: 0 additions & 1 deletion docs/deploy/metrics/docker-compose.yml

This file was deleted.

File renamed without changes.
File renamed without changes.
File renamed without changes.
36 changes: 18 additions & 18 deletions docs/hidden_toctree.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,18 +11,18 @@
:maxdepth: 2
:hidden:

runtime/README.md
API/nixl_connect/connector.md
API/nixl_connect/descriptor.md
API/nixl_connect/device.md
API/nixl_connect/device_kind.md
API/nixl_connect/operation_status.md
API/nixl_connect/rdma_metadata.md
API/nixl_connect/readable_operation.md
API/nixl_connect/writable_operation.md
API/nixl_connect/read_operation.md
API/nixl_connect/write_operation.md
API/nixl_connect/README.md
development/runtime-guide.md
api/nixl_connect/connector.md
api/nixl_connect/descriptor.md
api/nixl_connect/device.md
api/nixl_connect/device_kind.md
api/nixl_connect/operation_status.md
api/nixl_connect/rdma_metadata.md
api/nixl_connect/readable_operation.md
api/nixl_connect/writable_operation.md
api/nixl_connect/read_operation.md
api/nixl_connect/write_operation.md
api/nixl_connect/README.md

kubernetes/api_reference.md
kubernetes/create_deployment.md
Expand All @@ -32,14 +32,14 @@
kubernetes/grove.md
kubernetes/model_caching_with_fluid.md
kubernetes/README.md
guides/dynamo_run.md
guides/metrics.md
guides/run_kvbm_in_vllm.md
guides/run_kvbm_in_trtllm.md
guides/tool_calling.md
reference/cli.md
observability/metrics.md
kvbm/vllm-setup.md
kvbm/trtllm-setup.md
guides/tool-calling.md

architecture/kv_cache_routing.md
architecture/load_planner.md
planner/load_planner.md
architecture/request_migration.md
architecture/request_cancellation.md

Expand Down
20 changes: 10 additions & 10 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Quickstart

Quickstart <self>
Installation <_sections/installation>
Support Matrix <support_matrix.md>
Support Matrix <reference/support-matrix.md>
Architecture <_sections/architecture>
Examples <_sections/examples>

Expand All @@ -63,18 +63,18 @@ Quickstart
:caption: Components

Backends <_sections/backends>
Router <components/router/README>
Planner <architecture/planner_intro>
KVBM <architecture/kvbm_intro>
Router <router/README>
Planner <planner/planner_intro>
KVBM <kvbm/kvbm_intro>

.. toctree::
:hidden:
:caption: Developer Guide

Benchmarking Guide <benchmarks/benchmarking.md>
SLA Planner (Autoscaling) Quickstart <kubernetes/sla_planner_quickstart>
Logging <guides/logging.md>
Health Checks <guides/health_check.md>
Tuning Disaggregated Serving Performance <guides/disagg_perf_tuning.md>
Writing Python Workers in Dynamo <guides/backend.md>
Glossary <dynamo_glossary.md>
SLA Planner (Autoscaling) Quickstart <planner/sla_planner_quickstart>
Logging <observability/logging.md>
Health Checks <observability/health-checks.md>
Tuning Disaggregated Serving Performance <performance/tuning.md>
Writing Python Workers in Dynamo <development/backend-guide.md>
Glossary <reference/glossary.md>
2 changes: 1 addition & 1 deletion docs/kubernetes/create_deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ Consult the corresponding sh file. Each of the python commands to launch a compo

The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]"
Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command.
If you are a Dynamo contributor the [dynamo run guide](/docs/guides/dynamo_run.md) for details on how to run this command.
If you are a Dynamo contributor the [dynamo run guide](/docs/reference/cli.md) for details on how to run this command.


## Step 3: Key Customization Points
Expand Down
2 changes: 1 addition & 1 deletion docs/kubernetes/installation_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ kubectl get pods -n ${NAMESPACE}

3. **Optional:**
- [Set up Prometheus & Grafana](metrics.md)
- [SLA Planner Quickstart Guide](sla_planner_quickstart.md) (for SLA-aware scheduling and autoscaling)
- [SLA Planner Quickstart Guide](../planner/sla_planner_quickstart.md) (for SLA-aware scheduling and autoscaling)

## Troubleshooting

Expand Down
2 changes: 1 addition & 1 deletion docs/kubernetes/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ This will create two components:

Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about:
- Deployment configuration: See the [vLLM README](/docs/backends/vllm/README.md)
- Available metrics: See the [metrics guide](/docs/guides/metrics.md)
- Available metrics: See the [metrics guide](/docs/observability/metrics.md)

### Validate the Deployment

Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ limitations under the License.

This guide explains how to leverage KVBM (KV Block Manager) to mange KV cache and do KV offloading in TensorRT-LLM (trtllm).

To learn what KVBM is, please check [here](https://docs.nvidia.com/dynamo/latest/architecture/kvbm_intro.html)
To learn what KVBM is, please check [here](https://docs.nvidia.com/dynamo/latest/kvbm/kvbm_intro.html)

> [!Note]
> - Ensure that `etcd` and `nats` are running before starting.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ limitations under the License.

This guide explains how to leverage KVBM (KV Block Manager) to mange KV cache and do KV offloading in vLLM.

To learn what KVBM is, please check [here](https://docs.nvidia.com/dynamo/latest/architecture/kvbm_intro.html)
To learn what KVBM is, please check [here](https://docs.nvidia.com/dynamo/latest/kvbm/kvbm_intro.html)

## Quick Start

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -197,4 +197,4 @@ date: Wed, 03 Sep 2025 13:42:45 GMT

- [Distributed Runtime Architecture](../architecture/distributed_runtime.md)
- [Dynamo Architecture Overview](../architecture/architecture.md)
- [Backend Guide](backend.md)
- [Backend Guide](../development/backend-guide.md)
2 changes: 1 addition & 1 deletion docs/guides/logging.md → docs/observability/logging.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,5 +187,5 @@ curl -d '{"model": "Qwen/Qwen3-0.6B", "max_completion_tokens": 2049, "messages":

- [Distributed Runtime Architecture](../architecture/distributed_runtime.md)
- [Dynamo Architecture Overview](../architecture/architecture.md)
- [Backend Guide](backend.md)
- [Backend Guide](../development/backend-guide.md)
- [Log Aggregation in Kubernetes](../kubernetes/logging.md)
2 changes: 1 addition & 1 deletion docs/guides/metrics.md → docs/observability/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,6 @@ The metrics system includes a pre-configured Grafana dashboard for visualizing s

- [Distributed Runtime Architecture](../architecture/distributed_runtime.md)
- [Dynamo Architecture Overview](../architecture/architecture.md)
- [Backend Guide](backend.md)
- [Backend Guide](../development/backend-guide.md)
- [Metrics Implementation Examples](../../deploy/metrics/README.md#implementation-examples)
- [Complete Metrics Setup Guide](../../deploy/metrics/README.md)
File renamed without changes.
File renamed without changes.
Loading