diff --git a/README.md b/README.md index 410eadbbc4..a8b10a0e2c 100644 --- a/README.md +++ b/README.md @@ -59,9 +59,9 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa | [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ | | [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 | | [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ | -| [**Load Based Planner**](/docs/architecture/load_planner.md) | 🚧 | 🚧 | 🚧 | -| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | ✅ | ✅ | -| [**KVBM**](/docs/architecture/kvbm_architecture.md) | ✅ | 🚧 | ✅ | +| [**Load Based Planner**](docs/planner/load_planner.md) | 🚧 | 🚧 | 🚧 | +| [**SLA-Based Planner**](docs/planner/sla_planner.md) | ✅ | ✅ | ✅ | +| [**KVBM**](docs/kvbm/kvbm_architecture.md) | ✅ | 🚧 | ✅ | To learn more about each framework and their capabilities, check out each framework's README! @@ -74,7 +74,7 @@ Built in Rust for performance and in Python for extensibility, Dynamo is fully o # Installation The following examples require a few system level packages. -Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [docs/support_matrix.md](docs/support_matrix.md) +Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [docs/reference/support-matrix.md](docs/reference/support-matrix.md) ## 1. Initial setup diff --git a/components/backends/sglang/prometheus.md b/components/backends/sglang/prometheus.md index 30a1c38ba8..68d3d34e44 100644 --- a/components/backends/sglang/prometheus.md +++ b/components/backends/sglang/prometheus.md @@ -10,7 +10,7 @@ When running SGLang through Dynamo, SGLang engine metrics are automatically pass For the complete and authoritative list of all SGLang metrics, always refer to the official documentation linked above. -Dynamo runtime metrics are documented in [docs/guides/metrics.md](../../../docs/guides/metrics.md). +Dynamo runtime metrics are documented in [docs/observability/metrics.md](../../observability/metrics.md). ## Metric Reference @@ -91,7 +91,7 @@ sglang:cache_hit_rate{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0075 - [SGLang GitHub - Metrics Collector](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/metrics/collector.py) ### Dynamo Metrics -- **Dynamo Metrics Guide**: See `docs/guides/metrics.md` for complete documentation on Dynamo runtime metrics +- **Dynamo Metrics Guide**: See [docs/observability/metrics.md](../../observability/metrics.md) for complete documentation on Dynamo runtime metrics - **Dynamo Runtime Metrics**: Metrics prefixed with `dynamo_*` for runtime, components, endpoints, and namespaces - Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics) - Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants) diff --git a/components/backends/vllm/deploy/README.md b/components/backends/vllm/deploy/README.md index 7e726eec0b..a188d44b92 100644 --- a/components/backends/vllm/deploy/README.md +++ b/components/backends/vllm/deploy/README.md @@ -237,7 +237,7 @@ args: - **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/kubernetes/create_deployment.md) - **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md) - **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/kubernetes/installation_guide.md) -- **SLA Planner**: [SLA Planner Quickstart Guide](../../../../docs/kubernetes/sla_planner_quickstart.md) +- **SLA Planner**: [SLA Planner Quickstart Guide](../../../../docs/planner/sla_planner_quickstart.md) - **Examples**: [Deployment Examples](../../../../docs/examples/README.md) - **Architecture Docs**: [Disaggregated Serving](../../../../docs/architecture/disagg_serving.md), [KV-Aware Routing](../../../../docs/architecture/kv_cache_routing.md) diff --git a/components/backends/vllm/prometheus.md b/components/backends/vllm/prometheus.md index fce3f5eb6d..b479fe8e0b 100644 --- a/components/backends/vllm/prometheus.md +++ b/components/backends/vllm/prometheus.md @@ -10,7 +10,7 @@ When running vLLM through Dynamo, vLLM engine metrics are automatically passed t For the complete and authoritative list of all vLLM metrics, always refer to the official documentation linked above. -Dynamo runtime metrics are documented in [docs/guides/metrics.md](../../../docs/guides/metrics.md). +Dynamo runtime metrics are documented in [docs/observability/metrics.md](../../observability/metrics.md). ## Metric Reference @@ -96,7 +96,7 @@ vllm:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B"} 89.38 - [vLLM GitHub - Metrics Implementation](https://github.com/vllm-project/vllm/tree/main/vllm/engine/metrics) ### Dynamo Metrics -- **Dynamo Metrics Guide**: See `docs/guides/metrics.md` for complete documentation on Dynamo runtime metrics +- **Dynamo Metrics Guide**: See [docs/observability/metrics.md](../../observability/metrics.md) for complete documentation on Dynamo runtime metrics - **Dynamo Runtime Metrics**: Metrics prefixed with `dynamo_*` for runtime, components, endpoints, and namespaces - Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics) - Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants) diff --git a/components/src/dynamo/planner/README.md b/components/src/dynamo/planner/README.md index f666881bb1..fdd5503e90 100644 --- a/components/src/dynamo/planner/README.md +++ b/components/src/dynamo/planner/README.md @@ -15,4 +15,4 @@ See the License for the specific language governing permissions and limitations under the License. --> -Please refer to [planner docs](../../docs/architecture/planner_intro.rst) for planner documentation. +Please refer to [planner docs](../../../../docs/planner/planner_intro.rst) for planner documentation. diff --git a/deploy/metrics/README.md b/deploy/metrics/README.md index 41b8610fdc..993b6bcd84 100644 --- a/deploy/metrics/README.md +++ b/deploy/metrics/README.md @@ -3,7 +3,7 @@ This directory contains configuration for visualizing metrics from the metrics aggregation service using Prometheus and Grafana. > [!NOTE] -> For detailed information about Dynamo's metrics system, including hierarchical metrics, automatic labeling, and usage examples, see the [Metrics Guide](../../docs/guides/metrics.md). +> For detailed information about Dynamo's metrics system, including hierarchical metrics, automatic labeling, and usage examples, see the [Metrics Guide](../../docs/observability/metrics.md). ## Overview diff --git a/deploy/tracing/README.md b/deploy/tracing/README.md index f3893d2803..299136273d 100644 --- a/deploy/tracing/README.md +++ b/deploy/tracing/README.md @@ -19,7 +19,7 @@ Dynamo supports OpenTelemetry-based distributed tracing, allowing you to visuali ## Environment Variables -Dynamo's tracing is configured via environment variables. For complete logging documentation, see [docs/guides/logging.md](../../docs/guides/logging.md). +Dynamo's tracing is configured via environment variables. For complete logging documentation, see [docs/observability/logging.md](../../docs/observability/logging.md). ### Required Environment Variables diff --git a/docs/API/nixl_connect/README.md b/docs/api/nixl_connect/README.md similarity index 100% rename from docs/API/nixl_connect/README.md rename to docs/api/nixl_connect/README.md diff --git a/docs/API/nixl_connect/connector.md b/docs/api/nixl_connect/connector.md similarity index 100% rename from docs/API/nixl_connect/connector.md rename to docs/api/nixl_connect/connector.md diff --git a/docs/API/nixl_connect/descriptor.md b/docs/api/nixl_connect/descriptor.md similarity index 100% rename from docs/API/nixl_connect/descriptor.md rename to docs/api/nixl_connect/descriptor.md diff --git a/docs/API/nixl_connect/device.md b/docs/api/nixl_connect/device.md similarity index 100% rename from docs/API/nixl_connect/device.md rename to docs/api/nixl_connect/device.md diff --git a/docs/API/nixl_connect/device_kind.md b/docs/api/nixl_connect/device_kind.md similarity index 100% rename from docs/API/nixl_connect/device_kind.md rename to docs/api/nixl_connect/device_kind.md diff --git a/docs/API/nixl_connect/operation_status.md b/docs/api/nixl_connect/operation_status.md similarity index 100% rename from docs/API/nixl_connect/operation_status.md rename to docs/api/nixl_connect/operation_status.md diff --git a/docs/API/nixl_connect/rdma_metadata.md b/docs/api/nixl_connect/rdma_metadata.md similarity index 100% rename from docs/API/nixl_connect/rdma_metadata.md rename to docs/api/nixl_connect/rdma_metadata.md diff --git a/docs/API/nixl_connect/read_operation.md b/docs/api/nixl_connect/read_operation.md similarity index 100% rename from docs/API/nixl_connect/read_operation.md rename to docs/api/nixl_connect/read_operation.md diff --git a/docs/API/nixl_connect/readable_operation.md b/docs/api/nixl_connect/readable_operation.md similarity index 100% rename from docs/API/nixl_connect/readable_operation.md rename to docs/api/nixl_connect/readable_operation.md diff --git a/docs/API/nixl_connect/writable_operation.md b/docs/api/nixl_connect/writable_operation.md similarity index 100% rename from docs/API/nixl_connect/writable_operation.md rename to docs/api/nixl_connect/writable_operation.md diff --git a/docs/API/nixl_connect/write_operation.md b/docs/api/nixl_connect/write_operation.md similarity index 100% rename from docs/API/nixl_connect/write_operation.md rename to docs/api/nixl_connect/write_operation.md diff --git a/docs/architecture/architecture.md b/docs/architecture/architecture.md index f21e1ed639..55aae12435 100644 --- a/docs/architecture/architecture.md +++ b/docs/architecture/architecture.md @@ -54,8 +54,8 @@ The following diagram outlines Dynamo's high-level architecture. To enable large - [Dynamo Disaggregated Serving](disagg_serving.md) - [Dynamo Smart Router](kv_cache_routing.md) -- [Dynamo KV Cache Block Manager](kvbm_intro.rst) -- [Planner](planner_intro.rst) +- [Dynamo KV Cache Block Manager](../kvbm/kvbm_intro.rst) +- [Planner](../planner/planner_intro.rst) - [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md) Every component in the Dynamo architecture is independently scalable and portable. The API server can adapt to task-specific deployment. A smart router processes user requests to route them to the optimal worker for performance. Specifically, for Large Language Models (LLMs), Dynamo employs KV cache-aware routing, which directs requests to the worker with the highest cache hit rate while maintaining load balance, expediting decoding. This routing strategy leverages a KV cache manager that maintains a global radix tree registry for hit rate calculation. The KV cache manager also oversees a multi-tiered memory system, enabling rapid KV cache storage and eviction. This design results in substantial TTFT reductions, increased throughput, and the ability to process extensive context lengths. diff --git a/docs/architecture/kv_cache_routing.md b/docs/architecture/kv_cache_routing.md index 7f4e22ad9b..45edb97c2d 100644 --- a/docs/architecture/kv_cache_routing.md +++ b/docs/architecture/kv_cache_routing.md @@ -154,7 +154,7 @@ For improved fault tolerance, you can launch multiple frontend + router replicas ### Router State Management -The KV Router tracks two types of state (see [KV Router Architecture](../components/router/README.md) for details): +The KV Router tracks two types of state (see [KV Router Architecture](../router/README.md) for details): 1. **Prefix blocks (cached KV blocks)**: Maintained in a radix tree, tracking which blocks are cached on each worker. This state is **persistent** - backed by NATS JetStream events and object store snapshots. New router replicas automatically sync this state on startup, ensuring consistent cache awareness across restarts. @@ -506,4 +506,4 @@ This approach gives you complete control over routing decisions, allowing you to - **Maximize cache reuse**: Use `best_worker_id()` which considers both prefill and decode loads - **Balance load**: Consider both `potential_prefill_tokens` and `potential_decode_blocks` together -See [KV Router Architecture](../components/router/README.md) for performance tuning details. +See [KV Router Architecture](../router/README.md) for performance tuning details. diff --git a/docs/backends/sglang/README.md b/docs/backends/sglang/README.md index 6604edb5ee..4a399da6a3 100644 --- a/docs/backends/sglang/README.md +++ b/docs/backends/sglang/README.md @@ -37,9 +37,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) | [**Disaggregated Serving**](../../architecture/disagg_serving.md) | ✅ | | | [**Conditional Disaggregation**](../../architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) | | [**KV-Aware Routing**](../../architecture/kv_cache_routing.md) | ✅ | | -| [**SLA-Based Planner**](../../architecture/sla_planner.md) | ✅ | | +| [**SLA-Based Planner**](../../planner/sla_planner.md) | ✅ | | | [**Multimodal EPD Disaggregation**](multimodal_epd.md) | ✅ | | -| [**KVBM**](../../architecture/kvbm_architecture.md) | ❌ | Planned | +| [**KVBM**](../../kvbm/kvbm_architecture.md) | ❌ | Planned | ## Dynamo SGLang Integration diff --git a/docs/backends/trtllm/README.md b/docs/backends/trtllm/README.md index 12fbb9e5a9..6bd8338fca 100644 --- a/docs/backends/trtllm/README.md +++ b/docs/backends/trtllm/README.md @@ -55,9 +55,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) | [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ | | | [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet | | [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | | -| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ | | -| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | Planned | -| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | 🚧 | Planned | +| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | | +| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | Planned | +| [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | 🚧 | Planned | ### Large Scale P/D and WideEP Features @@ -308,4 +308,4 @@ For detailed instructions on running comprehensive performance sweeps across bot Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests. -Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/guides/run_kvbm_in_trtllm.md) . +Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/kvbm/trtllm-setup.md) . diff --git a/docs/backends/vllm/README.md b/docs/backends/vllm/README.md index 981e9226e3..06b98fa639 100644 --- a/docs/backends/vllm/README.md +++ b/docs/backends/vllm/README.md @@ -38,9 +38,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) | [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ | | | [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP | | [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | | -| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ | | -| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | WIP | -| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | ✅ | | +| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | | +| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | WIP | +| [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | ✅ | | | [**LMCache**](./LMCache_Integration.md) | ✅ | | ### Large Scale P/D and WideEP Features diff --git a/docs/benchmarks/pre_deployment_profiling.md b/docs/benchmarks/pre_deployment_profiling.md index 74ca4df2b3..1d094caf9b 100644 --- a/docs/benchmarks/pre_deployment_profiling.md +++ b/docs/benchmarks/pre_deployment_profiling.md @@ -1,7 +1,7 @@ # Pre-Deployment Profiling > [!TIP] -> **New to SLA Planner?** For a complete workflow including profiling and deployment, see the [SLA Planner Quick Start Guide](/docs/kubernetes/sla_planner_quickstart.md). +> **New to SLA Planner?** For a complete workflow including profiling and deployment, see the [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md). ## Profiling Script @@ -99,7 +99,7 @@ SLA planner can work with any interpolation data that follows the above format. ## Detailed Kubernetes Profiling Instructions > [!TIP] -> For a complete step-by-step workflow, see the [SLA Planner Quick Start Guide](/docs/kubernetes/sla_planner_quickstart.md). +> For a complete step-by-step workflow, see the [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md). This section provides detailed technical information for advanced users who need to customize the profiling process. diff --git a/docs/deploy/metrics/docker-compose.yml b/docs/deploy/metrics/docker-compose.yml deleted file mode 120000 index f7c658ffff..0000000000 --- a/docs/deploy/metrics/docker-compose.yml +++ /dev/null @@ -1 +0,0 @@ -../../../deploy/metrics/docker-compose.yml \ No newline at end of file diff --git a/docs/guides/backend.md b/docs/development/backend-guide.md similarity index 100% rename from docs/guides/backend.md rename to docs/development/backend-guide.md diff --git a/docs/runtime/README.md b/docs/development/runtime-guide.md similarity index 100% rename from docs/runtime/README.md rename to docs/development/runtime-guide.md diff --git a/docs/guides/tool_calling.md b/docs/guides/tool-calling.md similarity index 100% rename from docs/guides/tool_calling.md rename to docs/guides/tool-calling.md diff --git a/docs/hidden_toctree.rst b/docs/hidden_toctree.rst index d0dedad0b4..5223ce5355 100644 --- a/docs/hidden_toctree.rst +++ b/docs/hidden_toctree.rst @@ -11,18 +11,18 @@ :maxdepth: 2 :hidden: - runtime/README.md - API/nixl_connect/connector.md - API/nixl_connect/descriptor.md - API/nixl_connect/device.md - API/nixl_connect/device_kind.md - API/nixl_connect/operation_status.md - API/nixl_connect/rdma_metadata.md - API/nixl_connect/readable_operation.md - API/nixl_connect/writable_operation.md - API/nixl_connect/read_operation.md - API/nixl_connect/write_operation.md - API/nixl_connect/README.md + development/runtime-guide.md + api/nixl_connect/connector.md + api/nixl_connect/descriptor.md + api/nixl_connect/device.md + api/nixl_connect/device_kind.md + api/nixl_connect/operation_status.md + api/nixl_connect/rdma_metadata.md + api/nixl_connect/readable_operation.md + api/nixl_connect/writable_operation.md + api/nixl_connect/read_operation.md + api/nixl_connect/write_operation.md + api/nixl_connect/README.md kubernetes/api_reference.md kubernetes/create_deployment.md @@ -32,14 +32,14 @@ kubernetes/grove.md kubernetes/model_caching_with_fluid.md kubernetes/README.md - guides/dynamo_run.md - guides/metrics.md - guides/run_kvbm_in_vllm.md - guides/run_kvbm_in_trtllm.md - guides/tool_calling.md + reference/cli.md + observability/metrics.md + kvbm/vllm-setup.md + kvbm/trtllm-setup.md + guides/tool-calling.md architecture/kv_cache_routing.md - architecture/load_planner.md + planner/load_planner.md architecture/request_migration.md architecture/request_cancellation.md diff --git a/docs/index.rst b/docs/index.rst index d68399e9f7..9266656936 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -42,7 +42,7 @@ Quickstart Quickstart Installation <_sections/installation> - Support Matrix + Support Matrix Architecture <_sections/architecture> Examples <_sections/examples> @@ -63,18 +63,18 @@ Quickstart :caption: Components Backends <_sections/backends> - Router - Planner - KVBM + Router + Planner + KVBM .. toctree:: :hidden: :caption: Developer Guide Benchmarking Guide - SLA Planner (Autoscaling) Quickstart - Logging - Health Checks - Tuning Disaggregated Serving Performance - Writing Python Workers in Dynamo - Glossary + SLA Planner (Autoscaling) Quickstart + Logging + Health Checks + Tuning Disaggregated Serving Performance + Writing Python Workers in Dynamo + Glossary diff --git a/docs/kubernetes/create_deployment.md b/docs/kubernetes/create_deployment.md index 23b37d357a..4997a87a89 100644 --- a/docs/kubernetes/create_deployment.md +++ b/docs/kubernetes/create_deployment.md @@ -90,7 +90,7 @@ Consult the corresponding sh file. Each of the python commands to launch a compo The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]" Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command. -If you are a Dynamo contributor the [dynamo run guide](/docs/guides/dynamo_run.md) for details on how to run this command. +If you are a Dynamo contributor the [dynamo run guide](/docs/reference/cli.md) for details on how to run this command. ## Step 3: Key Customization Points diff --git a/docs/kubernetes/installation_guide.md b/docs/kubernetes/installation_guide.md index 49158c6fab..64eaea8260 100644 --- a/docs/kubernetes/installation_guide.md +++ b/docs/kubernetes/installation_guide.md @@ -196,7 +196,7 @@ kubectl get pods -n ${NAMESPACE} 3. **Optional:** - [Set up Prometheus & Grafana](metrics.md) - - [SLA Planner Quickstart Guide](sla_planner_quickstart.md) (for SLA-aware scheduling and autoscaling) + - [SLA Planner Quickstart Guide](../planner/sla_planner_quickstart.md) (for SLA-aware scheduling and autoscaling) ## Troubleshooting diff --git a/docs/kubernetes/metrics.md b/docs/kubernetes/metrics.md index dfb135e0d9..a7e31572ee 100644 --- a/docs/kubernetes/metrics.md +++ b/docs/kubernetes/metrics.md @@ -65,7 +65,7 @@ This will create two components: Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about: - Deployment configuration: See the [vLLM README](/docs/backends/vllm/README.md) -- Available metrics: See the [metrics guide](/docs/guides/metrics.md) +- Available metrics: See the [metrics guide](/docs/observability/metrics.md) ### Validate the Deployment diff --git a/docs/architecture/kvbm_architecture.md b/docs/kvbm/kvbm_architecture.md similarity index 100% rename from docs/architecture/kvbm_architecture.md rename to docs/kvbm/kvbm_architecture.md diff --git a/docs/architecture/kvbm_components.md b/docs/kvbm/kvbm_components.md similarity index 100% rename from docs/architecture/kvbm_components.md rename to docs/kvbm/kvbm_components.md diff --git a/docs/architecture/kvbm_intro.rst b/docs/kvbm/kvbm_intro.rst similarity index 100% rename from docs/architecture/kvbm_intro.rst rename to docs/kvbm/kvbm_intro.rst diff --git a/docs/architecture/kvbm_motivation.md b/docs/kvbm/kvbm_motivation.md similarity index 100% rename from docs/architecture/kvbm_motivation.md rename to docs/kvbm/kvbm_motivation.md diff --git a/docs/architecture/kvbm_reading.md b/docs/kvbm/kvbm_reading.md similarity index 100% rename from docs/architecture/kvbm_reading.md rename to docs/kvbm/kvbm_reading.md diff --git a/docs/guides/run_kvbm_in_trtllm.md b/docs/kvbm/trtllm-setup.md similarity index 99% rename from docs/guides/run_kvbm_in_trtllm.md rename to docs/kvbm/trtllm-setup.md index a6307d2b1b..aabc3e43c7 100644 --- a/docs/guides/run_kvbm_in_trtllm.md +++ b/docs/kvbm/trtllm-setup.md @@ -19,7 +19,7 @@ limitations under the License. This guide explains how to leverage KVBM (KV Block Manager) to mange KV cache and do KV offloading in TensorRT-LLM (trtllm). -To learn what KVBM is, please check [here](https://docs.nvidia.com/dynamo/latest/architecture/kvbm_intro.html) +To learn what KVBM is, please check [here](https://docs.nvidia.com/dynamo/latest/kvbm/kvbm_intro.html) > [!Note] > - Ensure that `etcd` and `nats` are running before starting. diff --git a/docs/guides/run_kvbm_in_vllm.md b/docs/kvbm/vllm-setup.md similarity index 99% rename from docs/guides/run_kvbm_in_vllm.md rename to docs/kvbm/vllm-setup.md index f1be7472cb..00e7b80f4e 100644 --- a/docs/guides/run_kvbm_in_vllm.md +++ b/docs/kvbm/vllm-setup.md @@ -19,7 +19,7 @@ limitations under the License. This guide explains how to leverage KVBM (KV Block Manager) to mange KV cache and do KV offloading in vLLM. -To learn what KVBM is, please check [here](https://docs.nvidia.com/dynamo/latest/architecture/kvbm_intro.html) +To learn what KVBM is, please check [here](https://docs.nvidia.com/dynamo/latest/kvbm/kvbm_intro.html) ## Quick Start diff --git a/docs/guides/health_check.md b/docs/observability/health-checks.md similarity index 99% rename from docs/guides/health_check.md rename to docs/observability/health-checks.md index 302d21fb10..8405c964eb 100644 --- a/docs/guides/health_check.md +++ b/docs/observability/health-checks.md @@ -197,4 +197,4 @@ date: Wed, 03 Sep 2025 13:42:45 GMT - [Distributed Runtime Architecture](../architecture/distributed_runtime.md) - [Dynamo Architecture Overview](../architecture/architecture.md) -- [Backend Guide](backend.md) +- [Backend Guide](../development/backend-guide.md) diff --git a/docs/guides/logging.md b/docs/observability/logging.md similarity index 99% rename from docs/guides/logging.md rename to docs/observability/logging.md index 1065482881..668902bc5f 100644 --- a/docs/guides/logging.md +++ b/docs/observability/logging.md @@ -187,5 +187,5 @@ curl -d '{"model": "Qwen/Qwen3-0.6B", "max_completion_tokens": 2049, "messages": - [Distributed Runtime Architecture](../architecture/distributed_runtime.md) - [Dynamo Architecture Overview](../architecture/architecture.md) -- [Backend Guide](backend.md) +- [Backend Guide](../development/backend-guide.md) - [Log Aggregation in Kubernetes](../kubernetes/logging.md) diff --git a/docs/guides/metrics.md b/docs/observability/metrics.md similarity index 99% rename from docs/guides/metrics.md rename to docs/observability/metrics.md index c2bc00b874..adbc636187 100644 --- a/docs/guides/metrics.md +++ b/docs/observability/metrics.md @@ -96,6 +96,6 @@ The metrics system includes a pre-configured Grafana dashboard for visualizing s - [Distributed Runtime Architecture](../architecture/distributed_runtime.md) - [Dynamo Architecture Overview](../architecture/architecture.md) -- [Backend Guide](backend.md) +- [Backend Guide](../development/backend-guide.md) - [Metrics Implementation Examples](../../deploy/metrics/README.md#implementation-examples) - [Complete Metrics Setup Guide](../../deploy/metrics/README.md) \ No newline at end of file diff --git a/docs/guides/disagg_perf_tuning.md b/docs/performance/tuning.md similarity index 100% rename from docs/guides/disagg_perf_tuning.md rename to docs/performance/tuning.md diff --git a/docs/architecture/load_planner.md b/docs/planner/load_planner.md similarity index 100% rename from docs/architecture/load_planner.md rename to docs/planner/load_planner.md diff --git a/docs/architecture/planner_intro.rst b/docs/planner/planner_intro.rst similarity index 93% rename from docs/architecture/planner_intro.rst rename to docs/planner/planner_intro.rst index 17f8ef4f0b..8326b2a4b0 100644 --- a/docs/architecture/planner_intro.rst +++ b/docs/planner/planner_intro.rst @@ -29,7 +29,7 @@ Key features include: .. admonition:: 🚀 Quick Start :class: seealso - **New to SLA Planner?** Start with the [SLA Planner Quick Start Guide](/docs/kubernetes/sla_planner_quickstart.md) for a complete, step-by-step workflow. + **New to SLA Planner?** Start with the [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md) for a complete, step-by-step workflow. **Prerequisites**: SLA-based planner requires pre-deployment profiling (2-4 hours on real silicon or a few minutes using simulator) before deployment. The Quick Start guide includes everything you need. @@ -77,6 +77,6 @@ Key features include: :hidden: Overview - SLA Planner Quick Start <../kubernetes/sla_planner_quickstart> + SLA Planner Quick Start Pre-Deployment Profiling <../benchmarks/pre_deployment_profiling.md> SLA-based Planner diff --git a/docs/architecture/sla_planner.md b/docs/planner/sla_planner.md similarity index 98% rename from docs/architecture/sla_planner.md rename to docs/planner/sla_planner.md index 119482600c..26f1f90103 100644 --- a/docs/architecture/sla_planner.md +++ b/docs/planner/sla_planner.md @@ -1,7 +1,7 @@ # SLA-based Planner > [!TIP] -> **New to SLA Planner?** For a complete workflow including profiling and deployment, see the [SLA Planner Quick Start Guide](/docs/kubernetes/sla_planner_quickstart.md). +> **New to SLA Planner?** For a complete workflow including profiling and deployment, see the [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md). This document covers information regarding the SLA-based planner in `examples/common/utils/planner_core.py`. @@ -129,7 +129,7 @@ Finally, SLA planner applies the change by scaling up/down the number of prefill ## Deploying -For complete deployment instructions, see the [SLA Planner Quick Start Guide](/docs/kubernetes/sla_planner_quickstart.md). +For complete deployment instructions, see the [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md). > [!NOTE] > The SLA planner requires a frontend that reports metrics at the `/metrics` HTTP endpoint with the number of requests, ISL, OSL, TTFT, and ITL in the correct format. The dynamo frontend provides these metrics automatically. diff --git a/docs/kubernetes/sla_planner_quickstart.md b/docs/planner/sla_planner_quickstart.md similarity index 99% rename from docs/kubernetes/sla_planner_quickstart.md rename to docs/planner/sla_planner_quickstart.md index b540552308..9a71c5ee29 100644 --- a/docs/kubernetes/sla_planner_quickstart.md +++ b/docs/planner/sla_planner_quickstart.md @@ -246,7 +246,7 @@ This is because the `subComponentType` field has only been added in newer versio ## Next Steps -- **Architecture Details**: See [SLA-based Planner Architecture](/docs/architecture/sla_planner.md) for technical details +- **Architecture Details**: See [SLA-based Planner Architecture](/docs/planner/sla_planner.md) for technical details - **Performance Tuning**: See [Pre-Deployment Profiling Guide](/docs/benchmarks/pre_deployment_profiling.md) for advanced profiling options - **Load Testing**: See [SLA Planner Load Test](/tests/planner/README.md) for comprehensive testing tools diff --git a/docs/guides/dynamo_run.md b/docs/reference/cli.md similarity index 100% rename from docs/guides/dynamo_run.md rename to docs/reference/cli.md diff --git a/docs/dynamo_glossary.md b/docs/reference/glossary.md similarity index 100% rename from docs/dynamo_glossary.md rename to docs/reference/glossary.md diff --git a/docs/support_matrix.md b/docs/reference/support-matrix.md similarity index 100% rename from docs/support_matrix.md rename to docs/reference/support-matrix.md diff --git a/docs/components/router/README.md b/docs/router/README.md similarity index 100% rename from docs/components/router/README.md rename to docs/router/README.md diff --git a/lib/bindings/python/README.md b/lib/bindings/python/README.md index 641b15c47f..4022760aa1 100644 --- a/lib/bindings/python/README.md +++ b/lib/bindings/python/README.md @@ -50,7 +50,7 @@ maturin develop --uv ### Prerequisite -See [README.md](../../../docs/runtime/README.md#prerequisites). +See [README.md](../../../docs/development/runtime-guide.md#prerequisites). ### Hello World Example