Skip to content

[codex] Rewrite InferenceServer docs for Ray Serve and Dynamo#2147

Open
lbliii wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
lbliii:codex/docs-inference-server-dynamo
Open

[codex] Rewrite InferenceServer docs for Ray Serve and Dynamo#2147
lbliii wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
lbliii:codex/docs-inference-server-dynamo

Conversation

@lbliii

@lbliii lbliii commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Summary

  • replace the obsolete InferenceModelConfig guide with typed Ray Serve and NVIDIA Dynamo configuration
  • add runnable Ray Serve, Dynamo aggregated, and Dynamo disaggregated examples
  • document model/server validation, multi-model behavior, static replicas, role overrides, router modes, KV-event constraints, multimodal routing, runtime environments, placement, lifecycle, and HAProxy ingress
  • add a concrete migration from InferenceModelConfig
  • update the LLM client, NeMo Data Designer, and release-note references to current typed APIs

Why

The published quickstart imported InferenceModelConfig, which no longer exists on main, and described Ray Serve as the only backend. Users could not run the examples or discover the current Dynamo serving surface.

User impact

Users can now choose RayServeModelConfig or DynamoVLLMModelConfig, size aggregated or disaggregated deployments, configure routing and runtime environments, and understand the tested architecture/dependency boundaries before starting a cluster.

Validation

  • fern check — 0 errors
  • fern docs broken-links — no errors in changed pages; 22 pre-existing errors remain in older API-reference pages
  • parsed all 32 Python fences in the changed guides with ast.parse
  • instantiated the documented Ray Serve, aggregated Dynamo, disaggregated Dynamo, role, and router configurations against current main
  • focused serve/config/runtime-env unit tests — 22 passed, 1 skipped
    • excluded the repository-wide Ray fixture because this sandbox hostname does not resolve; no focused assertion failed
  • git diff --check

Closes #2146

Signed-off-by: Lawrence Lane <llane@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@lbliii lbliii self-assigned this Jul 1, 2026
@lbliii lbliii marked this pull request as ready for review July 1, 2026 21:00
@lbliii lbliii requested a review from a team as a code owner July 1, 2026 21:00
@lbliii lbliii requested review from abhinavg4 and removed request for a team July 1, 2026 21:00
@greptile-apps

greptile-apps Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR rewrites the InferenceServer documentation to replace the removed InferenceModelConfig with typed RayServeModelConfig and DynamoVLLMModelConfig backends, and updates the LLM client and NeMo Data Designer guides to match.

  • inference-server.mdx is expanded from ~200 to ~470 lines, adding Ray Serve and Dynamo quickstarts, per-backend configuration reference tables, disaggregated prefill/decode examples, routing and KV-events semantics, runtime-env merging rules, resource-placement constraints, HAProxy ingress notes, a migration guide from InferenceModelConfig, and a troubleshooting section.
  • llm-client.mdx and nemo-data-designer.mdx each swap one import and one constructor from InferenceModelConfig to RayServeModelConfig, keeping their code examples runnable against current main.
  • release-notes/index.mdx updates the feature bullet to name both new typed config classes; .secrets.baseline reflects the changed line number for the existing api_key="unused" string.

Confidence Score: 5/5

Documentation-only change; no executable code paths are touched. The new content is accurate against current main and both previously-threaded doc discrepancies have been corrected.

All five changed files are documentation or a generated secrets baseline. The code examples in the three MDX files were validated by the author (ast.parse + config instantiation) and the imports correctly reflect the removed InferenceModelConfig. The two issues noted in prior review threads are now addressed in the new text.

No files require special attention.

Important Files Changed

Filename Overview
fern/versions/main/pages/curate-text/synthetic/inference-server.mdx Major rewrite from ~200 to ~470 lines: introduces typed RayServeModelConfig and DynamoVLLMModelConfig backends, adds aggregated/disaggregated Dynamo examples, routing config, runtime-env guidance, resource placement, HAProxy ingress, migration guide, and troubleshooting. Both previously-threaded issues (kv_events auto-enable, DynamoRoleConfig vs DynamoVLLMModelConfig constraint attribution) are now accurately documented.
fern/versions/main/pages/curate-text/synthetic/llm-client.mdx Minimal change: imports and constructor swapped from InferenceModelConfig to RayServeModelConfig. Surrounding prose still refers to InferenceServer as "(Ray Serve + vLLM)" without mentioning Dynamo, though the context is a Ray Serve-specific example.
fern/versions/main/pages/curate-text/synthetic/nemo-data-designer.mdx Single import/constructor rename from InferenceModelConfig to RayServeModelConfig in the end-to-end example. Rest of file unchanged and internally consistent.
fern/versions/main/pages/about/release-notes/index.mdx Single-line release note update replacing the InferenceModelConfig-specific description with the new typed backend language (RayServeModelConfig / DynamoVLLMModelConfig). Clean and accurate.
.github/workflows/config/.secrets.baseline Line number for the api_key="unused" string updated from 173 to 373 to reflect the grown inference-server.mdx; generated_at timestamp refreshed. Same hash — no new secrets introduced.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["InferenceServer(models, backend)"] --> B{backend type?}
    B -- "RayServeServerConfig (default)" --> C["Ray Serve backend"]
    B -- "DynamoServerConfig" --> D["Dynamo backend"]

    C --> E["RayServeModelConfig\n(model_identifier, deployment_config,\nengine_kwargs, runtime_env)"]
    E --> F["Ray Serve autoscaling\n(min/max replicas)"]

    D --> G{mode?}
    G -- "aggregated" --> H["DynamoVLLMModelConfig\n(num_replicas, engine_kwargs)"]
    G -- "disagg" --> I["DynamoVLLMModelConfig\n(prefill=DynamoRoleConfig,\ndecode=DynamoRoleConfig)"]

    H --> J["Static aggregated replicas\n(multi-node TP supported)"]
    I --> K["Prefill workers + Decode workers\n(single-node TP per role)"]

    D --> L["DynamoRouterConfig\n(mode: kv/round_robin/random/direct)"]
    L --> M{mode=None?}
    M -- "any disagg model" --> N["Auto-select KV routing\n+ enable kv_events"]
    M -- "aggregated only" --> O["Round-robin default"]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["InferenceServer(models, backend)"] --> B{backend type?}
    B -- "RayServeServerConfig (default)" --> C["Ray Serve backend"]
    B -- "DynamoServerConfig" --> D["Dynamo backend"]

    C --> E["RayServeModelConfig\n(model_identifier, deployment_config,\nengine_kwargs, runtime_env)"]
    E --> F["Ray Serve autoscaling\n(min/max replicas)"]

    D --> G{mode?}
    G -- "aggregated" --> H["DynamoVLLMModelConfig\n(num_replicas, engine_kwargs)"]
    G -- "disagg" --> I["DynamoVLLMModelConfig\n(prefill=DynamoRoleConfig,\ndecode=DynamoRoleConfig)"]

    H --> J["Static aggregated replicas\n(multi-node TP supported)"]
    I --> K["Prefill workers + Decode workers\n(single-node TP per role)"]

    D --> L["DynamoRouterConfig\n(mode: kv/round_robin/random/direct)"]
    L --> M{mode=None?}
    M -- "any disagg model" --> N["Auto-select KV routing\n+ enable kv_events"]
    M -- "aggregated only" --> O["Round-robin default"]
Loading

Reviews (2): Last reviewed commit: "docs: clarify Dynamo routing behavior" | Re-trigger Greptile

Comment on lines +217 to +219
model_identifier="HuggingFaceTB/SmolLM2-135M-Instruct",
mode="disagg",
engine_kwargs={

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Disaggregated mode silently overrides kv_events=False

The docs state that kv_events=False "uses approximate tree-based tracking," implying the default applies for disaggregated serving. In practice, DynamoBackend._resolve_effective_router computes kv_events = mode == "kv" and (mode_was_auto_picked or router.kv_events). When auto-routing selects "kv" for any disaggregated model (mode_was_auto_picked=True), kv_events is forced to True regardless of the DynamoRouterConfig default of kv_events=False. A user who relies on the default router config expecting tree-based tracking with disaggregated serving will actually get event-backed KV routing. The only exception is when an HMA publisher is detected and the user explicitly left kv_events=False. This auto-enable behavior should be documented here to avoid surprises.


| Parameter | Type | Default | Description |
| --- | --- | --- | --- |
| `num_replicas` | `int` | `1` | Number of workers for this role. Disaggregated models require at least one prefill and one decode replica. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 DynamoRoleConfig.num_replicas validation described inaccurately

The table says the constraint is "at least one prefill and one decode replica," presenting it as a DynamoRoleConfig-level rule. In the code, DynamoRoleConfig.__post_init__ only enforces >= 0; the >= 1 requirement is checked by DynamoVLLMModelConfig.__post_init__. A user can successfully construct DynamoRoleConfig(num_replicas=0) and only hit the error when that config is embedded in a DynamoVLLMModelConfig. Attributing the constraint to the model config rather than the role config is more accurate.

Suggested change
| `num_replicas` | `int` | `1` | Number of workers for this role. Disaggregated models require at least one prefill and one decode replica. |
| `num_replicas` | `int` | `1` | Number of workers for this role. Must be `>= 0`; `DynamoVLLMModelConfig` enforces that both prefill and decode are `>= 1` for disaggregated mode. |

@lbliii lbliii force-pushed the codex/docs-inference-server-dynamo branch from a98ec45 to 10e59b1 Compare July 2, 2026 02:10
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Docs] Rewrite InferenceServer docs for typed Ray Serve and Dynamo backends

1 participant