[codex] Rewrite InferenceServer docs for Ray Serve and Dynamo by lbliii · Pull Request #2147 · NVIDIA-NeMo/Curator

lbliii · 2026-07-01T17:46:31Z

Summary

replace the obsolete InferenceModelConfig guide with typed Ray Serve and NVIDIA Dynamo configuration
add runnable Ray Serve, Dynamo aggregated, and Dynamo disaggregated examples
document model/server validation, multi-model behavior, static replicas, role overrides, router modes, KV-event constraints, multimodal routing, runtime environments, placement, lifecycle, and HAProxy ingress
add a concrete migration from InferenceModelConfig
update the LLM client, NeMo Data Designer, and release-note references to current typed APIs

Why

The published quickstart imported InferenceModelConfig, which no longer exists on main, and described Ray Serve as the only backend. Users could not run the examples or discover the current Dynamo serving surface.

User impact

Users can now choose RayServeModelConfig or DynamoVLLMModelConfig, size aggregated or disaggregated deployments, configure routing and runtime environments, and understand the tested architecture/dependency boundaries before starting a cluster.

Validation

fern check — 0 errors
fern docs broken-links — no errors in changed pages; 22 pre-existing errors remain in older API-reference pages
parsed all 32 Python fences in the changed guides with ast.parse
instantiated the documented Ray Serve, aggregated Dynamo, disaggregated Dynamo, role, and router configurations against current main
focused serve/config/runtime-env unit tests — 22 passed, 1 skipped
- excluded the repository-wide Ray fixture because this sandbox hostname does not resolve; no focused assertion failed
git diff --check

Closes #2146

Signed-off-by: Lawrence Lane <llane@nvidia.com>

copy-pr-bot · 2026-07-01T17:46:35Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-07-01T21:04:31Z

Greptile Summary

This PR rewrites the InferenceServer documentation to replace the removed InferenceModelConfig with typed RayServeModelConfig and DynamoVLLMModelConfig backends, and updates the LLM client and NeMo Data Designer guides to match.

inference-server.mdx is expanded from ~200 to ~470 lines, adding Ray Serve and Dynamo quickstarts, per-backend configuration reference tables, disaggregated prefill/decode examples, routing and KV-events semantics, runtime-env merging rules, resource-placement constraints, HAProxy ingress notes, a migration guide from InferenceModelConfig, and a troubleshooting section.
llm-client.mdx and nemo-data-designer.mdx each swap one import and one constructor from InferenceModelConfig to RayServeModelConfig, keeping their code examples runnable against current main.
release-notes/index.mdx updates the feature bullet to name both new typed config classes; .secrets.baseline reflects the changed line number for the existing api_key="unused" string.

Confidence Score: 5/5

Documentation-only change; no executable code paths are touched. The new content is accurate against current main and both previously-threaded doc discrepancies have been corrected.

All five changed files are documentation or a generated secrets baseline. The code examples in the three MDX files were validated by the author (ast.parse + config instantiation) and the imports correctly reflect the removed InferenceModelConfig. The two issues noted in prior review threads are now addressed in the new text.

No files require special attention.

Important Files Changed

Filename	Overview
fern/versions/main/pages/curate-text/synthetic/inference-server.mdx	Major rewrite from ~200 to ~470 lines: introduces typed RayServeModelConfig and DynamoVLLMModelConfig backends, adds aggregated/disaggregated Dynamo examples, routing config, runtime-env guidance, resource placement, HAProxy ingress, migration guide, and troubleshooting. Both previously-threaded issues (kv_events auto-enable, DynamoRoleConfig vs DynamoVLLMModelConfig constraint attribution) are now accurately documented.
fern/versions/main/pages/curate-text/synthetic/llm-client.mdx	Minimal change: imports and constructor swapped from InferenceModelConfig to RayServeModelConfig. Surrounding prose still refers to InferenceServer as "(Ray Serve + vLLM)" without mentioning Dynamo, though the context is a Ray Serve-specific example.
fern/versions/main/pages/curate-text/synthetic/nemo-data-designer.mdx	Single import/constructor rename from InferenceModelConfig to RayServeModelConfig in the end-to-end example. Rest of file unchanged and internally consistent.
fern/versions/main/pages/about/release-notes/index.mdx	Single-line release note update replacing the InferenceModelConfig-specific description with the new typed backend language (RayServeModelConfig / DynamoVLLMModelConfig). Clean and accurate.
.github/workflows/config/.secrets.baseline	Line number for the api_key="unused" string updated from 173 to 373 to reflect the grown inference-server.mdx; generated_at timestamp refreshed. Same hash — no new secrets introduced.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["InferenceServer(models, backend)"] --> B{backend type?}
    B -- "RayServeServerConfig (default)" --> C["Ray Serve backend"]
    B -- "DynamoServerConfig" --> D["Dynamo backend"]

    C --> E["RayServeModelConfig\n(model_identifier, deployment_config,\nengine_kwargs, runtime_env)"]
    E --> F["Ray Serve autoscaling\n(min/max replicas)"]

    D --> G{mode?}
    G -- "aggregated" --> H["DynamoVLLMModelConfig\n(num_replicas, engine_kwargs)"]
    G -- "disagg" --> I["DynamoVLLMModelConfig\n(prefill=DynamoRoleConfig,\ndecode=DynamoRoleConfig)"]

    H --> J["Static aggregated replicas\n(multi-node TP supported)"]
    I --> K["Prefill workers + Decode workers\n(single-node TP per role)"]

    D --> L["DynamoRouterConfig\n(mode: kv/round_robin/random/direct)"]
    L --> M{mode=None?}
    M -- "any disagg model" --> N["Auto-select KV routing\n+ enable kv_events"]
    M -- "aggregated only" --> O["Round-robin default"]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["InferenceServer(models, backend)"] --> B{backend type?}
    B -- "RayServeServerConfig (default)" --> C["Ray Serve backend"]
    B -- "DynamoServerConfig" --> D["Dynamo backend"]

    C --> E["RayServeModelConfig\n(model_identifier, deployment_config,\nengine_kwargs, runtime_env)"]
    E --> F["Ray Serve autoscaling\n(min/max replicas)"]

    D --> G{mode?}
    G -- "aggregated" --> H["DynamoVLLMModelConfig\n(num_replicas, engine_kwargs)"]
    G -- "disagg" --> I["DynamoVLLMModelConfig\n(prefill=DynamoRoleConfig,\ndecode=DynamoRoleConfig)"]

    H --> J["Static aggregated replicas\n(multi-node TP supported)"]
    I --> K["Prefill workers + Decode workers\n(single-node TP per role)"]

    D --> L["DynamoRouterConfig\n(mode: kv/round_robin/random/direct)"]
    L --> M{mode=None?}
    M -- "any disagg model" --> N["Auto-select KV routing\n+ enable kv_events"]
    M -- "aggregated only" --> O["Round-robin default"]

_{Reviews (2): Last reviewed commit: "docs: clarify Dynamo routing behavior" | Re-trigger Greptile}

greptile-apps · 2026-07-01T21:04:35Z

+    model_identifier="HuggingFaceTB/SmolLM2-135M-Instruct",
+    mode="disagg",
+    engine_kwargs={


Disaggregated mode silently overrides kv_events=False

The docs state that kv_events=False "uses approximate tree-based tracking," implying the default applies for disaggregated serving. In practice, DynamoBackend._resolve_effective_router computes kv_events = mode == "kv" and (mode_was_auto_picked or router.kv_events). When auto-routing selects "kv" for any disaggregated model (mode_was_auto_picked=True), kv_events is forced to True regardless of the DynamoRouterConfig default of kv_events=False. A user who relies on the default router config expecting tree-based tracking with disaggregated serving will actually get event-backed KV routing. The only exception is when an HMA publisher is detected and the user explicitly left kv_events=False. This auto-enable behavior should be documented here to avoid surprises.

greptile-apps · 2026-07-01T21:04:36Z

+
+| Parameter | Type | Default | Description |
+| --- | --- | --- | --- |
+| `num_replicas` | `int` | `1` | Number of workers for this role. Disaggregated models require at least one prefill and one decode replica. |


DynamoRoleConfig.num_replicas validation described inaccurately

The table says the constraint is "at least one prefill and one decode replica," presenting it as a DynamoRoleConfig-level rule. In the code, DynamoRoleConfig.__post_init__ only enforces >= 0; the >= 1 requirement is checked by DynamoVLLMModelConfig.__post_init__. A user can successfully construct DynamoRoleConfig(num_replicas=0) and only hit the error when that config is embedded in a DynamoVLLMModelConfig. Attributing the constraint to the model config rather than the role config is more accurate.

Suggested change

| `num_replicas` | `int` | `1` | Number of workers for this role. Disaggregated models require at least one prefill and one decode replica. |

| `num_replicas` | `int` | `1` | Number of workers for this role. Must be `>= 0`; `DynamoVLLMModelConfig` enforces that both prefill and decode are `>= 1` for disaggregated mode. |

Signed-off-by: Lawrence Lane <llane@nvidia.com>

docs: rewrite inference server guide

1be4548

Signed-off-by: Lawrence Lane <llane@nvidia.com>

lbliii self-assigned this Jul 1, 2026

lbliii mentioned this pull request Jul 1, 2026

[Docs epic] Close post-v1.2.0 documentation gaps for 26.06 #2118

Open

21 tasks

lbliii marked this pull request as ready for review July 1, 2026 21:00

lbliii requested a review from a team as a code owner July 1, 2026 21:00

lbliii requested review from abhinavg4 and removed request for a team July 1, 2026 21:00

greptile-apps Bot reviewed Jul 1, 2026

View reviewed changes

lbliii force-pushed the codex/docs-inference-server-dynamo branch from a98ec45 to 10e59b1 Compare July 2, 2026 02:10

docs: clarify Dynamo routing behavior

ab81966

Signed-off-by: Lawrence Lane <llane@nvidia.com>

lbliii force-pushed the codex/docs-inference-server-dynamo branch from 10e59b1 to ab81966 Compare July 2, 2026 02:13

lbliii requested a review from a team as a code owner July 2, 2026 02:13

This was referenced Jul 2, 2026

[Docs] Integrate 26.06 documentation PRs and stage the version train #2160

Open

[codex] publish 26.06 release notes and migration checklist #2143

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[codex] Rewrite InferenceServer docs for Ray Serve and Dynamo#2147

[codex] Rewrite InferenceServer docs for Ray Serve and Dynamo#2147
lbliii wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
lbliii:codex/docs-inference-server-dynamo

lbliii commented Jul 1, 2026

Uh oh!

copy-pr-bot Bot commented Jul 1, 2026

Uh oh!

greptile-apps Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Jul 1, 2026

Uh oh!

greptile-apps Bot Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	\| `num_replicas` \| `int` \| `1` \| Number of workers for this role. Disaggregated models require at least one prefill and one decode replica. \|
	\| `num_replicas` \| `int` \| `1` \| Number of workers for this role. Must be `>= 0`; `DynamoVLLMModelConfig` enforces that both prefill and decode are `>= 1` for disaggregated mode. \|

Uh oh!

Conversation

lbliii commented Jul 1, 2026

Summary

Why

User impact

Validation

Uh oh!

copy-pr-bot Bot commented Jul 1, 2026

Uh oh!

greptile-apps Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jul 1, 2026 •

edited

Loading