[codex] Rewrite InferenceServer docs for Ray Serve and Dynamo#2147
[codex] Rewrite InferenceServer docs for Ray Serve and Dynamo#2147lbliii wants to merge 2 commits into
Conversation
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Greptile SummaryThis PR rewrites the InferenceServer documentation to replace the removed
Confidence Score: 5/5Documentation-only change; no executable code paths are touched. The new content is accurate against current main and both previously-threaded doc discrepancies have been corrected. All five changed files are documentation or a generated secrets baseline. The code examples in the three MDX files were validated by the author (ast.parse + config instantiation) and the imports correctly reflect the removed InferenceModelConfig. The two issues noted in prior review threads are now addressed in the new text. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["InferenceServer(models, backend)"] --> B{backend type?}
B -- "RayServeServerConfig (default)" --> C["Ray Serve backend"]
B -- "DynamoServerConfig" --> D["Dynamo backend"]
C --> E["RayServeModelConfig\n(model_identifier, deployment_config,\nengine_kwargs, runtime_env)"]
E --> F["Ray Serve autoscaling\n(min/max replicas)"]
D --> G{mode?}
G -- "aggregated" --> H["DynamoVLLMModelConfig\n(num_replicas, engine_kwargs)"]
G -- "disagg" --> I["DynamoVLLMModelConfig\n(prefill=DynamoRoleConfig,\ndecode=DynamoRoleConfig)"]
H --> J["Static aggregated replicas\n(multi-node TP supported)"]
I --> K["Prefill workers + Decode workers\n(single-node TP per role)"]
D --> L["DynamoRouterConfig\n(mode: kv/round_robin/random/direct)"]
L --> M{mode=None?}
M -- "any disagg model" --> N["Auto-select KV routing\n+ enable kv_events"]
M -- "aggregated only" --> O["Round-robin default"]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A["InferenceServer(models, backend)"] --> B{backend type?}
B -- "RayServeServerConfig (default)" --> C["Ray Serve backend"]
B -- "DynamoServerConfig" --> D["Dynamo backend"]
C --> E["RayServeModelConfig\n(model_identifier, deployment_config,\nengine_kwargs, runtime_env)"]
E --> F["Ray Serve autoscaling\n(min/max replicas)"]
D --> G{mode?}
G -- "aggregated" --> H["DynamoVLLMModelConfig\n(num_replicas, engine_kwargs)"]
G -- "disagg" --> I["DynamoVLLMModelConfig\n(prefill=DynamoRoleConfig,\ndecode=DynamoRoleConfig)"]
H --> J["Static aggregated replicas\n(multi-node TP supported)"]
I --> K["Prefill workers + Decode workers\n(single-node TP per role)"]
D --> L["DynamoRouterConfig\n(mode: kv/round_robin/random/direct)"]
L --> M{mode=None?}
M -- "any disagg model" --> N["Auto-select KV routing\n+ enable kv_events"]
M -- "aggregated only" --> O["Round-robin default"]
Reviews (2): Last reviewed commit: "docs: clarify Dynamo routing behavior" | Re-trigger Greptile |
| model_identifier="HuggingFaceTB/SmolLM2-135M-Instruct", | ||
| mode="disagg", | ||
| engine_kwargs={ |
There was a problem hiding this comment.
Disaggregated mode silently overrides
kv_events=False
The docs state that kv_events=False "uses approximate tree-based tracking," implying the default applies for disaggregated serving. In practice, DynamoBackend._resolve_effective_router computes kv_events = mode == "kv" and (mode_was_auto_picked or router.kv_events). When auto-routing selects "kv" for any disaggregated model (mode_was_auto_picked=True), kv_events is forced to True regardless of the DynamoRouterConfig default of kv_events=False. A user who relies on the default router config expecting tree-based tracking with disaggregated serving will actually get event-backed KV routing. The only exception is when an HMA publisher is detected and the user explicitly left kv_events=False. This auto-enable behavior should be documented here to avoid surprises.
|
|
||
| | Parameter | Type | Default | Description | | ||
| | --- | --- | --- | --- | | ||
| | `num_replicas` | `int` | `1` | Number of workers for this role. Disaggregated models require at least one prefill and one decode replica. | |
There was a problem hiding this comment.
DynamoRoleConfig.num_replicas validation described inaccurately
The table says the constraint is "at least one prefill and one decode replica," presenting it as a DynamoRoleConfig-level rule. In the code, DynamoRoleConfig.__post_init__ only enforces >= 0; the >= 1 requirement is checked by DynamoVLLMModelConfig.__post_init__. A user can successfully construct DynamoRoleConfig(num_replicas=0) and only hit the error when that config is embedded in a DynamoVLLMModelConfig. Attributing the constraint to the model config rather than the role config is more accurate.
| | `num_replicas` | `int` | `1` | Number of workers for this role. Disaggregated models require at least one prefill and one decode replica. | | |
| | `num_replicas` | `int` | `1` | Number of workers for this role. Must be `>= 0`; `DynamoVLLMModelConfig` enforces that both prefill and decode are `>= 1` for disaggregated mode. | |
a98ec45 to
10e59b1
Compare
Signed-off-by: Lawrence Lane <llane@nvidia.com>
10e59b1 to
ab81966
Compare
Summary
InferenceModelConfigguide with typed Ray Serve and NVIDIA Dynamo configurationInferenceModelConfigWhy
The published quickstart imported
InferenceModelConfig, which no longer exists onmain, and described Ray Serve as the only backend. Users could not run the examples or discover the current Dynamo serving surface.User impact
Users can now choose
RayServeModelConfigorDynamoVLLMModelConfig, size aggregated or disaggregated deployments, configure routing and runtime environments, and understand the tested architecture/dependency boundaries before starting a cluster.Validation
fern check— 0 errorsfern docs broken-links— no errors in changed pages; 22 pre-existing errors remain in older API-reference pagesast.parsemaingit diff --checkCloses #2146