feat: pre-deployment profiling automatically generates and deploys optimized DGD with planner #3441

tedzhouhk · 2025-10-06T18:58:41Z

pre-deployment profiling now generates a DGD with the best parallelization mapping and adds the planner service, with a flag to automatically deploy it after sweeping.
add feature to pass through planner flags when launching pre-deployment profiling
change dynamo-pvc to ReadWriteMany
update docs, now quick start is quicker

close https://linear.app/nvidia/issue/DEP-456/profiling-job-outputs-k8s-dgd-crd-yaml-for-both-online-profiler-and

Summary by CodeRabbit

New Features
- Option to auto-deploy the optimized deployment with planner after profiling.
- Planner options available via CLI; supports AI Configurator flags for backend and version.
- Generates a config_with_planner.yaml in profiling results.
Documentation
- Expanded quickstart with prerequisites and an end-to-end profiling-to-deployment flow.
- Updated profiling guide with new flags, examples, and file structure.
Chores
- Added ignore pattern for profiling results.
- Updated planner container working directories across manifests and tests.
- pvc-access utility pod no longer auto-terminates.

Signed-off-by: hongkuanz <[email protected]>

…/auto-gen-dgd

Signed-off-by: hongkuanz <[email protected]>

…/auto-gen-dgd

Signed-off-by: hongkuanz <[email protected]>

coderabbitai · 2025-10-06T19:06:11Z

Walkthrough

Adds planner integration to profiling: new planner args parsing/building, component-type-aware config handling, PVC/planner service config models, optional deploy-after-profile flow, and generation of config_with_planner.yaml. Updates YAML workingDir paths for Planner, removes Pod activeDeadlineSeconds, tweaks .gitignore, and revises docs and examples accordingly.

Changes

Cohort / File(s)	Change summary
Profiler planner integration `benchmarks/profiler/profile_sla.py`, `benchmarks/profiler/utils/planner_utils.py`, `benchmarks/profiler/utils/config.py`	Adds planner CLI arg injection and arg building utilities; introduces PVC and planner service config models; extends backend config modifiers to be component-type-aware; generates planner-enabled DGD config (config_with_planner.yaml); optional deploy-after-profile; updates backend/aic-backend handling and KV cache parsing.
Planner container workingDir updates `components/backends/sglang/deploy/disagg_planner.yaml`, `components/backends/vllm/deploy/disagg_planner.yaml`, `tests/planner/scaling/disagg_planner.yaml`, `tests/planner/perf_test_configs/disagg_8b_planner.yaml`	Changes Planner container workingDir to `/workspace/components/src/dynamo/planner` across backend deploy manifests and planner test configs.
PVC access pod behavior `deploy/utils/manifests/pvc-access-pod.yaml`	Removes `activeDeadlineSeconds` from PodSpec to eliminate automatic termination deadline.
Docs: profiling and planner workflow `docs/benchmarks/pre_deployment_profiling.md`, `docs/kubernetes/sla_planner_quickstart.md`	Documents new deploy-after-profile option, planner-enabled config file, updated flags (`--aic-backend`, `--aic-backend-version`), revised step-by-step workflow and commands.
Ignore patterns `.gitignore`	Adds `profiling_results*` pattern; retains `benchmarks/results`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant PS as profile_sla.py
  participant PU as planner_utils
  participant CFG as Config/PVC/Planner Service
  participant K8s as Kubernetes API
  participant AIC as AI Configurator (optional)

  U->>PS: Run profiling with backend + planner flags
  PS->>PU: add_planner_arguments_to_parser()
  note right of PU: Inject prefixed planner args into parser
  PS->>PS: Execute profiling and collect metrics
  PS->>PU: build_planner_args_from_namespace(args)
  PU-->>PS: Ordered planner CLI args
  PS->>CFG: Build planner-enabled DGD config<br/>(PVC, planner service, component types)
  CFG-->>PS: config_with_planner.yaml
  alt --deploy-after-profile
    PS->>AIC: Deploy via AI Configurator (if provided)
    PS->>K8s: Apply manifests / monitor rollout
    K8s-->>PS: Deployment status
  else no deploy
    PS-->>U: Profiling results + config_with_planner.yaml
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I nibbled logs and configs bright,
Tweaked my burrow’s working site,
Spun a plan from flags so neat,
Hopped to deploy on thumping feet—
PVCs snug, charts in a row,
After the profile, off we go!
🥕✨

Pre-merge checks

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description Check	⚠️ Warning	The pull request description does not follow the required repository template because it omits the mandatory section headings (“Overview”, “Details”, “Where should the reviewer start?”, and “Related Issues”) and the related issue is not presented under a “Related Issues” section in the prescribed format.	Please revise the PR description to include the four template sections—“#### Overview:”, “#### Details:”, “#### Where should the reviewer start?”, and “#### Related Issues:”—and format the issue reference under “Related Issues” using the expected syntax (for example, “- closes GitHub issue: #123”).

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title clearly summarizes the main feature of the pull request by stating that pre-deployment profiling now automatically generates and deploys an optimized DGD with a planner, focusing on the core change without extraneous detail and using an accepted conventional prefix.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

benchmarks/profiler/utils/config.py (1)
508-535: Fix: mutate the Config instance you return

set_config_tp_size (and the related set_config_tep_size / set_config_dep_size variants) fetch the worker via get_worker_service_from_config(config, …). That helper re-validates the original dict, so you mutate a separate Config instance. The cfg you ultimately model_dump() never sees those edits, meaning none of the TP/TEP/DEP adjustments or arg rewrites actually persist. Please grab the service from the cfg you just built (or otherwise ensure the same object is mutated) before returning, and apply the same fix across the other setters.
         cfg = Config.model_validate(config)
-        worker_service = get_worker_service_from_config(
-            config, backend="vllm", sub_component_type=component_type
-        )
+        service_name = get_service_name_by_type(
+            cfg.model_dump(), "vllm", component_type
+        )
+        worker_service = cfg.spec.services[service_name]

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 069434b and 82581d5.

📒 Files selected for processing (11)

.gitignore (1 hunks)
benchmarks/profiler/profile_sla.py (8 hunks)
benchmarks/profiler/utils/config.py (10 hunks)
benchmarks/profiler/utils/planner_utils.py (1 hunks)
components/backends/sglang/deploy/disagg_planner.yaml (1 hunks)
components/backends/vllm/deploy/disagg_planner.yaml (1 hunks)
deploy/utils/manifests/pvc-access-pod.yaml (0 hunks)
docs/benchmarks/pre_deployment_profiling.md (3 hunks)
docs/kubernetes/sla_planner_quickstart.md (5 hunks)
tests/planner/perf_test_configs/disagg_8b_planner.yaml (1 hunks)
tests/planner/scaling/disagg_planner.yaml (1 hunks)

💤 Files with no reviewable changes (1)

deploy/utils/manifests/pvc-access-pod.yaml

🧰 Additional context used

🧬 Code graph analysis (3)

benchmarks/profiler/utils/planner_utils.py (2)

tests/planner/test_replica_calculation.py (1)

planner (43-101)

components/src/dynamo/planner/utils/planner_argparse.py (1)

create_sla_planner_parser (21-126)

benchmarks/profiler/utils/config.py (1)

components/src/dynamo/planner/defaults.py (1)

SubComponentType (142-144)

benchmarks/profiler/profile_sla.py (4)

benchmarks/profiler/utils/config.py (15)

Config (92-95)

DgdPlannerServiceConfig (102-117)

PVCConfig (75-78)

set_config_tp_size (377-378)

set_config_tp_size (508-534)

set_config_tp_size (746-767)

set_config_tp_size (1043-1077)

set_config_tep_size (381-384)

set_config_tep_size (537-546)

set_config_tep_size (770-802)

set_config_tep_size (1080-1089)

set_config_dep_size (387-390)

set_config_dep_size (549-558)

set_config_dep_size (805-837)

set_config_dep_size (1092-1101)

components/src/dynamo/planner/defaults.py (1)

SubComponentType (142-144)

benchmarks/profiler/utils/planner_utils.py (2)

add_planner_arguments_to_parser (21-95)

build_planner_args_from_namespace (98-162)

deploy/utils/dynamo_deployment.py (2)

DynamoDeploymentClient (98-475)

create_deployment (220-265)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3441/merge) by tedzhouhk.

docs/benchmarks/pre_deployment_profiling.md

[error] 128-128: Trailing whitespace detected and corrected by pre-commit hook 'trailing-whitespace'.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: trtllm (amd64)
GitHub Check: trtllm (arm64)
GitHub Check: vllm (amd64)
GitHub Check: sglang
GitHub Check: vllm (arm64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (1)

.gitignore (1)

102-103: Ignore entry looks good.

Adding profiling_results* keeps newly generated profiling artifacts out of version control. Nice catch.

benchmarks/profiler/profile_sla.py

benchmarks/profiler/utils/planner_utils.py

benchmarks/profiler/profile_sla.py

docs/kubernetes/sla_planner_quickstart.md

benchmarks/profiler/profile_sla.py

Signed-off-by: hongkuanz <[email protected]>

docs/benchmarks/pre_deployment_profiling.md

benchmarks/profiler/profile_sla.py

docs/kubernetes/sla_planner_quickstart.md

Signed-off-by: hongkuanz <[email protected]>

…/auto-gen-dgd Signed-off-by: hongkuanz <[email protected]>

Signed-off-by: hongkuanz <[email protected]>

tedzhouhk added 11 commits October 2, 2025 13:55

stage

53565b5

Signed-off-by: hongkuanz <[email protected]>

Merge branch 'main' of https://github.com/ai-dynamo/dynamo into hzhou…

e7ab645

…/auto-gen-dgd

stage

703984e

Signed-off-by: hongkuanz <[email protected]>

bug fix

b6805fa

Signed-off-by: hongkuanz <[email protected]>

pc

a206d0e

Signed-off-by: hongkuanz <[email protected]>

update gitignore

ef58b3a

Signed-off-by: hongkuanz <[email protected]>

bugfix

2e15168

Signed-off-by: hongkuanz <[email protected]>

fix

d2958c4

Signed-off-by: hongkuanz <[email protected]>

Merge branch 'main' of https://github.com/ai-dynamo/dynamo into hzhou…

ff3fa68

…/auto-gen-dgd

update doc

159f281

Signed-off-by: hongkuanz <[email protected]>

doc

82581d5

Signed-off-by: hongkuanz <[email protected]>

tedzhouhk requested review from a team as code owners October 6, 2025 18:58

pull-request-size bot added the size/L label Oct 6, 2025

github-actions bot added the feat label Oct 6, 2025

pc

96923a6

Signed-off-by: hongkuanz <[email protected]>

coderabbitai bot reviewed Oct 6, 2025

View reviewed changes

benchmarks/profiler/profile_sla.py Show resolved Hide resolved

hhzhang16 reviewed Oct 6, 2025

View reviewed changes

benchmarks/profiler/profile_sla.py Show resolved Hide resolved

jasonqinzhou reviewed Oct 6, 2025

View reviewed changes

docs/kubernetes/sla_planner_quickstart.md Show resolved Hide resolved

docs/kubernetes/sla_planner_quickstart.md Outdated Show resolved Hide resolved

benchmarks/profiler/profile_sla.py Outdated Show resolved Hide resolved

mypy

b2a1589

Signed-off-by: hongkuanz <[email protected]>

pull-request-size bot added size/XL and removed size/L labels Oct 6, 2025

copy-pr-bot bot temporarily deployed to GITLAB October 6, 2025 22:40 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 6, 2025 22:41 Inactive

address PR comment

357ea62

Signed-off-by: hongkuanz <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB October 6, 2025 23:09 Inactive

jasonqinzhou reviewed Oct 6, 2025

View reviewed changes

docs/benchmarks/pre_deployment_profiling.md Show resolved Hide resolved

benchmarks/profiler/profile_sla.py Outdated Show resolved Hide resolved

docs/kubernetes/sla_planner_quickstart.md Show resolved Hide resolved

add TODO

5f3f2fe

Signed-off-by: hongkuanz <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB October 6, 2025 23:59 Inactive

coderabbit

9bb5462

Signed-off-by: hongkuanz <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB October 7, 2025 00:00 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 7, 2025 00:05 Inactive

add todo

82b446c

Signed-off-by: hongkuanz <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB October 7, 2025 00:27 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 7, 2025 00:32 Inactive

jasonqinzhou approved these changes Oct 7, 2025

View reviewed changes

move to helper func

d13949c

Signed-off-by: hongkuanz <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB October 7, 2025 20:26 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 7, 2025 20:28 Inactive

Merge branch 'main' of https://github.com/ai-dynamo/dynamo into hzhou…

fff651b

…/auto-gen-dgd Signed-off-by: hongkuanz <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB October 7, 2025 21:26 Inactive

add todo

0648e58

Signed-off-by: hongkuanz <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB October 7, 2025 21:27 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 7, 2025 21:32 Inactive

hhzhang16 approved these changes Oct 7, 2025

View reviewed changes

pc

3df0699

Signed-off-by: hongkuanz <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB October 7, 2025 22:12 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 7, 2025 22:21 Inactive

tedzhouhk merged commit 83e259a into main Oct 7, 2025
22 of 24 checks passed

tedzhouhk deleted the hzhou/auto-gen-dgd branch October 7, 2025 23:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: pre-deployment profiling automatically generates and deploys optimized DGD with planner #3441

feat: pre-deployment profiling automatically generates and deploys optimized DGD with planner #3441

tedzhouhk commented Oct 6, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 6, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: pre-deployment profiling automatically generates and deploys optimized DGD with planner #3441

feat: pre-deployment profiling automatically generates and deploys optimized DGD with planner #3441

Conversation

tedzhouhk commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tedzhouhk commented Oct 6, 2025 •

edited

Loading

coderabbitai bot commented Oct 6, 2025 •

edited

Loading