Skip to content

Conversation

tedzhouhk
Copy link
Contributor

@tedzhouhk tedzhouhk commented Oct 6, 2025

  • pre-deployment profiling now generates a DGD with the best parallelization mapping and adds the planner service, with a flag to automatically deploy it after sweeping.
  • add feature to pass through planner flags when launching pre-deployment profiling
  • change dynamo-pvc to ReadWriteMany
  • update docs, now quick start is quicker

close https://linear.app/nvidia/issue/DEP-456/profiling-job-outputs-k8s-dgd-crd-yaml-for-both-online-profiler-and

Summary by CodeRabbit

  • New Features

    • Option to auto-deploy the optimized deployment with planner after profiling.
    • Planner options available via CLI; supports AI Configurator flags for backend and version.
    • Generates a config_with_planner.yaml in profiling results.
  • Documentation

    • Expanded quickstart with prerequisites and an end-to-end profiling-to-deployment flow.
    • Updated profiling guide with new flags, examples, and file structure.
  • Chores

    • Added ignore pattern for profiling results.
    • Updated planner container working directories across manifests and tests.
    • pvc-access utility pod no longer auto-terminates.

Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
@tedzhouhk tedzhouhk requested review from a team as code owners October 6, 2025 18:58
@github-actions github-actions bot added the feat label Oct 6, 2025
Signed-off-by: hongkuanz <[email protected]>
Copy link
Contributor

coderabbitai bot commented Oct 6, 2025

Walkthrough

Adds planner integration to profiling: new planner args parsing/building, component-type-aware config handling, PVC/planner service config models, optional deploy-after-profile flow, and generation of config_with_planner.yaml. Updates YAML workingDir paths for Planner, removes Pod activeDeadlineSeconds, tweaks .gitignore, and revises docs and examples accordingly.

Changes

Cohort / File(s) Change summary
Profiler planner integration
benchmarks/profiler/profile_sla.py, benchmarks/profiler/utils/planner_utils.py, benchmarks/profiler/utils/config.py
Adds planner CLI arg injection and arg building utilities; introduces PVC and planner service config models; extends backend config modifiers to be component-type-aware; generates planner-enabled DGD config (config_with_planner.yaml); optional deploy-after-profile; updates backend/aic-backend handling and KV cache parsing.
Planner container workingDir updates
components/backends/sglang/deploy/disagg_planner.yaml, components/backends/vllm/deploy/disagg_planner.yaml, tests/planner/scaling/disagg_planner.yaml, tests/planner/perf_test_configs/disagg_8b_planner.yaml
Changes Planner container workingDir to /workspace/components/src/dynamo/planner across backend deploy manifests and planner test configs.
PVC access pod behavior
deploy/utils/manifests/pvc-access-pod.yaml
Removes activeDeadlineSeconds from PodSpec to eliminate automatic termination deadline.
Docs: profiling and planner workflow
docs/benchmarks/pre_deployment_profiling.md, docs/kubernetes/sla_planner_quickstart.md
Documents new deploy-after-profile option, planner-enabled config file, updated flags (--aic-backend, --aic-backend-version), revised step-by-step workflow and commands.
Ignore patterns
.gitignore
Adds profiling_results* pattern; retains benchmarks/results.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant PS as profile_sla.py
  participant PU as planner_utils
  participant CFG as Config/PVC/Planner Service
  participant K8s as Kubernetes API
  participant AIC as AI Configurator (optional)

  U->>PS: Run profiling with backend + planner flags
  PS->>PU: add_planner_arguments_to_parser()
  note right of PU: Inject prefixed planner args into parser
  PS->>PS: Execute profiling and collect metrics
  PS->>PU: build_planner_args_from_namespace(args)
  PU-->>PS: Ordered planner CLI args
  PS->>CFG: Build planner-enabled DGD config<br/>(PVC, planner service, component types)
  CFG-->>PS: config_with_planner.yaml
  alt --deploy-after-profile
    PS->>AIC: Deploy via AI Configurator (if provided)
    PS->>K8s: Apply manifests / monitor rollout
    K8s-->>PS: Deployment status
  else no deploy
    PS-->>U: Profiling results + config_with_planner.yaml
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I nibbled logs and configs bright,
Tweaked my burrow’s working site,
Spun a plan from flags so neat,
Hopped to deploy on thumping feet—
PVCs snug, charts in a row,
After the profile, off we go!
🥕✨

Pre-merge checks

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description Check ⚠️ Warning The pull request description does not follow the required repository template because it omits the mandatory section headings (“Overview”, “Details”, “Where should the reviewer start?”, and “Related Issues”) and the related issue is not presented under a “Related Issues” section in the prescribed format. Please revise the PR description to include the four template sections—“#### Overview:”, “#### Details:”, “#### Where should the reviewer start?”, and “#### Related Issues:”—and format the issue reference under “Related Issues” using the expected syntax (for example, “- closes GitHub issue: #123”).
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title clearly summarizes the main feature of the pull request by stating that pre-deployment profiling now automatically generates and deploys an optimized DGD with a planner, focusing on the core change without extraneous detail and using an accepted conventional prefix.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
benchmarks/profiler/utils/config.py (1)

508-535: Fix: mutate the Config instance you return

set_config_tp_size (and the related set_config_tep_size / set_config_dep_size variants) fetch the worker via get_worker_service_from_config(config, …). That helper re-validates the original dict, so you mutate a separate Config instance. The cfg you ultimately model_dump() never sees those edits, meaning none of the TP/TEP/DEP adjustments or arg rewrites actually persist. Please grab the service from the cfg you just built (or otherwise ensure the same object is mutated) before returning, and apply the same fix across the other setters.

         cfg = Config.model_validate(config)
-        worker_service = get_worker_service_from_config(
-            config, backend="vllm", sub_component_type=component_type
-        )
+        service_name = get_service_name_by_type(
+            cfg.model_dump(), "vllm", component_type
+        )
+        worker_service = cfg.spec.services[service_name]
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 069434b and 82581d5.

📒 Files selected for processing (11)
  • .gitignore (1 hunks)
  • benchmarks/profiler/profile_sla.py (8 hunks)
  • benchmarks/profiler/utils/config.py (10 hunks)
  • benchmarks/profiler/utils/planner_utils.py (1 hunks)
  • components/backends/sglang/deploy/disagg_planner.yaml (1 hunks)
  • components/backends/vllm/deploy/disagg_planner.yaml (1 hunks)
  • deploy/utils/manifests/pvc-access-pod.yaml (0 hunks)
  • docs/benchmarks/pre_deployment_profiling.md (3 hunks)
  • docs/kubernetes/sla_planner_quickstart.md (5 hunks)
  • tests/planner/perf_test_configs/disagg_8b_planner.yaml (1 hunks)
  • tests/planner/scaling/disagg_planner.yaml (1 hunks)
💤 Files with no reviewable changes (1)
  • deploy/utils/manifests/pvc-access-pod.yaml
🧰 Additional context used
🧬 Code graph analysis (3)
benchmarks/profiler/utils/planner_utils.py (2)
tests/planner/test_replica_calculation.py (1)
  • planner (43-101)
components/src/dynamo/planner/utils/planner_argparse.py (1)
  • create_sla_planner_parser (21-126)
benchmarks/profiler/utils/config.py (1)
components/src/dynamo/planner/defaults.py (1)
  • SubComponentType (142-144)
benchmarks/profiler/profile_sla.py (4)
benchmarks/profiler/utils/config.py (15)
  • Config (92-95)
  • DgdPlannerServiceConfig (102-117)
  • PVCConfig (75-78)
  • set_config_tp_size (377-378)
  • set_config_tp_size (508-534)
  • set_config_tp_size (746-767)
  • set_config_tp_size (1043-1077)
  • set_config_tep_size (381-384)
  • set_config_tep_size (537-546)
  • set_config_tep_size (770-802)
  • set_config_tep_size (1080-1089)
  • set_config_dep_size (387-390)
  • set_config_dep_size (549-558)
  • set_config_dep_size (805-837)
  • set_config_dep_size (1092-1101)
components/src/dynamo/planner/defaults.py (1)
  • SubComponentType (142-144)
benchmarks/profiler/utils/planner_utils.py (2)
  • add_planner_arguments_to_parser (21-95)
  • build_planner_args_from_namespace (98-162)
deploy/utils/dynamo_deployment.py (2)
  • DynamoDeploymentClient (98-475)
  • create_deployment (220-265)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3441/merge) by tedzhouhk.
docs/benchmarks/pre_deployment_profiling.md

[error] 128-128: Trailing whitespace detected and corrected by pre-commit hook 'trailing-whitespace'.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: sglang
  • GitHub Check: vllm (arm64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
.gitignore (1)

102-103: Ignore entry looks good.

Adding profiling_results* keeps newly generated profiling artifacts out of version control. Nice catch.

Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
@tedzhouhk tedzhouhk merged commit 83e259a into main Oct 7, 2025
22 of 24 checks passed
@tedzhouhk tedzhouhk deleted the hzhou/auto-gen-dgd branch October 7, 2025 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants