-
Notifications
You must be signed in to change notification settings - Fork 630
feat: pre-deployment profiling automatically generates and deploys optimized DGD with planner #3441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
WalkthroughAdds planner integration to profiling: new planner args parsing/building, component-type-aware config handling, PVC/planner service config models, optional deploy-after-profile flow, and generation of config_with_planner.yaml. Updates YAML workingDir paths for Planner, removes Pod activeDeadlineSeconds, tweaks .gitignore, and revises docs and examples accordingly. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor U as User
participant PS as profile_sla.py
participant PU as planner_utils
participant CFG as Config/PVC/Planner Service
participant K8s as Kubernetes API
participant AIC as AI Configurator (optional)
U->>PS: Run profiling with backend + planner flags
PS->>PU: add_planner_arguments_to_parser()
note right of PU: Inject prefixed planner args into parser
PS->>PS: Execute profiling and collect metrics
PS->>PU: build_planner_args_from_namespace(args)
PU-->>PS: Ordered planner CLI args
PS->>CFG: Build planner-enabled DGD config<br/>(PVC, planner service, component types)
CFG-->>PS: config_with_planner.yaml
alt --deploy-after-profile
PS->>AIC: Deploy via AI Configurator (if provided)
PS->>K8s: Apply manifests / monitor rollout
K8s-->>PS: Deployment status
else no deploy
PS-->>U: Profiling results + config_with_planner.yaml
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
benchmarks/profiler/utils/config.py (1)
508-535
: Fix: mutate the Config instance you return
set_config_tp_size
(and the relatedset_config_tep_size
/set_config_dep_size
variants) fetch the worker viaget_worker_service_from_config(config, …)
. That helper re-validates the original dict, so you mutate a separate Config instance. Thecfg
you ultimatelymodel_dump()
never sees those edits, meaning none of the TP/TEP/DEP adjustments or arg rewrites actually persist. Please grab the service from thecfg
you just built (or otherwise ensure the same object is mutated) before returning, and apply the same fix across the other setters.cfg = Config.model_validate(config) - worker_service = get_worker_service_from_config( - config, backend="vllm", sub_component_type=component_type - ) + service_name = get_service_name_by_type( + cfg.model_dump(), "vllm", component_type + ) + worker_service = cfg.spec.services[service_name]
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
.gitignore
(1 hunks)benchmarks/profiler/profile_sla.py
(8 hunks)benchmarks/profiler/utils/config.py
(10 hunks)benchmarks/profiler/utils/planner_utils.py
(1 hunks)components/backends/sglang/deploy/disagg_planner.yaml
(1 hunks)components/backends/vllm/deploy/disagg_planner.yaml
(1 hunks)deploy/utils/manifests/pvc-access-pod.yaml
(0 hunks)docs/benchmarks/pre_deployment_profiling.md
(3 hunks)docs/kubernetes/sla_planner_quickstart.md
(5 hunks)tests/planner/perf_test_configs/disagg_8b_planner.yaml
(1 hunks)tests/planner/scaling/disagg_planner.yaml
(1 hunks)
💤 Files with no reviewable changes (1)
- deploy/utils/manifests/pvc-access-pod.yaml
🧰 Additional context used
🧬 Code graph analysis (3)
benchmarks/profiler/utils/planner_utils.py (2)
tests/planner/test_replica_calculation.py (1)
planner
(43-101)components/src/dynamo/planner/utils/planner_argparse.py (1)
create_sla_planner_parser
(21-126)
benchmarks/profiler/utils/config.py (1)
components/src/dynamo/planner/defaults.py (1)
SubComponentType
(142-144)
benchmarks/profiler/profile_sla.py (4)
benchmarks/profiler/utils/config.py (15)
Config
(92-95)DgdPlannerServiceConfig
(102-117)PVCConfig
(75-78)set_config_tp_size
(377-378)set_config_tp_size
(508-534)set_config_tp_size
(746-767)set_config_tp_size
(1043-1077)set_config_tep_size
(381-384)set_config_tep_size
(537-546)set_config_tep_size
(770-802)set_config_tep_size
(1080-1089)set_config_dep_size
(387-390)set_config_dep_size
(549-558)set_config_dep_size
(805-837)set_config_dep_size
(1092-1101)components/src/dynamo/planner/defaults.py (1)
SubComponentType
(142-144)benchmarks/profiler/utils/planner_utils.py (2)
add_planner_arguments_to_parser
(21-95)build_planner_args_from_namespace
(98-162)deploy/utils/dynamo_deployment.py (2)
DynamoDeploymentClient
(98-475)create_deployment
(220-265)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3441/merge) by tedzhouhk.
docs/benchmarks/pre_deployment_profiling.md
[error] 128-128: Trailing whitespace detected and corrected by pre-commit hook 'trailing-whitespace'.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: trtllm (amd64)
- GitHub Check: trtllm (arm64)
- GitHub Check: vllm (amd64)
- GitHub Check: sglang
- GitHub Check: vllm (arm64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
.gitignore (1)
102-103
: Ignore entry looks good.Adding
profiling_results*
keeps newly generated profiling artifacts out of version control. Nice catch.
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
…/auto-gen-dgd Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
dynamo-pvc
toReadWriteMany
close https://linear.app/nvidia/issue/DEP-456/profiling-job-outputs-k8s-dgd-crd-yaml-for-both-online-profiler-and
Summary by CodeRabbit
New Features
Documentation
Chores