-
Notifications
You must be signed in to change notification settings - Fork 605
fix: Implement scaling Grove subresources #2531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
WalkthroughAdds Kubernetes Scale subresource support for Grove resources in the operator. Wires a Scale client into the reconciler, introduces a generic scaling helper, updates the reconciliation flow to scale PodClique/PodCliqueScalingGroup, and expands RBAC. Bumps Helm chart versions to 0.4.1 across platform and operator charts. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Operator as Operator Process
participant M as Manager
participant S as ScalesGetter (client)
participant R as DynamoGraphDeploymentReconciler
participant K as Kubernetes API Server
Operator->>M: Start manager
Operator->>M: createScalesGetter(cfg)
M->>K: Build discovery, RESTMapper, RESTClient
M-->>Operator: ScalesGetter
Operator->>R: Initialize Reconciler(ScaleClient=S)
note over R: Reconcile loop for DynamoGraphDeployment
R->>K: Sync GroveGangSet (unchanged flow)
R->>R: reconcileGroveScaling(dynamoDeployment)
alt Multi-node service
R->>R: name = {dynamo}-{i}-{svc}
R->>S: ScaleResource(GVR PodCliqueScalingGroups, ns, name, replicas)
else Single-node service
R->>R: name = {dynamo}-{i}-{svc}
R->>S: ScaleResource(GVR PodCliques, ns, name, replicas)
end
S->>K: GET .../scale
alt NotFound
S-->>R: Skip, retry on next reconcile
else Needs change
S->>K: UPDATE .../scale (replicas)
K-->>S: Result
S-->>R: Success/Failure
end
R-->>Operator: Reconcile result (error if scaling fails)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.2.2)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (4)
deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (1)
131-139
: Helm RBAC: scale subresource permissions correctly added.The rule grants get/patch/update on podcliques/scale and podcliquescalinggroups/scale, matching the controller changes. Good consistency with config/rbac/role.yaml.
Note: This template already includes a very broad wildcard rule (apiGroups: '', resources: '', verbs: '*'). That makes the new granular rule technically redundant at render time. If the goal is to move toward least privilege (recommended), consider gating or removing the wildcard rule in a future PR and relying on explicit rules like these.
deploy/cloud/operator/internal/controller_common/scale.go (1)
42-51
: Differentiate scale-subresource absence vs. target object absenceNotFound here likely indicates the target object hasn’t been created yet (perfectly fine for first reconcile). However, “no scale subresource” will typically surface as a mapping or Forbidden error. Consider treating Forbidden as an actionable RBAC issue to improve operator diagnosability.
- Add a special case for errors.IsForbidden(err) to point users to RBAC for “/scale”.
- Keep current NotFound handling as-is to allow retries.
deploy/cloud/operator/cmd/main.go (1)
354-360
: Optional: Lazily create Scale client only when Grove is enabledYou can defer scale client creation until after Grove availability detection to avoid unnecessary work in clusters without Grove. Not a blocker.
- Move createScalesGetter after DetectGroveAvailability and guard wiring with if groveEnabled.
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1)
179-195
: Avoid stringly-typed resourceType; pass GVR directlyUsing a string switch (“PodClique”, “PodCliqueScalingGroup”) is brittle. Pass the schema.GroupVersionResource directly to reduce errors and simplify call sites.
Apply this diff:
-// scaleGroveResource scales a Grove resource using the generic scaling function -func (r *DynamoGraphDeploymentReconciler) scaleGroveResource(ctx context.Context, resourceName, namespace string, newReplicas int32, resourceType string) error { - // Determine the GroupVersionResource based on resource type - var gvr schema.GroupVersionResource - switch resourceType { - case "PodClique": - gvr = podCliqueGVR - case "PodCliqueScalingGroup": - gvr = podCliqueScalingGroupGVR - default: - return fmt.Errorf("unsupported Grove resource type: %s", resourceType) - } - - // Use the generic scaling function - return commonController.ScaleResource(ctx, r.ScaleClient, gvr, namespace, resourceName, newReplicas) -} +// scaleGroveResource scales a Grove resource using the generic scaling function +func (r *DynamoGraphDeploymentReconciler) scaleGroveResource(ctx context.Context, gvr schema.GroupVersionResource, resourceName, namespace string, newReplicas int32) error { + return commonController.ScaleResource(ctx, r.ScaleClient, gvr, namespace, resourceName, newReplicas) +}And update the call sites below accordingly (see subsequent comment).
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (8)
deploy/cloud/helm/platform/Chart.yaml
(1 hunks)deploy/cloud/helm/platform/components/operator/Chart.yaml
(1 hunks)deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml
(1 hunks)deploy/cloud/operator/cmd/main.go
(3 hunks)deploy/cloud/operator/config/rbac/role.yaml
(1 hunks)deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go
(5 hunks)deploy/cloud/operator/internal/controller_common/scale.go
(1 hunks)deploy/helm/chart/Chart.yaml
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
🧬 Code Graph Analysis (2)
deploy/cloud/operator/cmd/main.go (1)
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1)
DynamoGraphDeploymentReconciler
(75-81)
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (2)
deploy/cloud/operator/internal/controller_common/resource.go (1)
Resource
(474-477)deploy/cloud/operator/internal/controller_common/scale.go (1)
ScaleResource
(33-77)
🔇 Additional comments (9)
deploy/helm/chart/Chart.yaml (1)
20-21
: Version bump is consistent and scoped correctly.Chart version and appVersion both updated to 0.4.1. Matches the operator/platform chart bumps in this PR. LGTM.
deploy/cloud/operator/config/rbac/role.yaml (1)
101-109
: RBAC for Grove scale subresources is correct; please verify CRDs on your clusterGranting get, patch and update on podcliques/scale and podcliquescalinggroups/scale matches the Scale client’s needs. We didn’t find any grove.io CRDs vendored in the repo—please confirm on your cluster:
- kubectl api-resources --api-group=grove.io | grep -E 'podclique|podcliquescalinggroup'
- kubectl get --raw "/apis/grove.io/v1/namespaces//podcliques//scale" | jq .
deploy/cloud/helm/platform/components/operator/Chart.yaml (1)
30-35
: Operator chart/appVersion bump looks good.Minor version aligns with the new capability and with the platform chart dependency update. No other changes needed here.
deploy/cloud/helm/platform/Chart.yaml (1)
22-26
: Platform chart version bump is consistentThe
version: 0.4.1
anddynamo-operator
dependency at0.4.1
align with the rest of this PR.However, I didn’t find a
Chart.lock
atdeploy/cloud/helm/platform/Chart.lock
. If you’re managing Helm dependencies for this chart, please regenerate and commit the lock file to avoid drift:cd deploy/cloud/helm/platform helm dependency update git add Chart.lock git commit -m "chore(platform): update Chart.lock for v0.4.1"deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (5)
56-68
: LGTM: Clear GVR definitions for Grove resourcesThe GVRs for PodClique and PodCliqueScalingGroup are explicit and correct for grove.io/v1alpha1. This makes the scaling code straightforward.
87-89
: RBAC covers Scale subresources for GroveGranting get/update/patch on podcliques/scale and podcliquescalinggroups/scale is necessary and sufficient for the Scale client calls.
211-237
: Replicas naming index stays constant at 0; verify intended Grove naming patternreplicaIndex is initialized to 0 and never incremented. As a result, all scaled resources are named: {DGD.name}-0-{serviceName}. If the Grove naming really expects index 0 for the subresources you scale, this is fine; otherwise, it may collide or miss the target resource.
- Confirm the expected naming convention for PodClique and PodCliqueScalingGroup instances created by the PodGangSet for multi/single-node services.
- If a different index is required (e.g., per-rank or per-shard), increment or derive replicaIndex accordingly.
If adopting the earlier refactor to pass GVR directly, also apply this call-site simplification:
- resourceName := fmt.Sprintf("%s-%d-%s", dynamoDeployment.Name, replicaIndex, strings.ToLower(serviceName)) - err := r.scaleGroveResource(ctx, - resourceName, - dynamoDeployment.Namespace, - *component.Replicas, - "PodCliqueScalingGroup") + resourceName := fmt.Sprintf("%s-%d-%s", dynamoDeployment.Name, replicaIndex, strings.ToLower(serviceName)) + err := r.scaleGroveResource(ctx, podCliqueScalingGroupGVR, resourceName, dynamoDeployment.Namespace, *component.Replicas)- resourceName := fmt.Sprintf("%s-%d-%s", dynamoDeployment.Name, replicaIndex, strings.ToLower(serviceName)) - err := r.scaleGroveResource(ctx, - resourceName, - dynamoDeployment.Namespace, - *component.Replicas, - "PodClique") + resourceName := fmt.Sprintf("%s-%d-%s", dynamoDeployment.Name, replicaIndex, strings.ToLower(serviceName)) + err := r.scaleGroveResource(ctx, podCliqueGVR, resourceName, dynamoDeployment.Namespace, *component.Replicas)
196-242
: Good: Idempotent scaling with NotFound toleranceThe flow scales after syncing the PodGangSet and tolerates NotFound (in ScaleResource), which is appropriate given Grove’s async creation of subresources.
244-271
: Integrating scaling after structural sync is correctRunning reconcileGroveScaling immediately after syncing the GangSet is a logical place. The error propagation with a specific reason ("grove_scaling_failed") improves debuggability.
Signed-off-by: Hannah Zhang <[email protected]>
Overview:
Implement scaling Grove subresources
Golang PodGangSet is a template. Changing replicas in the template doesn't update the sub resources (podcliques / podcliquescalinggroups)
Dynamo operator needs to scale directly these sub resources
Discussed with Grove team
Summary by CodeRabbit
New Features
Chores