fix: Implement scaling Grove subresources #2531

julienmancuso · 2025-08-19T18:07:21Z

Overview:

Implement scaling Grove subresources

Golang PodGangSet is a template. Changing replicas in the template doesn't update the sub resources (podcliques / podcliquescalinggroups)

Dynamo operator needs to scale directly these sub resources

Discussed with Grove team

Summary by CodeRabbit

New Features
- Automatic scaling of Grove resources based on service replica settings within deployments, improving elasticity and coordination.
- More robust synchronization flow with clearer logging during scaling and networking resource updates.
Chores
- Updated permissions to allow scaling operations on Grove resources.
- Aligned chart and app versions to 0.4.1 across platform and operator components, including dependency updates.

nvrohanv

LGTM!

coderabbitai · 2025-08-19T18:17:50Z

Walkthrough

Adds Kubernetes Scale subresource support for Grove resources in the operator. Wires a Scale client into the reconciler, introduces a generic scaling helper, updates the reconciliation flow to scale PodClique/PodCliqueScalingGroup, and expands RBAC. Bumps Helm chart versions to 0.4.1 across platform and operator charts.

Changes

Cohort / File(s)	Summary
Helm chart version bumps `deploy/cloud/helm/platform/Chart.yaml`, `deploy/cloud/helm/platform/components/operator/Chart.yaml`, `deploy/helm/chart/Chart.yaml`	Increment chart/appVersion from 0.4.0 to 0.4.1; update dynamo-operator dependency to 0.4.1.
RBAC updates for Scale subresources `deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml`, `deploy/cloud/operator/config/rbac/role.yaml`	Add grove.io permissions for podcliques/scale and podcliquescalinggroups/scale with get, patch, update.
Operator wiring for Scale client `deploy/cloud/operator/cmd/main.go`	Add createScalesGetter; initialize ScalesGetter using discovery, REST mapper; inject into DynamoGraphDeploymentReconciler; propagate errors.
Reconciler scaling integration `deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go`	Add ScaleClient field; implement scaleGroveResource and reconcileGroveScaling; integrate scaling into reconcileGroveResources; adjust error handling around SyncResource; add logging.
Generic scaling helper `deploy/cloud/operator/internal/controller_common/scale.go`	Add ScaleResource to get/update autoscalingv1.Scale via Scale subresource; handle NotFound, no-op when replicas unchanged; log transitions.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Operator as Operator Process
  participant M as Manager
  participant S as ScalesGetter (client)
  participant R as DynamoGraphDeploymentReconciler
  participant K as Kubernetes API Server

  Operator->>M: Start manager
  Operator->>M: createScalesGetter(cfg)
  M->>K: Build discovery, RESTMapper, RESTClient
  M-->>Operator: ScalesGetter
  Operator->>R: Initialize Reconciler(ScaleClient=S)

  note over R: Reconcile loop for DynamoGraphDeployment

  R->>K: Sync GroveGangSet (unchanged flow)
  R->>R: reconcileGroveScaling(dynamoDeployment)

  alt Multi-node service
    R->>R: name = {dynamo}-{i}-{svc}
    R->>S: ScaleResource(GVR PodCliqueScalingGroups, ns, name, replicas)
  else Single-node service
    R->>R: name = {dynamo}-{i}-{svc}
    R->>S: ScaleResource(GVR PodCliques, ns, name, replicas)
  end

  S->>K: GET .../scale
  alt NotFound
    S-->>R: Skip, retry on next reconcile
  else Needs change
    S->>K: UPDATE .../scale (replicas)
    K-->>S: Result
    S-->>R: Success/Failure
  end

  R-->>Operator: Reconcile result (error if scaling fails)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

fix: handle groveTerminationDelay and auto-detect grove installation #2190 — Touches the same operator codepaths and Helm RBAC/templates for Grove integration, including cmd/main.go, dynamographdeployment_controller.go, and manager-rbac.yaml.

Poem

A rabbit taps the cluster’s scale,
With whiskered math and tidy trail.
PodCliques grow, then gently rest,
Groups align at your behest.
Charts hop to point-one, neat—
RBAC keys unlock the feat.
Thump-thump: reconcile complete. 🥕🐇

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.2.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (1)

131-139: Helm RBAC: scale subresource permissions correctly added.

The rule grants get/patch/update on podcliques/scale and podcliquescalinggroups/scale, matching the controller changes. Good consistency with config/rbac/role.yaml.

Note: This template already includes a very broad wildcard rule (apiGroups: '', resources: '', verbs: '*'). That makes the new granular rule technically redundant at render time. If the goal is to move toward least privilege (recommended), consider gating or removing the wildcard rule in a future PR and relying on explicit rules like these.

deploy/cloud/operator/internal/controller_common/scale.go (1)

42-51: Differentiate scale-subresource absence vs. target object absence

NotFound here likely indicates the target object hasn’t been created yet (perfectly fine for first reconcile). However, “no scale subresource” will typically surface as a mapping or Forbidden error. Consider treating Forbidden as an actionable RBAC issue to improve operator diagnosability.

Add a special case for errors.IsForbidden(err) to point users to RBAC for “/scale”.

Keep current NotFound handling as-is to allow retries.

deploy/cloud/operator/cmd/main.go (1)

354-360: Optional: Lazily create Scale client only when Grove is enabled

You can defer scale client creation until after Grove availability detection to avoid unnecessary work in clusters without Grove. Not a blocker.

Move createScalesGetter after DetectGroveAvailability and guard wiring with if groveEnabled.
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1)
179-195: Avoid stringly-typed resourceType; pass GVR directly

Using a string switch (“PodClique”, “PodCliqueScalingGroup”) is brittle. Pass the schema.GroupVersionResource directly to reduce errors and simplify call sites.

Apply this diff:
-// scaleGroveResource scales a Grove resource using the generic scaling function
-func (r *DynamoGraphDeploymentReconciler) scaleGroveResource(ctx context.Context, resourceName, namespace string, newReplicas int32, resourceType string) error {
-  // Determine the GroupVersionResource based on resource type
-  var gvr schema.GroupVersionResource
-  switch resourceType {
-  case "PodClique":
-    gvr = podCliqueGVR
-  case "PodCliqueScalingGroup":
-    gvr = podCliqueScalingGroupGVR
-  default:
-    return fmt.Errorf("unsupported Grove resource type: %s", resourceType)
-  }
-
-  // Use the generic scaling function
-  return commonController.ScaleResource(ctx, r.ScaleClient, gvr, namespace, resourceName, newReplicas)
-}
+// scaleGroveResource scales a Grove resource using the generic scaling function
+func (r *DynamoGraphDeploymentReconciler) scaleGroveResource(ctx context.Context, gvr schema.GroupVersionResource, resourceName, namespace string, newReplicas int32) error {
+  return commonController.ScaleResource(ctx, r.ScaleClient, gvr, namespace, resourceName, newReplicas)
+}
And update the call sites below accordingly (see subsequent comment).

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d9aef67 and e7b7e56.

📒 Files selected for processing (8)

deploy/cloud/helm/platform/Chart.yaml (1 hunks)
deploy/cloud/helm/platform/components/operator/Chart.yaml (1 hunks)
deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (1 hunks)
deploy/cloud/operator/cmd/main.go (3 hunks)
deploy/cloud/operator/config/rbac/role.yaml (1 hunks)
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (5 hunks)
deploy/cloud/operator/internal/controller_common/scale.go (1 hunks)
deploy/helm/chart/Chart.yaml (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.

🧬 Code Graph Analysis (2)

deploy/cloud/operator/cmd/main.go (1)

deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1)

DynamoGraphDeploymentReconciler (75-81)

deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (2)

deploy/cloud/operator/internal/controller_common/resource.go (1)

Resource (474-477)

deploy/cloud/operator/internal/controller_common/scale.go (1)

ScaleResource (33-77)

🔇 Additional comments (9)

deploy/helm/chart/Chart.yaml (1)

20-21: Version bump is consistent and scoped correctly.

Chart version and appVersion both updated to 0.4.1. Matches the operator/platform chart bumps in this PR. LGTM.

deploy/cloud/operator/config/rbac/role.yaml (1)

101-109: RBAC for Grove scale subresources is correct; please verify CRDs on your cluster

Granting get, patch and update on podcliques/scale and podcliquescalinggroups/scale matches the Scale client’s needs. We didn’t find any grove.io CRDs vendored in the repo—please confirm on your cluster:

kubectl api-resources --api-group=grove.io | grep -E 'podclique|podcliquescalinggroup'

kubectl get --raw "/apis/grove.io/v1/namespaces//podcliques//scale" | jq .

deploy/cloud/helm/platform/components/operator/Chart.yaml (1)

30-35: Operator chart/appVersion bump looks good.

Minor version aligns with the new capability and with the platform chart dependency update. No other changes needed here.
deploy/cloud/helm/platform/Chart.yaml (1)

22-26: Platform chart version bump is consistent

The version: 0.4.1 and dynamo-operator dependency at 0.4.1 align with the rest of this PR.

However, I didn’t find a Chart.lock at deploy/cloud/helm/platform/Chart.lock. If you’re managing Helm dependencies for this chart, please regenerate and commit the lock file to avoid drift:
cd deploy/cloud/helm/platform
helm dependency update
git add Chart.lock
git commit -m "chore(platform): update Chart.lock for v0.4.1"
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (5)

56-68: LGTM: Clear GVR definitions for Grove resources

The GVRs for PodClique and PodCliqueScalingGroup are explicit and correct for grove.io/v1alpha1. This makes the scaling code straightforward.

87-89: RBAC covers Scale subresources for Grove

Granting get/update/patch on podcliques/scale and podcliquescalinggroups/scale is necessary and sufficient for the Scale client calls.

211-237: Replicas naming index stays constant at 0; verify intended Grove naming pattern

replicaIndex is initialized to 0 and never incremented. As a result, all scaled resources are named: {DGD.name}-0-{serviceName}. If the Grove naming really expects index 0 for the subresources you scale, this is fine; otherwise, it may collide or miss the target resource.

Confirm the expected naming convention for PodClique and PodCliqueScalingGroup instances created by the PodGangSet for multi/single-node services.

If a different index is required (e.g., per-rank or per-shard), increment or derive replicaIndex accordingly.

If adopting the earlier refactor to pass GVR directly, also apply this call-site simplification:
- resourceName := fmt.Sprintf("%s-%d-%s", dynamoDeployment.Name, replicaIndex, strings.ToLower(serviceName))
- err := r.scaleGroveResource(ctx,
-   resourceName,
-   dynamoDeployment.Namespace,
-   *component.Replicas,
-   "PodCliqueScalingGroup")
+ resourceName := fmt.Sprintf("%s-%d-%s", dynamoDeployment.Name, replicaIndex, strings.ToLower(serviceName))
+ err := r.scaleGroveResource(ctx, podCliqueScalingGroupGVR, resourceName, dynamoDeployment.Namespace, *component.Replicas)
- resourceName := fmt.Sprintf("%s-%d-%s", dynamoDeployment.Name, replicaIndex, strings.ToLower(serviceName))
- err := r.scaleGroveResource(ctx,
-   resourceName,
-   dynamoDeployment.Namespace,
-   *component.Replicas,
-   "PodClique")
+ resourceName := fmt.Sprintf("%s-%d-%s", dynamoDeployment.Name, replicaIndex, strings.ToLower(serviceName))
+ err := r.scaleGroveResource(ctx, podCliqueGVR, resourceName, dynamoDeployment.Namespace, *component.Replicas)
196-242: Good: Idempotent scaling with NotFound tolerance

The flow scales after syncing the PodGangSet and tolerates NotFound (in ScaleResource), which is appropriate given Grove’s async creation of subresources.

244-271: Integrating scaling after structural sync is correct

Running reconcileGroveScaling immediately after syncing the GangSet is a logical place. The error propagation with a specific reason ("grove_scaling_failed") improves debuggability.

deploy/cloud/operator/cmd/main.go

deploy/cloud/operator/internal/controller_common/scale.go

Signed-off-by: Hannah Zhang <[email protected]>

julienmancuso added 4 commits August 19, 2025 10:28

fix: implement workaround to scale Grove PodGangSet

0ffb09c

fix: implement workaround to scale Grove PodGangSet

bd6ab76

fix: implement workaround to scale Grove PodGangSet

e130afc

fix: implement workaround to scale Grove PodGangSet

e7b7e56

julienmancuso requested review from hutm, biswapanda, ishandhanani, hhzhang16, nnshah1 and mohammedabdulwahhab as code owners August 19, 2025 18:07

pull-request-size bot added the size/L label Aug 19, 2025

copy-pr-bot bot temporarily deployed to GITLAB August 19, 2025 18:07 Inactive

github-actions bot added the fix label Aug 19, 2025

copy-pr-bot bot temporarily deployed to GITLAB August 19, 2025 18:08 Inactive

fix: implement workaround to scale Grove PodGangSet

6abeee6

copy-pr-bot bot temporarily deployed to GITLAB August 19, 2025 18:15 Inactive

nvrohanv approved these changes Aug 19, 2025

View reviewed changes

coderabbitai bot reviewed Aug 19, 2025

View reviewed changes

deploy/cloud/operator/cmd/main.go Outdated Show resolved Hide resolved

deploy/cloud/operator/internal/controller_common/scale.go Show resolved Hide resolved

deploy/cloud/operator/internal/controller_common/scale.go Outdated Show resolved Hide resolved

copy-pr-bot bot temporarily deployed to GITLAB August 19, 2025 18:18 Inactive

biswapanda approved these changes Aug 19, 2025

View reviewed changes

fix: implement workaround to scale Grove PodGangSet

42fa0d3

copy-pr-bot bot temporarily deployed to GITLAB August 19, 2025 18:28 Inactive

mohammedabdulwahhab approved these changes Aug 19, 2025

View reviewed changes

copy-pr-bot bot temporarily deployed to GITLAB August 19, 2025 18:34 Inactive

julienmancuso enabled auto-merge (squash) August 19, 2025 18:36

julienmancuso merged commit a97602a into main Aug 19, 2025
9 of 11 checks passed

julienmancuso deleted the jsm/dep-323 branch August 19, 2025 18:54

harryskim mentioned this pull request Aug 22, 2025

[Roadmap]: 0.4.1 - 0.5.0 roadmap and key dates #2649

Open

hhzhang16 pushed a commit that referenced this pull request Aug 27, 2025

fix: Implement scaling Grove subresources (#2531)

0186159

Signed-off-by: Hannah Zhang <[email protected]>

nv-anants pushed a commit that referenced this pull request Aug 28, 2025

fix: Implement scaling Grove subresources (#2531)

bbc9ecb

coderabbitai bot mentioned this pull request Aug 28, 2025

feat: Add Grove and Kai scheduler as part of dynamo cloud helm chart #2755

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Implement scaling Grove subresources #2531

fix: Implement scaling Grove subresources #2531

Uh oh!

julienmancuso commented Aug 19, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

nvrohanv left a comment

Uh oh!

coderabbitai bot commented Aug 19, 2025

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fix: Implement scaling Grove subresources #2531

fix: Implement scaling Grove subresources #2531

Uh oh!

Conversation

julienmancuso commented Aug 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Summary by CodeRabbit

Uh oh!

nvrohanv left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Aug 19, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

julienmancuso commented Aug 19, 2025 •

edited by coderabbitai bot

Loading