Skip to content

Conversation

julienmancuso
Copy link
Contributor

@julienmancuso julienmancuso commented Aug 19, 2025

Overview:

Implement scaling Grove subresources

Golang PodGangSet is a template. Changing replicas in the template doesn't update the sub resources (podcliques / podcliquescalinggroups)

Dynamo operator needs to scale directly these sub resources

Discussed with Grove team

Summary by CodeRabbit

  • New Features

    • Automatic scaling of Grove resources based on service replica settings within deployments, improving elasticity and coordination.
    • More robust synchronization flow with clearer logging during scaling and networking resource updates.
  • Chores

    • Updated permissions to allow scaling operations on Grove resources.
    • Aligned chart and app versions to 0.4.1 across platform and operator components, including dependency updates.

Copy link
Contributor

@nvrohanv nvrohanv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

coderabbitai bot commented Aug 19, 2025

Walkthrough

Adds Kubernetes Scale subresource support for Grove resources in the operator. Wires a Scale client into the reconciler, introduces a generic scaling helper, updates the reconciliation flow to scale PodClique/PodCliqueScalingGroup, and expands RBAC. Bumps Helm chart versions to 0.4.1 across platform and operator charts.

Changes

Cohort / File(s) Summary
Helm chart version bumps
deploy/cloud/helm/platform/Chart.yaml, deploy/cloud/helm/platform/components/operator/Chart.yaml, deploy/helm/chart/Chart.yaml
Increment chart/appVersion from 0.4.0 to 0.4.1; update dynamo-operator dependency to 0.4.1.
RBAC updates for Scale subresources
deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml, deploy/cloud/operator/config/rbac/role.yaml
Add grove.io permissions for podcliques/scale and podcliquescalinggroups/scale with get, patch, update.
Operator wiring for Scale client
deploy/cloud/operator/cmd/main.go
Add createScalesGetter; initialize ScalesGetter using discovery, REST mapper; inject into DynamoGraphDeploymentReconciler; propagate errors.
Reconciler scaling integration
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go
Add ScaleClient field; implement scaleGroveResource and reconcileGroveScaling; integrate scaling into reconcileGroveResources; adjust error handling around SyncResource; add logging.
Generic scaling helper
deploy/cloud/operator/internal/controller_common/scale.go
Add ScaleResource to get/update autoscalingv1.Scale via Scale subresource; handle NotFound, no-op when replicas unchanged; log transitions.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Operator as Operator Process
  participant M as Manager
  participant S as ScalesGetter (client)
  participant R as DynamoGraphDeploymentReconciler
  participant K as Kubernetes API Server

  Operator->>M: Start manager
  Operator->>M: createScalesGetter(cfg)
  M->>K: Build discovery, RESTMapper, RESTClient
  M-->>Operator: ScalesGetter
  Operator->>R: Initialize Reconciler(ScaleClient=S)

  note over R: Reconcile loop for DynamoGraphDeployment

  R->>K: Sync GroveGangSet (unchanged flow)
  R->>R: reconcileGroveScaling(dynamoDeployment)

  alt Multi-node service
    R->>R: name = {dynamo}-{i}-{svc}
    R->>S: ScaleResource(GVR PodCliqueScalingGroups, ns, name, replicas)
  else Single-node service
    R->>R: name = {dynamo}-{i}-{svc}
    R->>S: ScaleResource(GVR PodCliques, ns, name, replicas)
  end

  S->>K: GET .../scale
  alt NotFound
    S-->>R: Skip, retry on next reconcile
  else Needs change
    S->>K: UPDATE .../scale (replicas)
    K-->>S: Result
    S-->>R: Success/Failure
  end

  R-->>Operator: Reconcile result (error if scaling fails)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

A rabbit taps the cluster’s scale,
With whiskered math and tidy trail.
PodCliques grow, then gently rest,
Groups align at your behest.
Charts hop to point-one, neat—
RBAC keys unlock the feat.
Thump-thump: reconcile complete. 🥕🐇

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.2.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (1)

131-139: Helm RBAC: scale subresource permissions correctly added.

The rule grants get/patch/update on podcliques/scale and podcliquescalinggroups/scale, matching the controller changes. Good consistency with config/rbac/role.yaml.

Note: This template already includes a very broad wildcard rule (apiGroups: '', resources: '', verbs: '*'). That makes the new granular rule technically redundant at render time. If the goal is to move toward least privilege (recommended), consider gating or removing the wildcard rule in a future PR and relying on explicit rules like these.

deploy/cloud/operator/internal/controller_common/scale.go (1)

42-51: Differentiate scale-subresource absence vs. target object absence

NotFound here likely indicates the target object hasn’t been created yet (perfectly fine for first reconcile). However, “no scale subresource” will typically surface as a mapping or Forbidden error. Consider treating Forbidden as an actionable RBAC issue to improve operator diagnosability.

  • Add a special case for errors.IsForbidden(err) to point users to RBAC for “/scale”.
  • Keep current NotFound handling as-is to allow retries.
deploy/cloud/operator/cmd/main.go (1)

354-360: Optional: Lazily create Scale client only when Grove is enabled

You can defer scale client creation until after Grove availability detection to avoid unnecessary work in clusters without Grove. Not a blocker.

  • Move createScalesGetter after DetectGroveAvailability and guard wiring with if groveEnabled.
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1)

179-195: Avoid stringly-typed resourceType; pass GVR directly

Using a string switch (“PodClique”, “PodCliqueScalingGroup”) is brittle. Pass the schema.GroupVersionResource directly to reduce errors and simplify call sites.

Apply this diff:

-// scaleGroveResource scales a Grove resource using the generic scaling function
-func (r *DynamoGraphDeploymentReconciler) scaleGroveResource(ctx context.Context, resourceName, namespace string, newReplicas int32, resourceType string) error {
-  // Determine the GroupVersionResource based on resource type
-  var gvr schema.GroupVersionResource
-  switch resourceType {
-  case "PodClique":
-    gvr = podCliqueGVR
-  case "PodCliqueScalingGroup":
-    gvr = podCliqueScalingGroupGVR
-  default:
-    return fmt.Errorf("unsupported Grove resource type: %s", resourceType)
-  }
-
-  // Use the generic scaling function
-  return commonController.ScaleResource(ctx, r.ScaleClient, gvr, namespace, resourceName, newReplicas)
-}
+// scaleGroveResource scales a Grove resource using the generic scaling function
+func (r *DynamoGraphDeploymentReconciler) scaleGroveResource(ctx context.Context, gvr schema.GroupVersionResource, resourceName, namespace string, newReplicas int32) error {
+  return commonController.ScaleResource(ctx, r.ScaleClient, gvr, namespace, resourceName, newReplicas)
+}

And update the call sites below accordingly (see subsequent comment).

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d9aef67 and e7b7e56.

📒 Files selected for processing (8)
  • deploy/cloud/helm/platform/Chart.yaml (1 hunks)
  • deploy/cloud/helm/platform/components/operator/Chart.yaml (1 hunks)
  • deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (1 hunks)
  • deploy/cloud/operator/cmd/main.go (3 hunks)
  • deploy/cloud/operator/config/rbac/role.yaml (1 hunks)
  • deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (5 hunks)
  • deploy/cloud/operator/internal/controller_common/scale.go (1 hunks)
  • deploy/helm/chart/Chart.yaml (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
🧬 Code Graph Analysis (2)
deploy/cloud/operator/cmd/main.go (1)
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1)
  • DynamoGraphDeploymentReconciler (75-81)
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (2)
deploy/cloud/operator/internal/controller_common/resource.go (1)
  • Resource (474-477)
deploy/cloud/operator/internal/controller_common/scale.go (1)
  • ScaleResource (33-77)
🔇 Additional comments (9)
deploy/helm/chart/Chart.yaml (1)

20-21: Version bump is consistent and scoped correctly.

Chart version and appVersion both updated to 0.4.1. Matches the operator/platform chart bumps in this PR. LGTM.

deploy/cloud/operator/config/rbac/role.yaml (1)

101-109: RBAC for Grove scale subresources is correct; please verify CRDs on your cluster

Granting get, patch and update on podcliques/scale and podcliquescalinggroups/scale matches the Scale client’s needs. We didn’t find any grove.io CRDs vendored in the repo—please confirm on your cluster:

  • kubectl api-resources --api-group=grove.io | grep -E 'podclique|podcliquescalinggroup'
  • kubectl get --raw "/apis/grove.io/v1/namespaces//podcliques//scale" | jq .
deploy/cloud/helm/platform/components/operator/Chart.yaml (1)

30-35: Operator chart/appVersion bump looks good.

Minor version aligns with the new capability and with the platform chart dependency update. No other changes needed here.

deploy/cloud/helm/platform/Chart.yaml (1)

22-26: Platform chart version bump is consistent

The version: 0.4.1 and dynamo-operator dependency at 0.4.1 align with the rest of this PR.

However, I didn’t find a Chart.lock at deploy/cloud/helm/platform/Chart.lock. If you’re managing Helm dependencies for this chart, please regenerate and commit the lock file to avoid drift:

cd deploy/cloud/helm/platform
helm dependency update
git add Chart.lock
git commit -m "chore(platform): update Chart.lock for v0.4.1"
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (5)

56-68: LGTM: Clear GVR definitions for Grove resources

The GVRs for PodClique and PodCliqueScalingGroup are explicit and correct for grove.io/v1alpha1. This makes the scaling code straightforward.


87-89: RBAC covers Scale subresources for Grove

Granting get/update/patch on podcliques/scale and podcliquescalinggroups/scale is necessary and sufficient for the Scale client calls.


211-237: Replicas naming index stays constant at 0; verify intended Grove naming pattern

replicaIndex is initialized to 0 and never incremented. As a result, all scaled resources are named: {DGD.name}-0-{serviceName}. If the Grove naming really expects index 0 for the subresources you scale, this is fine; otherwise, it may collide or miss the target resource.

  • Confirm the expected naming convention for PodClique and PodCliqueScalingGroup instances created by the PodGangSet for multi/single-node services.
  • If a different index is required (e.g., per-rank or per-shard), increment or derive replicaIndex accordingly.

If adopting the earlier refactor to pass GVR directly, also apply this call-site simplification:

- resourceName := fmt.Sprintf("%s-%d-%s", dynamoDeployment.Name, replicaIndex, strings.ToLower(serviceName))
- err := r.scaleGroveResource(ctx,
-   resourceName,
-   dynamoDeployment.Namespace,
-   *component.Replicas,
-   "PodCliqueScalingGroup")
+ resourceName := fmt.Sprintf("%s-%d-%s", dynamoDeployment.Name, replicaIndex, strings.ToLower(serviceName))
+ err := r.scaleGroveResource(ctx, podCliqueScalingGroupGVR, resourceName, dynamoDeployment.Namespace, *component.Replicas)
- resourceName := fmt.Sprintf("%s-%d-%s", dynamoDeployment.Name, replicaIndex, strings.ToLower(serviceName))
- err := r.scaleGroveResource(ctx,
-   resourceName,
-   dynamoDeployment.Namespace,
-   *component.Replicas,
-   "PodClique")
+ resourceName := fmt.Sprintf("%s-%d-%s", dynamoDeployment.Name, replicaIndex, strings.ToLower(serviceName))
+ err := r.scaleGroveResource(ctx, podCliqueGVR, resourceName, dynamoDeployment.Namespace, *component.Replicas)

196-242: Good: Idempotent scaling with NotFound tolerance

The flow scales after syncing the PodGangSet and tolerates NotFound (in ScaleResource), which is appropriate given Grove’s async creation of subresources.


244-271: Integrating scaling after structural sync is correct

Running reconcileGroveScaling immediately after syncing the GangSet is a logical place. The error propagation with a specific reason ("grove_scaling_failed") improves debuggability.

@julienmancuso julienmancuso enabled auto-merge (squash) August 19, 2025 18:36
@julienmancuso julienmancuso merged commit a97602a into main Aug 19, 2025
9 of 11 checks passed
@julienmancuso julienmancuso deleted the jsm/dep-323 branch August 19, 2025 18:54
hhzhang16 pushed a commit that referenced this pull request Aug 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants