feat(gcp): GCP-503: Implement OrphanDeleter#8884
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@thetechnick: This pull request references GCP-503 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds 🚥 Pre-merge checks | ✅ 11✅ Passed checks (11 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: thetechnick The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp_test.go (1)
420-472: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winConsider adding a case for the
ValidCredentials == trueearly-return path.The test only exercises the invalid-credentials cleanup path. Adding a case where
hchas valid WIF/credentials conditions set (soValidCredentialsreturns true) would confirm the early-returnniland that finalizers are left untouched, closing an easy-to-miss regression gap.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp_test.go` around lines 420 - 472, Add coverage for the ValidCredentials early-return in DeleteOrphanedMachines by extending TestDeleteOrphanedMachines with a HostedCluster state where WIF/credentials are valid and ValidCredentials returns true. Use the existing platform.DeleteOrphanedMachines and validHostedCluster helpers to set up that case, then assert the call returns nil and that GCPMachine finalizers remain unchanged for both deleted and non-deleted objects.hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp.go (1)
547-561: 🧹 Nitpick | 🔵 TrivialAll finalizers are wiped indiscriminately, not just WIF/credential-related ones.
Any finalizer present on a terminating
GCPMachineis cleared, including ones unrelated to WIF credential validity (e.g. finalizers owned by other controllers). Since the underlying GCP compute resources can't be cleaned up while credentials are invalid, this can leak actual cloud resources (VMs/disks) that CAPG never got to delete, and also removes any other controller's cleanup guarantees on this object. This may be an accepted tradeoff given the goal of unblocking stuck teardown, but worth calling out for operational awareness (e.g. monitoring/alerting on leaked GCP resources after this path fires).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp.go` around lines 547 - 561, The finalizer-clearing path in gcp.go currently removes every finalizer from terminating GCPMachine objects, not just the credential/WIF-related ones. Update the cleanup logic around the GCPMachine loop to either preserve unrelated finalizers or explicitly document and surface the broad wipe as an intentional tradeoff; use the gcpMachine.Finalizers assignment and the c.Update call as the key spots to adjust, and add a clear warning in the logger.Info/error path so operators can detect possible leaked resources.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp.go`:
- Line 545: The log message around the post-cleanup path in gcp.go is stale
copy-pasted wording: it says “skipping cleanup” and mentions AWS even though the
cleanup already happened in the GCP machine flow. Update the message emitted
near the logger := ctrl.LoggerFrom(ctx) path and the surrounding
finalizer/machine update logic to describe the actual completed cleanup, use the
GCP platform name, and ensure any related log strings in the same block
(including the later lines referenced in the comment) are consistent with the
successful cleanup action.
---
Nitpick comments:
In
`@hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp_test.go`:
- Around line 420-472: Add coverage for the ValidCredentials early-return in
DeleteOrphanedMachines by extending TestDeleteOrphanedMachines with a
HostedCluster state where WIF/credentials are valid and ValidCredentials returns
true. Use the existing platform.DeleteOrphanedMachines and validHostedCluster
helpers to set up that case, then assert the call returns nil and that
GCPMachine finalizers remain unchanged for both deleted and non-deleted objects.
In `@hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp.go`:
- Around line 547-561: The finalizer-clearing path in gcp.go currently removes
every finalizer from terminating GCPMachine objects, not just the
credential/WIF-related ones. Update the cleanup logic around the GCPMachine loop
to either preserve unrelated finalizers or explicitly document and surface the
broad wipe as an intentional tradeoff; use the gcpMachine.Finalizers assignment
and the c.Update call as the key spots to adjust, and add a clear warning in the
logger.Info/error path so operators can detect possible leaked resources.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: b896aebb-48e0-4882-b50a-0b4e227ce71d
📒 Files selected for processing (2)
hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp.gohypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp_test.go
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8884 +/- ##
=======================================
Coverage 43.28% 43.28%
=======================================
Files 771 771
Lines 95503 95527 +24
=======================================
+ Hits 41335 41347 +12
- Misses 51284 51293 +9
- Partials 2884 2887 +3
... and 1 file with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
ae6a05d to
c6bbad4
Compare
|
The commit message in the repository is Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe gitlint check failed because the commit message title Root CauseThe commit The gitlint workflow ( The fix is to amend the commit message to use a valid conventional commit prefix, e.g.:
Recommendations
Evidence
|
6600f3a to
8700832
Compare
| errs = append(errs, fmt.Errorf("failed to delete machine %s/%s: %w", gcpMachine.Namespace, gcpMachine.Name, err)) | ||
| continue | ||
| } | ||
| logger.Info("removed finalizers of gcpmachine because of invalid AWS identity provider", "machine", client.ObjectKeyFromObject(gcpMachine)) |
There was a problem hiding this comment.
This should probably read "invalid GCP Credentials" instead.
There was a problem hiding this comment.
Fixed! Sorry for the c/p error, should have seen that myself.
Removes finalizers from orphaned GCPMachines when WIF credentials become invalid to prevent cluster teardown getting stuck. The implementation is similar to the existing AWS implementation.
8700832 to
5c40158
Compare
|
@thetechnick: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
cblecker
left a comment
There was a problem hiding this comment.
Overall this is a clean implementation that correctly follows the established OrphanDeleter pattern from AWS. The extra len(Finalizers) == 0 guard, index-based iteration, and error aggregation are all good.
One blocking concern: the credential condition staleness during deletion (see inline comment on gcp.go). The remaining comments are non-blocking suggestions.
Since this PR adds GCP as an OrphanDeleter implementer, consider adding compile-time interface satisfaction checks in platform.go alongside the existing Platform checks:
var _ OrphanDeleter = aws.AWS{}
var _ OrphanDeleter = gcp.GCP{}This way if the method signature drifts, the build breaks instead of the runtime type assertion silently returning false.
| @@ -527,6 +529,34 @@ func (p GCP) validateWorkloadIdentityConfiguration(hcluster *hyperv1.HostedClust | |||
| return nil | |||
| } | |||
There was a problem hiding this comment.
DeleteOrphanedMachines relies on ValidCredentials(hc), but ValidGCPWorkloadIdentity and ValidGCPCredentials are only set during Phase 6a (ReconcileCredentials) of normal reconciliation — a path that's never reached during deletion.
AWS handles this by refreshing ValidAWSIdentityProvider in Phase 1 (hostedcluster_controller.go:423-452) before the deletion branch, with the comment: "We set this condition even if the HC is being deleted." GCP has no equivalent, so this function is making its decision based on condition data that could be stale from a transient error, or never set at all.
Three scenarios this creates:
- Transient API server error sets
ValidGCPCredentialstoFalseduring normal reconciliation → deletion starts → condition frozen → finalizers stripped unnecessarily - Cluster deleted before conditions ever set →
ValidCredentialsreturnsfalse(nil conditions) → finalizers stripped - Both are safe in the AWS path because the condition is refreshed before
delete()runs
Suggestion: mirror the AWS pattern. validateWorkloadIdentityConfiguration is a pure spec check (no network calls) — extract it into a standalone method, call it from Phase 1 to refresh the condition before the deletion branch, and tighten the guard here to require the condition to be explicitly False rather than just absent.
| @@ -416,3 +416,57 @@ func TestReconcileGCPClusterPreservesServerDefaultedFields(t *testing.T) { | |||
| g.Expect(gcpCluster.Spec.Network.Subnets[0].Name).To(Equal("test-subnet")) | |||
| g.Expect(gcpCluster.Spec.Network.Subnets[0].Region).To(Equal("us-central1")) | |||
| } | |||
There was a problem hiding this comment.
The test only exercises the invalid-credentials path (no status conditions on the HostedCluster). Consider adding a subtest where both ValidGCPWorkloadIdentity and ValidGCPCredentials are set to True, with GCPMachines that have DeletionTimestamp and Finalizers, and assert the finalizers remain unchanged. This protects the ValidCredentials guard — if it were accidentally inverted, the current test would still pass.
| len(gcpMachine.Finalizers) == 0 { | ||
| continue | ||
| } | ||
|
|
There was a problem hiding this comment.
Nit: the error says "failed to delete machine" but the operation is a client.Update to clear finalizers. Something like "failed to remove finalizers from GCPMachine %s/%s" would be more precise for operators debugging stuck teardowns. (AWS has the same wording — could be a follow-up for both.)
What this PR does / why we need it:
Removes finalizers from orphaned GCPMachines when WIF credentials become invalid to prevent cluster teardown getting stuck.
Which issue(s) this PR fixes:
Fixes #GCP-503
Special notes for your reviewer:
Checklist:
Summary by CodeRabbit