Skip to content

feat(gcp): GCP-503: Implement OrphanDeleter#8884

Open
thetechnick wants to merge 1 commit into
openshift:mainfrom
thetechnick:gcp-503-machine-orphan-delete
Open

feat(gcp): GCP-503: Implement OrphanDeleter#8884
thetechnick wants to merge 1 commit into
openshift:mainfrom
thetechnick:gcp-503-machine-orphan-delete

Conversation

@thetechnick

@thetechnick thetechnick commented Jul 1, 2026

Copy link
Copy Markdown

What this PR does / why we need it:

Removes finalizers from orphaned GCPMachines when WIF credentials become invalid to prevent cluster teardown getting stuck.

Which issue(s) this PR fixes:

Fixes #GCP-503

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Bug Fixes
    • Improved cleanup by automatically clearing stale finalizers on GCP machine resources that are already marked for deletion, reducing the chance of stuck machine deletions.
    • Cleanup failures are now reported as a single aggregated error to simplify diagnosing issues during teardown.
  • New Features
    • Added logic to remove finalizers when GCP credentials are invalid, with clear audit-style logging for each affected machine.
  • Tests
    • Added unit coverage to confirm finalizers are removed only for machines pending deletion.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot

openshift-ci-robot commented Jul 1, 2026

Copy link
Copy Markdown

@thetechnick: This pull request references GCP-503 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Removes finalizers from orphaned GCPMachines when WIF credentials become invalid to prevent cluster teardown getting stuck.

Which issue(s) this PR fixes:

Fixes #GCP-503

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 1, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 1, 2026
@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: e8822237-0bde-47b7-b2b8-9ca4fff985bc

📥 Commits

Reviewing files that changed from the base of the PR and between 8700832 and 5c40158.

📒 Files selected for processing (2)
  • hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp_test.go

📝 Walkthrough

Walkthrough

Adds DeleteOrphanedMachines to the GCP platform handler. When credentials are invalid, it lists GCPMachine objects in the control plane namespace, clears finalizers on machines with a non-zero DeletionTimestamp, updates them, and aggregates update errors. A unit test covers finalizer removal for terminating and non-terminating machines.

🚥 Pre-merge checks | ✅ 11
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title matches the main change: adding orphaned GCP machine finalizer cleanup logic for invalid credentials.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The added test uses a static Go test name; no Ginkgo titles or dynamic values appear in the changed file.
Test Structure And Quality ✅ Passed The new fake-client unit test is self-contained, targets one behavior, needs no cleanup/timeouts, and matches existing repo test style.
Topology-Aware Scheduling Compatibility ✅ Passed Only orphaned GCPMachine finalizer cleanup was added; no pod affinity, node selectors, topology spread, or replica scheduling logic changed.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Added test is a unit test with fake client, no Ginkgo e2e code, IPv4 literals, or external connectivity requirements.
No-Weak-Crypto ✅ Passed PASS: The diff only adds GCPMachine finalizer cleanup; gcp.go/test add no crypto imports, weak ciphers, or secret/token comparisons.
Container-Privileges ✅ Passed No privileged settings appear in the changed GCP files; the provider container sets allowPrivilegeEscalation=false, drops ALL, and runs as non-root.
No-Sensitive-Data-In-Logs ✅ Passed The only new log entry records a GCPMachine namespace/name; no passwords, tokens, keys, PII, or session data are logged.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot added area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/gcp PR/issue for GCP (GCPPlatform) platform and removed do-not-merge/needs-area labels Jul 1, 2026
@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: thetechnick
Once this PR has been reviewed and has the lgtm label, please assign enxebre for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp_test.go (1)

420-472: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Consider adding a case for the ValidCredentials == true early-return path.

The test only exercises the invalid-credentials cleanup path. Adding a case where hc has valid WIF/credentials conditions set (so ValidCredentials returns true) would confirm the early-return nil and that finalizers are left untouched, closing an easy-to-miss regression gap.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp_test.go`
around lines 420 - 472, Add coverage for the ValidCredentials early-return in
DeleteOrphanedMachines by extending TestDeleteOrphanedMachines with a
HostedCluster state where WIF/credentials are valid and ValidCredentials returns
true. Use the existing platform.DeleteOrphanedMachines and validHostedCluster
helpers to set up that case, then assert the call returns nil and that
GCPMachine finalizers remain unchanged for both deleted and non-deleted objects.
hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp.go (1)

547-561: 🧹 Nitpick | 🔵 Trivial

All finalizers are wiped indiscriminately, not just WIF/credential-related ones.

Any finalizer present on a terminating GCPMachine is cleared, including ones unrelated to WIF credential validity (e.g. finalizers owned by other controllers). Since the underlying GCP compute resources can't be cleaned up while credentials are invalid, this can leak actual cloud resources (VMs/disks) that CAPG never got to delete, and also removes any other controller's cleanup guarantees on this object. This may be an accepted tradeoff given the goal of unblocking stuck teardown, but worth calling out for operational awareness (e.g. monitoring/alerting on leaked GCP resources after this path fires).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp.go`
around lines 547 - 561, The finalizer-clearing path in gcp.go currently removes
every finalizer from terminating GCPMachine objects, not just the
credential/WIF-related ones. Update the cleanup logic around the GCPMachine loop
to either preserve unrelated finalizers or explicitly document and surface the
broad wipe as an intentional tradeoff; use the gcpMachine.Finalizers assignment
and the c.Update call as the key spots to adjust, and add a clear warning in the
logger.Info/error path so operators can detect possible leaked resources.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp.go`:
- Line 545: The log message around the post-cleanup path in gcp.go is stale
copy-pasted wording: it says “skipping cleanup” and mentions AWS even though the
cleanup already happened in the GCP machine flow. Update the message emitted
near the logger := ctrl.LoggerFrom(ctx) path and the surrounding
finalizer/machine update logic to describe the actual completed cleanup, use the
GCP platform name, and ensure any related log strings in the same block
(including the later lines referenced in the comment) are consistent with the
successful cleanup action.

---

Nitpick comments:
In
`@hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp_test.go`:
- Around line 420-472: Add coverage for the ValidCredentials early-return in
DeleteOrphanedMachines by extending TestDeleteOrphanedMachines with a
HostedCluster state where WIF/credentials are valid and ValidCredentials returns
true. Use the existing platform.DeleteOrphanedMachines and validHostedCluster
helpers to set up that case, then assert the call returns nil and that
GCPMachine finalizers remain unchanged for both deleted and non-deleted objects.

In `@hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp.go`:
- Around line 547-561: The finalizer-clearing path in gcp.go currently removes
every finalizer from terminating GCPMachine objects, not just the
credential/WIF-related ones. Update the cleanup logic around the GCPMachine loop
to either preserve unrelated finalizers or explicitly document and surface the
broad wipe as an intentional tradeoff; use the gcpMachine.Finalizers assignment
and the c.Update call as the key spots to adjust, and add a clear warning in the
logger.Info/error path so operators can detect possible leaked resources.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: b896aebb-48e0-4882-b50a-0b4e227ce71d

📥 Commits

Reviewing files that changed from the base of the PR and between d6ec188 and 7c5f487.

📒 Files selected for processing (2)
  • hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/gcp/gcp_test.go

@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 57.14286% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.28%. Comparing base (df4e94a) to head (5c40158).
⚠️ Report is 15 commits behind head on main.

Files with missing lines Patch % Lines
...rollers/hostedcluster/internal/platform/gcp/gcp.go 57.14% 6 Missing and 3 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8884   +/-   ##
=======================================
  Coverage   43.28%   43.28%           
=======================================
  Files         771      771           
  Lines       95503    95527   +24     
=======================================
+ Hits        41335    41347   +12     
- Misses      51284    51293    +9     
- Partials     2884     2887    +3     
Files with missing lines Coverage Δ
...rollers/hostedcluster/internal/platform/gcp/gcp.go 82.12% <57.14%> (-1.56%) ⬇️

... and 1 file with indirect coverage changes

Flag Coverage Δ
cmd-support 36.67% <ø> (ø)
cpo-hostedcontrolplane 45.31% <ø> (ø)
cpo-other 45.10% <ø> (ø)
hypershift-operator 53.58% <57.14%> (-0.01%) ⬇️
other 31.69% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@thetechnick thetechnick force-pushed the gcp-503-machine-orphan-delete branch 2 times, most recently from ae6a05d to c6bbad4 Compare July 1, 2026 09:51
@hypershift-jira-solve-ci

hypershift-jira-solve-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown

The commit message in the repository is GCP-503: Implement OrphanDeleter for GCP — it does NOT start with a conventional commit prefix. The PR API shows feat: [GCP-503](https://redhat.atlassian.net/browse/GCP-503) Implement OrphanDeleter for GCP as the headline, which appears to be a squash title set differently, but the actual commit ae6a05df that gitlint checked has the non-conforming title.

Test Failure Analysis Complete

Job Information

  • Prow Job: gitlint / Gitlint
  • Build ID: GitHub Actions run 28508774540 / job 84503657290
  • PR: #8884GCP-503: Implement OrphanDeleter for GCP
  • Commit: ae6a05df41dc01060dc0b9e1099e7fb096cd8edc

Test Failure Analysis

Error

1: CT1 Title does not start with one of fix, feat, chore, docs, style, refactor, perf, test, revert, ci, build: "GCP-503: Implement OrphanDeleter for GCP"
make: *** [Makefile:624: run-gitlint] Error 1

Summary

The gitlint check failed because the commit message title "[GCP-503](https://redhat.atlassian.net/browse/GCP-503): Implement OrphanDeleter for GCP" does not follow the Conventional Commits format required by the repository. The title must start with one of the allowed type prefixes (fix, feat, chore, docs, style, refactor, perf, test, revert, ci, build) followed by a colon and space. The JIRA ticket ID GCP-503 is not a valid conventional commit type.

Root Cause

The commit ae6a05df has the message title "[GCP-503](https://redhat.atlassian.net/browse/GCP-503): Implement OrphanDeleter for GCP", which starts with a JIRA ticket ID (GCP-503) instead of a conventional commit type prefix. The repository's .gitlint configuration enforces the contrib-title-conventional-commits rule, which requires commit titles to match the pattern <type>: <description> where <type> is one of: fix, feat, chore, docs, style, refactor, perf, test, revert, ci, build.

The gitlint workflow (gitlint-reusable.yaml) runs make run-gitlint and lints all commits in the range PULL_BASE_SHA..PULL_PULL_SHA. The commit's title uses GCP-503: as a prefix, which gitlint's CT1 rule does not recognize as a valid conventional commit type.

The fix is to amend the commit message to use a valid conventional commit prefix, e.g.:

  • feat: [GCP-503](https://redhat.atlassian.net/browse/GCP-503) Implement OrphanDeleter for GCP
  • feat(gcp): implement OrphanDeleter for GCP
Recommendations
  1. Amend the commit message to use a conventional commit prefix. The most appropriate type for this change is feat:

    feat: [GCP-503](https://redhat.atlassian.net/browse/GCP-503) Implement OrphanDeleter for GCP
    

    or following the convention of putting the JIRA ID in the scope or body:

    feat(gcp): implement OrphanDeleter for GCP
    
  2. Force-push the amended commit to the PR branch to trigger a re-run of the gitlint check.

  3. For future commits, always prefix the title with a valid conventional commit type. The JIRA ticket ID can be placed after the type prefix, in the scope, or in the commit body.

Evidence
Evidence Detail
Failing commit ae6a05df41dc01060dc0b9e1099e7fb096cd8edc
Commit title GCP-503: Implement OrphanDeleter for GCP
Gitlint rule violated CT1 (contrib-title-conventional-commits)
Allowed types fix, feat, chore, docs, style, refactor, perf, test, revert, ci, build
Config file .gitlintcontrib=contrib-title-conventional-commits
Workflow file .github/workflows/gitlint-reusable.yaml — runs make run-gitlint
Error line in logs 1: CT1 Title does not start with one of fix, feat, chore, docs, style, refactor, perf, test, revert, ci, build: "[GCP-503](https://redhat.atlassian.net/browse/GCP-503): Implement OrphanDeleter for GCP"

@thetechnick thetechnick changed the title GCP-503: Implement OrphanDeleter for GCP feat: GCP-503: Implement OrphanDeleter for GCP Jul 1, 2026
@thetechnick thetechnick force-pushed the gcp-503-machine-orphan-delete branch 2 times, most recently from 6600f3a to 8700832 Compare July 3, 2026 07:23
@thetechnick thetechnick changed the title feat: GCP-503: Implement OrphanDeleter for GCP feat(gcp): GCP-503: Implement OrphanDeleter Jul 3, 2026
@thetechnick thetechnick marked this pull request as ready for review July 3, 2026 07:24
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 3, 2026
@openshift-ci openshift-ci Bot requested review from jimdaga and sdminonne July 3, 2026 07:24
errs = append(errs, fmt.Errorf("failed to delete machine %s/%s: %w", gcpMachine.Namespace, gcpMachine.Name, err))
continue
}
logger.Info("removed finalizers of gcpmachine because of invalid AWS identity provider", "machine", client.ObjectKeyFromObject(gcpMachine))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably read "invalid GCP Credentials" instead.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Sorry for the c/p error, should have seen that myself.

Removes finalizers from orphaned GCPMachines when WIF credentials become
invalid to prevent cluster teardown getting stuck.

The implementation is similar to the existing AWS implementation.
@thetechnick thetechnick force-pushed the gcp-503-machine-orphan-delete branch from 8700832 to 5c40158 Compare July 3, 2026 13:55
@openshift-ci

openshift-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

@thetechnick: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@cblecker cblecker left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a clean implementation that correctly follows the established OrphanDeleter pattern from AWS. The extra len(Finalizers) == 0 guard, index-based iteration, and error aggregation are all good.

One blocking concern: the credential condition staleness during deletion (see inline comment on gcp.go). The remaining comments are non-blocking suggestions.

Since this PR adds GCP as an OrphanDeleter implementer, consider adding compile-time interface satisfaction checks in platform.go alongside the existing Platform checks:

var _ OrphanDeleter = aws.AWS{}
var _ OrphanDeleter = gcp.GCP{}

This way if the method signature drifts, the build breaks instead of the runtime type assertion silently returning false.

@@ -527,6 +529,34 @@ func (p GCP) validateWorkloadIdentityConfiguration(hcluster *hyperv1.HostedClust
return nil
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeleteOrphanedMachines relies on ValidCredentials(hc), but ValidGCPWorkloadIdentity and ValidGCPCredentials are only set during Phase 6a (ReconcileCredentials) of normal reconciliation — a path that's never reached during deletion.

AWS handles this by refreshing ValidAWSIdentityProvider in Phase 1 (hostedcluster_controller.go:423-452) before the deletion branch, with the comment: "We set this condition even if the HC is being deleted." GCP has no equivalent, so this function is making its decision based on condition data that could be stale from a transient error, or never set at all.

Three scenarios this creates:

  1. Transient API server error sets ValidGCPCredentials to False during normal reconciliation → deletion starts → condition frozen → finalizers stripped unnecessarily
  2. Cluster deleted before conditions ever set → ValidCredentials returns false (nil conditions) → finalizers stripped
  3. Both are safe in the AWS path because the condition is refreshed before delete() runs

Suggestion: mirror the AWS pattern. validateWorkloadIdentityConfiguration is a pure spec check (no network calls) — extract it into a standalone method, call it from Phase 1 to refresh the condition before the deletion branch, and tighten the guard here to require the condition to be explicitly False rather than just absent.

@@ -416,3 +416,57 @@ func TestReconcileGCPClusterPreservesServerDefaultedFields(t *testing.T) {
g.Expect(gcpCluster.Spec.Network.Subnets[0].Name).To(Equal("test-subnet"))
g.Expect(gcpCluster.Spec.Network.Subnets[0].Region).To(Equal("us-central1"))
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test only exercises the invalid-credentials path (no status conditions on the HostedCluster). Consider adding a subtest where both ValidGCPWorkloadIdentity and ValidGCPCredentials are set to True, with GCPMachines that have DeletionTimestamp and Finalizers, and assert the finalizers remain unchanged. This protects the ValidCredentials guard — if it were accidentally inverted, the current test would still pass.

len(gcpMachine.Finalizers) == 0 {
continue
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the error says "failed to delete machine" but the operation is a client.Update to clear finalizers. Something like "failed to remove finalizers from GCPMachine %s/%s" would be more precise for operators debugging stuck teardowns. (AWS has the same wording — could be a follow-up for both.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/gcp PR/issue for GCP (GCPPlatform) platform jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants