Skip to content

CNTRLPLANE-507: Add HCP finalizer to AWSEndpointService reconciler#8499

Open
hypershift-jira-solve-ci[bot] wants to merge 1 commit into
openshift:mainfrom
hypershift-community:fix-CNTRLPLANE-507
Open

CNTRLPLANE-507: Add HCP finalizer to AWSEndpointService reconciler#8499
hypershift-jira-solve-ci[bot] wants to merge 1 commit into
openshift:mainfrom
hypershift-community:fix-CNTRLPLANE-507

Conversation

@hypershift-jira-solve-ci

@hypershift-jira-solve-ci hypershift-jira-solve-ci Bot commented May 13, 2026

Copy link
Copy Markdown

What this PR does / why we need it:

Adds a finalizer on the HostedControlPlane resource from the AWSEndpointService reconciler to prevent HCP deletion before AWS PrivateLink resources are cleaned up.

Problem: When the CPO restarts during deletion of a SharedVPC cluster, the clientBuilder is uninitialized and the HCP (with its cross-account role ARNs) may already be deleted. This causes the reconciler to fail creating AWS clients, and after a 10-minute grace period the hypershift-operator force-removes the CPO finalizer — orphaning VPC endpoints, security groups, and DNS records in the shared VPC account.

Solution: The new HCP finalizer (hypershift.openshift.io/aws-private-link-endpoint-cleanup) follows the same pattern used by the Azure PLS controller:

  • Adds the finalizer to the HCP during normal reconciliation
  • When HCP deletion is detected, initializes AWS clients from the still-available HCP
  • Cleans up each AWSEndpointService's AWS resources and removes CR finalizers
  • Removes the HCP finalizer only after all AWSEndpointService CRs are cleaned up
  • Extends the HCP watch handler (enqueueOnHCPChange) to also trigger reconciliation when an HCP is being deleted with the finalizer present

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CNTRLPLANE-507

Special notes for your reviewer:

  • This follows the same finalizer pattern already established by the Azure PLS controller
  • The enqueueOnHCPChange handler (renamed from enqueueOnAccessChange) now triggers on both EndpointAccess changes and HCP deletions with the finalizer
  • AWS client initialization during HCP deletion reuses the existing getAWSClient helper, sourcing credentials from the still-available HCP spec

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Always review AI generated responses prior to use.
Generated with Claude Code via /jira:solve [CNTRLPLANE-507](https://redhat.atlassian.net/browse/CNTRLPLANE-507)


Note: This PR was auto-generated by the jira-agent periodic CI job in response to CNTRLPLANE-507. See the full report for token usage, cost breakdown, and detailed phase output.

Summary by CodeRabbit

  • Bug Fixes

    • Improved AWS PrivateLink deletion cleanup by coordinating finalizers between HostedControlPlane (HCP) and related endpoint service CRs.
    • Ensured HCP deletion reconciliation runs reliably across controller restarts, without racing the endpoint-service deletion path.
    • Added safer requeue behavior on Kubernetes conflicts and dependency-violation scenarios to avoid premature finalizer removal.
  • Tests

    • Added unit tests validating HCP finalizer management, deletion cleanup coordination, reconciliation request mapping/enqueue logic, and error/requeue handling for Kubernetes and AWS failures.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot

openshift-ci-robot commented May 13, 2026

Copy link
Copy Markdown

@hypershift-jira-solve-ci[bot]: This pull request references CNTRLPLANE-507 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Adds a finalizer on the HostedControlPlane resource from the AWSEndpointService reconciler to prevent HCP deletion before AWS PrivateLink resources are cleaned up.

Problem: When the CPO restarts during deletion of a SharedVPC cluster, the clientBuilder is uninitialized and the HCP (with its cross-account role ARNs) may already be deleted. This causes the reconciler to fail creating AWS clients, and after a 10-minute grace period the hypershift-operator force-removes the CPO finalizer — orphaning VPC endpoints, security groups, and DNS records in the shared VPC account.

Solution: The new HCP finalizer (hypershift.openshift.io/aws-private-link-endpoint-cleanup) follows the same pattern used by the Azure PLS controller:

  • Adds the finalizer to the HCP during normal reconciliation
  • When HCP deletion is detected, initializes AWS clients from the still-available HCP
  • Cleans up each AWSEndpointService's AWS resources and removes CR finalizers
  • Removes the HCP finalizer only after all AWSEndpointService CRs are cleaned up
  • Extends the HCP watch handler (enqueueOnHCPChange) to also trigger reconciliation when an HCP is being deleted with the finalizer present

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CNTRLPLANE-507

Special notes for your reviewer:

  • This follows the same finalizer pattern already established by the Azure PLS controller
  • The enqueueOnHCPChange handler (renamed from enqueueOnAccessChange) now triggers on both EndpointAccess changes and HCP deletions with the finalizer
  • AWS client initialization during HCP deletion reuses the existing getAWSClient helper, sourcing credentials from the still-available HCP spec

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Always review AI generated responses prior to use.
Generated with Claude Code via /jira:solve [CNTRLPLANE-507](https://redhat.atlassian.net/browse/CNTRLPLANE-507)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 13, 2026
@openshift-ci openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels May 13, 2026
@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This change adds an HCP-scoped AWS PrivateLink finalizer, updates HostedControlPlane event handling to enqueue AWSEndpointService reconciliations, and splits reconciliation into normal and HCP-deletion paths. The deletion path initializes AWS clients from the HCP, cleans up AWS resources, removes the AWSEndpointService finalizer, and clears the HCP finalizer after dependent CRs are done. Tests cover finalizer patching, deletion handling, client errors, and mapping behavior.

Possibly related PRs

  • openshift/hypershift#7868: Also changes awsprivatelink_controller.go deletion reconciliation and AWS cleanup/retry behavior around DependencyViolation.

Suggested reviewers

  • devguyio
  • enxebre
  • muraee

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error)

Check name Status Explanation Resolution
Container-Privileges ❌ Error New YAML manifests add hostPID/hostNetwork/privileged and allowPrivilegeEscalation=true in kubelet-config, kubevirt CSI, and e2e pod files. Remove or justify the privileged settings; use least-privilege securityContext where possible, or isolate these manifests with explicit exemption.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: adding an HCP finalizer to the AWSEndpointService reconciler.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Added test titles are static and descriptive; no new titles embed generated names, timestamps, UUIDs, or other run-to-run values.
Test Structure And Quality ✅ Passed Tests are table-driven fake-client unit tests, with no Ginkgo waits or cluster resources; they follow the repo’s existing testing style.
Topology-Aware Scheduling Compatibility ✅ Passed Only AWSEndpointService reconciliation/finalizer logic changed; no node selectors, affinity, spread constraints, replicas, or manifests were added.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Only Go unit tests were added; no new Ginkgo e2e tests or external-network dependencies were introduced.
No-Weak-Crypto ✅ Passed No MD5/SHA1/DES/RC4/3DES/Blowfish/ECB, custom crypto, or secret/token comparisons appear in the changed controller/test code.
No-Sensitive-Data-In-Logs ✅ Passed PASS: The added HCP-finalizer logs are generic status messages; I found no passwords, tokens, PII, or other clearly sensitive data in the new logging.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@openshift-ci

openshift-ci Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform and removed do-not-merge/needs-area labels May 13, 2026
@codecov

codecov Bot commented May 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 73.54839% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.10%. Comparing base (bc3bda9) to head (7355f0e).
⚠️ Report is 135 commits behind head on main.

Files with missing lines Patch % Lines
...ollers/awsprivatelink/awsprivatelink_controller.go 73.54% 39 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8499      +/-   ##
==========================================
+ Coverage   42.55%   43.10%   +0.54%     
==========================================
  Files         768      773       +5     
  Lines       95297    96302    +1005     
==========================================
+ Hits        40558    41507     +949     
+ Misses      51932    51911      -21     
- Partials     2807     2884      +77     
Files with missing lines Coverage Δ
...ollers/awsprivatelink/awsprivatelink_controller.go 47.25% <73.54%> (+5.60%) ⬆️

... and 52 files with indirect coverage changes

Flag Coverage Δ
cmd-support 36.08% <ø> (+0.62%) ⬆️
cpo-hostedcontrolplane 45.31% <ø> (+0.47%) ⬆️
cpo-other 45.54% <73.54%> (+0.83%) ⬆️
hypershift-operator 53.58% <ø> (+0.53%) ⬆️
other 31.68% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hypershift-jira-solve-ci

Copy link
Copy Markdown
Author

Re: #8499 (comment)

Thanks for the walkthrough summary — the description and sequence diagram accurately capture the changes.

Regarding the docstring coverage warning: noted. This is a new controller flow with several helper functions; docstrings can be added in a follow-up if the team considers it valuable for this area.


AI-assisted response via Claude Code

@bryan-cox bryan-cox left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Staff Engineer Review: Add HCP finalizer to AWSEndpointService reconciler

The core idea of this PR is correct and addresses a real operational problem — preventing AWS PrivateLink resource orphaning when the HCP is deleted before the AWSEndpointService cleanup runs. However, the implementation diverges from the Azure PLS pattern in ways that introduce correctness risks.

Blocking Issues (3)

See inline comments for details on:

  1. Dual deletion paths compete — existing CR deletion path and new HCP deletion path both remove the CR finalizer
  2. Multi-CR coordination under concurrency — convergent but produces unnecessary work with MaxConcurrentReconciles: 10
  3. UpdateFunc misses HCP deletions on controller restart — defeats the purpose of the PR

Open Questions (2)

  • Does the hypershift-operator's force-finalizer-removal logic (10-minute grace) know about this new aws-private-link-endpoint-cleanup finalizer? If not, the HCP could get stuck indefinitely.
  • The finalizer is added for ALL AWS PrivateLink clusters, not just SharedVPC. Is the broader scope intentional?

Praise

Test coverage is excellent — 784 lines of well-structured table-driven tests with gomock and client interceptors covering all new paths. The context.Background()ctx fix in the handler is a good improvement.

MaxConcurrentReconciles: 10,
}).
Watches(&hyperv1.HostedControlPlane{}, handler.Funcs{UpdateFunc: r.enqueueOnAccessChange(mgr)}).
Watches(&hyperv1.HostedControlPlane{}, handler.Funcs{UpdateFunc: r.enqueueOnHCPChange(mgr)}).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[blocking] UpdateFunc misses HCP deletions on controller restart

Using handler.Funcs{UpdateFunc: ...} means only Update events trigger this handler. If the CPO restarts while an HCP is being deleted (DeletionTimestamp already set), the informer cache sync generates a Create event — not an Update — so this handler never fires.

The Azure PLS controller avoids this by using handler.EnqueueRequestsFromMapFunc(...), which receives all event types (Create, Update, Delete) from the informer. On restart, it gets a Create event for the HCP with DeletionTimestamp set and correctly enqueues the CRs.

With the current approach, if the CPO restarts mid-HCP-deletion, the new handler will NOT fire. The reconciler would fall through to the existing AWSEndpointService CR deletion path — exactly the scenario this PR is trying to fix.

Recommendation: Switch to handler.EnqueueRequestsFromMapFunc(...) to match the Azure PLS pattern.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Switched from handler.Funcs{UpdateFunc: ...} to handler.EnqueueRequestsFromMapFunc(...) matching the Azure PLS pattern. The new mapHCPToAWSEndpointService() function receives all event types (Create, Update, Delete), so on controller restart the informer cache sync's Create event now correctly triggers cleanup for an HCP with DeletionTimestamp already set.

The EndpointAccess change detection (previously via old/new comparison in UpdateFunc) is dropped from the handler — those changes are picked up by the reconciler's existing 5-minute periodic requeue, which is acceptable since EndpointAccess changes are rare operational events.


AI-assisted response via Claude Code

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Switched from handler.Funcs{UpdateFunc: ...} to handler.EnqueueRequestsFromMapFunc(r.mapHCPToAWSEndpointService()) matching the Azure PLS pattern exactly. The new mapHCPToAWSEndpointService() MapFunc receives all event types (Create, Update, Delete), so on controller restart the informer cache sync's Create event correctly triggers cleanup for an HCP with DeletionTimestamp already set.

The MapFunc filters by finalizer presence (controllerutil.ContainsFinalizer(hcp, hcpAWSPrivateLinkFinalizerName)) to avoid unnecessary reconciliations, matching the Azure PLS approach. EndpointAccess change detection is dropped from the handler — those changes are picked up by the reconciler's existing 5-minute periodic requeue.

Tests updated: replaced TestEnqueueOnHCPChange (which tested the old UpdateFunc) with TestMapHCPToAWSEndpointService (which tests the new MapFunc directly).


AI-assisted response via Claude Code

// Handle HCP deletion: clean up AWS resources while HCP credentials are still valid.
if !hcp.DeletionTimestamp.IsZero() {
return r.reconcileHCPDeletion(ctx, awsEndpointService, hcp, log)
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[blocking] Dual deletion paths can compete

The existing AWSEndpointService CR deletion path (lines 466-486 in the diff) runs when the CR itself has a DeletionTimestamp and also removes the CR finalizer + calls r.delete(). This new HCP deletion path at line 534 also removes the CR finalizer + calls r.delete().

These two paths can activate simultaneously during namespace deletion or HCP ownership-based cascading. Consider:

  1. HCP deletion triggers enqueueOnHCPChange, enqueuing all CRs
  2. Namespace/owner cascade sets DeletionTimestamp on the CRs themselves
  3. A reconcile fires for a CR that has BOTH its own DeletionTimestamp AND the HCP is being deleted
  4. The CR enters the existing deletion path (step 1), which initializes from HCP and cleans up
  5. Another reconcile enters this HCP deletion path

The existing CR deletion path (line 466) does return early before reaching this check, so they are technically exclusive within a single reconcile call. But with MaxConcurrentReconciles: 10, two concurrent reconciles for the same CR could race.

Suggestion: Add an explicit guard here: if !awsEndpointService.DeletionTimestamp.IsZero() { return ctrl.Result{}, nil } to make the exclusion explicit and defend against concurrent reconciles.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added explicit guard at the top of reconcileHCPDeletion:

if !awsEndpointService.DeletionTimestamp.IsZero() {
    return ctrl.Result{}, nil
}

This makes the exclusion between the two deletion paths explicit and defends against concurrent reconciles under MaxConcurrentReconciles: 10. If the CR itself is being deleted, we defer to the existing CR deletion path at the top of Reconcile.


AI-assisted response via Claude Code

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added explicit guard at the top of reconcileHCPDeletion:

if !awsEndpointService.DeletionTimestamp.IsZero() {
    return ctrl.Result{}, nil
}

This makes the exclusion between the two deletion paths explicit and defends against concurrent reconciles under MaxConcurrentReconciles: 10. If the CR itself is being deleted, we defer to the existing CR deletion path at the top of Reconcile.


AI-assisted response via Claude Code

// our finalizer blocks HCP deletion.
r.awsClientBuilder.initializeWithHCP(log, hcp)
ec2Client, route53Client, err := r.awsClientBuilder.getClients(ctx)
if err != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[blocking] Multi-CR coordination needs documentation or simplification

With MaxConcurrentReconciles: 10 and enqueueOnHCPChange enqueuing ALL CRs, multiple reconcilers race through reconcileHCPDeletion concurrently. Each one cleans up its own CR, then checks if all others are done. The last one to finish removes the HCP finalizer, while earlier finishers return RequeueAfter: 5s and re-enter this path only to find the HCP finalizer already removed.

This convergent pattern is functionally correct, but:

  1. It produces unnecessary requeues and reconcile loops
  2. It is not documented, making it hard for future maintainers to reason about
  3. The Azure PLS controller avoids this entirely because it has MaxConcurrentReconciles: 1 and only one CR per namespace

Suggestion: At minimum, add a comment explaining the convergent behavior. Alternatively, consider having only the CR whose cleanup triggers len(pendingCRs) == 0 remove the HCP finalizer, and have all others simply return ctrl.Result{} after their own cleanup.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added comprehensive documentation on the reconcileHCPDeletion function explaining the convergent multi-CR coordination pattern:

  • Multiple reconcilers run concurrently (one per AWSEndpointService CR)
  • Each cleans up its own CR, then checks if all CRs are done
  • Only the last reconciler to finish removes the HCP finalizer
  • Earlier finishers see pending CRs, return RequeueAfter, and on re-entry find the finalizer already removed

The comment explains this produces a small number of no-op requeues but is correct and self-healing. This is functionally similar to how the Azure PLS controller works, but documented explicitly because the AWS controller has MaxConcurrentReconciles: 10 and multiple CRs per namespace (unlike Azure PLS's MaxConcurrentReconciles: 1 with one CR per namespace).


AI-assisted response via Claude Code

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added comprehensive documentation on reconcileHCPDeletion explaining the convergent multi-CR coordination pattern:

  • Multiple reconcilers run concurrently (one per AWSEndpointService CR) under MaxConcurrentReconciles: 10
  • Each cleans up its own CR's AWS resources, removes the CR finalizer, then checks if all CRs are done
  • The last reconciler to finish (seeing len(pendingCRs) == 0) removes the HCP finalizer
  • Earlier finishers see pending CRs, return RequeueAfter, and on re-entry find the HCP finalizer already removed

The comment explicitly contrasts this with the Azure PLS controller (MaxConcurrentReconciles: 1, one CR per namespace) to explain why this convergent pattern is necessary for the AWS controller.


AI-assisted response via Claude Code

controllerutil.AddFinalizer(hcp, hcpAWSPrivateLinkFinalizerName)
if err := r.Patch(ctx, hcp, client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{})); err != nil {
if apierrors.IsConflict(err) {
return ctrl.Result{Requeue: true}, nil

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion] Use RequeueAfter: time.Second instead of Requeue: true on conflicts

The Azure PLS equivalent returns ctrl.Result{RequeueAfter: time.Second} on conflict (see controller.go line 371). Using Requeue: true risks a tight retry loop under contention when multiple AWSEndpointService reconcilers are concurrently trying to patch the same HCP.

Same applies to the conflict handling in ensureHCPFinalizer (line 558).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Changed both conflict-handling sites to ctrl.Result{RequeueAfter: time.Second}:

  1. ensureHCPFinalizer (adding finalizer)
  2. reconcileHCPDeletion (removing finalizer)

Both now include a comment explaining the rationale: avoiding tight retry loops when multiple AWSEndpointService reconcilers concurrently try to patch the same HCP.

Note: the Azure PLS controller also uses Requeue: true for conflicts (controller.go line 371), so this change makes the AWS controller stricter than Azure PLS.


AI-assisted response via Claude Code

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Changed both conflict-handling sites to ctrl.Result{RequeueAfter: time.Second}:

  1. ensureHCPFinalizer (adding finalizer)
  2. reconcileHCPDeletion (removing finalizer)

Both now include a comment explaining the rationale: avoiding tight retry loops when multiple AWSEndpointService reconcilers concurrently try to patch the same HCP. Tests updated to assert RequeueAfter > 0 instead of Requeue == true.


AI-assisted response via Claude Code

// TestReconcileDeletionSharedVPC for details.
// The HCP finalizer (hcpAWSPrivateLinkFinalizerName) added during normal
// reconciliation ensures the HCP remains available during this cleanup.
// For SharedVPC clusters, this guarantees the cross-account role ARNs can

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion] Comment overstates the guarantee

This comment claims the HCP finalizer "ensures the HCP remains available during this cleanup." That is only true after a successful normal reconciliation has added the finalizer. If a cluster is newly created and the controller has not yet reconciled (e.g., controller was down), the HCP can still be deleted before the AWSEndpointService cleanup runs — the old scenario.

Consider acknowledging this edge case rather than stating the guarantee unconditionally.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Updated the comment to acknowledge the edge case. The new wording states that the finalizer "when present, blocks HCP deletion" and explicitly notes that it's only added after a successful normal reconciliation — if the controller hasn't reconciled yet (e.g., was down since cluster creation), the HCP may be deleted before the finalizer is placed, and the best-effort initialization is the only protection in that case.


AI-assisted response via Claude Code

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Updated the comment to acknowledge the edge case. The new wording states that the finalizer, "when present, blocks HCP deletion" and explicitly notes that it's only added after a successful normal reconciliation — if the controller hasn't reconciled yet (e.g., was down since cluster creation), the HCP may be deleted before the finalizer is placed, and the best-effort initialization is the only protection in that case.


AI-assisted response via Claude Code

}

// Enqueue when EndpointAccess changes (existing behavior).
if newHCP.Spec.Platform.AWS != nil && oldHCP.Spec.Platform.AWS != nil && newHCP.Spec.Platform.AWS.EndpointAccess != oldHCP.Spec.Platform.AWS.EndpointAccess {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion] Filter deletion trigger to transition only

Once the HCP finalizer is added, ANY HCP update with a DeletionTimestamp will re-enqueue all CRs. During HCP deletion, status updates from other controllers will repeatedly trigger this, producing unnecessary list+enqueue cycles.

Consider adding oldHCP.DeletionTimestamp.IsZero() to the condition so it only fires on the transition to deletion:

if oldHCP.DeletionTimestamp.IsZero() && !newHCP.DeletionTimestamp.IsZero() && controllerutil.ContainsFinalizer(newHCP, hcpAWSPrivateLinkFinalizerName) {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by switching to `EnqueueRequestsFromMapFunc` (per comment 1). With the new `mapHCPToAWSEndpointService()` MapFunc, we can no longer detect the deletion transition (old vs new DeletionTimestamp) because MapFunc only receives the current object, not old/new.

However, the filtering is achieved differently: the MapFunc only fires when the HCP has our finalizer (`controllerutil.ContainsFinalizer(hcp, hcpAWSPrivateLinkFinalizerName)`). Once all CRs are cleaned up and the HCP finalizer is removed, subsequent HCP updates no longer trigger CR enqueues. During the short deletion window, the repeated enqueues from status updates are harmless since the reconciler is idempotent — CRs that are already cleaned up return early. This matches the Azure PLS pattern exactly.


AI-assisted response via Claude Code

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by switching to EnqueueRequestsFromMapFunc (per comment 1). With the new mapHCPToAWSEndpointService() MapFunc, we can no longer detect the deletion transition (old vs new DeletionTimestamp) because MapFunc only receives the current object, not old/new.

However, the filtering is achieved differently: the MapFunc only fires when the HCP has our finalizer (controllerutil.ContainsFinalizer(hcp, hcpAWSPrivateLinkFinalizerName)). Once all CRs are cleaned up and the HCP finalizer is removed, subsequent HCP updates no longer trigger CR enqueues. During the short deletion window, the repeated enqueues from status updates are harmless since the reconciler is idempotent — CRs that are already cleaned up return early. This matches the Azure PLS pattern exactly.


AI-assisted response via Claude Code

@hypershift-jira-solve-ci

Copy link
Copy Markdown
Author

Re: #8499 (review)

@bryan-cox Thank you for the thorough staff engineer review. All blocking issues and suggestions have been addressed — see inline replies for details on each change.

Summary of changes

# Comment Action
1 [blocking] UpdateFunc misses HCP deletions on restart Switched to EnqueueRequestsFromMapFunc matching Azure PLS pattern
2 [blocking] Dual deletion paths can compete Added DeletionTimestamp.IsZero() guard in reconcileHCPDeletion
3 [blocking] Multi-CR coordination needs documentation Added comprehensive convergent-pattern docs on reconcileHCPDeletion
4 [suggestion] Use RequeueAfter on conflicts Changed to RequeueAfter: time.Second in both conflict handlers
5 [suggestion] Comment overstates guarantee Acknowledged edge case where finalizer isn't yet placed
6 [suggestion] Filter deletion trigger Handled via finalizer-based filtering in new MapFunc (matches Azure PLS)

Answers to open questions

Q: Does the hypershift-operator's force-finalizer-removal logic know about this new finalizer?
The HO's deleteAWSEndpointServices function handles the CPO finalizer on AWSEndpointService CRs (the 10-minute grace period), not HCP-level finalizers. The new aws-private-link-endpoint-cleanup finalizer is on the HCP itself. There is no HO-side force-removal for this finalizer — it relies on the CPO being healthy to clean up and remove it. This is the same design as the Azure PLS controller's azure-pls-endpoint-cleanup HCP finalizer, which also has no HO-side fallback. If the CPO is permanently down, the HCP would get stuck. For production safety, this could be addressed separately by adding HCP-level finalizer timeout logic to the HO, but that would be a cross-platform concern affecting both AWS and Azure.

Q: Is the broader scope (all AWS PrivateLink clusters, not just SharedVPC) intentional?
Yes, intentional. While the orphan risk is most acute for SharedVPC clusters (where AWS clients fail entirely without the HCP's role ARNs), non-SharedVPC clusters also benefit: the finalizer ensures the clientBuilder can always be re-initialized from the HCP on controller restart during deletion, rather than relying on the best-effort initialization path. The Azure PLS controller applies its HCP finalizer to all Azure PLS clusters regardless of configuration, which is the pattern we're following.


AI-assisted response via Claude Code

@hypershift-jira-solve-ci

Copy link
Copy Markdown
Author

Re: #8499 (review)

Thank you for the thorough staff engineer review, @bryan-cox. All 6 inline comments have been addressed:

Blocking Issues (3) — All Fixed

  1. UpdateFunc misses HCP deletions on controller restart — Switched from handler.Funcs{UpdateFunc: ...} to handler.EnqueueRequestsFromMapFunc(r.mapHCPToAWSEndpointService()) matching the Azure PLS pattern. Create events from informer cache sync now correctly trigger cleanup.

  2. Dual deletion paths can compete — Added explicit DeletionTimestamp guard at the top of reconcileHCPDeletion. If the CR itself is being deleted, we defer to the existing CR deletion path.

  3. Multi-CR coordination needs documentation — Added comprehensive comment block on reconcileHCPDeletion documenting the convergent pattern: multiple reconcilers run concurrently, each cleans up its own CR, and the last one to finish removes the HCP finalizer. Explicitly contrasts with Azure PLS's simpler model.

Suggestions (3) — All Fixed

  1. Use RequeueAfter: time.Second on conflicts — Changed both ensureHCPFinalizer and reconcileHCPDeletion conflict handlers. Tests updated accordingly.

  2. Comment overstates the guarantee — Updated comment to say "when present, blocks HCP deletion" and acknowledges the controller-hasn't-reconciled-yet edge case.

  3. Filter deletion trigger to transition only — Handled by the switch to EnqueueRequestsFromMapFunc. The MapFunc filters by finalizer presence, and once the finalizer is removed, no more enqueues occur.

All tests pass. Ready for re-review.


AI-assisted response via Claude Code

@bryan-cox

Copy link
Copy Markdown
Member

/test e2e-aws

@bryan-cox

Copy link
Copy Markdown
Member

e2e-aws evidence that the HCP finalizer is working

Job: pull-ci-openshift-hypershift-main-e2e-aws/2067264178355179520

Finalizer added during reconciliation (CPO logs)

Both private cluster tests show the CPO logging "Adding HCP finalizer for AWS PrivateLink cleanup":

TestCreateClusterPrivateCPO log (grep for Adding HCP finalizer):

{"level":"info","ts":"2026-06-17T17:02:21Z","msg":"Adding HCP finalizer for AWS PrivateLink cleanup","controller":"awsendpointservice","AWSEndpointService":{"name":"kube-apiserver-private","namespace":"e2e-clusters-5w84p-private-xcn67"}}

TestCreateClusterPrivateWithRouteKASCPO log (grep for Adding HCP finalizer):

{"level":"info","ts":"2026-06-17T17:03:11Z","msg":"Adding HCP finalizer for AWS PrivateLink cleanup","controller":"awsendpointservice","AWSEndpointService":{"name":"private-router","namespace":"e2e-clusters-5jz4x-private-22vkq"}}

Finalizer present on HCP resources (YAML dumps)

All three HCPs in the dump have hypershift.openshift.io/aws-private-link-endpoint-cleanup in their finalizers:

Clean teardown (destroy logs)

Both private cluster tests destroyed successfully with no errors — the finalizer did not block teardown:

Limitation

The reconcileHCPDeletion path ("HCP is being deleted" / "All AWS PrivateLink resources cleaned up, removing HCP finalizer") is not captured in the dump because the dump is collected pre-destroy and the CPO pod is terminated during destroy. The clean teardown is indirect evidence the finalizer was properly removed. The deletion path is covered by unit tests.

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 18, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage.

@openshift-ci

openshift-ci Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox, hypershift-jira-solve-ci[bot]

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 18, 2026
@bryan-cox

Copy link
Copy Markdown
Member

/lgtm cancel

@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 18, 2026
@bryan-cox

Copy link
Copy Markdown
Member

/rebase

@github-actions

Copy link
Copy Markdown

🤖 Rebasing PR onto main: workflow run

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 24, 2026
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 24, 2026
@csrwng csrwng marked this pull request as ready for review June 24, 2026 14:20
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 24, 2026
@openshift-ci openshift-ci Bot requested review from cblecker and sjenning June 24, 2026 14:21
@bryan-cox

Copy link
Copy Markdown
Member

lint is failing

@bryan-cox

Copy link
Copy Markdown
Member

/address-review-comments

@github-actions

Copy link
Copy Markdown

🤖 Addressing review comments: workflow run

@hypershift-jira-solve-ci

hypershift-jira-solve-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown
Author

Now I have a complete understanding. Here's my analysis:

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go:435:1:
cyclomatic complexity 31 of func `(*AWSEndpointServiceReconciler).Reconcile` is high (> 30) (gocyclo)

Summary

The golangci-lint gocyclo linter reports that the Reconcile method in awsprivatelink_controller.go has a cyclomatic complexity of 31, exceeding the configured maximum threshold of 30. The PR added two new branching code paths inside the Reconcile function — an HCP deletion check (if !hcp.DeletionTimestamp.IsZero()) and an HCP finalizer ensurance call (if result, err := r.ensureHCPFinalizer(...)) — which pushed the already-at-limit complexity from ≤30 to 31.

Root Cause

The PR adds HCP finalizer lifecycle management directly into the Reconcile function body at line 435 of awsprivatelink_controller.go. Specifically, two new if branches were inserted:

  1. HCP deletion check (around line 533): if !hcp.DeletionTimestamp.IsZero() { return r.reconcileHCPDeletion(...) } — adds 1 decision point.
  2. HCP finalizer ensurance (around line 549): if result, err := r.ensureHCPFinalizer(...); err != nil || !result.IsZero() { return result, err } — adds 1–2 decision points (the || counts as an additional branch).

While the new helper functions (ensureHCPFinalizer, reconcileHCPDeletion, mapHCPToAWSEndpointService) are properly extracted into separate methods, the dispatch logic inside Reconcile itself still adds branching. The Reconcile function was already at or near the complexity limit of 30 before this PR, and these additions pushed it to 31.

The project's .golangci.yml configures gocyclo with a max complexity of 30. The main (non-API) lint run processed 1,077 raw issues down to exactly 1 — this single gocyclo violation — causing golangci-lint to exit with code 1 and make lint to fail with exit code 1 (escalated to exit code 2 by the shell).

Recommendations
  1. Extract early-return blocks from Reconcile into a helper: Move the HCP-lookup, deletion-timestamp-check, paused-check, and client-initialization block into a single setup/dispatch helper (e.g., reconcileSetup or prepareReconciliation) that returns the HCP, clients, and whether to short-circuit. This would remove 3–5 decision points from Reconcile.

  2. Consolidate the deletion-path dispatch: The Reconcile function currently has two separate deletion code paths — one for the AWSEndpointService CR deletion (early in the function) and one for HCP deletion (newly added). Consider unifying these into a single handleDeletion dispatch at the top of Reconcile that routes to the appropriate cleanup path, reducing the top-level branching.

  3. Simplest fix — extract the new code into an existing helper: Move the HCP deletion check and finalizer ensurance into a reconcileNonDeletion helper that wraps the normal reconciliation path. The Reconcile function would become: handle CR deletion → call reconcileNonDeletion. This is the minimal change to stay under the complexity limit.

Evidence
Evidence Detail
Lint error cyclomatic complexity 31 of func (*AWSEndpointServiceReconciler).Reconcile is high (> 30) (gocyclo)
File control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go:435:1
Linter gocyclo (via golangci-lint v2.11.4)
Threshold Max cyclomatic complexity: 30 (configured in .golangci.yml)
Violation amount Complexity 31 — exceeds limit by 1
New branches added !hcp.DeletionTimestamp.IsZero() check + ensureHCPFinalizer error/result check with ||
Issues processed 1,077 raw → 1 after filtering (this single violation)
Exit code make lint exited with code 1 → shell exit code 2
PR files changed awsprivatelink_controller.go (+219/-33), awsprivatelink_controller_test.go (+736/-0)

@cblecker cblecker left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress on the HCP finalizer approach. The convergent multi-CR coordination pattern is well-designed and well-documented. One critical ordering issue below that can deadlock HCP deletion.

}

controllerutil.RemoveFinalizer(awsEndpointService, finalizer)
if err := r.Update(ctx, awsEndpointService); err != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since you're already using Patch + MergeFromWithOptimisticLock + explicit conflict handling for the HCP finalizer removal at line 706, consider using the same approach here for consistency within reconcileHCPDeletion. The awsEndpointService object is the one fetched at Reconcile entry and could be stale after the AWS cleanup operations above.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Switched to Patch + MergeFromWithOptimisticLock with explicit conflict handling, consistent with the HCP finalizer removal pattern.


AI-assisted response via Claude Code

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix was applied to the AES finalizer removal within reconcileHCPDeletion (line 683) but the original CR deletion path here still uses r.Update. This path has the same concurrent-reconciler race window — the AES object is fetched at line 448, then AWS cleanup runs before we reach this Update at line 494.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Switched the original CR deletion path to use Patch + MergeFromWithOptimisticLock with explicit conflict handling, matching the pattern in reconcileHCPDeletion. Also extracted the CR deletion logic into a reconcileCRDeletion helper to reduce Reconcile cyclomatic complexity.


AI-assisted response via Claude Code

return ctrl.Result{}, fmt.Errorf("unexpected number of HostedControlPlanes in namespace, expected: 1, actual: %d", len(hcpList.Items))

// Handle HCP deletion: clean up AWS resources while HCP credentials are still valid.
if !hcp.DeletionTimestamp.IsZero() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CR-level finalizer is added unconditionally at line 502 (before the serviceName check at line 512), but this HCP deletion check is only reachable when serviceName != "" — the early return at line 512-516 blocks entry.

During HCP deletion, reconcileHCPDeletion lists all CRs in the namespace (line 693-694) and keeps the HCP finalizer if any CR still has the CR finalizer. A CR that was created but not yet populated with EndpointServiceName by hypershift-operator will have the CR finalizer (added at line 502) but can never reach this HCP deletion check — the serviceName guard returns first. This permanently blocks HCP deletion.

Consider either:

  • Moving this HCP fetch + deletion check before the serviceName guard (line 512), so CRs without serviceName can still enter the HCP deletion cleanup path, or
  • Moving the CR finalizer addition (lines 502-510) to after the serviceName check, so CRs without serviceName don't get the finalizer and don't block the HCP finalizer removal.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Moved the CR finalizer addition to after the serviceName check so CRs not yet populated by hypershift-operator don't get a finalizer that would block HCP deletion.


AI-assisted response via Claude Code

@bryan-cox

Copy link
Copy Markdown
Member

/address-review-comments

@github-actions

Copy link
Copy Markdown

🤖 Addressing review comments: workflow run

@cblecker cblecker left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor consistency note (not in the diff, so noting here): Line 515 still uses Requeue: true for the CR finalizer addition conflict, while all the new HCP finalizer operations (L612, L685, L716) use RequeueAfter: time.Second. Since this line was reorganized in this PR (moved after the serviceName check), it'd be a good opportunity to align it with the rest of the conflict handling in this file.

return mockBuilder
},
clientInterceptors: interceptor.Funcs{
Patch: func(ctx context.Context, c crclient.WithWatch, obj crclient.Object, patch crclient.Patch, opts ...crclient.PatchOption) error {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interceptor catches all Patch calls, not just HCP patches. It works because the test AES has no CR finalizer so the AES Patch at L683 is never reached — but if someone adds a finalizer to this test fixture later, the test would pass for the wrong reason. Consider filtering by type to match the pattern at L2232:

if _, ok := obj.(*hyperv1.HostedControlPlane); ok {
    return apierrors.NewConflict(...)
}
return c.Patch(ctx, obj, patch, opts...)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added HCP type filter to the patch interceptor, consistent with the AES interceptor pattern at L2232.


AI-assisted response via Claude Code

// path (which returns early at the top of Reconcile) handle cleanup. This prevents
// the two deletion paths from racing under MaxConcurrentReconciles > 1, where the
// CR deletion path and HCP deletion path could both try to clean up the same CR.
if !awsEndpointService.DeletionTimestamp.IsZero() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This guard is the main defense against the dual-deletion race (HCP deletion path vs CR deletion path under concurrent reconciliation), but none of the TestReconcileHCPDeletion test cases exercise it — all AES objects are created without DeletionTimestamp. A test case with a DeletionTimestamp-bearing AES (and a CR finalizer) would protect this guard against accidental removal. The bare mock (no expectations) would also catch any leaked AWS client calls, acting as a double safety net.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added test case "When AWSEndpointService is being deleted it should return early and let the CR deletion path handle cleanup" with a DeletionTimestamp-bearing AES that has the CR finalizer. The bare mock (no expectations) catches any leaked AWS client calls.


AI-assisted response via Claude Code

@bryan-cox

Copy link
Copy Markdown
Member

/address-review-comments

@github-actions

Copy link
Copy Markdown

🤖 Addressing review comments: workflow run

@github-actions

Copy link
Copy Markdown

Minor consistency note (not in the diff, so noting here): Line 515 still uses Requeue: true for the CR finalizer addition conflict, while all the new HCP finalizer operations (L612, L685, L716) use RequeueAfter: time.Second. Since this line was reorganized in this PR (moved after the serviceName check), it'd be a good opportunity to align it with the rest of the conflict handling in this file.

Done. Aligned the CR finalizer conflict handling at L515 to use RequeueAfter: time.Second, consistent with all HCP finalizer operations in this file.


AI-assisted response via Claude Code

@cblecker cblecker left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing nit (F2): In setFromHCP (line ~368), the if branch sets three fields (assumeSharedVPCEndpointRoleARN, assumeSharedVPCRoute53RoleARN, localZoneID) but the else branch only clears the two role ARNs — localZoneID is left stale. Unlikely to matter in practice (SharedVPC config isn't removed from a live HCP), but asymmetric cleanup is easy to fix: add b.localZoneID = "" at line 370.


originalAES := awsEndpointService.DeepCopy()
controllerutil.RemoveFinalizer(awsEndpointService, finalizer)
if err := r.Patch(ctx, awsEndpointService, client.MergeFromWithOptions(originalAES, client.MergeFromWithOptimisticLock{})); err != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conflict handler here (returning RequeueAfter: time.Second) doesn't have a corresponding test case. The existing "removing AWSEndpointService finalizer fails" test at line 2237 uses fmt.Errorf, which exercises the generic error path (line 687), not the conflict path. The symmetric HCP Patch conflict is tested at line 2296.

Worth adding a test case with apierrors.NewConflict for AES objects — assert RequeueAfter == time.Second and err == nil.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added a test case "When removing AWSEndpointService finalizer returns conflict error it should requeue" to TestReconcileHCPDeletionClientErrors — it returns apierrors.NewConflict for AES Patch calls and asserts RequeueAfter == time.Second with err == nil.


AI-assisted response via Claude Code

// and return early at the top of this function. This convergent pattern produces a small
// number of no-op requeues but is correct and self-healing.
func (r *AWSEndpointServiceReconciler) reconcileHCPDeletion(ctx context.Context, awsEndpointService *hyperv1.AWSEndpointService, hcp *hyperv1.HostedControlPlane, log logr.Logger) (ctrl.Result, error) {
if !controllerutil.ContainsFinalizer(hcp, hcpAWSPrivateLinkFinalizerName) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a defensive guard at the top: if hcp.DeletionTimestamp.IsZero() { return ctrl.Result{}, nil }. The caller checks this at line 530, but the function name implies the precondition without enforcing it. If a future call site skips the check, this would clean up resources on a live cluster.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added if hcp.DeletionTimestamp.IsZero() { return ctrl.Result{}, nil } guard at the top of reconcileHCPDeletion to enforce the precondition explicitly.


AI-assisted response via Claude Code

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the guard. One follow-up: the diff version included a "When HCP is not being deleted it should return early" test case that exercised this guard, but it appears to have been dropped during a rebase. All current TestReconcileHCPDeletion cases construct the HCP with DeletionTimestamp: &now, so this guard path has no coverage. Worth adding back to protect it from accidental removal.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added "When HCP is not being deleted it should return early" test case to TestReconcileHCPDeletion that constructs the HCP without DeletionTimestamp, exercising the hcp.DeletionTimestamp.IsZero() guard.


AI-assisted response via Claude Code

@cblecker

Copy link
Copy Markdown
Member

/address-review-comments

@github-actions

Copy link
Copy Markdown

🤖 Addressing review comments: workflow run

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cblecker Re: the pre-existing nit in setFromHCP — Done. Added b.localZoneID = "" in the else branch to symmetrically clear all three fields.


AI-assisted response via Claude Code

@cblecker cblecker left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good pattern overall — the convergent multi-CR coordination is well-designed and the switch to EnqueueRequestsFromMapFunc correctly fixes the controller-restart scenario. A few ordering issues in the reconcile flow and some tests from the diff that appear to have been lost during rebasing.


// Initialize AWS clients from the HCP — guaranteed to be available because
// our finalizer blocks HCP deletion.
r.awsClientBuilder.initializeWithHCP(log, hcp)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initializeWithHCP and getClients are called unconditionally here, before the CR finalizer check at L687. If getClients fails (transient STS/credential issue), the error return at L683 prevents the pending-CRs check from ever being reached — blocking HCP finalizer removal even when all CRs are already cleaned up and no AWS API calls are needed.

Moving the client initialization inside the if controllerutil.ContainsFinalizer(awsEndpointService, finalizer) block would let already-cleaned-up CRs proceed straight to the pending-CRs check.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Moved initializeWithHCP and getClients inside the if controllerutil.ContainsFinalizer(awsEndpointService, finalizer) block so already-cleaned-up CRs proceed straight to the pending-CRs check without needing AWS API calls.


AI-assisted response via Claude Code

return false
}

// reconcileCRDeletion handles the AWSEndpointService CR deletion path.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff includes a TestReconcileCRDeletion with cases for the Patch conflict requeue path, but it doesn't appear in the actual test file — looks like it was dropped during a rebase. The refactoring from Update to Patch+MergeFromWithOptimisticLock with the RequeueAfter: time.Second conflict handling is a meaningful behavioral change worth covering directly.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added TestReconcileCRDeletion covering: no-finalizer early return, successful cleanup with GC, AWS client init failure, Patch conflict requeue (asserts RequeueAfter == time.Second and err == nil), and Patch non-conflict error.


AI-assisted response via Claude Code

@@ -518,6 +505,13 @@ func (r *AWSEndpointServiceReconciler) Reconcile(ctx context.Context, req ctrl.R
return ctrl.Result{}, err

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this getClients error is returned bare while the other two call sites (L588, L683) wrap it with context identifying the reconciliation path. Wrapping here too ("failed to get AWS clients for endpoint reconciliation: %w") would help with production triage.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Wrapped with "failed to get AWS clients for endpoint reconciliation: %w" to match the other call sites.


AI-assisted response via Claude Code

// Ensure the awsEndpointService has a finalizer for cleanup.
// This is placed after the serviceName check so that CRs not yet populated
// by hypershift-operator don't get a finalizer that would block HCP deletion.
if !controllerutil.ContainsFinalizer(awsEndpointService, finalizer) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CR finalizer is added here before the HCP deletion check at L493. During HCP deletion, after reconcileHCPDeletion removes a CR's finalizer and requeues (waiting for other CRs), the next reconcile re-adds the CR finalizer at this line before discovering the HCP is being deleted. This causes reconcileHCPDeletion to re-run r.delete on already-cleaned-up resources each cycle.

Moving the HCP fetch and deletion check before the CR finalizer addition would avoid the re-addition loop:

hcp, err := r.getHostedControlPlane(ctx, req.Namespace)
...
if hcp != nil && !hcp.DeletionTimestamp.IsZero() {
    return r.reconcileHCPDeletion(ctx, awsEndpointService, hcp, log)
}
// Only add CR finalizer during normal reconciliation
if !controllerutil.ContainsFinalizer(awsEndpointService, finalizer) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Moved the HCP fetch and deletion check before the CR finalizer addition so that during HCP deletion, reconcileHCPDeletion handles cleanup without re-adding the CR finalizer.


AI-assisted response via Claude Code

@cblecker

cblecker commented Jul 4, 2026

Copy link
Copy Markdown
Member

/address-review-comments

@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown

🤖 Addressing review comments: workflow run

…reconciler

Add a finalizer on the HostedControlPlane to block HCP deletion until
all AWS PrivateLink resources (VPC endpoints, security groups, DNS
records) are cleaned up by the AWSEndpointService reconciler.

- Add hcpAWSPrivateLinkFinalizerName finalizer, placed after client
  initialization succeeds on the normal reconciliation path
- Add reconcileHCPDeletion to clean up AWS resources for each
  AWSEndpointService CR before removing the HCP finalizer
- Replace handler.Funcs{UpdateFunc: ...} HCP watch with
  EnqueueRequestsFromMapFunc so Create/Delete events also trigger
  reconciliation (critical for CPO restarts during HCP deletion)
- Use convergent multi-CR coordination: each reconciler cleans its
  own CR, only the last one to finish removes the HCP finalizer
- Move HCP deletion check before CR finalizer addition to prevent
  re-addition loop during HCP deletion cleanup cycles
- Move AWS client initialization inside CR finalizer check in
  reconcileHCPDeletion so already-cleaned-up CRs proceed to the
  pending-CRs check without needing AWS API calls
- Add comprehensive unit tests covering finalizer lifecycle,
  HCP deletion cleanup, CR deletion cleanup, concurrent reconciler
  coordination, and SharedVPC scenarios

Previously, if the HCP was deleted before AWSEndpointService cleanup,
the controller could not construct valid AWS clients — particularly for
SharedVPC clusters where cross-account role ARNs are sourced from the
HCP spec — orphaning AWS resources.

Signed-off-by: OpenShift CI Bot <ci-bot@redhat.com>
Commit-Message-Assisted-by: Claude (via Claude Code)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci

openshift-ci Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

@hypershift-jira-solve-ci[bot]: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants