CNTRLPLANE-507: Add HCP finalizer to AWSEndpointService reconciler#8499
CNTRLPLANE-507: Add HCP finalizer to AWSEndpointService reconciler#8499hypershift-jira-solve-ci[bot] wants to merge 1 commit into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@hypershift-jira-solve-ci[bot]: This pull request references CNTRLPLANE-507 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis change adds an HCP-scoped AWS PrivateLink finalizer, updates HostedControlPlane event handling to enqueue AWSEndpointService reconciliations, and splits reconciliation into normal and HCP-deletion paths. The deletion path initializes AWS clients from the HCP, cleans up AWS resources, removes the AWSEndpointService finalizer, and clears the HCP finalizer after dependent CRs are done. Tests cover finalizer patching, deletion handling, client errors, and mapping behavior. Possibly related PRs
Suggested reviewers
Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error)
✅ Passed checks (10 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Skipping CI for Draft Pull Request. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8499 +/- ##
==========================================
+ Coverage 42.55% 43.10% +0.54%
==========================================
Files 768 773 +5
Lines 95297 96302 +1005
==========================================
+ Hits 40558 41507 +949
+ Misses 51932 51911 -21
- Partials 2807 2884 +77
... and 52 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
Re: #8499 (comment) Thanks for the walkthrough summary — the description and sequence diagram accurately capture the changes. Regarding the docstring coverage warning: noted. This is a new controller flow with several helper functions; docstrings can be added in a follow-up if the team considers it valuable for this area. AI-assisted response via Claude Code |
bryan-cox
left a comment
There was a problem hiding this comment.
Staff Engineer Review: Add HCP finalizer to AWSEndpointService reconciler
The core idea of this PR is correct and addresses a real operational problem — preventing AWS PrivateLink resource orphaning when the HCP is deleted before the AWSEndpointService cleanup runs. However, the implementation diverges from the Azure PLS pattern in ways that introduce correctness risks.
Blocking Issues (3)
See inline comments for details on:
- Dual deletion paths compete — existing CR deletion path and new HCP deletion path both remove the CR finalizer
- Multi-CR coordination under concurrency — convergent but produces unnecessary work with
MaxConcurrentReconciles: 10 UpdateFuncmisses HCP deletions on controller restart — defeats the purpose of the PR
Open Questions (2)
- Does the hypershift-operator's force-finalizer-removal logic (10-minute grace) know about this new
aws-private-link-endpoint-cleanupfinalizer? If not, the HCP could get stuck indefinitely. - The finalizer is added for ALL AWS PrivateLink clusters, not just SharedVPC. Is the broader scope intentional?
Praise
Test coverage is excellent — 784 lines of well-structured table-driven tests with gomock and client interceptors covering all new paths. The context.Background() → ctx fix in the handler is a good improvement.
| MaxConcurrentReconciles: 10, | ||
| }). | ||
| Watches(&hyperv1.HostedControlPlane{}, handler.Funcs{UpdateFunc: r.enqueueOnAccessChange(mgr)}). | ||
| Watches(&hyperv1.HostedControlPlane{}, handler.Funcs{UpdateFunc: r.enqueueOnHCPChange(mgr)}). |
There was a problem hiding this comment.
[blocking] UpdateFunc misses HCP deletions on controller restart
Using handler.Funcs{UpdateFunc: ...} means only Update events trigger this handler. If the CPO restarts while an HCP is being deleted (DeletionTimestamp already set), the informer cache sync generates a Create event — not an Update — so this handler never fires.
The Azure PLS controller avoids this by using handler.EnqueueRequestsFromMapFunc(...), which receives all event types (Create, Update, Delete) from the informer. On restart, it gets a Create event for the HCP with DeletionTimestamp set and correctly enqueues the CRs.
With the current approach, if the CPO restarts mid-HCP-deletion, the new handler will NOT fire. The reconciler would fall through to the existing AWSEndpointService CR deletion path — exactly the scenario this PR is trying to fix.
Recommendation: Switch to handler.EnqueueRequestsFromMapFunc(...) to match the Azure PLS pattern.
There was a problem hiding this comment.
Done. Switched from handler.Funcs{UpdateFunc: ...} to handler.EnqueueRequestsFromMapFunc(...) matching the Azure PLS pattern. The new mapHCPToAWSEndpointService() function receives all event types (Create, Update, Delete), so on controller restart the informer cache sync's Create event now correctly triggers cleanup for an HCP with DeletionTimestamp already set.
The EndpointAccess change detection (previously via old/new comparison in UpdateFunc) is dropped from the handler — those changes are picked up by the reconciler's existing 5-minute periodic requeue, which is acceptable since EndpointAccess changes are rare operational events.
AI-assisted response via Claude Code
There was a problem hiding this comment.
Done. Switched from handler.Funcs{UpdateFunc: ...} to handler.EnqueueRequestsFromMapFunc(r.mapHCPToAWSEndpointService()) matching the Azure PLS pattern exactly. The new mapHCPToAWSEndpointService() MapFunc receives all event types (Create, Update, Delete), so on controller restart the informer cache sync's Create event correctly triggers cleanup for an HCP with DeletionTimestamp already set.
The MapFunc filters by finalizer presence (controllerutil.ContainsFinalizer(hcp, hcpAWSPrivateLinkFinalizerName)) to avoid unnecessary reconciliations, matching the Azure PLS approach. EndpointAccess change detection is dropped from the handler — those changes are picked up by the reconciler's existing 5-minute periodic requeue.
Tests updated: replaced TestEnqueueOnHCPChange (which tested the old UpdateFunc) with TestMapHCPToAWSEndpointService (which tests the new MapFunc directly).
AI-assisted response via Claude Code
| // Handle HCP deletion: clean up AWS resources while HCP credentials are still valid. | ||
| if !hcp.DeletionTimestamp.IsZero() { | ||
| return r.reconcileHCPDeletion(ctx, awsEndpointService, hcp, log) | ||
| } |
There was a problem hiding this comment.
[blocking] Dual deletion paths can compete
The existing AWSEndpointService CR deletion path (lines 466-486 in the diff) runs when the CR itself has a DeletionTimestamp and also removes the CR finalizer + calls r.delete(). This new HCP deletion path at line 534 also removes the CR finalizer + calls r.delete().
These two paths can activate simultaneously during namespace deletion or HCP ownership-based cascading. Consider:
- HCP deletion triggers
enqueueOnHCPChange, enqueuing all CRs - Namespace/owner cascade sets DeletionTimestamp on the CRs themselves
- A reconcile fires for a CR that has BOTH its own DeletionTimestamp AND the HCP is being deleted
- The CR enters the existing deletion path (step 1), which initializes from HCP and cleans up
- Another reconcile enters this HCP deletion path
The existing CR deletion path (line 466) does return early before reaching this check, so they are technically exclusive within a single reconcile call. But with MaxConcurrentReconciles: 10, two concurrent reconciles for the same CR could race.
Suggestion: Add an explicit guard here: if !awsEndpointService.DeletionTimestamp.IsZero() { return ctrl.Result{}, nil } to make the exclusion explicit and defend against concurrent reconciles.
There was a problem hiding this comment.
Done. Added explicit guard at the top of reconcileHCPDeletion:
if !awsEndpointService.DeletionTimestamp.IsZero() {
return ctrl.Result{}, nil
}This makes the exclusion between the two deletion paths explicit and defends against concurrent reconciles under MaxConcurrentReconciles: 10. If the CR itself is being deleted, we defer to the existing CR deletion path at the top of Reconcile.
AI-assisted response via Claude Code
There was a problem hiding this comment.
Done. Added explicit guard at the top of reconcileHCPDeletion:
if !awsEndpointService.DeletionTimestamp.IsZero() {
return ctrl.Result{}, nil
}This makes the exclusion between the two deletion paths explicit and defends against concurrent reconciles under MaxConcurrentReconciles: 10. If the CR itself is being deleted, we defer to the existing CR deletion path at the top of Reconcile.
AI-assisted response via Claude Code
| // our finalizer blocks HCP deletion. | ||
| r.awsClientBuilder.initializeWithHCP(log, hcp) | ||
| ec2Client, route53Client, err := r.awsClientBuilder.getClients(ctx) | ||
| if err != nil { |
There was a problem hiding this comment.
[blocking] Multi-CR coordination needs documentation or simplification
With MaxConcurrentReconciles: 10 and enqueueOnHCPChange enqueuing ALL CRs, multiple reconcilers race through reconcileHCPDeletion concurrently. Each one cleans up its own CR, then checks if all others are done. The last one to finish removes the HCP finalizer, while earlier finishers return RequeueAfter: 5s and re-enter this path only to find the HCP finalizer already removed.
This convergent pattern is functionally correct, but:
- It produces unnecessary requeues and reconcile loops
- It is not documented, making it hard for future maintainers to reason about
- The Azure PLS controller avoids this entirely because it has
MaxConcurrentReconciles: 1and only one CR per namespace
Suggestion: At minimum, add a comment explaining the convergent behavior. Alternatively, consider having only the CR whose cleanup triggers len(pendingCRs) == 0 remove the HCP finalizer, and have all others simply return ctrl.Result{} after their own cleanup.
There was a problem hiding this comment.
Done. Added comprehensive documentation on the reconcileHCPDeletion function explaining the convergent multi-CR coordination pattern:
- Multiple reconcilers run concurrently (one per AWSEndpointService CR)
- Each cleans up its own CR, then checks if all CRs are done
- Only the last reconciler to finish removes the HCP finalizer
- Earlier finishers see pending CRs, return RequeueAfter, and on re-entry find the finalizer already removed
The comment explains this produces a small number of no-op requeues but is correct and self-healing. This is functionally similar to how the Azure PLS controller works, but documented explicitly because the AWS controller has MaxConcurrentReconciles: 10 and multiple CRs per namespace (unlike Azure PLS's MaxConcurrentReconciles: 1 with one CR per namespace).
AI-assisted response via Claude Code
There was a problem hiding this comment.
Done. Added comprehensive documentation on reconcileHCPDeletion explaining the convergent multi-CR coordination pattern:
- Multiple reconcilers run concurrently (one per AWSEndpointService CR) under
MaxConcurrentReconciles: 10 - Each cleans up its own CR's AWS resources, removes the CR finalizer, then checks if all CRs are done
- The last reconciler to finish (seeing
len(pendingCRs) == 0) removes the HCP finalizer - Earlier finishers see pending CRs, return
RequeueAfter, and on re-entry find the HCP finalizer already removed
The comment explicitly contrasts this with the Azure PLS controller (MaxConcurrentReconciles: 1, one CR per namespace) to explain why this convergent pattern is necessary for the AWS controller.
AI-assisted response via Claude Code
| controllerutil.AddFinalizer(hcp, hcpAWSPrivateLinkFinalizerName) | ||
| if err := r.Patch(ctx, hcp, client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{})); err != nil { | ||
| if apierrors.IsConflict(err) { | ||
| return ctrl.Result{Requeue: true}, nil |
There was a problem hiding this comment.
[suggestion] Use RequeueAfter: time.Second instead of Requeue: true on conflicts
The Azure PLS equivalent returns ctrl.Result{RequeueAfter: time.Second} on conflict (see controller.go line 371). Using Requeue: true risks a tight retry loop under contention when multiple AWSEndpointService reconcilers are concurrently trying to patch the same HCP.
Same applies to the conflict handling in ensureHCPFinalizer (line 558).
There was a problem hiding this comment.
Done. Changed both conflict-handling sites to ctrl.Result{RequeueAfter: time.Second}:
ensureHCPFinalizer(adding finalizer)reconcileHCPDeletion(removing finalizer)
Both now include a comment explaining the rationale: avoiding tight retry loops when multiple AWSEndpointService reconcilers concurrently try to patch the same HCP.
Note: the Azure PLS controller also uses Requeue: true for conflicts (controller.go line 371), so this change makes the AWS controller stricter than Azure PLS.
AI-assisted response via Claude Code
There was a problem hiding this comment.
Done. Changed both conflict-handling sites to ctrl.Result{RequeueAfter: time.Second}:
ensureHCPFinalizer(adding finalizer)reconcileHCPDeletion(removing finalizer)
Both now include a comment explaining the rationale: avoiding tight retry loops when multiple AWSEndpointService reconcilers concurrently try to patch the same HCP. Tests updated to assert RequeueAfter > 0 instead of Requeue == true.
AI-assisted response via Claude Code
| // TestReconcileDeletionSharedVPC for details. | ||
| // The HCP finalizer (hcpAWSPrivateLinkFinalizerName) added during normal | ||
| // reconciliation ensures the HCP remains available during this cleanup. | ||
| // For SharedVPC clusters, this guarantees the cross-account role ARNs can |
There was a problem hiding this comment.
[suggestion] Comment overstates the guarantee
This comment claims the HCP finalizer "ensures the HCP remains available during this cleanup." That is only true after a successful normal reconciliation has added the finalizer. If a cluster is newly created and the controller has not yet reconciled (e.g., controller was down), the HCP can still be deleted before the AWSEndpointService cleanup runs — the old scenario.
Consider acknowledging this edge case rather than stating the guarantee unconditionally.
There was a problem hiding this comment.
Done. Updated the comment to acknowledge the edge case. The new wording states that the finalizer "when present, blocks HCP deletion" and explicitly notes that it's only added after a successful normal reconciliation — if the controller hasn't reconciled yet (e.g., was down since cluster creation), the HCP may be deleted before the finalizer is placed, and the best-effort initialization is the only protection in that case.
AI-assisted response via Claude Code
There was a problem hiding this comment.
Done. Updated the comment to acknowledge the edge case. The new wording states that the finalizer, "when present, blocks HCP deletion" and explicitly notes that it's only added after a successful normal reconciliation — if the controller hasn't reconciled yet (e.g., was down since cluster creation), the HCP may be deleted before the finalizer is placed, and the best-effort initialization is the only protection in that case.
AI-assisted response via Claude Code
| } | ||
|
|
||
| // Enqueue when EndpointAccess changes (existing behavior). | ||
| if newHCP.Spec.Platform.AWS != nil && oldHCP.Spec.Platform.AWS != nil && newHCP.Spec.Platform.AWS.EndpointAccess != oldHCP.Spec.Platform.AWS.EndpointAccess { |
There was a problem hiding this comment.
[suggestion] Filter deletion trigger to transition only
Once the HCP finalizer is added, ANY HCP update with a DeletionTimestamp will re-enqueue all CRs. During HCP deletion, status updates from other controllers will repeatedly trigger this, producing unnecessary list+enqueue cycles.
Consider adding oldHCP.DeletionTimestamp.IsZero() to the condition so it only fires on the transition to deletion:
if oldHCP.DeletionTimestamp.IsZero() && !newHCP.DeletionTimestamp.IsZero() && controllerutil.ContainsFinalizer(newHCP, hcpAWSPrivateLinkFinalizerName) {There was a problem hiding this comment.
Addressed by switching to `EnqueueRequestsFromMapFunc` (per comment 1). With the new `mapHCPToAWSEndpointService()` MapFunc, we can no longer detect the deletion transition (old vs new DeletionTimestamp) because MapFunc only receives the current object, not old/new.
However, the filtering is achieved differently: the MapFunc only fires when the HCP has our finalizer (`controllerutil.ContainsFinalizer(hcp, hcpAWSPrivateLinkFinalizerName)`). Once all CRs are cleaned up and the HCP finalizer is removed, subsequent HCP updates no longer trigger CR enqueues. During the short deletion window, the repeated enqueues from status updates are harmless since the reconciler is idempotent — CRs that are already cleaned up return early. This matches the Azure PLS pattern exactly.
AI-assisted response via Claude Code
There was a problem hiding this comment.
Addressed by switching to EnqueueRequestsFromMapFunc (per comment 1). With the new mapHCPToAWSEndpointService() MapFunc, we can no longer detect the deletion transition (old vs new DeletionTimestamp) because MapFunc only receives the current object, not old/new.
However, the filtering is achieved differently: the MapFunc only fires when the HCP has our finalizer (controllerutil.ContainsFinalizer(hcp, hcpAWSPrivateLinkFinalizerName)). Once all CRs are cleaned up and the HCP finalizer is removed, subsequent HCP updates no longer trigger CR enqueues. During the short deletion window, the repeated enqueues from status updates are harmless since the reconciler is idempotent — CRs that are already cleaned up return early. This matches the Azure PLS pattern exactly.
AI-assisted response via Claude Code
|
Re: #8499 (review) @bryan-cox Thank you for the thorough staff engineer review. All blocking issues and suggestions have been addressed — see inline replies for details on each change. Summary of changes
Answers to open questionsQ: Does the hypershift-operator's force-finalizer-removal logic know about this new finalizer? Q: Is the broader scope (all AWS PrivateLink clusters, not just SharedVPC) intentional? AI-assisted response via Claude Code |
|
Re: #8499 (review) Thank you for the thorough staff engineer review, @bryan-cox. All 6 inline comments have been addressed: Blocking Issues (3) — All Fixed
Suggestions (3) — All Fixed
All tests pass. Ready for re-review. AI-assisted response via Claude Code |
|
/test e2e-aws |
e2e-aws evidence that the HCP finalizer is workingJob: pull-ci-openshift-hypershift-main-e2e-aws/2067264178355179520 Finalizer added during reconciliation (CPO logs)Both private cluster tests show the CPO logging TestCreateClusterPrivate — CPO log (grep for {"level":"info","ts":"2026-06-17T17:02:21Z","msg":"Adding HCP finalizer for AWS PrivateLink cleanup","controller":"awsendpointservice","AWSEndpointService":{"name":"kube-apiserver-private","namespace":"e2e-clusters-5w84p-private-xcn67"}}TestCreateClusterPrivateWithRouteKAS — CPO log (grep for {"level":"info","ts":"2026-06-17T17:03:11Z","msg":"Adding HCP finalizer for AWS PrivateLink cleanup","controller":"awsendpointservice","AWSEndpointService":{"name":"private-router","namespace":"e2e-clusters-5jz4x-private-22vkq"}}Finalizer present on HCP resources (YAML dumps)All three HCPs in the dump have
Clean teardown (destroy logs)Both private cluster tests destroyed successfully with no errors — the finalizer did not block teardown:
LimitationThe |
|
Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bryan-cox, hypershift-jira-solve-ci[bot] The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/lgtm cancel |
|
/rebase |
|
🤖 Rebasing PR onto main: workflow run |
3c58530 to
9877433
Compare
|
lint is failing |
|
/address-review-comments |
|
🤖 Addressing review comments: workflow run |
|
Now I have a complete understanding. Here's my analysis: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe Root CauseThe PR adds HCP finalizer lifecycle management directly into the
While the new helper functions ( The project's Recommendations
Evidence
|
3286f0b to
ebb3676
Compare
cblecker
left a comment
There was a problem hiding this comment.
Good progress on the HCP finalizer approach. The convergent multi-CR coordination pattern is well-designed and well-documented. One critical ordering issue below that can deadlock HCP deletion.
| } | ||
|
|
||
| controllerutil.RemoveFinalizer(awsEndpointService, finalizer) | ||
| if err := r.Update(ctx, awsEndpointService); err != nil { |
There was a problem hiding this comment.
nit: Since you're already using Patch + MergeFromWithOptimisticLock + explicit conflict handling for the HCP finalizer removal at line 706, consider using the same approach here for consistency within reconcileHCPDeletion. The awsEndpointService object is the one fetched at Reconcile entry and could be stale after the AWS cleanup operations above.
There was a problem hiding this comment.
Done. Switched to Patch + MergeFromWithOptimisticLock with explicit conflict handling, consistent with the HCP finalizer removal pattern.
AI-assisted response via Claude Code
There was a problem hiding this comment.
The fix was applied to the AES finalizer removal within reconcileHCPDeletion (line 683) but the original CR deletion path here still uses r.Update. This path has the same concurrent-reconciler race window — the AES object is fetched at line 448, then AWS cleanup runs before we reach this Update at line 494.
There was a problem hiding this comment.
Done. Switched the original CR deletion path to use Patch + MergeFromWithOptimisticLock with explicit conflict handling, matching the pattern in reconcileHCPDeletion. Also extracted the CR deletion logic into a reconcileCRDeletion helper to reduce Reconcile cyclomatic complexity.
AI-assisted response via Claude Code
| return ctrl.Result{}, fmt.Errorf("unexpected number of HostedControlPlanes in namespace, expected: 1, actual: %d", len(hcpList.Items)) | ||
|
|
||
| // Handle HCP deletion: clean up AWS resources while HCP credentials are still valid. | ||
| if !hcp.DeletionTimestamp.IsZero() { |
There was a problem hiding this comment.
The CR-level finalizer is added unconditionally at line 502 (before the serviceName check at line 512), but this HCP deletion check is only reachable when serviceName != "" — the early return at line 512-516 blocks entry.
During HCP deletion, reconcileHCPDeletion lists all CRs in the namespace (line 693-694) and keeps the HCP finalizer if any CR still has the CR finalizer. A CR that was created but not yet populated with EndpointServiceName by hypershift-operator will have the CR finalizer (added at line 502) but can never reach this HCP deletion check — the serviceName guard returns first. This permanently blocks HCP deletion.
Consider either:
- Moving this HCP fetch + deletion check before the serviceName guard (line 512), so CRs without serviceName can still enter the HCP deletion cleanup path, or
- Moving the CR finalizer addition (lines 502-510) to after the serviceName check, so CRs without serviceName don't get the finalizer and don't block the HCP finalizer removal.
There was a problem hiding this comment.
Done. Moved the CR finalizer addition to after the serviceName check so CRs not yet populated by hypershift-operator don't get a finalizer that would block HCP deletion.
AI-assisted response via Claude Code
|
/address-review-comments |
|
🤖 Addressing review comments: workflow run |
ebb3676 to
6e58fca
Compare
cblecker
left a comment
There was a problem hiding this comment.
Minor consistency note (not in the diff, so noting here): Line 515 still uses Requeue: true for the CR finalizer addition conflict, while all the new HCP finalizer operations (L612, L685, L716) use RequeueAfter: time.Second. Since this line was reorganized in this PR (moved after the serviceName check), it'd be a good opportunity to align it with the rest of the conflict handling in this file.
| return mockBuilder | ||
| }, | ||
| clientInterceptors: interceptor.Funcs{ | ||
| Patch: func(ctx context.Context, c crclient.WithWatch, obj crclient.Object, patch crclient.Patch, opts ...crclient.PatchOption) error { |
There was a problem hiding this comment.
This interceptor catches all Patch calls, not just HCP patches. It works because the test AES has no CR finalizer so the AES Patch at L683 is never reached — but if someone adds a finalizer to this test fixture later, the test would pass for the wrong reason. Consider filtering by type to match the pattern at L2232:
if _, ok := obj.(*hyperv1.HostedControlPlane); ok {
return apierrors.NewConflict(...)
}
return c.Patch(ctx, obj, patch, opts...)There was a problem hiding this comment.
Done. Added HCP type filter to the patch interceptor, consistent with the AES interceptor pattern at L2232.
AI-assisted response via Claude Code
| // path (which returns early at the top of Reconcile) handle cleanup. This prevents | ||
| // the two deletion paths from racing under MaxConcurrentReconciles > 1, where the | ||
| // CR deletion path and HCP deletion path could both try to clean up the same CR. | ||
| if !awsEndpointService.DeletionTimestamp.IsZero() { |
There was a problem hiding this comment.
This guard is the main defense against the dual-deletion race (HCP deletion path vs CR deletion path under concurrent reconciliation), but none of the TestReconcileHCPDeletion test cases exercise it — all AES objects are created without DeletionTimestamp. A test case with a DeletionTimestamp-bearing AES (and a CR finalizer) would protect this guard against accidental removal. The bare mock (no expectations) would also catch any leaked AWS client calls, acting as a double safety net.
There was a problem hiding this comment.
Done. Added test case "When AWSEndpointService is being deleted it should return early and let the CR deletion path handle cleanup" with a DeletionTimestamp-bearing AES that has the CR finalizer. The bare mock (no expectations) catches any leaked AWS client calls.
AI-assisted response via Claude Code
|
/address-review-comments |
|
🤖 Addressing review comments: workflow run |
6e58fca to
ce69e69
Compare
Done. Aligned the CR finalizer conflict handling at L515 to use AI-assisted response via Claude Code |
cblecker
left a comment
There was a problem hiding this comment.
Pre-existing nit (F2): In setFromHCP (line ~368), the if branch sets three fields (assumeSharedVPCEndpointRoleARN, assumeSharedVPCRoute53RoleARN, localZoneID) but the else branch only clears the two role ARNs — localZoneID is left stale. Unlikely to matter in practice (SharedVPC config isn't removed from a live HCP), but asymmetric cleanup is easy to fix: add b.localZoneID = "" at line 370.
|
|
||
| originalAES := awsEndpointService.DeepCopy() | ||
| controllerutil.RemoveFinalizer(awsEndpointService, finalizer) | ||
| if err := r.Patch(ctx, awsEndpointService, client.MergeFromWithOptions(originalAES, client.MergeFromWithOptimisticLock{})); err != nil { |
There was a problem hiding this comment.
The conflict handler here (returning RequeueAfter: time.Second) doesn't have a corresponding test case. The existing "removing AWSEndpointService finalizer fails" test at line 2237 uses fmt.Errorf, which exercises the generic error path (line 687), not the conflict path. The symmetric HCP Patch conflict is tested at line 2296.
Worth adding a test case with apierrors.NewConflict for AES objects — assert RequeueAfter == time.Second and err == nil.
There was a problem hiding this comment.
Done. Added a test case "When removing AWSEndpointService finalizer returns conflict error it should requeue" to TestReconcileHCPDeletionClientErrors — it returns apierrors.NewConflict for AES Patch calls and asserts RequeueAfter == time.Second with err == nil.
AI-assisted response via Claude Code
| // and return early at the top of this function. This convergent pattern produces a small | ||
| // number of no-op requeues but is correct and self-healing. | ||
| func (r *AWSEndpointServiceReconciler) reconcileHCPDeletion(ctx context.Context, awsEndpointService *hyperv1.AWSEndpointService, hcp *hyperv1.HostedControlPlane, log logr.Logger) (ctrl.Result, error) { | ||
| if !controllerutil.ContainsFinalizer(hcp, hcpAWSPrivateLinkFinalizerName) { |
There was a problem hiding this comment.
Consider adding a defensive guard at the top: if hcp.DeletionTimestamp.IsZero() { return ctrl.Result{}, nil }. The caller checks this at line 530, but the function name implies the precondition without enforcing it. If a future call site skips the check, this would clean up resources on a live cluster.
There was a problem hiding this comment.
Done. Added if hcp.DeletionTimestamp.IsZero() { return ctrl.Result{}, nil } guard at the top of reconcileHCPDeletion to enforce the precondition explicitly.
AI-assisted response via Claude Code
There was a problem hiding this comment.
+1 on the guard. One follow-up: the diff version included a "When HCP is not being deleted it should return early" test case that exercised this guard, but it appears to have been dropped during a rebase. All current TestReconcileHCPDeletion cases construct the HCP with DeletionTimestamp: &now, so this guard path has no coverage. Worth adding back to protect it from accidental removal.
There was a problem hiding this comment.
Done. Added "When HCP is not being deleted it should return early" test case to TestReconcileHCPDeletion that constructs the HCP without DeletionTimestamp, exercising the hcp.DeletionTimestamp.IsZero() guard.
AI-assisted response via Claude Code
|
/address-review-comments |
|
🤖 Addressing review comments: workflow run |
ce69e69 to
09f6bd8
Compare
There was a problem hiding this comment.
@cblecker Re: the pre-existing nit in setFromHCP — Done. Added b.localZoneID = "" in the else branch to symmetrically clear all three fields.
AI-assisted response via Claude Code
cblecker
left a comment
There was a problem hiding this comment.
Good pattern overall — the convergent multi-CR coordination is well-designed and the switch to EnqueueRequestsFromMapFunc correctly fixes the controller-restart scenario. A few ordering issues in the reconcile flow and some tests from the diff that appear to have been lost during rebasing.
|
|
||
| // Initialize AWS clients from the HCP — guaranteed to be available because | ||
| // our finalizer blocks HCP deletion. | ||
| r.awsClientBuilder.initializeWithHCP(log, hcp) |
There was a problem hiding this comment.
initializeWithHCP and getClients are called unconditionally here, before the CR finalizer check at L687. If getClients fails (transient STS/credential issue), the error return at L683 prevents the pending-CRs check from ever being reached — blocking HCP finalizer removal even when all CRs are already cleaned up and no AWS API calls are needed.
Moving the client initialization inside the if controllerutil.ContainsFinalizer(awsEndpointService, finalizer) block would let already-cleaned-up CRs proceed straight to the pending-CRs check.
There was a problem hiding this comment.
Done. Moved initializeWithHCP and getClients inside the if controllerutil.ContainsFinalizer(awsEndpointService, finalizer) block so already-cleaned-up CRs proceed straight to the pending-CRs check without needing AWS API calls.
AI-assisted response via Claude Code
| return false | ||
| } | ||
|
|
||
| // reconcileCRDeletion handles the AWSEndpointService CR deletion path. |
There was a problem hiding this comment.
The diff includes a TestReconcileCRDeletion with cases for the Patch conflict requeue path, but it doesn't appear in the actual test file — looks like it was dropped during a rebase. The refactoring from Update to Patch+MergeFromWithOptimisticLock with the RequeueAfter: time.Second conflict handling is a meaningful behavioral change worth covering directly.
There was a problem hiding this comment.
Done. Added TestReconcileCRDeletion covering: no-finalizer early return, successful cleanup with GC, AWS client init failure, Patch conflict requeue (asserts RequeueAfter == time.Second and err == nil), and Patch non-conflict error.
AI-assisted response via Claude Code
| @@ -518,6 +505,13 @@ func (r *AWSEndpointServiceReconciler) Reconcile(ctx context.Context, req ctrl.R | |||
| return ctrl.Result{}, err | |||
There was a problem hiding this comment.
Nit: this getClients error is returned bare while the other two call sites (L588, L683) wrap it with context identifying the reconciliation path. Wrapping here too ("failed to get AWS clients for endpoint reconciliation: %w") would help with production triage.
There was a problem hiding this comment.
Done. Wrapped with "failed to get AWS clients for endpoint reconciliation: %w" to match the other call sites.
AI-assisted response via Claude Code
| // Ensure the awsEndpointService has a finalizer for cleanup. | ||
| // This is placed after the serviceName check so that CRs not yet populated | ||
| // by hypershift-operator don't get a finalizer that would block HCP deletion. | ||
| if !controllerutil.ContainsFinalizer(awsEndpointService, finalizer) { |
There was a problem hiding this comment.
The CR finalizer is added here before the HCP deletion check at L493. During HCP deletion, after reconcileHCPDeletion removes a CR's finalizer and requeues (waiting for other CRs), the next reconcile re-adds the CR finalizer at this line before discovering the HCP is being deleted. This causes reconcileHCPDeletion to re-run r.delete on already-cleaned-up resources each cycle.
Moving the HCP fetch and deletion check before the CR finalizer addition would avoid the re-addition loop:
hcp, err := r.getHostedControlPlane(ctx, req.Namespace)
...
if hcp != nil && !hcp.DeletionTimestamp.IsZero() {
return r.reconcileHCPDeletion(ctx, awsEndpointService, hcp, log)
}
// Only add CR finalizer during normal reconciliation
if !controllerutil.ContainsFinalizer(awsEndpointService, finalizer) {There was a problem hiding this comment.
Done. Moved the HCP fetch and deletion check before the CR finalizer addition so that during HCP deletion, reconcileHCPDeletion handles cleanup without re-adding the CR finalizer.
AI-assisted response via Claude Code
|
/address-review-comments |
|
🤖 Addressing review comments: workflow run |
…reconciler
Add a finalizer on the HostedControlPlane to block HCP deletion until
all AWS PrivateLink resources (VPC endpoints, security groups, DNS
records) are cleaned up by the AWSEndpointService reconciler.
- Add hcpAWSPrivateLinkFinalizerName finalizer, placed after client
initialization succeeds on the normal reconciliation path
- Add reconcileHCPDeletion to clean up AWS resources for each
AWSEndpointService CR before removing the HCP finalizer
- Replace handler.Funcs{UpdateFunc: ...} HCP watch with
EnqueueRequestsFromMapFunc so Create/Delete events also trigger
reconciliation (critical for CPO restarts during HCP deletion)
- Use convergent multi-CR coordination: each reconciler cleans its
own CR, only the last one to finish removes the HCP finalizer
- Move HCP deletion check before CR finalizer addition to prevent
re-addition loop during HCP deletion cleanup cycles
- Move AWS client initialization inside CR finalizer check in
reconcileHCPDeletion so already-cleaned-up CRs proceed to the
pending-CRs check without needing AWS API calls
- Add comprehensive unit tests covering finalizer lifecycle,
HCP deletion cleanup, CR deletion cleanup, concurrent reconciler
coordination, and SharedVPC scenarios
Previously, if the HCP was deleted before AWSEndpointService cleanup,
the controller could not construct valid AWS clients — particularly for
SharedVPC clusters where cross-account role ARNs are sourced from the
HCP spec — orphaning AWS resources.
Signed-off-by: OpenShift CI Bot <ci-bot@redhat.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
09f6bd8 to
7355f0e
Compare
|
@hypershift-jira-solve-ci[bot]: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What this PR does / why we need it:
Adds a finalizer on the HostedControlPlane resource from the AWSEndpointService reconciler to prevent HCP deletion before AWS PrivateLink resources are cleaned up.
Problem: When the CPO restarts during deletion of a SharedVPC cluster, the
clientBuilderis uninitialized and the HCP (with its cross-account role ARNs) may already be deleted. This causes the reconciler to fail creating AWS clients, and after a 10-minute grace period the hypershift-operator force-removes the CPO finalizer — orphaning VPC endpoints, security groups, and DNS records in the shared VPC account.Solution: The new HCP finalizer (
hypershift.openshift.io/aws-private-link-endpoint-cleanup) follows the same pattern used by the Azure PLS controller:enqueueOnHCPChange) to also trigger reconciliation when an HCP is being deleted with the finalizer presentWhich issue(s) this PR fixes:
Fixes https://redhat.atlassian.net/browse/CNTRLPLANE-507
Special notes for your reviewer:
enqueueOnHCPChangehandler (renamed fromenqueueOnAccessChange) now triggers on both EndpointAccess changes and HCP deletions with the finalizergetAWSClienthelper, sourcing credentials from the still-available HCP specChecklist:
Always review AI generated responses prior to use.
Generated with Claude Code via
/jira:solve [CNTRLPLANE-507](https://redhat.atlassian.net/browse/CNTRLPLANE-507)Summary by CodeRabbit
Bug Fixes
Tests