fix: recover from stale-cache conflict on ConstraintTemplatePodStatus update#4596
fix: recover from stale-cache conflict on ConstraintTemplatePodStatus update#4596a7i wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds conflict-retry handling for updating ConstraintTemplatePodStatus when the controller-runtime cached client serves a stale resourceVersion, and introduces unit tests that reproduce the stale-cache race.
Changes:
- Inject an uncached
APIReaderinto the reconciler and use it to refetch on409 Conflictduring PodStatus updates. - Add
updatePodStatusWithRetryhelper and switch existing update call sites to use it. - Add tests that cover conflict recovery, non-conflict error passthrough, and the happy path.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| pkg/controller/constrainttemplate/podstatus_retry_test.go | Adds tests simulating stale cache vs live store to validate conflict retry behavior. |
| pkg/controller/constrainttemplate/constrainttemplate_controller.go | Adds apiReader and updatePodStatusWithRetry and wires it into reconciliation flows. |
7ba1e7d to
3e3a712
Compare
… update The ConstraintTemplate reconciler reads the per-pod ConstraintTemplatePodStatus through the controller-runtime informer cache and then writes it back with the delegating client. When the same ConstraintTemplate is reconciled twice in rapid succession (e.g. a CT update followed by the owned-CRD update event re-enqueuing the CT), the cache can briefly serve a stale resourceVersion and the next Update fails with a 409 Conflict, producing a noisy "update ct pod status error" log on every occurrence. Wrap the three PodStatus Update call sites in a retry helper that, on Conflict, refetches the PodStatus via mgr.GetAPIReader() (uncached) to avoid looping against the same stale cache view, re-applies the desired Status, and retries. Mirrors the retry-on-conflict pattern already used in generateCRD for the ConstraintTemplate annotation update. Fixes open-policy-agent#4595 Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
3e3a712 to
d25d37c
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #4596 +/- ##
===========================================
- Coverage 54.49% 44.49% -10.01%
===========================================
Files 134 282 +148
Lines 12329 20733 +8404
===========================================
+ Hits 6719 9225 +2506
- Misses 5116 10711 +5595
- Partials 494 797 +303
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What this PR does / why we need it:
The
ConstraintTemplatereconciler reads the per-podConstraintTemplatePodStatusthrough the controller-runtime informer cache and then writes it back with the delegating client (r.Update). When the sameConstraintTemplateis reconciled twice in rapid succession (e.g. a CT update followed by the owned-CRD update event re-enqueuing the CT), the cache briefly serves a staleresourceVersionand the subsequentUpdatefails with a 409 Conflict, producing a noisyupdate ct pod status errorlog on every occurrence:The reconciler already requeues on this error, but the spam is still high-volume because the same Reconcile pass re-enqueues itself and hits the same stale cache. Same race was previously documented for the mutator reconciler in #2459 (closed as stale, never fixed).
This change wraps the three PodStatus
Updatecall sites (Reconcile,reportErrorOnCTStatus,handleUpdate) in a smallupdatePodStatusWithRetryhelper. OnConflict, it refetches thePodStatusviamgr.GetAPIReader()(uncached) to avoid looping against the same stale cache view, re-applies the desiredStatusonto the latestresourceVersion, and retries viaretry.RetryOnConflict. Mirrors the existing retry-on-conflict pattern already used ingenerateCRDfor theConstraintTemplateannotation update.Why the uncached
APIReader(notr.Get): the cache is exactly what is stale.retry.RetryOnConflictusesDefaultBackoff(~50ms total across 5 steps), which is much shorter than typical informer relist/event latency under load, so re-reading from the same cache would just return the same staleresourceVersionand produce the same conflict. Going around the cache breaks the loop deterministically.Which issue(s) this PR fixes:
Fixes #4595
Special notes for your reviewer:
resourceVersion.statuspointer is updated in place with the latest object on retry so callers can continue to mutate it after a successful return.pkg/controller/constrainttemplate/podstatus_retry_test.go:TestUpdatePodStatusWithRetry_StaleCacheRecoversreproduces the bug with a wrapper client that servesGetfrom a pinned olderresourceVersionwhile lettingUpdatego to the live store. Without this fix the test fails to even compile against the new field; with the fix the helper recovers via the uncachedapiReaderand the desired status lands on the latest version.TestUpdatePodStatusWithRetry_NonConflictErrorReturnedverifiesNotFound(and other non-conflict errors) are surfaced and not silently retried.TestUpdatePodStatusWithRetry_HappyPathverifies the no-retry path.pkg/controller/constrainttemplatetest suite passes;make lintclean.