Skip to content

fix: recover from stale-cache conflict on ConstraintTemplatePodStatus update#4596

Open
a7i wants to merge 1 commit into
open-policy-agent:masterfrom
a7i:fix/ct-podstatus-conflict-retry
Open

fix: recover from stale-cache conflict on ConstraintTemplatePodStatus update#4596
a7i wants to merge 1 commit into
open-policy-agent:masterfrom
a7i:fix/ct-podstatus-conflict-retry

Conversation

@a7i
Copy link
Copy Markdown

@a7i a7i commented May 26, 2026

What this PR does / why we need it:

The ConstraintTemplate reconciler reads the per-pod ConstraintTemplatePodStatus through the controller-runtime informer cache and then writes it back with the delegating client (r.Update). When the same ConstraintTemplate is reconciled twice in rapid succession (e.g. a CT update followed by the owned-CRD update event re-enqueuing the CT), the cache briefly serves a stale resourceVersion and the subsequent Update fails with a 409 Conflict, producing a noisy update ct pod status error log on every occurrence:

{"level":"error","msg":"update ct pod status error","error":"Operation cannot be fulfilled on constrainttemplatepodstatuses.status.gatekeeper.sh \"gatekeeper--controller--manager--<pod>-<template>\": the object has been modified; please apply your changes to the latest version and try again"}

The reconciler already requeues on this error, but the spam is still high-volume because the same Reconcile pass re-enqueues itself and hits the same stale cache. Same race was previously documented for the mutator reconciler in #2459 (closed as stale, never fixed).

This change wraps the three PodStatus Update call sites (Reconcile, reportErrorOnCTStatus, handleUpdate) in a small updatePodStatusWithRetry helper. On Conflict, it refetches the PodStatus via mgr.GetAPIReader() (uncached) to avoid looping against the same stale cache view, re-applies the desired Status onto the latest resourceVersion, and retries via retry.RetryOnConflict. Mirrors the existing retry-on-conflict pattern already used in generateCRD for the ConstraintTemplate annotation update.

Why the uncached APIReader (not r.Get): the cache is exactly what is stale. retry.RetryOnConflict uses DefaultBackoff (~50ms total across 5 steps), which is much shorter than typical informer relist/event latency under load, so re-reading from the same cache would just return the same stale resourceVersion and produce the same conflict. Going around the cache breaks the loop deterministically.

Which issue(s) this PR fixes:
Fixes #4595

Special notes for your reviewer:

  • No external API or behavior change. The status field set written to the API server is identical, just guaranteed to land on the latest resourceVersion.
  • The supplied status pointer is updated in place with the latest object on retry so callers can continue to mutate it after a successful return.
  • Non-conflict errors are returned as-is and are not retried.
  • New unit tests in pkg/controller/constrainttemplate/podstatus_retry_test.go:
    • TestUpdatePodStatusWithRetry_StaleCacheRecovers reproduces the bug with a wrapper client that serves Get from a pinned older resourceVersion while letting Update go to the live store. Without this fix the test fails to even compile against the new field; with the fix the helper recovers via the uncached apiReader and the desired status lands on the latest version.
    • TestUpdatePodStatusWithRetry_NonConflictErrorReturned verifies NotFound (and other non-conflict errors) are surfaced and not silently retried.
    • TestUpdatePodStatusWithRetry_HappyPath verifies the no-retry path.
  • Full pkg/controller/constrainttemplate test suite passes; make lint clean.

Copilot AI review requested due to automatic review settings May 26, 2026 14:18
@a7i a7i requested a review from a team as a code owner May 26, 2026 14:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds conflict-retry handling for updating ConstraintTemplatePodStatus when the controller-runtime cached client serves a stale resourceVersion, and introduces unit tests that reproduce the stale-cache race.

Changes:

  • Inject an uncached APIReader into the reconciler and use it to refetch on 409 Conflict during PodStatus updates.
  • Add updatePodStatusWithRetry helper and switch existing update call sites to use it.
  • Add tests that cover conflict recovery, non-conflict error passthrough, and the happy path.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
pkg/controller/constrainttemplate/podstatus_retry_test.go Adds tests simulating stale cache vs live store to validate conflict retry behavior.
pkg/controller/constrainttemplate/constrainttemplate_controller.go Adds apiReader and updatePodStatusWithRetry and wires it into reconciliation flows.

Comment thread pkg/controller/constrainttemplate/constrainttemplate_controller.go Outdated
Comment thread pkg/controller/constrainttemplate/podstatus_retry_test.go Outdated
@a7i a7i force-pushed the fix/ct-podstatus-conflict-retry branch from 7ba1e7d to 3e3a712 Compare May 26, 2026 14:28
… update

The ConstraintTemplate reconciler reads the per-pod ConstraintTemplatePodStatus
through the controller-runtime informer cache and then writes it back with the
delegating client. When the same ConstraintTemplate is reconciled twice in
rapid succession (e.g. a CT update followed by the owned-CRD update event
re-enqueuing the CT), the cache can briefly serve a stale resourceVersion and
the next Update fails with a 409 Conflict, producing a noisy
"update ct pod status error" log on every occurrence.

Wrap the three PodStatus Update call sites in a retry helper that, on
Conflict, refetches the PodStatus via mgr.GetAPIReader() (uncached) to avoid
looping against the same stale cache view, re-applies the desired Status, and
retries. Mirrors the retry-on-conflict pattern already used in generateCRD
for the ConstraintTemplate annotation update.

Fixes open-policy-agent#4595

Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
@a7i a7i force-pushed the fix/ct-podstatus-conflict-retry branch from 3e3a712 to d25d37c Compare May 26, 2026 19:11
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 73.68421% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 44.49%. Comparing base (3350319) to head (d25d37c).
⚠️ Report is 714 commits behind head on master.

Files with missing lines Patch % Lines
...onstrainttemplate/constrainttemplate_controller.go 73.68% 2 Missing and 3 partials ⚠️

❗ There is a different number of reports uploaded between BASE (3350319) and HEAD (d25d37c). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (3350319) HEAD (d25d37c)
unittests 2 1
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #4596       +/-   ##
===========================================
- Coverage   54.49%   44.49%   -10.01%     
===========================================
  Files         134      282      +148     
  Lines       12329    20733     +8404     
===========================================
+ Hits         6719     9225     +2506     
- Misses       5116    10711     +5595     
- Partials      494      797      +303     
Flag Coverage Δ
unittests 44.49% <73.68%> (-10.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Conflict on ConstraintTemplatePodStatus update due to stale informer cache in ConstraintTemplate reconciler

3 participants