fix: recover from stale-cache conflict on ConstraintTemplatePodStatus update by a7i · Pull Request #4596 · open-policy-agent/gatekeeper

a7i · 2026-05-26T14:18:14Z

What this PR does / why we need it:

The ConstraintTemplate reconciler reads the per-pod ConstraintTemplatePodStatus through the controller-runtime informer cache and then writes it back with the delegating client (r.Update). When the same ConstraintTemplate is reconciled twice in rapid succession (e.g. a CT update followed by the owned-CRD update event re-enqueuing the CT), the cache briefly serves a stale resourceVersion and the subsequent Update fails with a 409 Conflict, producing a noisy update ct pod status error log on every occurrence:

{"level":"error","msg":"update ct pod status error","error":"Operation cannot be fulfilled on constrainttemplatepodstatuses.status.gatekeeper.sh \"gatekeeper--controller--manager--<pod>-<template>\": the object has been modified; please apply your changes to the latest version and try again"}

The reconciler already requeues on this error, but the spam is still high-volume because the same Reconcile pass re-enqueues itself and hits the same stale cache. Same race was previously documented for the mutator reconciler in #2459 (closed as stale, never fixed).

This change wraps the three PodStatus Update call sites (Reconcile, reportErrorOnCTStatus, handleUpdate) in a small updatePodStatusWithRetry helper. On Conflict, it refetches the PodStatus via mgr.GetAPIReader() (uncached) to avoid looping against the same stale cache view, re-applies the desired Status onto the latest resourceVersion, and retries via retry.RetryOnConflict. Mirrors the existing retry-on-conflict pattern already used in generateCRD for the ConstraintTemplate annotation update.

Why the uncached APIReader (not r.Get): the cache is exactly what is stale. retry.RetryOnConflict uses DefaultBackoff (~50ms total across 5 steps), which is much shorter than typical informer relist/event latency under load, so re-reading from the same cache would just return the same stale resourceVersion and produce the same conflict. Going around the cache breaks the loop deterministically.

Which issue(s) this PR fixes:
Fixes #4595

Special notes for your reviewer:

No external API or behavior change. The status field set written to the API server is identical, just guaranteed to land on the latest resourceVersion.
The supplied status pointer is updated in place with the latest object on retry so callers can continue to mutate it after a successful return.
Non-conflict errors are returned as-is and are not retried.
New unit tests in pkg/controller/constrainttemplate/podstatus_retry_test.go:
- TestUpdatePodStatusWithRetry_StaleCacheRecovers reproduces the bug with a wrapper client that serves Get from a pinned older resourceVersion while letting Update go to the live store. Without this fix the test fails to even compile against the new field; with the fix the helper recovers via the uncached apiReader and the desired status lands on the latest version.
- TestUpdatePodStatusWithRetry_NonConflictErrorReturned verifies NotFound (and other non-conflict errors) are surfaced and not silently retried.
- TestUpdatePodStatusWithRetry_HappyPath verifies the no-retry path.
Full pkg/controller/constrainttemplate test suite passes; make lint clean.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds conflict-retry handling for updating ConstraintTemplatePodStatus when the controller-runtime cached client serves a stale resourceVersion, and introduces unit tests that reproduce the stale-cache race.

Changes:

Inject an uncached APIReader into the reconciler and use it to refetch on 409 Conflict during PodStatus updates.
Add updatePodStatusWithRetry helper and switch existing update call sites to use it.
Add tests that cover conflict recovery, non-conflict error passthrough, and the happy path.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
pkg/controller/constrainttemplate/podstatus_retry_test.go	Adds tests simulating stale cache vs live store to validate conflict retry behavior.
pkg/controller/constrainttemplate/constrainttemplate_controller.go	Adds `apiReader` and `updatePodStatusWithRetry` and wires it into reconciliation flows.

… update The ConstraintTemplate reconciler reads the per-pod ConstraintTemplatePodStatus through the controller-runtime informer cache and then writes it back with the delegating client. When the same ConstraintTemplate is reconciled twice in rapid succession (e.g. a CT update followed by the owned-CRD update event re-enqueuing the CT), the cache can briefly serve a stale resourceVersion and the next Update fails with a 409 Conflict, producing a noisy "update ct pod status error" log on every occurrence. Wrap the three PodStatus Update call sites in a retry helper that, on Conflict, refetches the PodStatus via mgr.GetAPIReader() (uncached) to avoid looping against the same stale cache view, re-applies the desired Status, and retries. Mirrors the retry-on-conflict pattern already used in generateCRD for the ConstraintTemplate annotation update. Fixes open-policy-agent#4595 Signed-off-by: Amir Alavi <amiralavi7@gmail.com>

codecov-commenter · 2026-06-01T23:33:12Z

Codecov Report

❌ Patch coverage is 73.68421% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 44.49%. Comparing base (3350319) to head (d25d37c).
⚠️ Report is 714 commits behind head on master.

Files with missing lines	Patch %	Lines
...onstrainttemplate/constrainttemplate_controller.go	73.68%	2 Missing and 3 partials ⚠️

❗ There is a different number of reports uploaded between BASE (3350319) and HEAD (d25d37c). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (3350319) HEAD (d25d37c)

unittests 2 1

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #4596       +/-   ##
===========================================
- Coverage   54.49%   44.49%   -10.01%     
===========================================
  Files         134      282      +148     
  Lines       12329    20733     +8404     
===========================================
+ Hits         6719     9225     +2506     
- Misses       5116    10711     +5595     
- Partials      494      797      +303

Flag	Coverage Δ
unittests	`44.49% <73.68%> (-10.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot AI review requested due to automatic review settings May 26, 2026 14:18

a7i requested a review from a team as a code owner May 26, 2026 14:18

Copilot AI reviewed May 26, 2026

View reviewed changes

Comment thread pkg/controller/constrainttemplate/constrainttemplate_controller.go Outdated

Comment thread pkg/controller/constrainttemplate/podstatus_retry_test.go Outdated

a7i force-pushed the fix/ct-podstatus-conflict-retry branch from 7ba1e7d to 3e3a712 Compare May 26, 2026 14:28

a7i force-pushed the fix/ct-podstatus-conflict-retry branch from 3e3a712 to d25d37c Compare May 26, 2026 19:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: recover from stale-cache conflict on ConstraintTemplatePodStatus update#4596

fix: recover from stale-cache conflict on ConstraintTemplatePodStatus update#4596
a7i wants to merge 1 commit into
open-policy-agent:masterfrom
a7i:fix/ct-podstatus-conflict-retry

a7i commented May 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

a7i commented May 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jun 1, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants