Summary
The /v1/batch/data/... handler maintains its own LRU of
*rego.PreparedEvalQuery (pkg/batchquery/handler.go) but never invalidates
entries when policies change. As a result, after a policy update via bundle
plugin, opal-client, or PUT /v1/policies/..., the batch endpoint continues
to evaluate against the previous compiler and returns stale responses.
The main /v1/data/... server is not affected because upstream OPA registers
a store.Register(OnCommit: s.reload) trigger in v1/server/server.go that
clears its prepared-query cache on every commit. The batch handler has no
equivalent.
Observed behavior (real dev environment)
Originally surfaced in a deployment using opal-client to sync policies into
EOPA:
- Updating a policy is visible via
GET /v1/policies/... (sync confirmed).
POST /v1/data/<path> returns results consistent with the new policy.
POST /v1/batch/data/<path> keeps returning results consistent with the
pre-update policy. A subsequent call to /v1/data/... does not unblock
the batch path: another call to /v1/batch/data/... is still stale.
The stale entry survives until LRU eviction (capacity 100, keyed only by URL
path + flags) or process restart. With few hot paths it effectively never
evicts.
Verification
I have not reproduced the full HTTP scenario end-to-end against a
running EOPA binary in this investigation. The bug is confirmed from real
scenario and programmatically through the same code path used by HTTP
clients:
Expected behavior
/v1/batch/data/... should reflect policy changes consistently with
/v1/data/....
Actual behavior
/v1/batch/data/... returns results computed against the pre-update compiler
until the cache entry is evicted or the process restarts. GET /v1/policies/...
correctly shows the new policy version, so the staleness is only on the
batch evaluation path.
Environment
- EOPA: current
main
- OPA dependency:
github.com/open-policy-agent/opa v1.12.1
- Triggered in production via opal-client policy sync, but reproducible with
any policy-write path.
Root cause
pkg/batchquery/handler.go builds the LRU in Handler() and inserts on miss
in ensurePreparedEvalQueryIsCached(), but SetManager() does not register
any store trigger or compiler trigger. The file contains no Purge or
Remove calls. A PreparedEvalQuery snapshots the *ast.Compiler it was
built against, so when Manager.setCompiler() swaps in a new compiler the
cached entries keep referencing the old one.
Suggested fix
Hook into Manager.RegisterCompilerTrigger in SetManager and Purge()
the LRU when the compiler is swapped:
// pkg/batchquery/handler.go — inside SetManager, after `h.manager = m`:
m.RegisterCompilerTrigger(h.onCompilerChange)
func (h *hndl) onCompilerChange(_ storage.Transaction) {
if h.preparedEvalQueryCache != nil {
h.preparedEvalQueryCache.Purge()
}
}
RegisterCompilerTrigger is preferred over Store.Register(OnCommit:)
because the batch handler only caches the PEQ and data-only commits do not
invalidate it — but mirroring the main server's OnCommit trigger would
also be correct and arguably more defensive.
Context
Discussed on the OPA community Slack
with @srenatus abd @philipaconrad, who suggested this was the likely root
cause. It seems confirmed through code investigation.
I will open a PR after a feedback, on this gh issue.
Summary
The
/v1/batch/data/...handler maintains its own LRU of*rego.PreparedEvalQuery(pkg/batchquery/handler.go) but never invalidatesentries when policies change. As a result, after a policy update via bundle
plugin, opal-client, or
PUT /v1/policies/..., the batch endpoint continuesto evaluate against the previous compiler and returns stale responses.
The main
/v1/data/...server is not affected because upstream OPA registersa
store.Register(OnCommit: s.reload)trigger inv1/server/server.gothatclears its prepared-query cache on every commit. The batch handler has no
equivalent.
Observed behavior (real dev environment)
Originally surfaced in a deployment using opal-client to sync policies into
EOPA:
GET /v1/policies/...(sync confirmed).POST /v1/data/<path>returns results consistent with the new policy.POST /v1/batch/data/<path>keeps returning results consistent with thepre-update policy. A subsequent call to
/v1/data/...does not unblockthe batch path: another call to
/v1/batch/data/...is still stale.The stale entry survives until LRU eviction (capacity 100, keyed only by URL
path + flags) or process restart. With few hot paths it effectively never
evicts.
Verification
I have not reproduced the full HTTP scenario end-to-end against a
running EOPA binary in this investigation. The bug is confirmed from real
scenario and programmatically through the same code path used by HTTP
clients:
Expected behavior
/v1/batch/data/...should reflect policy changes consistently with/v1/data/....Actual behavior
/v1/batch/data/...returns results computed against the pre-update compileruntil the cache entry is evicted or the process restarts.
GET /v1/policies/...correctly shows the new policy version, so the staleness is only on the
batch evaluation path.
Environment
maingithub.com/open-policy-agent/opa v1.12.1any policy-write path.
Root cause
pkg/batchquery/handler.gobuilds the LRU inHandler()and inserts on missin
ensurePreparedEvalQueryIsCached(), butSetManager()does not registerany store trigger or compiler trigger. The file contains no
PurgeorRemovecalls. APreparedEvalQuerysnapshots the*ast.Compilerit wasbuilt against, so when
Manager.setCompiler()swaps in a new compiler thecached entries keep referencing the old one.
Suggested fix
Hook into
Manager.RegisterCompilerTriggerinSetManagerandPurge()the LRU when the compiler is swapped:
RegisterCompilerTriggeris preferred overStore.Register(OnCommit:)because the batch handler only caches the PEQ and data-only commits do not
invalidate it — but mirroring the main server's
OnCommittrigger wouldalso be correct and arguably more defensive.
Context
Discussed on the OPA community Slack
with @srenatus abd @philipaconrad, who suggested this was the likely root
cause. It seems confirmed through code investigation.
I will open a PR after a feedback, on this gh issue.