Skip to content

Batch Data API (/v1/batch/data/...) serves stale results after policy updates: prepared-query cache is never invalidated #450

@disaverio

Description

@disaverio

Summary

The /v1/batch/data/... handler maintains its own LRU of
*rego.PreparedEvalQuery (pkg/batchquery/handler.go) but never invalidates
entries when policies change. As a result, after a policy update via bundle
plugin, opal-client, or PUT /v1/policies/..., the batch endpoint continues
to evaluate against the previous compiler and returns stale responses.

The main /v1/data/... server is not affected because upstream OPA registers
a store.Register(OnCommit: s.reload) trigger in v1/server/server.go that
clears its prepared-query cache on every commit. The batch handler has no
equivalent.

Observed behavior (real dev environment)

Originally surfaced in a deployment using opal-client to sync policies into
EOPA:

  • Updating a policy is visible via GET /v1/policies/... (sync confirmed).
  • POST /v1/data/<path> returns results consistent with the new policy.
  • POST /v1/batch/data/<path> keeps returning results consistent with the
    pre-update policy. A subsequent call to /v1/data/... does not unblock
    the batch path: another call to /v1/batch/data/... is still stale.

The stale entry survives until LRU eviction (capacity 100, keyed only by URL
path + flags) or process restart. With few hot paths it effectively never
evicts.

Verification

I have not reproduced the full HTTP scenario end-to-end against a
running EOPA binary in this investigation. The bug is confirmed from real
scenario and programmatically through the same code path used by HTTP
clients:

Expected behavior

/v1/batch/data/... should reflect policy changes consistently with
/v1/data/....

Actual behavior

/v1/batch/data/... returns results computed against the pre-update compiler
until the cache entry is evicted or the process restarts. GET /v1/policies/...
correctly shows the new policy version, so the staleness is only on the
batch evaluation path.

Environment

  • EOPA: current main
  • OPA dependency: github.com/open-policy-agent/opa v1.12.1
  • Triggered in production via opal-client policy sync, but reproducible with
    any policy-write path.

Root cause

pkg/batchquery/handler.go builds the LRU in Handler() and inserts on miss
in ensurePreparedEvalQueryIsCached(), but SetManager() does not register
any store trigger or compiler trigger. The file contains no Purge or
Remove calls. A PreparedEvalQuery snapshots the *ast.Compiler it was
built against, so when Manager.setCompiler() swaps in a new compiler the
cached entries keep referencing the old one.

Suggested fix

Hook into Manager.RegisterCompilerTrigger in SetManager and Purge()
the LRU when the compiler is swapped:

// pkg/batchquery/handler.go — inside SetManager, after `h.manager = m`:
m.RegisterCompilerTrigger(h.onCompilerChange)

func (h *hndl) onCompilerChange(_ storage.Transaction) {
    if h.preparedEvalQueryCache != nil {
        h.preparedEvalQueryCache.Purge()
    }
}

RegisterCompilerTrigger is preferred over Store.Register(OnCommit:)
because the batch handler only caches the PEQ and data-only commits do not
invalidate it — but mirroring the main server's OnCommit trigger would
also be correct and arguably more defensive.

Context

Discussed on the OPA community Slack
with @srenatus abd @philipaconrad, who suggested this was the likely root
cause. It seems confirmed through code investigation.

I will open a PR after a feedback, on this gh issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions