fix(cache): enforce fail-closed Redis backend for distributed mode#199
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRefactors cache backend into an explicit mode-driven system, adds CacheBackendMode and CacheBackendUnavailableError, implements a TTL+LRU VerificationCache with introspection, rewrites RedisCache to enforce STRICT_DISTRIBUTED or EXPLICIT_DEGRADED semantics, adds reset_redis_state, and includes comprehensive tests for startup/runtime failures and recovery. ChangesCache backend mode enforcement
Sequence Diagram(s)sequenceDiagram
participant Client
participant Redis
participant VerificationCache
participant SecurityLogger
Client->>Redis: get/set/invalidate/scan
Redis-->>Client: success (distributed path)
Redis--x Client: runtime failure
Client->>VerificationCache: fallback read/write (EXPLICIT_DEGRADED)
VerificationCache-->>Client: return cached result with _degraded_mode/_cache_backend
Client->>SecurityLogger: emit BACKEND_DEGRADED / BACKEND_RECOVERED events
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ac6ba661a3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "Codex (@codex) review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tests/test_pr189_redis_failclosed.py (2)
31-42: ⚡ Quick winRemove the dead
_make_redis_cachehelper.This helper is never called (each test class defines its own
_make_*helper), and the firstwithblock is a no-op: it patchesRedisCache.__init__.__globals__(a read-only function attribute) and then immediatelypasses. Leaving it in is confusing and would error if ever invoked. Recommend deleting it.♻️ Proposed removal
-def _make_redis_cache(mode: CacheBackendMode, client=None): - """ - Build a RedisCache with a fully mocked Redis client. - Patches get_redis_client() so no real Redis is needed. - """ - with patch("qwed_new.core.redis_config.get_redis_client", return_value=client), \ - patch("qwed_new.core.cache.RedisCache.__init__.__globals__", {}): - pass - - # Patch at the point of import inside RedisCache.__init__ - with patch("qwed_new.core.redis_config.get_redis_client", return_value=client): - return RedisCache(mode=mode) - -🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_pr189_redis_failclosed.py` around lines 31 - 42, Remove the unused helper _make_redis_cache entirely: delete the function and its two with blocks (including the noop patch of RedisCache.__init__.__globals__ and the second patch that returns RedisCache(mode=mode)); any tests that need a mocked Redis client should define their own _make_* helpers as already done, so ensure no references to _make_redis_cache, RedisCache.__init__.__globals__, get_redis_client patching, or the return RedisCache(mode=mode) remain in the file.
14-14: 💤 Low valueDrop the unused
PropertyMockimport.CodeQL flags
PropertyMockas unused; onlyMagicMockandpatchare referenced.♻️ Proposed cleanup
-from unittest.mock import MagicMock, patch, PropertyMock +from unittest.mock import MagicMock, patch🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_pr189_redis_failclosed.py` at line 14, The import line includes an unused symbol PropertyMock; remove PropertyMock from the import so only MagicMock and patch are imported (i.e., update the statement that currently imports MagicMock, patch, PropertyMock), then run lint/tests to confirm no references remain to PropertyMock; references to MagicMock and patch should be untouched.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/qwed_new/core/cache.py`:
- Around line 228-233: The __contains__ method currently checks presence in
self._cache without acquiring _lock or validating TTL, so it can return True for
expired entries; update __contains__(dsl_code) to acquire self._lock, compute
key via _generate_key(dsl_code), look up the entry and verify it is not expired
using the same TTL validation logic as get() (and remove/return False if
expired) to mirror get() behavior; also make __len__ thread-safe by acquiring
self._lock and counting only non-expired entries (or returning the length of
self._cache after pruning expired entries) so both methods are consistent and
safe with _lock, _cache, _generate_key, and get().
- Line 450: The clear method currently has a default None but uses an implicit
Optional in its annotation; update the signature of clear to explicitly annotate
the parameter as Optional[str] (e.g., def clear(self, pattern: Optional[str]) ->
int: or def clear(self, pattern: Optional[str] = None) -> int:), add the
necessary import from typing (Optional) if not already present, and ensure any
references inside the clear method that assume a str handle the None case
safely.
- Line 574: The parameter type for verify_fn in the function signature uses
lowercase callable and an implicit Optional; update the annotation to use
typing.Callable and typing.Optional explicitly (e.g., Optional[Callable[...,
Any]]), and add the corresponding imports (Callable, Optional, Any) to the
module imports so the signature for verify_fn is properly typed; locate the
verify_fn parameter in src/qwed_new/core/cache.py and adjust its annotation and
imports accordingly.
---
Nitpick comments:
In `@tests/test_pr189_redis_failclosed.py`:
- Around line 31-42: Remove the unused helper _make_redis_cache entirely: delete
the function and its two with blocks (including the noop patch of
RedisCache.__init__.__globals__ and the second patch that returns
RedisCache(mode=mode)); any tests that need a mocked Redis client should define
their own _make_* helpers as already done, so ensure no references to
_make_redis_cache, RedisCache.__init__.__globals__, get_redis_client patching,
or the return RedisCache(mode=mode) remain in the file.
- Line 14: The import line includes an unused symbol PropertyMock; remove
PropertyMock from the import so only MagicMock and patch are imported (i.e.,
update the statement that currently imports MagicMock, patch, PropertyMock),
then run lint/tests to confirm no references remain to PropertyMock; references
to MagicMock and patch should be untouched.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a675bc6c-baa2-494b-9a22-10604acfd4a8
📒 Files selected for processing (3)
src/qwed_new/core/cache.pysrc/qwed_new/core/redis_config.pytests/test_pr189_redis_failclosed.py
Greptile SummaryThis PR replaces the silent Redis-to-in-memory fallback in
Confidence Score: 5/5Safe to merge after acknowledging the cold-cache window that follows deployment due to the key-format change. The new findings are observability and deployment concerns rather than correctness bugs: stats() reports backend_error=True during the post-failure cooldown window even after Redis has recovered, and the corrected key-generation format invalidates all existing Redis entries on deploy. Neither breaks the fail-closed guarantee that is the core purpose of this PR. src/qwed_new/core/cache.py — specifically the stats() error path and the deployment impact of the _generate_key format change. Important Files Changed
Reviews (26): Last reviewed commit: "fix(cache): exclude demo block from cove..." | Re-trigger Greptile |
ac6ba66 to
e43a351
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/qwed_new/core/cache.py`:
- Around line 545-556: The exception handler in the stats path currently sets
"degraded" using the backend mode (self.mode !=
CacheBackendMode.STRICT_DISTRIBUTED) which can be misleading; change it to
reflect whether the fallback cache is actually being used (e.g., check
self._fallback_cache or an existing runtime flag like self._using_fallback) so
"degraded" is True only when a fallback cache exists/is active and False
otherwise, updating the dict construction in the except block of the stats
method (or the surrounding Cache class) to use that runtime check instead of the
mode comparison.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 438e1ad1-1d0b-4028-9b8c-7f5faefb703e
📒 Files selected for processing (3)
src/qwed_new/core/cache.pysrc/qwed_new/core/redis_config.pytests/test_pr189_redis_failclosed.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/qwed_new/core/redis_config.py
e43a351 to
ab67b74
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/qwed_new/core/cache.py`:
- Around line 406-430: The get() (and likewise set()) path can stay degraded
forever because it checks self._client without attempting to reacquire it; call
self._try_get_client() at the start of Cache.get and Cache.set (before checking
self._client) so the client is reinitialized when Redis returns, then proceed
with the existing logic (keep existing calls to _recover_from_fallback() and
_handle_runtime_redis_error()). Ensure this change avoids altering fallback
semantics (still use self._fallback_cache when _client is None) and preserves
invalidate()/clear() behavior.
- Around line 154-169: The cached verification results are stored and returned
by reference in VerificationCache (methods set() and get()), allowing caller
mutations (e.g., cached_verify() tagging or nested model edits) to poison future
cache entries; fix by performing a deep copy of the result object at the cache
boundary: when storing in VerificationCache.set() deep-copy the value before
placing it into the cache entry, and when returning from VerificationCache.get()
return a deep copy of entry.result (use copy.deepcopy) so internal stored
objects are never mutated by callers; ensure both places (existing get() block
and the corresponding set() code path) are updated.
- Around line 591-599: get_cache currently reuses a single module-global
VerificationCache instance for LOCAL_ONLY or when use_redis=False which allows
cross-tenant leakage because VerificationCache._generate_key does not include
tenant context; fix by scoping local caches per tenant (e.g., maintain a mapping
of tenant_id -> VerificationCache and return/instantiate a tenant-specific
instance when mode==CacheBackendMode.LOCAL_ONLY or use_redis is False) or
alternatively ensure VerificationCache._generate_key incorporates tenant_id into
generated keys; update get_cache and any module globals (_verification_cache) to
use tenant-specific storage and mirror the same change for the other affected
branch (lines ~620-623) so local-only paths never share cache across tenant_ids.
- Around line 322-327: The _recover_from_fallback helper currently emits
BACKEND_RECOVERED always; change it so the event is only emitted when a real
state transition occurs by checking _fallback_cache under _fallback_lock and
emitting only if it was non-None. In practice, inside _recover_from_fallback
(and using self._fallback_lock) verify if self._fallback_cache is not None, set
it to None, and then call self._emit_security_event("BACKEND_RECOVERED",
MSG_BACKEND_RECOVERED); do not call _emit_security_event when there was no
fallback active. Ensure you reference the existing symbols
_recover_from_fallback, _fallback_cache, _fallback_lock, _emit_security_event,
and MSG_BACKEND_RECOVERED.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 803a0b15-15be-420d-8901-f8854615f6e1
📒 Files selected for processing (3)
src/qwed_new/core/cache.pysrc/qwed_new/core/redis_config.pytests/test_pr189_redis_failclosed.py
🚧 Files skipped from review as they are similar to previous changes (2)
- src/qwed_new/core/redis_config.py
- tests/test_pr189_redis_failclosed.py
ab67b74 to
a0a64fd
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/qwed_new/core/cache.py`:
- Around line 168-170: Replace the JSON round-trip deep-copy with copy.deepcopy
to avoid coercing non-JSON types: import copy in src/qwed_new/core/cache.py and
change the return statements that currently do
json.loads(json.dumps(entry.result)) (the ones around the uses of entry.result
at the shown spots) to return copy.deepcopy(entry.result); make the same
replacement for the similar code block around lines 192-194 so cached values
preserve types like Decimal, sympy objects, and tuples instead of being
converted by JSON.
- Around line 413-427: The Redis I/O try block currently wraps JSON
(de)serialization causing payload errors to be treated as backend outages;
narrow the exception scope so only the client.get()/client.set() call invokes
_handle_runtime_redis_error("get"/"set", exc) and let json.loads/json.dumps
raise normally (or handle them separately). Concretely, call client.get(key)
inside a small try that catches Redis/runtime errors and calls
_handle_runtime_redis_error, then after recovering from fallback run
json.loads(cached) outside that try (and similarly for the set path with
json.dumps), keeping calls to _recover_from_fallback(), _hits/_misses updates,
and return logic unchanged.
- Around line 629-642: get_cache() currently uses a single mutable global
_redis_cache that can be swapped by concurrent callers and return a RedisCache
for the wrong tenant/mode; replace this last-writer-wins singleton with a keyed
cache map (e.g. _redis_caches keyed by (tenant_id, mode)) and guard access with
a module-level lock to make lookup/creation atomic. Specifically, keep the
existing _verification_caches logic for LOCAL_ONLY, but for the Redis path
change uses of _redis_cache to look up in _redis_caches[(tenant_id, mode)] and,
under the lock, create and store a new RedisCache(tenant_id=tenant_id,
mode=mode) only if missing; ensure callers return the cached instance without
releasing the lock early so no interleaving can swap namespaces for the returned
RedisCache.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 26aacf39-807e-4d2a-8980-e5e22374bb79
📒 Files selected for processing (3)
src/qwed_new/core/cache.pysrc/qwed_new/core/redis_config.pytests/test_pr189_redis_failclosed.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/qwed_new/core/redis_config.py
a0a64fd to
2a21c62
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/qwed_new/core/cache.py`:
- Around line 130-135: The cache key construction is ambiguous because the code
concatenates json.dumps(variables) directly onto the normalized DSL (the local
variable normalized), causing collisions; update the key builder used by
VerificationCache/RedisCache to produce an unambiguous encoding instead of
simple concatenation—e.g., build a single structured string (or JSON object)
that explicitly labels fields (like {"dsl": normalized, "variables":
<json-dumped vars>}) or insert a clear delimiter/prefix-length before the
variables so boundaries are preserved; ensure you change the code that currently
does json.dumps(variables, sort_keys=True) and string-append to normalized to
use this structured encoding everywhere the key is computed.
- Around line 183-191: When inserting into the cache, avoid evicting an
unrelated LRU entry if the key already exists: change the eviction logic inside
the with self._lock block to only pop oldest items when len(self._cache) >=
self.max_size AND the key is not already in self._cache (i.e., only evict for a
new key that would push size over max_size). Update the block around
self._cache, self.max_size, and the CacheEntry insertion so existing keys are
updated in-place without causing an unnecessary popitem(last=False).
- Around line 375-398: The runtime Redis error handler
(_handle_runtime_redis_error) currently leaves self._client intact so callers
(get/set/invalidate/clear) keep using the broken client; mark the client dead by
setting self._client = None inside _handle_runtime_redis_error (while holding
_fallback_lock) when handling EXPLICIT_DEGRADED failures so the existing
client-retry/cooldown path is triggered; ensure the get/set/invalidate/clear
code paths that check for client is None and the STRICT_DISTRIBUTED behavior
remain unchanged (i.e., still fail-closed for STRICT_DISTRIBUTED) so a fresh
client will be created or the fallback used instead of repeatedly hitting a
broken client.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 76b111fb-660c-44d6-b572-94c2a2267cfb
📒 Files selected for processing (3)
src/qwed_new/core/cache.pysrc/qwed_new/core/redis_config.pytests/test_pr189_redis_failclosed.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/qwed_new/core/redis_config.py
2a21c62 to
8c5fa59
Compare
df1adf0 to
287593f
Compare
dc252b8 to
7963b3f
Compare
7963b3f to
2a3934e
Compare
2a3934e to
25ddce4
Compare
…loses #189) Issue #189: RedisCache silently fell back to node-local VerificationCache when Redis was unavailable, breaking distributed consistency and enabling policy/rate-limit bypass by routing requests to cold nodes. Root cause: RedisCache.__init__ unconditionally created a VerificationCache fallback when _client is None, and get/set/invalidate/clear silently routed to it. Infrastructure failure became an implicit security-mode downgrade. Fix: Introduce explicit CacheBackendMode enum with three modes -- STRICT_DISTRIBUTED (new default for RedisCache and get_cache): Redis unavailable at startup -> raises CacheBackendUnavailableError. Redis error at runtime (get/set/invalidate/clear) -> raises CacheBackendUnavailableError. No silent node-local fallback, ever. Use this for rate limits, replay protection, policy gates. EXPLICIT_DEGRADED (opt-in): Redis unavailable -> activates VerificationCache fallback, but only after emitting a structured JSON security event to qwed.cache.security logger. Every result from the fallback carries _degraded_mode=True and _cache_backend=local_degraded so callers and auditors can detect the downgrade. Cross-node divergence becomes explicit, not silent. LOCAL_ONLY: Never touches Redis. Pure in-memory VerificationCache. Used by cached_verify() (backward compatible -- its existing behavior was already effectively local-only via silent fallback). Additional changes: - _emit_security_event(): structured JSON log (event, mode, tenant_id, timestamp) emitted on BACKEND_UNAVAILABLE_AT_STARTUP and BACKEND_UNAVAILABLE_AT_RUNTIME - stats() now exposes mode and degraded fields for observability - redis_config.py: add reset_redis_state() for test isolation - VerificationCache: zero changes (backward compatible) - cached_verify(): explicitly LOCAL_ONLY with doc comment explaining why Tests: 25 new tests in tests/test_pr189_redis_failclosed.py covering startup-down, runtime-loss, security event emission, JSON event validity, _degraded_mode tagging, cross-node isolation documentation, get_cache() default mode contract, and cached_verify() backward compatibility. Full regression: 1198 passed, 82 skipped -- no regressions.
25ddce4 to
b1da90e
Compare
c8a2a4b to
b1da90e
Compare
50c163b to
b1da90e
Compare
|



Problem
RedisCachesilently fell back to a node-localVerificationCachewhen Rediswas unavailable. In a multi-node deployment this means:
This breaks distributed consistency at the enforcement boundary. Rate limits,
replay protection, and policy gates built on top of the shared cache can be
bypassed simply by routing requests to a cold or restarting node. Infrastructure
failure was silently converted into a security-mode downgrade with no operator
signal.
Closes #189.
What Changed
CacheBackendModeenum (new)Three explicit operating modes — no hidden behavior:
STRICT_DISTRIBUTEDCacheBackendUnavailableErrorEXPLICIT_DEGRADEDLOCAL_ONLYcached_verify()CacheBackendUnavailableError(new)Explicit, informative exception raised in
STRICT_DISTRIBUTEDmode. Messageincludes guidance toward
EXPLICIT_DEGRADEDorLOCAL_ONLYso callers knowtheir options.
RedisCache(modified)mode: CacheBackendMode = STRICT_DISTRIBUTED_fallback_cache = VerificationCache()on Redis unavailableSTRICT_DISTRIBUTED+ no Redis at startup → raises immediately at constructionSTRICT_DISTRIBUTED+ Redis error at runtime (get/set/invalidate/clear) → raisesCacheBackendUnavailableErrorEXPLICIT_DEGRADEDfallback activation emits a structured JSON security event toqwed.cache.securitylogger_degraded_mode=Trueand_cache_backend="local_degraded"— cross-node divergence is explicit, not silentstats()now exposesmodeanddegradedfieldsget_cache()(modified)modeparameter, defaults toSTRICT_DISTRIBUTEDuse_redis=FalseorLOCAL_ONLY→ returns plainVerificationCachecached_verify()(modified)Explicitly uses
LOCAL_ONLY. Its existing behaviour was already effectivelylocal-only via the silent fallback — this makes the intent clear and documented.
No breaking change for existing callers.
redis_config.py(modified)Added
reset_redis_state()— resets module-level Redis singleton between testsso Redis availability can be simulated independently per test case.
Tests
tests/test_pr189_redis_failclosed.py— 25 new tests:TestStrictDistributedStartupDown— raises on startup, error message guides caller, no fallback createdTestStrictDistributedRuntimeError— raises on get/set/invalidate/clear, never activates fallbackTestExplicitDegradedStartupDown— activates fallback, emits valid JSON security event, tags resultsTestExplicitDegradedRuntimeLoss— runtime Redis loss activates fallback lazily, emits security event, documents cross-node isolationTestLocalOnlyMode— never callsget_redis_client, consistent within processTestGetCacheContract— default mode isSTRICT_DISTRIBUTED, each mode returns correct typeTestCachedVerifyBackwardCompat— never touches Redis, caches correctly, no_degraded_modetagFull regression: 1198 passed, 82 skipped — zero regressions.
Backward Compatibility
VerificationCachedirectlycached_verify()LOCAL_ONLY— same behaviorRedisCache()(no mode arg)CacheBackendUnavailableError— breaking for distributed callers who relied on silent fallbackget_cache(use_redis=True)Summary by CodeRabbit
Bug Fixes
New Features
Refactor
Tests / Chores