refactor: eager-write lot cache for credit_balances#1122
Conversation
The previous credit_balances table was a single integer sum per address, recomputed lazily inside the read path and written back from there. Every get_credit_balance call could trigger a full credit_history FIFO walk plus an UPDATE, which made the read path expensive and made transactions take write locks on cache rows they only meant to inspect. This change replaces the table with a per-lot cache: one row per granting credit_history entry, keyed by (address, credit_ref, credit_index), with an amount_remaining column. Writers (distribution, expense, transfer) maintain it eagerly: - distributions insert one lot per recipient; - expenses drain the address's still-valid lots in emission order ((message_timestamp, credit_ref, credit_index) ASC) under FOR UPDATE, decrementing amount_remaining in place; - transfers drain the sender the same way and insert recipient lots with each portion capped at min(source_expiration, requested_expiration). Reads collapse to a SUM over rows whose expiration_date is NULL or in the future, with no FIFO walk and no write-back. The repair_node startup hook rebuilds the lot cache from credit_history by replaying it in emission order, so the migration can drop the old cache without backfilling: the empty new table is populated on the next start. Emission-first FIFO and the min(source, requested) transfer cap are preserved verbatim. No consensus-level behaviour change. The drain-order change to expiration-first is intentionally out of scope; it will land in a follow-up PR.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
This PR replaces the lazy-write credit_balances cache with an eager-write per-lot cache, making reads a simple SUM over still-valid lots instead of a FIFO walk. The logic is carefully preserved: emission-order FIFO, expiration filtering on read, and the min(source_expiration, requested_expiration) transfer cap are all maintained. The migration is correct (drop old table, create new one with FK to credit_history) and the repair hook ensures safe bootstrap. The test updates verify the key behavioral change (lots persist after expiration, read-time filtering handles it). No bugs, security issues, or semantic regressions were found. The code quality is high with clear documentation throughout.
CI surfaced two failure modes: 1. ForeignKeyViolation on the eager-write path. The lot insert ran inside the write functions before _bulk_insert_credit_history committed the referenced row, so the FK from credit_balances to credit_history exploded on every distribution and transfer. Drop the FK. The lot cache is a derived projection of credit_history, not authoritative state. Integrity is preserved by the eager writers (lot + history land in the same transaction) and by _repair_credit_balances on startup; a DB-level FK adds ordering coupling without runtime correctness benefit. 2. Tests that seed credit_history directly via session.add and then read through get_credit_balance / get_credit_balance_with_details saw 0 because the lot cache was never populated. Update the two test helpers (_insert_credit_history_entries in test_credit_balances.py, plus the two direct AlephCreditHistoryDb adds in test_process_stores.py) to call _rebuild_credit_lots_for_address on each touched address after seeding, mirroring what repair_node does on startup.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Well-structured refactor from lazy single-row cache to eager per-lot cache. The core FIFO logic and expiration semantics are preserved correctly. Main concern: get_credit_balances and count_credit_balances now exclude zero-balance addresses even with min_balance=0 (behaviour change from old code). Minor: missing type annotation, aggregate FILTER() syntax may be novel in this codebase.
src/aleph/db/accessors/balances.py (line 447): .having(balance_expr >= max(min_balance, 1)) — the old code returned all addresses when min_balance=0 (no filter). The new code excludes addresses with balance=0. If any caller relies on listing zero-balance addresses, this is a breaking change. Either preserve the old semantics (drop the max(..., 1) when min_balance=0) or document the change.
src/aleph/db/accessors/balances.py (line 471): Same max(min_balance, 1) issue as get_credit_balances — count_credit_balances with min_balance=0 now excludes zero-balance addresses, changing old behaviour.
src/aleph/db/accessors/balances.py (line 359): Missing type annotation on cutoff parameter. It accepts both dt.datetime and SQLAlchemy column expressions (via func.now()). A Union[dt.datetime, ColumnElement] or similar annotation would help readability.
src/aleph/db/accessors/balances.py (line 417): func.sum().filter(_valid_lot_filter(func.now())) uses SQLAlchemy's aggregate filter() clause. This generates SUM(...) FILTER (WHERE ...), which is PostgreSQL-specific and may not be supported by all SQLAlchemy versions. Consider using a subquery with a standard WHERE clause instead for better portability, or confirm this is tested against the target SQLAlchemy version.
deployment/migrations/versions/0057_a8c3d9f1b2e4_credit_balances_lot_cache.py (line 31): The migration drops and recreates credit_balances. Between migration completion and the repair hook running on next startup, reads against the new table will return 0 for all addresses. The PR description documents this (repair runs on startup), but it's worth ensuring the deploy order guarantees repair_node runs before any balance-dependent processing.
Drop max(min_balance, 1) in get_credit_balances / count_credit_balances so the default min_balance=0 keeps returning addresses with zero remaining credit (matching the pre-refactor behaviour). The public API still gates min_balance >= 1 at the OpenAPI schema layer. Annotate _valid_lot_filter's cutoff so callers can see it accepts both dt.datetime and a SQLAlchemy column expression.
|
Thanks for the review. Pushed d4b5c4a and notes on the others below. 1, 2 — Dropped External API access is still gated by the 3 — Missing type annotation on Annotated as 4 — pyaleph is PostgreSQL-only across the board ( 5 — Empty-cache window after migration — Acknowledged, no code change True that between |
foxpatch-aleph
left a comment
There was a problem hiding this comment.
This PR cleanly replaces the lazy-write credit_balances cache with an eager-write per-lot table. The FIFO consumption order, expiration capping on transfers, and whitelisted-sender bypass are all preserved verbatim. The code is well-documented, the migration is clean (drop-and-recreate with repair-from-history), locking via FOR UPDATE is correct, and ON CONFLICT DO NOTHING provides idempotent replay. The change set is focused and the PR description accurately scopes what's in vs. out (emission-order drain preserved, expiration-order drain deferred to a follow-up). No bugs found. Minor nits: the int(result or 0) redundancy in get_credit_balance, and adding an explicit test for _repair_credit_balances (the coordinator, not just _rebuild_credit_lots_for_address) would be nice-to-have.
src/aleph/db/accessors/balances.py (line 379): int(result or 0) is redundant — func.coalesce(func.sum(...), 0) already guarantees a non-None result. int(result) suffices.
src/aleph/repair.py (line 115): Consider adding a test for the _repair_credit_balances coordinator function (not just the per-address _rebuild_credit_lots_for_address). For example, verify that calling it rebuilds multiple addresses' lots correctly from history.
Summary
Replace the
credit_balancescache with a per-lot table maintained eagerly by writers, so reads collapse to aSUMand never walkcredit_historyor write back to the cache.credit_historyentry, keyed by(address, credit_ref, credit_index), withamount_remaining,expiration_date,message_timestamp.FOR UPDATE; transfers drain the sender and insert recipient lots capped atmin(source_expiration, requested_expiration).get_credit_balancebecomesSUM(amount_remaining) WHERE expiration_date IS NULL OR expiration_date > now().repair_noderebuilds the lot cache fromcredit_historyon startup, so the migration can drop the old cache without backfilling.Emission-first FIFO and the
min(source, requested)transfer cap are preserved verbatim. No consensus-level behaviour change. Switching to expiration-first drain is intentionally out of scope and will land in a follow-up PR.Why not the bucket cache from #1121
That PR couples the cache architecture (good) with a drain-order change from emission-first to expiration-first (consensus-level). This PR splits the two so the operational fix can ship now and the policy change can take its own review cycle.
Scope comparison at the schema layer:
infinity.py(PG sentinel adapter)Test plan
hatch run linting:allcleanhatch run testing:test tests/db/test_credit_balances.py -vcleancredit_historycredit_balancesfor writes