Skip to content

refactor: eager-write lot cache for credit_balances#1122

Merged
odesenfans merged 3 commits into
mainfrom
od/credit-lot-cache
May 12, 2026
Merged

refactor: eager-write lot cache for credit_balances#1122
odesenfans merged 3 commits into
mainfrom
od/credit-lot-cache

Conversation

@odesenfans
Copy link
Copy Markdown
Collaborator

Summary

Replace the credit_balances cache with a per-lot table maintained eagerly by writers, so reads collapse to a SUM and never walk credit_history or write back to the cache.

  • One row per granting credit_history entry, keyed by (address, credit_ref, credit_index), with amount_remaining, expiration_date, message_timestamp.
  • Distributions insert one lot per recipient; expenses drain the address's still-valid lots in emission order under FOR UPDATE; transfers drain the sender and insert recipient lots capped at min(source_expiration, requested_expiration).
  • get_credit_balance becomes SUM(amount_remaining) WHERE expiration_date IS NULL OR expiration_date > now().
  • repair_node rebuilds the lot cache from credit_history on startup, so the migration can drop the old cache without backfilling.

Emission-first FIFO and the min(source, requested) transfer cap are preserved verbatim. No consensus-level behaviour change. Switching to expiration-first drain is intentionally out of scope and will land in a follow-up PR.

Why not the bucket cache from #1121

That PR couples the cache architecture (good) with a drain-order change from emission-first to expiration-first (consensus-level). This PR splits the two so the operational fix can ship now and the policy change can take its own review cycle.

Scope comparison at the schema layer:

This PR (lot cache) #1121 (bucket cache)
Lines changed ~440 / 308 ~1058 / 395
New toolkit modules none infinity.py (PG sentinel adapter)
Test delta ~30 lines ~647 lines (new corner cases)
Behavior change none drain order flips

Test plan

  • hatch run linting:all clean
  • hatch run testing:test tests/db/test_credit_balances.py -v clean
  • Full suite green
  • Manual: start a fresh node, confirm the repair hook rebuilds lots from existing credit_history
  • Manual: confirm reads no longer touch credit_balances for writes

The previous credit_balances table was a single integer sum per address,
recomputed lazily inside the read path and written back from there. Every
get_credit_balance call could trigger a full credit_history FIFO walk plus
an UPDATE, which made the read path expensive and made transactions take
write locks on cache rows they only meant to inspect.

This change replaces the table with a per-lot cache: one row per granting
credit_history entry, keyed by (address, credit_ref, credit_index), with
an amount_remaining column. Writers (distribution, expense, transfer)
maintain it eagerly:

- distributions insert one lot per recipient;
- expenses drain the address's still-valid lots in emission order
  ((message_timestamp, credit_ref, credit_index) ASC) under FOR UPDATE,
  decrementing amount_remaining in place;
- transfers drain the sender the same way and insert recipient lots with
  each portion capped at min(source_expiration, requested_expiration).

Reads collapse to a SUM over rows whose expiration_date is NULL or in the
future, with no FIFO walk and no write-back. The repair_node startup hook
rebuilds the lot cache from credit_history by replaying it in emission
order, so the migration can drop the old cache without backfilling: the
empty new table is populated on the next start.

Emission-first FIFO and the min(source, requested) transfer cap are
preserved verbatim. No consensus-level behaviour change. The drain-order
change to expiration-first is intentionally out of scope; it will land in
a follow-up PR.
foxpatch-aleph
foxpatch-aleph previously approved these changes May 12, 2026
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR replaces the lazy-write credit_balances cache with an eager-write per-lot cache, making reads a simple SUM over still-valid lots instead of a FIFO walk. The logic is carefully preserved: emission-order FIFO, expiration filtering on read, and the min(source_expiration, requested_expiration) transfer cap are all maintained. The migration is correct (drop old table, create new one with FK to credit_history) and the repair hook ensures safe bootstrap. The test updates verify the key behavioral change (lots persist after expiration, read-time filtering handles it). No bugs, security issues, or semantic regressions were found. The code quality is high with clear documentation throughout.

CI surfaced two failure modes:

1. ForeignKeyViolation on the eager-write path. The lot insert ran inside
   the write functions before _bulk_insert_credit_history committed the
   referenced row, so the FK from credit_balances to credit_history
   exploded on every distribution and transfer.

   Drop the FK. The lot cache is a derived projection of credit_history,
   not authoritative state. Integrity is preserved by the eager writers
   (lot + history land in the same transaction) and by _repair_credit_balances
   on startup; a DB-level FK adds ordering coupling without runtime
   correctness benefit.

2. Tests that seed credit_history directly via session.add and then read
   through get_credit_balance / get_credit_balance_with_details saw 0
   because the lot cache was never populated.

   Update the two test helpers (_insert_credit_history_entries in
   test_credit_balances.py, plus the two direct AlephCreditHistoryDb adds
   in test_process_stores.py) to call _rebuild_credit_lots_for_address
   on each touched address after seeding, mirroring what repair_node does
   on startup.
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-structured refactor from lazy single-row cache to eager per-lot cache. The core FIFO logic and expiration semantics are preserved correctly. Main concern: get_credit_balances and count_credit_balances now exclude zero-balance addresses even with min_balance=0 (behaviour change from old code). Minor: missing type annotation, aggregate FILTER() syntax may be novel in this codebase.

src/aleph/db/accessors/balances.py (line 447): .having(balance_expr >= max(min_balance, 1)) — the old code returned all addresses when min_balance=0 (no filter). The new code excludes addresses with balance=0. If any caller relies on listing zero-balance addresses, this is a breaking change. Either preserve the old semantics (drop the max(..., 1) when min_balance=0) or document the change.

src/aleph/db/accessors/balances.py (line 471): Same max(min_balance, 1) issue as get_credit_balancescount_credit_balances with min_balance=0 now excludes zero-balance addresses, changing old behaviour.

src/aleph/db/accessors/balances.py (line 359): Missing type annotation on cutoff parameter. It accepts both dt.datetime and SQLAlchemy column expressions (via func.now()). A Union[dt.datetime, ColumnElement] or similar annotation would help readability.

src/aleph/db/accessors/balances.py (line 417): func.sum().filter(_valid_lot_filter(func.now())) uses SQLAlchemy's aggregate filter() clause. This generates SUM(...) FILTER (WHERE ...), which is PostgreSQL-specific and may not be supported by all SQLAlchemy versions. Consider using a subquery with a standard WHERE clause instead for better portability, or confirm this is tested against the target SQLAlchemy version.

deployment/migrations/versions/0057_a8c3d9f1b2e4_credit_balances_lot_cache.py (line 31): The migration drops and recreates credit_balances. Between migration completion and the repair hook running on next startup, reads against the new table will return 0 for all addresses. The PR description documents this (repair runs on startup), but it's worth ensuring the deploy order guarantees repair_node runs before any balance-dependent processing.

Drop max(min_balance, 1) in get_credit_balances / count_credit_balances so
the default min_balance=0 keeps returning addresses with zero remaining
credit (matching the pre-refactor behaviour). The public API still gates
min_balance >= 1 at the OpenAPI schema layer.

Annotate _valid_lot_filter's cutoff so callers can see it accepts both
dt.datetime and a SQLAlchemy column expression.
@odesenfans
Copy link
Copy Markdown
Collaborator Author

Thanks for the review. Pushed d4b5c4a and notes on the others below.

1, 2 — min_balance=0 HAVING filter ✅ Fixed in d4b5c4a

Dropped max(min_balance, 1) in both get_credit_balances and count_credit_balances. With the default min_balance=0 they now use having balance_expr >= 0, which keeps addresses whose lots have all been drained to zero. That matches the pre-refactor behaviour where the lazy cache often held rows with balance=0 after a full drain.

External API access is still gated by the min_balance: 1 minimum on the OpenAPI schema, so this only affects internal Python callers using the default.

3 — Missing type annotation on cutoff ✅ Fixed in d4b5c4a

Annotated as Union[dt.datetime, ColumnElement[dt.datetime]].

4 — func.sum().filter() portability — Not changing

pyaleph is PostgreSQL-only across the board (pg_insert, JSONB, COPY-from-STDIN, partial indexes). SUM(...) FILTER (WHERE ...) is standard SQL since 2003 and SQLAlchemy 2.0.49 (what this repo runs) supports aggregate.filter() cleanly. A subquery rewrite would just add a wrapper without buying anything.

5 — Empty-cache window after migration — Acknowledged, no code change

True that between alembic upgrade head and the daemon's repair_node running, the cache is empty. But repair_node runs synchronously during startup before the HTTP listener accepts traffic, so on a single-instance deploy there is no read-serving window with an empty cache. Multi-instance rolling restarts on schema changes are out of scope for this PR (the migration is destructive anyway). The deploy order is the standard one: run migration, restart daemon, traffic resumes once startup completes.

Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR cleanly replaces the lazy-write credit_balances cache with an eager-write per-lot table. The FIFO consumption order, expiration capping on transfers, and whitelisted-sender bypass are all preserved verbatim. The code is well-documented, the migration is clean (drop-and-recreate with repair-from-history), locking via FOR UPDATE is correct, and ON CONFLICT DO NOTHING provides idempotent replay. The change set is focused and the PR description accurately scopes what's in vs. out (emission-order drain preserved, expiration-order drain deferred to a follow-up). No bugs found. Minor nits: the int(result or 0) redundancy in get_credit_balance, and adding an explicit test for _repair_credit_balances (the coordinator, not just _rebuild_credit_lots_for_address) would be nice-to-have.

src/aleph/db/accessors/balances.py (line 379): int(result or 0) is redundant — func.coalesce(func.sum(...), 0) already guarantees a non-None result. int(result) suffices.

src/aleph/repair.py (line 115): Consider adding a test for the _repair_credit_balances coordinator function (not just the per-address _rebuild_credit_lots_for_address). For example, verify that calling it rebuilds multiple addresses' lots correctly from history.

@odesenfans odesenfans merged commit 5fc98b3 into main May 12, 2026
4 checks passed
@odesenfans odesenfans deleted the od/credit-lot-cache branch May 12, 2026 22:27
@odesenfans odesenfans restored the od/credit-lot-cache branch May 12, 2026 22:27
@odesenfans odesenfans deleted the od/credit-lot-cache branch May 12, 2026 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants