Skip to content

perf(metrics): persist scoring metrics in dedicated tables#1150

Merged
odesenfans merged 12 commits into
mainfrom
od/persist-node-metrics
May 21, 2026
Merged

perf(metrics): persist scoring metrics in dedicated tables#1150
odesenfans merged 12 commits into
mainfrom
od/persist-node-metrics

Conversation

@odesenfans
Copy link
Copy Markdown
Collaborator

Summary

  • Replace the on-the-fly JSON-unnesting views (crn_metric_view, ccn_metric_view) with persisted tables crn_metrics / ccn_metrics indexed on (node_id, measured_at DESC). Queries become O(log n) instead of scanning every scoring message and re-unnesting its JSONB array on each API call.
  • Rows are inserted inside the same DB transaction that stores the scoring message, via a new branch in PostMessageHandler.process_post gated on (channel, sender, post_type). aleph.scoring.{addresses, channel, metrics_post_type} is a new config block.
  • Migration 0059 creates the tables, backfills them from the existing views, then drops the views. FK to messages.item_hash with ON DELETE CASCADE means FORGETs cascade automatically.

Design and plan

  • Spec: docs/superpowers/specs/2026-05-18-persist-node-metrics-design.md
  • Plan: docs/superpowers/plans/2026-05-18-persist-node-metrics.md

Test plan

  • 7 unit tests for the row builders (full payload, missing optional fields, missing required fields, non-numeric coercion, empty array, CCN full payload, non-numeric pending_messages)
  • 4 integration tests for PostMessageHandler (pass case, wrong post type, non-allowlisted sender, wrong channel)
  • 1 FORGET cascade test (ON DELETE CASCADE removes metric rows when source message is deleted)
  • 3 API contract tests for /api/v0/compute/{node_id}/metrics and /api/v0/core/{node_id}/metrics (200 + 404)
  • 2 DB-backed tests for the rewritten query_metric_crn / query_metric_ccn accessors
  • Existing tests/api/test_new_metric.py updated path through the new tables (fixture loader extended)
  • ruff, isort, black clean
  • alembic upgrade head then downgrade -1 then upgrade head cycles cleanly

Deploy notes

  • The migration drops the views as its final step. Any binary still serving traffic after the migration runs but before being restarted will 500 on the metrics endpoints. Run during a maintenance window or use a blue/green flip where new binaries are queued up before the migration completes.
  • The backfill cost scales with the volume of scoring messages currently in messages. Time it against a staging clone before rolling to production.
  • Downgrade is supported: the views are recreated verbatim and the new tables are dropped. No data loss; scoring payloads remain in messages.content JSONB.

foxpatch-aleph
foxpatch-aleph previously approved these changes May 19, 2026
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-structured PR that replaces expensive JSON-unnesting views with persisted tables for node scoring metrics. The design is sound, the migration is clean and reversible, the production code handles edge cases defensively (missing fields, non-numeric coercion, empty arrays), and test coverage is comprehensive across all layers (unit, integration, API contract, FORGET cascade). The conftest fixture loader is properly updated to backfill the new tables for existing integration tests. No security concerns or correctness bugs found.

src/aleph/db/accessors/metrics.py (line 101): _parse_ccn_result returns {} when result is empty (zip(*[]) produces nothing). The controller handles this via if not ccn: raise HTTPNotFound(), so this is safe, but worth noting if new callers are added.

src/aleph/db/accessors/metrics.py (line 127): query_metric_ccn sort_order parameter type changed from SortOrder to SortOrderForMetrics. This is consistent with how the controllers call it (via MetricsQueryParams.sort which uses SortOrderForMetrics), but differs from the old code. The runtime behavior is identical since SortOrderForMetrics is just an annotated SortOrder.

src/aleph/db/accessors/metrics.py (line 163): query_metric_crn signature keeps node_id: str (non-optional), consistent with old code. The function is only called from the API controller which always has a node_id from the URL path.

tests/web/controllers/test_metrics_api.py (line 31): Good workaround for the falsy start_date=0 issue. The comment clearly explains why start_date=1 is used instead of 0.

foxpatch-aleph
foxpatch-aleph previously approved these changes May 19, 2026
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a well-designed, thoroughly tested PR that replaces expensive JSONB-unnesting views with persisted tables for node scoring metrics. The architecture is clean: row builders tolerate malformed input, insert happens atomically with message processing, FK cascades handle FORGETs automatically, and the migration is fully reversible. The test coverage is excellent across unit, integration, API contract, and cascade scenarios. The only notable behavioral change is a new unconditional default ORDER BY ASC on the query functions, and the production migration may benefit from batched backfill for large datasets.

src/aleph/db/accessors/metrics.py (line 153): The old _build_metric_filter only applied ORDER BY when sort_order was explicitly provided. The new code unconditionally orders by measured_at ASC when sort_order is not DESCENDING. This changes the default return order from "database-dependent" (previously no ORDER BY when sort_order was None) to ascending. While arguably better, this is a behavioral change not called out in the PR description.

deployment/migrations/versions/0059_b9c4f1e6a2d7_persist_node_metrics.py (line 75): The backfill INSERT...SELECT runs in a single transaction and holds locks on the new tables for the duration. For large scoring corpora, consider batching (e.g., using a cursor with FETCH FORWARD) to avoid long-running locks. The deploy notes correctly recommend timing on staging first, which is sufficient for an initial rollout.

tests/db/test_metrics_persistence.py (line 247): The query_metric_crn and query_metric_ccn tests seed data and assert results, but there's no test for the empty-result case where _parse_*_result receives an empty list and returns {}. The controller handles this correctly (checks not crn and not crn["item_hash"]) but the accessor's empty-return behavior is only implicitly tested through API integration tests.

@odesenfans odesenfans force-pushed the od/persist-node-metrics branch 2 times, most recently from 3351ebb to 0bd0d35 Compare May 19, 2026 22:12
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a well-structured PR that replaces JSON-unnesting views with persisted tables for scoring metrics, significantly improving query performance. The code is clean, well-tested (17 new tests across unit, integration, and API contract levels), and the design choices are sound. The migration handles backfill and downgrade correctly. A few minor observations are noted below.

src/aleph/db/accessors/metrics.py (line 154): The sort_order == SortOrder.DESCENDING comparison works correctly since SortOrder is an IntEnum, but using sort_order is SortOrder.DESCENDING would be more idiomatic for enum comparison. Non-blocking.

src/aleph/db/accessors/metrics.py (line 163): Signature changed query_metric_crn from node_id: Optional[str] to node_id: str. All call sites pass a non-optional string, but if this function is ever imported externally with node_id=None, it could break at runtime. Consider keeping the Optional type to avoid breaking callers outside this commit's scope.

src/aleph/handlers/content/post.py (line 344): The scoring metrics gate uses isinstance(content.content, dict) while the balance gate above (line 294) uses content.content (truthiness). Both are functionally equivalent for the expected payload, but they're inconsistent. Consider using content.content (truthiness) here too for consistency with the existing pattern.

src/aleph/db/accessors/metrics.py (line 206): metrics = content.get("metrics") or {} — if metrics is {}, the or branch is taken (empty dict is falsy), which is a no-op downstream. Intentional and correct, but worth a brief comment explaining that an empty dict is a no-op.

foxpatch-aleph
foxpatch-aleph previously approved these changes May 19, 2026
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR replaces JSON-unnesting views with persisted tables (crn_metrics, ccn_metrics) for node scoring metrics. The approach is sound: rows are inserted in the same transaction as the scoring message via PostMessageHandler, backfilled from existing views in migration 0059, and benefit from FK cascade on FORGET. Proper whitelisting (scoring_addresses, channel, post_type) gates insertion. New composite indexes on (node_id, measured_at DESC) replace full-view scans. Tests cover builders, handler gating, FORGET cascade, API contracts, and accessors comprehensively. No correctness or security issues found.

foxpatch-aleph
foxpatch-aleph previously approved these changes May 19, 2026
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean PR that replaces expensive JSON-unnesting DB views with persisted tables. The migration strategy is well-thought-out (backfill before index build, NOT VALID FK added post-backfill). The row builders properly handle coercion failures and missing fields. Test coverage is comprehensive across unit tests, integration tests, API contract tests, and the FORGET cascade. All existing tests are updated for the new table-based accessors.

src/aleph/db/models/metrics.py (line 9): The composite index ix_crn_metrics_node_id_measured_at (with measured_at DESC) is created in the migration but not declared in the CrnMetricDb ORM model. Similarly for CcnMetricDb. If anyone generates the schema via Base.metadata.create_all (e.g. in isolated test harnesses), this index won't exist, making the query pattern in query_metric_crn/query_metric_ccn scan instead of seek. Consider adding __table_args__ with the composite index.

src/aleph/db/accessors/metrics.py (line 201): metrics = content.get("metrics") or {} — if metrics is an empty Mapping {}, it gets replaced by a new {} via the or {} clause. The subsequent isinstance(metrics, Mapping) check catches this anyway, so or {} is redundant and slightly misleading. Consider metrics = content.get("metrics") alone.

tests/web/controllers/test_metrics_api.py (line 77): The test_get_crn_metrics_returns_persisted_rows test uses the ccn_api_client fixture. This works if the fixture provides a generic HTTP client, but the name is confusing since it's testing the CRN endpoint (/api/v0/compute/{node_id}/metrics). Naming nit only.

src/aleph/handlers/content/post.py (line 344): The scoring metrics branch uses isinstance(content.content, dict) while the adjacent balances/credit_balances branches use a truthiness check content.content. Consider being consistent — either use isinstance everywhere or truthiness everywhere.

odesenfans added 10 commits May 20, 2026 14:05
Add _build_crn_rows and _build_ccn_rows helpers that convert a scoring
message's JSON content into lists of dicts ready for bulk insert. Both
helpers skip entries missing required fields (node_id, measured_at) and
coerce non-numeric values to None instead of raising.
Switch query_metric_crn and query_metric_ccn from select_from(text("*_metric_view"))
to ORM column selects against CrnMetricDb / CcnMetricDb, and remove the now-dead
_build_metric_filter helper.
Extend PostMessageHandler with three new constructor params (scoring_addresses,
scoring_channel, scoring_metrics_post_type) and a scoring branch in process_post
that calls insert_node_metrics when a POST matches the allowlist and post type.
Wire the config values through BaseMessageHandler. Update all test call sites.
Add API contract tests for /api/v0/compute/{node_id}/metrics (CRN) and
/api/v0/core/{node_id}/metrics (CCN) confirming that both endpoints serve
rows from the new CrnMetricDb/CcnMetricDb tables and return 404 for unknown
nodes.

Fix two related regressions introduced when Task 6 switched the query
accessors from the old JSONB views to the new tables:
- Add default ORDER BY measured_at ASC to query_metric_ccn and
  query_metric_crn so that callers without an explicit sort parameter get
  deterministic, ascending results (matching the implicit ordering the old
  views provided).
- Extend _load_fixtures in tests/api/conftest.py to call insert_node_metrics
  for aleph-network-metrics messages so that the existing test_new_metric.py
  tests can find data in the new tables.
The original ordering created both tables with their indexes and FK
to messages first, then ran the INSERT. On a populated CCN, the
backfill fans out via jsonb_array_elements to roughly
scoring_messages * nodes_per_msg rows, and every insert paid for
two B-tree updates plus an FK lookup against the 5M-row messages
table. Single-transaction alembic ran the whole thing without
checkpoint relief, so WAL growth and random IO dominated.

Reorder upgrade() to:
  1. create the two heap tables (no indexes, no FK)
  2. run both INSERT...SELECT against bare tables
  3. CREATE INDEX in one shot on populated tables (sorted bulk build)
  4. ADD CONSTRAINT ... NOT VALID then VALIDATE CONSTRAINT

The end state is identical: same indexes, same FK semantics
(VALIDATE-d FK enforces existing rows and future inserts), same
ON DELETE CASCADE. Tested via existing metrics-persistence and
metrics-API suites; the FK cascade test still passes.
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is well-structured and thoroughly tested. The migration is carefully ordered for performance (heap insert → index → FK). The row builders handle edge cases (missing optional fields, non-numeric coercion, non-dict entries) correctly. The integration tests cover the gate conditions (post type, sender, channel) and the FORGET cascade. A few minor observations: (1) SortOrderForMetrics is imported but never used in the accessors' type annotations — they still use SortOrder directly in the type hints of some tests and the actual comparison logic. (2) The API endpoint controller will return a 200 with empty metrics lists rather than a 404 when no matching rows exist; currently only the 404 test verifies a genuinely empty non-existent node_id, but there's no test for a node with zero metric rows returning 200 with empty arrays. (3) The migration's backfill INSERT copies from the views before dropping them, but the views filter by hardcoded sender/channel — this is correct for the design but means the backfill won't pick up rows inserted by future insert_node_metrics calls during the migration window (which is fine since those tables don't exist yet).

src/aleph/db/accessors/metrics.py (line 9): SortOrderForMetrics is imported but the function signatures use Optional[SortOrderForMetrics] which resolves to Optional[SortOrder] at runtime. This is fine but a minor inconsistency — the import hints at a custom type that isn't actually used as a standalone type in these functions.

src/aleph/db/accessors/metrics.py (line 128): Comparing sort_order == SortOrder.DESCENDING using an IntEnum works correctly, but the type annotation says SortOrderForMetrics which expands to Annotated[SortOrder, BeforeValidator(...)]. The runtime value won't be the annotated type — it'll be the parsed SortOrder enum. No bug, but the annotation is slightly misleading.

src/aleph/handlers/content/post.py (line 344): The check isinstance(content.content, dict) guards against None/non-dict content. However insert_node_metrics already handles non-dict content defensively. The double check is harmless but worth noting for future simplification.

tests/web/controllers/test_metrics_api.py (line 126): The test_get_crn_metrics_404_when_node_unknown test passes a non-existent node_id and expects 404. This is correct for the existing controller behavior. However there are no tests for the 200-with-empty-metrics case (node exists but has zero metric rows). Worth adding for completeness.

deployment/migrations/versions/0059_b9c4f1e6a2d7_persist_node_metrics.py (line 1): The migration is well-structured with the performance-aware ordering (heap insert → index → FK NOT VALID → VALIDATE). The item_hash columns in the migration use sa.String without length, while the ORM model also uses String without length — consistent. One minor note: the crn_metrics and ccn_metrics tables created by the migration have item_hash columns without an explicit nullable=False constraint on the sa.Column definitions (lines 33, 44), though the model classes do have nullable=False. Looking more carefully, sa.Column(..., nullable=False) IS actually used on item_hash in both tables — my mistake. No issue here.

tests/messages/test_scoring_metrics.py (line 161): Comprehensive test coverage for the message handler gating logic. The test_forget_cascades_to_metric_rows test verifies cascade both before and after commit. However, note that the FK cascade happens at the DB level during session.delete(existing) + session.flush() — the test's second session_factory() context re-reads the cascade outcome. This correctly validates the ON DELETE CASCADE behavior.

…d_at

The unpartitioned tables produced by the earlier 0059 grew to ~88M
rows / 30GB+ on epyc, covering 3 years of scoring history with no way
to expire old data short of a full DELETE + VACUUM. Reshape the tables
to time-range partitioning so retention becomes a DROP PARTITION
instead of a long DELETE, and bound the initial backfill to a 12-month
horizon.

Schema
------
* crn_metrics and ccn_metrics are now PARTITION BY RANGE (measured_at)
* measured_at is TIMESTAMPTZ (was DOUBLE PRECISION epoch); PK is
  composite (id, measured_at) because PG requires the partition key
  in any PK on a partitioned table
* Initial coverage: 13 monthly partitions [now-12mo, now+1mo) plus a
  DEFAULT catch-all
* FK to messages(item_hash) ON DELETE CASCADE is added after backfill.
  PG does not yet support NOT VALID FKs on partitioned tables so the
  ADD CONSTRAINT validates in one pass — still cheaper than the
  per-INSERT FK check we'd pay with the constraint present during
  the bulk SELECT
* The 0059 migration also seeds a 'metrics_partition' row in cron_jobs

Cron
----
* New MetricsPartitionCronJob runs daily and (a) pre-creates the next
  N months of partitions, (b) DETACHes + DROPs partitions whose upper
  bound is past the retention cutoff, (c) warns if the DEFAULT
  partition ever gets rows (operational signal that lookahead is too
  short)
* Both responsibilities are idempotent; steady-state runs are no-ops
* aleph.toolkit.partitions holds the shared month-bounds / naming
  helpers used by both migration and cron so they agree on
  partition_name() formatting

Application layer
-----------------
* CrnMetricDb/CcnMetricDb.measured_at switched to dt.datetime
* _build_crn_rows/_build_ccn_rows convert epoch float -> UTC datetime
* query_metric_*() still accept and return epoch floats (unchanged API
  contract); conversion happens at the accessor boundary
* config.aleph.scoring grows retention_months and
  partition_lookahead_months keys

Tests
-----
* 27 metrics-related tests pass: builder/accessor adapted for
  TIMESTAMPTZ; new cron tests cover steady-state no-op, create-missing-
  future, drop-past-cutoff, DEFAULT-fills warning, partition routing;
  FK CASCADE still verified through tests/messages/test_scoring_metrics
foxpatch-aleph
foxpatch-aleph previously approved these changes May 20, 2026
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-structured PR that replaces JSON-unnesting views with partitioned persistent tables for scoring metrics. The design is clean with proper ON DELETE CASCADE, a cron job for partition maintenance, and thorough test coverage including unit, integration, API contract, and migration tests. The only real issue is a misleading docstring in the migration that claims NOT VALID + VALIDATE approach when the code actually adds the FK directly (because PG 17 doesn't support NOT VALID on partitioned tables). Minor style inconsistencies like query_metric_crn using a non-optional node_id parameter while query_metric_ccn uses Optional[str] are non-blocking.

deployment/migrations/versions/0059_b9c4f1e6a2d7_persist_node_metrics.py (line 24): Docstring says FK is added "NOT VALID then VALIDATE" but the actual code (lines 182-199) uses plain ADD CONSTRAINT without NOT VALID. Lines 178-180 explain that PG 17 doesn't support NOT VALID on partitioned tables. Please update the docstring to reflect reality (e.g. remove "added NOT VALID then VALIDATE" or replace with "added directly after backfill").

src/aleph/db/accessors/metrics.py (line 204): node_id is declared as str (non-optional) here, but query_metric_ccn at line 160 declares it as Optional[str] = None. Consider making both signatures consistent since the web controller always provides a node_id. Non-blocking.

src/aleph/db/accessors/metrics.py (line 65): The node_id check is if node_id is None but could also get an empty string "" which would pass through and be stored. The migration's backfill had WHERE node_id IS NOT NULL which wouldn't catch empty strings. Consider if not node_id for consistency. Minor edge case.

Two fixes from review of the partitioning rework:

* Migration docstring claimed FK was added NOT VALID + VALIDATE, but
  PG (as of 17) does not support NOT VALID on partitioned tables and
  the code does plain ADD CONSTRAINT after backfill. Update the
  docstring to match the code.

* _build_crn_rows / _build_ccn_rows skipped entries with node_id IS
  NULL but accepted empty-string node_ids, which the API can never
  look up. Tighten to 'if not node_id' so missing and empty are
  treated the same way. Add a regression test.

Pre-existing typing asymmetry between query_metric_crn (node_id: str)
and query_metric_ccn (node_id: Optional[str]) is left alone here;
not introduced by this PR and out of scope for the partitioning work.
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ autodev could not generate a review for this PR (empty output from the model).

Click Re-request review to retry.

@odesenfans odesenfans merged commit 7bd4e50 into main May 21, 2026
4 checks passed
@odesenfans odesenfans deleted the od/persist-node-metrics branch May 21, 2026 13:09
odesenfans added a commit that referenced this pull request May 21, 2026
…re (#1159)

PR #1150 added required scoring_addresses, scoring_channel, and
scoring_metrics_post_type args to PostMessageHandler.__init__. The
forget_handler fixture in tests/permissions/test_forget_permissions.py
was added in PR #1156 against a base that already had those required
args but did not pass them, so mypy on main fails with:

  Missing positional arguments "scoring_addresses", "scoring_channel",
  "scoring_metrics_post_type" in call to "PostMessageHandler"

Match the placeholder values used by the equivalent fixture in
tests/message_processing/test_process_forgets.py.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants