Skip to content

Transparent INSERT into compressed partitions (DML P1)#34

Draft
SferaDev wants to merge 3 commits into
fix/epoch-encodingfrom
feat/compressed-insert
Draft

Transparent INSERT into compressed partitions (DML P1)#34
SferaDev wants to merge 3 commits into
fix/epoch-encodingfrom
feat/compressed-insert

Conversation

@SferaDev

@SferaDev SferaDev commented Jun 12, 2026

Copy link
Copy Markdown
Member

What

Removes the INSERT-side read-only limitation on compressed partitions. INSERTs (direct or routed through the parent) land in the partition heap at native heap speed — no decompression, no segment rewrite — and every read path unions the compressed segments with the uncompressed heap tail:

  • DeltaXDecompress/DeltaXAppend gain a Phase 3 heap-tail scan (leader-only in parallel plans), with the full plan qual re-enabled when a tail exists (heap rows are never batch-filtered). A begin-time guard errors when a parallel plan could not drive the leader's copy (parallel_leader_participation=off) rather than dropping the tail.
  • DeltaXCount/DeltaXMinMax fold loose heap rows into their metadata-derived results at exec time (under the executor snapshot), so those pushdowns stay enabled.
  • DeltaXAgg bails to plain Agg over heap-tail-aware scans when any involved compressed partition has loose rows; a begin-time guard errors on the stale-cached-plan race instead of dropping rows.
  • Top-N pushdown / pathkey claims are disabled at plan time when a tail exists; a custom-private marker + relcache invalidation on the heap's empty→non-empty transition catch stale cached plans (pathkeys → error, Top-N → silent fallback to the full scan).
  • INSERT ... ON CONFLICT stays rejected (conflict inference cannot see rows inside segments) — including when smuggled in through a data-modifying CTE under a top-level SELECT. UPDATE/DELETE behavior is unchanged — they land in the follow-up P2/P2.5 PR.

deltax_compact_partition() folds loose rows into new segments appended to the existing companion tables (colstats/blooms/text_lengths/valbitmap kept exact or dropped, never under-covering) and truncates the loose region; the background worker compacts automatically each cycle. Compaction re-validates the catalog row under its AccessExclusive lock (closes a compact-vs-concurrent-decompress race), self-heals dead loose-row pages left behind by a REPEATABLE READ compaction (autovacuum is off on compressed partitions), and refuses json_extract tables with a clear remedy (decompress + recompress) instead of a raw SQL error.

Why

Compressed partitions were read-only; out-of-range or late-arriving data either errored or required manual decompress/recompress. With this change, a user inserting into a deltax table doesn't need to know whether the target partition is compressed — INSERT latency is plain-Postgres heap latency.

Design

dev/docs/COMPRESSED_DML.md (ships in this PR) carries the full DML program design (P1 heap-tail, P2 decompose-on-write, P2.5 tombstones).

Known tradeoff

Compaction ends with TRUNCATE of the loose region — like compress-time TRUNCATE, this is not MVCC-safe against a concurrent REPEATABLE READ reader whose snapshot predates the compaction commit (PostgreSQL's documented TRUNCATE caveat). This is the same tradeoff the existing compress path already makes; called out here for visibility.

Stacking

Based on fix/epoch-encoding (#33): this feature's plain-PG twin-table validation tests exposed those latent epoch bugs, and the heap-tail tests require the fixes to pass. Retarget to main once #33 merges.

Testing

  • make build clean; make clippy zero warnings; cargo fmt --check clean
  • make test (PG 17) and make test PG_MAJOR=18 green — 568 unit tests each
  • make integration-test PG_VERSIONS=17361/361 passed, including the new tests/test_compressed_insert.py (twin-table validation against plain PostgreSQL: routed + direct inserts, parallel scans, count/minmax folding, wCTE ON CONFLICT rejection, compaction, worker auto-compaction)

Provenance

Extracted from a longer perf session (branch perf/clickhouse-gap-session); UPDATE/DELETE (P2) and tombstone DELETE (P2.5) follow in a separate stacked PR. An independent correctness/cleanup pass tightened the stale-plan and compaction concurrency guards (see commit history).

@SferaDev SferaDev marked this pull request as ready for review June 12, 2026 13:29
@tsg

tsg commented Jun 12, 2026

Copy link
Copy Markdown
Member

This is a big one, we need it, but I want to do a bit of research myself on the possible approaches.

SferaDev added 3 commits June 12, 2026 23:11
Remove the INSERT-side read-only limitation on compressed partitions.
INSERTs (direct or parent-routed) land in the partition heap at native
heap speed — no decompression, no segment rewrite. Every read path
unions the compressed segments with the uncompressed heap tail:

- DeltaXDecompress/DeltaXAppend gain a Phase 3 heap-tail scan (leader-
  only in parallel plans), with the full plan qual re-enabled when a
  tail exists (heap rows are never batch-filtered).
- DeltaXCount/DeltaXMinMax fold loose heap rows into their metadata-
  derived results at exec time, so those pushdowns stay enabled.
- DeltaXAgg bails to plain Agg over heap-tail-aware scans when any
  involved compressed partition has loose rows (its columnar
  accumulators cannot ingest row-form tuples); a begin-time guard
  errors on the stale-cached-plan race instead of dropping rows.
- Top-N pushdown and pathkey claims are disabled at plan time when a
  tail exists; a [-4, flag] custom-private marker + relcache
  invalidation on the heap's empty->non-empty transition (sent by the
  leaf trigger's INSERT branch) catch stale cached plans.
- INSERT ... ON CONFLICT stays rejected (conflict inference cannot see
  rows inside segments). UPDATE/DELETE stay rejected unchanged.

deltax_compact_partition() folds loose rows into new segments appended
to the existing companion tables (colstats/blooms/text_lengths/
valbitmap kept exact or dropped, never under-covering; catalog
counters, column_minmax, HLL/ndistinct refreshed) and truncates the
loose region; the background worker compacts automatically each cycle.

Extracted from a longer perf session (DML P1, validated against a
plain-PostgreSQL twin table); UPDATE/DELETE (P2) and tombstones (P2.5)
follow in a separate PR. dev/docs/COMPRESSED_DML.md carries the full
program design.
Independent correctness/cleanup pass over the P1 transparent-INSERT work:

- Close the data-modifying-CTE bypass of the INSERT ... ON CONFLICT
  rejection: a wCTE hides the ModifyTable in PlannedStmt.subplans under a
  top-level CMD_SELECT, so the ExecutorStart hook never saw it. Conflict
  inference against empty leaf indexes would silently miss conflicts with
  segment rows (duplicate inserts under ON CONFLICT DO NOTHING). New
  integration test covers it.

- Guard the leader-only Phase 3 heap tail against
  parallel_leader_participation=off: Gather never drives the leader's plan
  copy when workers launch, so the tail would be silently dropped. The
  begin-time guard errors with a remedy instead.

- Re-validate the partition catalog row under the AccessExclusive lock in
  deltax_compact_partition(): a concurrent decompress committing between
  the unlocked is_compressed check and the lock grant would have made
  compaction treat the fully-restored heap as loose rows and truncate
  real data.

- Self-heal dead loose-row pages in the loose_rows == 0 compaction branch:
  a REPEATABLE READ compaction deletes (not truncates) and autovacuum is
  off on compressed partitions, so the heap would otherwise stay at
  nonzero blocks forever — pinning scans on the heap-tail path and
  re-taking the AEL every worker cycle.

- Reject json_extract tables in compaction with a clear remedy (synthetic
  columns have no heap presence — the cursor would fail with a raw SQL
  error) and skip them in worker auto-compaction to avoid an error loop.

- Disambiguate partition_oid_for_companion() by is_compressed: two
  deltatables with the same table name in different schemas both register
  same-named partitions in the catalog; the reverse lookup could return
  the wrong schema's heap and union foreign rows into the scan.

- Use the executor snapshot (es_snapshot) instead of GetActiveSnapshot()
  for the DeltaXCount/DeltaXMinMax heap-tail fold, matching the
  DeltaXDecompress/Append Phase 3 scan.

- Tighten the Phase 3 deform guard to also check the scan-slot tupdesc:
  for DeltaXAppend the slot is the parent's layout, and a dropped column
  on either side would silently misalign the positional deform.
The pre-existing correctness test asserted the P0 contract (parent-routed
INSERT into a compressed partition raises). DML P1 intentionally accepts
plain INSERTs — rows land in the partition heap and are unioned with
segment data at scan time. Update the test to insert into both the plain
and deltax tables, verify the loose row is visible through the deltax
read path (count + total parity), and keep the join-equality check.
UPDATE/DELETE rejection coverage is unchanged.
@SferaDev SferaDev force-pushed the feat/compressed-insert branch from 821622a to a7fd85d Compare June 12, 2026 21:12
@SferaDev SferaDev marked this pull request as draft June 12, 2026 22:41
SferaDev added a commit that referenced this pull request Jun 15, 2026
* fix: valbitmap correctness — fail-safe pruning, retention leak, cross-schema companion lookup

Three pre-existing correctness bugs found during review of #31/#34/#35:

1. Text valbitmap wrong-empty-results on segment overflow. A segment with
   more than VALBITMAP_MAX_DISTINCT (32) distinct text values writes no
   valbitmap row and contributes nothing to the partition-level
   column_valmap. When the union of the remaining segments stayed <= 32,
   querying a value unique to the overflowed segment missed the valmap and
   took the 'prune every segment without reading the bitmap table'
   shortcut — returning zero rows instead of the overflowed segment's
   rows. Pruning now always goes through the per-segment bitmap rows:
   a valmap miss (empty wanted_bits) prunes exactly the segments that
   wrote a bitmap row; segments without one are never prunable by valmap
   logic. The now-redundant ValbitmapCheck::prune_all field is removed.

2. Retention drop leaked _valbitmap companions. auto_drop_partitions
   dropped blobs/blooms/text_lengths/colstats/meta but not valbitmap,
   leaving orphaned tables in _deltax_compressed forever. (decompress and
   the empty-partition cleanup already covered all six.)

3. check_compressed_partition matched companions by partition NAME only.
   Companion tables live in the single shared _deltax_compressed schema
   and embed only the table name, so a same-named partition in another
   schema (or any same-named plain table) was treated as compressed and
   served the other partition's data. The lookup now confirms via
   deltax.deltax_partition that the exact (schema, table) pair is the
   compressed one — at most one partition of a given name can be
   compressed, so is_compressed disambiguates.

Regression tests: tests/test_valbitmap.py (one-segment-overflow shape),
tests/test_worker.py (no orphaned companions after retention drop),
tests/test_compression.py (two-schema same-name partition).

* test: make overflow regression probe robust to minmax/dictionary segment skipping

The plan-shape assertions (vb_skipped=1, segments=1) now ride on the
present-value 'ov07' query — where only the valbitmap can skip the
bitmap-covered segment and the overflowed segment must be scanned. The
absent-value probe keeps only the row-count assertion: an absent constant
can legitimately eliminate the overflowed segment through exact
per-segment evidence (text minmax, dictionary value check) regardless of
the valbitmap.

* rustfmt: format worker.rs launcher fan-out (inherited unformatted from #38 merge)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants