Skip to content

UPDATE/DELETE on compressed partitions (DML P2 + tombstone DELETE)#36

Draft
SferaDev wants to merge 5 commits into
feat/compressed-insertfrom
feat/compressed-update-delete
Draft

UPDATE/DELETE on compressed partitions (DML P2 + tombstone DELETE)#36
SferaDev wants to merge 5 commits into
feat/compressed-insertfrom
feat/compressed-update-delete

Conversation

@SferaDev

@SferaDev SferaDev commented Jun 12, 2026

Copy link
Copy Markdown
Member

UPDATE/DELETE on compressed partitions (DML P2 + tombstone DELETE)

Stacks on #34 (DML P1, feat/compressed-insert), which stacks on #33: #33#34 ← this PR. Base branch is feat/compressed-insert, not main.

What

Compressed partitions accept plain UPDATE / DELETE / MERGE (including data-modifying CTE forms). Two mechanisms:

P2 — decompose-on-write (UPDATE, and DELETEs that don't qualify for tombstones). The ExecutorStart interceptor walks every ModifyTable node, locates candidate segments with the read path's pruning pipeline (dml_candidate_segmentsload_segments_heap), and decomposes them back into ordinary heap rows (decompose_segments_for_dml in src/compress.rs) inside the user's transaction, before the planned DML executes. Everything is plain WAL-logged heap DML in one transaction — MVCC, rollback, and crash recovery need no extension-side machinery: a crash mid-decompose leaves the segment intact (the meta-row delete and the heap inserts roll back together). A whole-segment DELETE fast path drops segments without restoring rows. Its guards disable it whenever something must observe the deleted rows: RETURNING on that ModifyTable (checked per node, so CTE-level DELETE ... RETURNING decomposes too), row-level DELETE triggers, or REFERENCING OLD TABLE transition relations on the leaf or the named target; and every qual must be batch-provable AllPass. An ExecutorRun hook folds dropped rows into es_processed so command tags stay truthful — top-level statements only, since CTE-deleted rows never count toward the outer tag in PostgreSQL.

P2.5 — tombstones-as-rows (DELETE-only fast layer). Qualifying DELETEs append (segment_id, row_offset) tombstone rows to a per-partition _tombstones companion instead of decomposing. Scans subtract tombstones on every read path: decode paths AND them into the selection vector (sequential, Top-N, text, and parallel-worker paths — workers load the tombstone map under the shared parallel snapshot), DeltaXCount counts live rows, and DeltaXMinMax/DeltaXAgg carry stale-plan guards that error (forcing a replan) if tombstones appear after planning. Tombstones are never visible to user queries. Compaction folds tombstoned segments in; the catalog max_segment_id high-water mark is raised before any meta delete so compaction can never reuse a decomposed/dropped segment's id (the shared blob/decompressed caches and the colstats cache are keyed by (companion_oid, segment_id, ...) — id reuse would poison them).

Why / numbers

Requirement: "the user should almost not notice it's not normal postgres by wait time" for DELETE.

  • Tombstone DELETE: 0.53 ms vs 0.42 ms on the plain-PostgreSQL twin table — user-indistinguishable latency.
  • The same statements under decompose-on-write: 118–173× slower.
  • UPDATE uses decompose-on-write — transactional and correct, but slower; closing that gap is the remaining P3 work (the honest blocker: a tombstone-fast UPDATE must still materialize old row versions into the heap for SET expressions / triggers / RETURNING, which costs the same full-column decode the decompose performs).

Design

dev/docs/COMPRESSED_DML.md (shipped/updated in this stack) — §5 decompose-on-write, §5.4 whole-segment DELETE, the P2.5 tombstone fast-layer section, and the P2 deviations list (AccessExclusive partition lock + meta-row delete-first claim protocol; the decompose-cap GUC is deferred to P3).

Testing

  • make build — clean.
  • make clippy — zero warnings.
  • make test (pgrx unit tests) — green on PG 17 and PG 18 (575 each; the PG 18 run covers the version-gated ExecutorRun hook).
  • make integration-test PG_VERSIONS=17 — green (396 tests), including the new tests/test_compressed_dml.py (31 tests: decompose UPDATE/DELETE across query shapes, twin-table equality, rollback, concurrency, tombstone fast path, reads-with-tombstones on every query shape, tombstone rollback, compaction folding with segment-id high-water mark, mixed tombstones + heap tail, repeated DELETE accumulation, CTE DELETE ... RETURNING, CTE command-tag integrity, transition-table observation).

Provenance

Extracted from a longer perf-session branch (perf/clickhouse-gap-session); this PR carves out exactly the DML P2 + P2.5 work. Sibling features from that session (dual-mode segment files #32, partition bloom sentinels #35, count sidecars #31) are separate PRs; where their interaction rules matter for this code, comments flag them explicitly as not yet on this branch.

@SferaDev SferaDev marked this pull request as ready for review June 12, 2026 13:51
@SferaDev SferaDev force-pushed the feat/compressed-update-delete branch 3 times, most recently from e967118 to 3018ae8 Compare June 12, 2026 15:33
@SferaDev SferaDev force-pushed the feat/compressed-insert branch from b1e0989 to 821622a Compare June 12, 2026 20:26
@SferaDev SferaDev force-pushed the feat/compressed-update-delete branch from 3018ae8 to f3df30a Compare June 12, 2026 21:06
@SferaDev SferaDev force-pushed the feat/compressed-insert branch from 821622a to a7fd85d Compare June 12, 2026 21:12
SferaDev added 5 commits June 12, 2026 23:21
P2 decompose-on-write: the ExecutorStart interceptor walks every
ModifyTable node (UPDATE/DELETE/MERGE, incl. data-modifying CTEs),
locates candidate segments via the read path's pruning pipeline
(dml_candidate_segments), and decomposes them back into ordinary heap
rows (decompose_segments_for_dml) inside the user's transaction before
the planned DML runs — plain WAL-logged heap DML, so MVCC/rollback/
crash recovery need no extension-side machinery. Whole-segment DELETE
drops segments without restoring rows (guards: no RETURNING, no user
row DELETE triggers, all quals batch-provable AllPass), with an
ExecutorRun hook folding dropped rows into es_processed.

P2.5 tombstones-as-rows (DELETE-only fast layer): qualifying DELETEs
append (segment_id, row_offset) tombstone rows to a per-partition
_tombstones companion instead of decomposing — near-native DELETE
latency. Scans AND tombstones into the selection vector on every read
path (incl. parallel workers, which load the map under the shared
snapshot); DeltaXCount counts live rows; DeltaXMinMax/DeltaXAgg carry
stale-plan guards that error when tombstones appeared after planning.
Compaction folds tombstoned segments in; the catalog max_segment_id
high-water mark is raised before any meta delete so segment ids are
never reused (shared caches are keyed by (companion_oid, segment_id)).

UPDATE stays on decompose-on-write (correct, transactional, slower).

Extracted from a longer perf session; stacks on DML P1
(feat/compressed-insert).
…ETEs

Review fixes for the P2/P2.5 fast paths (whole-segment drop + tombstone),
which must be disabled whenever something observes the deleted rows:

- RETURNING is now checked per ModifyTable (returningLists), not via
  planned_stmt.hasReturning, which only reflects the top-level query: a
  data-modifying CTE's DELETE ... RETURNING previously took the fast path
  and returned no rows to the outer query.
- AFTER DELETE triggers with OLD TABLE transition relations (REFERENCING
  OLD TABLE) now disable the fast paths too — checked on the leaf and on
  the statement's named target, since transition capture on a partitioned
  parent collects rows from every leaf.
- es_processed credit is only applied when the ModifyTable is the
  statement's top plan node: CTE-deleted rows never count toward the outer
  command tag in PostgreSQL.
- PENDING_DML_EXTRA_ROWS is a Vec now so nested statements (trigger or
  function-body DML on another compressed partition) can't drop the outer
  statement's pending credit.

Also: reuse ddl_if_not_exists in copy.rs, doc update, and three new
integration tests covering CTE RETURNING, CTE tag integrity, and
transition-table observation.
…stack

Partition bloom sentinels (#35) and dual-mode segment files (#32) are
separate PRs; the new decompose comments referenced them as if present.
Rephrase as forward-looking interaction notes (matching the existing
'PERF #47, not yet on this branch' convention) or drop the reference.
…s succeeds

The correctness suite still asserted the P0/P1 contract (UPDATE/DELETE
on compressed partitions raises), which this PR obsoletes. Rewrite both
tests to assert the new transparent-DML contract, mirroring the
accepts_parent_routed_insert pattern: run the same statement against the
plain twin and the deltax table, commit, and require parity through the
deltax read path.

- test_partition_edge_compressed_partition_rejects_dml ->
  test_partition_edge_compressed_partition_applies_dml: direct
  partition-targeted UPDATE/DELETE on a compressed partition; asserts
  rowcount parity, post-UPDATE value visibility, post-DELETE row absence
  with the partition still compressed (tombstone-aware read path), total
  count parity, and full ordered result-set parity.
- test_rtabench_compressed_partition_rejects_parent_routed_update_delete ->
  test_rtabench_compressed_partition_accepts_parent_routed_update_delete:
  parent-routed UPDATE/DELETE across all four COPY layouts; asserts
  rowcount parity, per-order row parity (update) / absence on both sides
  with the touched partition still compressed (delete), total count
  parity, and join-case equality.

INSERT ... ON CONFLICT rejection coverage is unchanged.
The new correctness DML tests exposed two jsonb bugs:

- restore_segment_rows (shared by decompress_partition, decompose-on-write
  UPDATE/DELETE, and compaction) decoded jsonb blobs through the UTF-8
  text codec path. jsonb blobs hold binary jsonb varlena payloads, so any
  restore of a jsonb-bearing segment panicked with 'invalid UTF-8 in LZ4
  data'. Decode them byte-safe (decompress_column_byte_values) and convert
  back to JSON text via jsonb_out (jsonb_binary_to_text, the inverse of
  jsonb_text_to_binary) for the INSERT literal.

- The SPI SELECT builders in compress_partition_impl and the compaction
  path did not cast jsonb columns to ::text, but the accumulate path reads
  jsonb as String (canonical JSON text) — pgrx rejects the oid mismatch.
  Include ColumnKind::Jsonb in the ::text cast.

Adds a pg_test covering compress -> UPDATE (decompose) -> compact ->
decompress with jsonb payloads incl. NULLs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant