UPDATE/DELETE on compressed partitions (DML P2 + tombstone DELETE) by SferaDev · Pull Request #36 · xataio/deltax

SferaDev · 2026-06-12T13:23:04Z

UPDATE/DELETE on compressed partitions (DML P2 + tombstone DELETE)

Stacks on #34 (DML P1, feat/compressed-insert), which stacks on #33: #33 ← #34 ← this PR. Base branch is feat/compressed-insert, not main.

What

Compressed partitions accept plain UPDATE / DELETE / MERGE (including data-modifying CTE forms). Two mechanisms:

P2 — decompose-on-write (UPDATE, and DELETEs that don't qualify for tombstones). The ExecutorStart interceptor walks every ModifyTable node, locates candidate segments with the read path's pruning pipeline (dml_candidate_segments → load_segments_heap), and decomposes them back into ordinary heap rows (decompose_segments_for_dml in src/compress.rs) inside the user's transaction, before the planned DML executes. Everything is plain WAL-logged heap DML in one transaction — MVCC, rollback, and crash recovery need no extension-side machinery: a crash mid-decompose leaves the segment intact (the meta-row delete and the heap inserts roll back together). A whole-segment DELETE fast path drops segments without restoring rows. Its guards disable it whenever something must observe the deleted rows: RETURNING on that ModifyTable (checked per node, so CTE-level DELETE ... RETURNING decomposes too), row-level DELETE triggers, or REFERENCING OLD TABLE transition relations on the leaf or the named target; and every qual must be batch-provable AllPass. An ExecutorRun hook folds dropped rows into es_processed so command tags stay truthful — top-level statements only, since CTE-deleted rows never count toward the outer tag in PostgreSQL.

P2.5 — tombstones-as-rows (DELETE-only fast layer). Qualifying DELETEs append (segment_id, row_offset) tombstone rows to a per-partition _tombstones companion instead of decomposing. Scans subtract tombstones on every read path: decode paths AND them into the selection vector (sequential, Top-N, text, and parallel-worker paths — workers load the tombstone map under the shared parallel snapshot), DeltaXCount counts live rows, and DeltaXMinMax/DeltaXAgg carry stale-plan guards that error (forcing a replan) if tombstones appear after planning. Tombstones are never visible to user queries. Compaction folds tombstoned segments in; the catalog max_segment_id high-water mark is raised before any meta delete so compaction can never reuse a decomposed/dropped segment's id (the shared blob/decompressed caches and the colstats cache are keyed by (companion_oid, segment_id, ...) — id reuse would poison them).

Why / numbers

Requirement: "the user should almost not notice it's not normal postgres by wait time" for DELETE.

Tombstone DELETE: 0.53 ms vs 0.42 ms on the plain-PostgreSQL twin table — user-indistinguishable latency.
The same statements under decompose-on-write: 118–173× slower.
UPDATE uses decompose-on-write — transactional and correct, but slower; closing that gap is the remaining P3 work (the honest blocker: a tombstone-fast UPDATE must still materialize old row versions into the heap for SET expressions / triggers / RETURNING, which costs the same full-column decode the decompose performs).

Design

dev/docs/COMPRESSED_DML.md (shipped/updated in this stack) — §5 decompose-on-write, §5.4 whole-segment DELETE, the P2.5 tombstone fast-layer section, and the P2 deviations list (AccessExclusive partition lock + meta-row delete-first claim protocol; the decompose-cap GUC is deferred to P3).

Testing

make build — clean.
make clippy — zero warnings.
make test (pgrx unit tests) — green on PG 17 and PG 18 (575 each; the PG 18 run covers the version-gated ExecutorRun hook).
make integration-test PG_VERSIONS=17 — green (396 tests), including the new tests/test_compressed_dml.py (31 tests: decompose UPDATE/DELETE across query shapes, twin-table equality, rollback, concurrency, tombstone fast path, reads-with-tombstones on every query shape, tombstone rollback, compaction folding with segment-id high-water mark, mixed tombstones + heap tail, repeated DELETE accumulation, CTE DELETE ... RETURNING, CTE command-tag integrity, transition-table observation).

Provenance

Extracted from a longer perf-session branch (perf/clickhouse-gap-session); this PR carves out exactly the DML P2 + P2.5 work. Sibling features from that session (dual-mode segment files #32, partition bloom sentinels #35, count sidecars #31) are separate PRs; where their interaction rules matter for this code, comments flag them explicitly as not yet on this branch.

P2 decompose-on-write: the ExecutorStart interceptor walks every ModifyTable node (UPDATE/DELETE/MERGE, incl. data-modifying CTEs), locates candidate segments via the read path's pruning pipeline (dml_candidate_segments), and decomposes them back into ordinary heap rows (decompose_segments_for_dml) inside the user's transaction before the planned DML runs — plain WAL-logged heap DML, so MVCC/rollback/ crash recovery need no extension-side machinery. Whole-segment DELETE drops segments without restoring rows (guards: no RETURNING, no user row DELETE triggers, all quals batch-provable AllPass), with an ExecutorRun hook folding dropped rows into es_processed. P2.5 tombstones-as-rows (DELETE-only fast layer): qualifying DELETEs append (segment_id, row_offset) tombstone rows to a per-partition _tombstones companion instead of decomposing — near-native DELETE latency. Scans AND tombstones into the selection vector on every read path (incl. parallel workers, which load the map under the shared snapshot); DeltaXCount counts live rows; DeltaXMinMax/DeltaXAgg carry stale-plan guards that error when tombstones appeared after planning. Compaction folds tombstoned segments in; the catalog max_segment_id high-water mark is raised before any meta delete so segment ids are never reused (shared caches are keyed by (companion_oid, segment_id)). UPDATE stays on decompose-on-write (correct, transactional, slower). Extracted from a longer perf session; stacks on DML P1 (feat/compressed-insert).

…ETEs Review fixes for the P2/P2.5 fast paths (whole-segment drop + tombstone), which must be disabled whenever something observes the deleted rows: - RETURNING is now checked per ModifyTable (returningLists), not via planned_stmt.hasReturning, which only reflects the top-level query: a data-modifying CTE's DELETE ... RETURNING previously took the fast path and returned no rows to the outer query. - AFTER DELETE triggers with OLD TABLE transition relations (REFERENCING OLD TABLE) now disable the fast paths too — checked on the leaf and on the statement's named target, since transition capture on a partitioned parent collects rows from every leaf. - es_processed credit is only applied when the ModifyTable is the statement's top plan node: CTE-deleted rows never count toward the outer command tag in PostgreSQL. - PENDING_DML_EXTRA_ROWS is a Vec now so nested statements (trigger or function-body DML on another compressed partition) can't drop the outer statement's pending credit. Also: reuse ddl_if_not_exists in copy.rs, doc update, and three new integration tests covering CTE RETURNING, CTE tag integrity, and transition-table observation.

…stack Partition bloom sentinels (#35) and dual-mode segment files (#32) are separate PRs; the new decompose comments referenced them as if present. Rephrase as forward-looking interaction notes (matching the existing 'PERF #47, not yet on this branch' convention) or drop the reference.

…s succeeds The correctness suite still asserted the P0/P1 contract (UPDATE/DELETE on compressed partitions raises), which this PR obsoletes. Rewrite both tests to assert the new transparent-DML contract, mirroring the accepts_parent_routed_insert pattern: run the same statement against the plain twin and the deltax table, commit, and require parity through the deltax read path. - test_partition_edge_compressed_partition_rejects_dml -> test_partition_edge_compressed_partition_applies_dml: direct partition-targeted UPDATE/DELETE on a compressed partition; asserts rowcount parity, post-UPDATE value visibility, post-DELETE row absence with the partition still compressed (tombstone-aware read path), total count parity, and full ordered result-set parity. - test_rtabench_compressed_partition_rejects_parent_routed_update_delete -> test_rtabench_compressed_partition_accepts_parent_routed_update_delete: parent-routed UPDATE/DELETE across all four COPY layouts; asserts rowcount parity, per-order row parity (update) / absence on both sides with the touched partition still compressed (delete), total count parity, and join-case equality. INSERT ... ON CONFLICT rejection coverage is unchanged.

The new correctness DML tests exposed two jsonb bugs: - restore_segment_rows (shared by decompress_partition, decompose-on-write UPDATE/DELETE, and compaction) decoded jsonb blobs through the UTF-8 text codec path. jsonb blobs hold binary jsonb varlena payloads, so any restore of a jsonb-bearing segment panicked with 'invalid UTF-8 in LZ4 data'. Decode them byte-safe (decompress_column_byte_values) and convert back to JSON text via jsonb_out (jsonb_binary_to_text, the inverse of jsonb_text_to_binary) for the INSERT literal. - The SPI SELECT builders in compress_partition_impl and the compaction path did not cast jsonb columns to ::text, but the accumulate path reads jsonb as String (canonical JSON text) — pgrx rejects the oid mismatch. Include ColumnKind::Jsonb in the ::text cast. Adds a pg_test covering compress -> UPDATE (decompose) -> compact -> decompress with jsonb payloads incl. NULLs.

SferaDev marked this pull request as ready for review June 12, 2026 13:51

SferaDev force-pushed the feat/compressed-update-delete branch 3 times, most recently from e967118 to 3018ae8 Compare June 12, 2026 15:33

SferaDev force-pushed the feat/compressed-insert branch from b1e0989 to 821622a Compare June 12, 2026 20:26

SferaDev mentioned this pull request Jun 12, 2026

Fix pre-existing valbitmap correctness bugs #42

Merged

SferaDev force-pushed the feat/compressed-update-delete branch from 3018ae8 to f3df30a Compare June 12, 2026 21:06

SferaDev force-pushed the feat/compressed-insert branch from 821622a to a7fd85d Compare June 12, 2026 21:12

SferaDev added 5 commits June 12, 2026 23:21

SferaDev force-pushed the feat/compressed-update-delete branch from f3df30a to 1897f4c Compare June 12, 2026 21:21

SferaDev mentioned this pull request Jun 12, 2026

rustfmt: format worker.rs launcher fan-out #43

Open

SferaDev marked this pull request as draft June 12, 2026 22:41

SferaDev mentioned this pull request Jun 12, 2026

Dev docs: perf research notes (round-2 optimization candidates, PG18/19 opportunities) #29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPDATE/DELETE on compressed partitions (DML P2 + tombstone DELETE)#36

UPDATE/DELETE on compressed partitions (DML P2 + tombstone DELETE)#36
SferaDev wants to merge 5 commits into
feat/compressed-insertfrom
feat/compressed-update-delete

SferaDev commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

SferaDev commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

UPDATE/DELETE on compressed partitions (DML P2 + tombstone DELETE)

What

Why / numbers

Design

Testing

Provenance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SferaDev commented Jun 12, 2026 •

edited

Loading