UPDATE/DELETE on compressed partitions (DML P2 + tombstone DELETE)#36
Draft
SferaDev wants to merge 5 commits into
Draft
UPDATE/DELETE on compressed partitions (DML P2 + tombstone DELETE)#36SferaDev wants to merge 5 commits into
SferaDev wants to merge 5 commits into
Conversation
e967118 to
3018ae8
Compare
b1e0989 to
821622a
Compare
3018ae8 to
f3df30a
Compare
821622a to
a7fd85d
Compare
P2 decompose-on-write: the ExecutorStart interceptor walks every ModifyTable node (UPDATE/DELETE/MERGE, incl. data-modifying CTEs), locates candidate segments via the read path's pruning pipeline (dml_candidate_segments), and decomposes them back into ordinary heap rows (decompose_segments_for_dml) inside the user's transaction before the planned DML runs — plain WAL-logged heap DML, so MVCC/rollback/ crash recovery need no extension-side machinery. Whole-segment DELETE drops segments without restoring rows (guards: no RETURNING, no user row DELETE triggers, all quals batch-provable AllPass), with an ExecutorRun hook folding dropped rows into es_processed. P2.5 tombstones-as-rows (DELETE-only fast layer): qualifying DELETEs append (segment_id, row_offset) tombstone rows to a per-partition _tombstones companion instead of decomposing — near-native DELETE latency. Scans AND tombstones into the selection vector on every read path (incl. parallel workers, which load the map under the shared snapshot); DeltaXCount counts live rows; DeltaXMinMax/DeltaXAgg carry stale-plan guards that error when tombstones appeared after planning. Compaction folds tombstoned segments in; the catalog max_segment_id high-water mark is raised before any meta delete so segment ids are never reused (shared caches are keyed by (companion_oid, segment_id)). UPDATE stays on decompose-on-write (correct, transactional, slower). Extracted from a longer perf session; stacks on DML P1 (feat/compressed-insert).
…ETEs Review fixes for the P2/P2.5 fast paths (whole-segment drop + tombstone), which must be disabled whenever something observes the deleted rows: - RETURNING is now checked per ModifyTable (returningLists), not via planned_stmt.hasReturning, which only reflects the top-level query: a data-modifying CTE's DELETE ... RETURNING previously took the fast path and returned no rows to the outer query. - AFTER DELETE triggers with OLD TABLE transition relations (REFERENCING OLD TABLE) now disable the fast paths too — checked on the leaf and on the statement's named target, since transition capture on a partitioned parent collects rows from every leaf. - es_processed credit is only applied when the ModifyTable is the statement's top plan node: CTE-deleted rows never count toward the outer command tag in PostgreSQL. - PENDING_DML_EXTRA_ROWS is a Vec now so nested statements (trigger or function-body DML on another compressed partition) can't drop the outer statement's pending credit. Also: reuse ddl_if_not_exists in copy.rs, doc update, and three new integration tests covering CTE RETURNING, CTE tag integrity, and transition-table observation.
…s succeeds The correctness suite still asserted the P0/P1 contract (UPDATE/DELETE on compressed partitions raises), which this PR obsoletes. Rewrite both tests to assert the new transparent-DML contract, mirroring the accepts_parent_routed_insert pattern: run the same statement against the plain twin and the deltax table, commit, and require parity through the deltax read path. - test_partition_edge_compressed_partition_rejects_dml -> test_partition_edge_compressed_partition_applies_dml: direct partition-targeted UPDATE/DELETE on a compressed partition; asserts rowcount parity, post-UPDATE value visibility, post-DELETE row absence with the partition still compressed (tombstone-aware read path), total count parity, and full ordered result-set parity. - test_rtabench_compressed_partition_rejects_parent_routed_update_delete -> test_rtabench_compressed_partition_accepts_parent_routed_update_delete: parent-routed UPDATE/DELETE across all four COPY layouts; asserts rowcount parity, per-order row parity (update) / absence on both sides with the touched partition still compressed (delete), total count parity, and join-case equality. INSERT ... ON CONFLICT rejection coverage is unchanged.
The new correctness DML tests exposed two jsonb bugs: - restore_segment_rows (shared by decompress_partition, decompose-on-write UPDATE/DELETE, and compaction) decoded jsonb blobs through the UTF-8 text codec path. jsonb blobs hold binary jsonb varlena payloads, so any restore of a jsonb-bearing segment panicked with 'invalid UTF-8 in LZ4 data'. Decode them byte-safe (decompress_column_byte_values) and convert back to JSON text via jsonb_out (jsonb_binary_to_text, the inverse of jsonb_text_to_binary) for the INSERT literal. - The SPI SELECT builders in compress_partition_impl and the compaction path did not cast jsonb columns to ::text, but the accumulate path reads jsonb as String (canonical JSON text) — pgrx rejects the oid mismatch. Include ColumnKind::Jsonb in the ::text cast. Adds a pg_test covering compress -> UPDATE (decompose) -> compact -> decompress with jsonb payloads incl. NULLs.
f3df30a to
1897f4c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
UPDATE/DELETE on compressed partitions (DML P2 + tombstone DELETE)
Stacks on #34 (DML P1,
feat/compressed-insert), which stacks on #33: #33 ← #34 ← this PR. Base branch isfeat/compressed-insert, notmain.What
Compressed partitions accept plain
UPDATE/DELETE/MERGE(including data-modifying CTE forms). Two mechanisms:P2 — decompose-on-write (UPDATE, and DELETEs that don't qualify for tombstones). The ExecutorStart interceptor walks every ModifyTable node, locates candidate segments with the read path's pruning pipeline (
dml_candidate_segments→load_segments_heap), and decomposes them back into ordinary heap rows (decompose_segments_for_dmlinsrc/compress.rs) inside the user's transaction, before the planned DML executes. Everything is plain WAL-logged heap DML in one transaction — MVCC, rollback, and crash recovery need no extension-side machinery: a crash mid-decompose leaves the segment intact (the meta-row delete and the heap inserts roll back together). A whole-segment DELETE fast path drops segments without restoring rows. Its guards disable it whenever something must observe the deleted rows: RETURNING on that ModifyTable (checked per node, so CTE-levelDELETE ... RETURNINGdecomposes too), row-level DELETE triggers, orREFERENCING OLD TABLEtransition relations on the leaf or the named target; and every qual must be batch-provable AllPass. An ExecutorRun hook folds dropped rows intoes_processedso command tags stay truthful — top-level statements only, since CTE-deleted rows never count toward the outer tag in PostgreSQL.P2.5 — tombstones-as-rows (DELETE-only fast layer). Qualifying DELETEs append
(segment_id, row_offset)tombstone rows to a per-partition_tombstonescompanion instead of decomposing. Scans subtract tombstones on every read path: decode paths AND them into the selection vector (sequential, Top-N, text, and parallel-worker paths — workers load the tombstone map under the shared parallel snapshot),DeltaXCountcounts live rows, andDeltaXMinMax/DeltaXAggcarry stale-plan guards that error (forcing a replan) if tombstones appear after planning. Tombstones are never visible to user queries. Compaction folds tombstoned segments in; the catalogmax_segment_idhigh-water mark is raised before any meta delete so compaction can never reuse a decomposed/dropped segment's id (the shared blob/decompressed caches and the colstats cache are keyed by(companion_oid, segment_id, ...)— id reuse would poison them).Why / numbers
Requirement: "the user should almost not notice it's not normal postgres by wait time" for DELETE.
Design
dev/docs/COMPRESSED_DML.md(shipped/updated in this stack) — §5 decompose-on-write, §5.4 whole-segment DELETE, the P2.5 tombstone fast-layer section, and the P2 deviations list (AccessExclusive partition lock + meta-row delete-first claim protocol; the decompose-cap GUC is deferred to P3).Testing
make build— clean.make clippy— zero warnings.make test(pgrx unit tests) — green on PG 17 and PG 18 (575 each; the PG 18 run covers the version-gated ExecutorRun hook).make integration-test PG_VERSIONS=17— green (396 tests), including the newtests/test_compressed_dml.py(31 tests: decompose UPDATE/DELETE across query shapes, twin-table equality, rollback, concurrency, tombstone fast path, reads-with-tombstones on every query shape, tombstone rollback, compaction folding with segment-id high-water mark, mixed tombstones + heap tail, repeated DELETE accumulation, CTEDELETE ... RETURNING, CTE command-tag integrity, transition-table observation).Provenance
Extracted from a longer perf-session branch (
perf/clickhouse-gap-session); this PR carves out exactly the DML P2 + P2.5 work. Sibling features from that session (dual-mode segment files #32, partition bloom sentinels #35, count sidecars #31) are separate PRs; where their interaction rules matter for this code, comments flag them explicitly as not yet on this branch.