Transparent INSERT into compressed partitions (DML P1)#34
Draft
SferaDev wants to merge 3 commits into
Draft
Conversation
Member
|
This is a big one, we need it, but I want to do a bit of research myself on the possible approaches. |
9b59034 to
b579349
Compare
b1e0989 to
821622a
Compare
Remove the INSERT-side read-only limitation on compressed partitions. INSERTs (direct or parent-routed) land in the partition heap at native heap speed — no decompression, no segment rewrite. Every read path unions the compressed segments with the uncompressed heap tail: - DeltaXDecompress/DeltaXAppend gain a Phase 3 heap-tail scan (leader- only in parallel plans), with the full plan qual re-enabled when a tail exists (heap rows are never batch-filtered). - DeltaXCount/DeltaXMinMax fold loose heap rows into their metadata- derived results at exec time, so those pushdowns stay enabled. - DeltaXAgg bails to plain Agg over heap-tail-aware scans when any involved compressed partition has loose rows (its columnar accumulators cannot ingest row-form tuples); a begin-time guard errors on the stale-cached-plan race instead of dropping rows. - Top-N pushdown and pathkey claims are disabled at plan time when a tail exists; a [-4, flag] custom-private marker + relcache invalidation on the heap's empty->non-empty transition (sent by the leaf trigger's INSERT branch) catch stale cached plans. - INSERT ... ON CONFLICT stays rejected (conflict inference cannot see rows inside segments). UPDATE/DELETE stay rejected unchanged. deltax_compact_partition() folds loose rows into new segments appended to the existing companion tables (colstats/blooms/text_lengths/ valbitmap kept exact or dropped, never under-covering; catalog counters, column_minmax, HLL/ndistinct refreshed) and truncates the loose region; the background worker compacts automatically each cycle. Extracted from a longer perf session (DML P1, validated against a plain-PostgreSQL twin table); UPDATE/DELETE (P2) and tombstones (P2.5) follow in a separate PR. dev/docs/COMPRESSED_DML.md carries the full program design.
Independent correctness/cleanup pass over the P1 transparent-INSERT work: - Close the data-modifying-CTE bypass of the INSERT ... ON CONFLICT rejection: a wCTE hides the ModifyTable in PlannedStmt.subplans under a top-level CMD_SELECT, so the ExecutorStart hook never saw it. Conflict inference against empty leaf indexes would silently miss conflicts with segment rows (duplicate inserts under ON CONFLICT DO NOTHING). New integration test covers it. - Guard the leader-only Phase 3 heap tail against parallel_leader_participation=off: Gather never drives the leader's plan copy when workers launch, so the tail would be silently dropped. The begin-time guard errors with a remedy instead. - Re-validate the partition catalog row under the AccessExclusive lock in deltax_compact_partition(): a concurrent decompress committing between the unlocked is_compressed check and the lock grant would have made compaction treat the fully-restored heap as loose rows and truncate real data. - Self-heal dead loose-row pages in the loose_rows == 0 compaction branch: a REPEATABLE READ compaction deletes (not truncates) and autovacuum is off on compressed partitions, so the heap would otherwise stay at nonzero blocks forever — pinning scans on the heap-tail path and re-taking the AEL every worker cycle. - Reject json_extract tables in compaction with a clear remedy (synthetic columns have no heap presence — the cursor would fail with a raw SQL error) and skip them in worker auto-compaction to avoid an error loop. - Disambiguate partition_oid_for_companion() by is_compressed: two deltatables with the same table name in different schemas both register same-named partitions in the catalog; the reverse lookup could return the wrong schema's heap and union foreign rows into the scan. - Use the executor snapshot (es_snapshot) instead of GetActiveSnapshot() for the DeltaXCount/DeltaXMinMax heap-tail fold, matching the DeltaXDecompress/Append Phase 3 scan. - Tighten the Phase 3 deform guard to also check the scan-slot tupdesc: for DeltaXAppend the slot is the parent's layout, and a dropped column on either side would silently misalign the positional deform.
The pre-existing correctness test asserted the P0 contract (parent-routed INSERT into a compressed partition raises). DML P1 intentionally accepts plain INSERTs — rows land in the partition heap and are unioned with segment data at scan time. Update the test to insert into both the plain and deltax tables, verify the loose row is visible through the deltax read path (count + total parity), and keep the join-equality check. UPDATE/DELETE rejection coverage is unchanged.
821622a to
a7fd85d
Compare
SferaDev
added a commit
that referenced
this pull request
Jun 15, 2026
* fix: valbitmap correctness — fail-safe pruning, retention leak, cross-schema companion lookup Three pre-existing correctness bugs found during review of #31/#34/#35: 1. Text valbitmap wrong-empty-results on segment overflow. A segment with more than VALBITMAP_MAX_DISTINCT (32) distinct text values writes no valbitmap row and contributes nothing to the partition-level column_valmap. When the union of the remaining segments stayed <= 32, querying a value unique to the overflowed segment missed the valmap and took the 'prune every segment without reading the bitmap table' shortcut — returning zero rows instead of the overflowed segment's rows. Pruning now always goes through the per-segment bitmap rows: a valmap miss (empty wanted_bits) prunes exactly the segments that wrote a bitmap row; segments without one are never prunable by valmap logic. The now-redundant ValbitmapCheck::prune_all field is removed. 2. Retention drop leaked _valbitmap companions. auto_drop_partitions dropped blobs/blooms/text_lengths/colstats/meta but not valbitmap, leaving orphaned tables in _deltax_compressed forever. (decompress and the empty-partition cleanup already covered all six.) 3. check_compressed_partition matched companions by partition NAME only. Companion tables live in the single shared _deltax_compressed schema and embed only the table name, so a same-named partition in another schema (or any same-named plain table) was treated as compressed and served the other partition's data. The lookup now confirms via deltax.deltax_partition that the exact (schema, table) pair is the compressed one — at most one partition of a given name can be compressed, so is_compressed disambiguates. Regression tests: tests/test_valbitmap.py (one-segment-overflow shape), tests/test_worker.py (no orphaned companions after retention drop), tests/test_compression.py (two-schema same-name partition). * test: make overflow regression probe robust to minmax/dictionary segment skipping The plan-shape assertions (vb_skipped=1, segments=1) now ride on the present-value 'ov07' query — where only the valbitmap can skip the bitmap-covered segment and the overflowed segment must be scanned. The absent-value probe keeps only the row-count assertion: an absent constant can legitimately eliminate the overflowed segment through exact per-segment evidence (text minmax, dictionary value check) regardless of the valbitmap. * rustfmt: format worker.rs launcher fan-out (inherited unformatted from #38 merge)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Removes the INSERT-side read-only limitation on compressed partitions. INSERTs (direct or routed through the parent) land in the partition heap at native heap speed — no decompression, no segment rewrite — and every read path unions the compressed segments with the uncompressed heap tail:
parallel_leader_participation=off) rather than dropping the tail.INSERT ... ON CONFLICTstays rejected (conflict inference cannot see rows inside segments) — including when smuggled in through a data-modifying CTE under a top-level SELECT. UPDATE/DELETE behavior is unchanged — they land in the follow-up P2/P2.5 PR.deltax_compact_partition()folds loose rows into new segments appended to the existing companion tables (colstats/blooms/text_lengths/valbitmap kept exact or dropped, never under-covering) and truncates the loose region; the background worker compacts automatically each cycle. Compaction re-validates the catalog row under its AccessExclusive lock (closes a compact-vs-concurrent-decompress race), self-heals dead loose-row pages left behind by a REPEATABLE READ compaction (autovacuum is off on compressed partitions), and refusesjson_extracttables with a clear remedy (decompress + recompress) instead of a raw SQL error.Why
Compressed partitions were read-only; out-of-range or late-arriving data either errored or required manual decompress/recompress. With this change, a user inserting into a deltax table doesn't need to know whether the target partition is compressed — INSERT latency is plain-Postgres heap latency.
Design
dev/docs/COMPRESSED_DML.md(ships in this PR) carries the full DML program design (P1 heap-tail, P2 decompose-on-write, P2.5 tombstones).Known tradeoff
Compaction ends with
TRUNCATEof the loose region — like compress-timeTRUNCATE, this is not MVCC-safe against a concurrentREPEATABLE READreader whose snapshot predates the compaction commit (PostgreSQL's documented TRUNCATE caveat). This is the same tradeoff the existing compress path already makes; called out here for visibility.Stacking
Based on
fix/epoch-encoding(#33): this feature's plain-PG twin-table validation tests exposed those latent epoch bugs, and the heap-tail tests require the fixes to pass. Retarget tomainonce #33 merges.Testing
make buildclean;make clippyzero warnings;cargo fmt --checkcleanmake test(PG 17) andmake test PG_MAJOR=18green — 568 unit tests eachmake integration-test PG_VERSIONS=17— 361/361 passed, including the newtests/test_compressed_insert.py(twin-table validation against plain PostgreSQL: routed + direct inserts, parallel scans, count/minmax folding, wCTE ON CONFLICT rejection, compaction, worker auto-compaction)Provenance
Extracted from a longer perf session (branch
perf/clickhouse-gap-session); UPDATE/DELETE (P2) and tombstone DELETE (P2.5) follow in a separate stacked PR. An independent correctness/cleanup pass tightened the stale-plan and compaction concurrency guards (see commit history).