Compact parquet in size-budgeted batches (#933 followup)#959
Merged
erikdarlingdata merged 2 commits intoMay 18, 2026
Conversation
After PR #955 the main connection cap and Binder fix landed, but the reporter's next nightly hit a different failure mode: query_snapshots compaction OOM'd at "failed to pin block of size 102.9 MiB (3.6 GiB/3.7 GiB used)" with a 72-file backlog. Not the pre-reservation bug — DuckDB has legitimately consumed nearly the full 4 GB cap doing real work. Wide-VARCHAR plan XML expands ~10x on read; merging 72 files into a single COPY needs more than 4 GB in memory. The existing pairwise merge accumulates: by step 70 of 72, the accumulator is the merge of 71 files combined being read alongside one more. Bounded if file count is bounded; unbounded otherwise. This change caps a merge by total on-disk input bytes, not file count: - BuildSizeBudgetedBatches greedily groups smallest-first-sorted paths into batches whose total bytes don't exceed MaxBatchInputBytes (200 MB). - A group that fits in one batch keeps the existing YYYYMM_table.parquet output name — fully backward compatible. - A group that needs multiple batches produces YYYYMM_table_ptNNN.parquet part files (numbered from 001). Archive views already glob *_table.parquet, so readers see all parts as one logical month. - New regex case in the file-recognition pass recognizes _ptNNN suffixes so subsequent compactions round-trip part files correctly. - MergeBatchToFile factors out the single-pass-vs-pairwise logic so each batch is a standalone call; the cleanup orchestration (delete originals, promote temps) runs once after all batches succeed. For the reporter's specific 72-file backlog at ~5 MB each (~360 MB total), this produces roughly two batches of ~36 files each. Memory demand of each batch is half what the un-batched merge needed, well within the 4 GB compaction cap. Naturally adaptive: narrow tables (wait_stats, perfmon_stats) fit hundreds of files per batch; wide tables (query_snapshots, query_store_stats with plan XML) get fewer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
After PR #955 landed (main connection cap, raise-for-COPY wrapper, per-pair Binder fix), the reporter's next nightly hit a new failure mode:
This is not the pre-reservation bug from before — DuckDB has legitimately consumed nearly the full 4 GB compaction cap doing real work. The 72-file backlog (accumulated during the broken nightlies) was too large for the pairwise-accumulator merge to handle in one pass: by step 70 of 72, the accumulator is the merge of 71 files combined.
Fix
Cap merges by total on-disk input bytes, not file count. Wide-VARCHAR rows (query_snapshots' plan XML expands ~10x on read) get fewer files per batch; narrow tables fit hundreds.
MaxBatchInputBytes = 200 MB— sized so even ~10x expansion stays well under the 4 GB compaction capBuildSizeBudgetedBatches(sortedPaths, maxBytes)— greedy smallest-first bucketingMergeBatchToFile(table, sourcePaths, outputPath, spillDirSql)— extracted the existing single-pass-vs-pairwise merge logic so each batch is a standalone callYYYYMM_table.parquetoutput (backward compatible)YYYYMM_table_pt001.parquet,_pt002.parquet, etc.; archive views already glob*_table.parquetso readers see them as one logical month_ptNNNsuffix so subsequent compactions round-trip part files correctlyWhy this is naturally adaptive
The cap is dimensional, not numeric: 200 MB of on-disk compressed parquet is a lot of rows for narrow tables (
wait_stats,perfmon_stats) and few rows for wide ones. No table-specific tuning needed.What this means for the reporter's 72-file backlog
At ~5 MB per file × 72 ≈ 360 MB total on disk. Splits into roughly two batches of ~36 files each. Each batch's in-memory peak is half what the un-batched merge needed — well within 4 GB.
Test plan
dotnet build Lite/PerformanceMonitorLite.csproj -c Release)ArchiveSizeThresholdMb = 200(matches our PR Fix #933: cap main DuckDB memory_limit, per-pair compaction exclude detection #955 validation pattern), populate archive with a synthetic backlog of wide-VARCHAR files, trigger compaction, verify_pt001/_pt002outputs and zero OOM🤖 Generated with Claude Code