feat(parquet): add ParquetPushDecoder::swap_strategy for adaptive scans#9
Open
feat(parquet): add ParquetPushDecoder::swap_strategy for adaptive scans#9
Conversation
11 tasks
Add a small surface to the push decoder so callers can swap the RowFilter, ProjectionMask, and/or RowSelectionPolicy at row-group boundaries without rebuilding the decoder: - new pub fn `can_swap_strategy() -> bool` — true between row groups (outer state `ReadingRowGroup`, inner state `Finished`) - new pub fn `swap_strategy(StrategySwap) -> Result<()>` — rejected with `ParquetError::General` when called mid-row-group - new `pub struct StrategySwap` (`#[non_exhaustive]`) with builder methods `with_projection`, `with_filter`, `with_row_selection_policy` - new pub fn `row_groups_remaining() -> usize` for diagnostics `PushBuffers` carries through the swap, so bytes already fetched for columns that survive into the new strategy are reused — only bytes the new strategy needs but that aren't already buffered get requested via `NeedsData`. Adaptive callers should drive the decoder with `try_next_reader` rather than `try_decode`: handing the active reader off transitions the decoder back to `ReadingRowGroup` immediately, giving the caller a clean swap window between two consecutive returns. `try_decode` loops past row-group boundaries internally and is unsuitable for in-flight strategy changes. Tests cover: filter swap between row groups, mid-row-group rejection, swap-while-iterating-handed-off-reader, and projection narrowing that reuses already-buffered bytes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`PushBuffers` is `pub(crate)` in the arrow-rs workspace, so the intra-doc link `[\`PushBuffers\`]: crate::util::push_buffers::PushBuffers` in `ParquetPushDecoder::swap_strategy`'s public docs failed the `-D rustdoc::private-intra-doc-links` check on `cargo +nightly doc --document-private-items`. The rest of the doc is unchanged; the prose now refers to the type by name without a link. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b633dd6 to
403183a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Not for upstream merge — opened to run CI before landing the DataFusion-side change that depends on this branch (adriangb/datafusion#11).
Summary
Adds a small surface to the push decoder so callers can swap the
RowFilter,ProjectionMask, and/orRowSelectionPolicyat row-group boundaries without rebuilding the decoder:pub fn can_swap_strategy(&self) -> bool— true between row groups (outer stateReadingRowGroup, inner stateFinished)pub fn swap_strategy(&mut self, swap: StrategySwap) -> Result<()>— rejected withParquetError::Generalwhen called mid-row-grouppub struct StrategySwap(#[non_exhaustive]) with builder methodswith_projection,with_filter,with_row_selection_policypub fn row_groups_remaining(&self) -> usizefor diagnosticsPushBufferscarries through the swap, so bytes already fetched for columns that survive into the new strategy are reused — only bytes the new strategy needs but that aren't already buffered get requested viaNeedsData.Adaptive callers should drive the decoder with
try_next_readerrather thantry_decode: handing the active reader off transitions the decoder back toReadingRowGroupimmediately, giving the caller a clean swap window between two consecutive returns.try_decodeloops past row-group boundaries internally and is unsuitable for in-flight strategy changes.Test plan
cargo test -p parquet --lib --features arrow,async,object_store— 1121 passedpush_decoder/mod.rs::test:test_swap_strategy_installs_filter_between_row_groupstest_swap_strategy_rejected_mid_row_grouptest_swap_strategy_allowed_while_iterating_handed_off_readertest_swap_strategy_preserves_buffered_bytes🤖 Generated with Claude Code