Skip to content

Conversation

yeya24
Copy link
Contributor

@yeya24 yeya24 commented Sep 11, 2025

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Rationale for this change

Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.

What changes are included in this PR?

There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.

Are these changes tested?

We typically require tests for all PRs in order to:

  1. Prevent the code from being accidentally broken by subsequent changes
  2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?

Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Sep 11, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @yeya24 -- this change makes sense to me, but I would like to request a slightly different API

@@ -172,7 +172,7 @@ where
/// [`RowSelection`]: crate::arrow::arrow_reader::RowSelection
pub struct RowFilter {
/// A list of [`ArrowPredicate`]
pub(crate) predicates: Vec<Box<dyn ArrowPredicate>>,
pub predicates: Vec<Box<dyn ArrowPredicate>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making this pub, can you please make an accessor so we can change the internals of RowFilter without causing a breaking API change?

Perhaps something like

impl RowFilter {
  pub fn predicate(&self) -> &Vec<Box<dyn ArrowPredicate>> { .. }
  // and convert into the innner
  pub fn into_predicates(self) -> Vec<Box<dyn ArrowPredicate>> { .. }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in latest commit

@yeya24
Copy link
Contributor Author

yeya24 commented Sep 11, 2025

Thanks @alamb, I have addressed your comments by exposing predicates via methods.

For our usecase, we basically need a ParquetRecordBatchStreamReader like reader. Instead of reading and materializing the final results, it only filters the rows based on Row Filter and Row selection and returns the final rows that match the predicates in the Parquet file. The final result can be either RowSelection or a boolean array. Then our own customized reader can read and materialize the rows based on the filter results.

If we are open to adding a reader for this usecase, then we are happy. If not, then we have to reimplement a reader which requires a lot of other structs and methods to be exposed as public. I can list some:

Would you be open to expose those specific fields or we are open to add a new reader that only filters and returns a boolean array? Thanks

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb, I have addressed your comments by exposing predicates via methods.

Thank you @yeya24 . It looks like this PR needs a cargo fmt to get Ci passing but then I think we can merge it

For our usecase, we basically need a ParquetRecordBatchStreamReader like reader. Instead of reading and materializing the final results, it only filters the rows based on Row Filter and Row selection and returns the final rows that match the predicates in the Parquet file. The final result can be either RowSelection or a boolean array. Then our own customized reader can read and materialize the rows based on the filter results.

This makes sense -- it sounds a lot like what a ReadPlan is today, except that ReadPlan currently expands the BooleanArray back into a RowSelection

As you have probably also noticed, converting BooleanArray back to RowSelection is quite inefficient sometimes, and we have discussed improving this (see #5523 and linked PRs -- especially #7454)

  • There is more discussion here #8000

If we are open to adding a reader for this usecase, then we are happy. If not, then we have to reimplement a reader which requires a lot of other structs and methods to be exposed as public. I can list some:

Thank you for the offer. I would love to try and find some way to work together

What you are describing sounds like effectively being able to access the ReadPlan after evaluating the RowFilters but before decoding the data. The way the code is currently structured, it is not easy to get at this state.

However, if we could make some way to stop the decoder before it started decoding data, your usecase could use the existing code too.

What I am currently trying to do is to separate out the IO from the parquet decoder, by explicitly making the push decoder (see #7983 and code in #7997)

If you have a chance to comment / review that structure I would love some more feedback.

cc @XiangpengHao and @zhuqi-lucas who may also have interest in this

Copy link

@zhuqi-lucas zhuqi-lucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM After CI passed, thanks!

Copy link
Contributor

@XiangpengHao XiangpengHao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you! I also wanted this in LiquidCache

Signed-off-by: Ben Ye <[email protected]>
@yeya24
Copy link
Contributor Author

yeya24 commented Sep 12, 2025

Lint issue should be fixed in the latest commit.

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@alamb alamb merged commit 7b8f1f1 into apache:main Sep 12, 2025
16 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 12, 2025

Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Parquet] Expose predicates from RowFilter
5 participants