-
Notifications
You must be signed in to change notification settings - Fork 1k
Expose predicates from RowFilter #8315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Ben Ye <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @yeya24 -- this change makes sense to me, but I would like to request a slightly different API
@@ -172,7 +172,7 @@ where | |||
/// [`RowSelection`]: crate::arrow::arrow_reader::RowSelection | |||
pub struct RowFilter { | |||
/// A list of [`ArrowPredicate`] | |||
pub(crate) predicates: Vec<Box<dyn ArrowPredicate>>, | |||
pub predicates: Vec<Box<dyn ArrowPredicate>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of making this pub, can you please make an accessor so we can change the internals of RowFilter without causing a breaking API change?
Perhaps something like
impl RowFilter {
pub fn predicate(&self) -> &Vec<Box<dyn ArrowPredicate>> { .. }
// and convert into the innner
pub fn into_predicates(self) -> Vec<Box<dyn ArrowPredicate>> { .. }
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in latest commit
Signed-off-by: Ben Ye <[email protected]>
Thanks @alamb, I have addressed your comments by exposing predicates via methods. For our usecase, we basically need a If we are open to adding a reader for this usecase, then we are happy. If not, then we have to reimplement a reader which requires a lot of other structs and methods to be exposed as public. I can list some:
Would you be open to expose those specific fields or we are open to add a new reader that only filters and returns a boolean array? Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb, I have addressed your comments by exposing predicates via methods.
Thank you @yeya24 . It looks like this PR needs a cargo fmt
to get Ci passing but then I think we can merge it
For our usecase, we basically need a
ParquetRecordBatchStreamReader
like reader. Instead of reading and materializing the final results, it only filters the rows based on Row Filter and Row selection and returns the final rows that match the predicates in the Parquet file. The final result can be eitherRowSelection
or a boolean array. Then our own customized reader can read and materialize the rows based on the filter results.
This makes sense -- it sounds a lot like what a ReadPlan
is today, except that ReadPlan currently expands the BooleanArray back into a RowSelection
As you have probably also noticed, converting BooleanArray
back to RowSelection
is quite inefficient sometimes, and we have discussed improving this (see #5523 and linked PRs -- especially #7454)
- There is more discussion here #8000
If we are open to adding a reader for this usecase, then we are happy. If not, then we have to reimplement a reader which requires a lot of other structs and methods to be exposed as public. I can list some:
Thank you for the offer. I would love to try and find some way to work together
What you are describing sounds like effectively being able to access the ReadPlan
after evaluating the RowFilters but before decoding the data. The way the code is currently structured, it is not easy to get at this state.
However, if we could make some way to stop the decoder before it started decoding data, your usecase could use the existing code too.
What I am currently trying to do is to separate out the IO from the parquet decoder, by explicitly making the push decoder (see #7983 and code in #7997)
If you have a chance to comment / review that structure I would love some more feedback.
cc @XiangpengHao and @zhuqi-lucas who may also have interest in this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM After CI passed, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you! I also wanted this in LiquidCache
Signed-off-by: Ben Ye <[email protected]>
Lint issue should be fixed in the latest commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thanks everyone! |
Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.
Rationale for this change
Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
What changes are included in this PR?
There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
Are these changes tested?
We typically require tests for all PRs in order to:
If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
Are there any user-facing changes?
If there are user-facing changes then we may require documentation to be updated before approving the PR.
If there are any breaking changes to public APIs, please call them out.