-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Wrap immutable plan parts into Arc #19893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
b9caf1d to
bb97763
Compare
|
run benchmark reset_plan_states |
|
🤖 Hi @askalt, thanks for the request (#19893 (comment)). |
|
@xudong963 could you please |
|
run benchmark reset_plan_states |
|
🤖 |
|
🤖: Benchmark completed Details
|
70a396a to
3614687
Compare
3614687 to
bb97763
Compare
bb97763 to
f2d829a
Compare
- Closes apache#19852 Improve performance of query planning and plan state re-set by making node clone cheap. - Store projection as `Option<Arc<[usize]>>` instead of `Option<Vec<usize>>` in `FilterExec`, `HashJoinExec`, `NestedLoopJoinExec`. - Store exprs as `Arc<[ProjectionExpr]>` instead of Vec in `ProjectionExprs`. - Store arced aggregation, filter, group by expressions within `AggregateExec`.
f2d829a to
050e6bb
Compare
|
I understand that we wanna avoid breaking changes and hence are kinda liberal with Hence I would favor if we could avoid these |
Actually, I also think that using a generic here isn't justified, since |
Indeed my concern is the impact on (all) downstream users during the upgrade. I believe the initial PR from @askalt was substantially larger because it required changing a bunch of callsites. If we can avoid the use of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR @askalt and for the review @crepererum and @martin-g
I think the performance results are quite compelling and this is quite close.
I agree it would be find to avoid the AsRef impls, if you are willing to do so @askalt . I also think it would be fine to leave this as is.
In any event, I think we should add an example to project_schema that shows how to call it with Option<&Vec<usize>> (what is passed by TableProvider::scan)
Before we merge this PR I think we
- should also add a note to the upgrade guide with a note about this change (I can help with this, but I didn't want to push changes to this PR before we have consensus approach)
- Clean up the API for
OptionProjectionRef(move it, and makeOption<ProjectionRef>)
| pub fn project_schema( | ||
| schema: &SchemaRef, | ||
| projection: Option<&Vec<usize>>, | ||
| projection: Option<&impl AsRef<[usize]>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per @crepererum 's suggestion, I tried out what it would look like and came up with
It does seem to be reasonable
The original signature is Option<&Vec<..>> I think to align with TableProvider::scan (which also shouldn't have a owned Vec, but that I think is a historic accident)
| #[derive(Debug, Clone, PartialEq, Eq)] | ||
| pub struct ProjectionExprs { | ||
| exprs: Vec<ProjectionExpr>, | ||
| exprs: Arc<[ProjectionExpr]>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is worth commenting here the rationale for using Arc<[...]>, namely that it makes this structure inexpensive to copy as happens during PhysicalPlanning
| mode: AggregateMode, | ||
| /// Group by expressions | ||
| group_by: PhysicalGroupBy, | ||
| group_by: Arc<PhysicalGroupBy>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise here I think it would be good to note they are Arc to make clone/plan rewriting faster
| /// Returns an internal error if existing projection contains index that is | ||
| /// greater than len of the passed `projection`. | ||
| /// | ||
| pub fn apply_projection<'a>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is basically refactored out of NestedLoops join (and maybe elsewhere)
| } | ||
| } | ||
|
|
||
| /// Describes an option immutable reference counted shared projection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why you need something like this (as it changes the signture of TableProvider) but this seems inconsistent with the other representations of projections in DataFusion/Arrow I am familiar with, which are represented with an Option rather than having the Option internally
I recommend:
- Move this into
datafusion/physical-expr/src/projection.rs(not in physical plan) so it is nearProjectionExprs - Move the Option outside (so the signature is
Option<ProjectionRef>rather thanOptionProjectionRef)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also recommend documenting why this structure exists (because it is cheap to clone)
Something like
/// This structure represents projecting a set of columns by index.
/// It uses an `Arc` internally to make it cheap to clone
Which issue does this PR close?
Arc#19852Rationale for this change
Improve performance of query planning and plan state re-set by making node clone cheap.
What changes are included in this PR?
Option<Arc<[usize]>>instead ofOption<Vec<usize>>inFilterExec,HashJoinExec,NestedLoopJoinExec.Arc<[ProjectionExpr]>instead of Vec inProjectionExprs.AggregateExec.