-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Fix duplicate field name error in Join::try_new_with_project_input during physical planning #16454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
97caf0f
to
fb9d758
Compare
/// This is especially useful for queries that come as Substrait, since Substrait doesn't currently allow specifying | ||
/// aliases, neither for columns nor for tables. DataFusion requires columns to be uniquely identifiable, in some | ||
/// places (see e.g. DFSchema::check_names). | ||
pub fn requalify_sides_if_needed( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this helper function to the logical plan builder
module since now its not used only by the substrait consumer but also by plan.rs.
fb9d758
to
6244060
Compare
// Re-qualify the join schema only if the inputs were previously requalified in | ||
// `try_new_with_project_input`. This ensures that when building the Projection | ||
// it can correctly resolve field nullability and data types | ||
// by disambiguating fields from the left and right sides of the join. | ||
let qualified_join_schema = if requalified { | ||
Arc::new(qualify_join_schema_sides( | ||
join_schema, | ||
original_left, | ||
original_right, | ||
)?) | ||
} else { | ||
Arc::clone(join_schema) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rationale for qualifying the schema is that when building the logical Projection
after, it will build the fields out of the expression names in exprlist_to_fields so it will look in new_join.schema()
and try to match each expr to a field in the schema, if the expr::Column does not have a qualifier and there are multiple candidates Fields
that could correspond to this expr::Column , we will get an ambiguity error, qualifying the schema allows us to prevent this.
Which issue does this PR close?
Rationale for this change
There have been some occasions where substrait queries fail during logical planning because a node's resulting schema contains duplicate field names. This is not allowed by either Arrow or DataFusion (always verified on check_names upon schema creation). This occurred most commonly when constructing the schema of joins, in inner joins where both the left and right inputs have a field with the same name, resulting in duplicate fields in the output schema.
This was previously addressed in the Substrait consumer path in` Fix duplicate unqualified Field name (schema error) on join queries and there is an existing function that handles this kind of situations on joins rels: requalify_sides_if_needed
We are now encountering the same error again, but this time during physical planning while mutating the Logical plan, rather than in logical planning phase itself.
Specifically it arises on this line when the join key is an non
Column
expression, since it has to create aProjection
with aColumn
expr on top (by callingwrap_projection_for_join_if_necessary
) to ensure correct execution iiuc. The Join logical node is then updated to use this new column. The issue occurs when the new logical node is built using Join::try_new_with_project_input as the function that constructs the join schema checks that names are unique as well.What changes are included in this PR?
My proposed workaround is to qualify the sides of the join on
try_new_with_project_input
by calling the same function the substrait consumer uses:requalify_sides_if_needed
, but I'm open to suggestions 🙇♀️Are these changes tested?
Are there any user-facing changes?
No.