-
Notifications
You must be signed in to change notification settings - Fork 125
Support string column identifiers for sort/aggregate/window and stricter Expr validation #1221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Refactor expression handling and `_simplify_expression` for stronger type checking and clearer error handling - Improve type annotations for `file_sort_order` and `order_by` to support string inputs - Refactor DataFrame `filter` method to better validate predicates - Replace internal error message variable with public constant - Clarify usage of `col()` and `column()` in DataFrame examples
…handling - Update `order_by` handling in Window class for better type support - Improve type checking in DataFrame expression handling - Replace `Expr`/`SortExpr` with `SortKey` in file_sort_order and related functions - Simplify file_sort_order handling in SessionContext - Rename `_EXPR_TYPE_ERROR` → `EXPR_TYPE_ERROR` for consistency - Clarify usage of `col()` vs `column()` in DataFrame examples - Enhance documentation for file_sort_order in SessionContext
…ng, sorting, and docs - Introduce `ensure_expr` helper and improve internal expression validation - Update error messages and tests to consistently use `EXPR_TYPE_ERROR` - Refactor expression handling with `_to_raw_expr`, `_ensure_expr`, and `SortKey` - Improve type safety and consistency in sort key definitions and file sort order - Add parameterized parquet sorting tests - Enhance DataFrame docstrings with clearer guidance and usage examples - Fix minor typos and error message clarity
…ation - Introduced `ensure_expr_list` to validate and flatten nested expressions, treating strings as atomic - Updated expression utilities to improve consistency across aggregation and window functions - Consolidated and expanded parameterized tests for string equivalence in ranking and window functions - Exposed `EXPR_TYPE_ERROR` for consistent error messaging across modules and tests - Improved internal sort logic using `expr_internal.SortExpr` - Clarified expectations for `join_on` expressions in documentation - Standardized imports and improved test clarity for maintainability
Hi, thanks for the PR.
One thing to check, |
hi @HeWhoHeWho Yes. Comparisons and arithmetic on an Expr automatically coerce plain Python values to literals, so you can write: df.filter(col("A") > 123)
df.filter(col("B") == "Jack") without explicitly wrapping 123 or "Jack" in lit()/literal(). Internally, each operator checks whether the right‑hand side is already an Expr; if not, it calls Expr.literal to convert the value before performing the operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good improvement overall, and making our approach explicit is a good idea. I do think there is precedent in plenty of other libraries where some functions take strings and assume column names and others require you to provide expressions, so I feel comfortable in what we have here. I do have a couple of suggestions about places where the documentation feels a little awkward. Overall, a very nice addition.
.. code-block:: python | ||
from datafusion import col, lit | ||
df.filter(col('age') > lit(21)) | ||
Without ``lit()`` DataFusion would treat ``21`` as a column name rather than a | ||
constant value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this statement true? df.filter(col('age') > 21)
would treat 21
as a column name? I think that's a change in how the comparison operator works.
@@ -126,6 +126,53 @@ DataFusion's DataFrame API offers a wide range of operations: | |||
# Drop columns | |||
df = df.drop("temporary_column") | |||
String Columns and Expressions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the title here is misleading. "String Columns" to me would mean columns that contain string values. I think maybe we should call this something like "Function arguments taking column names" or "Column names as function arguments"
String Columns and Expressions | ||
------------------------------ | ||
|
||
Some ``DataFrame`` methods accept plain strings when an argument refers to an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommend "plain strings" -> "column names"
------------------------------ | ||
|
||
Some ``DataFrame`` methods accept plain strings when an argument refers to an | ||
existing column. These include: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add a note to see the full function documentation for details on any specific function.
:func:`datafusion.col` or :func:`datafusion.lit`; plain strings are not | ||
accepted. If more complex logic is required, see the logical operations in | ||
:py:mod:`~datafusion.functions`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this part that says "plain strings are not accepted". When you view it from the context of this PR and Issue, the statement makes sense. But if you are an external user coming across this function definition for the first time, I don't think the statement makes sense. I think you can just remove the change here.
:func:`datafusion.col` or :func:`datafusion.lit`; plain strings are not | ||
accepted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as before.
pass named expressions use the form name=Expr. | ||
By passing expressions, iterables of expressions, or named expressions. | ||
All expressions must be :class:`~datafusion.expr.Expr` objects created via | ||
:func:`datafusion.col` or :func:`datafusion.lit`; plain strings are not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as before
On expressions are used to support in-equality predicates. Equality | ||
predicates are correctly optimized | ||
Join predicates must be :class:`~datafusion.expr.Expr` objects, typically | ||
built with :func:`datafusion.col`; plain strings are not accepted. On |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as before
def expr_list_to_raw_expr_list( | ||
expr_list: Optional[list[Expr] | Expr], | ||
expr_list: Optional[_typing.Union[Sequence[_typing.Union[Expr, str]], Expr, str]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the change from |
to _typing.Union
? I'm not completely against it but it seems we're inconsistent in our approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I find the _typing.Union
to be an eye sore. Would it be possible to just import Union if we're going in this direction?
Also the |
is the preferred syntax in Python 3.10 and later and 3.9 reaches end of life next month. Maybe we just stick with |
and at the end of October update our minimum version to 3.10. What do you think?
Which issue does this PR close?
Rationale for this change
Users reported inconsistent behavior when using plain string column identifiers vs. explicit
col()
/column()
andlit()
/literal()
expressions. SomeDataFrame
methods accepted bare strings (e.g.select
,sort
,drop
), while expression-returning APIs (e.g.filter
,with_column
) requiredExpr
objects. This led to confusion and surprising runtime errors.This PR clarifies and documents the intended behavior by:
sort
,aggregate
grouping keys, manyorder_by
parameters, file sort metadata).Expr
objects for APIs that expect an expression (e.g.filter
,with_column
,with_columns
elements,join_on
predicates). When a non-Expr
is passed to such APIs, a clearTypeError
message guides the user to usecol()
/column()
orlit()
/literal()
.This makes usages consistent and explicit while preserving ergonomics where obvious (accepting column-name strings).
What changes are included in this PR?
High-level summary
Add robust helpers to validate and convert user-provided values:
ensure_expr(value)
— validate and return the internal expression object or raiseTypeError
with a helpful message.ensure_expr_list(iterable)
— flatten and validate nested iterables ofExpr
objects._to_raw_expr(value)
— convert anExpr
orstr
column name to the internal raw expression.SortKey = Expr | SortExpr | str
type alias to represent items accepted by sort-like APIs.Allow column-name strings in many APIs that logically take column identifiers:
DataFrame.sort
,DataFrame.aggregate
(group-by keys),Window.order_by
, variousfunctions.*
order_by
parameters, andSessionContext
file sort-order metadata.Enforce explicit
Expr
values in expression-only APIs and provide clear error messages referencingcol()/column()
andlit()/literal()
helpers.Update Python doc (
docs/source/user-guide/dataframe/index.rst
) to explain which methods accept strings and which requirecol()
/lit()
.Add tests covering string acceptance and error cases (extensive additions to
python/tests/test_dataframe.py
andpython/tests/test_expr.py
).Files changed (high level)
python/datafusion/expr.py
ensure_expr
,ensure_expr_list
,_to_raw_expr
.expr_list_to_raw_expr_list
acceptstr
in addition toExpr
and convert accordingly.sort_list_to_raw_sort_list
acceptstr
andSortKey
and convert entries to rawSortExpr
.SortKey
alias andEXPR_TYPE_ERROR
constant for consistent error messages.python/datafusion/dataframe.py
ensure_expr
/ensure_expr_list
to validate expressions infilter
,with_column(s)
,join_on
,aggregate
, and other places.sort
,aggregate
group-by, and related APIs via conversions.python/datafusion/context.py
file_sort_order
as nestedSortKey
sequences and added_convert_file_sort_order
to create the low-level representation for the Rust bindings.expr
namespace asexpr_internal
where required.python/datafusion/functions.py
SortKey
fororder_by
parameters.docs/source/user-guide/dataframe/index.rst
col()
/lit()
expressions, with examples.python/tests/test_dataframe.py
,python/tests/test_expr.py
Notable implementation details
order_by
orgroup_by
, we convert the string intoExpr.column(name)
prior to calling the lower-level bindings.filter
,with_column
,join_on
), passing a plain string now raisesTypeError
with the message:Use col()/column() or lit()/literal() to construct expressions
(exposed viaEXPR_TYPE_ERROR
). Tests assert on that message to ensure consistent behavior.Are these changes tested?
Yes. This PR adds and updates unit tests to cover:
sort
,aggregate
group-by,window.order_by
,array_agg(..., order_by=...)
,first_value/last_value/nth_value(..., order_by=...)
,lead/lag
,row_number
/rank
/dense_rank
/percent_rank
/cume_dist
/ntile
, andSessionContext
file sort order conversions.Expr
types to expression-only APIs (filter
,with_column
,with_columns
,join_on
, etc.).col("...")
counterparts when allowed.Specifically modified/added tests include (non-exhaustive list):
python/tests/test_dataframe.py
— many new tests:test_select_unsupported
,test_sort_string_and_expression_equivalent
,test_sort_unsupported
,test_aggregate_string_and_expression_equivalent
,test_aggregate_tuple_group_by
,test_filter_string_unsupported
,test_with_column_invalid_expr
,test_with_columns_invalid_expr
,test_join_on_invalid_expr
,test_aggregate_invalid_aggs
,test_order_by_string_equivalence
, and file sort-order tests forregister_parquet
,register_listing_table
, andread_parquet
.python/tests/test_expr.py
— tests forensure_expr
andensure_expr_list
behavior and error messages.Are there any user-facing changes?
Yes.
Behavioral / UX changes
Methods that naturally take column identifiers now accept plain
str
values (for convenience):DataFrame.sort("col")
DataFrame.aggregate(group_by="col", ...)
(and group keys can be sequences/tuples of strings)order_by
parameters in window and aggregate functions now accept string column names.SessionContext
file sort-order metadata accepts column name strings.Methods that inherently require expressions (computation or predicate) now enforce
Expr
values and raise a clearTypeError
directing the user tocol()
/column()
orlit()
/literal()
:DataFrame.filter
— must pass anExpr
(e.g.col("x") > lit(1)
).DataFrame.with_column
,with_columns
— items must beExpr
objects.DataFrame.join_on
— ON predicates must beExpr
(equality or other expressions).Documentation
Compatibility / API surface
SortKey
was introduced in the codebase to express the union of accepted types for sort-like parameters. This is an internal typing convenience and should not affect external users directly.Example usage
Notes for reviewers
expr.py
(ensure_expr
,ensure_expr_list
,_to_raw_expr
) and the conversions indataframe.py
andcontext.py
that call them. These are the core safety/validation changes.