Skip to content

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Sep 2, 2025

Which issue does this PR close?

Rationale for this change

Users reported inconsistent behavior when using plain string column identifiers vs. explicit col()/column() and lit()/literal() expressions. Some DataFrame methods accepted bare strings (e.g. select, sort, drop), while expression-returning APIs (e.g. filter, with_column) required Expr objects. This led to confusion and surprising runtime errors.

This PR clarifies and documents the intended behavior by:

  • Accepting plain string column identifiers where passing a column name is natural (e.g. sort, aggregate grouping keys, many order_by parameters, file sort metadata).
  • Requiring explicit Expr objects for APIs that expect an expression (e.g. filter, with_column, with_columns elements, join_on predicates). When a non-Expr is passed to such APIs, a clear TypeError message guides the user to use col()/column() or lit()/literal().
  • Updating documentation and tests to reflect the semantics and show examples.

This makes usages consistent and explicit while preserving ergonomics where obvious (accepting column-name strings).

What changes are included in this PR?

High-level summary

  • Add robust helpers to validate and convert user-provided values:

    • ensure_expr(value) — validate and return the internal expression object or raise TypeError with a helpful message.
    • ensure_expr_list(iterable) — flatten and validate nested iterables of Expr objects.
    • _to_raw_expr(value) — convert an Expr or str column name to the internal raw expression.
    • Introduce SortKey = Expr | SortExpr | str type alias to represent items accepted by sort-like APIs.
  • Allow column-name strings in many APIs that logically take column identifiers: DataFrame.sort, DataFrame.aggregate (group-by keys), Window.order_by, various functions.* order_by parameters, and SessionContext file sort-order metadata.

  • Enforce explicit Expr values in expression-only APIs and provide clear error messages referencing col()/column() and lit()/literal() helpers.

  • Update Python doc (docs/source/user-guide/dataframe/index.rst) to explain which methods accept strings and which require col()/lit().

  • Add tests covering string acceptance and error cases (extensive additions to python/tests/test_dataframe.py and python/tests/test_expr.py).

Files changed (high level)

  • python/datafusion/expr.py

    • Added helpers: ensure_expr, ensure_expr_list, _to_raw_expr.
    • Made expr_list_to_raw_expr_list accept str in addition to Expr and convert accordingly.
    • Made sort_list_to_raw_sort_list accept str and SortKey and convert entries to raw SortExpr.
    • Added SortKey alias and EXPR_TYPE_ERROR constant for consistent error messages.
  • python/datafusion/dataframe.py

    • Use ensure_expr/ensure_expr_list to validate expressions in filter, with_column(s), join_on, aggregate, and other places.
    • Allow column name strings for sort, aggregate group-by, and related APIs via conversions.
    • Added user-facing docstrings clarifying expected types for parameters and examples.
  • python/datafusion/context.py

    • Accept file_sort_order as nested SortKey sequences and added _convert_file_sort_order to create the low-level representation for the Rust bindings.
    • Import the internal expr namespace as expr_internal where required.
  • python/datafusion/functions.py

    • Updated numerous function signatures to accept SortKey for order_by parameters.
    • Updated docstrings and added usage examples where helpful.
  • docs/source/user-guide/dataframe/index.rst

    • New section: "String Columns and Expressions" documenting which methods accept plain strings and which require explicit col()/lit() expressions, with examples.
  • python/tests/test_dataframe.py, python/tests/test_expr.py

    • Added many tests exercising both permissive string-accepting behavior and strict expression validation.

Notable implementation details

  • When a user supplies a plain string for sort/aggregate/window order_by or group_by, we convert the string into Expr.column(name) prior to calling the lower-level bindings.
  • For APIs where an expression is required (e.g., filter, with_column, join_on), passing a plain string now raises TypeError with the message: Use col()/column() or lit()/literal() to construct expressions (exposed via EXPR_TYPE_ERROR). Tests assert on that message to ensure consistent behavior.

Are these changes tested?

Yes. This PR adds and updates unit tests to cover:

  • Accepting column-name strings for sort, aggregate group-by, window.order_by, array_agg(..., order_by=...), first_value/last_value/nth_value(..., order_by=...), lead/lag, row_number/rank/dense_rank/percent_rank/cume_dist/ntile, and SessionContext file sort order conversions.
  • Error paths for passing non-Expr types to expression-only APIs (filter, with_column, with_columns, join_on, etc.).
  • Tests ensure that string-based usages are equivalent to their col("...") counterparts when allowed.

Specifically modified/added tests include (non-exhaustive list):

  • python/tests/test_dataframe.py — many new tests: test_select_unsupported, test_sort_string_and_expression_equivalent, test_sort_unsupported, test_aggregate_string_and_expression_equivalent, test_aggregate_tuple_group_by, test_filter_string_unsupported, test_with_column_invalid_expr, test_with_columns_invalid_expr, test_join_on_invalid_expr, test_aggregate_invalid_aggs, test_order_by_string_equivalence, and file sort-order tests for register_parquet, register_listing_table, and read_parquet.
  • python/tests/test_expr.py — tests for ensure_expr and ensure_expr_list behavior and error messages.

Are there any user-facing changes?

Yes.

Behavioral / UX changes

  • Methods that naturally take column identifiers now accept plain str values (for convenience):

    • DataFrame.sort("col")
    • DataFrame.aggregate(group_by="col", ...) (and group keys can be sequences/tuples of strings)
    • Many order_by parameters in window and aggregate functions now accept string column names.
    • SessionContext file sort-order metadata accepts column name strings.
  • Methods that inherently require expressions (computation or predicate) now enforce Expr values and raise a clear TypeError directing the user to col()/column() or lit()/literal():

    • DataFrame.filter — must pass an Expr (e.g. col("x") > lit(1)).
    • DataFrame.with_column, with_columns — items must be Expr objects.
    • DataFrame.join_on — ON predicates must be Expr (equality or other expressions).

Documentation

  • The user guide now contains a new section "String Columns and Expressions" explaining the above and showing usage examples. This should reduce confusion for users migrating from other libraries like Polars and make the library semantics explicit.

Compatibility / API surface

  • This is backwards-compatible for most users: previously-accepted code that used strings where allowed will continue to work. Code that accidentally passed strings to expression-only APIs will now be rejected with clearer errors instead of producing surprising behavior.
  • A new type alias SortKey was introduced in the codebase to express the union of accepted types for sort-like parameters. This is an internal typing convenience and should not affect external users directly.

Example usage

from datafusion import col, lit, functions as f

# Allowed: passing a column name string to sort or aggregate grouping
df.sort("id")
df.aggregate("id", [f.count(col("value"))])

# Required: expressions for predicate/transform
# Bad: df.filter("age > 21")  # raises TypeError
# Good:
df.filter(col("age") > lit(21))

# Window example: order_by accepts string
f.first_value(col("a"), order_by="ts")

Notes for reviewers

  • Focus review on the helpers in expr.py (ensure_expr, ensure_expr_list, _to_raw_expr) and the conversions in dataframe.py and context.py that call them. These are the core safety/validation changes.
  • The bulk of the diff is tests and doc changes which should be reviewed for correctness and clarity.

- Refactor expression handling and `_simplify_expression` for stronger
  type checking and clearer error handling
- Improve type annotations for `file_sort_order` and `order_by` to
  support string inputs
- Refactor DataFrame `filter` method to better validate predicates
- Replace internal error message variable with public constant
- Clarify usage of `col()` and `column()` in DataFrame examples
…handling

- Update `order_by` handling in Window class for better type support
- Improve type checking in DataFrame expression handling
- Replace `Expr`/`SortExpr` with `SortKey` in file_sort_order and
  related functions
- Simplify file_sort_order handling in SessionContext
- Rename `_EXPR_TYPE_ERROR` → `EXPR_TYPE_ERROR` for consistency
- Clarify usage of `col()` vs `column()` in DataFrame examples
- Enhance documentation for file_sort_order in SessionContext
…ng, sorting, and docs

- Introduce `ensure_expr` helper and improve internal expression
  validation
- Update error messages and tests to consistently use `EXPR_TYPE_ERROR`
- Refactor expression handling with `_to_raw_expr`, `_ensure_expr`, and
  `SortKey`
- Improve type safety and consistency in sort key definitions and file
  sort order
- Add parameterized parquet sorting tests
- Enhance DataFrame docstrings with clearer guidance and usage examples
- Fix minor typos and error message clarity
…ation

- Introduced `ensure_expr_list` to validate and flatten nested
  expressions, treating strings as atomic
- Updated expression utilities to improve consistency across aggregation
  and window functions
- Consolidated and expanded parameterized tests for string equivalence
  in ranking and window functions
- Exposed `EXPR_TYPE_ERROR` for consistent error messaging across
  modules and tests
- Improved internal sort logic using `expr_internal.SortExpr`
- Clarified expectations for `join_on` expressions in documentation
- Standardized imports and improved test clarity for maintainability
@HeWhoHeWho
Copy link

HeWhoHeWho commented Sep 3, 2025

Hi, thanks for the PR.

# Required: expressions for predicate/transform ... # Good: df.filter(col("age") > lit(21))

One thing to check, lit() or literal() will remain flexible as below right?
Example: df.filter(col('A') > 123) or df.filter(col('B') == 'Jack') # These should be Good as well
This behaviour is allowed in Polars, current DataFusion ver. also supports this, hence checking if there are any changes made to this.

@kosiew
Copy link
Contributor Author

kosiew commented Sep 3, 2025

hi @HeWhoHeWho

Yes. Comparisons and arithmetic on an Expr automatically coerce plain Python values to literals, so you can write:

df.filter(col("A") > 123)
df.filter(col("B") == "Jack")

without explicitly wrapping 123 or "Jack" in lit()/literal().

Internally, each operator checks whether the right‑hand side is already an Expr; if not, it calls Expr.literal to convert the value before performing the operation.
Consequently, lit() and literal() remain available but are optional for simple constants.

Copy link
Contributor

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good improvement overall, and making our approach explicit is a good idea. I do think there is precedent in plenty of other libraries where some functions take strings and assume column names and others require you to provide expressions, so I feel comfortable in what we have here. I do have a couple of suggestions about places where the documentation feels a little awkward. Overall, a very nice addition.

Comment on lines +168 to +174
.. code-block:: python
from datafusion import col, lit
df.filter(col('age') > lit(21))
Without ``lit()`` DataFusion would treat ``21`` as a column name rather than a
constant value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this statement true? df.filter(col('age') > 21) would treat 21 as a column name? I think that's a change in how the comparison operator works.

@@ -126,6 +126,53 @@ DataFusion's DataFrame API offers a wide range of operations:
# Drop columns
df = df.drop("temporary_column")
String Columns and Expressions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the title here is misleading. "String Columns" to me would mean columns that contain string values. I think maybe we should call this something like "Function arguments taking column names" or "Column names as function arguments"

String Columns and Expressions
------------------------------

Some ``DataFrame`` methods accept plain strings when an argument refers to an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommend "plain strings" -> "column names"

------------------------------

Some ``DataFrame`` methods accept plain strings when an argument refers to an
existing column. These include:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add a note to see the full function documentation for details on any specific function.

Comment on lines +424 to +426
:func:`datafusion.col` or :func:`datafusion.lit`; plain strings are not
accepted. If more complex logic is required, see the logical operations in
:py:mod:`~datafusion.functions`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this part that says "plain strings are not accepted". When you view it from the context of this PR and Issue, the statement makes sense. But if you are an external user coming across this function definition for the first time, I don't think the statement makes sense. I think you can just remove the change here.

Comment on lines +448 to +449
:func:`datafusion.col` or :func:`datafusion.lit`; plain strings are not
accepted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as before.

pass named expressions use the form name=Expr.
By passing expressions, iterables of expressions, or named expressions.
All expressions must be :class:`~datafusion.expr.Expr` objects created via
:func:`datafusion.col` or :func:`datafusion.lit`; plain strings are not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as before

On expressions are used to support in-equality predicates. Equality
predicates are correctly optimized
Join predicates must be :class:`~datafusion.expr.Expr` objects, typically
built with :func:`datafusion.col`; plain strings are not accepted. On
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as before

def expr_list_to_raw_expr_list(
expr_list: Optional[list[Expr] | Expr],
expr_list: Optional[_typing.Union[Sequence[_typing.Union[Expr, str]], Expr, str]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change from | to _typing.Union ? I'm not completely against it but it seems we're inconsistent in our approach

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I find the _typing.Union to be an eye sore. Would it be possible to just import Union if we're going in this direction?

Also the | is the preferred syntax in Python 3.10 and later and 3.9 reaches end of life next month. Maybe we just stick with | and at the end of October update our minimum version to 3.10. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potential bug(?): Inconsistent usage of column() / col() and literal() / lit()
3 participants