-
Notifications
You must be signed in to change notification settings - Fork 125
Support string column identifiers for sort/aggregate/window and stricter Expr validation #1221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
f9cafb8
91167b0
54687a2
f591617
31a648f
37307b0
05cd237
28619d9
9adbf4f
0a27617
92bc68e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -126,6 +126,53 @@ DataFusion's DataFrame API offers a wide range of operations: | |
# Drop columns | ||
df = df.drop("temporary_column") | ||
|
||
String Columns and Expressions | ||
------------------------------ | ||
|
||
Some ``DataFrame`` methods accept plain strings when an argument refers to an | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. recommend "plain strings" -> "column names" |
||
existing column. These include: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should probably add a note to see the full function documentation for details on any specific function. |
||
|
||
* :py:meth:`~datafusion.DataFrame.select` | ||
* :py:meth:`~datafusion.DataFrame.sort` | ||
* :py:meth:`~datafusion.DataFrame.drop` | ||
* :py:meth:`~datafusion.DataFrame.join` (``on`` argument) | ||
* :py:meth:`~datafusion.DataFrame.aggregate` (grouping columns) | ||
|
||
Note that :py:meth:`~datafusion.DataFrame.join_on` expects ``col()``/``column()`` expressions rather than plain strings. | ||
|
||
For such methods, you can pass column names directly: | ||
|
||
.. code-block:: python | ||
|
||
from datafusion import col, functions as f | ||
|
||
df.sort('id') | ||
df.aggregate('id', [f.count(col('value'))]) | ||
|
||
The same operation can also be written with explicit column expressions, using either ``col()`` or ``column()``: | ||
|
||
.. code-block:: python | ||
|
||
from datafusion import col, column, functions as f | ||
|
||
df.sort(col('id')) | ||
df.aggregate(column('id'), [f.count(col('value'))]) | ||
|
||
Note that ``column()`` is an alias of ``col()``, so you can use either name; the example above shows both in action. | ||
|
||
Whenever an argument represents an expression—such as in | ||
:py:meth:`~datafusion.DataFrame.filter` or | ||
:py:meth:`~datafusion.DataFrame.with_column`—use ``col()`` to reference columns | ||
and wrap constant values with ``lit()`` (also available as ``literal()``): | ||
|
||
.. code-block:: python | ||
|
||
from datafusion import col, lit | ||
df.filter(col('age') > lit(21)) | ||
|
||
Without ``lit()`` DataFusion would treat ``21`` as a column name rather than a | ||
constant value. | ||
Comment on lines
+168
to
+174
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this statement true? |
||
|
||
Terminal Operations | ||
------------------- | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the title here is misleading. "String Columns" to me would mean columns that contain string values. I think maybe we should call this something like "Function arguments taking column names" or "Column names as function arguments"