Skip to content

DOC: User Guide Page on user-defined functions #61195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

arthurlw
Copy link
Contributor

@arthurlw
Copy link
Contributor Author

Currently writing this, so I would appreciate any feedback on it!

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I'm not opposed to a dedicated page on UDFs, but I am opposed to duplicating documentation that exists elsewhere in the user guide, as I think much of this does. Instead of e.g. examples of apply, I recommend linking to the appropriate section. This page can then focus on recommendations of when to use apply vs other methods.

Comment on lines 16 to 17
Why Use User-Defined Functions?
-------------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should lead with Why _not_ User-Defined Functions. While performance is called out down below, I think the poor behavior of UDFs should be mentioned as well. Namely that pandas has no information on what a UDF is doing, and so has to infer (guess) at how to handle the result.

In particular, I think it should be mentioned that none of the examples on this page should be UDFs in practice.

@rhshadrach rhshadrach added Apply Apply, Aggregate, Transform, Map Docs labels Mar 29, 2025
@arthurlw
Copy link
Contributor Author

Hi @rhshadrach thanks for the feedback! I agree with you and will push updates soon

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking a lot better. Can we also link to https://pandas.pydata.org/pandas-docs/dev/user_guide/enhancingperf.html#numba-jit-compilation at the very bottom in a section titled something like "Improving Performance with UDFs".

ways to apply UDFs across different pandas data structures.

.. note::
Some of these methods are can also be applied to Groupby Objects. Refer to :ref:`groupby`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also make a mention of resample, rolling, expanding, and ewm. Perhaps link to each section in the User Guide.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the other objects to this note, it seems to me they all belong together.

Suggested change
Some of these methods are can also be applied to Groupby Objects. Refer to :ref:`groupby`.
Some of these methods are can also be applied to groupby, resample, and various window objects. See :ref:`groupby`, :ref:`resample()<timeseries>`, :ref:`rolling()<window>`, :ref:`expanding()<window>`, and :ref:`ewm()<window>` for details.

pandas comes with a set of built-in functions for data manipulation, UDFs offer
flexibility when built-in methods are not sufficient. These functions can be
applied at different levels: element-wise, row-wise, column-wise, or group-wise,
and change the data differently, depending on the method used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "change the data differently" sounds very close to mutating in a UDF, which we explicitly do not support. What do you think of "behave differently".

Copy link
Contributor Author

@arthurlw arthurlw Apr 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“Behave differently” sounds clearer and avoids implying mutation. I'll update it!

Comment on lines 63 to 64
* :meth:`~DataFrame.apply` - A flexible method that allows applying a function to Series,
DataFrames, or groups of data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we should remove groups of data here. DataFrame.apply that you're referencing doesn't operate on groups, and you mention groupby below.

ways to apply UDFs across different pandas data structures.

.. note::
Some of these methods are can also be applied to Groupby Objects. Refer to :ref:`groupby`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the other objects to this note, it seems to me they all belong together.

Suggested change
Some of these methods are can also be applied to Groupby Objects. Refer to :ref:`groupby`.
Some of these methods are can also be applied to groupby, resample, and various window objects. See :ref:`groupby`, :ref:`resample()<timeseries>`, :ref:`rolling()<window>`, :ref:`expanding()<window>`, and :ref:`ewm()<window>` for details.

Comment on lines 129 to 130
When to use: Use :meth:`DataFrame.agg` for performing aggregations like sum, mean, or custom aggregation
functions across groups.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things like .agg(["sum", "mean"]) aren't UDFs, so I don't think they should be mentioned here, and it could be make users think these types of usages are slow (they are not).

Suggested change
When to use: Use :meth:`DataFrame.agg` for performing aggregations like sum, mean, or custom aggregation
functions across groups.
When to use: Use :meth:`DataFrame.agg` for performing custom aggregations, where the operation returns a scalar value on each input.

})

# Using transform with mean
df['Mean_Transformed'] = df.groupby('Category')['Values'].transform('mean')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't an example of a UDF. I really like your example of using linear regression - can we do that here? It's a bit unfortunate that groupby.transform does not allow operating on the entire group (only works column-by-column) here.

from sklearn.linear_model import LinearRegression

df = pd.DataFrame({
    'group': ['A', 'A', 'A', 'B', 'B', 'B'],
    'x': [1, 2, 3, 1, 2, 3],
    'y': [2, 4, 6, 1, 2, 1.5]
}).set_index("x")

# Function to fit a model to each group
def fit_model(group):
    x = group.index.to_frame()
    y = group
    model = LinearRegression()
    model.fit(x, y)
    pred = model.predict(x)
    return pred

result = df.groupby('group').transform(fit_model)

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent job here @arthurlw, thanks for taking care of this. I added a general comment about using examples to incrementally illustrate what it's explain here, and changing a bit the order of the sections.

Please let me know if it doesn't make sense or you have any comment. I'll review more in depth after the proposed changes are implemented or discussed. But in a first look, this is really nice.

@@ -88,3 +88,4 @@ Guides
sparse
gotchas
cookbook
user_defined_functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this before the groupby section. It feels more natural to me to explain first Series.apply and later explain groupby("col").apply.

and change the data differently, depending on the method used.

Why Not To Use User-Defined Functions
-----------------------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if Sphinx is more flexible now, but this had to be the same exact length as the title before.

{{ header }}

**************************************
Introduction to User-Defined Functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Introduction to User-Defined Functions
User-Defined Functions (UDFs)

This will be what will be shown in the index too, so better to be concise. Also, better for consistency to remove the Introduction to, which we could have in every other user guide too.

applied at different levels: element-wise, row-wise, column-wise, or group-wise,
and change the data differently, depending on the method used.

Why Not To Use User-Defined Functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just personal opinion, but to me it makes more sense to explain what UDFs are in pandas before explaining when not to use them. This order seems reasonable assuming users already know what pandas udfs are in practice, but I'd personally prefer not to assume it in the user guide for UDFs.

In my opinion, after the previous introduction which is great, I'd show a very simple example so we make sure users reading this understand the very basics.

Something like:

def add_one(x):
    return x + 1

my_series = pd.Series([1, 2, 3])

my_series.map(add_one)

Building on top of this, like then showing the same with a DataFrame, at some point showing UDFs that receive the whole column with .apply... should help make sure users are following and understanding all the information provided here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: Write user guide page on apply/map/transform methods
3 participants