ENH: Add warning when DataFrame.groupby
drops NA keys
#61339
Labels
Enhancement
Groupby
Missing-data
np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Needs Discussion
Requires discussion from core team before further action
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
Currently, pandas
DataFrame.groupby
will default to dropping NA values in any of the keys. In v1.1, adropna
argument was added which allows the users to retain NA values, but its default value is set toTrue
(#3729). In that discussion, there were several requests to add a warning when NA values are dropped (#3729 (comment)).This issue raises that request to provide additional visibility and a single place for discussion.
Feature Description
Add a warning to the user when a groupby contains NA keys and
dropna
is not explicitly passed. This warning should also be emitted in other aggregation functions that drop missing keys (e.g.,pivot
, .I think this is the best option as it warns the user in multiple scenarios:
User is unaware of pandas default behavior.
User is aware of pandas default behavior, but forgot to include the argument.
User is aware of pandas default behavior, but is unaware that their data contains missing values (prompting a bug fix or data quality check upstream).
Alternative Solutions
Here are some other ideas for discussion, but I think the downsides of these all outweigh the benefits.
Alternative 1: Set default
dropna
value to be user-configurable via pandas settingsThis would allow the user to decide if they prefer "SQL-style" grouping globally. This could work in conjunction with the user warning above. Cons: Still requires user to remember to specify the option in their code. Options would affect the results, which complicates debugging and collaboration and goes against good code guidelines.
Alternative 2: Change the default value of
dropna
This would bring pandas in line with SQL and Polars, but would likely break user code. This doesn't preclude the warning above, as it would be required as part of a deprecation plan. Cons: Would need to be rolled out very slowly.
Alternative 3: Change the default value of
dropna
for multi-key groupings only.Assumes users doing multi-key grouping are more likely to want to retain missing values. Cons: Would add confusion and still break user code.
Additional Context
This has been a known source of confusion and a difference from SQL and Polars (See 1 2 3).
Even for experienced Pandas users, it's easy to forget to add
dropna=False
or not realize there are missing values in your grouping keys. With the current behavior, we're adding an additional mental overhead on developers, increasing the learning curve (especially coming from SQL), and introducing a source of bugs.The text was updated successfully, but these errors were encountered: