ENH: Add warning when `DataFrame.groupby` drops NA keys #61339

tehunter · 2025-04-22T15:40:59Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

Currently, pandas DataFrame.groupby will default to dropping NA values in any of the keys. In v1.1, a dropna argument was added which allows the users to retain NA values, but its default value is set to True (#3729). In that discussion, there were several requests to add a warning when NA values are dropped (#3729 (comment)).

This issue raises that request to provide additional visibility and a single place for discussion.

Feature Description

Add a warning to the user when a groupby contains NA keys and dropna is not explicitly passed. This warning should also be emitted in other aggregation functions that drop missing keys (e.g., pivot, .

>>> df = pd.DataFrame({"key1": ["a", "a"], "key2": ["b", None], "value": [1, 2]})
>>> df.groupby(["a", "b"])["value"].sum()
MissingKeyWarning: `groupby` encountered missing keys which will be dropped from result. Please specify `dropna=True` to hide warning and retain default behavior, or `dropna=False` to include missing values.
key1  key2
a     b       1
Name: value, dtype: int64

I think this is the best option as it warns the user in multiple scenarios:

User is unaware of pandas default behavior.
User is aware of pandas default behavior, but forgot to include the argument.
User is aware of pandas default behavior, but is unaware that their data contains missing values (prompting a bug fix or data quality check upstream).

Alternative Solutions

Here are some other ideas for discussion, but I think the downsides of these all outweigh the benefits.

Alternative 1: Set default `dropna` value to be user-configurable via pandas settings

This would allow the user to decide if they prefer "SQL-style" grouping globally. This could work in conjunction with the user warning above. Cons: Still requires user to remember to specify the option in their code. Options would affect the results, which complicates debugging and collaboration and goes against good code guidelines.

Alternative 2: Change the default value of `dropna`

This would bring pandas in line with SQL and Polars, but would likely break user code. This doesn't preclude the warning above, as it would be required as part of a deprecation plan. Cons: Would need to be rolled out very slowly.

Alternative 3: Change the default value of `dropna` for multi-key groupings only.

Assumes users doing multi-key grouping are more likely to want to retain missing values. Cons: Would add confusion and still break user code.

Additional Context

This has been a known source of confusion and a difference from SQL and Polars (See 1 2 3).

Even for experienced Pandas users, it's easy to forget to add dropna=False or not realize there are missing values in your grouping keys. With the current behavior, we're adding an additional mental overhead on developers, increasing the learning curve (especially coming from SQL), and introducing a source of bugs.

The text was updated successfully, but these errors were encountered:

rhshadrach · 2025-04-22T21:08:14Z

Thanks for the request. I'm negative on:

Adding a warning when dropna=True. This is noisy, as far as pandas can tell the user is telling pandas to drop NA values, it should not warn when that happens.
Adding a global underride for dropna. That makes the behavior of pandas non-local: you can not look at a piece of code and know what it does as it depend on the global state.
Having the default value of dropna depend on other argument or the data itself.

However I'm positive on changing the default of dropna to False, and then even deprecating the parameter entirely. I am planning to start the deprecation after 3.0 is released (as long as there are no objections from the core team).

Related: #53094. This PDEP is stalled currently because there are many improvements that need to be done to pivot_table first.

rhshadrach · 2025-04-22T21:11:57Z

Ah, I did not see

when a groupby contains NA keys and dropna is not explicitly passed.

While I'd be positive on this, I think we should just deprecate dropna=True as the default. This will also give a warning when dropna is not passed and the groupby keys contain an NA value, so it's much the same.

tehunter added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 22, 2025

tehunter changed the title ~~ENH: Add warning when DataFrame.groupby drop's NA keys~~ ENH: Add warning when DataFrame.groupby drops NA keys Apr 22, 2025

rhshadrach added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 22, 2025

tehunter linked a pull request Apr 24, 2025 that will close this issue

Add warning to .groupby when null keys would be dropped due to default dropna #61351

Draft

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add warning when `DataFrame.groupby` drops NA keys #61339

ENH: Add warning when `DataFrame.groupby` drops NA keys #61339

tehunter commented Apr 22, 2025

rhshadrach commented Apr 22, 2025 •

edited

Loading

rhshadrach commented Apr 22, 2025

ENH: Add warning when DataFrame.groupby drops NA keys #61339

ENH: Add warning when DataFrame.groupby drops NA keys #61339

Comments

tehunter commented Apr 22, 2025

Feature Type

Problem Description

Feature Description

Alternative Solutions

Alternative 1: Set default dropna value to be user-configurable via pandas settings

Alternative 2: Change the default value of dropna

Alternative 3: Change the default value of dropna for multi-key groupings only.

Additional Context

rhshadrach commented Apr 22, 2025 • edited Loading

rhshadrach commented Apr 22, 2025

ENH: Add warning when `DataFrame.groupby` drops NA keys #61339

ENH: Add warning when `DataFrame.groupby` drops NA keys #61339

Alternative 1: Set default `dropna` value to be user-configurable via pandas settings

Alternative 2: Change the default value of `dropna`

Alternative 3: Change the default value of `dropna` for multi-key groupings only.

rhshadrach commented Apr 22, 2025 •

edited

Loading