Skip to content

lib/datautils: groupby_agg on string column with missing values introduces 0 #3515

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Nov 7, 2024 · 2 comments

Comments

@jorisvandenbossche
Copy link
Contributor

While running the tests with the next version of pandas, I ran into this case:

def test_default_aggregate_with_some_nans_ignored_different_types_and_more_nans(
self,
):
df_in = pd.DataFrame(
{
"year": [2001, 2002, 2002, 2003, 2003, 2003],
"value_01": [np.nan, 2, np.nan, 4, 5, 6],
"value_02": [np.nan, "b", np.nan, "d", "e", "f"],
"value_03": [np.nan, False, False, True, True, np.nan],
}
)
df_out = pd.DataFrame(
{
"year": [2001, 2002, 2003],
"value_01": [0.0, 2.0, 15.0],
"value_02": [0, "b", "def"],
"value_03": [0, 0, 2],
}
).set_index("year")
df_out["value_03"] = df_out["value_03"].astype(object)
assert dataframes.groupby_agg(

where the expected result df_out column "value_02" is a mixed integer/string column, because of a zero being introduced through the groupby's sum on a group with only missing values.

While being tested as the expected result, I am wondering if this is the intended or desired behaviour to get a 0 in there (I assume this is not the case). Or whether this is just from using some dummy data in the test, and not something you encounter or want on the actual datasets? (for example, I don't know if you actually ever use the "sum" operation on a string column in practice)

The reason I ran into this case is because this test is failing with pandas' future string dtype enabled. It is failing because the result now no longer has an integer 0 but a string "0" (and hence the assert equals(..) failed), which I think is even worse (so I opened an upstream issue to fix this in pandas: pandas-dev/pandas#60229, thanks to running your test suite!)

@larsyencken
Copy link
Contributor

@pabloarosado will take a quick look, it may not be trivial for us to make a change to behaviour since it affects a bunch of our steps.

@Marigold
Copy link
Collaborator

It took me a while to understand what’s going on. Fortunately, this is dummy data and isn’t very representative of our typical use case. In fact, I’d say running sum on strings would generally be considered bad practice for us.

I’m closing this for now. If it gets fixed upstream, we’ll address it on our side. Thanks, Joris!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants