Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: MultiIndex union/difference not commutative #60642

Open
2 of 3 tasks
ssche opened this issue Jan 2, 2025 · 8 comments
Open
2 of 3 tasks

BUG: MultiIndex union/difference not commutative #60642

ssche opened this issue Jan 2, 2025 · 8 comments
Labels
Bug Index Related to the Index class or subclasses Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Needs Triage Issue that has not been reviewed by a pandas team member setops union, intersection, difference, symmetric_difference

Comments

@ssche
Copy link
Contributor

ssche commented Jan 2, 2025

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

I wasn't able to extract the data for this example to set up the test case more programmatically, but I managed to reduce the data significantly without compromising the behaviour. I hope the below is somewhat portable (please let me know if it isn't). I used numpy==2.2.1 and pandas==2.2.3

>>> import pickle

>>> ix1_pickled = [redacted]
>>> ix2_pickled = [redacted]
ix1 = pickle.loads(ix1_pickled)
ix2 = pickle.loads(ix2_pickled)

>>> ix1
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])

>>> ix2
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])
>>>
>>> # expected - both indices are the same - should yield same result
>>> ix2.union(ix1)
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])
>>> 
>>> # it seems each row is considered different
>>> ix1.union(ix2)
MultiIndex([(nan, '2018-06-01'),
            (nan, '2018-06-01')],
           names=['dim1', 'dim2'])
>>> 
>>> # expected
>>> ix1.difference(ix2)
MultiIndex([], names=['dim1', 'dim2'])
>>> 
>>> # not expected
>>> ix2.difference(ix1)
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])

>>> # some diagnostics to show the values (the dates other than `2018-06-01` could come from my attempt to minimise the example as those values were previously contained)
>>> ix1.levels
FrozenList([[nan], [2018-06-01 00:00:00, 2018-07-01 00:00:00, 2018-08-01 00:00:00, 2018-09-01 00:00:00, 2018-10-01 00:00:00, 2018-11-01 00:00:00, 2018-12-01 00:00:00]])
>>> ix2.levels
FrozenList([[nan], [2018-06-01 00:00:00, 2018-07-01 00:00:00, 2018-08-01 00:00:00, 2018-09-01 00:00:00, 2018-10-01 00:00:00, 2018-11-01 00:00:00, 2018-12-01 00:00:00]])
>>> ix1.levels[1] == ix2.levels[1]
array([ True,  True,  True,  True,  True,  True,  True])

Issue Description

Creating the union of two indices with a nan level causes the union result to depend on the order of the call (index1.union(index2) vs. index2.union(index1)). With other words, one of the calls yields the wrong result as the call deems every row to be distinct. I'm fairly certain that is is due to nan value in dim1, but if I recreate the example programmatically, the behaviour is as expected.

>>> ix3 = pd.MultiIndex.from_product([[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]])
>>> ix3
MultiIndex([(nan, '2018-06-01')],
           )
>>> ix4 = pd.MultiIndex.from_product([[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]])
>>> 
>>> ix4.dtypes
level_0           float64
level_1    datetime64[ns]
dtype: object
>>> ix3.dtypes
level_0           float64
level_1    datetime64[ns]
>>> ix3.union(ix4)
MultiIndex([(nan, '2018-06-01')],
           )
>>> ix4.union(ix3)
MultiIndex([(nan, '2018-06-01')],
           )

However, in test cases for a rather large application, I arrive at the state from the pickle example. I'm not sure what's different to the working example

Expected Behavior

I would expect the difference of the two indices from the pickled example to be empty and the union to be the same as the two indices.

I am also at a loss as to why I can't reproduce the wrong behaviour programmatically.

Installed Versions


INSTALLED VERSIONS
------------------
commit                : 0691c5cf90477d3503834d983f69350f250a6ff7
python                : 3.13.1
python-bits           : 64
OS                    : Linux
OS-release            : 6.12.6-200.fc41.x86_64
Version               : #1 SMP PREEMPT_DYNAMIC Thu Dec 19 21:06:34 UTC 2024
machine               : x86_64
processor             : 
byteorder             : little
LC_ALL                : None
LANG                  : en_AU.UTF-8
LOCALE                : en_AU.UTF-8

pandas                : 2.2.3
numpy                 : 2.2.1
pytz                  : 2020.4
dateutil              : 2.9.0.post0
pip                   : 24.3.1
Cython                : 3.0.11
sphinx                : None
IPython               : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : None
blosc                 : None
bottleneck            : 1.4.2
dataframe-api-compat  : None
fastparquet           : None
fsspec                : None
html5lib              : None
hypothesis            : None
gcsfs                 : None
jinja2                : None
lxml.etree            : None
matplotlib            : None
numba                 : None
numexpr               : 2.10.2
odfpy                 : None
openpyxl              : 3.1.2
pandas_gbq            : None
psycopg2              : 2.9.10
pymysql               : None
pyarrow               : 18.1.0
pyreadstat            : None
pytest                : 8.3.4
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : 1.14.1
sqlalchemy            : None
tables                : 3.10.1
tabulate              : None
xarray                : None
xlrd                  : 2.0.1
xlsxwriter            : None
zstandard             : None
tzdata                : 2024.2
qtpy                  : None
pyqt5                 : None
@ssche ssche added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Index Related to the Index class or subclasses Needs Triage Issue that has not been reviewed by a pandas team member setops union, intersection, difference, symmetric_difference labels Jan 2, 2025
@rhshadrach
Copy link
Member

rhshadrach commented Jan 4, 2025

pickle can perform arbitrary code execution and thus presents security issues. Can you post your example without using pickle.

ix1 = pd.MultiIndex(levels=ix1.levels, codes=ix1.codes, names=ix1.names, verify_integrity=False)

@rhshadrach rhshadrach added the Needs Info Clarification about behavior needed to assess issue label Jan 4, 2025
@ssche
Copy link
Contributor Author

ssche commented Jan 6, 2025

Thanks @rhshadrach - that's really helpful for boiling down the essential parameters of the multiindex to reproduce the issue without using pickle.

When I extracted the level, codes, names from ix1 and ix2, I noticed a difference in ix2.codes. ix2 doesn't seem to contain a -1 code at a nan level value. ix2 originates from an aggregation function (first in this case, but I don't think this matters) for which groupby does not assign -1 codes for nan values (not sure if that's expected; I suppose that's not correct). Pandas 2.1.1 was doing the same thing, but a subsequent ix1.union(ix2) for which ix2 had 0 codes for nan levels still worked. That seems to be regression to 2.2.3.

Compare the following two snippets:

>>> pd.__version__
'2.1.1'
>>> np.__version__
'1.24.4'
>>> ix1 = pd.MultiIndex.from_product([[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]], names=['dim1', 'dim2'])
>>> ix1
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])
>>> ix1.codes
FrozenList([[-1], [0]])
>>> 
>>> df = pd.DataFrame({'foo': [42]}, index=ix1)
>>> df = df.reset_index(drop=False)
>>> ix2 = df.groupby(['dim1', 'dim2'], dropna=False, observed=True).first().index
>>> 
>>> ix2
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])
>>> ix2.codes
FrozenList([[0], [0]])
             ^
             supposedly wrong code (should be -1?)
>>> 
>>> ix1.union(ix2)
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])

2.1.1 has a 0-code for a nan level after the groupby, but the union is still correct. In 2.2.3

>>> pd.__version__
'2.2.3'
>>> np.__version__
'2.2.1'
>>> ix1 = pd.MultiIndex.from_product([[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]], names=['dim1', 'dim2'])
>>> ix1
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])
>>> ix1.codes
FrozenList([[-1], [0]])
>>> 
>>> df = pd.DataFrame({'foo': [42]}, index=ix1)
>>> df = df.reset_index(drop=False)
>>> ix2 = df.groupby(['dim1', 'dim2'], dropna=False, observed=True).first().index
>>> 
>>> ix2
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])
>>> ix2.codes
FrozenList([[0], [0]])
>>> 
>>> ix1.union(ix2)
MultiIndex([(nan, '2018-06-01'),
            (nan, '2018-06-01')],  # <- wrong - double up of rows (should be only 1)
           names=['dim1', 'dim2'])

>>> 

# let's try to fix `ix2` (`from_frame` seems to be "fixing" the index)
>>> ix2_fixed = pd.MultiIndex.from_frame(pd.DataFrame(index=ix2).reset_index())
>>> ix2_fixed.codes
FrozenList([[-1], [0]])
            # ^
            # expected for nan
>>> 
>>> ix1.union(ix2_fixed)
MultiIndex([(nan, '2018-06-01')],  # <- correct result, 1 row only
           names=['dim1', 'dim2'])

The question is - is groupby already wrong in assigning 0 values for nan levels or should union (and potentially others) not rely on index.codes?

@ssche ssche removed the Needs Info Clarification about behavior needed to assess issue label Jan 6, 2025
@ssche
Copy link
Contributor Author

ssche commented Jan 6, 2025

I went down the avenue of assuming that groupby should have -1 codes for nan values (so other nan level indices are compatible). This lead me to Grouping which calculates the codes.

class Grouping:
        ...
    def _codes_and_uniques(self):
       ...
            uniques = self._uniques
        else:
            # GH35667, replace dropna=False with use_na_sentinel=False
            # error: Incompatible types in assignment (expression has type "Union[
            # ndarray[Any, Any], Index]", variable has type "Categorical")
            codes, uniques = algorithms.factorize(  # type: ignore[assignment]
                self.grouping_vector, sort=self._sort, use_na_sentinel=self._dropna
            )
        return codes, uniques

Since use_na_sentinel is tied to _dropna (which is set to False because the nan rows should be kept), algorithms.factorize will calculate non-negative integer code values for nan values which differs to the codes in ix1.

However, I don't think that matters much as the codes may not be consistent across indices (i.e. the code-level relationship can be different for different indices). Ideally, .union() (which uses .difference()) should account for different indices, in my opinion.

Strangely enough, one cannot set up the above example via the default constructor of MultiIndex.

ix3 = pd.MultiIndex(levels=[[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]], names=['dim1', 'dim2'], codes=[[-1], [0]])
ix4 = pd.MultiIndex(levels=[[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]], names=['dim1', 'dim2'], codes=[[0], [0]])
ix3.union(ix4)

works as expected as ix4's codes are adjusted to [[-1], [0]] after the constructor is called.

I'm still puzzled as to whether there's an implicit assumption that nan values should be denoted as -1 codes (groupby doesn't adhere to this assumption, but union() seems to assume that and the fact that the constructor corrects the nan code supports that further).

@rhshadrach
Copy link
Member

Strangely enough, one cannot set up the above example via the default constructor of MultiIndex.

You need to pass verify_integrity=False, otherwise the validation process will modify your data so that the codes are -1.

The Cython groupby implementation is heavily tied to allowing nonnegative codes for NA values to implement dropna=False. These codes are used to determine the the label, which defines the group. E.g.

for i in range(N):
lab = labels[i]
if lab < 0:
continue
for j in range(K):
val = values[i, j]

I agree this appears to be a bug in .union. It should be able to handle codes as -1 or codes as nonnegative values whose corresponding level value is NA.

@ssche
Copy link
Contributor Author

ssche commented Feb 7, 2025

This defect causes all sorts of other issues for me which impact basic pandas functionality such as simple addition of 2 dataframes. I found that index.get_indexer(other_index) doesn't work reliably when the codes for nan values aren't negative in index (other_index.get_indexer(index) with other_index not having -1-nan-codes seems to work fine).

In the example below, a commutative operation such addition produces different results when the dataframes are swapped.

>>> import pandas as pd
>>> pd.__version__
'2.1.1'
>>> import numpy as np

>>> ix3 = pd.MultiIndex(levels=[[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]], names=['dim1', 'dim2'], codes=[[-1], [0]])
>>> ix3
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])
>>> ix3.codes
FrozenList([[-1], [0]])

>>> ix4 = pd.MultiIndex(levels=[[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]], names=['dim1', 'dim2'], codes=[[0], [0]], verify_integrity=False)
>>> ix4.codes
FrozenList([[0], [0]])

>>> df3 = pd.DataFrame({'a': [1]}, index=ix3)
>>> df3
                 a
dim1 dim2
NaN  2018-06-01  1
>>> df4 = pd.DataFrame({'a': [5]}, index=ix4)
>>> df4
                 a
dim1 dim2
NaN  2018-06-01  5


# try addition df3 + df4 (working fine)
>>> df3 + df4
                 a
dim1 dim2
NaN  2018-06-01  6

# as expected (both indices are equal, so union is same)
>>> ix3.union(ix4)
MultiIndex([(nan, '2018-06-01')],
           names=['dim1', 'dim2'])

# error! this means the ix4[0] is not in ix3 which is not the case
>>> ix3.get_indexer(ix4)
array([-1])

# as a result union fails
>>> ix4.union(ix3)
MultiIndex([(nan, '2018-06-01'),
            (nan, '2018-06-01')], # <- this is a duplicate entry which shouldn't be there
           names=['dim1', 'dim2'])

# same does addition this way around (`df3 + df4` worked as expected)
>>> df4 + df3
                  a
dim1 dim2
NaN  2018-06-01 NaN
     2018-06-01 NaN  # <- duplicate and no values

I would expect this to be causing issues for other users too (or perhaps nan indices aren't all that common).

Can someone suggest a workaround? I tried the following, but am unsure whether this works in all cases. The index can be "cleaned" by

ix3_clean = pd.MultiIndex(ix3.levels, ix3.codes, names=ix3.names)

Can I just set this to df3.index = ix3_clean or would I need to call df3.set_index(ix3_clean) (or even reindex()?)?

@ssche
Copy link
Contributor Author

ssche commented Feb 7, 2025

This also happened in pandas 1.5.3.

Python 3.8.20 (default, Sep  9 2024, 00:00:00) 
[GCC 14.2.1 20240801 (Red Hat 14.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.__version__
'1.5.3'
>>> import pandas as pd
>>> import numpy as np
>>> 
>>> ix3 = pd.MultiIndex(levels=[[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]], names=['dim1', 'dim2'], codes=[[-1], [0]])
>>> ix4 = pd.MultiIndex(levels=[[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]], names=['dim1', 'dim2'], codes=[[0], [0]], verify_integrity=False)
>>> print(ix4.codes)
[[0], [0]]
>>> 
>>> ixer = ix3.get_indexer(ix4)
>>> ixer
array([-1])  # [0] expected as both row indices are the same
>>> print(ix4.union(ix3))
MultiIndex([(nan, '2018-06-01'),
            (nan, '2018-06-01')],  # duplicate
           names=['dim1', 'dim2'])

@rhshadrach
Copy link
Member

(or perhaps nan indices aren't all that common).

I would recommend keeping nan out of an index. While pandas has worked around issues as they've arisen in the past, as you can see it is currently a troublesome area. In general, storing nan as keys in a dict-like data structure is typically a bad idea as nan != nan.

I actually wonder if pandas should raising any time an index was created with a nan value. groupby with nan values is one place that would be hard to support (even with as_index=False, that's typically just a call to reset_index under the hood). Maybe joins too.

@ssche
Copy link
Contributor Author

ssche commented Feb 19, 2025

I would recommend keeping nan out of an index

Too late - this is a feature I rely on heavily on in a production system. We used to convert to another sentinel value, but since issues were fixed I assumed this is supported now and we removed those cumbersome sentinel replacement strategies. I'd rather help to support this as a stable feature.

But I get your point that it's not 100% perfect (yet).

typically a bad idea as nan != nan

That argument doesn't hold when codes are used (at least for the MI case) for comparison. It may be counter-intuitive (or perhaps not) for python/pandas/scientific programming purists, but in real world scenarios it's quite useful to assume equality (esp. for indices).

I'd rather get this fixed (nan != nan or nan == nan could be an option as done for other operations).

I think there's an inconsistency between -1 codes and nans in MultiIndexes. I've seen cases where the MulitIndex levels (which is a pd.Index()) didn't contain nan values and -1 was used as code to denote nan, and other cases where the position of the nan value in the level was used as a code instead. The latter case (no -1 codes, position of nan as code) seems to work best, given some of the code I've seen which checks for -1 (and applies special handling).

I've since resorted to converting MultiIndex codes which seems to be working fine.

>>> # ix3 = pd.MultiIndex(levels=[[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]], names=['dim1', 'dim2'], codes=[[-1], [0]])
>>> ix3 = pd.MultiIndex(levels=[[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]], names=['dim1', 'dim2'], codes=[[0], [0]], verify_integrity=False)
>>> ix4 = pd.MultiIndex(levels=[[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]], names=['dim1', 'dim2'], codes=[[0], [0]], verify_integrity=False)
>>>
>>> df3 = pd.DataFrame({'a': [1]}, index=ix3)
>>> df4 = pd.DataFrame({'a': [5]}, index=ix4)

>>> df3 = pd.DataFrame({'a': [1]}, index=ix3)
>>> df4 + df3
                 a
dim1 dim2
NaN  2018-06-01  6
>>> df3 + df4
                 a
dim1 dim2
NaN  2018-06-01  6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Index Related to the Index class or subclasses Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Needs Triage Issue that has not been reviewed by a pandas team member setops union, intersection, difference, symmetric_difference
Projects
None yet
Development

No branches or pull requests

2 participants