-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: ValueError: Length of values (5) does not match length of index (4)
when subtracting two series with MultIndex and Index and nan values
#60908
Comments
take |
@rit4rosa Thanks for having a crack at this. I just wanted to share some of my insights as they may help you or others. I believe there's something fundamentally broken with the assignment of -1 codes to NaN values. I think the reason is that -1 is used ambiguously for 2 seemingly similar, but different reasons. It is used sometimes to denote missing values that are truly missing (as in, not existing) and then it is used for NaN values (or their type-specific versions) which are also sometimes called "missing values". Pandas traditionally calls NaN values "missing values", but I think in recent years a distinction crept in (rightly so in my opinion) which should be acknowledged. I think for the above Supporting my hypothesis is that when -1 codes are deliberately changed to some other value, NaN index values are handled just fine. # same as before
ix1 = pd.MultiIndex.from_arrays([[np.nan, 81, 81, 82, 82], [np.nan, np.nan, np.nan, np.nan, np.nan], pd.to_datetime([np.nan, '2018-06-01', '2018-07-01', '2018-07-01', '2018-08-01'])], names=['foo', 'bar', 'date'])
s1 = pd.Series([np.nan, 25.058969, 22.519751, 20.847981, 21.625236], index=ix1)
ix2 = pd.Index([81, 82, 83, 84, 85, 86, 87], name='foo')
s2 = pd.Series([28.2800, 25.2500, 22.2200, 16.7660, 14.0087, 14.9480, 29.2900], ix2)
# here we change -1 nan codes to some other code value
s1 = patch_series_multiindex_nan_codes(s1)
s2 = patch_series_multiindex_nan_codes(s2)
print(s1 - s2) which yields the expected result:
def patch_series_multiindex_nan_codes(series):
import pandas as pd
if not (
isinstance(series, pd.Series) and
isinstance(series.index, pd.MultiIndex)
):
return series
new_index, ixer = patch_multiindex_nan_codes(series.index)
values = series.values[ixer]
new_series = pd.Series(values, index=new_index, name=series.name)
return new_series
def patch_multiindex_nan_codes(index):
import pandas as pd
import numpy as np
new_codes = []
new_levels = []
for codes, levels in zip(index.codes, index.levels):
# copy to create new (detached from index) and writeable codes.
new_level_codes = np.copy(codes)
new_codes.append(new_level_codes)
level_values = levels
# Add null element if it's missing (so the code can index the null
# element).
if -1 in codes and not pd.isnull(level_values).any():
level_values = level_values.values
# Add null value
if np.issubdtype(level_values.dtype, np.datetime64):
nan_value = np.datetime64('nat')
else:
nan_value = np.nan
level_values = np.concatenate([level_values, [nan_value]])
new_levels.append(level_values)
# "fix" codes whose levels are nan and replace them with non -1 code.
null_ix = np.where(pd.isnull(level_values))[0]
if len(null_ix):
assert len(null_ix) == 1, 'only 1 null value in the level index'
null_ix = null_ix[0]
new_level_codes[new_level_codes == -1] = null_ix
# Create new MI without verification, so the null level codes are not reset
# to -1.
new_index = pd.MultiIndex(
new_levels, codes=new_codes, names=index.names, verify_integrity=False
)
# not really required.
new_index, ixer = new_index.sortlevel(0, sort_remaining=True)
return new_index, ixer I also believe that this could be the reason for a whole raft of other issues, such as #60642 (and others I've seen). If I was to address this, I would start at the |
…4) when subtracting two series with MultIndex and Index and nan values (pandas-dev#60908)
when subtracting two series with MultiIndex and Index and NaN values (pandas-dev#60908)
…match length of index (4) when subtracting two series with MultiIndex and Index and NaN values
Length of values (5) does not match length of index (4) when subtracting two series with MultiIndex and Index and NaN values
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
It is possible to carry out arithmetic operations on two series with "mixed" indices when at least 1 level is the same. However, in my case
s1 - s2
,s1
contains an allnan
index row which raises aValueError: Length of values (5) does not match length of index (4)
.I found that this could be an error in how the two series are aligned.
I traced the origin of the mismatching codes to
pandas.core.indexes.base.py:Index._join_level
which blatantly ignores missing values to construct a new index.This is all possible because
verify_integrity
is set to False (and not passed down). If I setverify_integrity=True
thejoin_index = MultiIndex(...)
fails much earlier withValueError: Length of levels and codes must match. NOTE: this index is in an inconsistent state.
I tried to fix this by changing the
taker = old_codes[old_codes != -1]
totaker = old_codes
. This alleviates the initialValueError
(just tested for my case). If I also comment out the -1 handling, I get the desired expected behaviour.Expected Behavior
Installed Versions
Also happens with pandas==2.2.3
The text was updated successfully, but these errors were encountered: