Skip to content

BUG: Inconsistent RollingGroupby Behaviour #58124

Open
@hsorsky

Description

@hsorsky

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import datetime

import pandas as pd

X = pd.DataFrame(
    {
        "groupby_col": [0, 0, 1, 0],
        "agg_col": [1, 2, 3, 4],
        "date": [
            pd.Timestamp(datetime.date(2000, 1, 1)),
            pd.Timestamp(datetime.date(2000, 1, 2)),
            pd.Timestamp(datetime.date(2000, 1, 3)),
            pd.Timestamp(datetime.date(2001, 1, 1)),
        ],
    }
)

# ----------------------------------------------------------------------

print(
    X.groupby(["groupby_col"])
    .rolling(window="5D", on="date")[["agg_col"]]
    .agg("sum")
)

print(
    X.groupby(["groupby_col"])[["agg_col", "date"]]
    .rolling(window="5D", on="date")
    .agg("sum")
)

print(
    X.groupby(["groupby_col"])[["agg_col", "date"]]
    .rolling(window="5D", on="date")[["agg_col"]]
    .agg("sum")
)

# ----------------------------------------------------------------------

print(
    X.groupby(["groupby_col"], as_index=False)
    .rolling(window="5D", on="date")[["agg_col"]]
    .agg("sum")
)

print(
    X.groupby(["groupby_col"], as_index=False)[["agg_col", "date"]]
    .rolling(window="5D", on="date")
    .agg("sum")
)

print(
    X.groupby(["groupby_col"], as_index=False)[["agg_col", "date"]]
    .rolling(window="5D", on="date")[["agg_col"]]
    .agg("sum")
)


# ----------------------------------------------------------------------

print(
    X.groupby(["groupby_col"], as_index=False, sort=False)
    .rolling(window="5D", on="date")[["agg_col"]]
    .agg("sum")
)

print(
    X.groupby(["groupby_col"], as_index=False, sort=False)[["agg_col", "date"]]
    .rolling(window="5D", on="date")
    .agg("sum")
)

print(
    X.groupby(["groupby_col"], as_index=False, sort=False)[["agg_col", "date"]]
    .rolling(window="5D", on="date")[["agg_col"]]
    .agg("sum")
)

Issue Description

Behaviour is inconsistent depending on if we select columns on the DataFrameGroupBy vs on the RollingGroupby.

In the first example, behaviour is as expected and we get

                        agg_col
groupby_col date               
0           2000-01-01      1.0
            2000-01-02      3.0
            2001-01-01      4.0
1           2000-01-03      3.0

In the second, we get

               agg_col       date
groupby_col                      
0           0      1.0 2000-01-01
            1      3.0 2000-01-02
            3      4.0 2001-01-01
1           2      3.0 2000-01-03

i.e. the original index is in there. I think I have seen a comment about this in another issue before, or in the docs but I can't seem to find it 🙃. Maybe related to #56705?

The third example gives us the same as the first

                        agg_col
groupby_col date               
0           2000-01-01      1.0
            2000-01-02      3.0
            2001-01-01      4.0
1           2000-01-03      3.0

How about if we try as_index=False?

Fourth example shows that this doesn't work as expected (it has no effect)

                        agg_col
groupby_col date               
0           2000-01-01      1.0
            2000-01-02      3.0
            2001-01-01      4.0
1           2000-01-03      3.0

But if we select before the do .rolling, as in example five, we see that this does seem to work

   groupby_col  agg_col       date
0            0      1.0 2000-01-01
1            0      3.0 2000-01-02
3            0      4.0 2001-01-01
2            1      3.0 2000-01-03

but im not sure if this is just a happy coincidence related to the weirdness from example two?

Example six shoes that selecting both pre and post .rolling is effectively the same as only selecting post .rolling

                        agg_col
groupby_col date               
0           2000-01-01      1.0
            2000-01-02      3.0
            2001-01-01      4.0
1           2000-01-03      3.0

Throwing sort in there doesn't seem to do anything (as noted in other issues, e.g. #50296, ), and exhibits the same behaviour as other examples wrt pre-rolling col selection, post-rolling col selection, both pre-and-post-rolling col selection, as shown by examples seven through nine.

Expected Behavior

  1. We would see consistent behaviour between pre-rolling col selection and post-rolling col selection
  2. as_index would always work and if False return a DataFrame the by from the groupby are in the columns, and not the index, presumably leaving the resulting index as a (potentially) unsorted version of the original index
  3. sort would work at all

Installed Versions

INSTALLED VERSIONS

commit : bdc79c1
python : 3.10.11.final.0
python-bits : 64
OS : Darwin
OS-release : 22.4.0
Version : Darwin Kernel Version 22.4.0: Mon Mar 6 20:59:28 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8
pandas : 2.2.1
numpy : 1.26.3
pytz : 2021.1
dateutil : 2.8.2
setuptools : 69.1.0
pip : 23.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.13.2
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.2.0
gcsfs : None
matplotlib : 3.7.1
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : 2024.2.0
scipy : 1.11.4
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugNeeds TriageIssue that has not been reviewed by a pandas team memberWindowrolling, ewma, expanding

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions