Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import datetime
import pandas as pd
X = pd.DataFrame(
{
"groupby_col": [0, 0, 1, 0],
"agg_col": [1, 2, 3, 4],
"date": [
pd.Timestamp(datetime.date(2000, 1, 1)),
pd.Timestamp(datetime.date(2000, 1, 2)),
pd.Timestamp(datetime.date(2000, 1, 3)),
pd.Timestamp(datetime.date(2001, 1, 1)),
],
}
)
# ----------------------------------------------------------------------
print(
X.groupby(["groupby_col"])
.rolling(window="5D", on="date")[["agg_col"]]
.agg("sum")
)
print(
X.groupby(["groupby_col"])[["agg_col", "date"]]
.rolling(window="5D", on="date")
.agg("sum")
)
print(
X.groupby(["groupby_col"])[["agg_col", "date"]]
.rolling(window="5D", on="date")[["agg_col"]]
.agg("sum")
)
# ----------------------------------------------------------------------
print(
X.groupby(["groupby_col"], as_index=False)
.rolling(window="5D", on="date")[["agg_col"]]
.agg("sum")
)
print(
X.groupby(["groupby_col"], as_index=False)[["agg_col", "date"]]
.rolling(window="5D", on="date")
.agg("sum")
)
print(
X.groupby(["groupby_col"], as_index=False)[["agg_col", "date"]]
.rolling(window="5D", on="date")[["agg_col"]]
.agg("sum")
)
# ----------------------------------------------------------------------
print(
X.groupby(["groupby_col"], as_index=False, sort=False)
.rolling(window="5D", on="date")[["agg_col"]]
.agg("sum")
)
print(
X.groupby(["groupby_col"], as_index=False, sort=False)[["agg_col", "date"]]
.rolling(window="5D", on="date")
.agg("sum")
)
print(
X.groupby(["groupby_col"], as_index=False, sort=False)[["agg_col", "date"]]
.rolling(window="5D", on="date")[["agg_col"]]
.agg("sum")
)
Issue Description
Behaviour is inconsistent depending on if we select columns on the DataFrameGroupBy
vs on the RollingGroupby
.
In the first example, behaviour is as expected and we get
agg_col
groupby_col date
0 2000-01-01 1.0
2000-01-02 3.0
2001-01-01 4.0
1 2000-01-03 3.0
In the second, we get
agg_col date
groupby_col
0 0 1.0 2000-01-01
1 3.0 2000-01-02
3 4.0 2001-01-01
1 2 3.0 2000-01-03
i.e. the original index is in there. I think I have seen a comment about this in another issue before, or in the docs but I can't seem to find it 🙃. Maybe related to #56705?
The third example gives us the same as the first
agg_col
groupby_col date
0 2000-01-01 1.0
2000-01-02 3.0
2001-01-01 4.0
1 2000-01-03 3.0
How about if we try as_index=False
?
Fourth example shows that this doesn't work as expected (it has no effect)
agg_col
groupby_col date
0 2000-01-01 1.0
2000-01-02 3.0
2001-01-01 4.0
1 2000-01-03 3.0
But if we select before the do .rolling
, as in example five, we see that this does seem to work
groupby_col agg_col date
0 0 1.0 2000-01-01
1 0 3.0 2000-01-02
3 0 4.0 2001-01-01
2 1 3.0 2000-01-03
but im not sure if this is just a happy coincidence related to the weirdness from example two?
Example six shoes that selecting both pre and post .rolling
is effectively the same as only selecting post .rolling
agg_col
groupby_col date
0 2000-01-01 1.0
2000-01-02 3.0
2001-01-01 4.0
1 2000-01-03 3.0
Throwing sort
in there doesn't seem to do anything (as noted in other issues, e.g. #50296, ), and exhibits the same behaviour as other examples wrt pre-rolling col selection, post-rolling col selection, both pre-and-post-rolling col selection, as shown by examples seven through nine.
Expected Behavior
- We would see consistent behaviour between pre-rolling col selection and post-rolling col selection
as_index
would always work and ifFalse
return aDataFrame
theby
from thegroupby
are in the columns, and not the index, presumably leaving the resulting index as a (potentially) unsorted version of the original indexsort
would work at all
Installed Versions
INSTALLED VERSIONS
commit : bdc79c1
python : 3.10.11.final.0
python-bits : 64
OS : Darwin
OS-release : 22.4.0
Version : Darwin Kernel Version 22.4.0: Mon Mar 6 20:59:28 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8
pandas : 2.2.1
numpy : 1.26.3
pytz : 2021.1
dateutil : 2.8.2
setuptools : 69.1.0
pip : 23.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.13.2
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.2.0
gcsfs : None
matplotlib : 3.7.1
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : 2024.2.0
scipy : 1.11.4
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None