Description
Probably the most fundamental question
Say we have
dfpd
, which is a pandas DataFramePandasStandardDataFrame
, which is the pandas implementation of the Standard. It takes a pandas DataFrame and returns a class which only has the methods supported by the Standard. Say this is available on PyPI aspandas-standard
(...and similarly for modin / cudf / vaex / anyone else)
Right. Say I want to write a function which can accept any DataFrame, like
def clean_column_names(df: DataFrame) -> DataFrame:
df_standard = <get the relevant Standard implementation>
mapping = {}
for column in df_standard.get_column_names():
mapping[column] = column.lower()
df_standard = df_standard.rename(mapping)
return df_standard.dataframe
How do we do the first line, i.e. getting df_standard
?
I think the ideal place to get to would be
df_standard = df.__dataframe_standard__()
but this can't happen overnight, especially if we want to stick to the mantra @jbrockmendel had mentioned here
id like to find an alternative that fits with the "assume pandas changes nothing" mantra
Phase 1
This a hacky, but allows for quick experimentation without needing to depend on pandas' approvals, or on the pandas' relatively slow release cycle, or those of any other library.
Consumers of the Standard would need to write something like
def enable_standard(df: DataFrame):
if type(df).split('.')[0] == 'pandas':
from pandas_standard import PandasStandardDataFrame # TODO raise if not installed
return PandasStandardDataFrame(df)
# and similarly for any other package which might not
# want to introduce `__dataframe_standard__` right away
try:
return df.__dataframe_standard__()
except AttributeError:
raise TypeError(f'Expected DataFrame Standard compliant DataFrame, got {type(df)}')
At the moment the Consortium is still relatively small, so enumerating the options in if-then
statements should be manageable.
Phase 2
Once we've seen that some libraries are actually able to use it to write portable code, then pandas could add a method like
def __dataframe_standard__(self):
import_optional_dependency("pandas_standard") # this will raise if not installed
from pandas_standard import PandasStandardDataFrame
return PandasStandardDataFrame(self)
, and similarly for other libraries.
I'd like think that something so small would be a relatively easy sell to the pandas-dev team - @jbrockmendel , @jorisvandenbossche, do you agree? Would you be OK with this? Do you have other suggestions for how to opt-in to the standard?
Note that consumers would need to have pandas-standard
installed for this to work.
(optional) phase 3
dfpd.__dataframe_standard__()
would work without requiring extra dependencies (either because pandas_standard
has become a runtime dependency of pandas, or because the standard is implemented within pandas).
I think this is what some, such as @aregm would like to see happen.
Usual reminder that I think this is unlikely to pass - nonetheless, it's not off the table, and some participants would find it desirable, so I've kept it in.