Skip to content

How to enable the Standard? #115

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MarcoGorelli opened this issue Mar 20, 2023 · 4 comments · Fixed by #156
Closed

How to enable the Standard? #115

MarcoGorelli opened this issue Mar 20, 2023 · 4 comments · Fixed by #156

Comments

@MarcoGorelli
Copy link
Contributor

Probably the most fundamental question

Say we have

  • dfpd, which is a pandas DataFrame
  • PandasStandardDataFrame, which is the pandas implementation of the Standard. It takes a pandas DataFrame and returns a class which only has the methods supported by the Standard. Say this is available on PyPI as pandas-standard

(...and similarly for modin / cudf / vaex / anyone else)

Right. Say I want to write a function which can accept any DataFrame, like

def clean_column_names(df: DataFrame) -> DataFrame:
    df_standard = <get the relevant Standard implementation>
    mapping = {}
    for column in df_standard.get_column_names():
        mapping[column] = column.lower()
    df_standard = df_standard.rename(mapping)
    return df_standard.dataframe

How do we do the first line, i.e. getting df_standard?

I think the ideal place to get to would be

df_standard = df.__dataframe_standard__()

but this can't happen overnight, especially if we want to stick to the mantra @jbrockmendel had mentioned here

id like to find an alternative that fits with the "assume pandas changes nothing" mantra

Phase 1

This a hacky, but allows for quick experimentation without needing to depend on pandas' approvals, or on the pandas' relatively slow release cycle, or those of any other library.

Consumers of the Standard would need to write something like

def enable_standard(df: DataFrame):
    if type(df).split('.')[0] == 'pandas':
        from pandas_standard import PandasStandardDataFrame  # TODO raise if not installed
        return PandasStandardDataFrame(df)
    # and similarly for any other package which might not
    # want to introduce `__dataframe_standard__` right away
    try:
        return df.__dataframe_standard__()
    except AttributeError:
        raise TypeError(f'Expected DataFrame Standard compliant DataFrame, got {type(df)}')

At the moment the Consortium is still relatively small, so enumerating the options in if-then statements should be manageable.

Phase 2

Once we've seen that some libraries are actually able to use it to write portable code, then pandas could add a method like

def __dataframe_standard__(self):
    import_optional_dependency("pandas_standard")  # this will raise if not installed
    from pandas_standard import PandasStandardDataFrame
    return PandasStandardDataFrame(self)

, and similarly for other libraries.

I'd like think that something so small would be a relatively easy sell to the pandas-dev team - @jbrockmendel , @jorisvandenbossche, do you agree? Would you be OK with this? Do you have other suggestions for how to opt-in to the standard?

Note that consumers would need to have pandas-standard installed for this to work.

(optional) phase 3

dfpd.__dataframe_standard__() would work without requiring extra dependencies (either because pandas_standard has become a runtime dependency of pandas, or because the standard is implemented within pandas).
I think this is what some, such as @aregm would like to see happen.

Usual reminder that I think this is unlikely to pass - nonetheless, it's not off the table, and some participants would find it desirable, so I've kept it in.

@jbrockmendel
Copy link
Contributor

I'd be fine with a dunder method if we can bikeshed away from "standard" to something more neutral e.g. "consortium"

@rgommers
Copy link
Member

I suggest __dataframe_namespace__, in analogy to https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.__array_namespace__.html. We've already invented this wheel I'd say, and I can't think of why it doesn't have to be copy-paste.

For phase 1 the approach you suggested works, or just monkeypatch __dataframe_namespace__ onto pd.DataFrame.

@MarcoGorelli
Copy link
Contributor Author

or just monkeypatch dataframe_namespace onto pd.DataFrame.

sure, but this would require that pandas be a dependency of the consumer, right? which, they might not necessarily want

@jorisvandenbossche
Copy link
Member

The package implementing the standard on top of pandas can do this monkey-patching (on the short term), so that package would depend on pandas anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants