Skip to content

How to enable the Standard? #115

Closed
@MarcoGorelli

Description

@MarcoGorelli

Probably the most fundamental question

Say we have

  • dfpd, which is a pandas DataFrame
  • PandasStandardDataFrame, which is the pandas implementation of the Standard. It takes a pandas DataFrame and returns a class which only has the methods supported by the Standard. Say this is available on PyPI as pandas-standard

(...and similarly for modin / cudf / vaex / anyone else)

Right. Say I want to write a function which can accept any DataFrame, like

def clean_column_names(df: DataFrame) -> DataFrame:
    df_standard = <get the relevant Standard implementation>
    mapping = {}
    for column in df_standard.get_column_names():
        mapping[column] = column.lower()
    df_standard = df_standard.rename(mapping)
    return df_standard.dataframe

How do we do the first line, i.e. getting df_standard?

I think the ideal place to get to would be

df_standard = df.__dataframe_standard__()

but this can't happen overnight, especially if we want to stick to the mantra @jbrockmendel had mentioned here

id like to find an alternative that fits with the "assume pandas changes nothing" mantra

Phase 1

This a hacky, but allows for quick experimentation without needing to depend on pandas' approvals, or on the pandas' relatively slow release cycle, or those of any other library.

Consumers of the Standard would need to write something like

def enable_standard(df: DataFrame):
    if type(df).split('.')[0] == 'pandas':
        from pandas_standard import PandasStandardDataFrame  # TODO raise if not installed
        return PandasStandardDataFrame(df)
    # and similarly for any other package which might not
    # want to introduce `__dataframe_standard__` right away
    try:
        return df.__dataframe_standard__()
    except AttributeError:
        raise TypeError(f'Expected DataFrame Standard compliant DataFrame, got {type(df)}')

At the moment the Consortium is still relatively small, so enumerating the options in if-then statements should be manageable.

Phase 2

Once we've seen that some libraries are actually able to use it to write portable code, then pandas could add a method like

def __dataframe_standard__(self):
    import_optional_dependency("pandas_standard")  # this will raise if not installed
    from pandas_standard import PandasStandardDataFrame
    return PandasStandardDataFrame(self)

, and similarly for other libraries.

I'd like think that something so small would be a relatively easy sell to the pandas-dev team - @jbrockmendel , @jorisvandenbossche, do you agree? Would you be OK with this? Do you have other suggestions for how to opt-in to the standard?

Note that consumers would need to have pandas-standard installed for this to work.

(optional) phase 3

dfpd.__dataframe_standard__() would work without requiring extra dependencies (either because pandas_standard has become a runtime dependency of pandas, or because the standard is implemented within pandas).
I think this is what some, such as @aregm would like to see happen.

Usual reminder that I think this is unlikely to pass - nonetheless, it's not off the table, and some participants would find it desirable, so I've kept it in.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions