-
Notifications
You must be signed in to change notification settings - Fork 21
Method to get underlying object #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think this is gh-85? |
Or if it's about regular API methods/functions, then those should already give back instances of the correct dataframe type, right? There's no separate object to convert to/from. |
My understanding was that the standard would be implemented as a separate class, with which would wrap the DataFrame, something like class PandasStandardDataFrame:
def __init__(self, df):
_validate_df(df) # check all columns are strings, no duplicate columns
self.df = df
def drop_column(self, label):
return PandasDataFrame(self.df.drop(label, axis=1))
def get_columns_by_name(self, names):
if not isinstance(names, list) and not all(isinstance(name, str) for name in names):
raise TypeError("Expected list of str")
return PandasDataFrame(self.df.loc[:, names]) df # pandas dataframe
df_standard = PandasStandardDataFrame(df) # enable standard mode
df_standard = df_standard.drop_column("y") # use some method from the standard
df_standard = df_standard.get_columns_by_name(["x_0", "x_1"]) # keep using methods from the standard
df = df_standard.df # go back to having a pandas dataframe If df # pandas dataframe
df = PandasStandardDataFrame(df).drop_column("y") # use some method from the standard
df = PandasStandardDataFrame(df).get_columns_by_name(["x_0", "x_1"]) # use another method from the standard ? |
Ah fair enough, you are right here. I was applying my array intuition too much - we really need the separate dataframe class because we design with methods not functions. So the question is how to spell what you need here. You suggested |
Is there a viable way to do this going through the interchange protocol? |
That's pretty expensive though, having to iterate through memory. This is within a single library so I'd just use a private class DataFrame(): # pd.DataFrame
def __init__(...):
if hasattr(df, '_df_pandasbase'):
return df._df_pandasbase |
id like to find an alternative that fits with the "assume pandas changes nothing" mantra |
I think accessing it from the object might be easier (with an attribute or method), because otherwise you need to know which namespace and function to use? For pandas it could be |
Unless it would be a method in the "standard namespace" (if we will have something like that) |
That bakes in the assumption though that the "native dataframe" exists, and that there's a 1:1 relationship between any implementer of the standard and some other underlying dataframe object within the same library. I'm not sure that that assumption will hold - say you write a new library that only implements the standard, natively, plus the interchange protocol to transform itself into any other library's df object. Or if you'd have a |
Yes, I think if the standard dataframe is the "native" object itself, it can just return itself, that doesn't seem like a problem (similarly like the interchange object also returns itself in But maybe we should also first think about the question: as a user of the standard API, how do you get a "standard" object given a random dataframe? |
Let's keep the question of opting into the standard for a separate issue
Sounds fine
In order for this to be usable, I don't see how it can not be part of the standard - otherwise how can a library implement a function like def my_fancy_function(df: AnyDataFrame):
standard_df = dataframe_standard(df) # we still need to agree on how to opt-in to the standard
standard_df = standard_df.get_columns_by_name(...) # bunch of operations which use the standard
return standard_df.df and be guaranteed that it'll work for any DataFrame? |
We can certainly discuss it separately, but I think the exact answer for this issue could depend on it. For example, if we define a namespace, we could also have a function in that namespace instead of a method or attribute. |
sure, but it would still need to be the same for all DataFrame libraries taking part, right? Otherwise, in the example in #108 (comment) , how does one write DataFrame-agnostic code? I'd have thought this was essential, not just useful |
Yes, to be clear I fully agree with that. Updated my comment above to not use the mere "useful" ;) |
Okay, so there is agreement we do need this in some form. Do we think this is always an O(1) operation? If so, an attribute seems reasonable. If it can trigger computation, it should be either a method, or a way to retrieve a constructor function as in gh-85 (that's more complex, which is probably justified for the interchange protocol but not here). |
Discussed today: folks agreed that this should exist and be cheap. Hence: an attribute |
Does there need to be a way to get back the underlying object?
I'm thinking about the
pyjanitor
clean_names
exampleSome user starts with a DataFrame (say, a pandas one)
df
, and callsclean_names(df)
. They would probably expect to get back what they started with, without caring that PyJanitor internally used the standard.For example, PyJanitor could do
So, should some
.dataframe
property be added, so that the library can "opt-out" of the standard once it has done all its work?The text was updated successfully, but these errors were encountered: