-
Notifications
You must be signed in to change notification settings - Fork 21
Restrictions on column labels #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think those are very good points to discuss. My preference would be:
The reason is the same in both cases, I think there is a trade-off between complexity of the standard/implementation, and flexibility of the tool. From my own experience with the kind of data and projects I worked on, the increase in complexity is not worth. I do see use cases, for example:
But even if cases like this would be a bit trickier, I still think that being able to assume string types and uniqueness will simplify enough things that it's worth. Of course, we can leave this out of the standard, and let dataframe implementations decide. But I think consumers of dataframes will also have an IMHO unreasonable increase of complexity. Think of For the types, not my preference, but would be ok accepting a small subset of types (e.g. |
I agree. I think that limitation is fair, not too limiting, and makes it much easier as a user and library author. If you think about going from DataFrames to other libraries, e.g. visualization, where they can be a label in a plot, or to a JSON like structures, where they can be a key, it's going to be messy if we don't require that. |
Unless you want to read a dataset that has duplicate labels :) Though perhaps we just require that any IO routine that reads from a store that allows duplicates (e.g. |
I'm not sure I understand how the pivot-table point was addressed. Does that mean that pivoting will convert a value to string for the column name? |
I think the string-only column names hasn't been addressed / discussed much. |
That's a good point. My opinion: I don't think pivoting should be part of the standard. Surely a nice feature for some users, but I'm personally fine with different implementations. In every package, or provided as third-party extensions. I think it should be quite easy to add a wrapper to a dataframe that maps any (hashable) value to a string, and shows those to the user instead of a string (so the user don't see the actual string values). So, even if the underlying dataframe has this restriction, implementations or third party packages can provide something "fancier" to the user. It's surely somehow tricky, but moving complexity out of the core dataframe standard, into implementations and third party packages seems like a good deal to me. |
+1 as well for string-only and uniqueness.
That is a separate decision that is unnecessary to mix in here.
That seems like a decent solution. Any implementation can ensure the resulting strings are unique (e.g. append |
It seems like the preference here is to require that column labels must be
We'll want to specify / provide guidance on when and how to mangle duplicate columns should they arise (reading from a CSV file). And we'll want to specify what should happen when a dataframe operation introduces duplicate labels (each of these should probably raise)
|
Tom, I appreciate you trying to move things forward. I also agree that column labels should be unique and strings. This also makes saving a DataFrame/Dataset/Table for cross platform much easier. What happens with a column name collision? I suggest we have a kwarg (perhaps 'collide' ?) that determines what happens. By default we might append '_1',' _2', '_3', etc. for each collision. If the user specifies: |
At least for data IO methods like I'm less certain of the need for it in methods that might introduce duplicate column labels in the course of normal operation, like |
Just checking, requiring unique all-string columns means "to satisfy the spec, you must support unique all-string columns", not "to satisfy the spec, you must support only unique all-string columns", right? |
@jbrockmendel the intent was the latter (only). That seems to be preferred from both a usability and "avoid complexity" point of view. This issue is quite old, but IIRC in more recent conversations there was pretty universal agreement on this. And the interchange protocol has that requirement as well: https://data-apis.org/dataframe-protocol/latest/design_requirements.html#protocol-design-requirements |
so to be spec-compliant pandas would have to deprecate support for non-unique columns and non-string columns? |
For context: there is a significant tension between backwards compatibility in libraries and not trying to standardize simply the way things work now for behavior/features that many maintainers don't like. As a result, library maintainers are not planning to implement the whole standard in their main namespace (or perhaps with some kind of switch, see gh-79). I'd expect this to be one of those things where Pandas would either not want to deprecate this at all, or quite slowly. |
I expect this is one of many things I'll have to get used to, but I find this confusing. Saying "implementation X must support Y" seems reasonable. Continuing with "and it must not support Z" seems unnecessary and counterproductive. If I apply the same reasoning to the description of arrays being contiguous, that means you're not allowed to support strided arrays, so e.g. |
Agreed - I think there are very few cases of this though. In general, we'd expect libraries to offer a superset of functionality of what's in a standard. So that means that if a user or downstream package author restricts themselves to the standardized set of APIs and to inputs that are supported, they have portable code. And if they go beyond it, they don't. Maybe I was wrong above. It's possible for a library to support non-unique/non-string columns, as long as the behavior is compliant for any methods/functions in the standard. Additionally, it'd be good to have sane defaults outside of that, so for example all I/O routines and other standard ways of creating dataframes would default to producing unique string names. Otherwise it's too easy to write non-portable code. But then an explicit
No, that's certainly not intended. The copy is only necessary in |
this has generally been the guiding principle - so pandas need not forbid non-string column names, but anyone writing something like df = data.__dataframe_consortium_standard__()
df.assign((df.col('a') + df.col('b')).rename(1999)) can't expect the above to produce dataframe-agnostic code I think the opening issue / question has been addressed then, so closing, but please let me know if I've misunderstood and I'll reopn |
One of the uncontroversial points from #2 is that DataFrames have column labels / names. I'd like to discuss two specific points on this before merging the results into that issue.
I'm a bit unsure whether these are getting too far into the implementation side of things. Should we just take no stance on either of these?
My responses:
Operations like
crosstab
/pivot
places a column from the input dataframe into the column labels of the output.We'll need to be careful with how this interacts with the indexing API, since a label like the tuple
('my', 'label')
might introduce ambiguities (e.g. the full list of labels is['my', 'label', ('my', 'label')]
.Is it reasonable to require each label to be hashable? Pandas requires this, to facilitate lookup in a hashtable.
dataframes are commonly used to wrangle real-world data into shape, and real-world data is messy. If an implementation wants to ensure uniqueness (perhaps on a per-object basis) then is can offer that separately. But the API should at least allow for it.
The text was updated successfully, but these errors were encountered: