-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: generalize __init__
on a dict
to abc.collections.Mapping
and __getitem__
on a list
to abc.collections.Sequence
#58803
Comments
I’m -0.5 on this. Internally we would just convert to dict/list anyway. I’d rather users do that where necessary and avoid the perf penalty in the cases we do supporr |
Ah, there would be a performance penalty? I would have thought it would just be changing an |
Slightly unrelated, but have you tried the Julia pandas wrapper? I believe it automatically does the conversion from Julia Vector/Dict to a Python list/dict for you |
Thanks, sadly that one looks to use the older PyCall.jl instead of the newer PythonCall.jl, so not (yet) compatible. It's not a big deal if not possible. I guess it's a bit of a sharp edge, especially as it doesn't throw an error, but for users who google this, the workaround does work. I just thought it seemed like it might be more duck-typey/pythonic if any dict-like input was acceptable for initializing from a dict (and similar for sequences) rather than only explicit dicts – I could imagine other cases where it might be useful to have this. But I understand this is totally subjective! |
@jbrockmendel I checked the performance degradation by accepting Note that I don't have much knowledge of the pandas internals so this patch can be insufficient to accept |
In the case of Line 763 in 2aa155a
data.keys() and data[key] for each key and converts them to some other internal data structure - so it doesn't matter whether the source is a dict or any other Mapping .
AFAICT the only performance concern is in the actual type check I haven't looked at the indexing code to see if the same conclusions hold for |
I made a pull-request to propose introduce Mapping support in DataFrame construction #58814. |
#58814 only fixes the issue of constructing a DataFrame from a Mapping, but using a Mapping can still work unexpectedly for other methods, take this example from the docs (using #58814's DictWrapper class): df = pd.DataFrame({"num_legs": [2, 4], "num_wings": [2, 0]}, index=["falcon", "dog"])
values = {"num_wings": [0, 3]}
my_dict = DictWrapper(values) # <-- Mapping
print(df.isin(values)) # Correct result
print(df.isin(my_dict)) # Wrong result A quick search shows 100+ results of |
@MilesCranmer Regarding This means we cannot support |
I guess the following change, skipping the call to the object if an error occurs, could be acceptable for improving pandas's interoperability with other libraries. diff --git a/pandas/core/common.py b/pandas/core/common.py
index 9629199122..31442d9f40 100644
--- a/pandas/core/common.py
+++ b/pandas/core/common.py
@@ -385,7 +385,10 @@ def apply_if_callable(maybe_callable, obj, **kwargs):
**kwargs
"""
if callable(maybe_callable):
- return maybe_callable(obj, **kwargs)
+ try:
+ return maybe_callable(obj, **kwargs)
+ except Exception:
+ pass
return maybe_callable After passing
|
Definitely -1 on this. The request adds a lot of maintenance burden for a use case that isn't remotely supported. |
I'm -1 on this entirely, the point I wanted to emphasize is that allowing Mapping in the DataFrame constructor does not fully solve this problem, so why bother fixing some parts only? |
I think the easiest approach would just be to convert it to a I would of course not expect you to support Mapping throughout the entire library, my apologies if that is what it came off like.
I think if you simply convert incoming types to a
While I of course don't expect support for my use-case, I just want to gently suggest that this sort of change seems like the "right thing to do" at some level. The fact Julia is running into errors (despite implementing the abstract class correctly according to the Python docs) is moreso a signal of a general interface mismatch, rather than a specific library to support. For example, if an incoming object inherits from Python's abstract import pandas as pd
from collections.abc import Mapping
class MyMapping(Mapping):
def __init__(self, **kwargs):
self.d = dict(**kwargs)
def __getitem__(self, k):
return self.d[k]
def __iter__(self):
return iter(self.d)
def __len__(self):
return len(self.d)
d = MyMapping(a=[1, 2], b=[3, 4])
df = pd.DataFrame(d)
print(df) This results in:
which is unexpected. The Python docs describe this class as:
So I think it's reasonable to check for this. Or at the very least throw an error instead of producing unexpected behavior like seen above. Of course as the maintainers the decision is up to you. |
Looks like there was also thread a while ago on this: #6792 (thanks for linking @mroeschke) |
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
I wish that
DataFrame
was more compatible with initializing from more general classes of dictionaries, and being indexed by more general classes of sequences.Feature Description
Initialization
This is motivated by attempting to use Pandas from Julia via PythonCall.jl, where I found the strange behavior where:
would result in the following dataframe:
As @cjdoris discovered, this this is due to the fact "
pandas.DataFrame.__init__
explicitly checks if its argument is adict
andPy(::Dict)
is not adict
(it becomes ajuliacall.DictValue
in Python)."The way to fix this would be to have
__init__
instead check for its input being an instance ofabc.collections.Mapping
instead, which includes bothdict
andabc.collections.Mapping
.Indexing
The second incompatibility here is that indexing using lists that are not exactly
list
does not work as expected. For example, if I try to index a pandas dataframe using a sequence of strings Julia, I get this error:As @cjdoris found, this is due to
__getitem__
checking forlist
.The solution would be to check for the more general
abc.collections.Sequence
, which includes bothlist
andjuliacall.VectorValue
.Alternative Solutions
Alternative solution 1
There is https://github.com/JuliaData/DataFrames.jl which implements a dataframe type from scratch, but pandas is more actively maintained, and I find it easier to use, so would like to call it from Julia.
Alternative solution 2
The current workaround is to use these operations as follows (thanks to @mrkn)
where we explicitly convert to a Python dict object. However, this is not obvious, and there is no error when you avoid
pydict
- it's only from checking the contents do you see it resulted in an unexpected dataframe.Similarly, for indexing:
However, this is not immediately obvious, and is a bit verbose. My immediate thought is that passing a list-like object to pandas getitem should work for indexing, and passing a dict-like object should work for initialization. Perhaps this is subjective but I thought I'd see what others think.
Checking the more general class of types would fix this, and also be compatible with any other list-like or dict-like inputs.
Alternative solution 3
As brought up by @cjdoris in JuliaPy/PythonCall.jl#501, one option is to change PythonCall so that it automatically convert Julia Dict to Python dict whenever passed to a Python object.
However, this would prevent people from mutating Julia dicts from within Python code, so would be a massive change in behavior. And converting Julia Vectors to Python lists automatically would also prevent mutation from Python, so is basically a non-option.
Additional Context
Discussed in JuliaPy/PythonCall.jl#501
The text was updated successfully, but these errors were encountered: