Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas compatibility #501

Open
2 tasks
MilesCranmer opened this issue May 18, 2024 · 6 comments
Open
2 tasks

Pandas compatibility #501

MilesCranmer opened this issue May 18, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@MilesCranmer
Copy link
Contributor

MilesCranmer commented May 18, 2024

Affects: PythonCall

Describe the bug

I have been trying to use pandas from PythonCall.jl and just wanted to document a few different calls that do not directly translate to Julia. I guess this might just mean we need a PythonPandas package to translate calls but I wonder if there's any missing methods that could be implemented to fix things automatically.

First, the preamble for this:

using PythonCall

pd = pyimport("pandas")
  • 1. Constructing pandas.DataFrame:

Using a similar syntax to Python:

df = pd.DataFrame(Dict([
    "a" => [1, 2, 3],
    "b" => [4, 5, 6]
]))

which results in the following dataframe:

julia> df
Python:
   0
0  b
1  a

i.e., it seems to have a single column named "0" and rows for a and b.

If I instead write this as a vector of pairs, I get:

julia> pd.DataFrame([
           "a" => [1, 2, 3],
           "b" => [4, 5, 6]
       ])
Python:
   0          1
0  a  [1, 2, 3]
1  b  [4, 5, 6]

I suppose this one makes sense.

I was able to get it working with the following syntax instead:

julia> df = pd.DataFrame([
            1   4
            2   5
            3   6
       ], columns=["a", "b"])
Python:
   a  b
0  1  4
1  2  5
2  3  6
  • 2. Selecting multiple columns

So, selecting a single column works:

julia> df["a"]
Python:
0    1
1    2
2    3
Name: a, dtype: int64

but multiple columns does not:

julia> df[["a", "b"]]
ERROR: Python: TypeError: Julia: MethodError: objects of type Vector{String} are not callable
Use square brackets [] for indexing an Array.
Python stacktrace:
 [1] __call__
   @ ~/.julia/packages/PythonCall/S5MOg/src/JlWrap/any.jl:223
 [2] apply_if_callable
   @ pandas.core.common ~/Documents/pysr_projects/arya/bigbench/.CondaPkg/env/lib/python3.12/site-packages/pandas/core/common.py:384
 [3] __getitem__
   @ pandas.core.frame ~/Documents/pysr_projects/arya/bigbench/.CondaPkg/env/lib/python3.12/site-packages/pandas/core/frame.py:4065
Stacktrace:
 [1] pythrow()
   @ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/err.jl:92
 [2] errcheck
   @ ~/.julia/packages/PythonCall/S5MOg/src/Core/err.jl:10 [inlined]
 [3] pygetitem(x::Py, k::Vector{String})
   @ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/builtins.jl:171
 [4] getindex(x::Py, i::Vector{String})
   @ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/Py.jl:292
 [5] top-level scope
   @ REPL[18]:1

I got around this by inserting a pylist call:

julia> df[pylist(["a", "b"])]
Python:
   a  b
0  1  4
1  2  5
2  3  6
@MilesCranmer MilesCranmer added the bug Something isn't working label May 18, 2024
@mrkn
Copy link

mrkn commented May 21, 2024

As you can see in the document, AbstractArray and AbstractDict are implicitly converted to wrapper objects on the Python call.

In the first case, you should use pydict function to convert a Julia's Dict to a Python's dict.

julia> df = pd.DataFrame(pydict(Dict("a" => [1, 2, 3], "b" => [4, 5, 6])))
Python:
   b  a
0  4  1
1  5  2
2  6  3

As in the first case, the necessity of the explicit call to the pylist function is required in the second case.

@MilesCranmer
Copy link
Contributor Author

Thanks, that makes sense! I didn’t see pydict.

So should this be closed or is there anything that can be done automatically?

@cjdoris
Copy link
Collaborator

cjdoris commented May 21, 2024

The issue is that pandas.DataFrame.__init__ explicitly checks if its argument is a dict and Py(::Dict) is not a dict (it's a juliacall.DictValue). The two options to make this work automatically are:

  • Change the PythonCall conversion rules to convert Julia Dict to Python dict. I'm not inclined to change this.
  • Change pandas.DataFrame.__init__ to check if the argument is a abc.collections.Mapping instead, which includes both dict and juliacall.DictValue.

@cjdoris
Copy link
Collaborator

cjdoris commented May 21, 2024

I think requiring pylist to do the indexing is a similar issue - it checks for list rather than the more general abc.collections.Sequence, which includes both list and juliacall.VectorValue.

@MilesCranmer
Copy link
Contributor Author

I think the solutions on pandas side sound like better options to me. I'm not sure if they have some edge cases which prevent them being more general... Like maybe some abc.collections.Sequence acting as a single key?

@MilesCranmer
Copy link
Contributor Author

cross-posted here: pandas-dev/pandas#58803

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants