Skip to content

Remove pyarrow as required dependency, relying on Arrow PyCapsule Interface #1227

@kylebarron

Description

@kylebarron

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

PyArrow is a massive dependency. Unpacked, it tends to be >100MB in size, and, until the latest versions (I think?) also required numpy as its own non-optional dependency.

It's also, in effect the only current dependency

dependencies = ["pyarrow>=11.0.0", "typing-extensions;python_version<'3.13'"]

It would be great if we could remove it, and that would greatly lessen the minimal environment size for datafusion python.

Many other Python Arrow libraries implement the PyCapsule Interface, so the user can use nanoarrow, arro3, Polars, DuckDB, etc, or pyarrow. Whatever is best for them.

Describe the solution you'd like

The Arrow PyCapsule Interface is a lightweight, decentralized protocol for sharing Arrow data between Python libraries. We already implement the PyCapsule Interface, so it's just a matter of removing places where we hard-code use of pyarrow.

Describe alternatives you've considered

Keep pyarrow dependency.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions