Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: better import exception for numpy #2397

Open
wants to merge 3 commits into
base: devel
Choose a base branch
from
Open

Conversation

zilto
Copy link
Collaborator

@zilto zilto commented Mar 11, 2025

Related Issues

Fixes #2380

Context

When converting arbitrary row data to pyarrow, we try to use pandas to transpose row-wise data to column-wise data. If unavailable, we fallback to numpy

Since pyarrow >= 18, numpy is no longer a dependency. Our pyarrow code assumed that numpy was available. There's no error / change from related to our recent refactoring.

Problem

User reported (#2380) that a pipeline with SQL source fails with an import error when using pyarrow >= 18 because of missing numpy dependency.

Note. the reported config pyarrow==18, python==3.12 doesn't match the pyproject.toml constraints pyarrow < 18 for python < 3.13. I would expect the package manager to enforce contraints and prevent this error.

Solution

Pyarrow is not a dependency of dlt[sql_database], so numpy probably shouldn't be either. Also, except this code, only some LanceDB-related function seems to assume numpy to be available.

Exception handling now raises a message similar to when pyarrow is missing.

E               dlt.common.exceptions.MissingDependencyException: 
E               You must install additional dependencies to run dlt pyarrow helpers. If you use pip you may do the following:
E               
E               pip install "numpy"
E               
E               Numpy is required for this pyarrow operation

Unfortunately, it's hard to raise earlier than DBApiCursor's arrow related methods

Copy link

netlify bot commented Mar 11, 2025

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 1bed9ef
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/67d1a3f3e8cea20008135af8

@sh-rp
Copy link
Collaborator

sh-rp commented Mar 12, 2025

I think the correct fix might be creating a new arrow extra that has pyarrow and numpy as dependencies and using that extra in all other extra dependencies, but I am not sure. It looks like we only use numpy for for converting tuples to arrow tables, maybe this can be done in a fast way with some arrow interface on arrow 18, so we only need numpy for arrow versions where it is part of the dependencies.

@zilto
Copy link
Collaborator Author

zilto commented Mar 13, 2025

I think the correct fix might be creating a new arrow extra that has pyarrow and numpy as dependencies and using that extra in all other extra dependencies, but I am not sure.

I thought about this, but the other code paths that use pyarrow do not need to numpy. By searching for np and numpy in the codebase, it seems that only this pivot operation and LanceDB use numpy

Also, the dlt[sql_table] doesn't depend on Pyarrow (and probably shouldn't). The numpy exception follows the pyarrow raised when you try to use sql_table(table_format="pyarrow") but you don't have pyarrow installed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing numpy dependency when running with PyArrow >= 18
2 participants