fix: better import exception for numpy #2397

zilto · 2025-03-11T19:59:46Z

Related Issues

Context

When converting arbitrary row data to pyarrow, we try to use pandas to transpose row-wise data to column-wise data. If unavailable, we fallback to numpy

Since pyarrow >= 18, numpy is no longer a dependency. Our pyarrow code assumed that numpy was available. There's no error / change from related to our recent refactoring.

Problem

User reported (#2380) that a pipeline with SQL source fails with an import error when using pyarrow >= 18 because of missing numpy dependency.

Note. the reported config pyarrow==18, python==3.12 doesn't match the pyproject.toml constraints pyarrow < 18 for python < 3.13. I would expect the package manager to enforce contraints and prevent this error.

Solution

Pyarrow is not a dependency of dlt[sql_database], so numpy probably shouldn't be either. Also, except this code, only some LanceDB-related function seems to assume numpy to be available.

Exception handling now raises a message similar to when pyarrow is missing.

E               dlt.common.exceptions.MissingDependencyException: 
E               You must install additional dependencies to run dlt pyarrow helpers. If you use pip you may do the following:
E               
E               pip install "numpy"
E               
E               Numpy is required for this pyarrow operation

Unfortunately, it's hard to raise earlier than DBApiCursor's arrow related methods

netlify · 2025-03-11T20:00:01Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`1bed9ef`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/67d1a3f3e8cea20008135af8

sh-rp · 2025-03-12T16:34:40Z

I think the correct fix might be creating a new arrow extra that has pyarrow and numpy as dependencies and using that extra in all other extra dependencies, but I am not sure. It looks like we only use numpy for for converting tuples to arrow tables, maybe this can be done in a fast way with some arrow interface on arrow 18, so we only need numpy for arrow versions where it is part of the dependencies.

zilto · 2025-03-13T12:59:26Z

I think the correct fix might be creating a new arrow extra that has pyarrow and numpy as dependencies and using that extra in all other extra dependencies, but I am not sure.

I thought about this, but the other code paths that use pyarrow do not need to numpy. By searching for np and numpy in the codebase, it seems that only this pivot operation and LanceDB use numpy

Also, the dlt[sql_table] doesn't depend on Pyarrow (and probably shouldn't). The numpy exception follows the pyarrow raised when you try to use sql_table(table_format="pyarrow") but you don't have pyarrow installed

better import exception for numpy

bf73875

zilto added 2 commits March 11, 2025 16:00

format

ca6d1c4

fix import order

1bed9ef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: better import exception for numpy #2397

fix: better import exception for numpy #2397

zilto commented Mar 11, 2025

netlify bot commented Mar 11, 2025 •

edited

Loading

sh-rp commented Mar 12, 2025

zilto commented Mar 13, 2025 •

edited

Loading

fix: better import exception for numpy #2397

Are you sure you want to change the base?

fix: better import exception for numpy #2397

Conversation

zilto commented Mar 11, 2025

Related Issues

Context

Problem

Solution

netlify bot commented Mar 11, 2025 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

sh-rp commented Mar 12, 2025

zilto commented Mar 13, 2025 • edited Loading

netlify bot commented Mar 11, 2025 •

edited

Loading

zilto commented Mar 13, 2025 •

edited

Loading