-
Notifications
You must be signed in to change notification settings - Fork 127
Introduce TableProvider
wrapper & unified register_table
API; deprecate register_table_provider
#1243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kosiew
wants to merge
10
commits into
apache:main
Choose a base branch
from
kosiew:table-provider-1239
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+726
−179
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
docs/tests, add DataFrame view support, and improve Send/concurrency support. migrates the codebase from using `Table` to a `TableProvider`-based API, refactors registration and access paths to simplify catalog/context interactions, and updates documentation and examples. DataFrame view handling is improved (`into_view` is now public), the test-suite is expanded to cover new registration and async SQL scenarios, and `TableProvider` now supports the `Send` trait across modules for safer concurrency. Minor import cleanup and utility adjustments (including a refined `pyany_to_table_provider`) are included.
DataFrame→TableProvider conversion, plus tests and FFI/pycapsule improvements. -- Registration logic & API * Refactor of table provider registration logic for improved clarity and simpler call sites. * Remove PyTableProvider registration from an internal module (reduces surprising side effects). * Update table registration method to call `register_table` instead of `register_table_provider`. * Extend `register_table` to support `TableProviderExportable` so more provider types can be registered uniformly. * Improve error messages related to registration failures (missing PyCapsule name and DataFrame registration errors). -- DataFrame ↔ TableProvider conversions * Introduce utility functions to simplify table provider conversions and centralize conversion logic. * Rename `into_view_provider` → `to_view_provider` for clearer intent. * Fix `from_dataframe` to return the correct type and update `DataFrame.into_view` to import the correct `TableProvider`. * Remove an obsolete `dataframe_into_view` test case after the refactor. -- FFI / PyCapsule handling * Update `FFI_TableProvider` initialization to accept an optional parameter (improves FFI ergonomics). * Introduce `table_provider_from_pycapsule` utility to standardize pycapsule-based construction. * Improve the error message when a PyCapsule name is missing to help debugging. -- DeltaTable & specific integrations * Update TableProvider registration for `DeltaTable` to use the correct registration method (matches the new API surface). -- Tests, docs & minor fixes * Add tests for registering a `TableProvider` from a `DataFrame` and from a capsule to ensure conversion paths are covered. * Fix a typo in the `register_view` docstring and another typo in the error message for unsupported volatility type. * Simplify version retrieval by removing exception handling around `PackageNotFoundError` (streamlines code path).
* Removed unused helpers (`extract_table_provider`, `_wrap`) and dead code to simplify maintenance. * Consolidated and streamlined table-provider extraction and registration logic; improved error handling and replaced a hardcoded error message with `EXPECTED_PROVIDER_MSG`. * Marked `from_view` as deprecated; updated deprecation message formatting and adjusted the warning `stacklevel` so it points to caller code. * Removed the `Send` marker from TableProvider trait objects to increase type flexibility — review threading assumptions. * Added type hints to `register_schema` and `deregister_table` methods. * Adjusted tests and exceptions (e.g., changed one test to expect `RuntimeError`) and updated test coverage accordingly. * Introduced a refactored `TableProvider` class and enhanced Python integration by adding support for extracting `PyDataFrame` in `PySchema`. Notes: * Consumers should migrate away from `TableProvider::from_view` to the new TableProvider integration. * Audit any code relying on `Send` for trait objects passed across threads. * Update downstream tests and documentation to reflect the changed exception types and deprecation.
utilities, docs, and robustness fixes * Normalized table-provider handling and simplified registration flow across the codebase; multiple commits centralize provider coercion and normalization. * Introduced utility helpers (`coerce_table_provider`, `extract_table_provider`, `_normalize_table_provider`) to centralize extraction, error handling, and improve clarity. * Simplified `from_dataframe` / `into_view` behavior: clearer implementations, direct returns of DataFrame views where appropriate, and added internal tests for DataFrame flows. * Fixed DataFrame registration semantics: enforce `TypeError` for invalid registrations; added handling for `DataFrameWrapper` by converting it to a view. * Added tests, including a schema registration test using a PyArrow dataset and internal DataFrame tests to cover new flows. * Documentation improvements: expanded `from_dataframe` docstrings with parameter details, added usage examples for `into_view`, and documented deprecations (e.g., `register_table_provider` → `register_table`). * Warning and UX fixes: synchronized deprecation `stacklevel` so warnings point to caller code; improved `__dir__` to return sorted, unique attributes. * Cleanup: removed unused imports (including an unused error import from `utils.rs`) and other dead code to reduce noise.
…dating method calls
c47b0f1
to
ea2973c
Compare
ea2973c
to
1872a7f
Compare
…d avoid documentation duplication
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
This change unifies how table-like objects are represented and registered in DataFusion's Python bindings. Historically there were multiple ad-hoc ways to register tables (direct
Table
objects, FFI pycapsules exposed by Rust providers,DataFrame
views, and theregister_table_provider
API). That fragmentation made the code harder to maintain, made FFI integration awkward, and caused subtle API surface inconsistencies.This patch introduces a single, high-level
TableProvider
wrapper in Python (backed by aPyTableProvider
Rust type) and centralizes the logic that coerces various supported inputs into a concrete provider. It also:SessionContext.register_table(...)
the single, preferred entrypoint for table registration.SessionContext.register_table_provider(...)
in favor ofregister_table
while preserving backward compatibility (it forwards toregister_table
and emits aDeprecationWarning
).Table
, the newTableProvider
wrapper, PyCapsule-based foreign providers, and PyArrow datasets) into the expected RustTableProvider
implementation.Overall this reduces duplication, clarifies documentation and examples, and provides a clearer path for FFI authors to expose table providers to Python.
What changes are included in this PR?
High-level summary
datafusion.TableProvider
wrapper (python/datafusion/table_provider.py)PyTableProvider
type and module (src/table.rs) exposing/from-capsule/from-dataframe helpers and__datafusion_table_provider__
coerce_table_provider
andtable_provider_from_pycapsule
(src/utils.rs)datafusion.utils._normalize_table_provider
(python/datafusion/utils.py)SessionContext.register_table(...)
to acceptTable | TableProvider | objects exporting __datafusion_table_provider__
(Python + Rust)register_table_provider(...)
andTableProvider.from_view()
(Python + Rust) with warnings, while preserving behavior by delegating to new API where appropriate.DataFrame.into_view()
return aTableProvider
(Python) and returnPyTableProvider
from Rustinto_view
.EXPECTED_PROVIDER_MSG
to give clearer errors when users pass unsupported objects.TableProvider
+register_table
.SessionContext.register_table_provider
.Files added
python/datafusion/table_provider.py
— high-level Python wrapper around the internal table provider.python/datafusion/utils.py
— helper_normalize_table_provider
and pyarrow dataset handling.src/table.rs
—PyTableProvider
Rust implementation.Files modified (representative, not exhaustive)
__init__.py
,catalog.py
,context.py
,dataframe.py
,io/table_provider.rst
,data-sources.rst
, examples and tests underexamples/
andpython/tests/
.src/utils.rs
,src/catalog.rs
,src/context.rs
,src/dataframe.rs
,src/udtf.rs
,src/lib.rs
, and other modules adjusted to use the new table provider helpers.Behavioral changes
SessionContext.register_table(name, table)
now accepts:datafusion.catalog.Table
(existing behavior preserved),datafusion.TableProvider
(new wrapper),__datafusion_table_provider__()
(pycapsule-based FFI providers),pyarrow.dataset.Dataset
instances.SessionContext.register_table_provider(...)
is deprecated and will warn; it forwards toregister_table
for backwards compatibility.TableProvider.from_view()
is deprecated in favor ofDataFrame.into_view()
andTableProvider.from_dataframe()
; calling the deprecated method emits aDeprecationWarning
.DataFrame.into_view()
now returns aTableProvider
wrapper rather than the older internal table representation exposed directly to Python.A common, clearer error message (
EXPECTED_PROVIDER_MSG
) is provided and exported for tests and user-facing errors.Are these changes tested?
Yes — the PR includes unit and integration test updates and additions in
python/tests/
to cover:TableProvider
created viafrom_capsule
,from_dataframe
, and viaDataFrame.into_view()
.Dataset
objects viaSchema.register_table
andSessionContext.register_table
.DataFrame
objects raise a clearTypeError
when passed directly toregister_table
(guiding users tointo_view()
/from_dataframe()
).DeprecationWarning
behavior forfrom_view
andregister_table_provider
.If any tests still need to be added, they should exercise cross-language FFI flows (Rust-provided pycapsule -> Python
TableProvider.from_capsule
->register_table
).Are there any user-facing changes?
Yes.
API additions / changes
datafusion.TableProvider
(Python).DataFrame.into_view()
returns aTableProvider
(Python).SessionContext.register_table(name, table)
accepts broader inputs and is the canonical registration API.SessionContext.register_table_provider
is deprecated (will emitDeprecationWarning
and forward toregister_table
).TableProvider.from_view()
is deprecated in favor ofDataFrame.into_view()
andTableProvider.from_dataframe()
.datafusion._internal.EXPECTED_PROVIDER_MSG
(and re-exported asdatafusion.EXPECTED_PROVIDER_MSG
) provides a stable error message for consumers and tests.Documentation
TableProvider
andregister_table
usage patterns.Compatibility
register_table_provider()
will continue to work but will receive a deprecation warning.DataFrame
objects directly toregister_table
will now get a clear error directing them tointo_view()
/from_dataframe()
.Breaking changes
DeprecationWarning
s. However, code that relied on internal implementation details of the old table provider representation (rather than the stable public APIs) may require updates.Notes for reviewers
coerce_table_provider
/_normalize_table_provider
): does it accept the right set of inputs and provide clear errors? Are there additional types we should accept?TableProvider
+register_table
).EXPECTED_PROVIDER_MSG
wording is acceptable and stable for users and tests.