Skip to content

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Sep 18, 2025

Which issue does this PR close?


Rationale for this change

This change unifies how table-like objects are represented and registered in DataFusion's Python bindings. Historically there were multiple ad-hoc ways to register tables (direct Table objects, FFI pycapsules exposed by Rust providers, DataFrame views, and the register_table_provider API). That fragmentation made the code harder to maintain, made FFI integration awkward, and caused subtle API surface inconsistencies.

This patch introduces a single, high-level TableProvider wrapper in Python (backed by a PyTableProvider Rust type) and centralizes the logic that coerces various supported inputs into a concrete provider. It also:

  • Makes SessionContext.register_table(...) the single, preferred entrypoint for table registration.
  • Deprecates SessionContext.register_table_provider(...) in favor of register_table while preserving backward compatibility (it forwards to register_table and emits a DeprecationWarning).
  • Adds utilities to normalize/coerce supported inputs (native Table, the new TableProvider wrapper, PyCapsule-based foreign providers, and PyArrow datasets) into the expected Rust TableProvider implementation.

Overall this reduces duplication, clarifies documentation and examples, and provides a clearer path for FFI authors to expose table providers to Python.


What changes are included in this PR?

High-level summary

  • New Python public API: datafusion.TableProvider wrapper (python/datafusion/table_provider.py)
  • New Rust PyTableProvider type and module (src/table.rs) exposing/from-capsule/from-dataframe helpers and __datafusion_table_provider__
  • Centralized coercion helpers on the Rust side: coerce_table_provider and table_provider_from_pycapsule (src/utils.rs)
  • New Python helper utilities: datafusion.utils._normalize_table_provider (python/datafusion/utils.py)
  • Update SessionContext.register_table(...) to accept Table | TableProvider | objects exporting __datafusion_table_provider__ (Python + Rust)
  • Deprecate register_table_provider(...) and TableProvider.from_view() (Python + Rust) with warnings, while preserving behavior by delegating to new API where appropriate.
  • Make DataFrame.into_view() return a TableProvider (Python) and return PyTableProvider from Rust into_view.
  • Export a helpful error message constant EXPECTED_PROVIDER_MSG to give clearer errors when users pass unsupported objects.
  • Update docs and user-guide examples to use TableProvider + register_table.
  • Add/modify tests to cover the new APIs and coercion rules.
  • Changelog entry documenting the deprecation of SessionContext.register_table_provider.

Files added

  • python/datafusion/table_provider.py — high-level Python wrapper around the internal table provider.
  • python/datafusion/utils.py — helper _normalize_table_provider and pyarrow dataset handling.
  • src/table.rsPyTableProvider Rust implementation.

Files modified (representative, not exhaustive)

  • Python: __init__.py, catalog.py, context.py, dataframe.py, io/table_provider.rst, data-sources.rst, examples and tests under examples/ and python/tests/.
  • Rust: src/utils.rs, src/catalog.rs, src/context.rs, src/dataframe.rs, src/udtf.rs, src/lib.rs, and other modules adjusted to use the new table provider helpers.

Behavioral changes

  • SessionContext.register_table(name, table) now accepts:

    • datafusion.catalog.Table (existing behavior preserved),
    • datafusion.TableProvider (new wrapper),
    • Objects exporting __datafusion_table_provider__() (pycapsule-based FFI providers),
    • pyarrow.dataset.Dataset instances.
  • SessionContext.register_table_provider(...) is deprecated and will warn; it forwards to register_table for backwards compatibility.

  • TableProvider.from_view() is deprecated in favor of DataFrame.into_view() and TableProvider.from_dataframe(); calling the deprecated method emits a DeprecationWarning.

  • DataFrame.into_view() now returns a TableProvider wrapper rather than the older internal table representation exposed directly to Python.

  • A common, clearer error message (EXPECTED_PROVIDER_MSG) is provided and exported for tests and user-facing errors.


Are these changes tested?

Yes — the PR includes unit and integration test updates and additions in python/tests/ to cover:

  • Registering a table from a TableProvider created via from_capsule, from_dataframe, and via DataFrame.into_view().
  • Registering PyArrow Dataset objects via Schema.register_table and SessionContext.register_table.
  • Ensuring DataFrame objects raise a clear TypeError when passed directly to register_table (guiding users to into_view() / from_dataframe()).
  • Tests asserting proper DeprecationWarning behavior for from_view and register_table_provider.

If any tests still need to be added, they should exercise cross-language FFI flows (Rust-provided pycapsule -> Python TableProvider.from_capsule -> register_table).


Are there any user-facing changes?

Yes.

API additions / changes

  • New public API: datafusion.TableProvider (Python).
  • DataFrame.into_view() returns a TableProvider (Python).
  • SessionContext.register_table(name, table) accepts broader inputs and is the canonical registration API.
  • SessionContext.register_table_provider is deprecated (will emit DeprecationWarning and forward to register_table).
  • TableProvider.from_view() is deprecated in favor of DataFrame.into_view() and TableProvider.from_dataframe().
  • A new exported constant datafusion._internal.EXPECTED_PROVIDER_MSG (and re-exported as datafusion.EXPECTED_PROVIDER_MSG) provides a stable error message for consumers and tests.

Documentation

  • User guide snippets and examples updated to show the new TableProvider and register_table usage patterns.
  • A changelog deprecation entry has been added.

Compatibility

  • Backwards compatibility is preserved where feasible: existing code that calls register_table_provider() will continue to work but will receive a deprecation warning.
  • Users passing DataFrame objects directly to register_table will now get a clear error directing them to into_view()/from_dataframe().

Breaking changes

  • This PR is designed to be minimally breaking. It intentionally deprecates rather than removes prior APIs and issues DeprecationWarnings. However, code that relied on internal implementation details of the old table provider representation (rather than the stable public APIs) may require updates.

Notes for reviewers

  • Focus on the coercion logic (coerce_table_provider / _normalize_table_provider): does it accept the right set of inputs and provide clear errors? Are there additional types we should accept?
  • Verify deprecation warning messaging and stacklevels to ensure they point at user code rather than library internals.
  • Confirm the documentation examples and user-guide reflect the recommended patterns (using TableProvider + register_table).
  • Ensure the exported EXPECTED_PROVIDER_MSG wording is acceptable and stable for users and tests.

docs/tests, add DataFrame view support, and improve Send/concurrency
support.

migrates the codebase from using `Table` to a
`TableProvider`-based API, refactors registration and access paths to
simplify catalog/context interactions, and updates documentation and
examples. DataFrame view handling is improved (`into_view` is now
public), the test-suite is expanded to cover new registration and async
SQL scenarios, and `TableProvider` now supports the `Send` trait across
modules for safer concurrency. Minor import cleanup and utility
adjustments (including a refined `pyany_to_table_provider`) are
included.
DataFrame→TableProvider conversion, plus tests and FFI/pycapsule
improvements.

-- Registration logic & API

* Refactor of table provider registration logic for improved clarity and
  simpler call sites.
* Remove PyTableProvider registration from an internal module (reduces
  surprising side effects).
* Update table registration method to call `register_table` instead of
  `register_table_provider`.
* Extend `register_table` to support `TableProviderExportable` so more
  provider types can be registered uniformly.
* Improve error messages related to registration failures (missing
  PyCapsule name and DataFrame registration errors).

-- DataFrame ↔ TableProvider conversions

* Introduce utility functions to simplify table provider conversions and
  centralize conversion logic.
* Rename `into_view_provider` → `to_view_provider` for clearer intent.
* Fix `from_dataframe` to return the correct type and update
  `DataFrame.into_view` to import the correct `TableProvider`.
* Remove an obsolete `dataframe_into_view` test case after the refactor.

-- FFI / PyCapsule handling

* Update `FFI_TableProvider` initialization to accept an optional
  parameter (improves FFI ergonomics).
* Introduce `table_provider_from_pycapsule` utility to standardize
  pycapsule-based construction.
* Improve the error message when a PyCapsule name is missing to help
  debugging.

-- DeltaTable & specific integrations

* Update TableProvider registration for `DeltaTable` to use the correct
  registration method (matches the new API surface).

-- Tests, docs & minor fixes

* Add tests for registering a `TableProvider` from a `DataFrame` and
  from a capsule to ensure conversion paths are covered.
* Fix a typo in the `register_view` docstring and another typo in the
  error message for unsupported volatility type.
* Simplify version retrieval by removing exception handling around
  `PackageNotFoundError` (streamlines code path).
* Removed unused helpers (`extract_table_provider`, `_wrap`) and dead code to simplify maintenance.
* Consolidated and streamlined table-provider extraction and registration logic; improved error handling and replaced a hardcoded error message with `EXPECTED_PROVIDER_MSG`.
* Marked `from_view` as deprecated; updated deprecation message formatting and adjusted the warning `stacklevel` so it points to caller code.
* Removed the `Send` marker from TableProvider trait objects to increase type flexibility — review threading assumptions.
* Added type hints to `register_schema` and `deregister_table` methods.
* Adjusted tests and exceptions (e.g., changed one test to expect `RuntimeError`) and updated test coverage accordingly.
* Introduced a refactored `TableProvider` class and enhanced Python integration by adding support for extracting `PyDataFrame` in `PySchema`.

Notes:

* Consumers should migrate away from `TableProvider::from_view` to the new TableProvider integration.
* Audit any code relying on `Send` for trait objects passed across threads.
* Update downstream tests and documentation to reflect the changed exception types and deprecation.
utilities, docs, and robustness fixes

* Normalized table-provider handling and simplified registration flow
  across the codebase; multiple commits centralize provider coercion and
normalization.
* Introduced utility helpers (`coerce_table_provider`,
  `extract_table_provider`, `_normalize_table_provider`) to centralize
extraction, error handling, and improve clarity.
* Simplified `from_dataframe` / `into_view` behavior: clearer
  implementations, direct returns of DataFrame views where appropriate,
and added internal tests for DataFrame flows.
* Fixed DataFrame registration semantics: enforce `TypeError` for
  invalid registrations; added handling for `DataFrameWrapper` by
converting it to a view.
* Added tests, including a schema registration test using a PyArrow
  dataset and internal DataFrame tests to cover new flows.
* Documentation improvements: expanded `from_dataframe` docstrings with
  parameter details, added usage examples for `into_view`, and
documented deprecations (e.g., `register_table_provider` →
`register_table`).
* Warning and UX fixes: synchronized deprecation `stacklevel` so
  warnings point to caller code; improved `__dir__` to return sorted,
unique attributes.
* Cleanup: removed unused imports (including an unused error import from
  `utils.rs`) and other dead code to reduce noise.
@kosiew kosiew force-pushed the table-provider-1239 branch from c47b0f1 to ea2973c Compare September 18, 2025 09:47
@kosiew kosiew force-pushed the table-provider-1239 branch from ea2973c to 1872a7f Compare September 18, 2025 09:51
@kosiew kosiew marked this pull request as ready for review September 20, 2025 06:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A single common PyTableProvider that can be created either via a pycapsule or into_view
1 participant