-
Notifications
You must be signed in to change notification settings - Fork 21
How to make a future dataframe API available? #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Here's a summary of some of the feedback/discussion on this in a call last week: There is a trade-off between what is easier for end users vs. for dataframe-consuming libraries vs. for dataframe implementers`:
We should have a better idea in terms of how this will all work once we actually see how extensive the differences are between, e.g., pandas and a standardized dataframe object. @shwina says that he expects cuDF to go with a separate dataframe object, because it will be hard to (for example) support a missing/optional @vnlitvinov says that for Modin he'd probably prefer future imports. Other points made:
|
This question got asked recently by @mmccarty (and others have brought it up before), so it's worth taking a stab at an answer. Note that this is slightly speculative, given that we only have fragments of a dataframe API rather than a mostly complete syntax + semantics.
A future API, or individual design elements of it, will certainly have (a) new API surface, and (b) backwards-incompatible changes compared to what dataframe libraries already implement. So how should it be made available?
Options include:
.array_api
in NumPy/CuPy,__array_namespace__
,__array_function__
and more recently with dtype casting rules changes),from __future__ import new_behavior
type import (i.e., new features on a per-module basis),One important difference between arrays and dataframes is that for the former we only have to think about functions, for the latter we're dealing with methods on the main dataframe objects. Hiding/unhiding methods is a little more tricky of course - can be done based on an environment variable set at import time, but it's more annoying with a context manager.
For behavior it's kind of the opposite: likely not all code will work with new behavior, so granular control helps, and a context manager is probably better.
Experiences with a separate namespace for the array API standard
The short summary of this is:
numpy
namespace converge to the array API standard; this takes time because of backwards compatibility constraints, but will avoid the "double namespaces" problem and have multiple other benefits, for example solving long-standing issues that Numba, CuPy etc. are running into.Therefore, using a separate namespace to implement dataframe API standard features/compatibility should likely not be the preferred solution.
Using a context manager
Pandas already has a context manager, namely
pandas.option_context
. This is used for existing options, seepd.describe_option()
. While most features are related to display, styling and I/O, some features that can be controlled are quite large and similar in style to what we'd expect to see in a dataframe API standard. Examples:mode.chained_assignment
(raise, warn, or ignore)mode.data_manager
("block"
or"array"
)mode.use_inf_as_null
(bool)It could be used similarly to currently available options, one option per feature:
Or there could be a single option to switch to "API-compliant mode":
Or both of those together.
Question: do other dataframe libraries have a similar context manager?
Using a
from __future__
importIt looks like it's possible to implement features with a
from __future__
itself, via import hooks (see Reference 3 below). That way the spelling would be uniform across libraries, which is nice. Alternatively, afrom dflib.__future__ import X
is easier (no import hooks), however it runs into the problem also described in Ref 3: it is not desirable to propagate options to nested scopes:Now of course this scope propagation is also what a context manager does. However, the point of a
from __future__
import and jumping through the hoops required to make that work (= more esoteric than a context manager) is to gain a switch that is local to the Python module in which it is used.Comparing a context manager and a
from __future__
importFor new functions, methods and objects both are pretty much equivalent, since they will only be used on purpose (the scope propagation issue above is irrelevant)
For changes to existing functions or methods, both will work too. The module-local behavior of a
from __future__
import is probably preferred, because code that's imported from another library that happens to use the same functionality under the hood may not expect the different result/behavior.For behavior changes there's an issue with the
from __future__
import. The import hooks will rely on AST transforms, so there must be some syntax to trigger on. With something that's very implicit, like casting rules, there is no such syntax. So it seems like there will be no good way to toggle that behavior on a module-scope level.My current impression
from __future__ import xxx
is perhaps best for adoption of changes to existing functions or methods, it has a configurable level of granularity and is explicit so should be more robust there than a context manager.References
The text was updated successfully, but these errors were encountered: