-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC, feat: LazyFrame.collect
kwargs
#1734
base: main
Are you sure you want to change the base?
Conversation
Nice, I like the look of this it may help to simplify Lines 77 to 80 in 8c9525a
to just
agree, I think arrow's a good default for duckdb (also, as far as I can tell, collecting into Polars from duckdb requires pyarrow anyway, suggesting they first collect into pyarrow anyway?). to check my understanding then, this would be compatibe with the |
Yes exactly!
🤔 now that you mention (and completely unrelated from this PR), we could do the same for pyspark: see SO Ritchie answer
Yes indeed as long as we intend to have the default to be PyArrow |
yes, nice! 🙌 |
I think this looks good, just going to give the chance to others to weigh in |
@MarcoGorelli I just added |
from narwhals.utils import Implementation | ||
|
||
return PolarsDataFrame( | ||
df=self._native_frame.pl(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated but.. should we change in PolarsDataFrame
:
- df: pl.DataFrame,
+ native_dataframe: pl.DataFrame,
polars_kwargs: dict[str, Any] | None = None, | ||
dask_kwargs: dict[str, Any] | None = None, | ||
duckdb_kwargs: dict[str, str] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These could all be TypedDict
's 👀
Niiiice! 🙌🙌 A couple of questions/idea:
|
Hey @EdAbati, thanks for your feedback! that's exactly the purpose of a RFC 👌
Considering the dataframe ecosystem as of today:
Not sure what we are aim to support in the future 😉 but I would try not to break our head too soon here!
Those are definitly fair concerns to think about! Thanks for pointing those out!
That's when they can branch out, we would just need to add additional def collect(
self: Self,
*,
polars_kwargs: dict[str, Any] | None = None,
dask_kwargs: dict[str, Any] | None = None,
duckdb_kwargs: dict[str, str] | None = None,
**kwargs: Any,
):
from narwhals.utils import Implementation
if self.implementation is Implementation.POLARS and polars_kwargs is not None:
kwargs_ = polars_kwargs
elif ...:
...
else:
kwargs_ = kwargs
return self._dataframe(
self._compliant_frame.collect(**kwargs),
level="full",
)
I would be ok deferring the responsibility to the users. But coming back to personal preference, if I were to use an external library to find that I have to do a lot of branching myself, I wouldn't say it is particularly ergonomic 😅 |
Ah! I was under the impression that there was a request to make Also FYI PySpark 4.0.0 will support |
there is indeed a request to be able to collect to specific backends (e.g. someone via work asked to be able to collect duckdb-backed lazyframe into polars-backed dataframe), but I think this would still be backend-specific - e.g. not all lazy backends would necessarily have a way to collect to polars as in, which might not have a way to do Having said that, should we:
small note, but this can now be done as |
Been thinking about this a bit more, and in the read functions we have lf.collect(collect_kwargs[lf.implementation]) We could also have |
What type of PR is this? (check all applicable)
Related issues
collect
for lazy-only libraries #1479Checklist
If you have comments or can explain your changes, please do so below
This is a proposal for #1479.
As it gets more relevant now due to DuckDB support and to decide how we could collect a DuckDB table.
For polars and dask, collect kwargs would follow native collect and compute respectively. For DuckDB we could come up with our own and document it properly. Specifically I would suggest to let the user decide to which dataframe backend to collect to (
return_type
?), with Arrow as default.