Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC, feat: LazyFrame.collect kwargs #1734

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Conversation

FBruzzesi
Copy link
Member

@FBruzzesi FBruzzesi commented Jan 6, 2025

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

This is a proposal for #1479.

As it gets more relevant now due to DuckDB support and to decide how we could collect a DuckDB table.

For polars and dask, collect kwargs would follow native collect and compute respectively. For DuckDB we could come up with our own and document it properly. Specifically I would suggest to let the user decide to which dataframe backend to collect to (return_type?), with Arrow as default.

@FBruzzesi FBruzzesi added the enhancement New feature or request label Jan 6, 2025
@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jan 6, 2025

Nice, I like the look of this

it may help to simplify

narwhals/tests/utils.py

Lines 77 to 80 in 8c9525a

if result.implementation is Implementation.POLARS and os.environ.get(
"NARWHALS_POLARS_GPU", False
): # pragma: no cover
result = result.to_native().collect(engine="gpu")

to just result.collect(polars_kwargs=dict(engine="gpu"))?


I would suggest to let the user decide to which dataframe backend to collect to (return_type?), with Arrow as default.

agree, I think arrow's a good default for duckdb (also, as far as I can tell, collecting into Polars from duckdb requires pyarrow anyway, suggesting they first collect into pyarrow anyway?). to check my understanding then, this would be compatibe with the collect in #1725, and we can add extra kwargs later?

@FBruzzesi
Copy link
Member Author

to just result.collect(polars_kwargs=dict(engine="gpu"))?

Yes exactly!

agree, I think arrow's a good default for duckdb (also, as far as I can tell, collecting into Polars from duckdb requires pyarrow anyway, suggesting they first collect into pyarrow anyway?).

🤔 now that you mention (and completely unrelated from this PR), we could do the same for pyspark: see SO Ritchie answer

to check my understanding then, this would be compatibe with the collect in #1725, and we can add extra kwargs later?

Yes indeed as long as we intend to have the default to be PyArrow

@MarcoGorelli
Copy link
Member

we could do the same for pyspark: see SO Ritchie answer

yes, nice! 🙌

@MarcoGorelli
Copy link
Member

I think this looks good, just going to give the chance to others to weigh in

@FBruzzesi
Copy link
Member Author

FBruzzesi commented Jan 6, 2025

@MarcoGorelli I just added duckdb_kwargs to specify DuckDB return_type in the last commit. Can revert it if we want to sleep on it

from narwhals.utils import Implementation

return PolarsDataFrame(
df=self._native_frame.pl(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated but.. should we change in PolarsDataFrame:

- df: pl.DataFrame,
+ native_dataframe: pl.DataFrame,

Comment on lines +3616 to +3618
polars_kwargs: dict[str, Any] | None = None,
dask_kwargs: dict[str, Any] | None = None,
duckdb_kwargs: dict[str, str] | None = None,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could all be TypedDict's 👀

@EdAbati
Copy link
Collaborator

EdAbati commented Jan 9, 2025

Niiiice! 🙌🙌

A couple of questions/idea:

  • wouldn't return_type eventually become an argument for every Lazy backend? e.g. one may want to always collect to polars regardless of which Lazy dataframe they are starting from.
    What do you think about changing the signature to:

    def collect(self, return_type, ...) -> ...
  • still not 100% sold to the idea of having one specific kwargs per Lazy backend, I think it would give us a bit less flexibility. The signature would change if we add more lazy backends (e.g. pyspark). Or I think it'd become a problem if:

    • we remove a lazy backend, for example moving it to an separate integration library
    • people wants to create their own Lazy backends and want to pass args to their collect. (probably a niche use case though)

    What do you think?

    Also I wouldn't find particularly ugly something like this (but that's personal preference :D ):

     lazy_df = df.lazy().select(nw_v1.all().sum())
     if lazy_df.implementation == Implementation.POLARS:
         eager_df = lazy_df.collect(no_optimization=True)
     elif lazy_df.implementation == Implementation.DASK:
         eager_df = lazy_df.collect(optimize_graph=False)
     elif lazy_df.implementation == Implementation.DUCKDB:
         eager_df = lazy_df.collect(return_type="pyarrow")

@FBruzzesi
Copy link
Member Author

FBruzzesi commented Jan 9, 2025

Hey @EdAbati, thanks for your feedback! that's exactly the purpose of a RFC 👌

  • wouldn't return_type eventually become an argument for every Lazy backend? e.g. one may want to always collect to polars regardless of which Lazy dataframe they are starting from.

Considering the dataframe ecosystem as of today:

  • Polars would always collect to polars
  • Dask would always collect to pandas
  • DuckDB (and maybe Ibis) could collect to polars, arrow and pandas
  • Spark natively to pandas (but I see the appeal of collecting to others)

Not sure what we are aim to support in the future 😉 but I would try not to break our head too soon here!

  • still not 100% sold to the idea of having one specific kwargs per Lazy backend, I think it would give us a bit less flexibility. The signature would change if we add more lazy backends (e.g. pyspark). Or I think it'd become a problem if:

    • we remove a lazy backend, for example moving it to an separate integration library

Those are definitly fair concerns to think about! Thanks for pointing those out!

  • people wants to create their own Lazy backends and want to pass args to their collect. (probably a niche use case though)

That's when they can branch out, we would just need to add additional **kwargs to be passed to all .collect if the implementation is outside of what we cover with the dedicated arguments. In code:

def collect(
    self: Self,
    *,
    polars_kwargs: dict[str, Any] | None = None,
    dask_kwargs: dict[str, Any] | None = None,
    duckdb_kwargs: dict[str, str] | None = None,
    **kwargs: Any,
):
    from narwhals.utils import Implementation

    if self.implementation is Implementation.POLARS and polars_kwargs is not None:
        kwargs_ = polars_kwargs
    elif ...:
        ...
    else:
        kwargs_ = kwargs

    return self._dataframe(
        self._compliant_frame.collect(**kwargs),
        level="full",
    )

What do you think?
Also I wouldn't find particularly ugly something like this (but that's personal preference :D ):

 lazy_df = df.lazy().select(nw_v1.all().sum())
 if lazy_df.implementation == Implementation.POLARS:
     eager_df = lazy_df.collect(no_optimization=True)
 elif lazy_df.implementation == Implementation.DASK:
     eager_df = lazy_df.collect(optimize_graph=False)
 elif lazy_df.implementation == Implementation.DUCKDB:
     eager_df = lazy_df.collect(return_type="pyarrow")

I would be ok deferring the responsibility to the users. But coming back to personal preference, if I were to use an external library to find that I have to do a lot of branching myself, I wouldn't say it is particularly ergonomic 😅

@EdAbati
Copy link
Collaborator

EdAbati commented Jan 11, 2025

Not sure what we are aim to support in the future 😉 but I would try not to break our head too soon here!

Ah! I was under the impression that there was a request to make LazyFrames able to collect to any backend. Similarly to what the methods to_pandas(), to_arrow etc would do for eager frames. Maybe I misunderstood then :D

Also FYI PySpark 4.0.0 will support toArrow

@MarcoGorelli
Copy link
Member

there is indeed a request to be able to collect to specific backends (e.g. someone via work asked to be able to collect duckdb-backed lazyframe into polars-backed dataframe), but I think this would still be backend-specific - e.g. not all lazy backends would necessarily have a way to collect to polars

as in, which might not have a way to do lazy_df.collect(eager_backend='polars') and know that it will work for all lazy backends...but if we know that say duckdb supports it, we can do lazy_df.collect(duckdb_kwargs={'eager_backend': 'polars'})

Having said that, should we:

  • use eager_backend instead of return_dtype? return_type kind of sounds like it affects the return type of the function, although that is always nw.DataFrame
  • should we document a couple of common kwargs for polars / dask backends? e.g. streaming, engine='gpu'. then for duckdb_kwargs / pyspark_kwargs we can document 'eager_backend'

if lazy_df.implementation == Implementation.POLARS:

small note, but this can now be done as if lazy_df.implementation.is_polars()

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jan 16, 2025

Been thinking about this a bit more, and in the read functions we have **kwargs, perhaps we should be consistent with that? To avoid branching, users could make a collect_kwargs dictionary which maps implementations to kwargs and then do

lf.collect(collect_kwargs[lf.implementation])

We could also have eager_backend as an argument in collect, and lazy_backend in lazy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: collect for lazy-only libraries
3 participants