Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enh]: Spark Expr missing methods #1714

Open
18 of 37 tasks
FBruzzesi opened this issue Jan 3, 2025 · 6 comments
Open
18 of 37 tasks

[Enh]: Spark Expr missing methods #1714

FBruzzesi opened this issue Jan 3, 2025 · 6 comments
Labels
enhancement New feature or request good first issue Good for newcomers, but anyone is welcome to submit a pull request! help wanted Extra attention is needed pyspark Issue is related to pyspark backend

Comments

@FBruzzesi
Copy link
Member

FBruzzesi commented Jan 3, 2025

Methods which are row order dependent are not included.
Methods within a namespace are included only if the namespace at least exists, otherwise it means that all namespace methods are missing.

Methods with one asterisk (*) change the length but don't aggregate - these are deprioritized for now

High priority:

  • abs
  • all
  • any
  • clip
  • is_between
  • is_duplicated
  • is_finite
  • is_in
  • is_nan
  • is_unique
  • len
  • median
  • n_unique
  • null_count
  • over
  • round
  • skew
  • drop_nulls (*)
  • fill_null (if strategy is prodived, otherwise it is order dependent)
  • filter (*)
  • mode (*)
  • quantile
  • replace_strict
  • unique (*)

Namespaces:

  • name (**)
  • cat (**)
  • list (**)
  • dt (**)
    • to_string
    • total_microseconds
    • total_milliseconds
    • total_minutes
    • total_nanoseconds
    • total_seconds
  • str (**)
    • replace
    • to_datetime
@FBruzzesi FBruzzesi added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers, but anyone is welcome to submit a pull request! labels Jan 3, 2025
@lucas-nelson-uiuc
Copy link
Contributor

lucas-nelson-uiuc commented Jan 4, 2025

Hey @FBruzzesi ,

Working on implementing scalar methods like any and all - should be ready to push later today.

Planning on working on the following methods - want to first check if my thought process is "correct".

  • arg_true
  • drop_nulls
  • filter
  • gather_every
  • sort
  • unique

Thinking of implementing two patterns for these methods:

# if predicate-based (e.g. drop_nulls, which uses predicate function `F.isnull`)
def method(self) -> Self:
        def _method(_input: Column) -> Column:
            from pyspark.sql import functions as F  # noqa: N812

            return F.explode(F.filter(F.array(_input), <predicate_func>))

        return self._from_call(_method, "method", returns_scalar=False)


# if not predicate-based (e.g. unique, which uses array function `F.array_distinct`)
def method(self) -> Self:
        def _method(_input: Column) -> Column:
            from pyspark.sql import functions as F  # noqa: N812

            return F.explode(<array_func>(F.array(_input)))

        return self._from_call(_method, "method", returns_scalar=False)

Not sure how expensive doing this is or if it collides with future API developments. Lmk what you think

@MarcoGorelli
Copy link
Member

thanks @lucas-nelson-uiuc for your efforts here

can we leave the row-order dependent ones out for now, make sure we've got everything done from the others first? there's some broader api decisions we need to make for those

@lucas-nelson-uiuc
Copy link
Contributor

lucas-nelson-uiuc commented Jan 10, 2025

got a working version for the following - all supports the Polars examples and expr_and_series tests:

  • filter
  • drop_null
  • replace_strict
  • fill_null (only strategy='zero' and strategy='one' seem like v1 additions)

@FBruzzesi
Copy link
Member Author

Amazing stuff @lucas-nelson-uiuc ! Looking forward to those as well!
Notice that now we merged the pyspark tests into the main test suite and to run a test you will just need to remove the following snippet from the dedicated feature test:

    if "pyspark" in str(constructor):
        request.applymarker(pytest.mark.xfail)

@FBruzzesi
Copy link
Member Author

FYI I am working on SparkLikeNamespace methods

@lucas-nelson-uiuc
Copy link
Contributor

tried adding is_nan into #1802 but noticed two things:

  • nw._spark_like.expr.cast is not yet fully developed - this causes the tests to fail
  • Spark handles zero division by returning null instead of nan - this also causes the test to fail
    • should the Spark implementation of is_nan check for NaN and NULL?

lmk if I'm missing something

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers, but anyone is welcome to submit a pull request! help wanted Extra attention is needed pyspark Issue is related to pyspark backend
Projects
None yet
Development

No branches or pull requests

3 participants