Skip to content

Median overhaul #1122

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Apr 23, 2025
Merged

Median overhaul #1122

merged 13 commits into from
Apr 23, 2025

Conversation

Jolanrensen
Copy link
Collaborator

Helps #961

WIP

@Jolanrensen Jolanrensen self-assigned this Apr 9, 2025
@Jolanrensen Jolanrensen added this to the 1.0.0-Beta1 (0.16) milestone Apr 9, 2025
@Jolanrensen Jolanrensen added the enhancement New feature or request label Apr 9, 2025
@Jolanrensen Jolanrensen mentioned this pull request Apr 9, 2025
9 tasks
@Jolanrensen
Copy link
Collaborator Author

https://youtrack.jetbrains.com/issue/KT-76683 is holding me back

@Jolanrensen
Copy link
Collaborator Author

@zaleslaw https://github.com/Kotlin/dataframe/blob/median/plugins/kotlin-dataframe/testData/box/groupBy_median.kt now fails because the return type of median changed to:

comparable -> itself or null
primitive number -> double or null

Shall I ignore the test for now so you can cover it after it's merged?

@Jolanrensen Jolanrensen marked this pull request as ready for review April 17, 2025 14:46
@zaleslaw
Copy link
Collaborator

Yes, you could ignore, I will fix soon

}

// propagate NaN to return if they are not to be skipped
if (type.canBeNaN && !skipNaN && any { it.isNaN }) return Double.NaN
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic here and next 3 rows is really hard to understand and follow, are you sure, that it covers all the required mixed situation Double.NaN + null, could it be here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null cannot occur in this function because of the type T : Comparable<T>. The aggregator integration already filters them out. So yes, it covers all the cases. I think it reads like a sentence: if the type of the values cán be nan and nans are not skipped, then if any is nan, return nan. In all other cases we continue and ignore nans. How would you rewrite it?

where C : Comparable<C & Any>?, C : Number? =
Aggregators.medianNumbers<C>(skipNaN).aggregateAll(this, columns)

public fun <T> DataFrame<T>.median(vararg columns: String, skipNaN: Boolean = skipNaNDefault): Any =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having Any as a return type defeats the purpose of strong typing, do we really need this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, the should be Number at least, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Median works for any comparable, also Strings, datetimes etc. So Number would be wrong.

The only correct return type would be R where R : Comparable<R>, but since we don't know R, Any comes closest to the actual return type for the string-api. We use the same for the column accessors in describe().

We also cannot return Comparable<Any?> because that would imply you could call df.median("someCol").compareTo(anything) and that will likely break. We also cannot give Comparable<*>/Comparable<Nothing> because a) we cannot return Nothing and b) df.median("someCol").compareTo(anything) will then also break.

We could, however, force users to give a specific type R, how do you like that idea?

Copy link
Contributor

@AndrewKis AndrewKis Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, however, force users to give a specific type R, how do you like that idea?

Why not, but please write an example first, not sure how it looks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, something like:

val medianAge: Int = df.median<Int>("age")
val medianLapTime: Double = df.median<Double>("lap1", "lap2", skipNaN = true)

)

type == typeOf<Long>() ->
logger.warn { "Converting Longs to Doubles to calculate the median, loss of precision may occur." }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably debug mode?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a warning logged right? users can disable it by changing the log level for org.jetbrains.kotlinx.dataframe.math.MedianKt on their end or by converting their columns to Double manually first. It's not a heavy check, so I think debug mode is not needed.

@zaleslaw
Copy link
Collaborator

I have a strong feeling, that we need less overloads/median* methods

@@ -36,12 +36,12 @@ public fun <T : Comparable<T>> DataColumn<T?>.maxOrNull(skipNaN: Boolean = skipN

public inline fun <T, reified R : Comparable<R & Any>?> DataColumn<T>.maxBy(
skipNaN: Boolean = skipNaNDefault,
noinline selector: (T) -> R,
crossinline selector: (T) -> R,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about crossinline here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crossline is more efficient than noinline and unfortunately we cannot do it without, so yes.

where C : Comparable<C & Any>?, C : Number? =
Aggregators.medianNumbers<C>(skipNaN).aggregateAll(this, columns)

public fun <T> DataFrame<T>.median(vararg columns: String, skipNaN: Boolean = skipNaNDefault): Any =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, the should be Number at least, no?

@Jolanrensen
Copy link
Collaborator Author

I have a strong feeling, that we need less overloads/median* methods

Please enlighten me if you know how :)
We have the following restrictions:

  • When we encounter specific number columns, the return type should be Double
  • When we encounter other comparables, the return type should be the same as the input
  • For float/double columns, we need a skipNaN argument.
  • Just like other statistics we have all the X, of/by/for, row, and -orNull overloads for columns, dataframes, pivots, groupBy, and pivotGroupBy. Nothing really groundbreaking here

I already tried to combine overloads where possible, so if the return type of the function does not change depending on the input type (like for DataFrame.medianFor {} -> DataRow) I just use one overload that accepts any comparable, number or not, and has the skipNaN argument. In other cases, where the return type does change, I made two variants; one that returns Double and another that returns T.

@Jolanrensen Jolanrensen merged commit b8043c6 into master Apr 23, 2025
6 checks passed
@Jolanrensen Jolanrensen linked an issue Apr 23, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

median is broken for "mixed" number types
3 participants