Median overhaul #1122

Jolanrensen · 2025-04-09T13:35:16Z

Helps #961

WIP

Jolanrensen · 2025-04-11T10:33:49Z

https://youtrack.jetbrains.com/issue/KT-76683 is holding me back

… now always returns null when empty

Jolanrensen · 2025-04-17T14:46:11Z

@zaleslaw https://github.com/Kotlin/dataframe/blob/median/plugins/kotlin-dataframe/testData/box/groupBy_median.kt now fails because the return type of median changed to:

comparable -> itself or null
primitive number -> double or null

Shall I ignore the test for now so you can cover it after it's merged?

zaleslaw · 2025-04-17T15:35:00Z

Yes, you could ignore, I will fix soon

core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/api/statistics.kt

zaleslaw · 2025-04-22T09:58:11Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/math/median.kt

+    }
+
+    // propagate NaN to return if they are not to be skipped
+    if (type.canBeNaN && !skipNaN && any { it.isNaN }) return Double.NaN


Logic here and next 3 rows is really hard to understand and follow, are you sure, that it covers all the required mixed situation Double.NaN + null, could it be here?

null cannot occur in this function because of the type T : Comparable<T>. The aggregator integration already filters them out. So yes, it covers all the cases. I think it reads like a sentence: if the type of the values cán be nan and nans are not skipped, then if any is nan, return nan. In all other cases we continue and ignore nans. How would you rewrite it?

zaleslaw · 2025-04-22T10:06:34Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/median.kt

+    where C : Comparable<C & Any>?, C : Number? =
+    Aggregators.medianNumbers<C>(skipNaN).aggregateAll(this, columns)
+
+public fun <T> DataFrame<T>.median(vararg columns: String, skipNaN: Boolean = skipNaNDefault): Any =


Having Any as a return type defeats the purpose of strong typing, do we really need this?

Agree, the should be Number at least, no?

Median works for any comparable, also Strings, datetimes etc. So Number would be wrong.

The only correct return type would be R where R : Comparable<R>, but since we don't know R, Any comes closest to the actual return type for the string-api. We use the same for the column accessors in describe().

We also cannot return Comparable<Any?> because that would imply you could call df.median("someCol").compareTo(anything) and that will likely break. We also cannot give Comparable<*>/Comparable<Nothing> because a) we cannot return Nothing and b) df.median("someCol").compareTo(anything) will then also break.

We could, however, force users to give a specific type R, how do you like that idea?

We could, however, force users to give a specific type R, how do you like that idea?

Why not, but please write an example first, not sure how it looks.

sure, something like:

val medianAge: Int = df.median<Int>("age") val medianLapTime: Double = df.median<Double>("lap1", "lap2", skipNaN = true)

zaleslaw · 2025-04-22T10:07:38Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/math/median.kt

+            )
+
+        type == typeOf<Long>() ->
+            logger.warn { "Converting Longs to Doubles to calculate the median, loss of precision may occur." }


probably debug mode?

It's just a warning logged right? users can disable it by changing the log level for org.jetbrains.kotlinx.dataframe.math.MedianKt on their end or by converting their columns to Double manually first. It's not a heavy check, so I think debug mode is not needed.

zaleslaw · 2025-04-22T10:09:17Z

I have a strong feeling, that we need less overloads/median* methods

AndrewKis · 2025-04-22T09:55:09Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/max.kt

@@ -36,12 +36,12 @@ public fun <T : Comparable<T>> DataColumn<T?>.maxOrNull(skipNaN: Boolean = skipN

 public inline fun <T, reified R : Comparable<R & Any>?> DataColumn<T>.maxBy(
    skipNaN: Boolean = skipNaNDefault,
-    noinline selector: (T) -> R,
+    crossinline selector: (T) -> R,


Are you sure about crossinline here?

crossline is more efficient than noinline and unfortunately we cannot do it without, so yes.

AndrewKis · 2025-04-22T10:17:54Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/median.kt

+    where C : Comparable<C & Any>?, C : Number? =
+    Aggregators.medianNumbers<C>(skipNaN).aggregateAll(this, columns)
+
+public fun <T> DataFrame<T>.median(vararg columns: String, skipNaN: Boolean = skipNaNDefault): Any =


Agree, the should be Number at least, no?

Jolanrensen · 2025-04-22T11:45:57Z

I have a strong feeling, that we need less overloads/median* methods

Please enlighten me if you know how :)
We have the following restrictions:

When we encounter specific number columns, the return type should be Double
When we encounter other comparables, the return type should be the same as the input
For float/double columns, we need a skipNaN argument.
Just like other statistics we have all the X, of/by/for, row, and -orNull overloads for columns, dataframes, pivots, groupBy, and pivotGroupBy. Nothing really groundbreaking here

I already tried to combine overloads where possible, so if the return type of the function does not change depending on the input type (like for DataFrame.medianFor {} -> DataRow) I just use one overload that accepts any comparable, number or not, and has the skipNaN argument. In other cases, where the return type does change, I made two variants; one that returns Double and another that returns T.

Jolanrensen added 5 commits April 8, 2025 13:27

small extra type conversion check for min/max

b3363ff

initial rework of median without percentile

f6d3309

wip median

fc88cfc

Merge branch 'master' into median

c40e642

updated from master

2e86d2f

Jolanrensen self-assigned this Apr 9, 2025

Jolanrensen added this to the 1.0.0-Beta1 (0.16) milestone Apr 9, 2025

Jolanrensen added the enhancement New feature or request label Apr 9, 2025

Jolanrensen mentioned this pull request Apr 9, 2025

☂ Statistics streamlining #961

Open

9 tasks

overhauling median

3f43197

Jolanrensen added 5 commits April 17, 2025 10:45

stuck on issue KT-76683

817d570

added medianBy overloads

9b356c8

added median tests, added HybridAggregationHandler for median, median…

8effc33

… now always returns null when empty

Merge branch 'master' into median

da07f3b

Merge branch 'master' into median

1082586

Jolanrensen force-pushed the median branch from dda36f9 to 1082586 Compare April 17, 2025 14:16

fixed tests aside from the compiler plugin

c889f30

Jolanrensen marked this pull request as ready for review April 17, 2025 14:46

Jolanrensen requested review from zaleslaw and AndrewKis April 17, 2025 14:46

ignoring testGroupBy_median test for now

fe5c4a9

zaleslaw reviewed Apr 22, 2025

View reviewed changes

core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/api/statistics.kt Show resolved Hide resolved

zaleslaw reviewed Apr 22, 2025

View reviewed changes

zaleslaw approved these changes Apr 22, 2025

View reviewed changes

AndrewKis reviewed Apr 22, 2025

View reviewed changes

Jolanrensen merged commit b8043c6 into master Apr 23, 2025
6 checks passed

Jolanrensen linked an issue Apr 23, 2025 that may be closed by this pull request

median is broken for "mixed" number types #566

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Median overhaul #1122

Median overhaul #1122

Jolanrensen commented Apr 9, 2025

Jolanrensen commented Apr 11, 2025

Jolanrensen commented Apr 17, 2025

zaleslaw commented Apr 17, 2025

zaleslaw Apr 22, 2025

Jolanrensen Apr 22, 2025

zaleslaw Apr 22, 2025

AndrewKis Apr 22, 2025

Jolanrensen Apr 22, 2025

AndrewKis Apr 22, 2025 •

edited by Jolanrensen

Loading

Jolanrensen Apr 22, 2025

zaleslaw Apr 22, 2025

Jolanrensen Apr 22, 2025

zaleslaw commented Apr 22, 2025

AndrewKis Apr 22, 2025

Jolanrensen Apr 22, 2025

AndrewKis Apr 22, 2025

Jolanrensen commented Apr 22, 2025

Median overhaul #1122

Median overhaul #1122

Conversation

Jolanrensen commented Apr 9, 2025

Jolanrensen commented Apr 11, 2025

Jolanrensen commented Apr 17, 2025

zaleslaw commented Apr 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewKis Apr 22, 2025 • edited by Jolanrensen Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaleslaw commented Apr 22, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jolanrensen commented Apr 22, 2025

AndrewKis Apr 22, 2025 •

edited by Jolanrensen

Loading