-
Notifications
You must be signed in to change notification settings - Fork 71
Median overhaul #1122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Median overhaul #1122
Conversation
https://youtrack.jetbrains.com/issue/KT-76683 is holding me back |
@zaleslaw https://github.com/Kotlin/dataframe/blob/median/plugins/kotlin-dataframe/testData/box/groupBy_median.kt now fails because the return type of median changed to:
Shall I ignore the test for now so you can cover it after it's merged? |
Yes, you could ignore, I will fix soon |
} | ||
|
||
// propagate NaN to return if they are not to be skipped | ||
if (type.canBeNaN && !skipNaN && any { it.isNaN }) return Double.NaN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic here and next 3 rows is really hard to understand and follow, are you sure, that it covers all the required mixed situation Double.NaN + null, could it be here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null
cannot occur in this function because of the type T : Comparable<T>
. The aggregator integration already filters them out. So yes, it covers all the cases. I think it reads like a sentence: if the type of the values cán be nan and nans are not skipped, then if any is nan, return nan. In all other cases we continue and ignore nans. How would you rewrite it?
where C : Comparable<C & Any>?, C : Number? = | ||
Aggregators.medianNumbers<C>(skipNaN).aggregateAll(this, columns) | ||
|
||
public fun <T> DataFrame<T>.median(vararg columns: String, skipNaN: Boolean = skipNaNDefault): Any = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having Any as a return type defeats the purpose of strong typing, do we really need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, the should be Number
at least, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Median works for any comparable, also Strings, datetimes etc. So Number
would be wrong.
The only correct return type would be R
where R : Comparable<R>
, but since we don't know R
, Any
comes closest to the actual return type for the string-api. We use the same for the column accessors in describe()
.
We also cannot return Comparable<Any?>
because that would imply you could call df.median("someCol").compareTo(anything)
and that will likely break. We also cannot give Comparable<*>/Comparable<Nothing>
because a) we cannot return Nothing
and b) df.median("someCol").compareTo(anything)
will then also break.
We could, however, force users to give a specific type R
, how do you like that idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, however, force users to give a specific type R, how do you like that idea?
Why not, but please write an example first, not sure how it looks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, something like:
val medianAge: Int = df.median<Int>("age")
val medianLapTime: Double = df.median<Double>("lap1", "lap2", skipNaN = true)
) | ||
|
||
type == typeOf<Long>() -> | ||
logger.warn { "Converting Longs to Doubles to calculate the median, loss of precision may occur." } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably debug mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a warning logged right? users can disable it by changing the log level for org.jetbrains.kotlinx.dataframe.math.MedianKt
on their end or by converting their columns to Double
manually first. It's not a heavy check, so I think debug mode is not needed.
I have a strong feeling, that we need less overloads/median* methods |
@@ -36,12 +36,12 @@ public fun <T : Comparable<T>> DataColumn<T?>.maxOrNull(skipNaN: Boolean = skipN | |||
|
|||
public inline fun <T, reified R : Comparable<R & Any>?> DataColumn<T>.maxBy( | |||
skipNaN: Boolean = skipNaNDefault, | |||
noinline selector: (T) -> R, | |||
crossinline selector: (T) -> R, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure about crossinline
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crossline
is more efficient than noinline
and unfortunately we cannot do it without, so yes.
where C : Comparable<C & Any>?, C : Number? = | ||
Aggregators.medianNumbers<C>(skipNaN).aggregateAll(this, columns) | ||
|
||
public fun <T> DataFrame<T>.median(vararg columns: String, skipNaN: Boolean = skipNaNDefault): Any = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, the should be Number
at least, no?
Please enlighten me if you know how :)
I already tried to combine overloads where possible, so if the return type of the function does not change depending on the input type (like for |
Helps #961
WIP