-
Notifications
You must be signed in to change notification settings - Fork 45
Make covariance and correlation work for iterators, skipmissing in particular. #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| @@ -1,5 +1,4 @@ | |||
| name = "Statistics" | |||
| uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It's a hack to make sure julia knows to load this folder, it's described here for Pkg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally the Travis script does that automatically, so you can revert this: https://github.com/JuliaLang/Statistics.jl/blob/master/.travis.yml#L24
Though you need it to run tests locally.
|
I have added the functionality we want and added tests. What's left, assuming what I've written is okay, is to disallow some things that only kind of work at the moment. The above works, but we can't add any of the |
collects everywhere
src/Statistics.jl
Outdated
| Return the number one. | ||
| """ | ||
| cor(itr::Any) = one(real(eltype(collect(itr)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better first check whether Base.IteratorEltype(itr) isa Base.HasEltype && isconcrete(eltype(itr)), and in that case avoid calling collect.
Also remove the docstring for AbstractVector below, which is just a special case of this one.
src/Statistics.jl
Outdated
|
|
||
| # corzm (non-exported, with centered data) | ||
|
|
||
| corzm(x::Any) = corzm(collect(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same remark here and for corm as for cor about using the eltype when it's known.
src/Statistics.jl
Outdated
| Compute the covariance between the iterators `x` and `y`. If `corrected` is `true` (the | ||
| default), computes ``\\frac{1}{n-1}\\sum_{i=1}^n (x_i-\\bar x) (y_i-\\bar y)^*`` where | ||
| ``*`` denotes the complex conjugate and `n = length(collect(x)) = length(collect(y))`. If `corrected` is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ``*`` denotes the complex conjugate and `n = length(collect(x)) = length(collect(y))`. If `corrected` is | |
| ``*`` denotes the complex conjugate and ``n`` the number of elements. If `corrected` is |
src/Statistics.jl
Outdated
| Compute the variance of the iterator `itr`. If `corrected` is `true` (the default) then the sum | ||
| is scaled with `n-1`, whereas the sum is scaled with `n` if `corrected` is `false` where | ||
| `n = length(collect(itr))`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `n = length(collect(itr))`. | |
| ``n`` is the number of elements. |
src/Statistics.jl
Outdated
| """ | ||
| function cov(itr::Any; corrected::Bool=true) | ||
| x = collect(itr) | ||
| covm(x, mean(x); corrected=corrected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better call covzm directly to avoid an additional copy:
| covm(x, mean(x); corrected=corrected) | |
| covzm(map!(t -> t - xmean, x, x); corrected=corrected) |
Same for the two-argument method.
src/Statistics.jl
Outdated
| ## Use map(t -> t - xmean, x) instead of x .- xmean to allow for Vector{Vector} | ||
| ## which can't be handled by broadcast | ||
| covm(itr::Any, itrmean; corrected::Bool=true) = | ||
| @show covm(collect(itr), itrmean; corrected=corrected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @show covm(collect(itr), itrmean; corrected=corrected) | |
| covm(collect(itr), itrmean; corrected=corrected) |
| @@ -1,5 +1,4 @@ | |||
| name = "Statistics" | |||
| uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally the Travis script does that automatically, so you can revert this: https://github.com/JuliaLang/Statistics.jl/blob/master/.travis.yml#L24
Though you need it to run tests locally.
nalimilan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
If these methods give correct results, then it's OK to allow it and only add keyword arguments later if needed. But they should be tested.
Can you also add tests for cov and cor?
src/Statistics.jl
Outdated
|
|
||
| # corm | ||
|
|
||
| corm(x::Any, xmean) = corm(collect(x), xmean) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also apply the eltype check here.
Co-Authored-By: Milan Bouchet-Valat <[email protected]>
|
I have added many more tests. Everything is covered. The rules for Are that the rows of I'm not sure how to best put that into a doscstring. Future PRs should
|
src/Statistics.jl
Outdated
|
|
||
| # corzm (non-exported, with centered data) | ||
|
|
||
| function corzm(itr::Any) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this code in an internal method which will be called by all functions that need it? It's repeated three times.
Also:
| function corzm(itr::Any) | |
| function corzm(itr::Any) |
Co-Authored-By: Milan Bouchet-Valat <[email protected]>
…ics.jl into iterator_cor2
|
The methods mutate the result to subtract the mean, so calling
Yes that's something that bothered me too. Probably only vectors should be allowed to be mixed with other iterators. For other cases it's not clear what should happen so better throw an error for now. |
I do not think they always do. Have a look at |
|
Ah right. So we need |
|
I was under the impression that |
|
@bkamins you are right about non-allocations. But adding methods results in tons of method ambiguity errors. To resolve this without re-thinking the whole dispatch scheme, I implemented just in places where we don't modify This result feels hacky, but it's better than method ambiguities. This is ready for review, by Milan and hopefully by Triage. |
| end | ||
|
|
||
| """ | ||
| cor(x::AbstractVector, y::AbstractVector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this docstring which is a special case of the previous one.
src/Statistics.jl
Outdated
| _lazycollect(x::Any) = collect(x) | ||
| _lazycollect(x::AbstractVector) = x | ||
|
|
||
| function _matrix_error(x, y, fun) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just throw an error from _lazycollect if passed a matrix? The error message will be less precise but that's not a big deal. And then you can also throw an error for any AbstractArray that isn't an AbstractVector, which is a case which isn't allowed currently and should probably remain an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I would just add a special method to _lazycollect for other types. Actually I would do it for AbstractArray not only AbstractMatrix. The only thing to think about if we want to allow 0-dimensional AbstractArrays (they would produce NaN anyway).
I think that collecting 2 or more dimensional arrays in places where we expect vectors is not useful (but we can discuss this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't work because we don't use it all the time, for instance when we collect and use map!.
I can add a _collect_if_itr_or_vec method for that scenario.
Co-authored-by: Milan Bouchet-Valat <[email protected]>
Yes, but these "many functions" have also a defined meaningful behaviour for Actually I would have preferred a completely different design for Now regarding |
Current implementation excludes
Yes. The original attempt at #30 was essentially this, working only with iterators. I wold prefer an implementation which doesn't know about matrix inputs and only cares about iterators. Taking advantage of BLAS etc. for
This PR is motivated by the new |
|
Ah - then I would also prefer iterators as you do 😄. Actually @nalimilan proposed And I fully agree that having Regarding |
Say |
|
But the typical use case is the following setting: and you want to calculate correlation matrix of the columns using "pairwise complete observations". |
|
There is currently no |
|
We discussed this a bit on triage and while we don't feel super qualified to comment, we felt that having independent iterators for the two arguments was likely problematic, because iterators don't in general have a strong guarantee over their ordering. E.g. |
|
@Keno - this is exactly what needs to be done in an ideal world in my opinion (and then you can have several rules what you "pass down" to the But doing it right would require a significant API change (i.e. be breaking) - so what should be a practical approach to this? Should a breaking PR be put on a table so that it can be judged for inclusion in 2.0 release. An alternative is to leave What would be the preference here? |
|
I'd say design the interface you want and then figure out whether it's possible to do most of it backwards compatible. If not, stdlibs can also go to 2.0 before base Julia (that's the whole reason why we introduced them). |
|
So let me put my proposal on a table (it is essentially what I think @pdeffebach had in mind in #30). I write it for Here The second form is: That does the same but between In this design we can treat everything inside To be less breaking we can keep methods: (note that I have added Alternatively we could allow |
|
Thank you for your comments. Here are my thoughts: I think it's important to understand the purpose Now they move onto the rest of their data, which now has So That said, if the researcher wrote their Therefore, I would want |
|
See previous discussion at JuliaStats/StatsBase.jl#343. I think there are legitimate use cases for the current design of Also, allowing to pass any iterator to Now, regarding the So an alternative I had in mind is to introduce Probably this shouldn't be discussed here... :-) BTW, another design challenge is to allow combining this with weights to allow computing weighted pairwise correlation while skipping missing values. Composability would be great to have in that case. |
| A .= A .* b | ||
| return A | ||
| end | ||
| function covzm(x::Any, y::Any; corrected::Bool = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what case covzm can get Any? It is an internal method and I thought it can only get already processed data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same question applies to covm below.
| is scaled with `n-1`, whereas the sum is scaled with `n` if `corrected` is `false` where | ||
| ``n`` is the number of elements. | ||
| """ | ||
| function cov(itr::Any; corrected::Bool=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to allow 0 or more than 2 dimensional arrays here?
| Return the number one. | ||
| """ | ||
| cor(itr::Any) = _return_one(itr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we touch this part of code then I do not understand the following (this is in general a separate PR, but we implement this method here, so I would like to clarify what is intended):
julia> x = rand(10, 2)
10×2 Array{Float64,2}:
0.281236 0.0338547
0.691944 0.830649
0.627939 0.62187
0.251539 0.162161
0.649065 0.627302
0.67754 0.227709
0.904292 0.481443
0.768511 0.439196
0.56268 0.885131
0.520348 0.0185026
julia> y = collect(eachrow(x))
10-element Array{SubArray{Float64,1,Array{Float64,2},Tuple{Int64,Base.Slice{Base.OneTo{Int64}}},true},1}:
[0.28123641854321346, 0.03385469973171129]
[0.6919437304763723, 0.8306494675149141]
[0.6279387501997036, 0.6218700109315964]
[0.2515386883173856, 0.16216070470557398]
[0.6490648176650291, 0.6273015737594985]
[0.6775400549397448, 0.22770867432380815]
[0.9042917317032804, 0.4814426050177454]
[0.7685107010600403, 0.4391959677108144]
[0.562680390776801, 0.8851305746997942]
[0.5203479821836228, 0.018502575073925165]
julia> cov(x)
2×2 Array{Float64,2}:
0.0409994 0.0321085
0.0321085 0.0983311
julia> cov(y)
2×2 Array{Float64,2}:
0.0409994 0.0321085
0.0321085 0.0983311
julia> cor(x)
2×2 Array{Float64,2}:
1.0 0.505691
0.505691 1.0
julia> cor(y)
ERROR: MethodError: no method matching zero(::Type{SubArray{Float64,1,Array{Float64,2},Tuple{Int64,Base.Slice{Base.OneTo{Int64}}},true}})
and I do not understand why cov and cor behave differently in how they handle x and y.
|
OK - if we want to go forward with the API proposal in this PR I have left some comments related only to it. |
|
Ah - and a general question, since |
I just became aware of this method of """
cor(x::AbstractVector)
Return the number one.
"""This basically torpedoes any idea of having a itr = ((1, 4), (3, 2), (5, 8), (7, 6))
cor(itr)itr = [(1, 4), (3, 2), (5, 8), (7, 6)]
cor(itr)Given this unfortunate situation, perhaps triage would reconsider allowing |
So, maybe removing single-argument For functions where arguments are fundamentally coupled (like |
Currently
covandcorfail with iterators that are not vectors, e.g.skipmissingiterators or vectors withIterators.Filterapplied to them. This is part of the plan I have commented on at JuliaLang/julia#35050 (comment) to improve quality of life issues withmissings.Thanks to
Missings.skipmissings(JuliaData/Missings.jl#111), this allows computing the correlation without missing values viacor(skipmissings(x, y)...).Supersedes #30 because it is a more minimal implementation.