Add some hypothesis test functions #315

mdahlin · 2025-01-02T16:24:22Z

Adds functions for

Mann-Whitney U test (per https://users.rust-lang.org/t/mann-whitney-u-test/95005)
One Way ANOVA F-test
One Sample t-test

Functions are generally based on the scipy version. I tried to align with existing formatting/setup from the fisher test, but wanted to make sure I was on the right track (in terms of structure, level of documentation, desire for this capability, etc.) before thinking about adding more tests.

YeungOnion · 2025-01-12T00:24:49Z

Sorry that I've taken a bit to reply here.

So far, these are great. We have mentioned the idea of a nan policy in regards to analytical functions (as opposed to empirical functions) and I think following the scipy approach is good because of developer expectations.

We don't really have enough tests to have a sense of uniform API for tests and having the policy as an argument is useful for establishing that.

I think the direction you're going in is good and I would approve this once I look into why the nightly-dependent workflow in the CI won't compile. I'm open to you continuing on this PR or opening a dependent PR.

However, regarding license, to what degree would you say you referred to the scipy source? I don't wish to complicate the license we distribute with, nor do I want to use a license that's not typical for crates on crates.io

mdahlin · 2025-01-12T15:01:06Z

Hey thanks for the response and feedback.

In terms of the nightly piece. I found the same error locally. A day or so later I updated nightly and everything worked just fine, so it seems like it was an issue specific to nightly.

In terms of how much I "referred to the scipy source", it's been a pretty loose reference for the most part but I'll provide some relevant links if you want to form your own opinion.

one-way ANOVA F-test

relevant part of the implementation in this PR:

statrs/src/stats_tests/f_oneway.rs

Line 114 in 100e726

let n = n_i.iter().sum::<usize>();
scipy: https://github.com/scipy/scipy/blob/92d2a8592782ee19a1161d0bf3fc2241ba78bb63/scipy/stats/_stats_py.py#L4173

My conclusion: commonality with scipy is mainly just the function signature as I leveraged a statsdirect page for logic

One Sample t-test

this PR:

statrs/src/stats_tests/ttest_onesample.rs

Line 78 in 100e726

let samplemean = a.iter().sum::<f64>() / (n as f64);
scipy: https://github.com/scipy/scipy/blob/6e246d0b54dd55dc69232a0caae6772228a7ac25/scipy/stats/_stats_py.py#L6092

My conclusion: again mainly just the function signature as I used the logic from this jpm page

Mann Whitney U

Here we'd probably want to look at the two main pieces of logic, being the different methods for calculating the test's statistic, separately

Exact

this PR:

statrs/src/stats_tests/mannwhitneyu.rs

Line 156 in 100e726

fn calc_mwu_exact_pvalue(u: f64, n1: usize, n2: usize) -> f64 {
scipy: https://github.com/scipy/scipy/blob/v1.15.0/scipy/stats/_mannwhitneyu.py#L83

My conclusion: These are very different from each other. The scipy version is doing a lot of 2-d array stuff and matrix operations that I didn't get into in my implementation.

Asymptotic

this PR:

statrs/src/stats_tests/mannwhitneyu.rs

Line 126 in 100e726

fn calc_mwu_asymptotic_pvalue(
scipy: https://github.com/scipy/scipy/blob/92d2a8592782ee19a1161d0bf3fc2241ba78bb63/scipy/stats/_mannwhitneyu.py#L149

My conclusion: This is probably the only case worth your review/thoughts. The scipy version is ~10 LOC and the version in this PR is basically a 1:1 copy of those lines. There isn't too much room for alternative implementation here, but happy to re-write it to avoid any potential issues. Also for reference, the scipy function references this section in the Mann Whitney U wiki article.

codecov · 2025-01-14T03:34:05Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.26%. Comparing base (a8fe65c) to head (15ae850).
Report is 9 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #315      +/-   ##
==========================================
+ Coverage   93.81%   94.26%   +0.45%     
==========================================
  Files          53       58       +5     
  Lines       11996    12943     +947     
==========================================
+ Hits        11254    12201     +947     
  Misses        742      742

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

YeungOnion · 2025-01-14T03:44:39Z

re: nightly
looks good there

Thanks for the well-structured reply.

I took a look at each and I agree on your points regarding the implementations. To be fair, had you not done this I would still trust your assessment completely since the author has the best idea of what level they referred to sources.

As a non-expert in licensing, I believe we're in the clear here, especially since it's laid out so clearly in the Wikipedia article.

Regarding test cases, I'm not so sure here. I'll reach out to softwarefreedom.org and update you.

YeungOnion · 2025-01-14T04:01:41Z

src/stats_tests/mannwhitneyu.rs

+/// ranks data and accounts for ties to calculate the U statistic
+fn rankdata_mwu<T: PartialOrd>(xy: Vec<T>) -> Result<(Vec<f64>, Vec<usize>), MannWhitneyUError> {
+    let mut j = (0..xy.len()).collect::<Vec<usize>>();
+    let mut y = xy;


hoping this comes off less of a nitpick and more of an "in case you didn't know"
You can bind arguments to be mut in the function header without affecting the signature and it can remove this kind of line.

wasn't aware of this. Still pretty new to Rust so appreciate any comments on the more idiomatic things I might be missing. Should be addressed in 50dcaab

mdahlin · 2025-01-15T23:22:50Z

I'm open to you continuing on this PR or opening a dependent PR.

In regards to this, I just added a couple of more tests. I'm happy to keep working on this PR adding more tests for a bit longer. Or I can just up a new PR for more additions. I don't have too much a preference, and it doesn't seem like you do either.

YeungOnion · 2025-01-16T18:39:32Z

I think I now prefer that the next things be a new PR since these tests could be used for sampling verification which I was working on a bit ago.

I forgot that we have some pub use in the module header, which I'm not certain how I feel about. The names are unlikely to be shadowed. But I can certainly say that not having it is okay as adding it won't break.

Thanks for adding these hypothesis tests!

YeungOnion · 2025-01-16T18:51:43Z

Oh and I didn't get back to you on due diligence for attributing data.

As much as possible, you should provide the source for the dataset where scipy is specifying what that source is; I believe it demonstrates good faith toward the primary source. But after that I'll definitely accept the PR.

mdahlin · 2025-01-16T22:25:09Z

Oh and I didn't get back to you on due diligence for attributing data.

As much as possible, you should provide the source for the dataset where scipy is specifying what that source is; I believe it demonstrates good faith toward the primary source. But after that I'll definitely accept the PR.

Sure thing. Are you looking for a full reference section?

YeungOnion · 2025-01-17T22:14:25Z

I think adding a line comment with an inline reference on the relevant tests would suffice.

After all, scipy had their references within their docs, so consuming that information is more common. This is for appropriate attribution, so I think it's fine if it's a little hairy.

YeungOnion

Thanks for making the attribution changes!

YeungOnion · 2025-01-19T15:51:21Z

Thank you for your contribution! Having tests will be great!

mdahlin added 3 commits January 1, 2025 21:35

feat(stats_tests): implement f_oneway

057b610

feat(stats_tests): implement ttest_onesample

9a7185a

feat(stats_tests): implement mannwhitneyu

100e726

mdahlin added 2 commits January 12, 2025 10:20

feat(stats_tests): implement skewtest

e226a1c

feat(stats_tests): implement chisquare

e8f8c80

YeungOnion reviewed Jan 14, 2025

View reviewed changes

mdahlin added 3 commits January 15, 2025 13:05

refactor: mut in function header instead of in function

50dcaab

test: more coverage for mannwhitneyu

0b3cf3d

test: more coverage for f_oneway

21113fc

docs(stats_test): better attribution of sources

15ae850

YeungOnion approved these changes Jan 19, 2025

View reviewed changes

YeungOnion merged commit f4136d5 into statrs-dev:master Jan 19, 2025
10 checks passed

YeungOnion mentioned this pull request Jan 20, 2025

crate revival and discussion #206

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add some hypothesis test functions #315

Add some hypothesis test functions #315

mdahlin commented Jan 2, 2025

YeungOnion commented Jan 12, 2025

mdahlin commented Jan 12, 2025

codecov bot commented Jan 14, 2025 •

edited

Loading

YeungOnion commented Jan 14, 2025 •

edited

Loading

YeungOnion Jan 14, 2025

mdahlin Jan 15, 2025

mdahlin commented Jan 15, 2025

YeungOnion commented Jan 16, 2025

YeungOnion commented Jan 16, 2025

mdahlin commented Jan 16, 2025

YeungOnion commented Jan 17, 2025

YeungOnion left a comment

YeungOnion commented Jan 19, 2025

Add some hypothesis test functions #315

Add some hypothesis test functions #315

Conversation

mdahlin commented Jan 2, 2025

YeungOnion commented Jan 12, 2025

mdahlin commented Jan 12, 2025

one-way ANOVA F-test

One Sample t-test

Mann Whitney U

Exact

Asymptotic

codecov bot commented Jan 14, 2025 • edited Loading

Codecov Report

YeungOnion commented Jan 14, 2025 • edited Loading

YeungOnion Jan 14, 2025

Choose a reason for hiding this comment

mdahlin Jan 15, 2025

Choose a reason for hiding this comment

mdahlin commented Jan 15, 2025

YeungOnion commented Jan 16, 2025

YeungOnion commented Jan 16, 2025

mdahlin commented Jan 16, 2025

YeungOnion commented Jan 17, 2025

YeungOnion left a comment

Choose a reason for hiding this comment

YeungOnion commented Jan 19, 2025

codecov bot commented Jan 14, 2025 •

edited

Loading

YeungOnion commented Jan 14, 2025 •

edited

Loading