Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some hypothesis test functions #315

Merged
merged 9 commits into from
Jan 19, 2025

Conversation

mdahlin
Copy link
Contributor

@mdahlin mdahlin commented Jan 2, 2025

Adds functions for

Functions are generally based on the scipy version. I tried to align with existing formatting/setup from the fisher test, but wanted to make sure I was on the right track (in terms of structure, level of documentation, desire for this capability, etc.) before thinking about adding more tests.

@YeungOnion
Copy link
Contributor

Sorry that I've taken a bit to reply here.

So far, these are great. We have mentioned the idea of a nan policy in regards to analytical functions (as opposed to empirical functions) and I think following the scipy approach is good because of developer expectations.

We don't really have enough tests to have a sense of uniform API for tests and having the policy as an argument is useful for establishing that.

I think the direction you're going in is good and I would approve this once I look into why the nightly-dependent workflow in the CI won't compile. I'm open to you continuing on this PR or opening a dependent PR.

However, regarding license, to what degree would you say you referred to the scipy source? I don't wish to complicate the license we distribute with, nor do I want to use a license that's not typical for crates on crates.io

@mdahlin
Copy link
Contributor Author

mdahlin commented Jan 12, 2025

Hey thanks for the response and feedback.

In terms of the nightly piece. I found the same error locally. A day or so later I updated nightly and everything worked just fine, so it seems like it was an issue specific to nightly.


In terms of how much I "referred to the scipy source", it's been a pretty loose reference for the most part but I'll provide some relevant links if you want to form your own opinion.

one-way ANOVA F-test

My conclusion: commonality with scipy is mainly just the function signature as I leveraged a statsdirect page for logic

One Sample t-test

My conclusion: again mainly just the function signature as I used the logic from this jpm page

Mann Whitney U

Here we'd probably want to look at the two main pieces of logic, being the different methods for calculating the test's statistic, separately

Exact

My conclusion: These are very different from each other. The scipy version is doing a lot of 2-d array stuff and matrix operations that I didn't get into in my implementation.

Asymptotic

My conclusion: This is probably the only case worth your review/thoughts. The scipy version is ~10 LOC and the version in this PR is basically a 1:1 copy of those lines. There isn't too much room for alternative implementation here, but happy to re-write it to avoid any potential issues. Also for reference, the scipy function references this section in the Mann Whitney U wiki article.

Copy link

codecov bot commented Jan 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.26%. Comparing base (a8fe65c) to head (15ae850).
Report is 9 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #315      +/-   ##
==========================================
+ Coverage   93.81%   94.26%   +0.45%     
==========================================
  Files          53       58       +5     
  Lines       11996    12943     +947     
==========================================
+ Hits        11254    12201     +947     
  Misses        742      742              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@YeungOnion
Copy link
Contributor

YeungOnion commented Jan 14, 2025

re: nightly
looks good there


Thanks for the well-structured reply.

I took a look at each and I agree on your points regarding the implementations. To be fair, had you not done this I would still trust your assessment completely since the author has the best idea of what level they referred to sources.

As a non-expert in licensing, I believe we're in the clear here, especially since it's laid out so clearly in the Wikipedia article.

Regarding test cases, I'm not so sure here. I'll reach out to softwarefreedom.org and update you.

/// ranks data and accounts for ties to calculate the U statistic
fn rankdata_mwu<T: PartialOrd>(xy: Vec<T>) -> Result<(Vec<f64>, Vec<usize>), MannWhitneyUError> {
let mut j = (0..xy.len()).collect::<Vec<usize>>();
let mut y = xy;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hoping this comes off less of a nitpick and more of an "in case you didn't know"
You can bind arguments to be mut in the function header without affecting the signature and it can remove this kind of line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wasn't aware of this. Still pretty new to Rust so appreciate any comments on the more idiomatic things I might be missing. Should be addressed in 50dcaab

@mdahlin
Copy link
Contributor Author

mdahlin commented Jan 15, 2025

I'm open to you continuing on this PR or opening a dependent PR.

In regards to this, I just added a couple of more tests. I'm happy to keep working on this PR adding more tests for a bit longer. Or I can just up a new PR for more additions. I don't have too much a preference, and it doesn't seem like you do either.

@YeungOnion
Copy link
Contributor

I think I now prefer that the next things be a new PR since these tests could be used for sampling verification which I was working on a bit ago.

I forgot that we have some pub use in the module header, which I'm not certain how I feel about. The names are unlikely to be shadowed. But I can certainly say that not having it is okay as adding it won't break.

Thanks for adding these hypothesis tests!

@YeungOnion
Copy link
Contributor

Oh and I didn't get back to you on due diligence for attributing data.

As much as possible, you should provide the source for the dataset where scipy is specifying what that source is; I believe it demonstrates good faith toward the primary source. But after that I'll definitely accept the PR.

@mdahlin
Copy link
Contributor Author

mdahlin commented Jan 16, 2025

Oh and I didn't get back to you on due diligence for attributing data.

As much as possible, you should provide the source for the dataset where scipy is specifying what that source is; I believe it demonstrates good faith toward the primary source. But after that I'll definitely accept the PR.

Sure thing. Are you looking for a full reference section?

@YeungOnion
Copy link
Contributor

I think adding a line comment with an inline reference on the relevant tests would suffice.

After all, scipy had their references within their docs, so consuming that information is more common. This is for appropriate attribution, so I think it's fine if it's a little hairy.

Copy link
Contributor

@YeungOnion YeungOnion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the attribution changes!

@YeungOnion
Copy link
Contributor

Thank you for your contribution! Having tests will be great!

@YeungOnion YeungOnion merged commit f4136d5 into statrs-dev:master Jan 19, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants