-
Notifications
You must be signed in to change notification settings - Fork 77
Check mutation parents on tree sequence init #3212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Nice - hopefully this isn't too intrusive perf-wise. Note for compatibilty, I think we should add a flag to treeseq_init like TSK_COMPUTE_MUTATION_PARENTS or something, that just calls the function unconditionally before init. That'll make mopping up the test suite failures easier. |
So for a tree sequence with 4M total nodes, 3M mutations the I think this is good enough for us to use for now, it can always be improved later with a non-breaking change. I'm going to check all the other places that integrity is checked to see if they need this new check flag, then I'll add |
I think this is a good path. We can whittle the time down another bit later, as you say. |
Open questions that occur as I'm fixing up tests:
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3212 +/- ##
=======================================
Coverage 89.63% 89.64%
=======================================
Files 28 28
Lines 32000 32050 +50
Branches 5877 5892 +15
=======================================
+ Hits 28684 28731 +47
- Misses 1886 1888 +2
- Partials 1430 1431 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
…grity flag to public method
Python tests fixed - less than I expected. One wrinkle is the confusing |
Looks very nice, that's about as non-intrusive as it'll get I'd say. Maybe we should try this out in a SLiM build and see if it breaks anything @petrelharp? |
Added C tests - realised there are some interactions with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Spotted a few thing along the way.
I think the thing to do now is to get C API users to try this out and see what breaks.
c/tskit/tables.c
Outdated
/* Set the mutation parent to TSK_NULL so that we don't error | ||
in the integrity check on data we're about to overwite. | ||
*/ | ||
tsk_memset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to do this, and we should't update the data if we error. TSK_CHECK_TREES doesn't imply TSK_CHECK_MUTATION_PARENTS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about not changing if we error. Here I was thinking about the fact that TSK_CHECK_TREES
implies TSK_CHECK_MUTATION_ORDERING
which checks order based on the existing parent column.
Fixed in 829c9ce
So actually not performing a double check makes this work much faster. We're now at 0.7s for |
Excellent - with #3218 we'll get this down to something negligible. Looks like there's a few error conditions to be hit in the C tests? (Or is this just codecov not updating quickly enough or something?) |
a45be79
to
5158ce1
Compare
5158ce1
to
b6f7b4d
Compare
Ah, I'd covered the functionality in Python, but not in C. Fixed now, the only uncovered lines are memory errors. |
One more thought here - there will/may be tree sequences on disk that will now error on load. They will have to be loaded as tables and the parents calculated. This process should be documented in the error message I guess. |
Yes, good point I think the main thing is now to get some feedback on how much this is going to break in the real world. @molpopgen @bhaller would it be possible to try out this branch downstream in your test suites and see if there's much breakage? It would really help to get a sense of how intrusive this change is going to be. |
@petrelharp can this ball fall in your court? I'm presently quite swamped getting ready for SLiM workshop teaching, and I'm not really sure what this would entail anyway – I suspect you have a better grasp of it than I do. (I don't recall being aware that mutations even have parents, lol.) @jeromekelleher what's the timeframe for needing feedback on this? Is there urgency here? |
No major hurry, we can leave this sit for a few weeks. |
I'm not sure how I could test this? |
Looking at the diff to the changelog, I don't expect this to affect anything downstream. The only possible nuisance is that I'll have to remember to make the new option flags public in tskit-rust once there is a new release of the C API and I upgrade that package. |
One question:
What does "implies" mean in this context? Does it mean that if "check mut parents" is 1, then the internal code will also check trees? Or does it mean that if "check mut parents" is 1, then "check trees MUST ALSO be 1, and therefore the flags in a bitfield are no longer additive? (I'd have an issue with the latter from a design perspective.) |
It means this. |
Thanks -- in that case, I think adding a test to tskit-rust that inits a treeseq w/o mutation parents being calculated should suffice. If I understand correctly, it will pass now but fail later once the C API gets updated? The failing test should serve as a reminder of what to do. |
That's correct, setting all the parents to |
I also note that #2923 suggests that all SINGER-generated files will have all-NULL mutation parents. |
WIP
This additional check causes 658 test failures in python and 6 in C, I think all of which would be solved by computing mutation parents.
Will do a perf check of this next.