Skip to content

Check mutation parents on tree sequence init #3212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

benjeffery
Copy link
Member

WIP

This additional check causes 658 test failures in python and 6 in C, I think all of which would be solved by computing mutation parents.

Will do a perf check of this next.

@jeromekelleher
Copy link
Member

Nice - hopefully this isn't too intrusive perf-wise.

Note for compatibilty, I think we should add a flag to treeseq_init like TSK_COMPUTE_MUTATION_PARENTS or something, that just calls the function unconditionally before init. That'll make mopping up the test suite failures easier.

@benjeffery
Copy link
Member Author

So for a tree sequence with 4M total nodes, 3M mutations the main tskit.load time is 0.7s. With this PR that rises to 1.2s.

I think this is good enough for us to use for now, it can always be improved later with a non-breaking change.

I'm going to check all the other places that integrity is checked to see if they need this new check flag, then I'll add TSK_COMPUTE_MUTATION_PARENTS to tsk_treeseq_init and start fixing up the tests. I also need to add tests for the new code.

@jeromekelleher
Copy link
Member

I think this is a good path. We can whittle the time down another bit later, as you say.

@benjeffery
Copy link
Member Author

benjeffery commented Jun 11, 2025

Open questions that occur as I'm fixing up tests:

  • Should we sort after computing parents inside tsk_treeseq_init? One test fails as it specifies mutations, then when the parents are assigned the sort order is incorrect. I lean towards checking the order and sorting if needed.
    (This was nonsense - I forgot that compute_mutation_parents needs them sorted beforehand!)

  • tree_diff_iter checks integrity, but does it need to check mutation parents? It doesn't use them, so doesn't need to strictly, but anyone writing something using it should be checking if using the parents.

Copy link

codecov bot commented Jun 11, 2025

Codecov Report

Attention: Patch coverage is 94.73684% with 6 lines in your changes missing coverage. Please review.

Project coverage is 89.64%. Comparing base (abe75c2) to head (b6f7b4d).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
c/tskit/tables.c 93.87% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3212   +/-   ##
=======================================
  Coverage   89.63%   89.64%           
=======================================
  Files          28       28           
  Lines       32000    32050   +50     
  Branches     5877     5892   +15     
=======================================
+ Hits        28684    28731   +47     
- Misses       1886     1888    +2     
- Partials     1430     1431    +1     
Flag Coverage Δ
c-tests 86.68% <94.73%> (+0.01%) ⬆️
lwt-tests 80.38% <ø> (ø)
python-c-tests 88.18% <ø> (ø)
python-tests 98.86% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
c/tskit/core.c 95.52% <100.00%> (+0.01%) ⬆️
c/tskit/core.h 100.00% <ø> (ø)
c/tskit/trees.c 90.63% <100.00%> (+0.01%) ⬆️
python/tskit/tables.py 99.35% <ø> (+<0.01%) ⬆️
c/tskit/tables.c 83.24% <93.87%> (+0.03%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@benjeffery
Copy link
Member Author

Python tests fixed - less than I expected.

One wrinkle is the confusing TSK_ERR_MUTATION_PARENT_AFTER_CHILD error message from init and compute_mutation_parents when the mutations are have no parent column set but are in the wrong order for the eventual parent that is being set.

@jeromekelleher
Copy link
Member

Looks very nice, that's about as non-intrusive as it'll get I'd say. Maybe we should try this out in a SLiM build and see if it breaks anything @petrelharp?

@benjeffery
Copy link
Member Author

Added C tests - realised there are some interactions with TSK_TAKE_OWNERSHIP and tested those, but they seem to make sense to me.

@benjeffery benjeffery marked this pull request as ready for review June 12, 2025 14:44
Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Spotted a few thing along the way.

I think the thing to do now is to get C API users to try this out and see what breaks.

c/tskit/tables.c Outdated
/* Set the mutation parent to TSK_NULL so that we don't error
in the integrity check on data we're about to overwite.
*/
tsk_memset(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to do this, and we should't update the data if we error. TSK_CHECK_TREES doesn't imply TSK_CHECK_MUTATION_PARENTS

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about not changing if we error. Here I was thinking about the fact that TSK_CHECK_TREES implies TSK_CHECK_MUTATION_ORDERING which checks order based on the existing parent column.

Fixed in 829c9ce

@benjeffery
Copy link
Member Author

benjeffery commented Jun 13, 2025

So actually not performing a double check makes this work much faster.

We're now at 0.7s for main and 0.9s for this branch!

@jeromekelleher
Copy link
Member

We're now at 0.7s for main and 0.9s for this branch!

Excellent - with #3218 we'll get this down to something negligible.

Looks like there's a few error conditions to be hit in the C tests? (Or is this just codecov not updating quickly enough or something?)

@benjeffery benjeffery force-pushed the check-mutation-parents branch from a45be79 to 5158ce1 Compare June 13, 2025 11:35
@benjeffery benjeffery force-pushed the check-mutation-parents branch from 5158ce1 to b6f7b4d Compare June 13, 2025 11:39
@benjeffery
Copy link
Member Author

Ah, I'd covered the functionality in Python, but not in C. Fixed now, the only uncovered lines are memory errors.

@benjeffery
Copy link
Member Author

One more thought here - there will/may be tree sequences on disk that will now error on load. They will have to be loaded as tables and the parents calculated. This process should be documented in the error message I guess.

@jeromekelleher
Copy link
Member

One more thought here - there will/may be tree sequences on disk that will now error on load. They will have to be loaded as tables and the parents calculated. This process should be documented in the error message I guess.

Yes, good point

I think the main thing is now to get some feedback on how much this is going to break in the real world.

@molpopgen @bhaller would it be possible to try out this branch downstream in your test suites and see if there's much breakage? It would really help to get a sense of how intrusive this change is going to be.

@bhaller
Copy link

bhaller commented Jun 13, 2025

I think the main thing is now to get some feedback on how much this is going to break in the real world.

@molpopgen @bhaller would it be possible to try out this branch downstream in your test suites and see if there's much breakage? It would really help to get a sense of how intrusive this change is going to be.

@petrelharp can this ball fall in your court? I'm presently quite swamped getting ready for SLiM workshop teaching, and I'm not really sure what this would entail anyway – I suspect you have a better grasp of it than I do. (I don't recall being aware that mutations even have parents, lol.) @jeromekelleher what's the timeframe for needing feedback on this? Is there urgency here?

@jeromekelleher
Copy link
Member

@jeromekelleher what's the timeframe for needing feedback on this? Is there urgency here?

No major hurry, we can leave this sit for a few weeks.

@molpopgen
Copy link
Member

I'm not sure how I could test this?

@molpopgen
Copy link
Member

Looking at the diff to the changelog, I don't expect this to affect anything downstream. The only possible nuisance is that I'll have to remember to make the new option flags public in tskit-rust once there is a new release of the C API and I upgrade that package.

@molpopgen
Copy link
Member

One question:

- Add ``TSK_CHECK_MUTATION_PARENTS`` option to ``tsk_table_collection_check_integrity``
  to check that mutation parents are consistent with the tree sequence topology.
  This option implies ``TSK_CHECK_TREES``.
  (:user:`benjeffery`, :issue:`2729`, :issue:`2732`, :pr:`3212`).

What does "implies" mean in this context? Does it mean that if "check mut parents" is 1, then the internal code will also check trees? Or does it mean that if "check mut parents" is 1, then "check trees MUST ALSO be 1, and therefore the flags in a bitfield are no longer additive? (I'd have an issue with the latter from a design perspective.)

@benjeffery
Copy link
Member Author

Does it mean that if "check mut parents" is 1, then the internal code will also check trees

It means this.

@molpopgen
Copy link
Member

Thanks -- in that case, I think adding a test to tskit-rust that inits a treeseq w/o mutation parents being calculated should suffice. If I understand correctly, it will pass now but fail later once the C API gets updated? The failing test should serve as a reminder of what to do.

@benjeffery
Copy link
Member Author

That's correct, setting all the parents to TSK_NULL will fire the error - but only if there is topology and mutations inconsistent with this.

@benjeffery
Copy link
Member Author

I also note that #2923 suggests that all SINGER-generated files will have all-NULL mutation parents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants