Skip to content

Speed up identical molecule detection #2025

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 24, 2025
Merged

Speed up identical molecule detection #2025

merged 8 commits into from
Apr 24, 2025

Conversation

j-wags
Copy link
Member

@j-wags j-wags commented Feb 23, 2025

Using the excellent reproducing code snippet from #2008 (though restricting mol size to 10^3 instead of 10^4) we see about a 50x speedup

Before
Screenshot 2025-02-23 at 1 32 11 PM

After
Screenshot 2025-02-23 at 1 31 53 PM

Copy link

codecov bot commented Feb 23, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.60%. Comparing base (f4fff54) to head (86047af).
Report is 1 commits behind head on main.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mattwthompson
Copy link
Member

I'd like to volunteer be the reviewer once you think it's ready for feedback

@j-wags
Copy link
Member Author

j-wags commented Feb 25, 2025

Thanks @mattwthompson - I'm gonna wait for feedback from @hannaomi and/or @timbernat to ensure this addresses the performance bottleneck they were seeing. So this won't make it into today's release but I can cut another one whenever this is ready.

@hannaomi
Copy link

Thanks @j-wags ,

I am re-running the same benchmarks at the moment. It might take a couple of days to get the full dataset due to supercomputing constraints. I will post the results here ASAP.

@hannaomi
Copy link

hannaomi commented Mar 3, 2025

Hi both,
before_commit
after_commit

Many thanks for your patience whilst I collected the rest of the data. As you can see I did not see a significant difference in the create interchange run time at higher unique molecule numbers, despite initially seeing a time saving with smaller topologies. Perhaps there is a significant difference in the way I am preparing my topologies vs. the reproducing code?

@j-wags happy to talk this over in our meeting on Thursday if that would work for you?

@j-wags j-wags self-assigned this Mar 4, 2025
@j-wags
Copy link
Member Author

j-wags commented Mar 21, 2025

As an update - @hannaomi and I met after the last message and she handed off a reproducing example of the whole workflow to me. @mattwthompson took a look at this example and opened several issues for performance improvement with the polymer-performance label. Those issues are getting triaged for our upcoming sprints.

@mattwthompson
Copy link
Member

Is there a downside of moving forward with this change now?

@j-wags
Copy link
Member Author

j-wags commented Mar 28, 2025

I'd put this together quickly as a proof of concept solution to this use case, without considering its effect on anything else. I'll give this a once-over today and will open for review shortly if it seems safe.

@j-wags j-wags marked this pull request as ready for review March 28, 2025 20:05
Copy link
Collaborator

@Yoshanuikabundi Yoshanuikabundi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This speedup is great!

I think the approach of checking for equality by comparing Python builtin hash()es of the relevant information is very dangerous. As written, this PR doesn't actually do the same thing as the original code because of the possibility of a hash collision. Python's hash() builtin doesn't use a cryptographic hash function and is quite vulnerable to collisons - The canonical example is hash(-2) == hash(-1) (eg, see https://stackoverflow.com/a/68048420). A hash in, say, a dict is only used to narrow down the search, the keys are still directly compared - the actual datastructure is like dict[int, list[tuple[KeyType, ValueType]], with the list hopefully being length 1. The builtin hash function also salts string inputs, so hashes are not consistent from one Python process to the next even in the same environment (try running python -c 'print(hash("hi"))' a few times in a row, and see https://docs.python.org/3/reference/datamodel.html#object.__hash__) - this might cause a serious issue if a Molecule gets pickled.

I think the chances of a hash collision are worth taking seriously - partly because the hash, on my machine, is only 64 bits long, much shorter than the string that would be fed into it, but more because it would be such an infuriating and embarrassing bug to run into in the unlikely case that it came up.

I've done some testing and it seems that the vast majority of the speedup comes from looping over the atoms and assigning their _molecule_atom_index attribute. Once that's done, the difference between comparing hashes or just comparing the big id string that gets hashed is basically nil. I've also found that comparing tuples of the attributes you want to compare is much faster than iterating over the actual attributes; in other words,

    def _is_exactly_the_same_as(self, other):
        self_id = (
            tuple((atom.atomic_number, atom.formal_charge.magnitude, atom.stereochemistry) for atom in self.atoms),
            tuple((bond.bond_order, bond.stereochemistry, bond.atom1_index, bond.atom2_index) for bond in self.bonds),
        )
        other_id = (
            tuple((atom.atomic_number, atom.formal_charge.magnitude, atom.stereochemistry) for atom in other.atoms),
            tuple((bond.bond_order, bond.stereochemistry, bond.atom1_index, bond.atom2_index) for bond in other.bonds),
        )
        return self_id == other_id

Is about 5x faster (in this case where the molecules are large and the same) than

    def _is_exactly_the_same_as(self, other):
        for atom1, atom2 in zip(self.atoms, other.atoms):
            if (
                (atom1.atomic_number != atom2.atomic_number)
                or (atom1.formal_charge != atom2.formal_charge)
                or (atom1.is_aromatic != atom2.is_aromatic)
                or (atom1.stereochemistry != atom2.stereochemistry)
            ):
                return False
        for bond1, bond2 in zip(self.bonds, other.bonds):
            if (
                (bond1.atom1_index != bond2.atom1_index)
                or (bond1.atom2_index != bond2.atom2_index)
                or (bond1.is_aromatic != bond2.is_aromatic)
                or (bond1.stereochemistry != bond2.stereochemistry)
            ):
                return False
        return True

In fact, that first implementation if _is_exactly_the_same_as() is just as fast as computing and comparing the hashes if the molecule atom indices are already assigned.

So I'd propose just assigning _molecule_atom_indices in Molecule._add_atom() (the value is already computed in the return statement!), and then changing _is_exactly_the_same_as to construct and compare big tuples. This would mean that there would not need to be any change in the ordered_connection_table_hash() function, though I would suggest we still update its docstring to say that hashes cannot be used across different Python processes.

If that approach is no good, I think this is at least blocked by needing a collision-resistant hash function like hashlib.sha3_224. Incidentally, this would not involve any salting, so hashes could be compared across Python processes (within a given version of the Toolkit) - however it would mean changing the return type from int to bytes. It feels like overkill, but since it didn't slow the benchmark down at all on my machine, I think it's worth it.

@mattwthompson mattwthompson added the polymer-performance Runtime of loading and/or parametrizing (bio)polymers label Apr 7, 2025
Copy link
Collaborator

@Yoshanuikabundi Yoshanuikabundi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one more change to make sure we're still getting the main optimization and this is good to go!

@j-wags j-wags merged commit e735c2b into main Apr 24, 2025
17 checks passed
@j-wags j-wags deleted the faster-ident-mol-grps branch April 24, 2025 20:42
@hannaomi
Copy link

hannaomi commented May 12, 2025

Hi all,

I have reran my reproducing example with this new commit - much faster performance parameterizing 50 unique polymer chains in one topology. For reference I was seeing a runtime of approx 40'000 seconds for 50 chains previously.

Many thanks!
1879c22

@j-wags
Copy link
Member Author

j-wags commented May 12, 2025

Fantastic, thanks for the feedback! There's lots to unpack in the shape of that plot, but a 40x speedup for the worst case is good news. This PR is now in the OpenFF Toolkit 0.16.9 release, so you shouldn't need to install from the branch/commit any more :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
polymer-performance Runtime of loading and/or parametrizing (bio)polymers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants