Speed up identical molecule detection #2025

j-wags · 2025-02-23T20:39:36Z

Addresses Topology.identical_molecule_groups scales poorly with many large molecules #2008
Document change in Molecule.ordered_connection_table_hash
Add tests
Update docstrings/documentation, if applicable
Lint codebase
Update changelog

Using the excellent reproducing code snippet from #2008 (though restricting mol size to 10^3 instead of 10^4) we see about a 50x speedup

Before

After

codecov · 2025-02-23T20:45:52Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.60%. Comparing base (f4fff54) to head (86047af).
Report is 1 commits behind head on main.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mattwthompson · 2025-02-24T17:28:05Z

I'd like to volunteer be the reviewer once you think it's ready for feedback

j-wags · 2025-02-25T16:30:16Z

Thanks @mattwthompson - I'm gonna wait for feedback from @hannaomi and/or @timbernat to ensure this addresses the performance bottleneck they were seeing. So this won't make it into today's release but I can cut another one whenever this is ready.

hannaomi · 2025-02-25T16:34:15Z

Thanks @j-wags ,

I am re-running the same benchmarks at the moment. It might take a couple of days to get the full dataset due to supercomputing constraints. I will post the results here ASAP.

hannaomi · 2025-03-03T13:35:35Z

Hi both,

Many thanks for your patience whilst I collected the rest of the data. As you can see I did not see a significant difference in the create interchange run time at higher unique molecule numbers, despite initially seeing a time saving with smaller topologies. Perhaps there is a significant difference in the way I am preparing my topologies vs. the reproducing code?

@j-wags happy to talk this over in our meeting on Thursday if that would work for you?

j-wags · 2025-03-21T20:22:01Z

As an update - @hannaomi and I met after the last message and she handed off a reproducing example of the whole workflow to me. @mattwthompson took a look at this example and opened several issues for performance improvement with the polymer-performance label. Those issues are getting triaged for our upcoming sprints.

mattwthompson · 2025-03-28T15:12:39Z

Is there a downside of moving forward with this change now?

j-wags · 2025-03-28T17:42:43Z

I'd put this together quickly as a proof of concept solution to this use case, without considering its effect on anything else. I'll give this a once-over today and will open for review shortly if it seems safe.

Yoshanuikabundi

This speedup is great!

I think the approach of checking for equality by comparing Python builtin hash()es of the relevant information is very dangerous. As written, this PR doesn't actually do the same thing as the original code because of the possibility of a hash collision. Python's hash() builtin doesn't use a cryptographic hash function and is quite vulnerable to collisons - The canonical example is hash(-2) == hash(-1) (eg, see https://stackoverflow.com/a/68048420). A hash in, say, a dict is only used to narrow down the search, the keys are still directly compared - the actual datastructure is like dict[int, list[tuple[KeyType, ValueType]], with the list hopefully being length 1. The builtin hash function also salts string inputs, so hashes are not consistent from one Python process to the next even in the same environment (try running python -c 'print(hash("hi"))' a few times in a row, and see https://docs.python.org/3/reference/datamodel.html#object.__hash__) - this might cause a serious issue if a Molecule gets pickled.

I think the chances of a hash collision are worth taking seriously - partly because the hash, on my machine, is only 64 bits long, much shorter than the string that would be fed into it, but more because it would be such an infuriating and embarrassing bug to run into in the unlikely case that it came up.

I've done some testing and it seems that the vast majority of the speedup comes from looping over the atoms and assigning their _molecule_atom_index attribute. Once that's done, the difference between comparing hashes or just comparing the big id string that gets hashed is basically nil. I've also found that comparing tuples of the attributes you want to compare is much faster than iterating over the actual attributes; in other words,

    def _is_exactly_the_same_as(self, other):
        self_id = (
            tuple((atom.atomic_number, atom.formal_charge.magnitude, atom.stereochemistry) for atom in self.atoms),
            tuple((bond.bond_order, bond.stereochemistry, bond.atom1_index, bond.atom2_index) for bond in self.bonds),
        )
        other_id = (
            tuple((atom.atomic_number, atom.formal_charge.magnitude, atom.stereochemistry) for atom in other.atoms),
            tuple((bond.bond_order, bond.stereochemistry, bond.atom1_index, bond.atom2_index) for bond in other.bonds),
        )
        return self_id == other_id

Is about 5x faster (in this case where the molecules are large and the same) than

    def _is_exactly_the_same_as(self, other):
        for atom1, atom2 in zip(self.atoms, other.atoms):
            if (
                (atom1.atomic_number != atom2.atomic_number)
                or (atom1.formal_charge != atom2.formal_charge)
                or (atom1.is_aromatic != atom2.is_aromatic)
                or (atom1.stereochemistry != atom2.stereochemistry)
            ):
                return False
        for bond1, bond2 in zip(self.bonds, other.bonds):
            if (
                (bond1.atom1_index != bond2.atom1_index)
                or (bond1.atom2_index != bond2.atom2_index)
                or (bond1.is_aromatic != bond2.is_aromatic)
                or (bond1.stereochemistry != bond2.stereochemistry)
            ):
                return False
        return True

In fact, that first implementation if _is_exactly_the_same_as() is just as fast as computing and comparing the hashes if the molecule atom indices are already assigned.

So I'd propose just assigning _molecule_atom_indices in Molecule._add_atom() (the value is already computed in the return statement!), and then changing _is_exactly_the_same_as to construct and compare big tuples. This would mean that there would not need to be any change in the ordered_connection_table_hash() function, though I would suggest we still update its docstring to say that hashes cannot be used across different Python processes.

If that approach is no good, I think this is at least blocked by needing a collision-resistant hash function like hashlib.sha3_224. Incidentally, this would not involve any salting, so hashes could be compared across Python processes (within a given version of the Toolkit) - however it would mean changing the return type from int to bytes. It feels like overkill, but since it didn't slow the benchmark down at all on my machine, I think it's worth it.

openff/toolkit/topology/molecule.py

Yoshanuikabundi

Just one more change to make sure we're still getting the main optimization and this is good to go!

openff/toolkit/topology/molecule.py

Co-authored-by: Josh A. Mitchell <[email protected]>

hannaomi · 2025-05-12T14:52:39Z

Hi all,

I have reran my reproducing example with this new commit - much faster performance parameterizing 50 unique polymer chains in one topology. For reference I was seeing a runtime of approx 40'000 seconds for 50 chains previously.

Many thanks!

j-wags · 2025-05-12T15:03:22Z

Fantastic, thanks for the feedback! There's lots to unpack in the shape of that plot, but a 40x speedup for the worst case is good news. This PR is now in the OpenFF Toolkit 0.16.9 release, so you shouldn't need to install from the branch/commit any more :-)

speed up molecule comparison (should help with #2008)

2a106c0

j-wags self-assigned this Mar 4, 2025

j-wags added 3 commits March 28, 2025 12:18

Merge branch 'main' into faster-ident-mol-grps

e18de9d

update docstring and releasehistory

21ea1db

remove trailing whitespace

b40cb6c

j-wags marked this pull request as ready for review March 28, 2025 20:05

jameseastwood unassigned j-wags Mar 28, 2025

Yoshanuikabundi requested changes Apr 2, 2025

View reviewed changes

openff/toolkit/topology/molecule.py Outdated Show resolved Hide resolved

openff/toolkit/topology/molecule.py Outdated Show resolved Hide resolved

openff/toolkit/topology/molecule.py Show resolved Hide resolved

mattwthompson added the polymer-performance Runtime of loading and/or parametrizing (bio)polymers label Apr 7, 2025

j-wags and others added 2 commits April 21, 2025 10:04

Apply some suggestions from code review

4301d1f

Merge branch 'main' into faster-ident-mol-grps

1879c22

Yoshanuikabundi approved these changes Apr 24, 2025

View reviewed changes

openff/toolkit/topology/molecule.py Outdated Show resolved Hide resolved

openff/toolkit/topology/molecule.py Show resolved Hide resolved

j-wags and others added 2 commits April 24, 2025 08:46

Apply suggestions from code review

97b6cce

Co-authored-by: Josh A. Mitchell <[email protected]>

fix whitespace

86047af

j-wags merged commit e735c2b into main Apr 24, 2025
17 checks passed

j-wags deleted the faster-ident-mol-grps branch April 24, 2025 20:42

hannaomi mentioned this pull request May 21, 2025

Poor scaling when creating an interchange with multiple unique components vs. individual components openforcefield/openff-interchange#1156

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up identical molecule detection #2025

Speed up identical molecule detection #2025

Uh oh!

j-wags commented Feb 23, 2025 •

edited

Loading

Uh oh!

codecov bot commented Feb 23, 2025 •

edited

Loading

Uh oh!

mattwthompson commented Feb 24, 2025

Uh oh!

j-wags commented Feb 25, 2025

Uh oh!

hannaomi commented Feb 25, 2025

Uh oh!

hannaomi commented Mar 3, 2025

Uh oh!

j-wags commented Mar 21, 2025

Uh oh!

mattwthompson commented Mar 28, 2025

Uh oh!

j-wags commented Mar 28, 2025

Uh oh!

Yoshanuikabundi left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Yoshanuikabundi left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hannaomi commented May 12, 2025 •

edited

Loading

Uh oh!

j-wags commented May 12, 2025

Uh oh!

Uh oh!

Speed up identical molecule detection #2025

Speed up identical molecule detection #2025

Uh oh!

Conversation

j-wags commented Feb 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Feb 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mattwthompson commented Feb 24, 2025

Uh oh!

j-wags commented Feb 25, 2025

Uh oh!

hannaomi commented Feb 25, 2025

Uh oh!

hannaomi commented Mar 3, 2025

Uh oh!

j-wags commented Mar 21, 2025

Uh oh!

mattwthompson commented Mar 28, 2025

Uh oh!

j-wags commented Mar 28, 2025

Uh oh!

Yoshanuikabundi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Yoshanuikabundi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hannaomi commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

j-wags commented May 12, 2025

Uh oh!

Uh oh!

j-wags commented Feb 23, 2025 •

edited

Loading

codecov bot commented Feb 23, 2025 •

edited

Loading

hannaomi commented May 12, 2025 •

edited

Loading