-
Notifications
You must be signed in to change notification settings - Fork 98
Speed up identical molecule detection #2025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
🚀 New features to boost your workflow:
|
I'd like to volunteer be the reviewer once you think it's ready for feedback |
Thanks @mattwthompson - I'm gonna wait for feedback from @hannaomi and/or @timbernat to ensure this addresses the performance bottleneck they were seeing. So this won't make it into today's release but I can cut another one whenever this is ready. |
Thanks @j-wags , I am re-running the same benchmarks at the moment. It might take a couple of days to get the full dataset due to supercomputing constraints. I will post the results here ASAP. |
Many thanks for your patience whilst I collected the rest of the data. As you can see I did not see a significant difference in the create interchange run time at higher unique molecule numbers, despite initially seeing a time saving with smaller topologies. Perhaps there is a significant difference in the way I am preparing my topologies vs. the reproducing code? @j-wags happy to talk this over in our meeting on Thursday if that would work for you? |
As an update - @hannaomi and I met after the last message and she handed off a reproducing example of the whole workflow to me. @mattwthompson took a look at this example and opened several issues for performance improvement with the |
Is there a downside of moving forward with this change now? |
I'd put this together quickly as a proof of concept solution to this use case, without considering its effect on anything else. I'll give this a once-over today and will open for review shortly if it seems safe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This speedup is great!
I think the approach of checking for equality by comparing Python builtin hash()
es of the relevant information is very dangerous. As written, this PR doesn't actually do the same thing as the original code because of the possibility of a hash collision. Python's hash()
builtin doesn't use a cryptographic hash function and is quite vulnerable to collisons - The canonical example is hash(-2) == hash(-1)
(eg, see https://stackoverflow.com/a/68048420). A hash in, say, a dict is only used to narrow down the search, the keys are still directly compared - the actual datastructure is like dict[int, list[tuple[KeyType, ValueType]]
, with the list hopefully being length 1. The builtin hash function also salts string inputs, so hashes are not consistent from one Python process to the next even in the same environment (try running python -c 'print(hash("hi"))'
a few times in a row, and see https://docs.python.org/3/reference/datamodel.html#object.__hash__) - this might cause a serious issue if a Molecule
gets pickled.
I think the chances of a hash collision are worth taking seriously - partly because the hash, on my machine, is only 64 bits long, much shorter than the string that would be fed into it, but more because it would be such an infuriating and embarrassing bug to run into in the unlikely case that it came up.
I've done some testing and it seems that the vast majority of the speedup comes from looping over the atoms and assigning their _molecule_atom_index
attribute. Once that's done, the difference between comparing hashes or just comparing the big id
string that gets hashed is basically nil. I've also found that comparing tuples of the attributes you want to compare is much faster than iterating over the actual attributes; in other words,
def _is_exactly_the_same_as(self, other):
self_id = (
tuple((atom.atomic_number, atom.formal_charge.magnitude, atom.stereochemistry) for atom in self.atoms),
tuple((bond.bond_order, bond.stereochemistry, bond.atom1_index, bond.atom2_index) for bond in self.bonds),
)
other_id = (
tuple((atom.atomic_number, atom.formal_charge.magnitude, atom.stereochemistry) for atom in other.atoms),
tuple((bond.bond_order, bond.stereochemistry, bond.atom1_index, bond.atom2_index) for bond in other.bonds),
)
return self_id == other_id
Is about 5x faster (in this case where the molecules are large and the same) than
def _is_exactly_the_same_as(self, other):
for atom1, atom2 in zip(self.atoms, other.atoms):
if (
(atom1.atomic_number != atom2.atomic_number)
or (atom1.formal_charge != atom2.formal_charge)
or (atom1.is_aromatic != atom2.is_aromatic)
or (atom1.stereochemistry != atom2.stereochemistry)
):
return False
for bond1, bond2 in zip(self.bonds, other.bonds):
if (
(bond1.atom1_index != bond2.atom1_index)
or (bond1.atom2_index != bond2.atom2_index)
or (bond1.is_aromatic != bond2.is_aromatic)
or (bond1.stereochemistry != bond2.stereochemistry)
):
return False
return True
In fact, that first implementation if _is_exactly_the_same_as()
is just as fast as computing and comparing the hashes if the molecule atom indices are already assigned.
So I'd propose just assigning _molecule_atom_indices
in Molecule._add_atom()
(the value is already computed in the return statement!), and then changing _is_exactly_the_same_as
to construct and compare big tuples. This would mean that there would not need to be any change in the ordered_connection_table_hash()
function, though I would suggest we still update its docstring to say that hashes cannot be used across different Python processes.
If that approach is no good, I think this is at least blocked by needing a collision-resistant hash function like hashlib.sha3_224
. Incidentally, this would not involve any salting, so hashes could be compared across Python processes (within a given version of the Toolkit) - however it would mean changing the return type from int
to bytes
. It feels like overkill, but since it didn't slow the benchmark down at all on my machine, I think it's worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one more change to make sure we're still getting the main optimization and this is good to go!
Co-authored-by: Josh A. Mitchell <[email protected]>
Fantastic, thanks for the feedback! There's lots to unpack in the shape of that plot, but a 40x speedup for the worst case is good news. This PR is now in the OpenFF Toolkit 0.16.9 release, so you shouldn't need to install from the branch/commit any more :-) |
Topology.identical_molecule_groups
scales poorly with many large molecules #2008Using the excellent reproducing code snippet from #2008 (though restricting mol size to 10^3 instead of 10^4) we see about a 50x speedup
Before

After
