-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The executor fails with multiple ligands #237
Comments
Thanks for writing in. I don't have an immediate workaround but @jthorton or @Yoshanuikabundi might chime in. One rough tip might be to set the convergence criteria to something less strict, like Could you clarify whether the result retrieval fails for all molecules, or just the ones with convergence failures? Since psi4/xtb/qcengine tools are upstream we can't directly implement fixes for them, and if the convergence failures are indeed due to randomness stemming from memory availability, successes and failures could both be valid outputs for those tools for the same input. So I think this will be a longer-term usability improvement on our end - There are a number of things that can cause a bespokefit run to fail, which can be categorized on a spectrum from "controlled failures" (like raising an error because bonds broke in QM or because elements are out of scope for downstream tools) to "uncontrolled failures" (like if a single unconverged optimization brings down the entire database). Controlled failures are intentional and are should come up in situations where we want to communicate that users are better off using the normal FF parameters than the outputs of bespokefit. We may want to document these better so that people like you who make workflows know what to expect and can choose how to handle them. So I think there are a few actions we can take here:
cc #224 |
Hey Jeff - thanks for the quick response! I just want to add that this is currently a pretty big roadblock for us benchmarking/testing this at redesign. The reason we think it's a memory / thread allocation problem is that apart from the error message above, we've also frequently observed hung jobs and seg-faulted jobs that complete upon restarting. Additionally, if we provide a list of ~10 ligands it will fail, but splitting the 10 ligands into 2 file folders with ~5 ligands each and running two instantiations on the same hardware concurrently leads to both completing without issue. These are not particularly strange ligands with poor geometries, they're from FEP benchmark sets with crystal or crystal-adjacent poses. Additionally the hardware that we're testing on should have plenty of resources, 96 cores. But if there's a specific hardware config that's known to work we can probably get whatever is needed. I wonder if there's some sub-optimized or hardware agnostic celery calls that are working in the test environment but failing /causing issues on our end? If there's a debugging protocol we could run on our end maybe that would tell us more. Regardless, in the current set up it takes up enough time / babysitting that unfortunately we can't use it on any practical benchmarking. The few sets that we've completed have pretty significant changes and have provided interesting insights, so we're keen on getting this scaled up. Thanks! |
Hi all, I think this is an issue with the ForceBalance optimisation stage (going by the error message) and unfortunately, the error reporting is minimal as this does not often fail to converge. Ideally, we want the forcebalance log file to be printed or saved somewhere for better debugging we could look into adding that. If this is a memory issue you could try setting the There should also be some log files in a directory called |
Thank you @jthorton. Here is a typical error message:
I checked the bespoke-executor directory and did not find much. The above message appears in the celery-optimizer.log file. The massage indicates that the issue may be with the celery tool. |
I had a brainstorming session on this issue with colleagues at redesign and we came to the conclusion that we would like to try new strategies in attempt to avoid executor/celery/redis issues. So I have a few questions/proposals:
|
Hi @ognjenperisic, The new version of BespokeFit (0.2.1) includes an option to preserve more information from the ForceBalance run - try running the Executor with the You can definitely run the fragmenter directly! Something like import openff.bespokefit.executor.services._settings # Workaround for bug I just discovered
from openff.bespokefit.workflows import BespokeWorkflowFactory
from openff.fragmenter.fragment import WBOFragmenter
from openff.toolkit import Molecule
parent = Molecule.from_smiles("CNC(=O)C1=NN=C(C=C1NC2=CC=CC(=C2OC)C3=NN(C=N3)C)NC(=O)C4CC4")
workflow_factory = BespokeWorkflowFactory()
fragmenter = workflow_factory.fragmentation_engine # or instantiate WBOFragmenter directly
target_torsion_smirks = workflow_factory.target_torsion_smirks
fragmentation_result = fragmenter.fragment(parent, target_bond_smarts=target_torsion_smirks)
fragment_molecules = [fragment.molecule for fragment in fragmentation_result.fragments] See https://docs.openforcefield.org/projects/fragmenter/en/stable/api.html Unfortunately I'm not aware of any good way to run individual stages. You could try running the celery workers directly, but they're not documented and take raw json strings as input so I wouldn't recommend it. |
Hi @Yoshanuikabundi , thank you very much for your assistance. I managed to run BespokeFit (xtb code) in individually, in parallel by setting the BEFLOW_OPTIMIZER_KEEP_FILES, BEFLOW_GATEWAY_PORT, and BEFLOW_REDIS_PORT environment variables. Here is how I had done that.
The ligand_index variable has a double role. It is used to read a ligand from the sdf file, but it also adds an integer value to the
The environment variables are set before declaring the factory. I prepare a directory for each ligand and then copy the script and parameters.yaml files to each of the directories. In each directory, I just change the ligand_index in the .yaml file and start the executor. This method enables running multiple executors/redis databases on the same machine at the same time. It also enables better control of the execution, but the drawback is that calculations are repeated if some segments are shared among the ligands. |
Dear members of the bespokeFit community,
I need your help/advice. I have a problem processing multiple ligands (ligand sets in parallel) with BespokeFit. Jobs often fail, usually at the Optimization stage (see the error message below). The failed task can usually be finished if it is restarted, or if the ligand is processed as a single BespokeFit job. I tested BespokeFit with the fep-benchmark set and with proprietary ligands, mostly with the semiempirical xtb code, and few with psi4.
I follow the @jthorton 's advice given here previously (#215).
I have a feeling that the issue has something to do with memory management. I gave the scripts I use to a colleague of mine and his jobs failed multiple times on a Linux machine he uses as a personal computer.
I would appreciate any advice you can share.
I run BespokeFit with 1 or 2 cores (with psi4 jobs I use 4):
When the jobs are done I extract individual FFs with the next code:
and merge them with
Error message:
Hardware:
OS:
The text was updated successfully, but these errors were encountered: