Biomolecular Emulator (BioEmu for short) is a model that samples from the approximated equilibrium distribution of structures for a protein monomer, given its amino acid sequence.
For more information, see our preprint.
This repository contains inference code and model weights.
Run setup.sh
to create a conda environment named 'bioemu' with bioemu and its dependencies installed. setup.sh
will install and patch ColabFold, create a conda environment called 'bioemu' with some installed dependencies that pip does not handle, and then pip-install the bioemu
package inside the conda environment.
You can sample structures for a given protein sequence using the script sample.py
. To run a tiny test using the default model parameters and denoising settings:
python -m bioemu.sample --sequence GYDPETGTWG --num_samples 10 --output_dir ~/test-chignolin
The model parameters will be automatically downloaded from huggingface. See sample.py for more options.
Sampling times will depend on sequence length and available infrastructure. The following table gives times for collecting 1000 samples measured on an A100 GPU with 80 GB VRAM for sequences of different lengths (using a batch_size_100=20
setting in sample.py
):
sequence length | time / min |
---|---|
100 | 4 |
300 | 40 |
600 | 150 |
You can use this code together with code from bioemu-benchmarks to approximately reproduce results from our preprint.
The bioemu-v1.0
checkpoint contains the model weights used to produce the results in the preprint. Due to simplifications made in the embedding computation and a more efficient sampler, the results obtained with this code are not identical but consistent with the statistics shown in the preprint, i.e., mode coverage and free energy errors averaged over the proteins in a test set. Results for individual proteins may differ. For more details, please check the BIOEMU_RESULTS.md document on the bioemu-benchmarks repository.
If you are using our code or model, please consider citing our work:
@article {BioEmu2024,
author = {Lewis, Sarah and Hempel, Tim and Jim{\'e}nez-Luna, Jos{\'e} and Gastegger, Michael and Xie, Yu and Foong, Andrew Y. K. and Satorras, Victor Garc{\'\i}a and Abdin, Osama and Veeling, Bastiaan S. and Zaporozhets, Iryna and Chen, Yaoyi and Yang, Soojung and Schneuing, Arne and Nigam, Jigyasa and Barbero, Federico and Stimper, Vincent and Campbell, Andrew and Yim, Jason and Lienen, Marten and Shi, Yu and Zheng, Shuxin and Schulz, Hannes and Munir, Usman and Clementi, Cecilia and No{\'e}, Frank},
title = {Scalable emulation of protein equilibrium ensembles with generative deep learning},
year = {2024},
doi = {10.1101/2024.12.05.626885},
journal = {bioRxiv}
}
BioEmu outputs structures in backbone frame representation. To reconstruct the side-chains, several tools are available. As an example, we provide a script to conduct side-chain reconstruction with HPacker (https://github.com/gvisani/hpacker), and provide an interface for running a short molecular dynamics (MD) equilibration. HPacker is a method for protein side-chain packing based on holographic rotationally equivariant convolutional neural networks (https://arxiv.org/abs/2311.09312).
This code is experimental and is provided for research purposes only. Further testing/development are needed before considering its application in real-world scenarios or production environments.
Clone and install the HPacker code and other dependencies with
./setup_sidechain_relax.sh
This will install some additional dependences for running MD relaxation in the bioemu
environment. It will also install HPacker in a separate conda environment called hpacker
.
Inside the bioemu
enviroment, run side-chain reconstruction with:
python -m bioemu.sidechain_relax --pdb-path path/to/topology.pdb --xtc-path path/to/samples.xtc
By default, side-chain reconstruction and local energy minimization are performed (no full MD integration for efficiency reasons). Note that the runtime of this code scales with the size of the system. We suggest running this code on a selection of samples rather than the full set.
There are two other options:
- To only run side-chain reconstruction without MD equilibration, add
--no-md-equil
. - To run a short NVT equilibration (0.1 ns), add
--md-protocol nvt_equil
To see the full list of options, call python -m bioemu.sidechain_relax --help
.
The script saves reconstructed all-heavy-atom structures in samples_sidechain_rec.{pdb,xtc}
and MD-equilibrated structures in samples_md_equil.{pdb,xtc}
(filename to be altered with --outname other_name
).
The code in the openfold
subdirectory is copied from openfold with minor modifications. The modifications are described in the relevant source files.
If you have any questions not covered here, please create an issue or contact the BioEmu team by writing to the corresponding author on our preprint.