Skip to content

Inference code for scalable emulation of protein equilibrium ensembles with generative deep learning

License

Notifications You must be signed in to change notification settings

microsoft/bioemu

Repository files navigation

BioEmu logo

DOI:10.1101/2024.12.05.626885 Requires Python 3.10+

Biomolecular Emulator (BioEmu)

Biomolecular Emulator (BioEmu for short) is a model that samples from the approximated equilibrium distribution of structures for a protein monomer, given its amino acid sequence.

For more information, see our preprint.

This repository contains inference code and model weights.

Table of Contents

Installation

Run setup.sh to create a conda environment named 'bioemu' with bioemu and its dependencies installed. setup.sh will install and patch ColabFold, create a conda environment called 'bioemu' with some installed dependencies that pip does not handle, and then pip-install the bioemu package inside the conda environment.

Sampling structures

You can sample structures for a given protein sequence using the script sample.py. To run a tiny test using the default model parameters and denoising settings:

python -m bioemu.sample --sequence GYDPETGTWG --num_samples 10 --output_dir ~/test-chignolin

The model parameters will be automatically downloaded from huggingface. See sample.py for more options.

Sampling times will depend on sequence length and available infrastructure. The following table gives times for collecting 1000 samples measured on an A100 GPU with 80 GB VRAM for sequences of different lengths (using a batch_size_100=20 setting in sample.py):

sequence length time / min
100 4
300 40
600 150

Reproducing results from the preprint

You can use this code together with code from bioemu-benchmarks to approximately reproduce results from our preprint.

The bioemu-v1.0 checkpoint contains the model weights used to produce the results in the preprint. Due to simplifications made in the embedding computation and a more efficient sampler, the results obtained with this code are not identical but consistent with the statistics shown in the preprint, i.e., mode coverage and free energy errors averaged over the proteins in a test set. Results for individual proteins may differ. For more details, please check the BIOEMU_RESULTS.md document on the bioemu-benchmarks repository.

Citation

If you are using our code or model, please consider citing our work:

@article {BioEmu2024,
    author = {Lewis, Sarah and Hempel, Tim and Jim{\'e}nez-Luna, Jos{\'e} and Gastegger, Michael and Xie, Yu and Foong, Andrew Y. K. and Satorras, Victor Garc{\'\i}a and Abdin, Osama and Veeling, Bastiaan S. and Zaporozhets, Iryna and Chen, Yaoyi and Yang, Soojung and Schneuing, Arne and Nigam, Jigyasa and Barbero, Federico and Stimper, Vincent and Campbell, Andrew and Yim, Jason and Lienen, Marten and Shi, Yu and Zheng, Shuxin and Schulz, Hannes and Munir, Usman and Clementi, Cecilia and No{\'e}, Frank},
    title = {Scalable emulation of protein equilibrium ensembles with generative deep learning},
    year = {2024},
    doi = {10.1101/2024.12.05.626885},
    journal = {bioRxiv}
}

Side-chain reconstruction and MD-relaxation

BioEmu outputs structures in backbone frame representation. To reconstruct the side-chains, several tools are available. As an example, we provide a script to conduct side-chain reconstruction with HPacker (https://github.com/gvisani/hpacker), and provide an interface for running a short molecular dynamics (MD) equilibration. HPacker is a method for protein side-chain packing based on holographic rotationally equivariant convolutional neural networks (https://arxiv.org/abs/2311.09312).

This code is experimental and is provided for research purposes only. Further testing/development are needed before considering its application in real-world scenarios or production environments.

Install side-chain reconstruction tools

Clone and install the HPacker code and other dependencies with

./setup_sidechain_relax.sh

This will install some additional dependences for running MD relaxation in the bioemu environment. It will also install HPacker in a separate conda environment called hpacker.

Use side-chain reconstruction tools

Inside the bioemu enviroment, run side-chain reconstruction with:

python -m bioemu.sidechain_relax --pdb-path path/to/topology.pdb --xtc-path path/to/samples.xtc

By default, side-chain reconstruction and local energy minimization are performed (no full MD integration for efficiency reasons). Note that the runtime of this code scales with the size of the system. We suggest running this code on a selection of samples rather than the full set.

There are two other options:

  • To only run side-chain reconstruction without MD equilibration, add --no-md-equil.
  • To run a short NVT equilibration (0.1 ns), add --md-protocol nvt_equil

To see the full list of options, call python -m bioemu.sidechain_relax --help.

The script saves reconstructed all-heavy-atom structures in samples_sidechain_rec.{pdb,xtc} and MD-equilibrated structures in samples_md_equil.{pdb,xtc} (filename to be altered with --outname other_name).

Third-party code

The code in the openfold subdirectory is copied from openfold with minor modifications. The modifications are described in the relevant source files.

Get in touch

If you have any questions not covered here, please create an issue or contact the BioEmu team by writing to the corresponding author on our preprint.

About

Inference code for scalable emulation of protein equilibrium ensembles with generative deep learning

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

No packages published