Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: PaddlePaddle/PaddleHelix
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: dev
Choose a base ref
...
head repository: YaoYinYing/PaddleHelix
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: patch-hydra
Choose a head ref
Can’t automatically merge. Don’t worry, you can still create the pull request.

Commits on Aug 17, 2024

  1. chore: hydra and pip

    YaoYinYing committed Aug 17, 2024
    Copy the full SHA
    662bbff View commit details
  2. use reduced bfd

    YaoYinYing committed Aug 17, 2024
    Copy the full SHA
    b4fce8a View commit details
  3. disable no MSA mode (#312)

    RyanGarciaLI authored and YaoYinYing committed Aug 17, 2024
    Copy the full SHA
    d2aa1b9 View commit details
  4. add override flag

    YaoYinYing committed Aug 17, 2024
    Copy the full SHA
    d528300 View commit details
  5. add configs

    YaoYinYing committed Aug 17, 2024
    Copy the full SHA
    a3d62e2 View commit details
  6. fix:config

    YaoYinYing committed Aug 17, 2024
    Copy the full SHA
    6e39caa View commit details
  7. fix: maxit path

    YaoYinYing committed Aug 17, 2024
    Copy the full SHA
    29ef9c4 View commit details
  8. fix: maxit run with env

    YaoYinYing committed Aug 17, 2024
    Copy the full SHA
    6ef4cd8 View commit details
  9. fix: maxit run with env

    YaoYinYing committed Aug 17, 2024
    Copy the full SHA
    5568284 View commit details
  10. copy of license

    YaoYinYing committed Aug 17, 2024
    Copy the full SHA
    ae9146a View commit details
  11. fix: obabel bin resolve

    YaoYinYing committed Aug 17, 2024
    Copy the full SHA
    5aae769 View commit details
  12. doc: use conda prefix

    YaoYinYing committed Aug 17, 2024
    Copy the full SHA
    b63e722 View commit details
  13. doc: merge to readme

    YaoYinYing committed Aug 17, 2024
    Copy the full SHA
    0863ecf View commit details
  14. fix: imports

    YaoYinYing committed Aug 17, 2024
    Copy the full SHA
    84a550c View commit details
  15. fix: config overriden

    YaoYinYing committed Aug 17, 2024
    Copy the full SHA
    21d4108 View commit details
  16. doc: update for cli

    YaoYinYing committed Aug 17, 2024
    Copy the full SHA
    7561674 View commit details
  17. doc: update for cli

    YaoYinYing committed Aug 17, 2024
    Copy the full SHA
    e8582de View commit details

Commits on Aug 18, 2024

  1. fix: msa parallel for bfd

    YaoYinYing committed Aug 18, 2024
    Copy the full SHA
    cd92429 View commit details
  2. fix: multimer msa parallel

    YaoYinYing committed Aug 18, 2024
    Copy the full SHA
    8c22e15 View commit details
  3. fix: multimer msa parallel

    YaoYinYing committed Aug 18, 2024
    Copy the full SHA
    8ce13ca View commit details
  4. drop: buildin logger

    YaoYinYing committed Aug 18, 2024
    Copy the full SHA
    8548570 View commit details

Commits on Aug 19, 2024

  1. Copy the full SHA
    d2feac5 View commit details
  2. refactor: deduplicate code

    YaoYinYing committed Aug 19, 2024
    Copy the full SHA
    b3153de View commit details
  3. Copy the full SHA
    cd68f00 View commit details
  4. Copy the full SHA
    7180fe7 View commit details
  5. Copy the full SHA
    330e455 View commit details
  6. fix: output logs

    YaoYinYing committed Aug 19, 2024
    Copy the full SHA
    dce79f8 View commit details
  7. Copy the full SHA
    6968f69 View commit details
  8. docs: sm file inputs

    YaoYinYing committed Aug 19, 2024
    Copy the full SHA
    90f731a View commit details

Commits on Aug 20, 2024

  1. drop: buildin logger

    YaoYinYing committed Aug 20, 2024
    Copy the full SHA
    2bd81ad View commit details
  2. drop: buildin logger

    YaoYinYing committed Aug 20, 2024
    Copy the full SHA
    0720311 View commit details
  3. add: use_3d opt for ligand

    YaoYinYing committed Aug 20, 2024
    Copy the full SHA
    d7bd9ee View commit details
  4. Copy the full SHA
    96a3de8 View commit details

Commits on Aug 23, 2024

  1. dev:major: covalent bond

    YaoYinYing committed Aug 23, 2024
    Copy the full SHA
    c39a3e4 View commit details
  2. fix: covalent bond

    YaoYinYing committed Aug 23, 2024
    Copy the full SHA
    3bba852 View commit details
  3. doc: covalently

    YaoYinYing committed Aug 23, 2024
    Copy the full SHA
    7653585 View commit details

Commits on Aug 24, 2024

  1. Update README.md

    YaoYinYing committed Aug 24, 2024
    Copy the full SHA
    6be8aa6 View commit details
  2. Copy the full SHA
    db42183 View commit details
  3. feat: disulfide bonding

    YaoYinYing committed Aug 24, 2024
    Copy the full SHA
    e4d6821 View commit details
  4. feat: disulfide bonding

    YaoYinYing committed Aug 24, 2024
    Copy the full SHA
    43698f6 View commit details
  5. case: metalc

    YaoYinYing committed Aug 24, 2024
    Copy the full SHA
    d7ee1da View commit details

Commits on Aug 26, 2024

  1. Copy the full SHA
    1fd11ba View commit details

Commits on Aug 28, 2024

  1. Copy the full SHA
    4daee6c View commit details
  2. docs: batch run mode

    YaoYinYing committed Aug 28, 2024
    Copy the full SHA
    b23e7eb View commit details
  3. chore: db to ramdisk

    YaoYinYing committed Aug 28, 2024
    Copy the full SHA
    7dc26a7 View commit details
  4. fix: ramdisk permission

    YaoYinYing committed Aug 28, 2024
    Copy the full SHA
    484e3bc View commit details
  5. Copy the full SHA
    db27a25 View commit details
  6. Copy the full SHA
    a832241 View commit details

Commits on Aug 29, 2024

  1. Copy the full SHA
    6dc4554 View commit details

Commits on Aug 30, 2024

  1. revert: script links

    YaoYinYing committed Aug 30, 2024
    Copy the full SHA
    5e3824f View commit details
Showing with 3,504 additions and 1,934 deletions.
  1. +219 −66 apps/protein_folding/helixfold3/README.md
  2. +155 −0 apps/protein_folding/helixfold3/data/7s69_glycan.sdf
  3. +20 −0 apps/protein_folding/helixfold3/data/demo_3fap_protein_sm.json
  4. +45 −0 apps/protein_folding/helixfold3/data/demo_4Fe-4S.json
  5. +1 −0 apps/protein_folding/helixfold3/data/demo_6zcy_smiles.json
  6. +21 −0 apps/protein_folding/helixfold3/data/demo_7s69_coval.json
  7. +19 −0 apps/protein_folding/helixfold3/data/demo_7u7w_protein_nucleic.json
  8. +24 −0 apps/protein_folding/helixfold3/data/demo_7u7w_protein_nucleic_sm.json
  9. +23 −0 apps/protein_folding/helixfold3/data/demo_E2-Ub.json
  10. +14 −0 apps/protein_folding/helixfold3/data/demo_disulf.json
  11. +33 −0 apps/protein_folding/helixfold3/data/demo_disulf_homodimer.json
  12. +20 −0 apps/protein_folding/helixfold3/data/demo_p450_heme.json
  13. +26 −0 apps/protein_folding/helixfold3/data/demo_p450_heme_coval.json
  14. +20 −0 apps/protein_folding/helixfold3/data/demo_p450_heme_sdf.json
  15. +27 −0 apps/protein_folding/helixfold3/data/demo_p450_heme_smiles.json
  16. +21 −0 apps/protein_folding/helixfold3/data/demo_pUb.json
  17. +23 −0 apps/protein_folding/helixfold3/data/demo_pUb_modres.json
  18. +35 −0 apps/protein_folding/helixfold3/data/demo_pUb_modres_typ.json
  19. +30 −0 apps/protein_folding/helixfold3/data/demo_phos.json
  20. +68 −0 apps/protein_folding/helixfold3/data/typ.mol2
  21. +36 −0 apps/protein_folding/helixfold3/helixfold/LICENSE
  22. +35 −5 apps/protein_folding/helixfold3/helixfold/common/all_atom_pdb_save.py
  23. +88 −0 apps/protein_folding/helixfold3/helixfold/config/helixfold.yaml
  24. +1 −2 apps/protein_folding/helixfold3/helixfold/data/mmcif_parsing_paddle.py
  25. +214 −22 apps/protein_folding/helixfold3/helixfold/data/pipeline_conf_bonds.py
  26. +25 −10 apps/protein_folding/helixfold3/helixfold/data/pipeline_multimer_parallel.py
  27. +99 −85 apps/protein_folding/helixfold3/helixfold/data/pipeline_parallel.py
  28. +43 −0 apps/protein_folding/helixfold3/helixfold/data/pipeline_residue_replacement.py
  29. +23 −14 apps/protein_folding/helixfold3/helixfold/data/pipeline_token_feature.py
  30. +51 −36 apps/protein_folding/helixfold3/helixfold/data/templates.py
  31. +3 −0 apps/protein_folding/helixfold3/helixfold/data/tools/hhblits.py
  32. +37 −1 apps/protein_folding/helixfold3/helixfold/data/tools/utils.py
  33. +765 −0 apps/protein_folding/helixfold3/helixfold/inference.py
  34. +2 −3 apps/protein_folding/helixfold3/helixfold/model/config.py
  35. +8 −5 apps/protein_folding/helixfold3/helixfold/model/modules_all_atom.py
  36. 0 apps/protein_folding/helixfold3/{ → helixfold}/utils/__init__.py
  37. +521 −0 apps/protein_folding/helixfold3/helixfold/utils/feature_processing_aa.py
  38. +169 −0 apps/protein_folding/helixfold3/helixfold/utils/mmcif_writer.py
  39. +2 −3 apps/protein_folding/helixfold3/{ → helixfold}/utils/model.py
  40. +485 −0 apps/protein_folding/helixfold3/helixfold/utils/preprocess.py
  41. 0 apps/protein_folding/helixfold3/{ → helixfold}/utils/utils.py
  42. +0 −492 apps/protein_folding/helixfold3/infer_scripts/feature_processing_aa.py
  43. +1 −0 apps/protein_folding/helixfold3/infer_scripts/feature_processing_aa.py
  44. +0 −293 apps/protein_folding/helixfold3/infer_scripts/preprocess.py
  45. +1 −0 apps/protein_folding/helixfold3/infer_scripts/preprocess.py
  46. +0 −169 apps/protein_folding/helixfold3/infer_scripts/tools/mmcif_writer.py
  47. +1 −0 apps/protein_folding/helixfold3/infer_scripts/tools/mmcif_writer.py
  48. +0 −637 apps/protein_folding/helixfold3/inference.py
  49. +50 −0 apps/protein_folding/helixfold3/pyproject.toml
  50. +0 −13 apps/protein_folding/helixfold3/requirements.txt
  51. +0 −39 apps/protein_folding/helixfold3/run_infer.sh
  52. +0 −39 apps/protein_folding/helixfold3/utils/misc.py
285 changes: 219 additions & 66 deletions apps/protein_folding/helixfold3/README.md
Original file line number Diff line number Diff line change
@@ -36,25 +36,34 @@ Those settings are recommended as they are the same as we used in our A100 machi
### Installation

HelixFold3 depends on [PaddlePaddle](https://github.com/paddlepaddle/paddle). Python dependencies available through `pip`
is provided in `requirements.txt`. `kalign`, the [`HH-suite`](https://github.com/soedinglab/hh-suite) and `jackhmmer` are
is provided with `pyproject.toml`. `kalign`, the [`HH-suite`](https://github.com/soedinglab/hh-suite) and `jackhmmer` are
also needed to produce multiple sequence alignments. The download scripts require `aria2c`.

Locate to the directory of `helixfold` then run:

```bash
# Install py env
conda create -n helixfold -c conda-forge python=3.9
conda install -y -c bioconda aria2 hmmer==3.3.2 kalign2==2.04 hhsuite==3.3.0 -n helixfold
conda install -y -c conda-forge openbabel -n helixfold

# activate the conda environment
conda activate helixfold

# adjust these version numbers as your situation
conda install -y cudnn=8.4.1 cudatoolkit=11.7 nccl=2.14.3 -c conda-forge -c nvidia
conda install -y -c bioconda aria2 hmmer==3.3.2 kalign2==2.04 hhsuite==3.3.0
conda install -y -c conda-forge openbabel

# install paddlepaddle
python3 -m pip install paddlepaddle-gpu==2.6.1.post120 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
pip install paddlepaddle-gpu==2.6.1.post120 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
# or lower version: https://paddle-wheel.bj.bcebos.com/2.5.1/linux/linux-gpu-cuda11.7-cudnn8.4.1-mkl-gcc8.2-avx/paddlepaddle_gpu-2.5.1.post117-cp39-cp39-linux_x86_64.whl

python3 -m pip install -r requirements.txt
# downgrade pip
pip install --upgrade 'pip<24'

# edit configuration file at `./helixfold/config/helixfold.yaml` to set your databases and binaries correctly.

# install HF3 as a python library
pip install . --no-cache-dir
```

Note: If you have a different version of python3 and cuda, please refer to [here](https://www.paddlepaddle.org.cn/whl/linux/gpu/develop.html) for the compatible PaddlePaddle `dev` package.
@@ -63,14 +72,22 @@ Note: If you have a different version of python3 and cuda, please refer to [here
#### Install Maxit
The conversion between `.cif` and `.pdb` relies on [Maxit](https://sw-tools.rcsb.org/apps/MAXIT/index.html).
Download Maxit source code from https://sw-tools.rcsb.org/apps/MAXIT/maxit-v11.100-prod-src.tar.gz. Untar and follow
its `README` to complete installation.
its `README` to complete installation. If you encouter error like your GCC version not support (9.4.0, for example), editing `etc/platform.sh` and reruning compilation again would make sense. See below:

```bash
# Check if it is a Linux platform
Linux)
# Check if it is GCC version 4.x
gcc_ver=`gcc --version | grep -e " 4\."` # edit `4\.` to `9\.`
if [[ -z $gcc_ver ]]
```
### Usage
In order to run HelixFold3, the genetic databases and model parameters are required.
The parameters of HelixFold3 can be downloaded [here](https://paddlehelix.bd.bcebos.com/HelixFold3/params/HelixFold3-params-240814.zip),
please place the downloaded checkpoint in ```./init_models/ ```directory.
please place the downloaded checkpoint path in `weight_path` of `helixfold/config/helixfold.yaml` configuration file before install HF3 as a python module.
The script `scripts/download_all_data.sh` can be used to download and set up all genetic databases with the following configs:
@@ -96,10 +113,11 @@ The script `scripts/download_all_data.sh` can be used to download and set up all
There are some demo input under `./data/` for your test and reference. Data input is in the form of JSON containing
several entities such as `protein`, `ligand`, `nucleic acids`, and `iron`. Proteins and nucleic acids inputs are their sequence.
HelixFold3 supports input ligand as SMILES or CCD id, please refer to `/data/demo_6zcy_smiles.json` and `demo_output/demo_6zcy_smiles/`
for more details about SMILES input. More flexible input will come in soon.
HelixFold3 supports input ligand as SMILES, CCD id or small molecule files, please refer to `/data/demo_6zcy_smiles.json` and `data/demo_p450_heme_sdf.json`
for more details about SMILES input. Flexible input from small molecule is now supported. See `obabel -L formats |grep -v 'Write-only'`
A example of input data is as follows:
```json
{
"entities": [
@@ -117,74 +135,209 @@ A example of input data is as follows:
}
```
Another example of **covalently modified** input:
```json
{
"entities": [
{
"type": "protein",
"sequence": "MDALYKSTVAKFNEVIQLDCSTEFFSIALSSIAGILLLLLLFRSKRHSSLKLPPGKLGIPFIGESFIFLRALRSNSLEQFFDERVKKFGLVFKTSLIGHPTVVLCGPAGNRLILSNEEKLVQMSWPAQFMKLMGENSVATRRGEDHIVMRSALAGFFGPGALQSYIGKMNTEIQSHINEKWKGKDEVNVLPLVRELVFNISAILFFNIYDKQEQDRLHKLLETILVGSFALPIDLPGFGFHRALQGRAKLNKIMLSLIKKRKEDLQSGSATATQDLLSVLLTFRDDKGTPLTNDEILDNFSSLLHASYDTTTSPMALIFKLLSSNPECYQKVVQEQLEILSNKEEGEEITWKDLKAMKYTWQVAQETLRMFPPVFGTFRKAITDIQYDGYTIPKGWKLLWTTYSTHPKDLYFNEPEKFMPSRFDQEGKHVAPYTFLPFGGGQRSCVGWEFSKMEILLFVHHFVKTFSSYTPVDPDEKISGDPLPPLPSKGFSIKLFPRP",
"count": 1
},
{
"type": "ligand",
"ccd": "HEM",
"count": 1
},
{
"type": "ligand",
"smiles": "CC1=C2CC[C@@]3(CCCC(=C)[C@H]3C[C@@H](C2(C)C)CC1)C",
"name": "lig",
"count": 1
},
{
"type": "bond",
"bond": "A,CYS,445,SG,B,HEM,1,FE,covale,2.3",
"_comment": "<chain-id>,<residue name>,<residue index>,<atom id>,<chain-id>,<residue name>,<residue index>,<atom id>,<bond type>,<bond length>",
"_another_comment": "use semicolon to separate multiple bonds",
"_also_comment": "For ccd input, use CCD key as residue name; for smiles and file input, use `name` identifier"
}
]
}
```
*EXPERIMENTAL* Example input of residue replacements(Non Canonical Amino Acids):
```json
{
"entities": [
{
"type": "protein",
"sequence": "MQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLADYNIQKESTLHLVLRLRGG",
"count": 1,
"_case_from": "https://www.rcsb.org/structure/5K9P",
"_note": "Ub"
},
{
"type": "ncaa",
"ccd": "SEP",
"_comment": "Register `SEP` as `sep`",
"_also_a_comment": "All ccd inputs will be processed as lowercase characters to avoid collisions with standard protein residues"
},
{
"type": "ncaa",
"name": "typ",
"mol2": "/repo/PaddleHelix/apps/protein_folding/helixfold3/data/typ.mol2",
"_comment": "Register this mol2 file as `typ`. Only MOL2 files are allowed. for NCAA, one must name all atoms with individual atom labels in the way that used by standard amino acid species"
},
{
"type": "modres",
"modres": "A,20,SER,sep",
"_note": "Ser20 phosphorylated ubiquitin",
"_comment": "Use `sep` to replace `SER`"
},
{
"type": "modres",
"modres": "A,59,TYR,typ",
"_note": "all phosphotyrosines, 59",
"_comment": "Use `typ` to replace `TYR`"
}
]
}
```
**The atom ID that bonded in covalent bond input is fatal.**
For seaking all atom ids in CCD database:
```shell
helixfold_check_ligand +ligand=HEM
```
This command outputs like the following:
```text
[2024-08-31 10:55:10,131][absl][WARNING] - Using resolved obabel: /mnt/data/envs/conda_env/envs/helixfold/bin/obabel
[2024-08-31 10:55:10,132][absl][INFO] - Started Loading CCD dataset from /mnt/db/ccd/ccd_preprocessed_etkdg.pkl.gz
[2024-08-31 10:55:18,752][absl][INFO] - Finished Loading CCD dataset from /mnt/db/ccd/ccd_preprocessed_etkdg.pkl.gz in 8.620 seconds
[2024-08-31 10:55:18,752][absl][INFO] - CCD dataset contains 43488 entries.
[2024-08-31 10:55:18,752][absl][INFO] - Atoms in HEM: ['CHA', 'CHB', 'CHC', 'CHD', 'C1A', 'C2A', 'C3A', 'C4A', 'CMA', 'CAA', 'CBA', 'CGA', 'O1A', 'O2A', 'C1B', 'C2B', 'C3B', 'C4B', 'CMB', 'CAB', 'CBB', 'C1C', 'C2C', 'C3C', 'C4C', 'CMC', 'CAC', 'CBC', 'C1D', 'C2D', 'C3D', 'C4D', 'CMD', 'CAD', 'CBD', 'CGD', 'O1D', 'O2D', 'NA', 'NB', 'NC', 'ND', 'FE']
```
For seaking all atom ids in a given `sdf`/`mol2`/`smiles`:
```shell
# smiles
helixfold_check_ligand '+ligand="CNC(=O)c1nn(C)c2ccc(Nc3nccc(n3)n4cc(N[C@@H]5CCNC5)c(C)n4)cc12"'
# sdf file
helixfold_check_ligand +ligand=./60119277-3d.sdf
# mol2 file
helixfold_check_ligand +ligand=./60119277-3d.mol2
# multiple ccd ids:
helixfold_check_ligand +ligand=[SER,SEP,TYR,TYP,THR,THP]
```
HF3 will output like the following:
```text
helixfold_check_ligand '+ligand="CNC(=O)c1nn(C)c2ccc(Nc3nccc(n3)n4cc(N[C@@H]5CCNC5)c(C)n4)cc12"'
[2024-08-31 10:56:16,445][absl][WARNING] - Using resolved obabel: /mnt/data/envs/conda_env/envs/helixfold/bin/obabel
[2024-08-31 10:56:16,445][absl][INFO] - Guessed ligand input type: smiles
[2024-08-31 10:56:16,567][absl][INFO] - Started converting smiles to mol2: CNC(=O)c1nn(C)c2ccc(Nc3nccc(n3)n4cc(N[C@@H]5CCNC5)c(C)n4)cc12
[2024-08-31 10:56:16,567][absl][WARNING] - This takes a while ...
[2024-08-31 10:56:16,851][absl][INFO] - Finished converting smiles to mol2: CNC(=O)c1nn(C)c2ccc(Nc3nccc(n3)n4cc(N[C@@H]5CCNC5)c(C)n4)cc12 in 0.283 seconds
[2024-08-31 10:56:16,857][absl][INFO] - Atoms in CNC(=O)c1nn(C)c2ccc(Nc3nccc(n3)n4cc(N[C@@H]5CCNC5)c(C)n4)cc12 (smiles): {'UNK-': {'atom_symbol': ['C', 'N', 'C', 'O', 'C', 'N', 'N', 'C', 'C', 'C', 'C', 'C', 'N', 'C', 'N', 'C', 'C', 'C', 'N', 'N', 'C', 'C', 'N', 'C', 'C', 'C', 'N', 'C', 'C', 'C', 'N', 'C', 'C'], 'charge': [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'atom_ids': ['C1', 'N1', 'C2', 'O1', 'C3', 'N2', 'N3', 'C4', 'C5', 'C6', 'C7', 'C8', 'N4', 'C9', 'N5', 'C10', 'C11', 'C12', 'N6', 'N7', 'C13', 'C14', 'N8', 'C15', 'C16', 'C17', 'N9', 'C18', 'C19', 'C20', 'N10', 'C21', 'C22'], 'coval_bonds': [('C1', 'N1', 'SING'), ('N1', 'C2', 'SING'), ('C2', 'O1', 'DOUB'), ('C2', 'C3', 'SING'), ('C3', 'N2', 'AROM'), ('N2', 'N3', 'AROM'), ('N3', 'C4', 'SING'), ('N3', 'C5', 'AROM'), ('C5', 'C6', 'AROM'), ('C6', 'C7', 'AROM'), ('C7', 'C8', 'AROM'), ('C8', 'N4', 'SING'), ('N4', 'C9', 'SING'), ('C9', 'N5', 'AROM'), ('N5', 'C10', 'AROM'), ('C10', 'C11', 'AROM'), ('C11', 'C12', 'AROM'), ('C12', 'N6', 'AROM'), ('C9', 'N6', 'AROM'), ('C12', 'N7', 'SING'), ('N7', 'C13', 'AROM'), ('C13', 'C14', 'AROM'), ('C14', 'N8', 'SING'), ('N8', 'C15', 'SING'), ('C15', 'C16', 'SING'), ('C16', 'C17', 'SING'), ('C17', 'N9', 'SING'), ('N9', 'C18', 'SING'), ('C15', 'C18', 'SING'), ('C14', 'C19', 'AROM'), ('C19', 'C20', 'SING'), ('C19', 'N10', 'AROM'), ('N7', 'N10', 'AROM'), ('C8', 'C21', 'AROM'), ('C21', 'C22', 'AROM'), ('C3', 'C22', 'AROM'), ('C5', 'C22', 'AROM')], 'position': array([[ 2.8929, -1.0132, 1.1086],
[ 3.7391, -1.0171, -0.0475],
[ 4.0952, 0.1618, -0.6747],
[ 3.7451, 1.2639, -0.2632],
[ 4.9273, -0.0255, -1.8697],
[ 5.2505, -1.2502, -2.3477],
[ 6.0037, -0.9802, -3.4344],
[ 6.4721, -2.0524, -4.2816],
[ 6.2059, 0.3548, -3.6608],
[ 6.918 , 1.0385, -4.6439],
[ 6.8886, 2.4361, -4.5918],
[ 6.2178, 3.1502, -3.589 ],
[ 6.223 , 4.5552, -3.5106],
[ 7.2064, 5.4461, -3.9552],
[ 7.2928, 6.6137, -3.3032],
[ 8.3377, 7.4081, -3.6083],
[ 9.3053, 7.0807, -4.5284],
[ 9.1092, 5.8865, -5.1784],
[ 8.0557, 5.0867, -4.9217],
[ 10.0465, 5.4369, -6.1549],
[ 9.8984, 4.3439, -6.9834],
[ 10.9956, 4.3038, -7.8014],
[ 11.2377, 3.4349, -8.8185],
[ 10.147 , 2.7603, -9.5397],
[ 10.643 , 1.4653, -10.1679],
[ 11.0877, 1.8617, -11.5619],
[ 10.2177, 2.9614, -11.9789],
[ 9.6687, 3.5755, -10.7532],
[ 11.8347, 5.3372, -7.3132],
[ 13.2229, 5.6823, -7.7052],
[ 11.2358, 6.0634, -6.3575],
[ 5.4856, 2.4209, -2.6404],
[ 5.4943, 1.0102, -2.6657]], dtype=float32)}}
```
`atom_ids` is what we are looking for.
#### Running HelixFold for Inference
To run inference on a sequence or multiple sequences using HelixFold3's pretrained parameters, run e.g.:
* Inference on single GPU (change the settings in script BEFORE you run it)
##### Run from default config
```shell
LD_LIBRARY_PATH=$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH \
helixfold \
input=./data/demo_8ecx.json \
output=. \
CONFIG_DIFFS.preset=allatom_demo
```
sh run_infer.sh
##### Run with customized configuration dir and file(`./myfold.yaml`, for example):
```shell
LD_LIBRARY_PATH=$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH \
helixfold --config-dir=. --config-name=myfold \
input=./data/demo_6zcy_smiles.json \
output=. \
CONFIG_DIFFS.preset=allatom_demo
```
The script is as follows,
```bash
#!/bin/bash
PYTHON_BIN="PATH/TO/YOUR/PYTHON"
ENV_BIN="PATH/TO/YOUR/ENV"
MAXIT_SRC="PATH/TO/MAXIT/SRC"
DATA_DIR="PATH/TO/DATA"
export OBABEL_BIN="PATH/TO/OBABEL/BIN"
export PATH="$MAXIT_BIN/bin:$PATH"
CUDA_VISIBLE_DEVICES=0 "$PYTHON_BIN" inference.py \
--maxit_binary "$MAXIT_SRC/bin/maxit" \
--jackhmmer_binary_path "$ENV_BIN/jackhmmer" \
--hhblits_binary_path "$ENV_BIN/hhblits" \
--hhsearch_binary_path "$ENV_BIN/hhsearch" \
--kalign_binary_path "$ENV_BIN/kalign" \
--hmmsearch_binary_path "$ENV_BIN/hmmsearch" \
--hmmbuild_binary_path "$ENV_BIN/hmmbuild" \
--nhmmer_binary_path "$ENV_BIN/nhmmer" \
--preset='reduced_dbs' \
--bfd_database_path "$DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt" \
--small_bfd_database_path "$DATA_DIR/small_bfd/bfd-first_non_consensus_sequences.fasta" \
--bfd_database_path "$DATA_DIR/small_bfd/bfd-first_non_consensus_sequences.fasta" \
--uniclust30_database_path "$DATA_DIR/uniclust30/uniclust30_2018_08/uniclust30_2018_08" \
--uniprot_database_path "$DATA_DIR/uniprot/uniprot.fasta" \
--pdb_seqres_database_path "$DATA_DIR/pdb_seqres/pdb_seqres.txt" \
--uniref90_database_path "$DATA_DIR/uniref90/uniref90.fasta" \
--mgnify_database_path "$DATA_DIR/mgnify/mgy_clusters_2018_12.fa" \
--template_mmcif_dir "$DATA_DIR/pdb_mmcif/mmcif_files" \
--obsolete_pdbs_path "$DATA_DIR/pdb_mmcif/obsolete.dat" \
--ccd_preprocessed_path "$DATA_DIR/ccd_preprocessed_etkdg.pkl.gz" \
--rfam_database_path "$DATA_DIR/Rfam-14.9_rep_seq.fasta" \
--max_template_date=2020-05-14 \
--input_json data/demo_protein_ligand.json \
--output_dir ./output \
--model_name allatom_demo \
--init_model ./init_models/checkpoints.pdparams \
--infer_times 3 \
--precision "fp32"
##### Run with additional configuration term
```shell
LD_LIBRARY_PATH=$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH \
helixfold \
input=./data/demo_6zcy.json \
output=. \
CONFIG_DIFFS.preset=allatom_demo \
+CONFIG_DIFFS.model.global_config.subbatch_size=192 \
+CONFIG_DIFFS.model.num_recycle=10
```
The descriptions of the above script are as follows:
* Replace `MAXIT_SRC` with your installed `maxit`'s root path.
* Replace `DATA_DIR` with your downloaded data path.
* Replace `OBABEL_BIN` with your installed `openbabel` path.
* Replace `ENV_BIN` with your conda virtual environment or any environment where `hhblits`, `hmmsearch` and other dependencies have been installed.
* `--preset` - Set `'reduced_dbs'` to use small bfd or `'full_dbs'` to use full bfd.
* `--*_database_path` - Path to datasets you have downloaded.
* `--input_json` - Input data in the form of JSON. Input pattern in `./data/demo_*.json` for your reference.
* `--output_dir` - Model output path. The output will be in a folder named the same as your `--input_json` under this path.
* `--model_name` - Model name in `./helixfold/model/config.py`. Different model names specify different configurations. Mirro modification to configuration can be specified in `CONFIG_DIFFS` in the `config.py` without change to the full configuration in `CONFIG_ALLATOM`.
* `--infer_time` - The number of inferences executed by model for single input. In each inference, the model will infer `5` times (`diff_batch_size`) for the same input by default. This hyperparameter can be changed by `model.head.diffusion_module.test_diff_batch_size` within `./helixfold/model/config.py`
* `--precision` - Either `bf16` or `fp32`. Please check if your machine can support `bf16` or not beforing changing it. For example, `bf16` is supported by A100 and H100 or higher version while V100 only supports `fp32`.
* `LD_LIBRARY_PATH` - This is required to load the `libcudnn.so` library if you encounter issue like `RuntimeError: (PreconditionNotMet) Cannot load cudnn shared library. Cannot invoke method cudnnGetVersion.`
* `config-dir` - The directory that contains the alterative configuration file you would like to use.
* `config-name` - The name of the configuration file you would like to use.
* `input` - Input data in the form of JSON or directory that contains such JSON file(s). For file input, check content pattern in `./data/demo_*.json` for your reference.
* `output` - Model output path. The output will be in a folder named the same as your input json file under this path.
* `CONFIG_DIFFS.preset` - Adjusted model config preset name in `./helixfold/model/config.py:CONFIG_DIFFS`. The preset will be updated into final model configuration with `CONFIG_ALLATOM`.
* `CONFIG_DIFFS.*` - Override model any configuration in `CONFIG_ALLATOM`.
### Understanding Model Output
The outputs will be in a subfolder of `output_dir`, including the computed MSAs, predicted structures,
ranked structures, and evaluation metrics. For a task of inferring twice with diffusion batch size 3,
assume your input JSON is named `demo_data.json`, the `output_dir` directory will have the following structure:
```
```text
<output_dir>/
└── demo_data/
├── demo_data-pred-1-1/
@@ -208,9 +361,10 @@ assume your input JSON is named `demo_data.json`, the `output_dir` directory wil
└── ...
```
The contents of each output file are as follows:
* `final_features.pkl` – A `pickle` file containing the input feature NumPy arrays
used by the models to predict the structures.
used by the models to predict the structures. If you need to re-run a inference without re-building the MSAs, delete this file.
* `msas/` - A directory containing the files describing the various genetic
tool hits that were used to construct the input MSA.
* `demo_data-pred-X-Y` - Prediction results of `demo_data.json` in X-th inference and Y-thdiffusion batch,
@@ -224,8 +378,7 @@ We suggest a single GPU for inference has at least 32G available memory. The max
single V100-32G with precision `fp32` is up to 1000. Inferring longer tokens or entities with larger atom numbers
per token than normal protein residues like nucleic acids may cost more GPU memory.
For samples with larger tokens, you can reduce `model.global_config.subbatch_size` in `CONFIG_DIFFS` in `helixfold/model/config.py` to save more GPU memory but suffer from slower inference. `model.global_config.subbatch_size` is set as `96` by default. You can also
reduce the number of additional recycles by changing `model.num_recycle` in the same place.
For samples with larger tokens, you can override `model.global_config.subbatch_size` in `CONFIG_ALLATOM` by using `+CONFIG_DIFFS.model.global_config.subbatch_size=X` on command runs, where `X` is a smaller number than `96`, to save more GPU memory although this will cause a slower inference. Additionally, you can reduce the number of additional recycles by setting `+CONFIG_DIFFS.model.num_recycle=Y`, where `Y` is a smaller number than `3`.
We are keen on support longer token inference, it will come in soon.
Loading