Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
YaoYinYing committed Aug 24, 2024
1 parent 7653585 commit 6be8aa6
Showing 1 changed file with 112 additions and 10 deletions.
122 changes: 112 additions & 10 deletions apps/protein_folding/helixfold3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,14 +72,22 @@ Note: If you have a different version of python3 and cuda, please refer to [here
#### Install Maxit
The conversion between `.cif` and `.pdb` relies on [Maxit](https://sw-tools.rcsb.org/apps/MAXIT/index.html).
Download Maxit source code from https://sw-tools.rcsb.org/apps/MAXIT/maxit-v11.100-prod-src.tar.gz. Untar and follow
its `README` to complete installation.
its `README` to complete installation. If you encouter error like your GCC version not support (9.4.0, for example), editing `etc/platform.sh` and reruning compilation again would make sense. See below:

```bash
# Check if it is a Linux platform
Linux)
# Check if it is GCC version 4.x
gcc_ver=`gcc --version | grep -e " 4\."` # edit `4\.` to `9\.`
if [[ -z $gcc_ver ]]
```
### Usage
In order to run HelixFold3, the genetic databases and model parameters are required.
The parameters of HelixFold3 can be downloaded [here](https://paddlehelix.bd.bcebos.com/HelixFold3/params/HelixFold3-params-240814.zip),
please place the downloaded checkpoint in ```./init_models/ ```directory.
please place the downloaded checkpoint path in `weight_path` of `helixfold/config/helixfold.yaml` configuration file before install HF3 as a python module.
The script `scripts/download_all_data.sh` can be used to download and set up all genetic databases with the following configs:
Expand Down Expand Up @@ -109,6 +117,7 @@ HelixFold3 supports input ligand as SMILES, CCD id or small molecule files, plea
for more details about SMILES input. Flexible input from small molecule is now supported. See `obabel -L formats |grep -v 'Write-only'`
A example of input data is as follows:
```json
{
"entities": [
Expand All @@ -127,6 +136,7 @@ A example of input data is as follows:
```
Another example of **covalently modified** input:
```json
{
"entities": [
Expand All @@ -149,18 +159,21 @@ Another example of **covalently modified** input:
"type": "bond",
"bond": "A,CYS,445,SG,B,HEM,1,FE,covale,2.3",
"_comment": "<chain-id>,<residue name>,<residue index>,<atom id>,<chain-id>,<residue name>,<residue index>,<atom id>,<bond type>,<bond length>",
"_also_comment": "For ccd input, use CCD key as residue name; for smiles and file input, use `UNK-<index>` where index is the chain order you input"
"_another_comment": "use semicolon to separate multiple bonds",
"_also_comment": "For ccd input, use CCD key as residue name; for smiles and file input, use `UNK-<index>` where index is the chain order you input. eg. `UNK-1` for the first ligand chain(or the count #1), `UNK-2` the second(or the count #2)."
}
]
}
```
For seaking all atom ids in CCD database:
```shell
helixfold_show_ccd +ccd_id=HEM
```
This command outputs like:
```text
# output:
[2024-08-23 22:44:36,324][absl][INFO] - Started Loading CCD dataset from /mnt/db/ccd/ccd_preprocessed_etkdg.pkl.gz
Expand All @@ -169,10 +182,97 @@ This command outputs like:
[2024-08-23 22:44:43,237][absl][INFO] - Atoms in HEM: ['CHA', 'CHB', 'CHC', 'CHD', 'C1A', 'C2A', 'C3A', 'C4A', 'CMA', 'CAA', 'CBA', 'CGA', 'O1A', 'O2A', 'C1B', 'C2B', 'C3B', 'C4B', 'CMB', 'CAB', 'CBB', 'C1C', 'C2C', 'C3C', 'C4C', 'CMC', 'CAC', 'CBC', 'C1D', 'C2D', 'C3D', 'C4D', 'CMD', 'CAD', 'CBD', 'CGD', 'O1D', 'O2D', 'NA', 'NB', 'NC', 'ND', 'FE']
```
For seaking all atom ids in a given `sdf`/`mol2`, the atom list follows the same order in its file.
HF3 parsed:
```text
['C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'N1', 'O1', 'O2', 'O3', 'O4', 'O5', 'C9', 'C10', 'C11', 'C12', 'C13', 'C14', 'C15', 'C16', 'N2', 'O6', 'O7', 'O8', 'O9', 'O10', 'C17', 'C18', 'C19', 'C20', 'C21', 'C22', 'O11', 'O12', 'O13', 'O14', 'O15', 'C23', 'C24', 'C25', 'C26', 'C27', 'C28', 'O16', 'O17', 'O18', 'O19', 'O20', 'C29', 'C30', 'C31', 'C32', 'C33', 'C34', 'O21', 'O22', 'O23', 'O24', 'O25', 'C35', 'C36', 'C37', 'C38', 'C39', 'C40', 'O26', 'O27', 'O28', 'O29', 'O30']
```
while in `SDF`:
```text
29.7340 3.2540 76.7430 C 0 0 0 0 0 2 0 0 0 0 0 0
29.8160 4.4760 77.6460 C 0 0 1 0 0 3 0 0 0 0 0 0
28.5260 5.2840 77.5530 C 0 0 2 0 0 3 0 0 0 0 0 0
28.1780 5.5830 76.1020 C 0 0 1 0 0 3 0 0 0 0 0 0
28.2350 4.3240 75.2420 C 0 0 1 0 0 3 0 0 0 0 0 0
28.1040 4.6170 73.7650 C 0 0 0 0 0 2 0 0 0 0 0 0
31.3020 3.8250 79.4830 C 0 0 0 0 0 0 0 0 0 0 0 0
31.3910 3.4410 80.9280 C 0 0 0 0 0 1 0 0 0 0 0 0
30.0760 4.0880 79.0210 N 0 0 0 0 0 2 0 0 0 0 0 0
28.6870 6.5050 78.2670 O 0 0 0 0 0 1 0 0 0 0 0 0
26.8490 6.0910 76.0350 O 0 0 0 0 0 0 0 0 0 0 0 0
29.4950 3.6650 75.4130 O 0 0 0 0 0 0 0 0 0 0 0 0
29.3670 4.5550 73.1150 O 0 0 0 0 0 1 0 0 0 0 0 0
32.2950 3.8940 78.7640 O 0 0 0 0 0 0 0 0 0 0 0 0
26.7420 7.4140 75.6950 C 0 0 1 0 0 3 0 0 0 0 0 0
25.2700 7.7830 75.6110 C 0 0 1 0 0 3 0 0 0 0 0 0
25.1290 9.2300 75.1610 C 0 0 2 0 0 3 0 0 0 0 0 0
25.9180 10.1440 76.0880 C 0 0 1 0 0 3 0 0 0 0 0 0
27.3630 9.6720 76.2210 C 0 0 1 0 0 3 0 0 0 0 0 0
28.1310 10.4360 77.2730 C 0 0 0 0 0 2 0 0 0 0 0 0
23.8820 5.8170 75.1400 C 0 0 0 0 0 0 0 0 0 0 0 0
23.1980 5.0100 74.0810 C 0 0 0 0 0 1 0 0 0 0 0 0
24.5530 6.8930 74.7160 N 0 0 0 0 0 2 0 0 0 0 0 0
23.7530 9.5950 75.1670 O 0 0 0 0 0 1 0 0 0 0 0 0
25.9170 11.4700 75.5730 O 0 0 0 0 0 0 0 0 0 0 0 0
27.4050 8.2900 76.6040 O 0 0 0 0 0 0 0 0 0 0 0 0
29.5300 10.4030 77.0280 O 0 0 0 0 0 1 0 0 0 0 0 0
23.8300 5.5110 76.3290 O 0 0 0 0 0 0 0 0 0 0 0 0
25.3940 12.4250 76.4090 C 0 0 1 0 0 3 0 0 0 0 0 0
25.9490 13.7680 75.9090 C 0 0 2 0 0 3 0 0 0 0 0 0
25.1320 14.9560 76.4900 C 0 0 2 0 0 3 0 0 0 0 0 0
23.6130 14.6900 76.6390 C 0 0 1 0 0 3 0 0 0 0 0 0
23.3700 13.3000 77.2280 C 0 0 1 0 0 3 0 0 0 0 0 0
21.9020 12.9360 77.3500 C 0 0 0 0 0 2 0 0 0 0 0 0
25.9010 13.8490 74.4810 O 0 0 0 0 0 1 0 0 0 0 0 0
25.3420 16.1410 75.7110 O 0 0 0 0 0 0 0 0 0 0 0 0
23.0420 15.6520 77.5170 O 0 0 0 0 0 1 0 0 0 0 0 0
23.9910 12.3690 76.3570 O 0 0 0 0 0 0 0 0 0 0 0 0
21.3660 12.8480 76.0500 O 0 0 0 0 0 0 0 0 0 0 0 0
20.8090 11.6500 75.6780 C 0 0 2 0 0 3 0 0 0 0 0 0
20.6800 11.6410 74.1740 C 0 0 2 0 0 3 0 0 0 0 0 0
19.5510 12.5850 73.8180 C 0 0 2 0 0 3 0 0 0 0 0 0
18.2370 12.0940 74.4540 C 0 0 1 0 0 3 0 0 0 0 0 0
18.4030 11.9240 75.9810 C 0 0 1 0 0 3 0 0 0 0 0 0
17.2710 11.1260 76.6120 C 0 0 0 0 0 2 0 0 0 0 0 0
20.2900 10.3510 73.7080 O 0 0 0 0 0 1 0 0 0 0 0 0
19.4280 12.7380 72.4110 O 0 0 0 0 0 0 0 0 0 0 0 0
17.2120 13.0460 74.2030 O 0 0 0 0 0 1 0 0 0 0 0 0
19.6260 11.2000 76.3010 O 0 0 0 0 0 0 0 0 0 0 0 0
16.0670 11.4490 75.9360 O 0 0 0 0 0 1 0 0 0 0 0 0
20.2190 13.6280 71.7260 C 0 0 2 0 0 3 0 0 0 0 0 0
19.6090 14.0000 70.3810 C 0 0 2 0 0 3 0 0 0 0 0 0
19.6360 12.7820 69.4880 C 0 0 2 0 0 3 0 0 0 0 0 0
21.0860 12.3100 69.3240 C 0 0 1 0 0 3 0 0 0 0 0 0
21.7030 12.0240 70.7120 C 0 0 1 0 0 3 0 0 0 0 0 0
23.1940 11.7460 70.6620 C 0 0 0 0 0 2 0 0 0 0 0 0
20.4080 14.9810 69.7000 O 0 0 0 0 0 1 0 0 0 0 0 0
19.0310 13.0500 68.2340 O 0 0 0 0 0 1 0 0 0 0 0 0
21.1060 11.1280 68.5380 O 0 0 0 0 0 1 0 0 0 0 0 0
21.5380 13.1700 71.5840 O 0 0 0 0 0 0 0 0 0 0 0 0
23.8240 12.5210 71.6820 O 0 0 0 0 0 1 0 0 0 0 0 0
26.0070 17.3020 76.0200 C 0 0 2 0 0 3 0 0 0 0 0 0
27.0750 17.5250 74.9350 C 0 0 2 0 0 3 0 0 0 0 0 0
28.3660 16.8320 75.3290 C 0 0 2 0 0 3 0 0 0 0 0 0
28.7820 17.2470 76.7510 C 0 0 1 0 0 3 0 0 0 0 0 0
27.6930 16.8120 77.7320 C 0 0 1 0 0 3 0 0 0 0 0 0
27.9770 17.2020 79.1710 C 0 0 0 0 0 2 0 0 0 0 0 0
27.3990 18.9140 74.8010 O 0 0 0 0 0 1 0 0 0 0 0 0
29.4060 17.0990 74.3950 O 0 0 0 0 0 1 0 0 0 0 0 0
30.0160 16.6410 77.0930 O 0 0 0 0 0 1 0 0 0 0 0 0
26.4610 17.4820 77.3520 O 0 0 0 0 0 0 0 0 0 0 0 0
27.3660 18.4620 79.4040 O 0 0 0 0 0 1 0 0 0 0 0 0
```
#### Running HelixFold for Inference
To run inference on a sequence or multiple sequences using HelixFold3's pretrained parameters, run e.g.:
##### Run from default config
```shell
LD_LIBRARY_PATH=$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH \
helixfold \
Expand All @@ -182,6 +282,7 @@ helixfold \
```
##### Run with customized configuration dir and file(`./myfold.yaml`, for example):
```shell
LD_LIBRARY_PATH=$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH \
helixfold --config-dir=. --config-name=myfold \
Expand All @@ -191,14 +292,15 @@ helixfold --config-dir=. --config-name=myfold \
```
##### Run with additional configuration term
```shell
LD_LIBRARY_PATH=$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH \
helixfold \
input=./data/demo_6zcy.json \
output=. \
CONFIG_DIFFS.model.heads.confidence_head.weight=0.01 \
CONFIG_DIFFS.model.global_config.subbatch_size=192 \
CONFIG_DIFFS.model.num_recycle=10
CONFIG_DIFFS.preset=allatom_demo \
+CONFIG_DIFFS.model.global_config.subbatch_size=192 \
+CONFIG_DIFFS.model.num_recycle=10
```
The descriptions of the above script are as follows:
Expand All @@ -216,7 +318,7 @@ The outputs will be in a subfolder of `output_dir`, including the computed MSAs,
ranked structures, and evaluation metrics. For a task of inferring twice with diffusion batch size 3,
assume your input JSON is named `demo_data.json`, the `output_dir` directory will have the following structure:
```
```text
<output_dir>/
└── demo_data/
├── demo_data-pred-1-1/
Expand All @@ -240,9 +342,10 @@ assume your input JSON is named `demo_data.json`, the `output_dir` directory wil
└── ...
```
The contents of each output file are as follows:
* `final_features.pkl` – A `pickle` file containing the input feature NumPy arrays
used by the models to predict the structures.
used by the models to predict the structures. If you need to re-run a inference without re-building the MSAs, delete this file.
* `msas/` - A directory containing the files describing the various genetic
tool hits that were used to construct the input MSA.
* `demo_data-pred-X-Y` - Prediction results of `demo_data.json` in X-th inference and Y-thdiffusion batch,
Expand All @@ -256,8 +359,7 @@ We suggest a single GPU for inference has at least 32G available memory. The max
single V100-32G with precision `fp32` is up to 1000. Inferring longer tokens or entities with larger atom numbers
per token than normal protein residues like nucleic acids may cost more GPU memory.
For samples with larger tokens, you can reduce `model.global_config.subbatch_size` in `CONFIG_DIFFS` in `helixfold/model/config.py` to save more GPU memory but suffer from slower inference. `model.global_config.subbatch_size` is set as `96` by default. You can also
reduce the number of additional recycles by changing `model.num_recycle` in the same place.
For samples with larger tokens, you can override `model.global_config.subbatch_size` in `CONFIG_ALLATOM` by using `+CONFIG_DIFFS.model.global_config.subbatch_size=X` on command runs, where `X` is a smaller number than `96`, to save more GPU memory although this will cause a slower inference. Additionally, you can reduce the number of additional recycles by setting `+CONFIG_DIFFS.model.num_recycle=Y`, where `Y` is a smaller number than `3`.
We are keen on support longer token inference, it will come in soon.
Expand Down

0 comments on commit 6be8aa6

Please sign in to comment.