Skip to content

Commit

Permalink
Merge pull request #7 from steineggerlab/foldseek_default
Browse files Browse the repository at this point in the history
Foldseek default
  • Loading branch information
pskvins authored Feb 11, 2025
2 parents 77863df + e932bfc commit 10f5992
Show file tree
Hide file tree
Showing 5 changed files with 43 additions and 600 deletions.
42 changes: 16 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,7 @@ Kim, D., Park, S., & Steinegger, M. (2024). Unicore enables scalable and accurat
## Table of Contents
- [Unicore](#unicore)
- [Quick Start with Conda](#quick-start-with-conda)
- [GPU acceleration with CUDA](#gpu-acceleration-with-cuda)
- [GPU acceleration with Foldseek-ProstT5 (beta)](#gpu-acceleration-with-foldseek-prostt5-beta)
- [GPU acceleration with Foldseek-ProstT5](#gpu-acceleration-with-foldseek-prostt5)
- [Tutorial](#tutorial)
- [Manual](#manual)
- [Input](#input)
Expand All @@ -29,24 +28,16 @@ conda install -c bioconda unicore
unicore -v
```

### GPU acceleration with CUDA
`createdb` module can be greatly acclerated with ProstT5-GPU.
If you have a Linux machine with CUDA-compatible GPU, please install this additional package:
```
conda install -c conda-forge pytorch-gpu
```
### GPU acceleration with Foldseek-ProstT5
Foldseek features GPU-acceleration for ProstT5 prediction under following requirements:
* Turing or newer NVIDIA GPU
* `foldseek` ≥10
* `glibc` ≥2.17
* `nvidia-driver` ≥525.60.13

### GPU acceleration with Foldseek-ProstT5 (beta)
> Note. This feature is under development and may not work in some environments. We will provide an update after the stable release of Foldseek-ProstT5.
Foldseek provides a GPU-compatible static binary for ProstT5 prediction (requires Linux with AVX2 support, `glibc` ≥2.29, and `nvidia-driver` ≥525.60.13)<br>
To use it, please install it by running the following command:
```
wget https://mmseqs.com/foldseek/foldseek-linux-gpu.tar.gz; tar xvfz foldseek-linux-gpu.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH
Apply `--gpu` option to either `easy-core` or `createdb` module to use it, e.g.
```
Then, add `--use-foldseek` and `--gpu` options to either `easy-core` or `createdb` module to use Foldseek implementation of ProstT5-GPU:
```
unicore easy-core --use-foldseek --gpu <INPUT> <OUTPUT> <MODEL> <TMP>
unicore easy-core --gpu <INPUT> <OUTPUT> <MODEL> <TMP>
```

<hr>
Expand All @@ -61,7 +52,7 @@ unzip unicore_example.zip
If you cloned the repository, you can find the example dataset in the `example/data` folder.

### Download ProstT5 weights
You need to first download the ProstT5 weights to run the `createdb` module.
You can preliminarily download the ProstT5 weights required to run the `createdb` module.
```
foldseek databases ProstT5 weights tmp
```
Expand Down Expand Up @@ -142,13 +133,14 @@ This module runs much faster with GPU. Please install `cuda` for GPU acceleratio

To run the module, please use the following command:
```
// Download ProstT5 weights as below if you haven't already
// foldseek databases ProstT5 /path/to/prostt5/weights tmp
unicore createdb data db/proteome_db /path/to/prostt5/weights
```
This will create a Foldseek database in the `db` folder.

If you have foldseek installed with CUDA, you can run the ProstT5 in the module with foldseek by adding `--use-foldseek` option.
If you want to select the GPU devices, please use the `CUDA_VISIBLE_DEVICES` environment variable.

* `CUDA_VISIBLE_DEVICES=0` to use GPU 0.
* `CUDA_VISIBLE_DEVICES=0,1` to use GPU 0 and 1.

#### cluster
`cluster` module takes a `createdb` output database, runs Foldseek clustering, and outputs the cluster results.
Expand Down Expand Up @@ -217,11 +209,9 @@ unicore gene-tree --realign --threshold 30 --name /path/to/hashed/gene/names tre
## Build from Source
### Minimum requirements
* [Cargo](https://www.rust-lang.org/tools/install) (Rust)
* [Foldseek](https://foldseek.com) (version ≥ 9)
* [Foldseek](https://foldseek.com) (version ≥ 10)
* [Foldmason](https://foldmason.foldseek.com)
* [IQ-TREE](http://www.iqtree.org/)
* pytorch, transformers, sentencepiece, protobuf
- These are required for users who cannot build foldseek with CUDA. Please install them with `pip install torch transformers sentencepiece protobuf`.
### Optional requirements
* [MAFFT](https://mafft.cbrc.jp/alignment/software/)
* [Fasttree](http://www.microbesonline.org/fasttree/) or [RAxML](https://cme.h-its.org/exelixis/web/software/raxml/)
Expand All @@ -240,5 +230,5 @@ With these tools installed, you can install and run `unicore` by:
git clone https://github.com/steineggerlab/unicore.git
cd unicore
cargo build --release
bin/unicore help
bin/unicore -v
```
9 changes: 0 additions & 9 deletions src/envs/variables.rs
Original file line number Diff line number Diff line change
Expand Up @@ -74,15 +74,6 @@ pub fn locate_path_cfg() -> String {
err::error(err::ERR_GENERAL, Some("Could not locate path.cfg".to_string()));
}
}
pub fn locate_encoder_py() -> String {
if File::open(format!("{}{}etc{}predict_3Di_encoderOnly.py", parent_dir(), SEP, SEP)).is_ok() {
format!("{}{}etc{}predict_3Di_encoderOnly.py", parent_dir(), SEP, SEP)
} else if File::open(format!("{}{}src{}py{}predict_3Di_encoderOnly.py", src_parent_dir(), SEP, SEP, SEP)).is_ok() {
format!("{}{}src{}py{}predict_3Di_encoderOnly.py", src_parent_dir(), SEP, SEP, SEP)
} else {
err::error(err::ERR_GENERAL, Some("Could not locate path.cfg".to_string()));
}
}

// binary paths
pub const VALID_BINARY: [&str; 8] = [
Expand Down
118 changes: 26 additions & 92 deletions src/modules/createdb.rs
Original file line number Diff line number Diff line change
Expand Up @@ -26,18 +26,11 @@ pub fn run(args: &Args, bin: &var::BinaryPaths) -> Result<(), Box<dyn std::error
let overwrite = args.createdb_overwrite.unwrap_or_else(|| { err::error(err::ERR_ARGPARSE, Some("createdb - overwrite".to_string())); });
let max_len = args.createdb_max_len.unwrap_or_else(|| { err::error(err::ERR_ARGPARSE, Some("createdb - max_len".to_string())); });
let gpu = args.createdb_gpu.unwrap_or_else(|| { err::error(err::ERR_ARGPARSE, Some("createdb - gpu".to_string())); });
let use_python = args.createdb_use_python.unwrap_or_else(|| { err::error(err::ERR_ARGPARSE, Some("createdb - use_python".to_string())); });
let use_foldseek = args.createdb_use_foldseek.unwrap_or_else(|| { err::error(err::ERR_ARGPARSE, Some("createdb - use_foldseek".to_string())); });
let afdb_lookup = args.createdb_afdb_lookup.unwrap_or_else(|| { err::error(err::ERR_ARGPARSE, Some("createdb - afdb_lookup".to_string())); });
let afdb_local = args.createdb_afdb_local.clone().unwrap_or_else(|| { err::error(err::ERR_ARGPARSE, Some("createdb - afdb_local".to_string())); });
let threads = crate::envs::variables::threads();
let foldseek_verbosity = (match var::verbosity() { 4 => 3, 3 => 2, _ => var::verbosity() }).to_string();

// Either use_foldseek or use_python must be true
if !use_foldseek && !use_python {
err::error(err::ERR_ARGPARSE, Some("Either use_foldseek or use_python must be true".to_string()));
}

// Check afdb_lookup
let afdb_local = if afdb_lookup && !afdb_local.is_some() {
err::error(err::ERR_ARGPARSE, Some("afdb-lookup is provided but afdb-local is not given".to_string()));
Expand Down Expand Up @@ -135,44 +128,37 @@ pub fn run(args: &Args, bin: &var::BinaryPaths) -> Result<(), Box<dyn std::error
fasta::write_fasta(&combined_aa, &fasta_data)?;
}

if use_foldseek {
// Added use_foldseek temporarily.
// TODO: Remove use_foldseek when foldseek is ready
let foldseek_path = match &bin.get("foldseek") {
Some(bin) => &bin.path,
_none => { err::error(err::ERR_BINARY_NOT_FOUND, Some("foldseek".to_string())); }
};

// Check if old weights exist
if Path::new(&model).join("cnn.safetensors").exists() || Path::new(&model).join(format!("model{}cnn.safetensors", SEP)).exists() {
err::error(err::ERR_GENERAL, Some("Old weight files detected from the given path. Please provide different path for the model weights".to_string()));
}
// Check if weights exist
if !Path::new(&model).join("prostt5-f16.gguf").exists() {
// Download the model
std::fs::create_dir_all(format!("{}{}tmp", model, SEP))?;
let mut cmd = std::process::Command::new(foldseek_path);
let mut cmd = cmd
.arg("databases").arg("ProstT5").arg(&model).arg(format!("{}{}tmp", model, SEP)).arg("--threads").arg(threads.to_string());
cmd::run(&mut cmd);
}
// Use foldseek to create the database
let foldseek_path = match &bin.get("foldseek") {
Some(bin) => &bin.path,
_none => { err::error(err::ERR_BINARY_NOT_FOUND, Some("foldseek".to_string())); }
};

// Run foldseek createdb
// Check if old weights exist
if Path::new(&model).join("cnn.safetensors").exists() || Path::new(&model).join(format!("model{}cnn.safetensors", SEP)).exists() {
err::error(err::ERR_GENERAL, Some("Old weight files detected from the given path. Please provide different path for the model weights".to_string()));
}
// Check if weights exist
if !Path::new(&model).join("prostt5-f16.gguf").exists() {
// Download the model
std::fs::create_dir_all(format!("{}{}tmp", model, SEP))?;
let mut cmd = std::process::Command::new(foldseek_path);
let cmd = cmd
.arg("createdb").arg(&combined_aa).arg(&output)
.arg("--prostt5-model").arg(&model)
.arg("--threads").arg(threads.to_string());
let mut cmd = if gpu {
cmd.arg("--gpu").arg("1")
} else { cmd };
let mut cmd = cmd
.arg("databases").arg("ProstT5").arg(&model).arg(format!("{}{}tmp", model, SEP)).arg("--threads").arg(threads.to_string());
cmd::run(&mut cmd);
} else if use_python {
let _ = _run_python(&combined_aa, &curr_dir, &parent, &output, &model, keep, bin, threads.to_string());
} else {
err::error(err::ERR_GENERAL, Some("Either use_foldseek or use_python must be true".to_string()));
}

// Run foldseek createdb
let mut cmd = std::process::Command::new(foldseek_path);
let cmd = cmd
.arg("createdb").arg(&combined_aa).arg(&output)
.arg("--prostt5-model").arg(&model)
.arg("--threads").arg(threads.to_string());
let mut cmd = if gpu {
cmd.arg("--gpu").arg("1")
} else { cmd };
cmd::run(&mut cmd);

if afdb_lookup {
let foldseek_path = match &bin.get("foldseek") {
Some(bin) => &bin.path,
Expand Down Expand Up @@ -221,57 +207,5 @@ pub fn run(args: &Args, bin: &var::BinaryPaths) -> Result<(), Box<dyn std::error
chkpnt::write_checkpoint(&checkpoint_file, "1")?;


Ok(())
}

fn _run_python(combined_aa: &String, curr_dir: &str, parent: &str, output: &str, model: &str, keep: bool, bin: &crate::envs::variables::BinaryPaths, threads: String) -> Result<(), Box<dyn std::error::Error>> {
let input_3di = format!("{}{}{}{}combined_3di.fasta", curr_dir, SEP, parent, SEP);
let inter_prob = format!("{}{}{}{}output_probabilities.csv", curr_dir, SEP, parent, SEP);
let output_3di = format!("{}{}{}_ss", curr_dir, SEP, output);
let foldseek_verbosity = (match var::verbosity() { 4 => 3, 3 => 2, _ => var::verbosity() }).to_string();

// Run python script
let mut cmd = std::process::Command::new("python");
let mut cmd = cmd
.arg(var::locate_encoder_py())
.arg("-i").arg(&combined_aa)
.arg("-o").arg(&input_3di)
.arg("--model").arg(&model)
.arg("--half").arg("0")
.arg("--threads").arg(threads);
cmd::run(&mut cmd);

// Build foldseek db
let foldseek_path = match &bin.get("foldseek") {
Some(bin) => &bin.path,
_none => { err::error(err::ERR_BINARY_NOT_FOUND, Some("foldseek".to_string())); }
};
let mut cmd = std::process::Command::new(foldseek_path);
let mut cmd = cmd
.arg("base:createdb").arg(&combined_aa).arg(&output)
.arg("--shuffle").arg("0")
.arg("-v").arg(foldseek_verbosity.as_str());

cmd::run(&mut cmd);

// Build foldseek 3di db
let mut cmd = std::process::Command::new(foldseek_path);
let mut cmd = cmd
.arg("base:createdb").arg(&input_3di).arg(&output_3di)
.arg("--shuffle").arg("0")
.arg("-v").arg(foldseek_verbosity.as_str());
cmd::run(&mut cmd);

// Delete intermediate files
if !keep {
// std::fs::remove_file(mapping_file)?;
// std::fs::remove_file(combined_aa)?;
std::fs::remove_file(input_3di)?;
std::fs::remove_file(inter_prob)?;
}

// // Write the checkpoint file
// chkpnt::write_checkpoint(&format!("{}/createdb.chk", parent), "1")?;

Ok(())
}
Loading

0 comments on commit 10f5992

Please sign in to comment.