Unicore is a method for scalable and accurate phylogenetic reconstruction with structural core genes using Foldseek and ProstT5, universally applicable to any given set of taxa.
Kim, D., Park, S., & Steinegger, M. (2024). Unicore enables scalable and accurate phylogenetic reconstruction with structural core genes. bioRxiv, 2024.12.22.629535. doi.org/10.1101/2024.12.22.629535
conda install -c bioconda unicore
unicore -v
createdb
module can be greatly acclerated with ProstT5-GPU.
If you have a Linux machine with CUDA-compatible GPU, please install this additional package:
conda install -c conda-forge pytorch-gpu
Note. This feature is under development and may not work in some environments. We will provide an update after the stable release of Foldseek-ProstT5.
Foldseek provides a GPU-compatible static binary for ProstT5 prediction (requires Linux with AVX2 support, glibc
≥2.29, and nvidia-driver
≥525.60.13)
To use it, please install it by running the following command:
wget https://mmseqs.com/foldseek/foldseek-linux-gpu.tar.gz; tar xvfz foldseek-linux-gpu.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH
Then, add --use-foldseek
and --gpu
options to either easy-core
or createdb
module to use Foldseek implementation of ProstT5-GPU:
unicore easy-core --use-foldseek --gpu <INPUT> <OUTPUT> <MODEL> <TMP>
If you are using the conda package, you can download the example dataset from the following link:
wget https://unicore.steineggerlab.workers.dev/unicore_example.zip
unzip unicore_example.zip
If you cloned the repository, you can find the example dataset in the example/data
folder.
You need to first download the ProstT5 weights to run the createdb
module.
foldseek databases ProstT5 weights tmp
The easy-core
module processes all the way from the input proteomes to build the phylogenetic tree based on their structural core genes.
Use the following command to run the easy-core module:
unicore easy-core example/data example/results weights tmp
If you have a CUDA-compatible GPU, add --gpu
flag to run ProstT5 with GPU acceleration.
After running the easy-core
module, you can find the results in the example/results
folder.
proteome
folder contains the proteome information parsed from the input files.cluster
folder contains the Foldseek clustering result (clust.tsv).profile
folder contains the taxonomic profiling results and metadata of defined structural core genes.tree
folder contains the results from the phylogenetic inference.
example/results/tree/iqtree.treefile
is the concatenated structural core gene tree represented in Newick format.
Each node in the tree represents a species, labeled with their input proteome file name.
example/results/tree/fasta
folder contains subfolders named after the defined structural core genes.
Each subfolder contains the amino acid sequences (aa.fasta) and their 3Di representations (3di.fasta) of the core genes.
We provide an easy workflow module that automatically runs the all modules in order.
easy-core
- Easy core gene phylogeny workflow, from fasta files to phylogenetic tree
Unicore has four main modules, which can be run sequentially to infer the phylogenetic tree of the given species.
createdb
- Create 3Di structural alphabet database from input speciescluster
- Cluster Foldseek databaseprofile
- Taxonomic profiling and core gene identificationtree
- Phylogenetic inference using structural core genes
Run each module with unicore <module> help
to see the detailed usage.
Unicore requires a set of proteomes as input to infer the phylogenetic tree. Please prepare the input proteomes in a folder.
You can also refer to the example dataset in the example/data
folder or download it from here.
Note. Currently, proteomes in
.fasta
format are only supported as an input. We will try to support more types and formats that can represent species.
Example dataset:
data/
└─┬ Proteome1.fasta
├ Proteome2.fasta
...
└ ProteomeN.fasta
easy-core
workflow module orchestrates four modules in order, processes all the way from the input proteomes to the phylogenetic tree.
Example command:
// Download ProstT5 weights as below if you haven't already
// foldseek databases ProstT5 /path/to/prostt5/weights tmp
unicore easy-core data results /path/to/prostt5/weights tmp
This will create a results/tree
folder with phylogenetic trees built with the structural core genes identified from the input proteomes.
The easy-core
module will also create folders named results/proteome
, results/cluster
, and results/profile
with intermediate results for createdb
, cluster
, and profile
module, respectively.
createdb
module takes a folder with input species and outputs 3Di structural alphabets predicted with ProstT5.
This module runs much faster with GPU. Please install cuda
for GPU acceleration.
To run the module, please use the following command:
// Download ProstT5 weights as below if you haven't already
// foldseek databases ProstT5 /path/to/prostt5/weights tmp
unicore createdb data db/proteome_db /path/to/prostt5/weights
This will create a Foldseek database in the db
folder.
If you have foldseek installed with CUDA, you can run the ProstT5 in the module with foldseek by adding --use-foldseek
option.
cluster
module takes a createdb
output database, runs Foldseek clustering, and outputs the cluster results.
On default, clustering will be done with 80% bidirectional coverage (-c 0.8).
You can feed custom clustering parameters for Foldseek via --cluster-options
option.
Example command:
unicore cluster db/proteome_db out/clu tmp
This will create a clu.tsv
output file in the out
folder.
profile
module takes the database (createdb
output) and cluster results (cluster
output) to find structural core genes.
On default, the module will report the genes that are present in 80% of the species as a single copy. You can change this threshold by -t
option.
Example command:
// 85% coverage
unicore profile -t 85 db/proteome_db out/clu.tsv result
This will create a result
folder with the core genes and their occurrences in the species.
tree
module takes the core genes and the species proteomes to infer the phylogenetic tree using the alignments of the structural core genes.
On default, alignment will be generated by foldmason
and truncated by 50% gap filtering, followed by phylogenetic inference using iqtree
.
Example command:
unicore tree db/proteome_db result tree
This will create a tree
folder with the resulting phylogenetic trees in Newick format.
gene-tree
module takes the output folder of the tree
module and infer the phylogenetic tree for each core gene.
Each phylogenetic tree will be saved in the tree/fasta/{gene_name}
directory.
On default, the module will reuse the alignment computed from the tree
module.
Example command:
unicore gene-tree tree
If you want to recompute the alignment for each core gene, you can add --realign
option, which will build and filter the MSA again.
You can also use --name
option to provide subset of hashed gene names to infer the phylogenetic tree.
The list of hashed gene names can be created and be used with --name
by running the following command:
// Create a list of hashed gene names
awk -F"\t" 'NR==FNR {a[$1];next} ($3 in a) {print $1}' /path/to/original/gene/names db/proteome_db.map > /path/to/hashed/gene/names
// Run gene-tree with the list of hashed gene names
// Also optionally use --realign option to recompute the alignment and --threshold option to filter the MSA
unicore gene-tree --realign --threshold 30 --name /path/to/hashed/gene/names tree
- Cargo (Rust)
- Foldseek (version ≥ 9)
- Foldmason
- IQ-TREE
- pytorch, transformers, sentencepiece, protobuf
- These are required for users who cannot build foldseek with CUDA. Please install them with
pip install torch transformers sentencepiece protobuf
.
- These are required for users who cannot build foldseek with CUDA. Please install them with
Please install the latest version of Rust from here.
Foldseek can be installed from here.
You have to pre-download the model weights of the ProstT5. Run foldseek databases ProstT5 <dir> tmp
to download the weights on <dir>
. If this doesn't work, make sure you have the latest version of Foldseek.
Foldmason and IQ-TREE is designated as default tools for alignment and phylogenetic inference. You can download Foldmason from here and IQ-TREE from here.
With these tools installed, you can install and run unicore
by:
git clone https://github.com/steineggerlab/unicore.git
cd unicore
cargo build --release
bin/unicore help