A pipeline based on sourmash and KEGG database to obtain functional profiles for metagenomic sequences.
The associated publication for this tool is accessible via: https://doi.org/10.1093/bioinformatics/btae397
- pre-built reference data from KEGG
- metagenomic sequences (fasta or fastq file)
- List of KEGG Orthology (KOs) with abundance
- Standard
sourmash prefetchoutput with more KO-based statistics (including containment, ANI etc.)
With conda or mamba installed, create the environment from the provided environment.yaml:
conda env create -f environment.yaml
conda activate fmhfunprofiler
pip install git+https://github.com/KoslickiLab/fmh-funprofiler.git
Install the conda dependencies manually, then pip install the package from GitHub:
mamba create -n fmhfunprofiler
conda activate fmhfunprofiler
mamba install -c bioconda -c conda-forge sourmash pandas biom-format
pip install git+https://github.com/KoslickiLab/fmh-funprofiler.git
git clone https://github.com/KoslickiLab/fmh-funprofiler.git
cd fmh-funprofiler
pip install .
cd demo
# Obtain reference data (pre-built sketches for the KEGG database)
wget https://zenodo.org/records/10045253/files/KOs_sketched_scaled_1000.sig.zip
# profile the example fastq file
funcprofiler metagenome_example.fastq KOs_sketched_scaled_1000.sig.zip 11 1000 ko_profiles -p prefetch_out
# change result to biom format
sed 's/,/\t/g' ko_profiles > temp.tsv
biom convert -i temp.tsv -o ko_profiles.biom --table-type="Ortholog table" --to-hdf5
ko_profiles: a csv file with abundance for all identified KOs in the sampleprefetch_out: the fullsourmash prefetchoutput containing more KO-based statistics, check here for more detailsko_profiles.biom: a biom format of KO profiles. Please note that there is no metadata added now.
The followings are detailed procedures for this pipeline.
https://zenodo.org/records/10045253
- The sbt's are sequence-bloom-trees. They make the pipeline run faster. We have made three sbt's available to use: scaled=1000 and k=7, 11, and 15.
- We also have made two sig.zip files available: one with scaled=500, and another with scaled=1000. Each of these files have three k values: 7, 11, and 15.
As the reference database, one can use either the sig.zip files, or the sbt's. The results will be the same.
You may use your own reference data instead of KEGG, here is how to build your own reference data
# if you have a list of files, each file represents 1 gene (or gene cluster)
find <path of foler containg FASTA/FASTQ files> > dataset.csv
sourmash sketch fromfile -p protein,k=7,k=11,k=15,abund,scaled=1000 -o ref_sketch.sig.zip dataset.csv
# if you have a single file will all genes stored together and each record represents 1 gene (cluster)
sourmash sketch -p protein,k=7,k=11,k=15,abund,scaled=1000 --singleton -o ref_sketch.sig.zip ${input_file}
https://zenodo.org/records/10981414
To accommodate analyses at varying sequencing depths, we offer a range of scale factors for sketches of the KEGG database. You may download any or all of these scale factors as needed for your research.
- k value: 7, 11, and 15
- scale factor: 10, 50, 100, 150, 200, and 500
funcprofiler --help
A metagenome example fastq file is available with this repository. Obtain the KO sketches from the link given above. Then, run the following command:
cd demo
funcprofiler metagenome_example.fastq KOs_sbt_scaled_1000_k_11.sbt.zip 11 1000 ko_profiles
The output is a csv file named ko_profiles, which lists the KOs that are present in the sample, and the second column gives their relative abundances.
| Name | Description |
|---|---|
| mg_filename | Path to the input metagenomic sequences (fasta or fastq) |
| ko_sketch | Path to the reference sketch |
| ksize | Protein kmer size (7, 11, or 15) to use, corresponding to (21,33,45) for DNA |
| scaled | The scale factor in the FracMinHash technique, 1000 is suitable for KEGG KO. |
| output | Output filename |
| -t THRESHOLD_BP | Least bp of overlap for a reference gene (cluster) to be present, default 1000 |
| -p PREFETCH_FILE | Output filename for the sourmash prefetch output |
One can run many instances of funcprofiler simultaneously in parallel for many metagenomes using the funcprofiler-many command. This command takes a file_list as input, which should contain a list of all input metagenome files, and their target output files.
funcprofiler-many <KO_REF_DB> <KSIZE> <SCALED> <FILE_LIST> <THRESHOLD_BP>
- KO_REF_DB: the KO reference db (sig.zip or sbt, detailed above)
- KSIZE: proper protein k-mer size
- SCALED: proper scaled value (our pre-built ref dbs support 1000 or 500)
- FILE_LIST: a text csv file, containing no headers, two columns, first column being the metagenome file paths, and the second column being the corresponding target ko profile output names
- THRESHOLD_BP: the threshold bp to use in sourmash. 50/100/500 etc. are typical values. Larger value is faster with reduced sensitivity.
cd demo
funcprofiler-many KOs_sbt_scaled_1000_k_11.sbt.zip 11 1000 list_of_files 100
fmhfunprofiler is published in Bioinformatics, please cite the following.
Hera MR, Liu S, Wei W, Rodriguez JS, Ma C, Koslicki D. Metagenomic functional profiling: to sketch or not to sketch? Bioinformatics. 2024 Sep 1;40(Suppl 2):ii165-ii173. doi: 10.1093/bioinformatics/btae397. PMID: 39230701; PMCID: PMC11373326.