Skip to content

RogerLab/MetagenomeBinningWorkflow

Repository files navigation

Binning Workflow for Recovering Eukaryotic Genomes from Metagenomes

Overview

This workflow is designed to guide you through recovering eukaryotic genomes from metagenomic datasets based on the contigs classified as eukaryotic or unknown (EUnk) by Eukfinder.

Summary of Approach:

  1. Input Selection: Contigs classified as eukaryotic or unknown (EUnk) are extracted from Eukfinder results for further analysis.
  2. Multiple MyCC Analyses: Three separate MyCC analyses are conducted using different k-mers (4-mer, 5-mer, and 5-6-mer).
  3. Data Mapping: Read coverage depth, identified rRNA sequences, and taxonomic identities of contigs are mapped to the corresponding MyCC bins.
  4. Filtering and Classification: Contigs are filtered based on multiple criteria and assigned to final eukaryotic bins or marked as mitochondrial genomes.

Inclusion Criteria for Eukaryotic Bins:

  • Depth of Coverage: Contigs cannot exceed the depth of coverage of the SSU rRNA gene.
  • Taxonomic Identity: The best PLAST hit cannot be a prokaryote or virus with >90% identity over an aligned length ≥1000 bp.
  • Mitochondrial Classification: Contigs identified as mitochondrial by Metaxa2 and BLAST are marked as mitochondrial genomes.
  • Eukaryotic Classification: Contigs identified as eukaryotic by Metaxa2, Centrifuge, and/or PLAST are considered for eukaryotic bins.
  • MyCC Clustering: Contigs must appear in potential eukaryotic clusters in at least two of the three MyCC k-mer analyses. A potential eukaryotic cluster is defined such that over 50% of the contigs in it are classified as Eukaryotic.

Schematic explanation of Supervised binning

Panels (a)–(c) illustrate cluster maps generated by MyCC based on marker genes, k-mer composition, and depth of coverage for three different k-mer analyses: 4-mer, 5-mer, and a combination of 5-mer and 6-mer (denoted as 5&6-mer). Panel (d) summarizes the decision-making process for including or excluding contigs in the final eukaryotic genome bin.

Panels (a–c):

  • Geometric shapes (triangles, squares, and circles) represent contigs grouped into clusters.
  • The numerical labels indicate the cluster IDs.
  • Contigs with a hit to eukaryotes by Centrifuge or PLAST are shaded gray.
  • Potential eukaryotic clusters, where more than 50% of contigs are classified as eukaryotic, are highlighted in yellow.
  • The alphabet letters (A–F) represent specific contigs that appear at least once in potential eukaryotic clusters.

Panel (d):

The table shows the inclusion criteria for contigs in the final eukaryotic genome bin.

Columns Explained:

  • Contig: Identifier for individual contigs.
  • Centrifuge results: Taxonomic classification of contigs by Centrifuge.
  • PLAST results: Taxonomic classification of contigs by PLAST.
  • Cluster Number of the Contig in MyCC: The cluster IDs for each k-mer analysis (4-mer, 5-mer, and 5&6-mer).
  • Times Hit Potential Euk Bin: Number of times a contig appears in potential eukaryotic clusters across the k-mer analyses.
  • Included/Excluded from Final Euk Genome: Indicates whether the contig meets the inclusion criteria for the eukaryotic genome.

Key Decision Rule:

To be included in the final eukaryotic genome bin, a contig must appear at least twice in potential eukaryotic clusters across the three MyCC k-mer analyses (e.g., Contigs A–C). Contigs not meeting this criterion are excluded.

Note: The combination of 5-mer and 6-mer is referred to as "5&6-mer" in this figure.

Workflow

  1. Prepare Input Files

You need the following input files before starting:

  • A FASTA file of assembled contigs (EUnk.fasta from Eukfinder_short or Eukfinder_long).
  • Results from various tools, including Centrifuge, Plast, Blast, Metaxa2, Metabat2, and MyCC.
  1. Run Parsing Scripts

Scripts are provided to process results from the tools mentioned above and integrate them into a single analysis pipeline.

  1. Generate Output

The pipeline produces two main outputs:

  • A FASTA file containing the recovered eukaryotic nuclear genome.
  • A FASTA file containing the recovered mitochondrial genome.

Step-by-Step Instructions

Step 1. Prepare the Environment

Ensure that you have the following tools and Python libraries installed:

  • Python (version 3.6+)

  • pandas, Centrifuge, Plast, acc2tax, Bowtie2 (included in Eukfinder)

  • Biopython

    Install via pip install biopython.

  • SAMtools, Blast, MyCC, Metabat2, and Metaxa2.


Step 2. Prepare Input Files

Ensure the assembled contigs file (EUnk.fasta) is located in the TempEukfinder directory.

Copy and rename the file

cp TempEukfinder/EUnk.fasta Eukfinder_long.fasta

2.1 Run Plast against nt database:

Launch the shell run_Plast.sh

source activate eukfinder
query=Eukfinder_long.fasta
# Run plast
DB=/scratch5/db/Eukfinder/nt2021/nt.fasta
plast -e 1E-5 -max-hit-per-query 1 -outfmt 1 -a 48 -p plastn  \
      -i $query -d $DB -force-query-order 1000  \
      -o ${query::-6}.PLAST_nt.tsv

result file: Eukfinder_long.PLAST_nt.tsv

2.2 Run BLAST against Mitochondrial database to detect mitochondrial contigs

Run Blast_mito.sh

cquery=Eukfinder_long.fasta
source activate blast
export BLASTDB=/scratch5/db/Eukfinder/Mitochondrial
DB=mito_blast_db
blastn -db $DB -query $query -out ${query::-6}_BLAST4Mit.out -num_threads 30 \
       -outfmt "6 qseqid sseqid stitle evalue pident qcovhsp nident mismatch length slen qlen qstart qend sstart send staxids sscinames sskingdoms"  \
       -evalue 1E-5 -max_hsps 1
conda deactivate

result file: Eukfinder_long_BLAST4Mit.out

2.3 Use Metaxa2 to detect LSU and SSU rDNA sequences

Run Metaxa2_detection.sh

mkdir Metaxa2_results
metaxa2 --cpu 20 -g SSU -i Eukfinder_long.fasta -o Eukfinder_long_metaxa2_SSU
metaxa2 --cpu 20 -g LSU -i Eukfinder_long.fasta -o Eukfinder_long_metaxa2_LSU
find . -type f -size 0 -delete
mv *_metaxa2_* Metaxa2_results

result folder: Metaxa2_results

2.4 Map reads to resulted EUnk.fasta to get depth of coverage file for binning

Run Depth.sh

OUTPUT FILE: Eukfinder_long_EUnk.depth.txt

Eukfinder_long_EUnk.depth.txt file has five columns:

contigName, contigLen, totalAvgDepth, Eukfinder_long_sorted.bam, Eukfinder_long_sorted.bam-var

2.5 Run MyCC

export PATH=/opt/perun/myCC/Tools:$PATH
source  /scratch2/software/python2-packages/bin/activate
cat Eukfinder_long_EUnk.depth.txt | cut -f 1,3 > Eukfinder_long_EUnk_depth_for_binning.txt
MyCC.py Eukfinder_long.fasta  -a Eukfinder_long_EUnk_depth_for_binning.txt 4mer
MyCC.py Eukfinder_long.fasta  -a Eukfinder_long_EUnk_depth_for_binning.txt 5mer
MyCC.py Eukfinder_long.fasta  -a Eukfinder_long_EUnk_depth_for_binning.txt 56mer

result folder:

Eukfinder_long_20210919_1721_4mer_0.7_cov

Eukfinder_long_20210919_1732_5mer_0.7_cov

Eukfinder_long_20210919_1748_56mer_0.7_cov


Step 3. Parse Centrifuge Results

Use the Parsing_centrifuge_results.py script to process Centrifuge results and translate TaxIDs to taxonomy.

source activate python36-generic
python3 Parsing_centrifuge_results.py

Explanation:

-c: Path to the Centrifuge results file.

-o: Output file for parsed results.

The output will contain the eukaryotic species detected and their corresponding counts.

OUTPUT:

Eukaryotic species with more than 10 contigs detected by Centrifuge:

                      species  centrifuge_count
0  Blastocystis sp. subtype 4              3300
1     Cyclospora cayetanensis                15

Step 4. Parse Plast Results

Use the Parsing_Plast_results.py script to process Plast results and annotate them with taxonomy using the acc2tax database.

source activate python36-generic
python3 Parsing_Plast_results.py

Explanation:

-i: Input Plast results file.

-d: Path to the acc2tax database.

-o: Output file for parsed Plast results.

This step annotates Plast results with domain, phylum, genus, and species.


Step 5: Parse MyCC binning Results

Use the Reading_binning_results.py script to process MyCC binning results and combine the bins into one table.

source activate python36-generic
python3 Reading_binning_results.py

Explanation:

-i: Input fasta file.

-m: Path to the folder containing MyCC results.


Step 6: Combine All Results and Perform Binning

Run the main script, Binning.py, to combine all parsed results and generate two FASTA files: the nuclear genome and the mitochondrial genome.

6.1 Combine Results: Combine MyCC, Plast, BLAST, Metaxa2, and depth coverage results into a single table.

6.2 Filter Contigs: Apply the inclusion criteria outlined in the overview to filter contigs.

source activate python36-generic
python3 Binning.py

Explanation:

-i: Input FASTA file.

-c: Parsed Centrifuge results.

-p: Parsed Plast results.

-d: Depth of coverage results.

-b: Binning results.

Outputs:

Eukfinder_long_Blastocystis.fas: Recovered eukaryotic nuclear genome.

Eukfinder_long_Mito_Blastocystis.fas: Recovered mitochondrial genome.


Step 7: Validate the Results

Check the generated FASTA files:

Nuclear genome (Eukfinder_long_Blastocystis.fas).

Mitochondrial genome (Eukfinder_long_Mito_Blastocystis.fas).

Review the combined results in Combined_binning_results_CPB_Mito.tsv.

About

Metagenome Binning Workflow from metagenomic classification tools like Eukfinder.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published