This workflow is designed to guide you through recovering eukaryotic genomes from metagenomic datasets based on the contigs classified as eukaryotic or unknown (EUnk) by Eukfinder.
- Input Selection: Contigs classified as eukaryotic or unknown (EUnk) are extracted from Eukfinder results for further analysis.
- Multiple MyCC Analyses: Three separate MyCC analyses are conducted using different k-mers (4-mer, 5-mer, and 5-6-mer).
- Data Mapping: Read coverage depth, identified rRNA sequences, and taxonomic identities of contigs are mapped to the corresponding MyCC bins.
- Filtering and Classification: Contigs are filtered based on multiple criteria and assigned to final eukaryotic bins or marked as mitochondrial genomes.
- Depth of Coverage: Contigs cannot exceed the depth of coverage of the SSU rRNA gene.
- Taxonomic Identity: The best PLAST hit cannot be a prokaryote or virus with >90% identity over an aligned length ≥1000 bp.
- Mitochondrial Classification: Contigs identified as mitochondrial by Metaxa2 and BLAST are marked as mitochondrial genomes.
- Eukaryotic Classification: Contigs identified as eukaryotic by Metaxa2, Centrifuge, and/or PLAST are considered for eukaryotic bins.
- MyCC Clustering: Contigs must appear in potential eukaryotic clusters in at least two of the three MyCC k-mer analyses. A potential eukaryotic cluster is defined such that over 50% of the contigs in it are classified as Eukaryotic.
Panels (a)–(c) illustrate cluster maps generated by MyCC based on marker genes, k-mer composition, and depth of coverage for three different k-mer analyses: 4-mer, 5-mer, and a combination of 5-mer and 6-mer (denoted as 5&6-mer). Panel (d) summarizes the decision-making process for including or excluding contigs in the final eukaryotic genome bin.
Panels (a–c):
- Geometric shapes (triangles, squares, and circles) represent contigs grouped into clusters.
- The numerical labels indicate the cluster IDs.
- Contigs with a hit to eukaryotes by Centrifuge or PLAST are shaded gray.
- Potential eukaryotic clusters, where more than 50% of contigs are classified as eukaryotic, are highlighted in yellow.
- The alphabet letters (A–F) represent specific contigs that appear at least once in potential eukaryotic clusters.
Panel (d):
The table shows the inclusion criteria for contigs in the final eukaryotic genome bin.
Columns Explained:
- Contig: Identifier for individual contigs.
- Centrifuge results: Taxonomic classification of contigs by Centrifuge.
- PLAST results: Taxonomic classification of contigs by PLAST.
- Cluster Number of the Contig in MyCC: The cluster IDs for each k-mer analysis (4-mer, 5-mer, and 5&6-mer).
- Times Hit Potential Euk Bin: Number of times a contig appears in potential eukaryotic clusters across the k-mer analyses.
- Included/Excluded from Final Euk Genome: Indicates whether the contig meets the inclusion criteria for the eukaryotic genome.
Key Decision Rule:
To be included in the final eukaryotic genome bin, a contig must appear at least twice in potential eukaryotic clusters across the three MyCC k-mer analyses (e.g., Contigs A–C). Contigs not meeting this criterion are excluded.
Note: The combination of 5-mer and 6-mer is referred to as "5&6-mer" in this figure.
- Prepare Input Files
You need the following input files before starting:
- A FASTA file of assembled contigs (EUnk.fasta from Eukfinder_short or Eukfinder_long).
- Results from various tools, including Centrifuge, Plast, Blast, Metaxa2, Metabat2, and MyCC.
- Run Parsing Scripts
Scripts are provided to process results from the tools mentioned above and integrate them into a single analysis pipeline.
- Generate Output
The pipeline produces two main outputs:
- A FASTA file containing the recovered eukaryotic nuclear genome.
- A FASTA file containing the recovered mitochondrial genome.
Ensure that you have the following tools and Python libraries installed:
-
Python (version 3.6+)
-
pandas, Centrifuge, Plast, acc2tax, Bowtie2 (included in Eukfinder)
-
Biopython
Install via pip install biopython.
-
SAMtools, Blast, MyCC, Metabat2, and Metaxa2.
Ensure the assembled contigs file (EUnk.fasta) is located in the TempEukfinder directory.
Copy and rename the file
cp TempEukfinder/EUnk.fasta Eukfinder_long.fasta
Launch the shell run_Plast.sh
source activate eukfinder
query=Eukfinder_long.fasta
# Run plast
DB=/scratch5/db/Eukfinder/nt2021/nt.fasta
plast -e 1E-5 -max-hit-per-query 1 -outfmt 1 -a 48 -p plastn \
-i $query -d $DB -force-query-order 1000 \
-o ${query::-6}.PLAST_nt.tsv
result file: Eukfinder_long.PLAST_nt.tsv
Run Blast_mito.sh
cquery=Eukfinder_long.fasta
source activate blast
export BLASTDB=/scratch5/db/Eukfinder/Mitochondrial
DB=mito_blast_db
blastn -db $DB -query $query -out ${query::-6}_BLAST4Mit.out -num_threads 30 \
-outfmt "6 qseqid sseqid stitle evalue pident qcovhsp nident mismatch length slen qlen qstart qend sstart send staxids sscinames sskingdoms" \
-evalue 1E-5 -max_hsps 1
conda deactivate
result file: Eukfinder_long_BLAST4Mit.out
Run Metaxa2_detection.sh
mkdir Metaxa2_results
metaxa2 --cpu 20 -g SSU -i Eukfinder_long.fasta -o Eukfinder_long_metaxa2_SSU
metaxa2 --cpu 20 -g LSU -i Eukfinder_long.fasta -o Eukfinder_long_metaxa2_LSU
find . -type f -size 0 -delete
mv *_metaxa2_* Metaxa2_results
result folder: Metaxa2_results
Run Depth.sh
OUTPUT FILE: Eukfinder_long_EUnk.depth.txt
Eukfinder_long_EUnk.depth.txt file has five columns:
contigName, contigLen, totalAvgDepth, Eukfinder_long_sorted.bam, Eukfinder_long_sorted.bam-var
export PATH=/opt/perun/myCC/Tools:$PATH
source /scratch2/software/python2-packages/bin/activate
cat Eukfinder_long_EUnk.depth.txt | cut -f 1,3 > Eukfinder_long_EUnk_depth_for_binning.txt
MyCC.py Eukfinder_long.fasta -a Eukfinder_long_EUnk_depth_for_binning.txt 4mer
MyCC.py Eukfinder_long.fasta -a Eukfinder_long_EUnk_depth_for_binning.txt 5mer
MyCC.py Eukfinder_long.fasta -a Eukfinder_long_EUnk_depth_for_binning.txt 56mer
result folder:
Eukfinder_long_20210919_1721_4mer_0.7_cov
Eukfinder_long_20210919_1732_5mer_0.7_cov
Eukfinder_long_20210919_1748_56mer_0.7_cov
Use the Parsing_centrifuge_results.py script to process Centrifuge results and translate TaxIDs to taxonomy.
source activate python36-generic
python3 Parsing_centrifuge_results.py
Explanation:
-c: Path to the Centrifuge results file.
-o: Output file for parsed results.
The output will contain the eukaryotic species detected and their corresponding counts.
OUTPUT:
Eukaryotic species with more than 10 contigs detected by Centrifuge:
species centrifuge_count
0 Blastocystis sp. subtype 4 3300
1 Cyclospora cayetanensis 15
Use the Parsing_Plast_results.py script to process Plast results and annotate them with taxonomy using the acc2tax database.
source activate python36-generic
python3 Parsing_Plast_results.py
Explanation:
-i: Input Plast results file.
-d: Path to the acc2tax database.
-o: Output file for parsed Plast results.
This step annotates Plast results with domain, phylum, genus, and species.
Use the Reading_binning_results.py script to process MyCC binning results and combine the bins into one table.
source activate python36-generic
python3 Reading_binning_results.py
Explanation:
-i: Input fasta file.
-m: Path to the folder containing MyCC results.
Run the main script, Binning.py, to combine all parsed results and generate two FASTA files: the nuclear genome and the mitochondrial genome.
6.1 Combine Results: Combine MyCC, Plast, BLAST, Metaxa2, and depth coverage results into a single table.
6.2 Filter Contigs: Apply the inclusion criteria outlined in the overview to filter contigs.
source activate python36-generic
python3 Binning.py
Explanation:
-i: Input FASTA file.
-c: Parsed Centrifuge results.
-p: Parsed Plast results.
-d: Depth of coverage results.
-b: Binning results.
Outputs:
Eukfinder_long_Blastocystis.fas: Recovered eukaryotic nuclear genome.
Eukfinder_long_Mito_Blastocystis.fas: Recovered mitochondrial genome.
Check the generated FASTA files:
Nuclear genome (Eukfinder_long_Blastocystis.fas).
Mitochondrial genome (Eukfinder_long_Mito_Blastocystis.fas).
Review the combined results in Combined_binning_results_CPB_Mito.tsv.