-
Notifications
You must be signed in to change notification settings - Fork 11
Home
Following genome assembly, the contigs will be binned into MAGs. From one contig file there will be multiple MAGs generated per sample. The artifact generated in this process needs to support storing multiple MAG files per one sample.
input | output |
---|---|
MAG de-replication, gene prediction, index building | genome binning |
sample1: mag1.fa, mag2.fa, mag3.fa ...
sample2: mag1.fa, mag2.fa, mag3.fa ...
...
>k129_5480
TTATTTTCAAGATAATGAGCCAATTTAAGCGGTGTCTGGCCGCCAAGCTGCACGATCACA
CCTTTAACTTTCCCATGCTCATTTTCTGCTTCAATCAATGACAATACATCTTCGCCTGTG
AGCGGCTCGAAATATAATCTGTCAGAGGTATCATAATCCGTTGAAACGGTTTCAGGATTA
CAATTAACCATGATTGTTTCATAACCCGCCTCTTTTAGTGCATAGGCGGCATGGACACAG
CAATAATCAAATTCAATACCTT
>k129_5488
TCACCGACTGACTTCATTGCTGTGGTTAAGGTGTTATCAGAGCCTTTAAATTTCTCGAAA
GCAAAACGAGGCACTTTTGTCACGACATAATCAATGGATGGCTCAAAGGCTGCGGGTGTT
TTGCCGCCTGTAATATCATTGCCTAATTCATCAAGTGTATACCCTACCGCCAATTTCGCT
GCCACTTTAGCAATCGGAAAACCTGTAGCTTTTGAGGCTAAAGCAGAAGAACGAGACACA
CGAGGGTTCATCTCAATCA
Genome binning will require original reads to be first mapped to the assembled contigs to evaluate coverage. Read mapping will also be needed for MAG de-replication. As there are multiple MAGs per sample, multiple index files per sample need to be supported.
input | output |
---|---|
read mapping | contig/MAG indexing |
sample1: mag1/{idx1, idx2, ref3, ref4, rev1, rev2}.b2tl, mag2/{idx1, idx2, ref3, ref4, rev1, rev2}.b2tl ...
sample2: mag1/{idx1, idx2, ref3, ref4, rev1, rev2}.b2tl, mag2/{idx1, idx2, ref3, ref4, rev1, rev2}.b2tl ...
...
The metagenome assembler will output one file containing all the assembled contigs per sample. We need a new type to handle that contig data.
input | output |
---|---|
genome binning, contig indexing | genome assembly |
Following genome binning, resulting MAGs from all samples can be de-replicated to produce a list of unique genomes. The result of this process would be FeatureData[MAG]
where one MAG comprises multiple contigs (in a single fasta file).
input | output |
---|---|
gene prediction | MAG de-replication |
feature1: contigs.fa
feature2: contigs.fa
...
>k129_5480
TTATTTTCAAGATAATGAGCCAATTTAAGCGGTGTCTGGCCGCCAAGCTGCACGATCACA
CCTTTAACTTTCCCATGCTCATTTTCTGCTTCAATCAATGACAATACATCTTCGC
>k129_5481
TCACCGACTGACTTCATTGCTGTGGTTAAGGTGTTATCAGAGCCTTTAAATTTCTCGAAA
GCAAAACGAGGCACTTTTGTCACGACATAATCAATGGATGGCTCAAAGGCTGCGGGTGTT
TTGCCGCCTGTAATATCA
>k129_5482
AGCGGCTCGAAATATAATCTGTCAGAGGTATCATAATCCGTTGAAACGGTTTCAGGATTA
CAATTAACCATGATTGTTTCATAACCCGCCTCTTTTAGTGCATAGGCGGCATGGACACAG
CAATAATCAAATTCAATACCTT
Functional annotation step will use FeatureData[Sequence | ProteinSequence]
as input to generate a list of functional categories. At least initially we will be using eggNOG which produces a large TSV-like table of all kinds of annotations.
input | output |
---|---|
? | functional annotation |
one TSV file with one row per feature (annotation)
# Fri Jan 15 15:36:47 2021
# emapper-2.0.4-rf1-5-g5d81570
# /home/mziemski/metagenome-tests/metagenome_assembly_tutorial/eggnog-mapper/emapper.py -i fixed-bins.1.faa -o /home/mziemski/metagenome-tests/metagenome_assembly_tutorial/results/functional_annotation/fixed-bins.1.faa --data_dir /scratch/mziemski/pfam_db/ --cpu 6
#
#query_name seed_eggNOG_ortholog seed_ortholog_evalue seed_ortholog_score eggNOG OGs narr_og_name narr_og_cat narr_og_desc best_og_name best_og_cat best_og_desc Preferred_name GOs EC KEGG_ko KEGG_Pathway KEGG_Module KEGG_Reaction KEGG_rclass BRITE KEGG_TC CAZy BiGG_Reaction PFAMs
k129_5480_1 856793.MICA_1589 0.0 982.2 COG0458@1|root,COG0458@2|Bacteria,1MUDZ@1224|Proteobacteria,2TQZU@28211|Alphaproteobacteria,4BPM5@82117|unclassified Alphaproteobacteria 4BPM5@82117|unclassified Alphaproteobacteria F PFAM Carbamoyl-phosphate synthase L chain, ATP binding domain 2TQZU@28211|Alphaproteobacteria F Belongs to the CarB family carB - 6.3.5.5 ko:K01955 ko00240,ko00250,ko01100,map00240,map00250,map01100 M00051 R00256,R00575,R01395,R10948,R10949 RC00002,RC00010,RC00043,RC02750,RC02798,RC03314 ko00000,ko00001,ko00002,ko01000 - - - ATP-grasp,ATP-grasp_3,ATP-grasp_4,CPSase_L_D2,CPSase_L_D3,Dala_Dala_lig_C,GARS_A,MGS,RimK
k129_5480_2 1469613.JT55_09850 1.6e-67 209.1 COG0782@1|root,COG0782@2|Bacteria,1RCXW@1224|Proteobacteria,2U5JU@28211|Alphaproteobacteria,3FD6X@34008|Rhodovulum 3FD6X@34008|Rhodovulum K Necessary for efficient RNA polymerase transcription elongation past template-encoded arresting sites. The arresting sites in DNA have the property of trapping a certain fraction of elongating RNA polymerases that pass through, resulting in locked ternary complexes. Cleavage of the nascent transcript by cleavage factors such as GreA or GreB allows the resumption of elongation from the new 3'terminus. GreA releases sequences of 2 to 3 nucleotides 2U5JU@28211|Alphaproteobacteria K Necessary for efficient RNA polymerase transcription elongation past template-encoded arresting sites. The arresting sites in DNA have the property of trapping a certain fraction of elongating RNA polymerases that pass through, resulting in locked ternary complexes. Cleavage of the nascent transcript by cleavage factors such as GreA or GreB allows the resumption of elongation from the new 3'terminus. GreA releases sequences of 2 to 3 nucleotides greA - - ko:K03624 - - - - ko00000,ko03021 - - - GreA_GreB,GreA_GreB_N
As Clusters Orthologous Groups (COGs) are widely used for genome functional annotation, we need to introduce a new type that will specifically handle those. It would be based on a TSV file containing a few OG-specific fields (inspired by the eggNOG annotator) that would most likely be derivable by other tools in the future.
input | output |
---|---|
? | functional annotation |
one TSV file with one row per feature (annotation)
input | output |
---|---|
? | functional annotation |
one TSV file with one row per feature (annotation)
Gene prediction step will produce one GFF file per genome in a sample (totalling to multiple GFF files per sample). We need a new type to handle those GFF files.
input | output |
---|---|
? | gene prediction |
genome1: loci.gff
genome2: loci.gff
...
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM19595v2
#!genome-build-accession NCBI_Assembly:GCA_000195955.2
##sequence-region AL123456.3 1 4411532
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=83332
AL123456.3 EMBL region 1 4411532 . + . ID=AL123456.3:1..4411532;Dbxref=taxon:83332;gbkey=Src;mol_type=genomic DNA;strain=H37Rv;type-material=type strain of Mycobacterium tuberculosis
AL123456.3 EMBL gene 1 1524 . + . ID=gene-Rv0001;Name=dnaA;gbkey=Gene;gene=dnaA;gene_biotype=protein_coding;locus_tag=Rv0001
AL123456.3 EMBL CDS 1 1524 . + 0 ID=cds-CCP42723.1;Parent=gene-Rv0001;Dbxref=EnsemblGenomes-Gn:Rv0001,EnsemblGenomes-Tr:CCP42723,GOA:P9WNW3,InterPro:IPR001957,InterPro:IPR003593,InterPro:IPR010921,InterPro:IPR013159,InterPro:IPR013317,InterPro:IPR018312,InterPro:IPR020591,InterPro:IPR027417,NCBI_GP:CCP42723.1;Name=CCP42723.1;Note=Rv0001%2C (MT0001%2C MTV029.01%2C P49993)%2C len: 507 aa. dnaA%2C chromosomal replication initiator protein (see citations below)%2C equivalent to other Mycobacterial chromosomal replication initiator proteins. Also highly similar to others except in N-terminus e.g. Q9ZH75|DNAA_STRCH chromosomal replication initiator protein from Streptomyces chrysomallus (624 aa). Contains PS00017 ATP/GTP-binding site motif A (P-loop) and PS01008 DnaA protein signature. Belongs to the DnaA family. Note that the first base of this gene has been taken as base 1 of the Mycobacterium tuberculosis H37Rv genomic sequence.;experiment=EXISTENCE: identified in proteomics study;gbkey=CDS;gene=dnaA;inference=protein motif:PROSITE:PS01008;locus_tag=Rv0001;product=Chromosomal replication initiator protein DnaA;protein_id=CCP42723.1;transl_table=11
AL123456.3 EMBL gene 2052 3260 . + . ID=gene-Rv0002;Name=dnaN;gbkey=Gene;gene=dnaN;gene_biotype=protein_coding;locus_tag=Rv0002
AL123456.3 EMBL CDS 2052 3260 . + 0 ID=cds-CCP42724.1;Parent=gene-Rv0002;Dbxref=EnsemblGenomes-Gn:Rv0002,EnsemblGenomes-Tr:CCP42724,GOA:P9WNU1,InterPro:IPR001001,InterPro:IPR022634,InterPro:IPR022635,InterPro:IPR022637,PDB:3P16,PDB:3RB9,NCBI_GP:CCP42724.1;Name=CCP42724.1;Note=Rv0002%2C (MTV029.02%2C MTCY10H4.0)%2C len: 402 aa. DnaN%2CDNA polymerase III (beta chain) (see citations below)%2Cequivalent to other Mycobacterial DNA polymerases III beta chain. Also highly similar to others e.g. P27903|DP3B_STRCO DNA polymerase III beta chain from Streptomyces coelicolor (376 aa). Overlaps and extends CDS in neighbouring cosmid MTCY10H4.01.;experiment=EXISTENCE: identified in proteomics study;gbkey=CDS;gene=dnaN;locus_tag=Rv0002;product=DNA polymerase III (beta chain) DnaN (DNA nucleotidyltransferase);protein_id=CCP42724.1;transl_table=11
AL123456.3 EMBL gene 3280 4437 . + . ID=gene-Rv0003;Name=recF;gbkey=Gene;gene=recF;gene_biotype=protein_coding;locus_tag=Rv0003
AL123456.3 EMBL CDS 3280 4437 . + 0 ID=cds-CCP42725.1;Parent=gene-Rv0003;Dbxref=EnsemblGenomes-Gn:Rv0003,EnsemblGenomes-Tr:CCP42725,GOA:P9WHI9,InterPro:IPR001238,InterPro:IPR003395,InterPro:IPR018078,InterPro:IPR027417,NCBI_GP:CCP42725.1;Name=CCP42725.1;Note=Rv0003%2C (MTCY10H4.01)%2C len: 385 aa. RecF%2C DNA replication and repair protein (see citations below)%2Cequivalent to other mycobacterial DNA replication and repair proteins. Also highly similar to many others. Contains PS00017 ATP/GTP-binding site motif A (P-loop)%2CPS00617 RecF protein signature 1%2C and PS00618 RecF protein signature 2. Belongs to the RecF family.;experiment=EXISTENCE: identified in proteomics study;gbkey=CDS;gene=recF;inference=protein motif:PROSITE:PS00618;locus_tag=Rv0003;product=DNA replication and repair protein RecF (single-strand DNA binding protein);protein_id=CCP42725.1;transl_table=11
Based on the gene positional information from GFF files, gene prediction step will extract gene sequences from corresponding MAGs and store them in a FASTA file.
input | output |
---|---|
functional annotation | gene prediction |
genome1: genes.fasta
genome2: genes.fasta
...
>k129_5480_1 # 3 # 1988 # -1 # ID=1_1;partial=10;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.442
ATGCCTAAACGTACAGATATTTCTTCTATTTGTATTATTGGAGCTGGACCTATTGTAATT
GGACAAGCTTGTGAATTTGATTATTCTGGAGCTCAAGCTTGTAAAGCTTTAAAAGAAGAA
GGATATCGTGTAGTATTAATTAATTCTAATCCTGCTACAATTATGACAGATCCTAATATG
GCTGATGCTACATATATTGAACCTATTACACCTGAAATTGTAGCTAAAATTTTAGAAAAA
GAACGTCCTGATGCTTTATTACCTACAATGGGAGGACAAACAGCTTTAAATGCTGCTTTA
GCTTTAGATAAAATGGGAGTATTAAAACGTTTAAATATTGAATTAATTGGAGCTAATAAA
GAAGCTATTGAAAAAGCTGAAGATCGTCAATTATTTAAAGATTGTATGGAAAAAATTGGA
TTAGAATCTCCTAAATCTGCTGTAGTACATTCTATGGAAGAAGCTCGTGAAGCTTTAAAA
CATACAGGATTACCTGCTATTATTCGTCCTTCTTTTACAATGGGAGGATCTGGAGGAGGA
GTAGCTTATAATAAAGATGAATTTGAACAAATTATTCGTGAAGGATTAGATGCTTCTCCT
ACAAATGAAGTATTAATTGATGAATCTTTATTAGGATGGAAAGAATATGAAATGGAAGTA
GTACGTGATACAAAAGATAATGCTATTATTATTTGTTCTATTGAAAATATTGATCCTATG
GGAGTACATACAGGAGATTCTATTACAGTAGCTCCTGCTTTAACATTAACAGATAAAGAA
TATCAAATTATGCGTAATGCTTCTTTAGCTGTATTACGTGTAATTGGAGTAGAAACAGGA
GGATCTAATGTACAATTTGGAATGGATCCTGAAACAGGACGTATGGTAGTAATTGAAATG
AATCCTCGTGTATCTCGTTCTTCTGCTTTAGCTTCTAAAGCTACAGGATTTCCTATTGCT
AAAGTAGCTGCTAAATTAGCTGTAGGATATACATTAGATGAATTAGGAAATGATATTACA
GGAGGAAAAACACCTGCTGCTTTTGAACCTTCTATTGATTATGTAGTAACAAAAGTACCT
CGTTTTGCTTTTGAAAAATTTAAAGGATCTGATAATACATTAACAACAGCTATGAAATCT
GTAGGAGAAGCTATGGCTATTGGACGTTCTTTTGAAGAATCTTTACAAAAAGCTTTACGT
TCTTTAGAAAAAGGATTAGAAGGATTATCTTCTATTCCTATTGCTGGAAAATCTGAACCT
GATATGGATGATATTCGTGCTGCTTTATCTCGTCCTACACCTGCTCGTTTATTATATGCT
GCTGAAGCTATGCGTCATGGAATGGATTTAGAAACAATTTATCAATTAACAAAATTTGAT
ATGTGGTATTTAGAACGTATTAAATCTTTAATTGATACAGAAGCTTCTATTAAAAAAAAT
GGATTACCTAAAGATCCTCAAGGATGGATGGCTTTAAAACGTGCTGGATTTTCTGATGCT
CGTTTAGCTGAATTAGTAGGAGTAGCTGAAGCTACAATTCGTAAAACACGTTTAATGCAT
AAAGTAAAACCTGTATATAAACGTGTAGATTCTTGTGCTGCTGAAATTCCTTCTTTAACA
TCTTATATGTATGGAACATATGAAACATTAAATTCTACATCTGAAATTACAGCTACAAAA
AAAGATAAAATTGTAATTTTAGGAGGAGGACCTAATCGTATTGGACAAGGAATTGAATTT
GATTATTGTTGTGTACATGCTGCTTATGCTTTAAAAGAAGCTGGATATGAAACAATTATG
GTAAATTGTAATCCTGAAACAGTATCTACAGATTATGATACATCTGATCGTTTATATTTT
GAACCTTTAACAGGAGAAGATGTATTATCTTTAATTGAAGCTGAAAATGAACATGGAAAA
GTAAAAGGAGTAATTGTACAATTAGGAGGACAAACACCTTTAAAATTAGCTCATTATTTA
GAAAAT
>k129_5480_2 # 2150 # 2623 # 1 # ID=1_2;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.426
ATGCAAAAAATTCCTTTAACAAAACAAGGACATACAGATTTAGAAGCTGAATTAAAAGAT
TTAAAACATCGTCAACGTCCTGCTGTAATTGCTGCTATTTCTGAAGCTCGTGAACATGGA
GATTTATCTGAAAATGCTGAATATCATGCTGCTCGTGAACAACAATCTTTTATTGAAGGA
CGTATTGAACAAGTAGAAGCTATTTTATCTTTAGCTGAAATTATTGATCCTGCTAAAATT
TCTGGAGATACAGTAAAATTTGCTGCTACAGTAAAAGTAGTAGATTGTGATACAGATGAT
GAACATATTTATCAAATTGTAGGAGATGAAGAATCTGATATTGAAACAGGAAAATTAGCT
ATTTCTTCTCCTGTAGCTCGTGCTTTAATTGGAAAAAAAGTAGAAGATTCTGTAGAAGTA
CGTACACCTAAAGGAACACGTGAATATGAAATTTTAGAAATTTTATATAAATAA
Based on the gene positional information from GFF files, gene prediction step will generate protein translations from corresponding MAGs and store them in a FASTA file.
input | output |
---|---|
functional annotation | gene prediction |
genome1: proteins.fasta
genome2: proteins.fasta
...
>k129_5480_1 # 3 # 1988 # -1 # ID=1_1;partial=10;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.442
MPKRTDISSICIIGAGPIVIGQACEFDYSGAQACKALKEEGYRVVLINSNPATIMTDPNM
ADATYIEPITPEIVAKILEKERPDALLPTMGGQTALNAALALDKMGVLKRLNIELIGANK
EAIEKAEDRQLFKDCMEKIGLESPKSAVVHSMEEAREALKHTGLPAIIRPSFTMGGSGGG
VAYNKDEFEQIIREGLDASPTNEVLIDESLLGWKEYEMEVVRDTKDNAIIICSIENIDPM
GVHTGDSITVAPALTLTDKEYQIMRNASLAVLRVIGVETGGSNVQFGMDPETGRMVVIEM
NPRVSRSSALASKATGFPIAKVAAKLAVGYTLDELGNDITGGKTPAAFEPSIDYVVTKVP
RFAFEKFKGSDNTLTTAMKSVGEAMAIGRSFEESLQKALRSLEKGLEGLSSIPIAGKSEP
DMDDIRAALSRPTPARLLYAAEAMRHGMDLETIYQLTKFDMWYLERIKSLIDTEASIKKN
GLPKDPQGWMALKRAGFSDARLAELVGVAEATIRKTRLMHKVKPVYKRVDSCAAEIPSLT
SYMYGTYETLNSTSEITATKKDKIVILGGGPNRIGQGIEFDYCCVHAAYALKEAGYETIM
VNCNPETVSTDYDTSDRLYFEPLTGEDVLSLIEAENEHGKVKGVIVQLGGQTPLKLAHYL
EN
>k129_5480_2 # 2150 # 2623 # 1 # ID=1_2;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.426
MQKIPLTKQGHTDLEAELKDLKHRQRPAVIAAISEAREHGDLSENAEYHAAREQQSFIEG
RIEQVEAILSLAEIIDPAKISGDTVKFAATVKVVDCDTDDEHIYQIVGDEESDIETGKLA
ISSPVARALIGKKVEDSVEVRTPKGTREYEILEILYK*