Skip to content
This repository has been archived by the owner on May 21, 2024. It is now read-only.
Michal Ziemski edited this page May 3, 2021 · 1 revision

Semantic types for genomics

SampleData[MAGs]

Description

Following genome assembly, the contigs will be binned into MAGs. From one contig file there will be multiple MAGs generated per sample. The artifact generated in this process needs to support storing multiple MAG files per one sample.

Applicable to:

input output
MAG de-replication, gene prediction, index building genome binning

Artifact structure

sample1: mag1.fa, mag2.fa, mag3.fa ...
sample2: mag1.fa, mag2.fa, mag3.fa ...
...

Sample data: mag1.fa

>k129_5480
TTATTTTCAAGATAATGAGCCAATTTAAGCGGTGTCTGGCCGCCAAGCTGCACGATCACA
CCTTTAACTTTCCCATGCTCATTTTCTGCTTCAATCAATGACAATACATCTTCGCCTGTG
AGCGGCTCGAAATATAATCTGTCAGAGGTATCATAATCCGTTGAAACGGTTTCAGGATTA
CAATTAACCATGATTGTTTCATAACCCGCCTCTTTTAGTGCATAGGCGGCATGGACACAG
CAATAATCAAATTCAATACCTT
>k129_5488
TCACCGACTGACTTCATTGCTGTGGTTAAGGTGTTATCAGAGCCTTTAAATTTCTCGAAA
GCAAAACGAGGCACTTTTGTCACGACATAATCAATGGATGGCTCAAAGGCTGCGGGTGTT
TTGCCGCCTGTAATATCATTGCCTAATTCATCAAGTGTATACCCTACCGCCAATTTCGCT
GCCACTTTAGCAATCGGAAAACCTGTAGCTTTTGAGGCTAAAGCAGAAGAACGAGACACA
CGAGGGTTCATCTCAATCA

SampleData[MultiBowtie2Index]

Description

Genome binning will require original reads to be first mapped to the assembled contigs to evaluate coverage. Read mapping will also be needed for MAG de-replication. As there are multiple MAGs per sample, multiple index files per sample need to be supported.

Applicable to:

input output
read mapping contig/MAG indexing

Artifact structure

sample1: mag1/{idx1, idx2, ref3, ref4, rev1, rev2}.b2tl, mag2/{idx1, idx2, ref3, ref4, rev1, rev2}.b2tl ...
sample2: mag1/{idx1, idx2, ref3, ref4, rev1, rev2}.b2tl, mag2/{idx1, idx2, ref3, ref4, rev1, rev2}.b2tl ...
...

SampleData[Contigs]

Description

The metagenome assembler will output one file containing all the assembled contigs per sample. We need a new type to handle that contig data.

Applicable to:

input output
genome binning, contig indexing genome assembly

Artifact structure

Sample data

FeatureData[MAG]

Description

Following genome binning, resulting MAGs from all samples can be de-replicated to produce a list of unique genomes. The result of this process would be FeatureData[MAG] where one MAG comprises multiple contigs (in a single fasta file).

Applicable to:

input output
gene prediction MAG de-replication

Artifact structure

feature1: contigs.fa
feature2: contigs.fa
...

Sample data: contigs.fa

>k129_5480
TTATTTTCAAGATAATGAGCCAATTTAAGCGGTGTCTGGCCGCCAAGCTGCACGATCACA
CCTTTAACTTTCCCATGCTCATTTTCTGCTTCAATCAATGACAATACATCTTCGC
>k129_5481
TCACCGACTGACTTCATTGCTGTGGTTAAGGTGTTATCAGAGCCTTTAAATTTCTCGAAA
GCAAAACGAGGCACTTTTGTCACGACATAATCAATGGATGGCTCAAAGGCTGCGGGTGTT
TTGCCGCCTGTAATATCA
>k129_5482
AGCGGCTCGAAATATAATCTGTCAGAGGTATCATAATCCGTTGAAACGGTTTCAGGATTA
CAATTAACCATGATTGTTTCATAACCCGCCTCTTTTAGTGCATAGGCGGCATGGACACAG
CAATAATCAAATTCAATACCTT

FeatureData[NOG]

Description

Functional annotation step will use FeatureData[Sequence | ProteinSequence] as input to generate a list of functional categories. At least initially we will be using eggNOG which produces a large TSV-like table of all kinds of annotations.

Applicable to:

input output
? functional annotation

Artifact structure

one TSV file with one row per feature (annotation)

Sample data: annotations.tsv

# Fri Jan 15 15:36:47 2021
# emapper-2.0.4-rf1-5-g5d81570
# /home/mziemski/metagenome-tests/metagenome_assembly_tutorial/eggnog-mapper/emapper.py -i fixed-bins.1.faa -o /home/mziemski/metagenome-tests/metagenome_assembly_tutorial/results/functional_annotation/fixed-bins.1.faa --data_dir /scratch/mziemski/pfam_db/ --cpu 6
#
#query_name	seed_eggNOG_ortholog	seed_ortholog_evalue	seed_ortholog_score	eggNOG OGs	narr_og_name	narr_og_cat	narr_og_desc	best_og_name	best_og_cat	best_og_desc	Preferred_name	GOs	EC	KEGG_ko	KEGG_Pathway	KEGG_Module	KEGG_Reaction	KEGG_rclass	BRITE	KEGG_TC	CAZy	BiGG_Reaction	PFAMs
k129_5480_1	856793.MICA_1589	0.0	982.2	COG0458@1|root,COG0458@2|Bacteria,1MUDZ@1224|Proteobacteria,2TQZU@28211|Alphaproteobacteria,4BPM5@82117|unclassified Alphaproteobacteria	4BPM5@82117|unclassified Alphaproteobacteria	F	PFAM Carbamoyl-phosphate synthase L chain, ATP binding domain	2TQZU@28211|Alphaproteobacteria	F	Belongs to the CarB family	carB	-	6.3.5.5	ko:K01955	ko00240,ko00250,ko01100,map00240,map00250,map01100	M00051	R00256,R00575,R01395,R10948,R10949	RC00002,RC00010,RC00043,RC02750,RC02798,RC03314	ko00000,ko00001,ko00002,ko01000	-	-	-	ATP-grasp,ATP-grasp_3,ATP-grasp_4,CPSase_L_D2,CPSase_L_D3,Dala_Dala_lig_C,GARS_A,MGS,RimK
k129_5480_2	1469613.JT55_09850	1.6e-67	209.1	COG0782@1|root,COG0782@2|Bacteria,1RCXW@1224|Proteobacteria,2U5JU@28211|Alphaproteobacteria,3FD6X@34008|Rhodovulum	3FD6X@34008|Rhodovulum	K	Necessary for efficient RNA polymerase transcription elongation past template-encoded arresting sites. The arresting sites in DNA have the property of trapping a certain fraction of elongating RNA polymerases that pass through, resulting in locked ternary complexes. Cleavage of the nascent transcript by cleavage factors such as GreA or GreB allows the resumption of elongation from the new 3'terminus. GreA releases sequences of 2 to 3 nucleotides	2U5JU@28211|Alphaproteobacteria	K	Necessary for efficient RNA polymerase transcription elongation past template-encoded arresting sites. The arresting sites in DNA have the property of trapping a certain fraction of elongating RNA polymerases that pass through, resulting in locked ternary complexes. Cleavage of the nascent transcript by cleavage factors such as GreA or GreB allows the resumption of elongation from the new 3'terminus. GreA releases sequences of 2 to 3 nucleotides	greA	-	-	ko:K03624	-	-	-	-	ko00000,ko03021	-	-	-	GreA_GreB,GreA_GreB_N

FeatureData[OG]

Description

As Clusters Orthologous Groups (COGs) are widely used for genome functional annotation, we need to introduce a new type that will specifically handle those. It would be based on a TSV file containing a few OG-specific fields (inspired by the eggNOG annotator) that would most likely be derivable by other tools in the future.

Applicable to:

input output
? functional annotation

Artifact structure

one TSV file with one row per feature (annotation)

Sample data

FeatureData[KEGG]

Description

Applicable to:

input output
? functional annotation

Artifact structure

one TSV file with one row per feature (annotation)

Sample data file

GenomeData[Loci]

Description

Gene prediction step will produce one GFF file per genome in a sample (totalling to multiple GFF files per sample). We need a new type to handle those GFF files.

Applicable to:

input output
? gene prediction

Artifact structure

genome1: loci.gff
genome2: loci.gff
...

Sample data: loci.gff

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM19595v2
#!genome-build-accession NCBI_Assembly:GCA_000195955.2
##sequence-region AL123456.3 1 4411532
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=83332
AL123456.3	EMBL	region	1	4411532	.	+	.	ID=AL123456.3:1..4411532;Dbxref=taxon:83332;gbkey=Src;mol_type=genomic DNA;strain=H37Rv;type-material=type strain of Mycobacterium tuberculosis
AL123456.3	EMBL	gene	1	1524	.	+	.	ID=gene-Rv0001;Name=dnaA;gbkey=Gene;gene=dnaA;gene_biotype=protein_coding;locus_tag=Rv0001
AL123456.3	EMBL	CDS	1	1524	.	+	0	ID=cds-CCP42723.1;Parent=gene-Rv0001;Dbxref=EnsemblGenomes-Gn:Rv0001,EnsemblGenomes-Tr:CCP42723,GOA:P9WNW3,InterPro:IPR001957,InterPro:IPR003593,InterPro:IPR010921,InterPro:IPR013159,InterPro:IPR013317,InterPro:IPR018312,InterPro:IPR020591,InterPro:IPR027417,NCBI_GP:CCP42723.1;Name=CCP42723.1;Note=Rv0001%2C (MT0001%2C MTV029.01%2C P49993)%2C len: 507 aa. dnaA%2C chromosomal replication initiator protein (see citations below)%2C equivalent to other Mycobacterial chromosomal replication initiator proteins. Also highly similar to others except in N-terminus e.g. Q9ZH75|DNAA_STRCH chromosomal replication initiator protein from Streptomyces chrysomallus (624 aa). Contains PS00017 ATP/GTP-binding site motif A (P-loop) and PS01008 DnaA protein signature. Belongs to the DnaA family. Note that the first base of this gene has been taken as base 1 of the Mycobacterium tuberculosis H37Rv genomic sequence.;experiment=EXISTENCE: identified in proteomics study;gbkey=CDS;gene=dnaA;inference=protein motif:PROSITE:PS01008;locus_tag=Rv0001;product=Chromosomal replication initiator protein DnaA;protein_id=CCP42723.1;transl_table=11
AL123456.3	EMBL	gene	2052	3260	.	+	.	ID=gene-Rv0002;Name=dnaN;gbkey=Gene;gene=dnaN;gene_biotype=protein_coding;locus_tag=Rv0002
AL123456.3	EMBL	CDS	2052	3260	.	+	0	ID=cds-CCP42724.1;Parent=gene-Rv0002;Dbxref=EnsemblGenomes-Gn:Rv0002,EnsemblGenomes-Tr:CCP42724,GOA:P9WNU1,InterPro:IPR001001,InterPro:IPR022634,InterPro:IPR022635,InterPro:IPR022637,PDB:3P16,PDB:3RB9,NCBI_GP:CCP42724.1;Name=CCP42724.1;Note=Rv0002%2C (MTV029.02%2C MTCY10H4.0)%2C len: 402 aa. DnaN%2CDNA polymerase III (beta chain) (see citations below)%2Cequivalent to other Mycobacterial DNA polymerases III beta chain. Also highly similar to others e.g. P27903|DP3B_STRCO DNA polymerase III beta chain from Streptomyces coelicolor (376 aa). Overlaps and extends CDS in neighbouring cosmid MTCY10H4.01.;experiment=EXISTENCE: identified in proteomics study;gbkey=CDS;gene=dnaN;locus_tag=Rv0002;product=DNA polymerase III (beta chain) DnaN (DNA nucleotidyltransferase);protein_id=CCP42724.1;transl_table=11
AL123456.3	EMBL	gene	3280	4437	.	+	.	ID=gene-Rv0003;Name=recF;gbkey=Gene;gene=recF;gene_biotype=protein_coding;locus_tag=Rv0003
AL123456.3	EMBL	CDS	3280	4437	.	+	0	ID=cds-CCP42725.1;Parent=gene-Rv0003;Dbxref=EnsemblGenomes-Gn:Rv0003,EnsemblGenomes-Tr:CCP42725,GOA:P9WHI9,InterPro:IPR001238,InterPro:IPR003395,InterPro:IPR018078,InterPro:IPR027417,NCBI_GP:CCP42725.1;Name=CCP42725.1;Note=Rv0003%2C (MTCY10H4.01)%2C len: 385 aa. RecF%2C DNA replication and repair protein (see citations below)%2Cequivalent to other mycobacterial DNA replication and repair proteins. Also highly similar to many others. Contains PS00017 ATP/GTP-binding site motif A (P-loop)%2CPS00617 RecF protein signature 1%2C and PS00618 RecF protein signature 2. Belongs to the RecF family.;experiment=EXISTENCE: identified in proteomics study;gbkey=CDS;gene=recF;inference=protein motif:PROSITE:PS00618;locus_tag=Rv0003;product=DNA replication and repair protein RecF (single-strand DNA binding protein);protein_id=CCP42725.1;transl_table=11

GenomeData[Sequence]

Description

Based on the gene positional information from GFF files, gene prediction step will extract gene sequences from corresponding MAGs and store them in a FASTA file.

Applicable to:

input output
functional annotation gene prediction

Artifact structure

genome1: genes.fasta
genome2: genes.fasta
...

Sample data: genes.fasta

>k129_5480_1 # 3 # 1988 # -1 # ID=1_1;partial=10;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.442
ATGCCTAAACGTACAGATATTTCTTCTATTTGTATTATTGGAGCTGGACCTATTGTAATT
GGACAAGCTTGTGAATTTGATTATTCTGGAGCTCAAGCTTGTAAAGCTTTAAAAGAAGAA
GGATATCGTGTAGTATTAATTAATTCTAATCCTGCTACAATTATGACAGATCCTAATATG
GCTGATGCTACATATATTGAACCTATTACACCTGAAATTGTAGCTAAAATTTTAGAAAAA
GAACGTCCTGATGCTTTATTACCTACAATGGGAGGACAAACAGCTTTAAATGCTGCTTTA
GCTTTAGATAAAATGGGAGTATTAAAACGTTTAAATATTGAATTAATTGGAGCTAATAAA
GAAGCTATTGAAAAAGCTGAAGATCGTCAATTATTTAAAGATTGTATGGAAAAAATTGGA
TTAGAATCTCCTAAATCTGCTGTAGTACATTCTATGGAAGAAGCTCGTGAAGCTTTAAAA
CATACAGGATTACCTGCTATTATTCGTCCTTCTTTTACAATGGGAGGATCTGGAGGAGGA
GTAGCTTATAATAAAGATGAATTTGAACAAATTATTCGTGAAGGATTAGATGCTTCTCCT
ACAAATGAAGTATTAATTGATGAATCTTTATTAGGATGGAAAGAATATGAAATGGAAGTA
GTACGTGATACAAAAGATAATGCTATTATTATTTGTTCTATTGAAAATATTGATCCTATG
GGAGTACATACAGGAGATTCTATTACAGTAGCTCCTGCTTTAACATTAACAGATAAAGAA
TATCAAATTATGCGTAATGCTTCTTTAGCTGTATTACGTGTAATTGGAGTAGAAACAGGA
GGATCTAATGTACAATTTGGAATGGATCCTGAAACAGGACGTATGGTAGTAATTGAAATG
AATCCTCGTGTATCTCGTTCTTCTGCTTTAGCTTCTAAAGCTACAGGATTTCCTATTGCT
AAAGTAGCTGCTAAATTAGCTGTAGGATATACATTAGATGAATTAGGAAATGATATTACA
GGAGGAAAAACACCTGCTGCTTTTGAACCTTCTATTGATTATGTAGTAACAAAAGTACCT
CGTTTTGCTTTTGAAAAATTTAAAGGATCTGATAATACATTAACAACAGCTATGAAATCT
GTAGGAGAAGCTATGGCTATTGGACGTTCTTTTGAAGAATCTTTACAAAAAGCTTTACGT
TCTTTAGAAAAAGGATTAGAAGGATTATCTTCTATTCCTATTGCTGGAAAATCTGAACCT
GATATGGATGATATTCGTGCTGCTTTATCTCGTCCTACACCTGCTCGTTTATTATATGCT
GCTGAAGCTATGCGTCATGGAATGGATTTAGAAACAATTTATCAATTAACAAAATTTGAT
ATGTGGTATTTAGAACGTATTAAATCTTTAATTGATACAGAAGCTTCTATTAAAAAAAAT
GGATTACCTAAAGATCCTCAAGGATGGATGGCTTTAAAACGTGCTGGATTTTCTGATGCT
CGTTTAGCTGAATTAGTAGGAGTAGCTGAAGCTACAATTCGTAAAACACGTTTAATGCAT
AAAGTAAAACCTGTATATAAACGTGTAGATTCTTGTGCTGCTGAAATTCCTTCTTTAACA
TCTTATATGTATGGAACATATGAAACATTAAATTCTACATCTGAAATTACAGCTACAAAA
AAAGATAAAATTGTAATTTTAGGAGGAGGACCTAATCGTATTGGACAAGGAATTGAATTT
GATTATTGTTGTGTACATGCTGCTTATGCTTTAAAAGAAGCTGGATATGAAACAATTATG
GTAAATTGTAATCCTGAAACAGTATCTACAGATTATGATACATCTGATCGTTTATATTTT
GAACCTTTAACAGGAGAAGATGTATTATCTTTAATTGAAGCTGAAAATGAACATGGAAAA
GTAAAAGGAGTAATTGTACAATTAGGAGGACAAACACCTTTAAAATTAGCTCATTATTTA
GAAAAT
>k129_5480_2 # 2150 # 2623 # 1 # ID=1_2;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.426
ATGCAAAAAATTCCTTTAACAAAACAAGGACATACAGATTTAGAAGCTGAATTAAAAGAT
TTAAAACATCGTCAACGTCCTGCTGTAATTGCTGCTATTTCTGAAGCTCGTGAACATGGA
GATTTATCTGAAAATGCTGAATATCATGCTGCTCGTGAACAACAATCTTTTATTGAAGGA
CGTATTGAACAAGTAGAAGCTATTTTATCTTTAGCTGAAATTATTGATCCTGCTAAAATT
TCTGGAGATACAGTAAAATTTGCTGCTACAGTAAAAGTAGTAGATTGTGATACAGATGAT
GAACATATTTATCAAATTGTAGGAGATGAAGAATCTGATATTGAAACAGGAAAATTAGCT
ATTTCTTCTCCTGTAGCTCGTGCTTTAATTGGAAAAAAAGTAGAAGATTCTGTAGAAGTA
CGTACACCTAAAGGAACACGTGAATATGAAATTTTAGAAATTTTATATAAATAA

GenomeData[ProteinSequence]

Description

Based on the gene positional information from GFF files, gene prediction step will generate protein translations from corresponding MAGs and store them in a FASTA file.

Applicable to:

input output
functional annotation gene prediction

Artifact structure

genome1: proteins.fasta
genome2: proteins.fasta
...

Sample data: proteins.fa

>k129_5480_1 # 3 # 1988 # -1 # ID=1_1;partial=10;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.442
MPKRTDISSICIIGAGPIVIGQACEFDYSGAQACKALKEEGYRVVLINSNPATIMTDPNM
ADATYIEPITPEIVAKILEKERPDALLPTMGGQTALNAALALDKMGVLKRLNIELIGANK
EAIEKAEDRQLFKDCMEKIGLESPKSAVVHSMEEAREALKHTGLPAIIRPSFTMGGSGGG
VAYNKDEFEQIIREGLDASPTNEVLIDESLLGWKEYEMEVVRDTKDNAIIICSIENIDPM
GVHTGDSITVAPALTLTDKEYQIMRNASLAVLRVIGVETGGSNVQFGMDPETGRMVVIEM
NPRVSRSSALASKATGFPIAKVAAKLAVGYTLDELGNDITGGKTPAAFEPSIDYVVTKVP
RFAFEKFKGSDNTLTTAMKSVGEAMAIGRSFEESLQKALRSLEKGLEGLSSIPIAGKSEP
DMDDIRAALSRPTPARLLYAAEAMRHGMDLETIYQLTKFDMWYLERIKSLIDTEASIKKN
GLPKDPQGWMALKRAGFSDARLAELVGVAEATIRKTRLMHKVKPVYKRVDSCAAEIPSLT
SYMYGTYETLNSTSEITATKKDKIVILGGGPNRIGQGIEFDYCCVHAAYALKEAGYETIM
VNCNPETVSTDYDTSDRLYFEPLTGEDVLSLIEAENEHGKVKGVIVQLGGQTPLKLAHYL
EN
>k129_5480_2 # 2150 # 2623 # 1 # ID=1_2;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.426
MQKIPLTKQGHTDLEAELKDLKHRQRPAVIAAISEAREHGDLSENAEYHAAREQQSFIEG
RIEQVEAILSLAEIIDPAKISGDTVKFAATVKVVDCDTDDEHIYQIVGDEESDIETGKLA
ISSPVARALIGKKVEDSVEVRTPKGTREYEILEILYK*