-
Notifications
You must be signed in to change notification settings - Fork 7
3. Assembly based analyses
- Metagenomic assembly
- Assembly evaluation
- Read mapping
- Metagenomic assembly annotation
- Proxy to gene abundances
Next Generation Sequencing changed the way we understand microbial communities, bringing us the opportunity to have a global view of the different biological processes. As we saw previously, we can extract valuable information from the analyses at the read level, but assembling these sequences (construct longer fragments from the reads) can provide more benefits. First of all, it will reduce our datasets, OSD102 is a little metagenome with ~650K reads, but for example some metagenomes from TARA have more than 400M reads. Second, once we assemble the reads in contigs, we will recover many fragments with at least the size of a gene, this will allow us to improve our gene predictions and annotations. Third, with longer fragments we can get the genomic context of many of our predicted genes, for example, we can start looking for clusters of biosynthetic genes.
The assembly process is computationally intensive and most of the time challenging. During the last years many advancements in algorithms made metagenomic assemblies more accessible for those without big computational resources, and specifically, big memory machines. Some of the advances are based on using 'de Brujin graphs' for the assembly process. Here you can find a nice review about them applied to genome assembly. Altough de Brujin graphs are really useful, they consume large amounts of memory. Using 'distributed de Brujin graphs', allows to distribute the assembly process to many different machines. In combination with the development of the de Brujin graphs, new data structures have been developed, one of the most impressive achievements is the development of the 'succinct de Brujin graphs', they allow to represent the same graphs using just a few GB of memory. An interesting read about the evolution of 'de Brujin graphs' applied to assemblies and a detailed explanation of the succinct ones can be found here.
There are plenty of assemblers out there that can be applied to metagenomics: SOAPdenovo, RAY, SPAdes, Omega, IDBA, Price and many more. In December 2014 a new assembler was published, MEGAHIT. It uses the succinct representations of the de Brujin graphs and other algorithmic improvements. This assembler is really fast and allows assembly of massive datasets that until now only were only possible after applying some special data treatment, like the one applied by digital normalization that removes redundant and erroneous data.
For the hands-on we will play with MEGAHIT and IDBA to assemble our OSD102 metagenome. These assemblers have interesting features, one of the most interesting ones is that they follow a multi k-mer approach. Depending on the assembler, you should select the length of the word (k-mer) in which you want to split your sequences in order to build the de Brujin graph. Before the availability of multi k-mer assemblers, you needed to test different k-mers until you found the optimal one for your data. This blog has a nice explanation of this.
Due to the limitations of the Virtual Machine, the metagenome has been already assembled using the following commands:
cd ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/0-metagenomic-assembly
megahit -1 ~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/paired/OSD102.R1.qc.fastq -2 ~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/paired/OSD102.R2.qc.fastq -r ~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/paired/OSD102.SR.qc.fastq --presets meta -o OSD102_assembly-megahit --min-contig-len 500 --out-prefix OSD102Let's have a look at the parameters used. If we have the paired-end reads split in two files we will use -1 and -2, if the reads are interleaved then we will use --12. The --presets uses different pre-defined parameters depending on the kind of data we have, have a look here to check the different options. We set --min-contig-len to only output the contigs longer than 500 nucleotides.
Note: Megahit using the above parameters was able to assemble the metagenome in the VM in just 37 minutes
In a similar way we can run IDBA:
cd ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/0-metagenomic-assembly
idba_ud -r OSD102.pe.fasta -l OSD102.me.fasta --min_contig 500 -o OSD102_assembly-idbaEvaluating the quality of metagenomic assemblies from short reads is a difficult task, basically because we don't have a reference to compare to like when we have a genomic assembly. One possibility to evaluate the quality of a metagenomic assembly, but usually not available for everyone, is to use long reads (Sanger, 454) to test the co-linearity between the contigs and the long reads like described here
The cheapest option is test different assemblers and inspect different properties like the number of reads recovered by the assembly, length of the assembly in bp, distribution of the contig lengths, number of genes predicted etc.. We can easily perform this kind of evaluation using MetaQUAST
Another option to compare different assemblies in a systematic way is to use the method proposed by Clark et al. 2012 where it does an assembly likelihood evaluation that integrates read quality, mate pair orientation and insert length (for paired-end reads), sequencing coverage, read alignment and k-mer frequency.
For the hands-on we will use MetaQUAST:
cd ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/1-assembly-evaluation/0-metaquast-evaluation/
/usr/local/bioinf/quast-3.1/metaquast.py -f --meta --max-ref-number 0 -l "megahit,idba" -o OSD102-metaquast ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/0-metagenomic-assembly/OSD102_assembly-megahit/OSD102.contigs.fa ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/0-metagenomic-assembly/OSD102_assembly-idba/contig.fa
cd OSD102-metaquast
firefox report.html &Which is the longest assembly? Which one has more contigs? And which one has more genes?
If we want to know how many reads have been used for the assembly and which is the read coverage of each contig, we have to map the reads to the assembled contigs. For this purpose we will use a read mapper that will map all short reads to our contigs. There are many read mappers and usually they are really fast. However, some of them will not work for metagenomic samples because they require a large amount of memory. Some of the most used read mappers are Bowtie2, BBMAP, CLC, BWA among others. Murat Eren has a nice comparison of different read mappers. For our hands-on we will use Heng Li's BWA mapper,a fast and accurate mapper.
First we will link the contig file to our working folder:
cd ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/1-assembly-evaluation/1-contig-read-coverage/0-megahit-assembly
ln -s ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/0-metagenomic-assembly/OSD102_assembly-megahit/OSD102.contigs.fa .Next step is to construct the FM-index for the contigs and use bwa's mem algorithm for our quality trimmed paired-end -and single reads:
bwa index OSD102.contigs.fa
bwa mem -M OSD102.contigs.fa ~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/paired/OSD102.SR.qc.fastq > OSD102.assembly.SR.sam
bwa mem -M OSD102.contigs.fa ~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/paired/OSD102.R1.qc.fastq ~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/paired/OSD102.R2.qc.fastq > OSD102.assembly.PE.samNow we need to merge both mappings into one SAM file. SAM is a special text format to store sequence alignments/mappings and has a well defined specification. First we need to index the SAM file using samtools although we will use a fork from the original that uses some libraries from Facebook (rocksdb) to improve the performance:
samtools faidx OSD102.contigs.faThen we will convert paired and single end SAM files to BAM (a binary version of SAM, less space and faster access) and we will sort the files.
samtools view -@ 1 -q 10 -F 4 -bt OSD102.contigs.fa.fai OSD102.assembly.SR.sam | samtools rocksort -@ 1 -m 1 - OSD102.assembly.SR
samtools view -@ 1 -q 10 -F 4 -bt OSD102.contigs.fa.fai OSD102.assembly.PE.sam | samtools rocksort -@ 1 -m 1 - OSD102.assembly.PEWe use -F 4 to only keep the mapped reads and -q 10 to allow only the alignments with a map quality larger than 10. -@ 1 specifies that we will use one thread only. And -m 1 specifies that we will use 1GB per thread.
Now it is time to merge the BAM files:
samtools merge -f -@ 1 OSD102.assembly.bam OSD102.assembly.PE.bam OSD102.assembly.SR.bamOnce the merging is done, we will sort the files again:
samtools rocksort -@ 1 -m 1 OSD102.assembly.bam OSD102.assembly.sorted
samtools index OSD102.assembly.sorted.bamNow will be the time to remove duplicated reads. Those duplicates can be the result from artefacts during PCR amplification and sequencing. If we inspect the alignments we will observe that there are many exact duplicates of a read, that means that they have identical coordinates (5') and coordinates. MarkDuplicates from Picard tools will identify those reads and will only keep the ones with higher quality:
java -jar /usr/local/bioinf/picard-tools-1.140/picard.jar MarkDuplicates \
INPUT=OSD102.assembly.sorted.bam \
OUTPUT=OSD102.assembly.sorted.markdup.bam \
METRICS_FILE=OSD102.assembly.sorted.markdup.metrics \
AS=TRUE \
VALIDATION_STRINGENCY=LENIENT \
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 \
REMOVE_DUPLICATES=TRUEWe will sort and index the BAM file without the duplicates and we will get some statistics:
samtools rocksort -@ 1 -m 1 OSD102.assembly.sorted.markdup.bam OSD102.assembly.sorted.markdup.sorted
samtools index OSD102.assembly.sorted.markdup.sorted.bam
samtools flagstat OSD102.assembly.sorted.markdup.sorted.bam > OSD102.assembly.sorted.markdup.sorted.flagstatHow many reads are mapped? And with proper mapping?
We can easily calculate the mean coverage per contig and the fraction of the contig covered with BEDTOOLS and some awk code:
First we get the coverage using BEDTOOLS:
genomeCoverageBed -ibam OSD102.assembly.sorted.markdup.sorted.bam > OSD102.assembly.sorted.markdup.sorted.coverageAnd some awk magic:
BEGIN{
FS="\t"
}
{
if ($0 i ~ /^genome/){
next;
}else{
c[$1]=c[$1] + ($2*$5)
if ($2 > 0){
d[$1] = d[$1] + $5
}
}
}END{
for (var in c){
print var"\t"d[var]*100"\t"c[var]
}
}We will create a file and we will paste this code into it:
gedit ~/Desktop/microbeco_course/scripts/get_mean_coverage_from_bam.awk &After this, we will calculate the mean coverage per contig and the fraction covered:
awk -f ~/Desktop/microbeco_course/scripts/get_mean_coverage_from_bam.awk OSD102.assembly.sorted.markdup.sorted.coverage > OSD102.assembly.sorted.markdup.sorted.coverage.percontigInspect the file using less:
less OSD102.assembly.sorted.markdup.sorted.coverage.percontigThere are many different ways to annotate the assemblies, for example, we could use MG-RAST or EBI-MG (as discussed prevoiupsly). Just as an example, we will use PROKKA for this purpose. PROKKA will annotate our contigs using different databases. In our case we will use PFAM and COG. The annotations are already provided, they took ~25 minutes in the Virtual Machine.
First we need to setup the DBs:
prokka --setupdb
prokka --listdbAnd then we will perform the annotation:
Note: In PROKKA we modified Prodigal command to allow genes to run off edges
cd ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/2-assembly-annotation
prokka ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/0-metagenomic-assembly/OSD102_assembly-megahit/OSD102.contigs.fa --outdir OSD102_contigs_megahit_annotation --norrna --notrna --metagenome --cpus 2Once the annotation is done we can start to analyse the results in the GFF file generated by PROKKA:
cd ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/2-assembly-annotation/OSD102_contigs_megahit_annotation
awk '$0 ~ "eC_number"{split($9,a,";");gsub("ID=","",a[1]); gsub("eC_number=", "", a[2]); print a[1]"\t"a[2]}' PROKKA_11012015.gff > OSD102.ec_numbers.txtawk '$0 ~ "COG"{split($12,a,";"); sub("motif:COG:","",a[1]); sub("locus_tag=","",a[2]); print a[2]"\t"a[1]}' PROKKA_11012015.gff > OSD102.cog.txtawk '$0 ~ "pfam"{split($12,a,";"); sub("motif:pfam:pfam","PF",a[1]); sub("locus_tag=","",a[2]); print a[2]"\t"a[1]}' PROKKA_11012015.gff > OSD102.pfam.txtThen, for example we can get all amino acid sequences predicted by PROKKA with a COG annotation using the script filterbyname.sh from BBtools:
filterbyname.sh -Xmx1g in=PROKKA_11012015.faa out=stdout.fasta names=<(cut -f1 OSD102.cog.txt) include=t | awk '{$0 ~ "^>"?$0=$1:$0;print $0}' > OSD102.cog.aa.fastaDuring the assembly process we lose the quantitative information provided by the reads. To estimate the number of times a gene might be present in a given metagenome, we can use read mapping to get the coverage per gene using BEDTOOLS.
First, we will transform the GFF generated by PROKKA to BED format using gff2bed from BEDOPS:
gff2bed < ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/2-assembly-annotation/OSD102_contigs_megahit_annotation/PROKKA_11012015.gff | sort --parallel=8 -k1,1 -k2,2n > OSD102.assembly.bedWe do the same for the BAM formatted file:
bamToBed -i ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/1-assembly-evaluation/1-contig-read-coverage/0-megahit-assembly/OSD102.assembly.sorted.markdup.sorted.bam | sort --parallel=8 -k1,1 -k2,2n > OSD102.assembly.sorted.markdup.sorted.bedNote: We do this transformations to avoid using large amounts of memory when using BEDTOOLS in case we have very large metagenomes
bedtools coverage -hist -sorted -b OSD102.assembly.sorted.markdup.sorted.bed -a OSD102.assembly.bed > OSD102.assembly.coverage.hist Now that we have calculated the coverages for each fraction of our features in the GFF file we can aggregate them with this awk script:
#Read a hist file from bedtools coverage -hist and create a coverage per orf file.
#Coverage per each gene is calculated as: coverage_orfA = sum(depth_of_coverage * fraction_covered)
BEGIN{
FS="\t"
}
{
if ($0 i ~ /^all/){
next;
}else{
c[$4]=c[$4] + ($11*$14)
}
}END{
for (var in c){
print var"\t"c[var]
}
}
This put together:
awk -f ~/Desktop/microbeco_course/scripts/get_orf_coverage_from_bam.awk OSD102.assembly.coverage.histThen we will have the quantification of each gene and we can use this information for our downstream analyses.