Skip to content

3. Assembly based analyses

Antonio Fernandez-Guerra edited this page Jan 18, 2017 · 11 revisions

Session 3: 16:00-18:00

Table of contents

  1. Metagenomic assembly
  2. Assembly evaluation
  3. Read mapping
  4. Metagenomic assembly annotation
  5. Proxy to gene abundances

Metagenomic assembly

Next Generation Sequencing changed the way we understand microbial communities, bringing us the opportunity to have a global view of the different biological processes. As we saw previously, we can extract valuable information from the analyses at the read level, but assembling these sequences (construct longer fragments from the reads) can provide more benefits. First of all, it will reduce our datasets, OSD102 is a little metagenome with ~650K reads, but for example some metagenomes from TARA have more than 400M reads. Second, once we assemble the reads in contigs, we will recover many fragments with at least the size of a gene, this will allow us to improve our gene predictions and annotations. Third, with longer fragments we can get the genomic context of many of our predicted genes, for example, we can start looking for clusters of biosynthetic genes.

The assembly process is computationally intensive and most of the time challenging. During the last years many advancements in algorithms made metagenomic assemblies more accessible for those without big computational resources, and specifically, big memory machines. Some of the advances are based on using 'de Brujin graphs' for the assembly process. Here you can find a nice review about them applied to genome assembly. Altough de Brujin graphs are really useful, they consume large amounts of memory. Using 'distributed de Brujin graphs', allows to distribute the assembly process to many different machines. In combination with the development of the de Brujin graphs, new data structures have been developed, one of the most impressive achievements is the development of the 'succinct de Brujin graphs', they allow to represent the same graphs using just a few GB of memory. An interesting read about the evolution of 'de Brujin graphs' applied to assemblies and a detailed explanation of the succinct ones can be found here.

There are plenty of assemblers out there that can be applied to metagenomics: SOAPdenovo, RAY, SPAdes, Omega, IDBA, Price and many more. In December 2014 a new assembler was published, MEGAHIT. It uses the succinct representations of the de Brujin graphs and other algorithmic improvements. This assembler is really fast and allows assembly of massive datasets that until now only were only possible after applying some special data treatment, like the one applied by digital normalization that removes redundant and erroneous data.

For the hands-on we will play with MEGAHIT and IDBA to assemble our OSD102 metagenome. These assemblers have interesting features, one of the most interesting ones is that they follow a multi k-mer approach. Depending on the assembler, you should select the length of the word (k-mer) in which you want to split your sequences in order to build the de Brujin graph. Before the availability of multi k-mer assemblers, you needed to test different k-mers until you found the optimal one for your data. This blog has a nice explanation of this.

Due to the limitations of the Virtual Machine, the metagenome has been already assembled using the following commands:

cd ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/0-metagenomic-assembly

megahit -1 ~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/paired/OSD102.R1.qc.fastq -2 ~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/paired/OSD102.R2.qc.fastq -r ~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/paired/OSD102.SR.qc.fastq --presets meta -o OSD102_assembly-megahit --min-contig-len 500 --out-prefix OSD102

Let's have a look at the parameters used. If we have the paired-end reads split in two files we will use -1 and -2, if the reads are interleaved then we will use --12. The --presets uses different pre-defined parameters depending on the kind of data we have, have a look here to check the different options. We set --min-contig-len to only output the contigs longer than 500 nucleotides.

Note: Megahit using the above parameters was able to assemble the metagenome in the VM in just 37 minutes

In a similar way we can run IDBA:

cd ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/0-metagenomic-assembly

idba_ud -r OSD102.pe.fasta -l OSD102.me.fasta --min_contig 500 -o OSD102_assembly-idba
Assembly evaluation

Evaluating the quality of metagenomic assemblies from short reads is a difficult task, basically because we don't have a reference to compare to like when we have a genomic assembly. One possibility to evaluate the quality of a metagenomic assembly, but usually not available for everyone, is to use long reads (Sanger, 454) to test the co-linearity between the contigs and the long reads like described here

The cheapest option is test different assemblers and inspect different properties like the number of reads recovered by the assembly, length of the assembly in bp, distribution of the contig lengths, number of genes predicted etc.. We can easily perform this kind of evaluation using MetaQUAST

Another option to compare different assemblies in a systematic way is to use the method proposed by Clark et al. 2012 where it does an assembly likelihood evaluation that integrates read quality, mate pair orientation and insert length (for paired-end reads), sequencing coverage, read alignment and k-mer frequency.

For the hands-on we will use MetaQUAST:

cd ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/1-assembly-evaluation/0-metaquast-evaluation/

/usr/local/bioinf/quast-3.1/metaquast.py -f --meta --max-ref-number 0 -l "megahit,idba" -o OSD102-metaquast ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/0-metagenomic-assembly/OSD102_assembly-megahit/OSD102.contigs.fa ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/0-metagenomic-assembly/OSD102_assembly-idba/contig.fa

cd OSD102-metaquast

firefox report.html &

Which is the longest assembly? Which one has more contigs? And which one has more genes?

Read mapping

If we want to know how many reads have been used for the assembly and which is the read coverage of each contig, we have to map the reads to the assembled contigs. For this purpose we will use a read mapper that will map all short reads to our contigs. There are many read mappers and usually they are really fast. However, some of them will not work for metagenomic samples because they require a large amount of memory. Some of the most used read mappers are Bowtie2, BBMAP, CLC, BWA among others. Murat Eren has a nice comparison of different read mappers. For our hands-on we will use Heng Li's BWA mapper,a fast and accurate mapper.

First we will link the contig file to our working folder:

cd ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/1-assembly-evaluation/1-contig-read-coverage/0-megahit-assembly

ln -s ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/0-metagenomic-assembly/OSD102_assembly-megahit/OSD102.contigs.fa .

Next step is to construct the FM-index for the contigs and use bwa's mem algorithm for our quality trimmed paired-end -and single reads:

bwa index OSD102.contigs.fa 

bwa mem -M OSD102.contigs.fa ~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/paired/OSD102.SR.qc.fastq > OSD102.assembly.SR.sam

bwa mem -M OSD102.contigs.fa ~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/paired/OSD102.R1.qc.fastq ~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/paired/OSD102.R2.qc.fastq > OSD102.assembly.PE.sam

Now we need to merge both mappings into one SAM file. SAM is a special text format to store sequence alignments/mappings and has a well defined specification. First we need to index the SAM file using samtools although we will use a fork from the original that uses some libraries from Facebook (rocksdb) to improve the performance:

samtools faidx OSD102.contigs.fa

Then we will convert paired and single end SAM files to BAM (a binary version of SAM, less space and faster access) and we will sort the files.

samtools view -@ 1 -q 10 -F 4 -bt OSD102.contigs.fa.fai OSD102.assembly.SR.sam | samtools rocksort -@ 1 -m 1 - OSD102.assembly.SR

samtools view -@ 1 -q 10 -F 4 -bt OSD102.contigs.fa.fai OSD102.assembly.PE.sam |  samtools rocksort -@ 1 -m 1 - OSD102.assembly.PE

We use -F 4 to only keep the mapped reads and -q 10 to allow only the alignments with a map quality larger than 10. -@ 1 specifies that we will use one thread only. And -m 1 specifies that we will use 1GB per thread.

Now it is time to merge the BAM files:

samtools merge -f -@ 1 OSD102.assembly.bam OSD102.assembly.PE.bam OSD102.assembly.SR.bam

Once the merging is done, we will sort the files again:

samtools rocksort -@ 1 -m 1 OSD102.assembly.bam OSD102.assembly.sorted

samtools index OSD102.assembly.sorted.bam

Now will be the time to remove duplicated reads. Those duplicates can be the result from artefacts during PCR amplification and sequencing. If we inspect the alignments we will observe that there are many exact duplicates of a read, that means that they have identical coordinates (5') and coordinates. MarkDuplicates from Picard tools will identify those reads and will only keep the ones with higher quality:

java -jar /usr/local/bioinf/picard-tools-1.140/picard.jar MarkDuplicates \
INPUT=OSD102.assembly.sorted.bam \
OUTPUT=OSD102.assembly.sorted.markdup.bam \
METRICS_FILE=OSD102.assembly.sorted.markdup.metrics \
AS=TRUE \
VALIDATION_STRINGENCY=LENIENT \
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 \
REMOVE_DUPLICATES=TRUE

We will sort and index the BAM file without the duplicates and we will get some statistics:

samtools rocksort -@ 1 -m 1 OSD102.assembly.sorted.markdup.bam OSD102.assembly.sorted.markdup.sorted

samtools index OSD102.assembly.sorted.markdup.sorted.bam

samtools flagstat OSD102.assembly.sorted.markdup.sorted.bam > OSD102.assembly.sorted.markdup.sorted.flagstat

How many reads are mapped? And with proper mapping?

We can easily calculate the mean coverage per contig and the fraction of the contig covered with BEDTOOLS and some awk code:

First we get the coverage using BEDTOOLS:

genomeCoverageBed -ibam OSD102.assembly.sorted.markdup.sorted.bam > OSD102.assembly.sorted.markdup.sorted.coverage

And some awk magic:

BEGIN{
FS="\t"
}
{
    if ($0 i ~ /^genome/){
        next;
    }else{
        c[$1]=c[$1] + ($2*$5)
        if ($2 > 0){
            d[$1] = d[$1] + $5
        }
    }
}END{
    for (var in c){
        print var"\t"d[var]*100"\t"c[var]
    }
}

We will create a file and we will paste this code into it:

gedit ~/Desktop/microbeco_course/scripts/get_mean_coverage_from_bam.awk &

After this, we will calculate the mean coverage per contig and the fraction covered:

awk -f ~/Desktop/microbeco_course/scripts/get_mean_coverage_from_bam.awk OSD102.assembly.sorted.markdup.sorted.coverage > OSD102.assembly.sorted.markdup.sorted.coverage.percontig

Inspect the file using less:

less OSD102.assembly.sorted.markdup.sorted.coverage.percontig
Metagenomic assembly annotation

There are many different ways to annotate the assemblies, for example, we could use MG-RAST or EBI-MG (as discussed prevoiupsly). Just as an example, we will use PROKKA for this purpose. PROKKA will annotate our contigs using different databases. In our case we will use PFAM and COG. The annotations are already provided, they took ~25 minutes in the Virtual Machine.

First we need to setup the DBs:

prokka --setupdb

prokka --listdb

And then we will perform the annotation:

Note: In PROKKA we modified Prodigal command to allow genes to run off edges

cd ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/2-assembly-annotation

prokka ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/0-metagenomic-assembly/OSD102_assembly-megahit/OSD102.contigs.fa --outdir OSD102_contigs_megahit_annotation --norrna --notrna --metagenome --cpus 2

Once the annotation is done we can start to analyse the results in the GFF file generated by PROKKA:

cd ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/2-assembly-annotation/OSD102_contigs_megahit_annotation

awk '$0 ~ "eC_number"{split($9,a,";");gsub("ID=","",a[1]); gsub("eC_number=", "", a[2]); print a[1]"\t"a[2]}' PROKKA_11012015.gff > OSD102.ec_numbers.txt
awk '$0 ~ "COG"{split($12,a,";"); sub("motif:COG:","",a[1]); sub("locus_tag=","",a[2]); print a[2]"\t"a[1]}' PROKKA_11012015.gff > OSD102.cog.txt
awk '$0 ~ "pfam"{split($12,a,";"); sub("motif:pfam:pfam","PF",a[1]); sub("locus_tag=","",a[2]); print a[2]"\t"a[1]}' PROKKA_11012015.gff > OSD102.pfam.txt

Then, for example we can get all amino acid sequences predicted by PROKKA with a COG annotation using the script filterbyname.sh from BBtools:

filterbyname.sh -Xmx1g in=PROKKA_11012015.faa out=stdout.fasta names=<(cut -f1 OSD102.cog.txt) include=t | awk '{$0 ~ "^>"?$0=$1:$0;print $0}' > OSD102.cog.aa.fasta
Proxy to gene abundances

During the assembly process we lose the quantitative information provided by the reads. To estimate the number of times a gene might be present in a given metagenome, we can use read mapping to get the coverage per gene using BEDTOOLS.

First, we will transform the GFF generated by PROKKA to BED format using gff2bed from BEDOPS:

gff2bed < ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/2-assembly-annotation/OSD102_contigs_megahit_annotation/PROKKA_11012015.gff | sort --parallel=8 -k1,1 -k2,2n > OSD102.assembly.bed

We do the same for the BAM formatted file:

bamToBed -i ~/Desktop/microbeco_course/analysis/2-assembly-level-analysis/1-assembly-evaluation/1-contig-read-coverage/0-megahit-assembly/OSD102.assembly.sorted.markdup.sorted.bam | sort --parallel=8 -k1,1 -k2,2n > OSD102.assembly.sorted.markdup.sorted.bed

Note: We do this transformations to avoid using large amounts of memory when using BEDTOOLS in case we have very large metagenomes

bedtools coverage -hist -sorted  -b OSD102.assembly.sorted.markdup.sorted.bed -a OSD102.assembly.bed > OSD102.assembly.coverage.hist 

Now that we have calculated the coverages for each fraction of our features in the GFF file we can aggregate them with this awk script:

#Read a hist file from bedtools coverage -hist and create a coverage per orf file.
#Coverage per each gene is calculated as: coverage_orfA = sum(depth_of_coverage * fraction_covered)

BEGIN{
FS="\t"
}
{
    if ($0 i ~ /^all/){
        next;
    }else{
        c[$4]=c[$4] + ($11*$14)
    }
}END{
    for (var in c){
        print var"\t"c[var]
    }
}

This put together:

awk -f ~/Desktop/microbeco_course/scripts/get_orf_coverage_from_bam.awk OSD102.assembly.coverage.hist

Then we will have the quantification of each gene and we can use this information for our downstream analyses.

Clone this wiki locally