Skip to content

Sequence Data Pre Processing

renzo edited this page Dec 4, 2015 · 10 revisions

Overview:

Definition: pre-processing is defined as the process in which a subset of the original raw sequence data is created which fulfills defined sequence quality criteria, suitable for further analyses.

Intent: The goal of formalizing this workflow is to provide all OSD participants (and the whole scientific community) with a single, quality-controlled dataset in order to ensure comparability and repeatability of analysis results.

Scope: The workflow covers both amplicon (i.e. 16/18S rDNA) and shotgun (i.e. metagenome) data sequenced with Illumina MiSeq.

Input: Sequences ‘as they are delivered by the sequencing centers’. More specifically, no filtering of any kind (e.g. quality, length) has been done yet.

Data delivery by sequencing center

Sequence data was delivered by tree different sequencing center. See also Guide to OSD 2014 Data

Illumina pre-processing workflow

Sketch of the sequence pre-procesing steps of OSD 2014

Caption: Quality control (using FASTQC) is done after each individual step, except shotgun step 3. The QC information is only used for actual filtering before submitting the ‘workable’ datasets. Otherwise, it has a purely informative function.

Workflow steps

All pre-processing was done using Bash scripts. The code is available here https://colab.mpi-bremen.de/micro-b3/svn/analysis-scripts/trunk/osd-analysis/osd-pre-processing/

A graphical overview is shown above in short the steps of each pre-processing pipeline:

16S/18S rDNA Amplicon Data

  1. demultiplexing

    Sequence data from all sequencing centers were already de-multiplexed.

  2. sort reads (cutadapt)

    Already done by LifeWatch.

  3. clip adapters (Only done on LGC sequence reads, trimmomatic)

  4. clip primers (cutadapt)

    Submitted to ENA

  5. merge reads

  6. trim by quality

  7. filter by length

  8. final QC

    Submitted to ENA.

Metagenome (shotgun) data

  1. demultiplexing

    Sequence data from all sequencing centers were already de-multiplexed.

  2. clip adapters (Trimmomatic)

    Submitted to ENA.

  3. trim by quality

  4. filter by length

  5. final QC

Output

##‘raw’ and ‘workable’ datasets

We provide two versions for both amplicon and metagenome datasets:

  1. ‘raw’ version only includes initial Quality Control (QC) and demultiplexing The ‘raw’ version is intended for advanced users who might (1) have their own pre-processing pipelines or (2) prefer to use different tools and parameters than the ones chosen by the OSD Analysis Team.

  2. ‘workable’ version is the one resulting at the end of the pre-processing workflow. The ‘workable’ version will be used for all further analyses in the consortium. This version should be recommended as it guarantees comparability results.

The output of the pre-processing workflow are quality-controlled datasets, ready for analysis (e.g. SILVAngs, MG-Traits, EMG (MG-Portal) among others).

  • For amplicon data the output files per sample are:

    • raw: non-merged

    • workable: merged

  • For shotgun data the output files per sample are:

    • raw: non-merged (used e.g. for EMG)

    • workable output files

      • merged (used e.g. by mg-traits)

      • non-merged (used e.g. for assemblies)

All workable data sets are downloadable here

Background

We chose PEAR for merging reads. Antonio did an extensive comparison of FLASH, BBMerge and PEAR. In a nutshell, FLASH produces weird results, whereas BBMerge, SeqPrep and PEAR have comparable outputs. PEAR was chosen because it is already published and thus citable. Detailed results and discussion of Antonio's test can be found here. Another detailed forum discussion on the different tools and their performance.

Clone this wiki locally