
A hybrid pipeline for reconstruction & analysis
of viral and host genomes at multi-organ level.
TRACESPipe is a next-generation sequencing pipeline for identification, assembly, and analysis of viral and human-host genomes at multi-organ level. The identification and assembly of viral genomes rely on cooperation between three modalities:
- compression-based predictors;
- sequence alignments;
- de-novo assembly.
For the human-host variant call identification, the same procedure is followed although directly starting within the second point, given the use of the same reference (revised Cambridge Reference) to all the cases.
The previous image shows the architecture of TRACESPipe, where the green line stands for the mitochondrial human line. This pipeline has been tested in Illumina HiSeq and NovaSeq platforms. The operating system required to run it is Linux. In windows use cygwin (https://www.cygwin.com/) and make sure that it is included in the installation: cmake, make, zcat, unzip, wget, tr, grep (and any dependencies). If you install the complete cygwin packet then all these will be installed. After, all steps will be the same as in Linux.
The TRACESPipe includes methods for ancient DNA authentication, namely using the quantification of damage (in the tips of the reads) relative to a reference. Other feature is the quantification of y-chromosome presence through compression-based predictors.
Additionally, the TRACESPipe includes read trimming and filtering, PhiX removal, and redundancy controls (at the Database level and for each candidate reference genomes) to improve the consistency and quality of the data.
is needed for installation.
To install Conda use the following steps:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Additional instructions can be found here:
https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html
To install TRACESPipe, run the following commands in a Linux OS:
git clone https://github.com/viromelab/tracespipe.git
cd tracespipe/src/
chmod +x TRACES*.sh
./TRACESPipe.sh --install
./TRACESPipe.sh --get-all-aux
Install, Update, Version, and Check scripts, as well as this README have sections which are automatically generated based on the dependencies using:
make all
As user you should not need to run this command, but if you have yq (a CLI YAML parser), you may.
As a developer this should be run whenever there are changes to the system_files/dependencies.yml file,
the generator scripts, or the relevant files. A suggestion is to add this to a pre_commit git hook.
In the tracespipe/ folder the following structure exists:
tracespipe/
βΒ Β
βββ meta_data/ # information about the filenames in input_data/ and organ names
βΒ Β βββ meta_info.txt # see Configuration section for this file.
βΒ Β
βββ input_data/ # where the NGS reads must be placed (and compressed with gzip)
βΒ Β
βββ output_data/ # where the results will appear using the following subfolders:
β β
β β
βΒ Β βββ TRACES_preprocessed_reads/ # trimmed and adapter removed fastq files
βΒ Β βββ TRACES_results/ # where the files regarding the metagenomic
β β # analysis, redundancy (complexity) and control will appear
βΒ Β βββ TRACES_results/profiles/ # where the redundancy (complexity) profiles appear
β β
βΒ Β βββ TRACES_viral_alignments/ # where viral alignments and index will appear
βΒ Β βββ TRACES_viral_consensus/ # where viral consensus (FASTA) will appear
βΒ Β βββ TRACES_viral_bed/ # where viral BED files will appear (SNPs and Coverage)
βΒ Β βββ TRACES_viral_statistics/ # where viral statistics appear (depth/wide coverage)
βΒ Β β
βΒ Β βββ TRACES_mtdna_alignments/ # where mtdna alignments and index will appear
βΒ Β βββ TRACES_mtdna_consensus/ # where mtdna consensus (FASTA) will appear
βΒ Β βββ TRACES_mtdna_bed/ # where mtdna BED files will appear (SNPs and Coverage)
βΒ Β βββ TRACES_mtdna_statistics/ # where mtdna statistics appear (depth/wide coverage)
βΒ Β βββ TRACES_mtdna_authentication/ # where mtdna species and population authentication appears
βΒ Β β
βΒ Β βββ TRACES_cy_alignments/ # where cy alignments and index will appear
βΒ Β βββ TRACES_cy_consensus/ # where cy consensus (FASTA) will appear
βΒ Β βββ TRACES_cy_bed/ # where cy BED files will appear (SNPs and Coverage)
βΒ Β βββ TRACES_cy_statistics/ # where cy statistics appear (depth/wide coverage)
βΒ Β β
βΒ Β βββ TRACES_specific_alignments/ # where specific alignments and index will appear
βΒ Β βββ TRACES_specific_consensus/ # where specific consensus (FASTA) will appear
βΒ Β βββ TRACES_specific_bed/ # where specific BED files will appear
βΒ Β βββ TRACES_specific_statistics/ # where specific statistics appear (depth/wide coverage)
βΒ Β β
βΒ Β βββ TRACES_mtdna_damage_<ORGAN>/ # where the mtdna damage estimation files will appear
βΒ Β β
βΒ Β βββ TRACES_denovo_<ORGAN>/ # where the output of de-novo assembly appears
βΒ Β β
β βββ TRACES_hybrid_alignments/ # where the hybrid data appears
β βββ TRACES_hybrid_consensus/ # where the hybrid data appears
β βββ TRACES_hybrid_bed/ # where the hybrid data appears
βΒ Β β
β βββ TRACES_hybrid_R2_alignments/ # where the second round hybrid data appears
β βββ TRACES_hybrid_R2_consensus/ # where the second round hybrid data appears
β βββ TRACES_hybrid_R2_bed/ # where the second round hybrid data appears
βΒ Β β
β βββ TRACES_hybrid_R3_alignments/ # where the third round hybrid data appears
β βββ TRACES_hybrid_R3_consensus/ # where the third round hybrid data appears
β βββ TRACES_hybrid_R3_bed/ # where the third round hybrid data appears
βΒ Β β
β βββ TRACES_hybrid_R4_alignments/ # where the fourth round hybrid data appears
β βββ TRACES_hybrid_R4_consensus/ # where the fourth round hybrid data appears
β βββ TRACES_hybrid_R4_bed/ # where the fourth round hybrid data appears
βΒ Β β
β βββ TRACES_hybrid_R5_consensus/ # where the automatic choosen hybrid consensus
βΒ Β β # appears (diff will be made using this data)
βΒ Β β
β βββ TRACES_multiorgan_alignments/ # where the multi-organ alignments data appears
β βββ TRACES_multiorgan_consensus/ # where the multi-organ consensus data appears
βΒ Β β
β βββ TRACES_diff/ # where the dnadiff results appear (identity & SNPs)
β βββ TRACES_specific_diff/ # where the dnadiff results appear for specific
βΒ Β β
β βββ TRACES_blasts/ # where the specific blasted results appears
βΒ Β
βββ to_encrypt_data/ # where the NGS files to encrypt must be before encryption
βββ encrypted_data/ # where the encrypted data will appear
βββ decrypted_data/ # where the decrypted data will appear
βΒ Β
βββ logs/ # where the logs (stdout, stderr, and system) will appear
βΒ Β
βββ src/ # where the bash code is and where the commands must be call
βΒ Β
βββ imgs/ # images related with the pipelineTo configure TRACESPipe add your FASTQ files gziped at the folder
input_data/
Then, add a file exclusively with name meta_info.txt at the folder
meta_data/
This file needs to specify the organ type (with a single word name) and the filenames for the paired end reads. An example of the content of meta_info.txt is the following:
skin:V1_S44_R1_001.fastq.gz:V1_S44_R2_001.fastq.gz
brain:V2_S29_R1_001.fastq.gz:V2_S29_R2_001.fastq.gz
colon:V3_S45_R1_001.fastq.gz:V3_S45_R2_001.fastq.gz
Then, at the src/ folder run:
./TRACESPipe.sh --get-all-aux
To run TRACES Pipeline, use the following command:
./TRACESPipe.sh <parameters>
There are many parameters and configurations that can be used.
See the next section for more information about the usage.
./TRACESPipe.sh -h
βββββββββ βββββββ ββββββ βββββββ ββββββββ ββββββββ
βββββββββ ββββββββ ββββββββ ββββββββ ββββββββ ββββββββ
βββ ββββββββ ββββββββ βββ ββββββ ββββββββ
βββ ββββββββ ββββββββ βββ ββββββ ββββββββ
βββ βββ βββ βββ βββ ββββββββ ββββββββ ββββββββ
βββ βββ βββ βββ βββ βββββββ ββββββββ ββββββββ
P I P E L I N E
| A hybrid pipeline for reconstruction & analysis |
| of viral and host genomes at multi-organ level. |
Usage: ./TRACESPipe.sh [options]
=========== GENERAL OPTIONS ==========
-h, --help Show this help message and exit,
-v, --version Show the version and some information,
-flog, --flush-logs Flush logs (delete logs),
-fout, --flush-output Flush output data (delete all output_data),
-t <THREADS>, --threads <THREADS> Number of threads to use,
=========== SETUP COMMANDS ==========
-i, --install Installation of all the tools,
-up, --update Update all the tools in TRACESPipe,
-spv, --show-prog-ver Show included programs versions,
-st, --sample Creates human ref. VDB and sample organ,
-gmt, --get-max-threads Get the number of maximum machine threads,
-dec, --decrypt Decrypt (all files in ../encrypted_data),
-enc, --encrypt Encrypt (all files in ../to_encrypt_data),
-vdb, --build-viral Build viral database (all) [Recommended],
-vdbr, --build-viral-r Build viral database (references only),
-udb, --build-unviral Build non viral database (control),
-lcm, --lcr-mask-vdb Construct an LCR masked viral database
Uses alt-vdb if specified
-afs <FASTA>, --add-fasta <FASTA>
Add a FASTA sequence to the viral database,
-aes <ID>, --add-extra-seq <ID>
Add extra sequence to the viral database,
-gx, --get-extra-vir Downloads/appends (VDB) extra viral seq,
-gad, --gen-adapters Generate FASTA file with adapters,
-gp, --get-phix Extracts PhiX genomes (Needs viral DB),
-gm, --get-mito Downloads human Mitochondrial genome,
-dwms, --download-mito-species
Downloads the complete NCBI mitogenomes
database containing the existing species,
-dwmp, --download-mito-population
Downloads two complete mitogenome databases
with healthy and pathogenic sequences,
-aums, --auth-mito-species
Autheticate the mitogenome species,
-aump, --auth-mito-population
Authenticate closest population,
-cmt <ID>, --change-mito <ID>
Set any Mitochondrial genome by ID,
-gy, --get-y-chromo Downloads human Y-chromosome,
-gax, --get-all-aux Runs -gad -gp -gm -gy,
-cbn, --create-blast-db It creates a nucleotide blast database,
-ubn, --update-blast-db It updates a nucleotide blast database,
=========== ANALYSIS COMMANDS ==========
--- Some commands can only be run with or after others.
At the bottom of this section is a dependency tree
-ra, --run-analysis Run data analysis (core),
-all, --run-all Run all the options (excluding the specific).
-proc, --run-preprocess Run adapter removal, quality trimming,
length filtering, base correction,
and poly-g tail removal with fastp
-sfs <FASTA>, --search-blast-db <FASTA>
It blasts the nucleotide (nt) blast DB,
-sfrs <FASTA>, --search-blast-remote-db <FASTA>
It blasts remotly thenucleotide (nt) blast
database (it requires internet connection),
-gbb, --best-of-bests Identifies the best of bests references
between multiple organs [similar reference],
-rm, --run-meta Run viral metagenomic identification,
-ro, --run-meta-nv Run NON-viral metagenomic identification,
-rpro, --run-profiles Run complexity and relative profiles (control),
-rpgi <ID>, --run-gid-complexity-profile <ID>
Run complexity profiles by GID,
-rava, --run-all-v-alig Run all viral align/sort/consensus seqs
from a specific list,
-rsr <ID>, --run-specific <ID/PATTERN>
Run specific reference align/consensus,
-rsx <ID>, --run-extreme <ID/PATTERN>
Run specific reference align/consensus
using extreme sensitivity;
Retained for backwards compatibility;
Now an alias for -vhs -rsr <ID/PATTERN>,
-rmt, --run-mito Run Mito align and consensus seq,
-rmtd, --run-mito-dam Run Mito damage only,
-rgid <ID>, --run-gid-damage <ID>
Run damage pattern analysis by GID,
-rya, --run-cy-align Run CY align and consensus seq,
-ryq, --run-cy-quant Estimate the quantity of CY DNA,
-rda, --run-de-novo Run de-novo assembly,
-rhyb, --run-hybrid Run hybrid assembly (align/de-novo),
-rsd <ID>, --run-de-novo-specific <ID/PATTERN>
Run specific alignments of the de-novo
to the reference genome,
-cast, --compile-aln-stats <TYPE>
Combine breadth, depth, similarity,
and selected mapping statistics into a
report. Valid types are: viral, mtdna
cy, specific, and all. Auto-enabled
with any alignment command.
-rmhc, --run-multiorgan-consensus
Run alignments/consensus between all the
reconstructed organ sequences,
-vis, --visual-align Run Visualization tool for alignments,
-covl, --coverage-latex Run coverage table in Latex format,
-covc, --coverage-csv Run coverage table in CSV format,
-covp <NAME>, --coverage-profile <BED_NAME_FILE>
Run coverage profile for specific BED file,
-diff, --run-diff Run diff -> reference and hybrid (ident/SNPs),
-sdiff <V_NAME> <ID/PATTERN>, --run-specific-diff <V_NAME> <ID/PATTERN>
Run specific diff of reconstructed to a virus
pattern of ID. Example: -sdiff B19 AY386330.1,
-brec, --blast-reconstructed
Run local blast over reconstructed genomes,
--- Dependency Tree ---
run-preprocess
ββββrun-meta
β ββββrun-profiles
β ββββrun-all-v-alig
β ββββcompile-aln-stats viral
β ββββcoverage-latex
β ββββcoverage-csv
β ββββrun-diff
ββββrun-specific
β ββββcompile-aln-stats specific
ββββrun-mito
β ββββcompile-aln-stats mtdna
ββββrun-cy-align
β ββββcompile-aln-stats cy
ββββrun-cy-quant
ββββrun-de-novo
ββββrun-hybrid
β ββββrun-multiorgan-consensus
β ββββrun-diff
β ββββrun-specific-diff
β ββββblast-reconstructed
ββββrun-de-novo-specific
search-blast-db
search-blast-remote-db
best-of-bests
run-gid-complexity-profile
run-mito-dam
run-gid-damage
visual-align
coverage-profile
=========== ANALYSIS OPTIONS ==========
-avdb <FASTA>, --alt-viral-db <FASTA>
Specify a path to fasta file containing
viral sequences
Sequence names must include the accession
as the first field, either whitespace or
underscore (_) delimited (NC_ handled)
-vdbm <PATH>, --viral-db-metadata <PATH>
Specify a path to a tab del file which
has sequence GID/ACC in the first column
and a Name representing a virus label
in the second
This changes the meta-analysis behaviour:
rather than using internal virus names
and inclusion criteria, the relationships
defined in the provided file are used.
This allows the user to define groupings
of interest in the default, or user
provided viral database.
-rdup, --remove-dup Remove duplications (e.g. PCR dup),
-vhs, --very-sensitive Aligns with very high sensitivity (slower),
-adr, --attempt-denovo-restart
TRACESPipe will attempt to run previous
de-novo assemblies from their last
checkpoint, useful if an external fault
halted assembly
Warning: If resuming initiates but later
fails, assembly will restart from scratch
-dmef <VALUE>, --denovo-mem-estimate-factor <VALUE> Default:50
A value multiplied by the file size of
compressed raw foward reads to estimate
the spades memory usage. Controls trigger
for bbnorm digital normalization.
A value of 0 disables bbnorm.
-mdm, --max-denovo-mem Default:350
Maximum memory in GB to allocate for
de novo assemly. A value of 0 removes the
limit and disables bbnorm.
-ulcm, --use-lcr-masked-vdb
Use the pre-constructed lcr masked vdb
for input to FALCON; disabled unless -lcm
has been run
-pdep <FILE>, --pattern-depletion <FILE>
A path to a file containing regular
expressions. Matches are filtered out
of FALCON input. Primary use is to filter
patterns which exist in both a virus and
background sequences (eq. telomeres)
-iss <SIZE>, --inter-sim-size <SIZE>
Inter-genome similarity top size (control),
-cpwi <VALUE>, --complexity-profile-window <VALUE>
Complexity profile window size,
-cple <VALUE>, --complexity-profile-level <VALUE>
Complexity profile compression level [1;10],
-mis <VALUE>, --min-similarity <VALUE>
Minimum similarity value to consider the
sequence for alignment-consensus (filter),
-misl <VALUE>, --min-similarity-len <VALUE>
Minimum product of similarity value and
best hit sequence length for
alignment-consensus (filter),
-misv <PATH>, --min-similarity-virus <PATH>
Path to a tab sep file with tow columns
containing the virus and min sim values
Any values lower than --misl are ignored
-top <VALUE>, --view-top <VALUE>
Display the top <VALUE> with the highest
similarity (by descending order),
-amax <VALUE>, --max-alignments <VALUE>
The maximum number of alignments to
report for any one read; 0 = no limit,
-c <VALUE>, --cache <VALUE>
Cache to be used by FALCON-meta,
-tsv <VALUE>, --top-size-virus <VALUE>
Top size to be used by FALCON-meta when
using TRACES_metagenomic_viral.sh;
default:0 -> seq count in viral db
-ts <VALUE>, --top-size <VALUE>
Top size to be used by FALCON-meta when
using TRACES_metagenomic.sh;
default:0 -> seq count in non-viral db
-cmax <MAX>, --max-coverage <MAX_COVERAGE>
Maximum depth coverage (depth normalization),
-clog <VALUE>, --coverage-log-scale <VALUE>
Coverage profile logarithmic scale VALUE=Base,
-cwis <VALUE>, --coverage-window-size <VALUE>
Coverage window size for low-pass filter,
-cdro <VALUE>, --coverage-drop <VALUE>
Coverage drop size (sampling),
-covm <VALUE>, --coverage-min-x <VALUE>
Coverage minimum value for x-axis
=========== EXAMPLES ==========
Ex: ./TRACESPipe.sh --flush-output --flush-logs --run-mito --run-meta
--remove-dup --run-de-novo --run-hybrid --min-similarity 1 --run-diff
--very-sensitive --best-of-bests --run-multiorgan-consensus
Add the file meta_info.txt at ../meta_data/ folder. Example:
meta_info.txt -> 'organ:reads_forward.fa.gz:reads_reverse.fa.gz'
The reads must be GZIPed in the ../input_data/ folder.
The output results are at ../output_data/ folder.
Contact: tracespipe@gmail.com
The common use of TRACESPipe as command is:
./TRACESPipe.sh \
--flush-logs \
--run-preprocess \
--run-meta \
--inter-sim-size 2 \
--run-all-v-alig \
--run-mito \
--remove-dup \
--run-de-novo \
--run-hybrid \
--min-similarity 1.5 \
--view-top 5 \
--best-of-bests \
--very-sensitive \
--run-multiorgan-consensus \
--run-diff
From the run all the output is provided at folder output_data and it can be human inspected using IGV.
Nevertheless, for specific runs, below some examples are described.
5.1 Building viral consensus sequences with fixed reference sequence in all organs (if exists in the FASTQ samples):
./TRACESPipe.sh --run-meta --run-all-v-alig --remove-dup --min-similarity 3 --best-of-bests
The output consensus sequence is included at
output_data/TRACES_viral_consensus
while the alignments at
output_data/TRACES_viral_alignments
and the BED files at
output_data/TRACES_viral_bed
./TRACESPipe.sh --run-mito --remove-dup
The output consensus sequence is included at
output_data/TRACES_mtdna_consensus
while the alignments at
output_data/TRACES_mtdna_alignments
and the BED files at
output_data/TRACES_mtdna_bed
TRACESPipe supports secure encryption of genomic data. This allows outsourcing of the sequencing service while maintaining secure transmission and storage of the files.
Place the files from sequencing (e.g. FASTQ gziped files) in the folder to_encrypt_data and, then, run:
./TRACESPipe.sh --encrypt
Insert a strong password.
The encrypted files are in the encrypted_data folder.
Place the encrypted files in the folder encrypted_data and, then, run:
./TRACESPipe.sh --decrypt
Insert the password that has been used in encryption.
The decrypted files are in the decrypted_data folder.
./TRACESPipe.sh --run-meta --run-all-v-alig
The output consensus sequence is included at
output_data/TRACES_viral_consensus
while the alignments at
output_data/TRACES_viral_alignments
and the BED files at
output_data/TRACES_viral_bed
./TRACESPipe.sh --run-cy-quant
The output quantify is included at
output_data/TRACES_results/REP_CY_<organ_name>.txt
./TRACESPipe.sh --run-meta
The output is included at
../output_data/TRACES_results/REPORT_META_VIRAL_ALL.txt
./TRACESPipe.sh --run-meta-nv
The output is included at
../output_data/TRACES_results/REPORT_META_NON_VIRAL_<organ_name>.txt
./TRACESPipe.sh --run-de-novo
The outputs are included at
../output_data/TRACES_denovo_alignments
../output_data/TRACES_denovo_consensus
../output_data/TRACES_denovo_bed
5.9 Run specific viral alignment (AF037218.1) for all organs using extreme sensitivity without duplications:
./TRACESPipe.sh --remove-dup --run-extreme AF037218.1
The output is included at
../output_data/TRACES_specific_alignments
and the depth and breadth coverage values at
cat ../output_data/TRACES_specific_statistics
./TRACESPipe.sh --run-mito-dam
The output is included at
../output_data/TRACES_mtdna_damage_<organ_name>
./TRACESPipe.sh --search-blast-remote-db AF037218.1
The output is included at
../output_data/TRACES_blastn
This approach assumes that the reconstruction has already been processed:
./TRACES_normalized_depth.sh ../output_data/TRACES_viral_bed/B19-coverage-blood.bed 200
The output is provided to the stdout.
TRACES Pipeline uses a combination of the following tools:
| Tool | TestedVersion | Article |
|---|---|---|
| πΒ AdapterRemoval | 2.3.4 | |
| πΒ ART_illumina | 2016.06.05 | |
| πΒ BBNorm | 39.79 | NA |
| πΒ BFCtools | 1.21 | |
| πΒ BEDOPS | 2.4.42 | |
| πΒ BEDTools | v2.31.1 | |
| πΒ BLASTn | 2.16.0+ | |
| πΒ Bowtie2 | 2.5.4 | |
| πΒ BWA | 0.7.18-r1243-dirty | |
| πΒ Cryfa | v18.06 | |
| πΒ dnadiff | 1.3 | |
| πΒ efetch | 24.4 | NA |
| πΒ FALCON | 2.3 | |
| πΒ fastp | 0.23.4 | |
| πΒ grepq | 1.5.4 | |
| πΒ GTO | v1.5.9 | |
| πΒ IGV | 2.19.3 | |
| πΒ iVar | 1.4.4 | |
| πΒ MAGNET | 19.4 | |
| πΒ mapDamage2 | 2.2.3 | |
| πΒ SAMtools | 1.21 | |
| πΒ Sdust | 0.1 | |
| πΒ SPAdes | 4.2.0 | |
| πΒ Tabix | 1.21 | |
| πΒ Trimmomatic | 0.39 |
If you use this pipeline, please cite:
Pratas, D., Toppinen, M., PyΓΆriΓ€, L., Hedman, K., Sajantila, A. and Perdomo, M.F., 2020.
A hybrid pipeline for reconstruction and analysis of viral genomes at multi-organ level.
GigaScience, 9(8), p.giaa086.
For any issue let us know at issues link.
GPL v3.
For more information see LICENSE file or visit
http://www.gnu.org/licenses/gpl-3.0.html
