IRMA-core Aligner README

Motivation and Goals

IRMA-core's aligner provides an efficient and exact local sequence alignment routine using Striped Smith-Waterman. This is particularly useful for aligning Next Generation Sequencing (NGS) reads and performing full alignment between short genomes.

Multithreading

aligner uses rayon to perform multithreading to enable higher throughput. To specify the number of threads, set the RAYON_NUM_THREADS environmental variable as described here. Or, to limit to a single worker thread, pass --single-thread to aligner.

For benchmarking or scenarios where a single thread is always used, the dev_no_rayon feature can be enabled in IRMA-core to remove the use of channels. This feature may be removed in future releases, and so should not be relied upon except for testing.

Inputs and Outputs

The first positional argument is a FASTA file containing the reference sequence(s), and the second argument is a FASTA or FASTQ file containing the queries. Either file may be gzip-compressed, in which case it is assumed to end in .gz. The output is in the SAM alignment format. The score is reported with the AS tag for mapped reads, and the MAPQ field is not used (it is set to 255). The output file is specified with --out or --output flags. If not specified, output is directed to STDOUT. If the provided file ends in .gz, the output will be zipped.

As an example, consider the following inputs:

reference.fasta

>reference
GACTCAGTAAGACACGGTCTAGCTGACTGT

query.fastq

@query1
AGTAACACGGTC
+
IIIIIIIIIIII
@query2
GTCTAGGTGACTA
+
IIIIIIIIIIIII

Aligner could be run with a command such as:

irma-core aligner \
    reference.fasta query.fastq \
    --out alignments.sam \
    -m 1 -x 1 -o 2 -e 1

This produces:

@query1 0       reference       6       255     5M2D7M  *       0       0       AGTAACACGGTC    IIIIIIIIIIII    AS:i:9
@query2 0       reference       17      255     12M1S   *       0       0       GTCTAGGTGACTA   IIIIIIIIIIIII   AS:i:10

Scoring

IRMA-core aligner supports both DNA and amino acid alignments. For DNA, alignment is case-insensitive over the alphabet ACGTN, with all other symbols being treated as N. For amino acid alignment, the case-insensitive alphabet is ACDEFGHIKLMNPQRSTVWY*BJZX, with all other symbols being treated as X. The alphabet is specified with --alphabet dna (default) or --alphabet aa.

Scoring is done with either:

A simple weight matrix, where a fixed score is used for matching residues (--matching or -m) and mismatching residues (--mismatch or -x). For DNA alignment, --ignore-n can be passed to make all comparisons involving an N have a score of 0.
A protein substitution matrix can be specified using --matrix <MAT>, where <MAT> may be:
- BLOSUM: blosum30, blosum35, blosum40, blosum45, blosum50, blosum55, blosum60, blosum62, blosum65, blosum70, blosum75, blosum80, blosum85, blosum90, blosum95, blosum100
- PAM: pam30, pam40, pam70, pam120, pam200, pam250

All other combinations are invalid and will result in an error. Some more examples:

Example	Scoring Scheme
Nothing	A simple DNA weight matrix with default weights (`--matching 2 --mismatch -5`)
`-m 1 -x 1 -o 2 -e 1`	A simple DNA weight matrix with some weights and penalties specified
`--alphabet aa`	The default protein substitution matrix `BLOSUM_62`
`--matrix pam250`	A named protein weight matrix from Zoe, in this case `PAM250`
`--alphabet aa -m 5 -x 2`	A simple protein weight matrix with user-specified match score and mismatch penalty

If only one of --matching or --mismatch are specified, then the other weights are set to the default.

An affine gap penalty is used with --gap-open and --gap-extend, which default to 10 and 1 respectively. Note that the gap open penalty must be at least the gap extend penalty. We use the affine gap formula, $W(k) = u(k-1) + v$, where $k$ is the gap length, $u$ is the gap extend penalty, $v$ is the gap open penalty, and $W(k)$ is the total penalty for the gap. In order to use $W(k) = uk + v$, simply pass gap open as the gap open plus gap extension.

Parameter	Default	Kind	Description
`--matching` (`-m`)	2	integer $0\leq x\leq 127$	The score for matching residues in a simple weight matrix
`--mismatch` (`-x`)	5	integer $0\leq x\leq 127$	The penalty for mismatching residues in a simple weight matrix
`--gap-open` (`-o`)	10	integer $0\leq x\leq 127$	The penalty for opening a gap
`--gap-extend` (`-e`)	1	integer $0\leq x\leq 127$	The penalty for extending a gap
`--ignore-n`	False		Use a score of 0 when `N` is being compared for the DNA alphabet
`--matrix`	`blosum62`	[`blosum30`, `blosum35`, ..., `pam30`, `pam40`, ...]	The protein substitution matrix to use for scoring
`--alphabet`	`dna`	[`dna`, `aa`]	The alphabet to interpret the inputs as

Performance Options

When trying to optimize the runtime or memory usage of aligner, there are two configuration options that can be considered. Using --method 1pass (default) or --method 3pass, the underlying method for computing the alignments can be altered. The one-pass algorithm builds the full traceback matrix, as is traditional with Striped Smith Waterman. For aligning against a long reference sequence, or aligning against two long full-length sequences, this can use a significant amount of memory (and cache misses can impact runtime). To improve this, the three-pass algorithm uses three passes to avoid building the full traceback matrix:

A single pass to determine the ending position of the alignment within the query and reference
A second truncated pass to determine the starting position of the alignment within the query and reference
A final restricted alignment pass using banded Smith Waterman, automatically increasing the band width until an optimal alignment is found

A second configuration option which can alter performance is passing either --profile-from-query (default) or --profile-from-ref. These mutually exclusive flags toggle whether the striped profile is built from the query or the reference, respectively. For the one-pass algorithm, --profile-from-ref can often offer faster performance by reusing profiles between multiple alignments. For the three-pass algorithm, --profile-from-query is often fastest (since the reverse profile in the second pass is cheaper for smaller sequences), but under certain scoring schemes, --profile-from-ref may perform better.

When in doubt, benchmarking on data reflective of the use-case can be informative.

Parameter	Default	Kind	Description
`--method`	`1pass`	`1pass` or `3pass`	The alignment method to use
`--profile-from-ref`	False		Builds the striped profiles from the reference sequence(s)
`--profile-from-query`	True		Builds the striped profiles from the query sequences

Other Options

For DNA alignments, passing --rev-comp or -r will also check the alignment against the reverse complement and return whichever is better. SAM uses the 5th bit (16 or 0b0001 0000) to indicate that the best alignment was against the reverse complement of the reference. To exclude unmapped (zero-scoring) alignments from the output, use --exclude-unmapped.

By default, aligner will align all references against all queries and output each result. To instead only output the best match for each query, use --best-match.

Parameter	Description
`--rev-comp` (`-r`)	Also checks alignments against the reverse complement, outputting whichever has the highest score
`--exclude-unmapped`	Excludes unmapped alignments from the output file
`--best-match`	The best matching alignment for each query is output, instead of all of them
`--single-thread`	Sets the number of `rayon` threads to 1. See here for more details
`--header`	Includes a SAM header in the output, containing the `HD` and `SQ` lines

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IRMA-core Aligner README

Motivation and Goals

Multithreading

Inputs and Outputs

Scoring

Performance Options

Other Options

FilesExpand file tree

ALIGNER.md

Latest commit

History

ALIGNER.md

File metadata and controls

IRMA-core Aligner README

Motivation and Goals

Multithreading

Inputs and Outputs

Scoring

Performance Options

Other Options