IRMA-core's aligner provides an efficient and exact local sequence alignment routine using Striped Smith-Waterman. This is particularly useful for aligning Next Generation Sequencing (NGS) reads and performing full alignment between short genomes.
aligner uses rayon to perform multithreading to enable higher throughput. To specify the number of threads, set the RAYON_NUM_THREADS environmental variable as described here. Or, to limit to a single worker thread, pass --single-thread to aligner.
For benchmarking or scenarios where a single thread is always used, the dev_no_rayon feature can be enabled in IRMA-core to remove the use of channels. This feature may be removed in future releases, and so should not be relied upon except for testing.
The first positional argument is a FASTA file containing the reference sequence(s), and the second argument is a FASTA or FASTQ file containing the queries. Either file may be gzip-compressed, in which case it is assumed to end in .gz. The output is in the SAM alignment format. The score is reported with the AS tag for mapped reads, and the MAPQ field is not used (it is set to 255). The output file is specified with --out or --output flags. If not specified, output is directed to STDOUT. If the provided file ends in .gz, the output will be zipped.
As an example, consider the following inputs:
-
reference.fasta>reference GACTCAGTAAGACACGGTCTAGCTGACTGT -
query.fastq@query1 AGTAACACGGTC + IIIIIIIIIIII @query2 GTCTAGGTGACTA + IIIIIIIIIIIII
Aligner could be run with a command such as:
irma-core aligner \
reference.fasta query.fastq \
--out alignments.sam \
-m 1 -x 1 -o 2 -e 1This produces:
@query1 0 reference 6 255 5M2D7M * 0 0 AGTAACACGGTC IIIIIIIIIIII AS:i:9
@query2 0 reference 17 255 12M1S * 0 0 GTCTAGGTGACTA IIIIIIIIIIIII AS:i:10
IRMA-core aligner supports both DNA and amino acid alignments. For DNA, alignment is case-insensitive over the alphabet ACGTN, with all other symbols being treated as N. For amino acid alignment, the case-insensitive alphabet is ACDEFGHIKLMNPQRSTVWY*BJZX, with all other symbols being treated as X. The alphabet is specified with --alphabet dna (default) or --alphabet aa.
Scoring is done with either:
- A simple weight matrix, where a fixed score is used for matching residues (
--matchingor-m) and mismatching residues (--mismatchor-x). For DNA alignment,--ignore-ncan be passed to make all comparisons involving anNhave a score of 0. - A protein substitution matrix can be specified using
--matrix <MAT>, where<MAT>may be:- BLOSUM:
blosum30,blosum35,blosum40,blosum45,blosum50,blosum55,blosum60,blosum62,blosum65,blosum70,blosum75,blosum80,blosum85,blosum90,blosum95,blosum100 - PAM:
pam30,pam40,pam70,pam120,pam200,pam250
- BLOSUM:
All other combinations are invalid and will result in an error. Some more examples:
| Example | Scoring Scheme |
|---|---|
| Nothing | A simple DNA weight matrix with default weights (--matching 2 --mismatch -5) |
-m 1 -x 1 -o 2 -e 1 |
A simple DNA weight matrix with some weights and penalties specified |
--alphabet aa |
The default protein substitution matrix BLOSUM_62 |
--matrix pam250 |
A named protein weight matrix from Zoe, in this case PAM250 |
--alphabet aa -m 5 -x 2 |
A simple protein weight matrix with user-specified match score and mismatch penalty |
If only one of --matching or --mismatch are specified, then the other weights are set to the default.
An affine gap penalty is used with --gap-open and --gap-extend, which default to 10 and 1 respectively. Note that the gap open penalty must be at least the gap extend penalty. We use the affine gap formula,
| Parameter | Default | Kind | Description |
|---|---|---|---|
--matching (-m) |
2 | integer |
The score for matching residues in a simple weight matrix |
--mismatch (-x) |
5 | integer |
The penalty for mismatching residues in a simple weight matrix |
--gap-open (-o) |
10 | integer |
The penalty for opening a gap |
--gap-extend (-e) |
1 | integer |
The penalty for extending a gap |
--ignore-n |
False | Use a score of 0 when N is being compared for the DNA alphabet |
|
--matrix |
blosum62 |
[blosum30, blosum35, ..., pam30, pam40, ...] |
The protein substitution matrix to use for scoring |
--alphabet |
dna |
[dna, aa] |
The alphabet to interpret the inputs as |
When trying to optimize the runtime or memory usage of aligner, there are two configuration options that can be considered. Using --method 1pass (default) or --method 3pass, the underlying method for computing the alignments can be altered. The one-pass algorithm builds the full traceback matrix, as is traditional with Striped Smith Waterman. For aligning against a long reference sequence, or aligning against two long full-length sequences, this can use a significant amount of memory (and cache misses can impact runtime). To improve this, the three-pass algorithm uses three passes to avoid building the full traceback matrix:
- A single pass to determine the ending position of the alignment within the query and reference
- A second truncated pass to determine the starting position of the alignment within the query and reference
- A final restricted alignment pass using banded Smith Waterman, automatically increasing the band width until an optimal alignment is found
A second configuration option which can alter performance is passing either --profile-from-query (default) or --profile-from-ref. These mutually exclusive flags toggle whether the striped profile is built from the query or the reference, respectively. For the one-pass algorithm, --profile-from-ref can often offer faster performance by reusing profiles between multiple alignments. For the three-pass algorithm, --profile-from-query is often fastest (since the reverse profile in the second pass is cheaper for smaller sequences), but under certain scoring schemes, --profile-from-ref may perform better.
When in doubt, benchmarking on data reflective of the use-case can be informative.
| Parameter | Default | Kind | Description |
|---|---|---|---|
--method |
1pass |
1pass or 3pass |
The alignment method to use |
--profile-from-ref |
False | Builds the striped profiles from the reference sequence(s) | |
--profile-from-query |
True | Builds the striped profiles from the query sequences |
For DNA alignments, passing --rev-comp or -r will also check the alignment against the reverse complement and return whichever is better. SAM uses the 5th bit (16 or 0b0001 0000) to indicate that the best alignment was against the reverse complement of the reference. To exclude unmapped (zero-scoring) alignments from the output, use --exclude-unmapped.
By default, aligner will align all references against all queries and output each result. To instead only output the best match for each query, use --best-match.
| Parameter | Description |
|---|---|
--rev-comp (-r) |
Also checks alignments against the reverse complement, outputting whichever has the highest score |
--exclude-unmapped |
Excludes unmapped alignments from the output file |
--best-match |
The best matching alignment for each query is output, instead of all of them |
--single-thread |
Sets the number of rayon threads to 1. See here for more details |
--header |
Includes a SAM header in the output, containing the HD and SQ lines |