Skip to content

HISAT-3N: silently ignores read pairs when --repeat is set #465

Description

@ankebusch

Hello,

I am using HISAT-3N to map paired-end eTAMseq data containing A-to-G transitions. For some samples, HISAT-3N stops writing output after one specific read pair. All reads, which are in the input FASTQ files before this pair, are correctly processed, while all reads after the pair (and the pair itself) are skipped. HISAT-3N does not crash but continues running for several hours, however it does not write any additional output neither to the output SAM file nor the unmapped reads files. When it finishes running, it does not write any error or warning message, but a normal log file, in which is reports having analyzed a total number of X reads, where X is the number of reads before the problematic pair in the input FASTQ. In my case, it processed only approx. 60 million read pairs out of 90 million.

I'm using HISAT-3N version 2.2.1-3n-0.0.3 with the following settings:

hisat-3n --base-change A,G --repeat --repeat-limit 1000 --bowtie2-dp 0 --no-unal --seed 123 --directional-mapping -x INDEXFOLDER/NAMEBASE --rna-strandness FR -1 SAMPLE.R1.fastq.gz -2 SAMPLE.R2.fastq.gz --un-conc-gz SAMPLE.not_properly_mapped

The mapping index was created with the following settings:

hisat-3n-build --base-change A,G --repeat-index 100-300 -p 16 --exon genome.exon --ss genome.ss GRCh38.primary_assembly.genome.fa INDEXFOLDER/NAMEBASE

genome.ss and genome.exon were created as follows:

extract_splice_sites.py gencode.v45.primary_assembly.annotation.gtf > genome.ss
extract_exons.py gencode.v45.primary_assembly.annotation.gtf > genome.exon

Both GRCh38.primary_assembly.genome.fa and gencode.v45.primary_assembly.annotation.gtf were downloaded from the GENCODE website.

The issue is most severe when HISAT-3N is run using just one thread. Depending on where the problematic pair is located in the FASTQ input files, most reads might be lost. When using multiple threads, less reads are skipped. It seems that only reads are skipped, which are processes on the same thread as the problematic pair.

I reduced the issue down to a minimal example of just three read pairs. The second pair is the problematic one:

SAMPLE.R1.fastq:

@pair1
TTGCTTTTTGGGGTGTCTGTCGTTGGGGGGGGTTGGTCTTGCCCTTGGGGTCGGGGGCGCGT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@pair2
GGTGGTGTGCTTCTGTTTCTTGGTTTTTCGGCTTTGGGCGTTGTTGGTTTGTTTCTGTGTTGTGTTTGGGTTTGGCTGTTTTGCTCGTTTTCCCTCTCTCTCT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII-I9II-IIIIIIIIIIIIII
@pair3
CCGGTCGGGCGGGAGGTGGCGGGCGCTGGGAGGGCCTGGGGCTTGTTGCGGGGCTCCTCTTGCCGCGCC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

SAMPLE.R2.fastq:

@pair1
ACGCGCCCCCGACCCCAAGGGCAAGACCAACCCCCCCCAACGACAACCACCCCAAAAAGCAA
+
9IIIIIIIIIIIIIIIIIIIIII-IIIIIII9IIIIII-IIIIII--9-99IIIII-9I9I9
@pair2
AGTCAATGGGAGCAGCACACCAGCATGGCACACGCACACACACGCAACCAACCCGCACACCGCGCACACGCACCCCAAAACCCAAAGCACAACAACAACAAAA
+
IIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@pair3
GGCGCGGCAAGAGGAGCCCCGCAACAAGCCCCAGGCCCTCCCAGCGCCCGCCACCTCCCGCCCGACCGG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

When these are used for mapping with the parameters mentioned above, HISAT-3N writes the following log:

1 reads; of these:
  1 (100.00%) were paired; of these:
    0 (0.00%) aligned concordantly 0 times
    1 (100.00%) aligned concordantly exactly 1 time
    0 (0.00%) aligned concordantly >1 times
    ----
    0 pairs aligned concordantly 0 times; of these:
      0 (0.00%) aligned discordantly 1 time
    ----
    0 pairs aligned 0 times concordantly or discordantly; of these:
      0 mates make up the pairs; of these:
        0 (0.00%) aligned 0 times
        0 (0.00%) aligned exactly 1 time
        0 (0.00%) aligned >1 times
100.00% overall alignment rate

The second and third pair are ignored and not mentioned in the output. The output files of unmapped reads are empty, while the output SAM file includes only pair1.

When I repeat the mapping without setting the --repeat parameter, all three pairs are processed and HISAT-3N writes the following log:

3 reads; of these:
  3 (100.00%) were paired; of these:
    0 (0.00%) aligned concordantly 0 times
    3 (100.00%) aligned concordantly exactly 1 time
    0 (0.00%) aligned concordantly >1 times
    ----
    0 pairs aligned concordantly 0 times; of these:
      0 (0.00%) aligned discordantly 1 time
    ----
    0 pairs aligned 0 times concordantly or discordantly; of these:
      0 mates make up the pairs; of these:
        0 (0.00%) aligned 0 times
        0 (0.00%) aligned exactly 1 time
        0 (0.00%) aligned >1 times
100.00% overall alignment rate

I'm not sure what feature of pair2 causes the behavior of HISAT-3N when --repeat is set. Can anyone explain this?

I noticed that when I modify the last base pair of R2 of pair2 to a C or T, HISAT-3N processes all pairs correctly also when --repeat is set. However, if I modify it to a G, the issue remains. Maybe this helps to narrow down the problem?

Thanks and best,
Anke.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions