ParaDISM: Paralog Disambiguating Mapper

A read-mapping and refinement workflow for highly homologous genomic regions. The pipeline aligns reads (paired-end or single-end), maps alignments back to a multiple-sequence alignment (MSA), refines to uniquely supported mappings, and produces gene-specific FASTQ/BAM outputs.

Prerequisites

Environment Setup

To create a conda environment with all required dependencies, run:

conda env create -f paradism_env.yml
conda activate paradism_env

Usage

Interactive Mode

Run without arguments to launch the guided CLI:

python paradism.py

Or specify a custom input directory:

python paradism.py --input-dir /path/to/data

You'll be prompted to:

Select sequencing mode (Paired-End or Single-End)
Pick FASTQ file(s) accordingly (R1/R2 or a single FASTQ)
Choose reference FASTA
Optionally use an existing SAM (skips alignment)
Pick an aligner (bowtie2, bwa-mem2, minimap2)
Configure threads and minimap2 profile (if using minimap2)
Set alignment score threshold
Configure iterations for iterative refinement
Set variant calling options (when iterations > 1):
- Minimum alternate allele count
- Quality filters (QUAL, DP, AF thresholds)

Note: By default, interactive mode scans the current working directory for input files. Use --input-dir to specify a different directory to scan.

Argument-Driven Mode

python paradism.py --read1 <forward_reads.fq> \
                 [--read2 <reverse_reads.fq>] \
                 --reference <reference.fa> \
                 [--aligner ALIGNER] \
                 [--threads N] \
                 [--minimap2-profile PROFILE] \
                 [--sam existing.sam] \
                 [--output-dir OUTPUT] \
                 [--prefix PREFIX] \
                 [--iterations N] \
                 [--threshold THRESHOLD] \
                 [--min-alternate-count N] \
                 [--add-quality-filters] \
                 [--qual-threshold N] \
                 [--dp-threshold N] \
                 [--af-threshold F]

Required:

--read1: R1 FASTQ file (or single-end reads)
--reference: Reference FASTA

Optional - General:

--read2: R2 FASTQ file (for paired-end mode)
--aligner: bowtie2 (default), bwa-mem2, or minimap2
--threads: Number of alignment threads (default: 4)
--minimap2-profile (required with minimap2): one of
- short (short-read Illumina)
- pacbio-hifi
- pacbio-clr
- ont-q20
- ont-standard
--sam: Existing alignment (skips alignment stage)
--output-dir: Destination directory (default: ./output)
--prefix: Prefix for output files (default: derived from output directory name)
--threshold: Alignment score threshold (aligner-specific). For example: G,40,40 (bowtie2) or 240 (bwa-mem2/minimap2).

Optional - Iterative Refinement:

--iterations: Total ParaDISM runs (default: 1). Use >1 to enable iterative refinement.
--min-alternate-count: Minimum alternate allele count for FreeBayes variant calling (default: 5)
--add-quality-filters: Enable quality filtering during variant calling
--qual-threshold: Minimum QUAL score for quality filtering (default: 20)
--dp-threshold: Minimum depth (DP) for quality filtering (default: 10)
--af-threshold: Minimum allele frequency (AF) for quality filtering (default: 0.05)

Paired-end vs single-end:

For paired-end runs, provide both --read1 and --read2.
If --read2 is omitted, the pipeline runs in single-end mode.

Examples

# Paired-end short reads, default aligner (Bowtie2)
python paradism.py --read1 r1.fq --read2 r2.fq --reference ref.fa

# Single-end long reads with minimap2 (PacBio HiFi)
python paradism.py --read1 hifi.fq --reference ref.fa \
  --aligner minimap2 --minimap2-profile pacbio-hifi

# Using custom output directory (prefix auto-derived)
python paradism.py --read1 r1.fq --read2 r2.fq --reference ref.fa \
  --output-dir sample_001

# Two pipeline runs with iterative refinement
python paradism.py --read1 r1.fq --read2 r2.fq --reference ref.fa \
  --iterations 2

# Iterative refinement with custom variant calling parameters
python paradism.py --read1 r1.fq --read2 r2.fq --reference ref.fa \
  --iterations 5 \
  --min-alternate-count 5 \
  --add-quality-filters \
  --qual-threshold 20 \
  --dp-threshold 10 \
  --af-threshold 0.05

SAM Requirements

When skipping alignment via --sam, ensure the SAM:

Contains a valid header
Has at least one mapped read
Includes MD tags (samtools calmd can add them)

Output Layout

By default, output files are prefixed with the output directory name. For example, if --output-dir SAMPLE_NAME, all final outputs will be prefixed with SAMPLE_NAME_.

output_dir/
├── prefix_pipeline_YYYYMMDD_HHMMSS.log      # Run log
├── iteration_1/                             # Iteration outputs (when iterations > 1)
├── iteration_2/                             # Additional iterations
└── final_outputs/
    ├── prefix_fastq/                        # Gene-specific FASTQs
    │   ├── prefix_gene1.fq
    │   ├── prefix_gene2.fq
    │   └── ...
    └── prefix_bam/                          # Gene-specific BAMs
        ├── prefix_gene1.sorted.bam
        ├── prefix_gene1.sorted.bam.bai
        └── ...

If unresolved reads remain (NONE assignments), ParaDISM also writes:

final_outputs/<prefix>_none/<prefix>_NONE_r1.fq
final_outputs/<prefix>_none/<prefix>_NONE_r2.fq (paired-end only)
final_outputs/<prefix>_none/<prefix>_NONE_all_refs.sorted.bam
final_outputs/<prefix>_none/<prefix>_NONE_all_refs.sorted.bam.bai

Note: Intermediate files (MSA, SAM, indices) are created during processing but cleaned up automatically. Only final outputs are retained.

Iterative Refinement

Set --iterations to run ParaDISM multiple times. The first run produces mappings and per-gene BAMs. Each refinement iteration:

Calls variants from previously mapped (non-NONE) reads using FreeBayes
Optionally applies quality filters (QUAL, DP, AF thresholds)
Updates the reference with called variants
Re-aligns only reads that were labeled NONE against the updated reference
Merges results so prior successful mappings remain unchanged

The loop stops early if:

No reads remain to rescue
No variants to apply
No reads were reassigned in the latest iteration

Analysis and Experiments

See individual README files in each directory:

giab_benchmark/ - GIAB HG002 benchmarking
gnaq_analysis/ - GNAQ samples analysis
hts_analysis/ - HTS clinical samples
simulation/ - Simulated reads analysis

Name		Name	Last commit message	Last commit date
Latest commit History 215 Commits
giab_benchmark		giab_benchmark
gnaq_analysis		gnaq_analysis
hts_analysis		hts_analysis
simulation		simulation
src		src
.gitignore		.gitignore
LICENSE		LICENSE
PKD1_genes_exons_verified.bed		PKD1_genes_exons_verified.bed
README.md		README.md
paradism.py		paradism.py
paradism_env.yml		paradism_env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ParaDISM: Paralog Disambiguating Mapper

Prerequisites

Environment Setup

Usage

Interactive Mode

Argument-Driven Mode

Examples

SAM Requirements

Output Layout

Iterative Refinement

Analysis and Experiments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ParaDISM: Paralog Disambiguating Mapper

Prerequisites

Environment Setup

Usage

Interactive Mode

Argument-Driven Mode

Examples

SAM Requirements

Output Layout

Iterative Refinement

Analysis and Experiments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages