A read-mapping and refinement workflow for highly homologous genomic regions. The pipeline aligns reads (paired-end or single-end), maps alignments back to a multiple-sequence alignment (MSA), refines to uniquely supported mappings, and produces gene-specific FASTQ/BAM outputs.
To create a conda environment with all required dependencies, run:
conda env create -f paradism_env.yml
conda activate paradism_envRun without arguments to launch the guided CLI:
python paradism.pyOr specify a custom input directory:
python paradism.py --input-dir /path/to/dataYou'll be prompted to:
- Select sequencing mode (Paired-End or Single-End)
- Pick FASTQ file(s) accordingly (R1/R2 or a single FASTQ)
- Choose reference FASTA
- Optionally use an existing SAM (skips alignment)
- Pick an aligner (
bowtie2,bwa-mem2,minimap2) - Configure threads and minimap2 profile (if using minimap2)
- Set alignment score threshold
- Configure iterations for iterative refinement
- Set variant calling options (when iterations > 1):
- Minimum alternate allele count
- Quality filters (QUAL, DP, AF thresholds)
Note: By default, interactive mode scans the current working directory for input files. Use --input-dir to specify a different directory to scan.
python paradism.py --read1 <forward_reads.fq> \
[--read2 <reverse_reads.fq>] \
--reference <reference.fa> \
[--aligner ALIGNER] \
[--threads N] \
[--minimap2-profile PROFILE] \
[--sam existing.sam] \
[--output-dir OUTPUT] \
[--prefix PREFIX] \
[--iterations N] \
[--threshold THRESHOLD] \
[--min-alternate-count N] \
[--add-quality-filters] \
[--qual-threshold N] \
[--dp-threshold N] \
[--af-threshold F]Required:
--read1: R1 FASTQ file (or single-end reads)--reference: Reference FASTA
Optional - General:
--read2: R2 FASTQ file (for paired-end mode)--aligner:bowtie2(default),bwa-mem2, orminimap2--threads: Number of alignment threads (default: 4)--minimap2-profile(required with minimap2): one ofshort(short-read Illumina)pacbio-hifipacbio-clront-q20ont-standard
--sam: Existing alignment (skips alignment stage)--output-dir: Destination directory (default:./output)--prefix: Prefix for output files (default: derived from output directory name)--threshold: Alignment score threshold (aligner-specific). For example:G,40,40(bowtie2) or240(bwa-mem2/minimap2).
Optional - Iterative Refinement:
--iterations: Total ParaDISM runs (default: 1). Use >1 to enable iterative refinement.--min-alternate-count: Minimum alternate allele count for FreeBayes variant calling (default: 5)--add-quality-filters: Enable quality filtering during variant calling--qual-threshold: Minimum QUAL score for quality filtering (default: 20)--dp-threshold: Minimum depth (DP) for quality filtering (default: 10)--af-threshold: Minimum allele frequency (AF) for quality filtering (default: 0.05)
Paired-end vs single-end:
- For paired-end runs, provide both
--read1and--read2. - If
--read2is omitted, the pipeline runs in single-end mode.
# Paired-end short reads, default aligner (Bowtie2)
python paradism.py --read1 r1.fq --read2 r2.fq --reference ref.fa
# Single-end long reads with minimap2 (PacBio HiFi)
python paradism.py --read1 hifi.fq --reference ref.fa \
--aligner minimap2 --minimap2-profile pacbio-hifi
# Using custom output directory (prefix auto-derived)
python paradism.py --read1 r1.fq --read2 r2.fq --reference ref.fa \
--output-dir sample_001
# Two pipeline runs with iterative refinement
python paradism.py --read1 r1.fq --read2 r2.fq --reference ref.fa \
--iterations 2
# Iterative refinement with custom variant calling parameters
python paradism.py --read1 r1.fq --read2 r2.fq --reference ref.fa \
--iterations 5 \
--min-alternate-count 5 \
--add-quality-filters \
--qual-threshold 20 \
--dp-threshold 10 \
--af-threshold 0.05When skipping alignment via --sam, ensure the SAM:
- Contains a valid header
- Has at least one mapped read
- Includes MD tags (
samtools calmdcan add them)
By default, output files are prefixed with the output directory name. For example, if --output-dir SAMPLE_NAME, all final outputs will be prefixed with SAMPLE_NAME_.
output_dir/
├── prefix_pipeline_YYYYMMDD_HHMMSS.log # Run log
├── iteration_1/ # Iteration outputs (when iterations > 1)
├── iteration_2/ # Additional iterations
└── final_outputs/
├── prefix_fastq/ # Gene-specific FASTQs
│ ├── prefix_gene1.fq
│ ├── prefix_gene2.fq
│ └── ...
└── prefix_bam/ # Gene-specific BAMs
├── prefix_gene1.sorted.bam
├── prefix_gene1.sorted.bam.bai
└── ...
If unresolved reads remain (NONE assignments), ParaDISM also writes:
final_outputs/<prefix>_none/<prefix>_NONE_r1.fqfinal_outputs/<prefix>_none/<prefix>_NONE_r2.fq(paired-end only)final_outputs/<prefix>_none/<prefix>_NONE_all_refs.sorted.bamfinal_outputs/<prefix>_none/<prefix>_NONE_all_refs.sorted.bam.bai
Note: Intermediate files (MSA, SAM, indices) are created during processing but cleaned up automatically. Only final outputs are retained.
Set --iterations to run ParaDISM multiple times. The first run produces mappings and per-gene BAMs. Each refinement iteration:
- Calls variants from previously mapped (non-NONE) reads using FreeBayes
- Optionally applies quality filters (QUAL, DP, AF thresholds)
- Updates the reference with called variants
- Re-aligns only reads that were labeled
NONEagainst the updated reference - Merges results so prior successful mappings remain unchanged
The loop stops early if:
- No reads remain to rescue
- No variants to apply
- No reads were reassigned in the latest iteration
See individual README files in each directory:
giab_benchmark/- GIAB HG002 benchmarkinggnaq_analysis/- GNAQ samples analysishts_analysis/- HTS clinical samplessimulation/- Simulated reads analysis