A genomic interval annotation tool built on Ensembl GTF files, designed for read/peak/interval datasets (e.g. CLASH, eCLIP, ChIP peaks) where transcript ambiguity is common.
This repository implements a 2-step workflow:
- Step 1 (evidence-only): compute all transcript overlaps and build a per-nt UNION region matrix across transcripts (no transcript-selection assumptions).
- Step 2 (assumption-bearing): choose a transcript using an explicit policy (e.g. CLASH-friendly) and summarize region composition (dominant + multi-region).
- Step 1 preserves evidence: what regions are supported by any transcript?
- Step 2 makes assumptions explicit: which transcript policy are you applying and why?
conda env create -f environment.yml
conda activate genomic-region-annotator
pip install -e .Input must be a TSV with:
| column | description |
|---|---|
| chr | Chromosome (e.g. 1, chr1) |
| start | Start coordinate |
| end | End coordinate |
| strand | + / - (or . if unknown) |
Use --coords:
1-based(default): 1-based inclusive (GTF/SAM-like)bed: 0-based half-open (BED)
Internally, normalization to 1-based inclusive for overlap math.
Annotate an input TSV and produce step1 outputs:
genomic-region-annotator annotate \
--input data/raw/file_with_intervals.tsv \
--release <version_of_ensembl> \
--output data/processed/output_name.tsvIf your output stem is output, you get:
data/processed/step1/output_annotated_input_with_ids.tsvdata/processed/step1/output_annotated_matrix.tsvdata/processed/step1/output_annotated_transcripts.tsvdata/processed/step1/output_annotated_step1_stats.tsv
By default, Step 1 keeps only transcripts where the read is 100% contained inside the transcript span.
You can relax this with:
--min-overlap-nt 30meaning: keep transcripts with ≥ 30 nt overlap with transcript span.
Run transcript selection + site summary:
genomic-region-annotator summarize-sites \
--transcripts data/processed/step1/output_transcripts.tsv \
--matrix data/processed/step1/output_annotated_matrix.tsv \
--policy clash_utr3_first \
--dominance coverage \
--reportdata/processed/step2/output_site_summary.tsvdata/processed/step2/output_step2_stats.tsv
Step 2 outputs both:
dominant_region_selecteddominant_region_union
Dominance modes:
--dominance coverage(recommended): label = region with most bp overlap (ties broken by priority)--dominance priority: label = first region present by priority order (UTR3 > CDS > ...)
regions_present_selectedregions_present_union
Example:
CDS|UTR3means the site spans both CDS and 3′UTR.
--policy clash_utr3_first selects a transcript per read by:
- maximize UTR3 overlap bp
- then CDS overlap bp
- then UTR5 overlap bp
- then EXON_OTHER bp (exon excluding CDS/UTRs)
- then INTRON bp
- then EXON bp
- then TX overlap bp
- tie-breakers: contained_100pct, then transcript_id