This repo contains a genome-wide TR catalog with 4.9 million loci.
It stratifies TRs into 2 groups:
- isolated TRs suitable for traditional repeat copy number analysis using short-read or long-read data
 - TRs embedded within wider polymorphic regions (ie. variation clusters) that are best studied through sequence-level analysis
 
Release v1.0 is available for download, and is described in:
Defining a tandem repeat catalog and variation clusters for genome-wide analyses and population databases
Ben Weisburd, Egor Dolzhenko, Mark F. Bennett, Matt C. Danzi, Adam English, Laurel Hiatt, Hope Tanudisastro, Nehir Edibe Kurtas, Helyaneh Ziaei Jam, Harrison Brand, Fritz J. Sedlazeck, Melissa Gymrek, Harriet Dashnow, Michael A. Eberle, Heidi L. Rehm
bioRxiv 2024.10.04.615514; doi: https://doi.org/10.1101/2024.10.04.615514
Tandem repeats (TRs) are regions of the genome that consist of consecutive copies of some motif sequence. For example, CAGCAGCAG is a tandem repeat of the CAG motif. Many types of genomic studies require annotations of tandem repeats in the reference genome, called repeat catalogs, which specify the genomic start and end coordinates of each tandem repeat region, as well as the one or more motifs that repeat there.
For example, if a hypothetical region at the beginning of chrX consisted of the following nucleotide sequence:
ATCAGTAGA ATATATATAT CAGACAGCAGCAG TGAGTGCGTAC...
it could be represented in a repeat catalog as two entries:
chrX:10-19 (AT)*
chrX:20-32 (CAG)*
indicating that a repeat of the AT motif occurs between positions 10 and 19 (inclusive), and of the CAG motif between positions 20 and 32.
A genome-wide catalog would contain such entries for all repeat regions of interest found anywhere in the genome.
The genome-wide TR catalog was created by combining 4 source catalogs in order:
- Known disease-associated loci
 - Illumina catalog of 174k polymorphic repeats
 - All perfect repeats in hg38 that span ≥ 9bp and consist of at least 3 repeats of any motif between 1 and 1000 bp in size. These were identified using ColabRepeatFinder.
 - Catalog of polymorphic loci computed by applying methods described in [Weisburd et al. 2023] to 78 haplotype-resolved T2T assemblies from the HPRC and HGSVC
 
The numbers (and %) of loci in the combined catalog that were added from each of the source catalogs were as follows:
           83 out of        83 (100.0%) TRs from source 1: known disease-associated loci as well as 20 adjacent or historical candidate loci 
      174,244 out of   174,286 (100.0%) TRs from source 2: Illumina catalog of 174k polymorphic loci
    4,391,197 out of 4,558,281 ( 96.3%) TRs from source 3: perfect repeats in hg38
      297,517 out of 1,937,805 ( 15.4%) TRs from source 4: polymorphic loci in 78 haplotype-resolved T2T assemblies
The merging procedure involved taking all loci from the 1st catalog, then all non-duplicate loci from the next catalog, then from the third catalog and so on, in the order listed above. A locus was considered a duplicate if it overlapped a previously-added locus by at least 66% and the two loci had the same motif after cyclic shift and/or reverse complement (ie. CAG, AGC, GCA, CTG, TGC, GCT were considered to be the same motif).
The following catalog stats for v1.0 were computed using str_analysis/compute_catalog_stats.py:
Stats for repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz:
    4,863,041 total loci
   65,678,112 base pairs spanned by all loci (2.127% of the genome)
    3,210,115 out of  4,863,041 ( 66.0%) repeat interval size is an integer multiple of the motif size (aka. trimmed)
    1,567,337 out of  4,863,041 ( 32.2%) repeat intervals are homopolymers
       18,340 out of  4,863,041 (  0.4%) repeat intervals overlap each other by at least two motif lengths
           11 out of  4,863,041 (  0.0%) repeat intervals have non-ACGT motifs
Examples of overlapping repeats: chr1:82008141-82008152, chr3:78937990-78938032, chr4:1046750-1046794, chr5:52437153-52437201, chr6:34683425-34683453, chr4:107646310-107646327, chr18:52413438-52413450, chr6:150149299-150149323, chr7:40295582-40295597, chr9:35561915-35561931
Ranges:
   Motif size range:  1-833bp
   Locus size range:  1-2523bp
   Num repeats range: 1-300x repeats
   Max locus size =   2,523bp               @ chrX:71520430-71522953 (CCAGCACTTTGGGAGGCCGAGGCAGGCTGATCACTAGGTCAGGAGTTCAAGACCAGCCTGGCCAACATGGTGAAACCCCCGTCTCTACTAAAAATACAAAAATTACCTGGGTGTGGGGGTGGGCACCTGTAATCCCAGCTACTCGGGAGGCTGGGGAGGCAGGAGAATTGCCTGAACCTGAGAGGCAGAGGCTGCAGTGAGCTGAGATTGTGCCACTGCACTCCAGCCTGGGCGACAGAGTGAGACTCAGTCTCAAAACAAAAAAAAAAAAAGATTTTAGTAACTTTTATCCTGTTTTAATAATACTGACTCAGAAACTATAATGTGTACTTTATAATTTACTTCCTAGATGACACTTGATTTTCTTCAAGAGCAAGATAGCTGCCCTGTGCAGTTGGTCTCCTTGAAAACTATTTTAGTTCTATCATAATTTCCTGTGATAAATATTTTGACCTTCTAAAATTTCAGAATATTGCACCAAGTAGAAAGAAAATAGGTTTTTTCTCTTTTCTTCTTCTTCCTTTTTTTTTTCTGAGAAAGAGGGAATGAGAACTTTAGTGTTCTTTCAATAGCGTTCTTATTTGTAGAAATGCATAATAGTGTCCTAGTAAGGCTTGACAATAACTCTGGTCTTCATCATATTTTGTGATAAAACTTTTGATTTAAAAAAACCTCTGATCTATTTATCATGGCAAATGGATAGAGCTTTCCTGCCTGTTTTCTTTCTTTTCTTTTTTCTTTCTTTCCTTTTTTTTCCTTTGAGCTTAGATTTTTAGAAGCACATATTTAAAAATCAGGTATAAGACTGGATGCAGTGGCTCACGCCTGTAATC)
   Min reference repeat purity   =  0.43    @ chr3:112804380-112804514 (TCT)
   Min overall mappability       =  0.00    @ chrY:56887882-56887891 (TGA)
   Base-level   purity   median: 1.000,  mean: 0.999
          chrX:    244,191 out of  4,863,041 (  5.0%) repeat intervals
          chrY:     39,257 out of  4,863,041 (  0.8%) repeat intervals
          chrM:         14 out of  4,863,041 (  0.0%) repeat intervals
   alt contigs:          0 out of  4,863,041 (  0.0%) repeat intervals
Motif size distribution:
          1bp:  1,567,337 out of  4,863,041 ( 32.2%) repeat intervals
          2bp:    978,972 out of  4,863,041 ( 20.1%) repeat intervals
          3bp:  1,432,117 out of  4,863,041 ( 29.4%) repeat intervals
          4bp:    590,787 out of  4,863,041 ( 12.1%) repeat intervals
          5bp:    177,422 out of  4,863,041 (  3.6%) repeat intervals
          6bp:     56,731 out of  4,863,041 (  1.2%) repeat intervals
       7-24bp:     43,996 out of  4,863,041 (  0.9%) repeat intervals
        25+bp:     15,679 out of  4,863,041 (  0.3%) repeat intervals
Num repeats in reference:
           1x:     10,443 out of  4,863,041 (  0.2%) repeat intervals
           2x:     38,922 out of  4,863,041 (  0.8%) repeat intervals
           3x:  1,799,189 out of  4,863,041 ( 37.0%) repeat intervals
           4x:    650,397 out of  4,863,041 ( 13.4%) repeat intervals
           5x:    356,525 out of  4,863,041 (  7.3%) repeat intervals
           6x:    151,893 out of  4,863,041 (  3.1%) repeat intervals
           7x:     85,760 out of  4,863,041 (  1.8%) repeat intervals
           8x:    257,475 out of  4,863,041 (  5.3%) repeat intervals
           9x:    352,993 out of  4,863,041 (  7.3%) repeat intervals
       10-15x:    759,188 out of  4,863,041 ( 15.6%) repeat intervals
       16-25x:    348,837 out of  4,863,041 (  7.2%) repeat intervals
       26-35x:     45,478 out of  4,863,041 (  0.9%) repeat intervals
       36-50x:      5,610 out of  4,863,041 (  0.1%) repeat intervals
         51+x:        331 out of  4,863,041 (  0.0%) repeat intervals
Reference repeat purity distribution:
          0.0:          0 out of  4,863,041 (  0.0%) repeat intervals
          0.1:          0 out of  4,863,041 (  0.0%) repeat intervals
          0.2:          0 out of  4,863,041 (  0.0%) repeat intervals
          0.3:          0 out of  4,863,041 (  0.0%) repeat intervals
          0.4:          3 out of  4,863,041 (  0.0%) repeat intervals
          0.5:         14 out of  4,863,041 (  0.0%) repeat intervals
          0.6:         44 out of  4,863,041 (  0.0%) repeat intervals
          0.7:      1,570 out of  4,863,041 (  0.0%) repeat intervals
          0.8:     12,336 out of  4,863,041 (  0.3%) repeat intervals
          0.9:     21,126 out of  4,863,041 (  0.4%) repeat intervals
          1.0:  4,827,948 out of  4,863,041 ( 99.3%) repeat intervals
Mappability distribution:
          0.0:    154,279 out of  4,863,041 (  3.2%) loci
          0.1:    214,471 out of  4,863,041 (  4.4%) loci
          0.2:    246,877 out of  4,863,041 (  5.1%) loci
          0.3:    236,856 out of  4,863,041 (  4.9%) loci
          0.4:    391,388 out of  4,863,041 (  8.0%) loci
          0.5:    561,639 out of  4,863,041 ( 11.5%) loci
          0.6:    352,273 out of  4,863,041 (  7.2%) loci
          0.7:    306,208 out of  4,863,041 (  6.3%) loci
          0.8:    337,715 out of  4,863,041 (  6.9%) loci
          0.9:    626,047 out of  4,863,041 ( 12.9%) loci
          1.0:  1,435,288 out of  4,863,041 ( 29.5%) loci
Locus sizes at each motif size:
     1bp motifs: locus size range:      1 bp to      90 bp  (median:   11 bp) based on  1,567,337 loci. Mean base purity: 1.00.  Mean mappability: 0.66
     2bp motifs: locus size range:      2 bp to     600 bp  (median:   10 bp) based on    978,972 loci. Mean base purity: 1.00.  Mean mappability: 0.76
     3bp motifs: locus size range:      3 bp to     632 bp  (median:    9 bp) based on  1,432,117 loci. Mean base purity: 1.00.  Mean mappability: 0.75
     4bp motifs: locus size range:      4 bp to     533 bp  (median:   14 bp) based on    590,787 loci. Mean base purity: 1.00.  Mean mappability: 0.68
     5bp motifs: locus size range:      5 bp to     400 bp  (median:   18 bp) based on    177,422 loci. Mean base purity: 1.00.  Mean mappability: 0.61
     6bp motifs: locus size range:      6 bp to   1,103 bp  (median:   20 bp) based on     56,731 loci. Mean base purity: 1.00.  Mean mappability: 0.62
     7bp motifs: locus size range:      7 bp to     151 bp  (median:   22 bp) based on     15,083 loci. Mean base purity: 1.00.  Mean mappability: 0.58
     8bp motifs: locus size range:      8 bp to     312 bp  (median:   25 bp) based on      7,107 loci. Mean base purity: 1.00.  Mean mappability: 0.57
     9bp motifs: locus size range:      9 bp to     153 bp  (median:   28 bp) based on      3,231 loci. Mean base purity: 1.00.  Mean mappability: 0.51
    10bp motifs: locus size range:     10 bp to     150 bp  (median:   31 bp) based on      2,713 loci. Mean base purity: 1.00.  Mean mappability: 0.50
Additional stats can be found in the [run log]