-
Notifications
You must be signed in to change notification settings - Fork 2
03. Create profiles with popdel profile
Overview
Sampling options
Filtering options
Examples
The first step of PopDel is the creation of insert size profiles for each individual sample. This profile stores positions and insert sizes of read pairs that align confidently to the reference genome. In addition, the profile contains meta information, including a distribution of insert sizes across the sample and an index, which allows for jumping to specific genomic positions in the profile. Insert size is defined as the distance between the leftmost alignment position of the forward read to the rightmost alignment position of the reverse read in the pair, expanded by potentially clipped bases.
The sample can be given as a (single-sample) BAM-file or CRAM-file. The file may contain multiple read groups; PopDel writes separate profile data per read group to the output file. The BAM file may also be compressed using gzip. The BAI-index of the BAM-File must be located in the same directory and have the same filename as the BAM-file, followed by '.bai'. The default output filename is <INPUT-FILE>.profile. The output name and path can be changed using the option -o. In the simplest case, the profiling can be called as:
popdel profile sample.bam
The profiling can also be limited to one or more regions by adding them in the samtools style to the end of the program call:
popdel profile sample.bam chr21 chr1:2000000-5000000 chr4:30000000
This creates a single profile file for the complete chr21, chr1 from 2000000-5000000 bp and chr4 from 30000000 bp till the end of the chromosome.
Note: It is however recommended to create the profiles on the whole genomes, as you can always limit the calling to the desired regions and this will eliminate the risk that you later have to recreate a profile if you are interested in another region. Only the profiles are needed for the calling. Each profile should have a size about 1-2% of the original file size. They are compressed binary files and can be viewed in human readable format using the popdel view command (see 06. Inspect or convert profiles with popdel view).
PopDel assumes that GRCh38 was used as a reference genome to create the BAM/CRAM-file. The default sampling regions (see next section) and the default chromosome names are set accordingly. If you used another build of the human genome, you can specify this via the option -r or --reference. Supported builds (some of which are synonymous) are 'T2T, 'GRCh38', 'hg38', 'GRCh37' and 'hg19'. While the sampling regions for GRCh37 and hg19 are the same, PopDel assumes the chromosomes of the hg19 build to be labeled "chr1", "chr2", ... and the chromosomes of the GRCh37 build to be labeled "1", "2", ...
For other chromosome names or organisms other than human, please refer to the following section to define your own sampling regions.
CRAM-files further require the presence of the reference genome used for creating the CRAM file. If the header of the CRAM file does not correctly point to the reference genome, the path to the reference can by given via the -R or --reference-file option.
PopDel only considers a small portion of the genome for estimating the insert size distribution. Per default it uses one well-behaved region of each chromosome (see Sampling Intervals below). If the first region does not contain sufficient reads, PopDel continues sampling from the next interval until the number of reads is sufficient.
The following intervals have been selected with the target in mind that they don't contain an excess amount of irregular sequence, such as centromeric or telomeric sequence or regions of low mappability. They have been selected for the reference genome GRCh38. For the analysis of alignments to a reference genome other than T2T, GRCh38/hg38, GRCh37/hg19 (for selecting the reference build, please refer to the previous section) it is necessary to use user-defined sampling intervals (see option -i).
chr1:35000000-36000000
chr2:174000000-175000000
chr3:36500000-37500000
chr4:88000000-89000000
chr5:38000000-39000000
chr6:38000000-39000000
chr7:37000000-38000000
chr8:19000000-20000000
chr9:19000000-20000000
chr10:19000000-20000000
chr11:19000000-20000000
chr12:19000000-20000000
chr13:30000000-31000000
chr14:26000000-27000000
chr15:26000000-27000000
chr16:25000000-26000000
chr17:31000000-32000000
chr18:31000000-32000000
chr19:31000000-32000000
chr20:33000000-34000000
chr21:21000000-22000000
chr22:34000000-35000000
For other reference genomes or chromosome names it might be necessary to use different sampling intervals. A file containing user-defined intervals can be given to PopDel by using the option -i. The intervals in the file have to follow above samtools-style notation. There may only be one interval per line. For reliable results, the regions should not include regions containing abnormal sequence, like telomeric or centromeric regions. It is important, that the contig names used in the file are exactly the same as in the BAM-file (or a subset thereof). If PopDel can not sample from the default or user-defined regions, please check if the chromosomes of the BAM-file are named chrNUM:START-END (without leading 0's). If they don't follow this exact naming pattern, the user has to define the intervals as described above.
The amount of required read-pairs for parameter estimation defaults to 50,000. This value can be modified using the option -n. Please note that, if the profiling is restricted to certain regions of the genome (see previous section), the sampling is only performed on the chromosomes of those regions. This behavior can be overwritten by specifying user-defined sampling intervals.
| Flag | Extended flag | Meaning | Default |
|---|---|---|---|
-i |
--intervals |
File with genomic intervals for parameter estimation instead of default intervals. One closed interval per line, formatted as 'CHROM:START-END', 1-based coordinates. | |
-n |
--min-read-num |
Minimum number of read pairs for parameter estimation (per read group) | 50000 |
The filters used by PopDel profile are preconfigured for most use cases, but can be fine-tuned by the user:
| Flag | Extended flag | Meaning | Default |
|---|---|---|---|
-d |
--max-deletion-size |
Maximum size of deletions | 10000 |
-f |
--flags-set |
Only use reads with all bits of NUM set in the bam flag | 33 |
-F |
--flags-unset |
Only use reads with all bits of NUM unset in the bam flag | 3868 |
-mq |
--min-mapping-qual |
Only use reads with a mapping quality of at least NUM | 1 |
-u |
--min-unclipped |
Only use reads of which at least NUM bases are not clipped | 50 |
-s |
--min-align-score |
Only use reads with an alignment score relative to read length above NUM | 0.8 |
It is possible to create profiles of BAM/CRAM-files that only contain a part of the genome, e.g. only chr21. However, there is one thing to keep in mind to avoid errors: When simply called with the standard command line and default sampling regions, PopDel will try to sample reads from the whole genome, which will result in a very low and wrong coverage estimate since the BAM-file only contains reads for chromosome 21. To avoid this, there are two possibilities. The first is to specify the region the BAM/CRAM-file contains, in this case 'chr21'. The other option is to use user defined sampling intervals that lie in adequate regions of the chromosomes in the BAM/CRAM-file. If you are not interested in the coverage, you can also just ignore it, as PopDel itself won't use the value later on.
In case your BAM/CRAM-file contains multiple read groups of the individual and prepared with the same library, PopDel can be prompted to treat all reads of a file as if they originated from the same read group, thereby effectively merging all read groups. This is especially useful, if the coverage of the individual read groups is very low. Please be sure that your read groups share the same distribution of insert-sizes, especially the mean insert-size and standard deviation of the insert-size distribution. Also the read length must be the same. PopDel will NOT check, if the merged read groups follow the same distributions.
| Flag | Extended flag | Meaning |
|---|---|---|
-mrg |
--merge-read-groups |
Merge all read groups of the sample |
All read groups will be merged into the first read group of the sample.
Run PopDel profile on sample.bam, creating the output sample.bam.profile.
popdel profile sample.bam
Run PopDel profile on sample.bam, creating the output otherfolder/othername.profile. Use the sampling intervals given in somewhereelse/samplingIntervals.txt
popdel profile -i somewhereelse/samplingIntervals.txt -o otherfolder/othername.profile sample.bam
Run PopDel profile on the compressed sample.bam.gz, creating the output sample.bam.gz.profile. Filter reads with mapping quality below 30.
popdel profile -mq 30 sample.bam.gz
Next page → PopDel Call