This pipeline uses PacBio's smrtlink pbmm2 tool to align reads to a reference genome and then uses Google's deepvariant tool to call, glnexus to do joint calling, and the PacBio pbsv tools to call SV.
You need to have a FASTA file that to represent the reference genome. This is used by both of the above tools. This needs to be indexed -- and the two tools use different indexes so you have to index twice
-
Index this first using pbmm2 e.g.
pbmm2 index ref38.fastaThis produces a file with an mmi suffix. -
Index using samtools e.g.
samtools faidx ref38.fasta. This produces a file with anfaisuffix
The index files should be in the same directory as the fasta file. If necessary using symbolic links.
To be specified on the command line or in the config file
--ref_dir: The full path to the reference genome directory.--ref: the name of the reference genome -- it should be in the directory. The index files should be there too--fqs: a glob with the bgzipped FastQ files -- one file per sample. You don't need to provide this is your input is a BAM file. (seeno_alignbelow)--fast_qc: directory where the QCed FastQ files will be placed. You don't need to provide this is your input is a BAM file. (seeno_alignbelow).--exclude_regions: BED files of regions which should be ignored--output_vcfdirectory where per sample VCFs should be placed--output_gvcfdirectory where gVCF files should go- `--joint_dir where the joinly called BCFs go
--joint_namewhat the name of the file should be--no_alignIf this set to true, the BAM files provided are used as input--tandem_exampleNeeded for pbsv--par_regions_bedPAR Regions--bam: Where the BAM files should be placed. Usually an output directory but seeno_align--bamify_cpus: how many cores does creating the BAM file use (default is 16)--bamify_mem: how much memory BAM creation needs (default 32GB)--call_cpus: how many cores calling requires (default is 16)--call_mem: how much memory calling requires (default 48GB)--output: The name of the jointly called VCF file output bypbsv--chrom_prefix. The default ischr. BAM/VCF files can refer to a chromosome as chr7 or just as 7. The various tools in the pipeline need to know which. For build 38,chr7is more common and this is the default but if your data is different you need to set this.--skip-sv. Default is true (don't call SV)
Google's DeepVariant pipeline relies on TensorFlow which in turn relies on computers with AVX instruction support. A few of the older nodes on the cluster do not support AVX instructions so you need to make sure that SLURM gives you the nodes you need. Add the following options. (If you see jobs failing with a 252 error you may have overlooked this)
--constraint=avx2
Used to take a trio, run bcftools through a parameter sweep to find Mendelian error rates