Snakemake pipeline to do variant calling, that is, get from fastq files from the sequencing company to a vcf file.
-
If you haven't yet, install mamba or micromamba. (conda works as well, but I would recommend mamba/micromamba as they are a lot faster when installing)
-
Clone this repository and cd into it:
git clone https://github.com/gmanthey/variant-calling.git cd variant-calling
-
Create a new environment from the environment specs file:
Using mamba:
mamba env create -f environment.yml
Using micromamba:
micromamba env create -f environment.yml
-
Copy the
config.yml.template
file toconfig.yml
-
Adjust the paths to the genome and the raw fastq directory in the
config.yml
file.Depending on your setup and the sequencing technology used, change the path to the adapter sequences (the one provided uses illumina adapters provided by bbduk if you installed using conda, mamba or micromamba). Also adjust the optical duplicates distance depending on the sequencer used (e.g. 2500 for NovaSeq, 100 for MiSeq).
-
Create a chromosomes file from your reference genome:
samtools faidx <reference.fasta> cut -f 1 <reference.fasta.fai> > resources/chromosomes.txt
If the reference contains many scaffolds, it may be more efficient to group scaffolds together (as this pipeline parallelizes by chromosomes). For this, change the chromosome file to contain a group name in the first column, followed by whitespace seperated sequence names. You can also use the provided script
helpers/group_scaffolds
to create a chromosome file with balanced groups bases on the reference genome or a fai file of the reference genome. This format may also be combined with single sequence names for large sequences, i.e.:group1 seq1 seq2 seq3 group2 seq4 seq5 seq6 seq7
And adjust the path in the
config.yml
(or place it in the resources folder) -
Create a individuals.txt file from your list of fastq files/sample sheet. The individuals.txt file needs to be a tab seperated file with 2 columns, the first one being the individual ids that should be in the final vcf and the second one the path to the raw fastq file. Each line should only contain a single fastq file, but an individual can appear in multiple lines. The program automatically figures out R1 and R2 reads, as long as the first read in both files has the same fastq id.
For example:
We have 2 individuals I1 and I2, with I1 being run on 2 lanes, and therefor resulting in files:
- I1_L1_R1_001.fastq.gz
- I1_L1_R2_001.fastq.gz
- I1_L2_R1_001.fastq.gz
- I1_L2_R2_001.fastq.gz
- I2_L1_R1_001.fastq.gz
- I2_L1_R2_001.fastq.gz
Then the individuals.txt file needs to look like:
I1 I1_L1_R1_001.fastq.gz I1 I1_L1_R2_001.fastq.gz I1 I1_L2_R1_001.fastq.gz I1 I1_L2_R2_001.fastq.gz I2 I2_L1_R1_001.fastq.gz I2 I2_L1_R2_001.fastq.gz
Then adjust the path in the
config.yml
file for the individuals file (or place it in the resources folder).This allows for (almost) arbitrary files to be associated with any individual and a mapping to the actual individual ids early in the process. The trimmed fastq files will already contain combined sequences for an individual and are already named after the individual instead of after the files.
If the files already contain the sample names, as for example in the above case, you can create such a file with the following handy one-liner:
ls *.fastq.gz | cut -d "_" -f 1 | paste - <(ls *.fastq.gz)
-
Run the pipeline.
I suggest to open a screen or tmux window, as snakemake needs to run throughout the whole calculation in the background, but doesn't use many resources.
Activate the environment:
with mamba:
mamba activate variant-calling
with micromamba:
micromamba activate variant-calling
Run
snakemake --profile profile/default/
in the variant-calling folder to start this pipeline on the rosa hpc cluster at the UOL. The profile sets sensible default resources for most use cases and limits the number of currently running jobs to 100, as there is a limit on number of concurrent jobs for a single group.
If you want to only create bams and don't run variant calling, run
snakemake --profile profile/default/ bams