This repository hosts code for computing the mutation dynamics of S-protein in SARS-Cov-2 based on phylogenetic and statistical analyses.
git clone https://github.com/hzi-bifo/corona_protein_dynamics.git
cd corona_protein_dynamics
conda env create -f environment.yml
conda activate corona_sd_plots
If the pre-built binaries in libs/phylogeo-tools do not work, they should be build with the following commands .
rm -rf libs/phylogeo-tools
git clone https://github.com/hzi-bifo/phylogeo-tools.git libs/phylogeo-tools
cd libs/phylogeo-tools
mkdir -p build/
conda activate corona_sd_plots
make
cd ../../
Usage: bash corona_sd_plot.sh -r <root_path> \
[-s <sample_size>] \
[-t <time_period>] \
[-g <location>] \
-c <genomes_file> \
-l <lineage_file>
Options:
-r Path to the root (reference sequence)
-s Number of sampled sequences (if not set, no sampling)
-t Analyze by given time period (year, month, day, season)
-g Sample per geographic location (continent, country). If fasta is from one location, no need to specify
-c Get coding sequences first (specify genomes file)
-l Map substitutions to pangolin lineages (specify lineage file)
-h, --help Show help
After uncompress the test data in test_data/Germany
with:
gunzip test_data/Germany/DE.metadata.tsv.gz
gunzip test_data/Germany/DE.fasta.gz
The test data were created using sequences downloaded from NCBI, and the EPI_ISL IDs were artificially generated only for testing purposes.
You can run the pipeline on test data:
bash corona_sd_plot.sh \
<output dir> \
-r root_seq/Asia_root_cds.fa \
-t month \
-c test_data/Germany/DE.fasta \
-l test_data/Germany/DE.metadata.tsv
This will start the pipeline in the <output dir>
folder you have created, use the root sequence from file root_seq/Asia_root_cds.fa
and test data for Germany, and make a plot with monthly time periods. The runtime on this test data is around 25 minutes.
The -l
option requires a metadata file as input (e.g., -l DE.metadata.tsv
) to map amino acid substitutions to pangolin lineages. The metadata.tsv
file should adhere to the format exemplified in test_data/Germany/DE.metadata.tsv
The script requires sequences to have header in the following format:
>Germany/BY-ChVir-1017/2020|EPI_ISL_450209|2020-01-30
- Pull the singularity image
singularity remote add --no-login cloud https://cloud.sylabs.io
singularity pull --arch amd64 library://zldeng/collection/corona_protein_dynamics:latest
- Create the output directory
mkdir output
- Download the test data
mkdir -p testdata
cd testdata
wget https://raw.githubusercontent.com/hzi-bifo/corona_protein_dynamics/refs/heads/new_sd_plots/test_data/Germany/DE.fasta.gz
wget https://raw.githubusercontent.com/hzi-bifo/corona_protein_dynamics/refs/heads/new_sd_plots/test_data/Germany/DE.metadata.tsv.gz
gunzip DE.fasta.gz
gunzip DE.metadata.tsv.gz
- Run the container:
singularity exec corona_protein_dynamics_latest.sif \
bash /opt/corona_protein_dynamics/corona_sd_plot.sh output \
-r /opt/corona_protein_dynamics/root_seq/Asia_root_cds.fa \
-t month \
-c testdata/DE.fasta \
-l testdata/DE.metadata.tsv