Skip to content

Latest commit

 

History

History
104 lines (85 loc) · 3.31 KB

README.md

File metadata and controls

104 lines (85 loc) · 3.31 KB

Pipeline for S-protein dynamics

This repository hosts code for computing the mutation dynamics of S-protein in SARS-Cov-2 based on phylogenetic and statistical analyses.

Installation

git clone https://github.com/hzi-bifo/corona_protein_dynamics.git
cd corona_protein_dynamics
conda env create -f environment.yml
conda activate corona_sd_plots

If the pre-built binaries in libs/phylogeo-tools do not work, they should be build with the following commands .

rm -rf libs/phylogeo-tools
git clone https://github.com/hzi-bifo/phylogeo-tools.git libs/phylogeo-tools
cd libs/phylogeo-tools
mkdir -p build/
conda activate corona_sd_plots
make
cd ../../

Usage

Usage: bash corona_sd_plot.sh -r <root_path> \
  [-s <sample_size>] \
  [-t <time_period>] \
  [-g <location>] \
  -c <genomes_file> \
  -l <lineage_file>
Options:
  -r          Path to the root (reference sequence)
  -s          Number of sampled sequences (if not set, no sampling)
  -t          Analyze by given time period (year, month, day, season)
  -g          Sample per geographic location (continent, country). If fasta is from one location, no need to specify
  -c          Get coding sequences first (specify genomes file)
  -l          Map substitutions to pangolin lineages (specify lineage file)
  -h, --help  Show help

After uncompress the test data in test_data/Germany with:

gunzip test_data/Germany/DE.metadata.tsv.gz
gunzip test_data/Germany/DE.fasta.gz

The test data were created using sequences downloaded from NCBI, and the EPI_ISL IDs were artificially generated only for testing purposes.

You can run the pipeline on test data:

bash corona_sd_plot.sh \
  <output dir> \
  -r root_seq/Asia_root_cds.fa \
  -t month \
  -c test_data/Germany/DE.fasta \
  -l test_data/Germany/DE.metadata.tsv

This will start the pipeline in the <output dir> folder you have created, use the root sequence from file root_seq/Asia_root_cds.fa and test data for Germany, and make a plot with monthly time periods. The runtime on this test data is around 25 minutes.

The -l option requires a metadata file as input (e.g., -l DE.metadata.tsv) to map amino acid substitutions to pangolin lineages. The metadata.tsv file should adhere to the format exemplified in test_data/Germany/DE.metadata.tsv

The script requires sequences to have header in the following format:

>Germany/BY-ChVir-1017/2020|EPI_ISL_450209|2020-01-30

Use singularity container

  • Pull the singularity image
singularity remote add --no-login cloud https://cloud.sylabs.io
singularity pull --arch amd64 library://zldeng/collection/corona_protein_dynamics:latest
  • Create the output directory
mkdir output
  • Download the test data
mkdir -p testdata
cd testdata
wget https://raw.githubusercontent.com/hzi-bifo/corona_protein_dynamics/refs/heads/new_sd_plots/test_data/Germany/DE.fasta.gz
wget https://raw.githubusercontent.com/hzi-bifo/corona_protein_dynamics/refs/heads/new_sd_plots/test_data/Germany/DE.metadata.tsv.gz
gunzip DE.fasta.gz
gunzip DE.metadata.tsv.gz
  • Run the container:
singularity exec corona_protein_dynamics_latest.sif \
  bash /opt/corona_protein_dynamics/corona_sd_plot.sh output \
  -r /opt/corona_protein_dynamics/root_seq/Asia_root_cds.fa \
  -t month \
  -c testdata/DE.fasta \
  -l testdata/DE.metadata.tsv