Skip to content

downingtim/GTPV-SPPV-PVG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

27 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Pangenome variation graph (PVG) analysis of sheeppox (SPPV) and goatpox (GTPV) viruses 🧬

This repository contains the R, Python, and Shell scripts used for a comprehensive pangenome variation graph (PVG) analysis of Sheeppox Virus (SPPV) and Goatpox Virus (GTPV), focusing on analyses of genomic diversity, phylogenetic relationships, and variant distribution patterns.

Repository Structure

The repository is organised into seven main directories:

  • πŸ“ FIGURES/: The output plots and figures generated by the analysis scripts in PDF and PNG format, organised by virus type.
  • πŸ“ FIGURES/GTPV/GENOME: Phylogenetic and PCA visualisations of each GTPV gene / putative ORF.
  • πŸ“ FIGURES/SPPV/GENOME: Phylogenetic and PCA visualisations of each SPPV gene / putative ORF.
  • πŸ“ scripts/: Contains the analysis scripts used to process data and generate the figures.
  • πŸ“ FASTA_FILES/: Contains the GTPV and SPPV genome FASTA files used for sequence analysis.
  • πŸ“ XG_FILES/: Contains the GTPV and SPPV XG files used for conversion to PG format.
  • πŸ“ PG_FILES/: Contains the GTPV and SPPV PG files used for visualisation with sequenceTubeMap, including the representative PVGs with stems "GTPV.3" and "SPPV.4", respectively.
  • πŸ“ GFA_FILES/: Contains the GTPV and SPPV GFA files used for Bandage visualisation.
  • πŸ“ TABLES/: Contains tabular data outputs including variant comparison analyses.

Scripts πŸ”¬

This section details the purpose, dependencies, required inputs, and outputs for each script in the scripts/ directory.

01_phylogeny_PCA.GTPV.R and 01_phylogeny_PCA.SPPV.R

These scripts perform phylogenetic analysis and Principal Component Analysis (PCA) for Goatpox (GTPV) and Sheeppox (SPPV) viruses, respectively.

  • Purpose:
    1. Phylogenetic Tree: Reads a RAxML-generated tree file with bootstrap support, midpoint roots it, and visualises it. It annotates the tree with bootstrap values (β‰₯90%) and colours tips based on predefined genetic groups.
    2. PCA: Reads a multiple sequence alignment, converts it to a numeric matrix, and performs PCA to visualise the genetic clustering of the viral isolates.
    3. Combined Plot: Merges the phylogenetic tree and the PCA plot into a single comparative figure.
  • Dependencies (R Packages): ape, phangorn, ggplot2, ggtree, treeio, dplyr, RColorBrewer, seqinr, ggrepel, gridExtra.
  • Input Files:
    • T14.raxml.supportTBE: A Newick tree file with bootstrap support values.
    • genomes.aln: A FASTA-formatted multiple sequence alignment of the viral genomes.
  • Output Files:
    • goatpox_phylogeny.pdf/.png or sheeppox_phylogeny.pdf/.png
    • goatpox_pca.pdf or sheeppox_pca.pdf
    • sheeppox_pca.PC3.PC4.pdf (for SPPV only)
    • goatpox_combined_analysis.pdf/.png or sheeppox_combined_analysis.pdf/.png

02_count_vcf.R

  • Purpose: Counts the occurrences of different variant types (SNP, DEL, INS, MNP, COMPLEX, COMPOUND) within specified genomic regions from Variant Call Format (VCF) files, with particular focus on terminal regions containing Inverted Terminal Repeats (ITRs).
  • Dependencies (R Packages): Base R functions for file I/O and data manipulation.
  • Input File:
    • A VCF file (e.g., gfavariants.vcf).
  • Output: Quantitative summary of variant types and their genomic distribution.

03_SNP_density.R

  • Purpose: Generates a comprehensive visualisation of SNP density across the entire genomes of SPPV and GTPV. It plots SNV density in sliding windows and aligns this with a map of the Coding Sequences (CDS) from corresponding GenBank files, highlighting specific regions of interest.
  • Dependencies (R Packages): ggplot2, dplyr, readr, genbankr, patchwork.
  • Input Files:
    • results_SPPV/vcf/gfavariants.vcf: VCF file for SPPV.
    • results_GTPV/vcf/gfavariants.vcf: VCF file for GTPV.
    • nc_004002.gb: GenBank file for the SPPV reference genome.
    • nc_004003.gb: GenBank file for the GTPV reference genome.
  • Output Files:
    • Figure_3_snv_density_with_cds.pdf/.png: A combined plot showing SNP density and CDS maps for both viruses.

03_SNP_density.ITR.R

  • Purpose: This is a specialised version of the SNP density script that creates a focused comparison of SNV density within the terminal regions (first and last 5 Kb) of the SPPV and GTPV genomes, which include the Inverted Terminal Repeats (ITRs).
  • Dependencies (R Packages): ggplot2, dplyr, readr, genbankr, patchwork.
  • Input Files:
    • results_SPPV/vcf/gfavariants.vcf: VCF file for SPPV.
    • results_GTPV/vcf/gfavariants.vcf: VCF file for GTPV.
    • nc_004002.gb: GenBank file for the SPPV reference genome.
    • nc_004003.gb: GenBank file for the GTPV reference genome.
  • Output Files:
    • Figure_S3_snv_density_terminal_regions.pdf: A side-by-side plot comparing SNV density in the initial and terminal 5 Kb regions of the genomes.

04_phylogeny_PCA.GTPV.R and 04_phylogeny_PCA.SPPV.R

Extended versions of the phylogenetic and PCA analysis scripts with additional functionality for strain-specific analyses and enhanced visualisation options.

  • Purpose: Provides advanced phylogenetic reconstruction and multivariate analysis capabilities with strain-specific filtering and enhanced graphical outputs.
  • Dependencies (R Packages): ape, phangorn, ggplot2, ggtree, treeio, dplyr, RColorBrewer, seqinr, ggrepel, gridExtra.
  • Output Files: Enhanced phylogenetic trees and PCA plots with strain-specific annotations.

05_genome_analysis.GTPV.py and 05_genome_analysis.SPPV.py

Note: These are Python scripts for advanced genomic analysis.

  • Purpose: Perform comprehensive genome-wide analysis including pangenome construction, core genome identification, and accessory gene analysis using Python-based bioinformatics tools.
  • Dependencies: Python packages for bioinformatics analysis (e.g., BioPython, pandas, numpy).
  • Output Files: Pangenome statistics, core gene analyses, and genome growth curves.

06_seqs.sh and 07_network.sh

Note: These are Bash scripts and should be run in a Unix-like terminal.

  • Purpose:
    • 06_seqs.sh: Processes sequence files and prepares them for downstream analysis.
    • 07_network.sh: Constructs phylogenetic networks and performs network-based analyses.
  • Dependencies: Standard Unix command-line tools and bioinformatics software.
  • Output Files: Processed sequence files and network analysis results.

08_diversity.R

  • Purpose: Calculates and visualises genetic diversity metrics across SPPV and GTPV genomes, including nucleotide diversity, mutation rates, and population genetic statistics.
  • Dependencies (R Packages): ggplot2, dplyr, statistical packages for diversity calculations.
  • Output Files: Diversity plots and statistical summaries.

09_create_genbank_table.sh

  • Purpose: Create table of aligned genes based on GenBank file input.
  • Output Files: Table TSV files with gene coordinates.

10_get_genbank_info.sh

  • Purpose: Retrieve GenBank info for LSDV, GTPV & SPPV
  • Output Files: GenBank tabular data for each species

11_diversity_metrics.py

  • Purpose: Compute SNPs/Kb and mutations/Kb for each gene in GTPV and SPPV.
  • Output Files: Tables of diversity metrics (CSV format).

Data Files πŸ“Š

Input Sequence Data

  • FASTA_FILES/GTPV.fasta: Goatpox virus genome sequences
  • FASTA_FILES/SPPV.fasta: Sheeppox virus genome sequences

Graph Format Files

  • XG_FILES/GTPV.xg & SPPV.xg: Compressed graph representations
  • PG_FILES/GTPV.pg & SPPV.pg: PackedGraph format for visualisation
  • GFA_FILES/GTPV.gfa & SPPV.gfa: Graphical Fragment Assembly format for Bandage

Reference Genomes

  • scripts/nc_004002.gb: SPPV reference genome (GenBank format)
  • scripts/nc_004003.gb: GTPV reference genome (GenBank format)

Analysis Results

  • TABLES/SNV_comparison_GTPV_vs_SPPV.csv: Comparative analysis of single nucleotide variants between virus types
  • TABLES/LSDV_GTPV_SPPV.tsv: Table of aligned genes' coordinates across LSDV, GTPV and SPPV.
  • TABLES/diversity_metrics.csv: Table of the LSDV-linked genes, showing the GTPV_SNPs/Kb, GTPV_Mutations/Kb, SPPV_SNPs/Kb and SPPV_Mutations/Kb.

Figures πŸ“ˆ

The FIGURES/ directory contains all outputs organised by analysis type and virus:

Main Figures

  • Figure_3_snv_density_with_cds.pdf: Genome-wide SNV density plotted against gene locations for SPPV and GTPV.
  • Figure_S3_snv_density_terminal_regions.pdf: Focused view of SNV density in the terminal regions of the genomes.
  • Figure_snps_kb_correlation.png: The association of SNPs/Kb in GTPV vs SPPV for each gene with a linear model displayed (blue).

GTPV-Specific Analyses

The FIGURES/GTPV/ directory contains:

  • GENOME/: A folder containing info for each gene as LSDVXXX: *LSDVXXX_GTPV_combined.pdf: Phylogenetic & PCA visualisations of diversity for gene whose LSDV ID is XXX. *LSDVXXX_sliding_window.png: Diversity per clade across gene whose LSDV ID is XXX. *similarity_matrices_LSDVXXX.csv: Pairwise matrix of percent similarity across samples for gene whose LSDV ID is XXX *region_LSDVXXX.aln: Sequences for samples at gene whose LSDV ID is XXX.
  • goatpox_combined_analysis.pdf: Combined phylogenetic tree and PCA plot
  • pangrowth.pdf: Pangenome growth curve analysis
  • p_core.pdf: Core genome size estimates
  • growth.pdf & heaps.pdf: Genome accumulation curves
  • histgrowth.node.pdf: Node-based growth statistics
  • mutation_density.pdf: Mutation density distributions
  • GTPV_communities.pdf: Phylogenetic community structure
  • Strain-specific analyses: LSDV009_, LSDV013_, LSDV026_, LSDV132_, LSDV136_ prefixed files

SPPV-Specific Analyses

The FIGURES/SPPV/ directory contains:

  • GENOME/: A folder containing info for each gene as LSDVXXX: *LSDVXXX_GTPV_combined.pdf: Phylogenetic & PCA visualisations of diversity for gene whose LSDV ID is XXX. *LSDVXXX_sliding_window.png: Diversity per clade across gene whose LSDV ID is XXX. *similarity_matrices_LSDVXXX.csv: Pairwise matrix of percent similarity across samples for gene whose LSDV ID is XXX *region_LSDVXXX.aln: Sequences for samples at gene whose LSDV ID is XXX.
  • sheeppox_combined_analysis.pdf: Combined phylogenetic tree and PCA plot
  • sheeppox_pca.PC3.PC4.pdf: PCA plot using the 3rd and 4th principal components
  • pangrowth.pdf: Pangenome growth curve analysis
  • p_core.pdf: Core genome size estimates
  • growth.pdf & heaps.pdf: Genome accumulation curves
  • histgrowth.node.pdf: Node-based growth statistics
  • SPPV_communities.pdf: Phylogenetic community structure
  • Strain-specific analyses: LSDV009_, LSDV013_, LSDV026_, LSDV132_, LSDV136_ prefixed files

Getting Started

Prerequisites

  1. R: Ensure you have R installed on your system (version β‰₯4.0 recommended).
  2. Python: Python 3.7+ with bioinformatics packages for running the .py scripts.
  3. R Packages: You will need several R packages. You can install them all by running the following command in an R console:
    install.packages(c("ape", "phangorn", "ggplot2", "ggtree", "treeio", "dplyr", 
                       "RColorBrewer", "seqinr", "ggrepel", "gridExtra", "readr", 
                       "genbankr", "patchwork"))
    
    # For Bioconductor packages:
    if (!require("BiocManager", quietly = TRUE))
        install.packages("BiocManager")
    BiocManager::install(c("ggtree", "treeio", "genbankr"))
  4. System Requirements: Unix-like environment (Linux, macOS, or WSL on Windows) for running shell scripts.

Running the Scripts

  1. Clone the repository:

    git clone https://github.com/downingtim/GTPV-SPPV-PVG
    cd GTPV-SPPV-PVG
  2. Prepare Input Data: Ensure all required input files are present. The repository includes reference genomes and example data files.

  3. Execute scripts in sequence:

    # Phylogenetic and PCA analysis
    Rscript scripts/01_phylogeny_PCA.SPPV.R
    Rscript scripts/01_phylogeny_PCA.GTPV.R
    
    # Variant analysis
    Rscript scripts/02_count_vcf.R
    
    # SNP density analysis
    Rscript scripts/03_SNP_density.R
    Rscript scripts/03_SNP_density.ITR.R
    
    # Sequence processing 
    bash scripts/06_seqs.sh
    
    # Advanced analyses
    python scripts/05_genome_analysis.SPPV.py
    python scripts/05_genome_analysis.GTPV.py
    
    # Phylogeny and PCA
    Rscript 04_phylogeny_PCA.GTPV.R
    Rscript 04_phylogeny_PCA.SPPV.R
    
    # Network analysis
    bash scripts/07_network.sh
    
    # Diversity analysis
    Rscript scripts/08_diversity.R
    
    # genbank table (run in order shown)
    sh 10_get_genbank_info.sh
    sh 09_create_genbank_table.sh
    
    

***

## Citation

If you use this code or analysis pipeline in your research, please cite:
   Downing T. Insights into goatpox virus and sheeppox virus genomes from pangenome graphs.
   github.com/downingtim/GTPV-SPPV-PVG/

***

## License

This project is licensed under the MIT License - see the LICENSE file for details

***

## Contact

For questions or issues, please open an issue on GitHub or contact the repository maintainer.

About

Pangenome variation graph (PVG) analysis of all goatpox virus (GTPV) and sheeppox virus (SPPV) genomes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors