This repository contains the R, Python, and Shell scripts used for a comprehensive pangenome variation graph (PVG) analysis of Sheeppox Virus (SPPV) and Goatpox Virus (GTPV), focusing on analyses of genomic diversity, phylogenetic relationships, and variant distribution patterns.
The repository is organised into seven main directories:
π FIGURES/: The output plots and figures generated by the analysis scripts in PDF and PNG format, organised by virus type.π FIGURES/GTPV/GENOME: Phylogenetic and PCA visualisations of each GTPV gene / putative ORF.π FIGURES/SPPV/GENOME: Phylogenetic and PCA visualisations of each SPPV gene / putative ORF.π scripts/: Contains the analysis scripts used to process data and generate the figures.π FASTA_FILES/: Contains the GTPV and SPPV genome FASTA files used for sequence analysis.π XG_FILES/: Contains the GTPV and SPPV XG files used for conversion to PG format.π PG_FILES/: Contains the GTPV and SPPV PG files used for visualisation with sequenceTubeMap, including the representative PVGs with stems "GTPV.3" and "SPPV.4", respectively.π GFA_FILES/: Contains the GTPV and SPPV GFA files used for Bandage visualisation.π TABLES/: Contains tabular data outputs including variant comparison analyses.
This section details the purpose, dependencies, required inputs, and outputs for each script in the scripts/ directory.
These scripts perform phylogenetic analysis and Principal Component Analysis (PCA) for Goatpox (GTPV) and Sheeppox (SPPV) viruses, respectively.
- Purpose:
- Phylogenetic Tree: Reads a RAxML-generated tree file with bootstrap support, midpoint roots it, and visualises it. It annotates the tree with bootstrap values (β₯90%) and colours tips based on predefined genetic groups.
- PCA: Reads a multiple sequence alignment, converts it to a numeric matrix, and performs PCA to visualise the genetic clustering of the viral isolates.
- Combined Plot: Merges the phylogenetic tree and the PCA plot into a single comparative figure.
- Dependencies (R Packages):
ape,phangorn,ggplot2,ggtree,treeio,dplyr,RColorBrewer,seqinr,ggrepel,gridExtra. - Input Files:
T14.raxml.supportTBE: A Newick tree file with bootstrap support values.genomes.aln: A FASTA-formatted multiple sequence alignment of the viral genomes.
- Output Files:
goatpox_phylogeny.pdf/.pngorsheeppox_phylogeny.pdf/.pnggoatpox_pca.pdforsheeppox_pca.pdfsheeppox_pca.PC3.PC4.pdf(for SPPV only)goatpox_combined_analysis.pdf/.pngorsheeppox_combined_analysis.pdf/.png
- Purpose: Counts the occurrences of different variant types (SNP, DEL, INS, MNP, COMPLEX, COMPOUND) within specified genomic regions from Variant Call Format (VCF) files, with particular focus on terminal regions containing Inverted Terminal Repeats (ITRs).
- Dependencies (R Packages): Base R functions for file I/O and data manipulation.
- Input File:
- A VCF file (e.g.,
gfavariants.vcf).
- A VCF file (e.g.,
- Output: Quantitative summary of variant types and their genomic distribution.
- Purpose: Generates a comprehensive visualisation of SNP density across the entire genomes of SPPV and GTPV. It plots SNV density in sliding windows and aligns this with a map of the Coding Sequences (CDS) from corresponding GenBank files, highlighting specific regions of interest.
- Dependencies (R Packages):
ggplot2,dplyr,readr,genbankr,patchwork. - Input Files:
results_SPPV/vcf/gfavariants.vcf: VCF file for SPPV.results_GTPV/vcf/gfavariants.vcf: VCF file for GTPV.nc_004002.gb: GenBank file for the SPPV reference genome.nc_004003.gb: GenBank file for the GTPV reference genome.
- Output Files:
Figure_3_snv_density_with_cds.pdf/.png: A combined plot showing SNP density and CDS maps for both viruses.
- Purpose: This is a specialised version of the SNP density script that creates a focused comparison of SNV density within the terminal regions (first and last 5 Kb) of the SPPV and GTPV genomes, which include the Inverted Terminal Repeats (ITRs).
- Dependencies (R Packages):
ggplot2,dplyr,readr,genbankr,patchwork. - Input Files:
results_SPPV/vcf/gfavariants.vcf: VCF file for SPPV.results_GTPV/vcf/gfavariants.vcf: VCF file for GTPV.nc_004002.gb: GenBank file for the SPPV reference genome.nc_004003.gb: GenBank file for the GTPV reference genome.
- Output Files:
Figure_S3_snv_density_terminal_regions.pdf: A side-by-side plot comparing SNV density in the initial and terminal 5 Kb regions of the genomes.
Extended versions of the phylogenetic and PCA analysis scripts with additional functionality for strain-specific analyses and enhanced visualisation options.
- Purpose: Provides advanced phylogenetic reconstruction and multivariate analysis capabilities with strain-specific filtering and enhanced graphical outputs.
- Dependencies (R Packages):
ape,phangorn,ggplot2,ggtree,treeio,dplyr,RColorBrewer,seqinr,ggrepel,gridExtra. - Output Files: Enhanced phylogenetic trees and PCA plots with strain-specific annotations.
Note: These are Python scripts for advanced genomic analysis.
- Purpose: Perform comprehensive genome-wide analysis including pangenome construction, core genome identification, and accessory gene analysis using Python-based bioinformatics tools.
- Dependencies: Python packages for bioinformatics analysis (e.g., BioPython, pandas, numpy).
- Output Files: Pangenome statistics, core gene analyses, and genome growth curves.
Note: These are Bash scripts and should be run in a Unix-like terminal.
- Purpose:
06_seqs.sh: Processes sequence files and prepares them for downstream analysis.07_network.sh: Constructs phylogenetic networks and performs network-based analyses.
- Dependencies: Standard Unix command-line tools and bioinformatics software.
- Output Files: Processed sequence files and network analysis results.
- Purpose: Calculates and visualises genetic diversity metrics across SPPV and GTPV genomes, including nucleotide diversity, mutation rates, and population genetic statistics.
- Dependencies (R Packages):
ggplot2,dplyr, statistical packages for diversity calculations. - Output Files: Diversity plots and statistical summaries.
- Purpose: Create table of aligned genes based on GenBank file input.
- Output Files: Table TSV files with gene coordinates.
- Purpose: Retrieve GenBank info for LSDV, GTPV & SPPV
- Output Files: GenBank tabular data for each species
- Purpose: Compute SNPs/Kb and mutations/Kb for each gene in GTPV and SPPV.
- Output Files: Tables of diversity metrics (CSV format).
FASTA_FILES/GTPV.fasta: Goatpox virus genome sequencesFASTA_FILES/SPPV.fasta: Sheeppox virus genome sequences
XG_FILES/GTPV.xg&SPPV.xg: Compressed graph representationsPG_FILES/GTPV.pg&SPPV.pg: PackedGraph format for visualisationGFA_FILES/GTPV.gfa&SPPV.gfa: Graphical Fragment Assembly format for Bandage
scripts/nc_004002.gb: SPPV reference genome (GenBank format)scripts/nc_004003.gb: GTPV reference genome (GenBank format)
TABLES/SNV_comparison_GTPV_vs_SPPV.csv: Comparative analysis of single nucleotide variants between virus typesTABLES/LSDV_GTPV_SPPV.tsv: Table of aligned genes' coordinates across LSDV, GTPV and SPPV.TABLES/diversity_metrics.csv: Table of the LSDV-linked genes, showing the GTPV_SNPs/Kb, GTPV_Mutations/Kb, SPPV_SNPs/Kb and SPPV_Mutations/Kb.
The FIGURES/ directory contains all outputs organised by analysis type and virus:
Figure_3_snv_density_with_cds.pdf: Genome-wide SNV density plotted against gene locations for SPPV and GTPV.Figure_S3_snv_density_terminal_regions.pdf: Focused view of SNV density in the terminal regions of the genomes.Figure_snps_kb_correlation.png: The association of SNPs/Kb in GTPV vs SPPV for each gene with a linear model displayed (blue).
The FIGURES/GTPV/ directory contains:
GENOME/: A folder containing info for each gene as LSDVXXX: *LSDVXXX_GTPV_combined.pdf: Phylogenetic & PCA visualisations of diversity for gene whose LSDV ID is XXX. *LSDVXXX_sliding_window.png: Diversity per clade across gene whose LSDV ID is XXX. *similarity_matrices_LSDVXXX.csv: Pairwise matrix of percent similarity across samples for gene whose LSDV ID is XXX *region_LSDVXXX.aln: Sequences for samples at gene whose LSDV ID is XXX.goatpox_combined_analysis.pdf: Combined phylogenetic tree and PCA plotpangrowth.pdf: Pangenome growth curve analysisp_core.pdf: Core genome size estimatesgrowth.pdf&heaps.pdf: Genome accumulation curveshistgrowth.node.pdf: Node-based growth statisticsmutation_density.pdf: Mutation density distributionsGTPV_communities.pdf: Phylogenetic community structure- Strain-specific analyses:
LSDV009_,LSDV013_,LSDV026_,LSDV132_,LSDV136_prefixed files
The FIGURES/SPPV/ directory contains:
GENOME/: A folder containing info for each gene as LSDVXXX: *LSDVXXX_GTPV_combined.pdf: Phylogenetic & PCA visualisations of diversity for gene whose LSDV ID is XXX. *LSDVXXX_sliding_window.png: Diversity per clade across gene whose LSDV ID is XXX. *similarity_matrices_LSDVXXX.csv: Pairwise matrix of percent similarity across samples for gene whose LSDV ID is XXX *region_LSDVXXX.aln: Sequences for samples at gene whose LSDV ID is XXX.sheeppox_combined_analysis.pdf: Combined phylogenetic tree and PCA plotsheeppox_pca.PC3.PC4.pdf: PCA plot using the 3rd and 4th principal componentspangrowth.pdf: Pangenome growth curve analysisp_core.pdf: Core genome size estimatesgrowth.pdf&heaps.pdf: Genome accumulation curveshistgrowth.node.pdf: Node-based growth statisticsSPPV_communities.pdf: Phylogenetic community structure- Strain-specific analyses:
LSDV009_,LSDV013_,LSDV026_,LSDV132_,LSDV136_prefixed files
- R: Ensure you have R installed on your system (version β₯4.0 recommended).
- Python: Python 3.7+ with bioinformatics packages for running the
.pyscripts. - R Packages: You will need several R packages. You can install them all by running the following command in an R console:
install.packages(c("ape", "phangorn", "ggplot2", "ggtree", "treeio", "dplyr", "RColorBrewer", "seqinr", "ggrepel", "gridExtra", "readr", "genbankr", "patchwork")) # For Bioconductor packages: if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(c("ggtree", "treeio", "genbankr"))
- System Requirements: Unix-like environment (Linux, macOS, or WSL on Windows) for running shell scripts.
-
Clone the repository:
git clone https://github.com/downingtim/GTPV-SPPV-PVG cd GTPV-SPPV-PVG -
Prepare Input Data: Ensure all required input files are present. The repository includes reference genomes and example data files.
-
Execute scripts in sequence:
# Phylogenetic and PCA analysis Rscript scripts/01_phylogeny_PCA.SPPV.R Rscript scripts/01_phylogeny_PCA.GTPV.R # Variant analysis Rscript scripts/02_count_vcf.R # SNP density analysis Rscript scripts/03_SNP_density.R Rscript scripts/03_SNP_density.ITR.R # Sequence processing bash scripts/06_seqs.sh # Advanced analyses python scripts/05_genome_analysis.SPPV.py python scripts/05_genome_analysis.GTPV.py # Phylogeny and PCA Rscript 04_phylogeny_PCA.GTPV.R Rscript 04_phylogeny_PCA.SPPV.R # Network analysis bash scripts/07_network.sh # Diversity analysis Rscript scripts/08_diversity.R # genbank table (run in order shown) sh 10_get_genbank_info.sh sh 09_create_genbank_table.sh
***
## Citation
If you use this code or analysis pipeline in your research, please cite:
Downing T. Insights into goatpox virus and sheeppox virus genomes from pangenome graphs.
github.com/downingtim/GTPV-SPPV-PVG/
***
## License
This project is licensed under the MIT License - see the LICENSE file for details
***
## Contact
For questions or issues, please open an issue on GitHub or contact the repository maintainer.