A tool for predicting the effects of genomic variants on protein-coding genes, focusing on synonymous and nonsynonymous changes.
This tool analyzes Variant Call Format (VCF) files to determine the effects of SNPs (Single Nucleotide Polymorphisms) on protein-coding genes. It uses a reference genome (FASTA) and gene annotations (GenBank) to:
- Identify if variants fall within coding regions
- Determine if coding variants cause amino acid changes
- Classify variants as:
- Synonymous (no amino acid change)
- Nonsynonymous (amino acid change)
- Stop gain (introducing a premature stop codon)
- Stop loss (removing a stop codon)
- Intergenic (not in a coding region)
- Python 3.6+
- BioPython
- pandas (for batch processing)
- matplotlib and seaborn (for visualization)
pip install biopython pandas matplotlib seaborn
# use conda insteadpython vep.py INPUT.vcf REFERENCE.fasta ANNOTATION.gbk -o OUTPUT.tsvINPUT.vcf: Input variant file in VCF formatREFERENCE.fasta: Reference genome in FASTA formatANNOTATION.gbk: Gene annotation file in GenBank format-o, --output: Output file name (default: variant_effects.tsv)
python batch.py --vcf_dir /path/to/vcfs --pattern "*.vcf.gz" --fasta REFERENCE.fasta --genbank ANNOTATION.gbk --output_dir results--vcf_dir: Directory containing VCF files--fasta: Reference genome in FASTA format--genbank: Gene annotation file in GenBank format--output_dir: Output directory (default: variant_results)--pattern: Pattern to match VCF files (default: *.vcf)
The tool generates a tab-delimited file with the following columns:
CHROM: Chromosome/contig namePOS: Variant positionREF: Reference baseALT: Alternate baseEFFECT: Effect type (synonymous, nonsynonymous, stop_gain, stop_loss, intergenic)GENE: Gene namePRODUCT: Gene product descriptionREF_CODON: Reference codonALT_CODON: Alternate codonREF_AA: Reference amino acidALT_AA: Alternate amino acidCODON_POS: Position in the codon (1-based)AA_POS: Amino acid position (1-based)
The batch processor generates:
- Individual TSV files for each processed VCF
- A combined results file (combined_effects.tsv)
- A summary text file with statistics (variant_summary.txt)
- Visualization plots in the "plots" directory:
- Distribution of variant effects
- Variant effects by sample
- Top affected genes
Using the provided test data:
# Process single file
python vep.py test1.vcf data/KX894508/KX894508.1_155920_2012_Israel_19_Dec_2012.fna data/KX894508/genes.gbk
# Process multiple files
python batch.py --vcf_dir GIRAFFE_1 --fasta data/KX894508/KX894508.1_155920_2012_Israel_19_Dec_2012.fna --genbank data/KX894508/genes.gbkThe provided test case uses:
- The KX894508 genome (Lumpy skin disease virus isolate)
- Gene annotations from genes.gbk
- Variants from test1.vcf
- This tool currently only handles simple SNPs (no indels or complex variants)
- Focuses on protein-coding regions (doesn't analyze UTRs, introns, etc.)
- Assumes the GBK and FASTA files are properly aligned