Skip to content

ryanjosephkamp/the-digital-mutator

Repository files navigation

The Digital Mutator — Deep Mutational Scanning Simulator

Week 20, Project 2 · Biophysics Portfolio · CS Research Self-Study

A computational deep mutational scanning (DMS) system that uses a structure-conditioned message-passing neural network (MPNN) to predict the fitness effect of every possible single amino acid substitution in a protein. Scores $19L$ mutations from backbone structure alone, validates against fixed benchmark DMS datasets (GB1, GFP), and provides interactive visualization via a Streamlit dashboard. Includes pretrained weights from both GPU-accelerated PyTorch training (Colab) and the NumPy finite-difference engine, transferred from the companion inverse folding project (Week 20, Project 1). Features six model improvement phases: real experimental DMS validation data, improved NumPy training with Adam optimizer and mini-batching, real PDB training data infrastructure, PyTorch training improvements (cosine annealing, label smoothing, gradient clipping), structural noise data augmentation, and autoregressive decoding for context-dependent sequence design.


Overview

Feature Description
DMS Scoring ΔlogP = log P(mutant | structure) − log P(wildtype | structure)
Autoregressive Decoding Context-dependent sequence probabilities with configurable decoding order
MPNN Architecture $k$-nearest-neighbor graph with 29-dim edge features, gated message passing
Pretrained Models 6 model presets: 3 PyTorch GPU-trained + 3 NumPy-trained weight sets
NumPy Training Improved Adam optimizer with mini-batching, proper gradient accumulation
Data Augmentation Structural noise injection for training robustness
Real Benchmark Data Fixed GB1 and GFP DMS datasets (non-circular validation)
Virtual Mutagenesis Select any residue; see all 19 substitution effects instantly
Epistasis Analysis Pairwise deviation from additive model
Experimental Validation Spearman ρ against fixed benchmark DMS data (GB1, GFP)
Conservation Profiling Shannon entropy, perplexity, and conservation scores
PDB Training Infrastructure Parse real PDB files and create training datasets
8 Preset Structures Alpha helix, beta hairpin, Trp-cage, villin, GB1, GFP, crambin, coiled coil
3D Structure Viewer Interactive backbone visualizations colored by mean mutational effect
Dual Renderers Plotly (interactive) + Matplotlib (publication)

Preset Protein Structures

Preset Residues Structure DMS Data
gb1_domain_56 56 α/β (B1 domain of protein G) yes
gfp_barrel_25 25 β-barrel fragment yes
trp_cage_20 20 α/β/PPII miniprotein no
villin_headpiece_35 35 α-helical bundle no
alpha_helix_20 20 Ideal α-helix no
beta_hairpin_16 16 Two-strand β-hairpin no
crambin_46 46 Mixed α/β plant protein no
coiled_coil_28 28 α-helical coiled coil no

Key Results

  • Negatively skewed DFE: Most mutations are deleterious, consistent with evolutionary optimization of natural sequences
  • Conservation–entropy anticorrelation: Structurally critical positions show low entropy and high conservation
  • Epistasis detection: Structurally proximal pairs exhibit stronger non-additive effects
  • Real benchmark validation: Non-circular GB1 (1,064 mutations) and GFP (475 mutations) fixed datasets replace simulated experimental data
  • Autoregressive decoding: Context-dependent sequence probabilities that account for previously placed amino acids
  • Improved NumPy training: Adam optimizer with momentum, mini-batching, and proper gradient accumulation replaces naive finite-difference method
  • Data augmentation: Gaussian structural noise injection ($\sigma$ = 0.1–0.5 Å) improves training robustness
  • PDB infrastructure: Real PDB file parser and dataset builder for training on experimental structures
  • PyTorch training pipeline: Cosine annealing LR scheduler, label smoothing, gradient clipping, mixed-precision support
  • PyTorch models: 3 GPU-trained models (128/192/256 hidden, 3/4/6 layers, up to 2.8M parameters) transferred from W20P1
  • NumPy models: 3 CPU-trained models (64/128 hidden, 2/3/4 layers) using stochastic finite-difference gradients
  • Architecture alignment: MPNN architecture aligned with P1 (3-way message concat, element-wise gating) for direct weight transfer

Project Structure

week_20_project_2/
├── app.py                          # Streamlit dashboard (10 pages)
├── main.py                         # CLI entry point (6 modes)
├── requirements.txt
├── .gitignore
├── README.md
├── week_20_project_2_outline.md
├── src/
│   ├── __init__.py                 # Package re-exports
│   ├── dms_engine.py               # Core DMS engine + MPNN (~2,980 lines)
│   ├── analysis.py                 # Analysis pipelines (~1,120 lines)
│   ├── visualization.py            # Plotly + Matplotlib renderers (~2,010 lines)
│   └── pytorch_weights.py          # Pretrained weight loader (PyTorch → NumPy)
├── benchmarks/                     # Fixed experimental DMS validation data
│   ├── __init__.py
│   ├── gb1_dms.csv                 # GB1 domain: 1,064 single-point mutations
│   └── gfp_dms.csv                 # GFP barrel: 475 single-point mutations
├── training/                       # Model training infrastructure
│   ├── __init__.py
│   ├── pdb_dataset.py              # PDB file parser + dataset builder
│   └── train_mpnn.py               # PyTorch training script (Colab-ready)
├── mpnn_weights/                   # Pretrained model weights (.npz + meta .json)
│   ├── pytorch_{small,medium,large}.npz
│   ├── numpy_{quick,default,deep}.npz
│   └── ... (metadata, training history)
├── tests/
│   └── test_digital_mutator.py     # 29 test classes, 221 methods
├── docs/
│   ├── scientific_report.md
│   ├── w20p2_digital_mutator_ieee.tex
│   └── w20p2_digital_mutator_ieee.pdf
└── figures/                        # Generated, gitignored

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Run the CLI

python main.py                                                 # Default DMS scan (GB1)
python main.py --dms --preset alpha_helix_20 --save            # Save figures
python main.py --validate --benchmark gfp                      # Experimental validation
python main.py --epistasis --preset beta_hairpin_16             # Epistasis analysis
python main.py --gallery --save --verbose                      # All presets
python main.py --compare --model pytorch_medium                # Compare pretrained vs random
python main.py --dms --model numpy_default --preset crambin_46  # DMS with pretrained weights
python main.py --retrain --epochs 20 --lr 0.001                # Retrain MPNN with NumPy

3. Launch the Streamlit Dashboard

streamlit run app.py

4. Run Tests

pytest tests/ -v

Theory — Deep Mutational Scanning in Brief

The mutational effect of substituting wildtype amino acid $a_{\text{wt}}$ with mutant $a_{\text{mut}}$ at position $i$ is:

$$\Delta\log P_i = \log P(a_{\text{mut}} \mid \mathbf{X}) - \log P(a_{\text{wt}} \mid \mathbf{X})$$

where $\mathbf{X}$ is the 3D backbone structure and $P(a \mid \mathbf{X})$ is computed by a message-passing neural network conditioned on $k$-nearest-neighbor protein graphs with SE(3)-invariant edge features.


CLI Options

Flag Default Description
--dms Full deep mutational scanning
--validate Validate against experimental data
--epistasis Pairwise epistasis analysis
--gallery Scan all preset structures
--preset gb1_domain_56 Preset protein structure
--benchmark gb1 Benchmark for validation (gb1, gfp)
--temperature 1.0 Softmax temperature
--k-neighbors 30 K nearest neighbors
--hidden-dim 128 MPNN hidden dimension
--num-layers 3 MPNN layers
--seed 42 Random seed
--top-k 15 Top K epistatic pairs
--compare Compare pretrained vs random-init MPNN
--retrain Retrain MPNN weights on preset structures
--model Pretrained model name (e.g., pytorch_medium)
--epochs 10 Training epochs for --retrain
--lr 0.001 Learning rate for --retrain
--save Save figures to figures/
--verbose Additional output

Streamlit Dashboard Pages

  1. 🏠 Home — Project overview and preset summary
  2. 🔬 The Mutation Scanner — Full DMS heatmap with effect categories
  3. 🧬 Virtual Mutagenesis — Interactive single-site exploration (all 19 substitutions)
  4. 📊 Effect Landscape — Effect distribution, mean profile, entropy
  5. 🧪 Predict Then Validate — Spearman/Pearson correlation scatter plots
  6. 🔗 Epistasis Explorer — Pairwise epistasis heatmap and table
  7. 📐 Conservation Profiler — Entropy and conservation per position
  8. ⚖️ NumPy vs PyTorch — Compare random-init vs pretrained MPNN weights side-by-side
  9. 🏗️ Preset Gallery — 3D backbone visualizations colored by mutational effect for all 8 presets
  10. 📚 Theory & Mathematics — Mathematical foundations (12 expanders)

Pretrained Models

6 pretrained weight sets from the companion inverse folding project (Week 20, Project 1):

Model Backend Hidden Layers Parameters Source
pytorch_small PyTorch (Colab GPU) 128 3 ~530K W20P1
pytorch_medium PyTorch (Colab GPU) 192 4 ~1.2M W20P1
pytorch_large PyTorch (Colab GPU) 256 6 ~2.8M W20P1
numpy_quick NumPy (CPU) 64 2 ~67K W20P1
numpy_default NumPy (CPU) 128 3 ~530K W20P1
numpy_deep NumPy (CPU) 128 4 ~660K W20P1

Weights are stored as .npz arrays with _meta.json metadata in mpnn_weights/.


Testing

29 test classes with 221 methods covering:

  • Constants: Amino acid tables, thresholds, preset registry
  • Backbone geometry: Coordinate generation, bond lengths
  • Graph construction: KNN graph, edge/node features
  • MPNN: Weight initialization, log-probability normalization
  • DMS pipeline: Full scan, site scan, single mutation
  • Epistasis: Pairwise epistasis, additive model
  • Validation: Spearman/Pearson range, real benchmark data, simulated fallback
  • Entropy: Shannon entropy, perplexity, conservation
  • Utilities: Sequence validation, molecular weight, notation
  • Pretrained weights: Weight loading, metadata parsing, model listing, shape validation
  • Model comparison: Random vs pretrained analysis, effect correlation, entropy comparison
  • Benchmark data: Real DMS CSV loading (GB1, GFP), fallback to simulated
  • NumPy training: Adam optimizer, mini-batching, gradient accumulation, loss decrease
  • PDB dataset: PDB file parser, dataset creation, structure extraction
  • Data augmentation: Structural noise injection, noise-augmented training
  • Autoregressive decoding: AR probability computation, order-dependent results, DMS integration
  • Analysis: All pipeline functions and result containers
  • Visualization: Every Plotly and Matplotlib renderer method (including comparison charts)
  • CLI: Argument parsing, mode dispatch, --compare, --model, and --retrain flags
  • Edge cases: Tiny proteins, extreme temperatures, different seeds

Dependencies

  • numpy >= 1.24
  • scipy >= 1.10
  • matplotlib >= 3.7
  • plotly >= 5.14
  • streamlit >= 1.28
  • pandas >= 2.0
  • pytest >= 7.3

Model Improvement Phases

Six systematic improvement phases enhance the DMS engine from pedagogical baseline toward production readiness:

Phase Improvement Key Changes
1 Real experimental DMS validation Fixed GB1/GFP benchmark CSVs replace circular simulated data
2 Improved NumPy training Adam optimizer with momentum, mini-batch gradient accumulation
3 PDB training data infrastructure Parse real PDB files, create training datasets with augmentation
4 PyTorch training improvements Cosine annealing LR, label smoothing, gradient clipping, mixed precision
5 Data augmentation Gaussian structural noise injection ($\sigma$ = 0.1–0.5 Å)
6 Autoregressive decoding Context-dependent run_dms_ar() with configurable decoding order

Author

Ryan Kamp Department of Computer Science, University of Cincinnati kamprj@mail.uc.edu · GitHub

About

Week 20 Project 2: In silico deep mutational scanning simulator powered by a structure-conditioned MPNN; predicts fitness effects of all single amino acid substitutions from backbone geometry & validates against GB1/GFP benchmarks; 10-page interactive Streamlit dashboard w/ 3D visualization, epistasis analysis, & pretrained model comparison

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors