The Digital Mutator — Deep Mutational Scanning Simulator

Week 20, Project 2 · Biophysics Portfolio · CS Research Self-Study

A computational deep mutational scanning (DMS) system that uses a structure-conditioned message-passing neural network (MPNN) to predict the fitness effect of every possible single amino acid substitution in a protein. Scores $19L$ mutations from backbone structure alone, validates against fixed benchmark DMS datasets (GB1, GFP), and provides interactive visualization via a Streamlit dashboard. Includes pretrained weights from both GPU-accelerated PyTorch training (Colab) and the NumPy finite-difference engine, transferred from the companion inverse folding project (Week 20, Project 1). Features six model improvement phases: real experimental DMS validation data, improved NumPy training with Adam optimizer and mini-batching, real PDB training data infrastructure, PyTorch training improvements (cosine annealing, label smoothing, gradient clipping), structural noise data augmentation, and autoregressive decoding for context-dependent sequence design.

Overview

Feature	Description
DMS Scoring	ΔlogP = log P(mutant \| structure) − log P(wildtype \| structure)
Autoregressive Decoding	Context-dependent sequence probabilities with configurable decoding order
MPNN Architecture	$k$-nearest-neighbor graph with 29-dim edge features, gated message passing
Pretrained Models	6 model presets: 3 PyTorch GPU-trained + 3 NumPy-trained weight sets
NumPy Training	Improved Adam optimizer with mini-batching, proper gradient accumulation
Data Augmentation	Structural noise injection for training robustness
Real Benchmark Data	Fixed GB1 and GFP DMS datasets (non-circular validation)
Virtual Mutagenesis	Select any residue; see all 19 substitution effects instantly
Epistasis Analysis	Pairwise deviation from additive model
Experimental Validation	Spearman ρ against fixed benchmark DMS data (GB1, GFP)
Conservation Profiling	Shannon entropy, perplexity, and conservation scores
PDB Training Infrastructure	Parse real PDB files and create training datasets
8 Preset Structures	Alpha helix, beta hairpin, Trp-cage, villin, GB1, GFP, crambin, coiled coil
3D Structure Viewer	Interactive backbone visualizations colored by mean mutational effect
Dual Renderers	Plotly (interactive) + Matplotlib (publication)

Preset Protein Structures

Preset	Residues	Structure	DMS Data
`gb1_domain_56`	56	α/β (B1 domain of protein G)	yes
`gfp_barrel_25`	25	β-barrel fragment	yes
`trp_cage_20`	20	α/β/PPII miniprotein	no
`villin_headpiece_35`	35	α-helical bundle	no
`alpha_helix_20`	20	Ideal α-helix	no
`beta_hairpin_16`	16	Two-strand β-hairpin	no
`crambin_46`	46	Mixed α/β plant protein	no
`coiled_coil_28`	28	α-helical coiled coil	no

Key Results

Negatively skewed DFE: Most mutations are deleterious, consistent with evolutionary optimization of natural sequences
Conservation–entropy anticorrelation: Structurally critical positions show low entropy and high conservation
Epistasis detection: Structurally proximal pairs exhibit stronger non-additive effects
Real benchmark validation: Non-circular GB1 (1,064 mutations) and GFP (475 mutations) fixed datasets replace simulated experimental data
Autoregressive decoding: Context-dependent sequence probabilities that account for previously placed amino acids
Improved NumPy training: Adam optimizer with momentum, mini-batching, and proper gradient accumulation replaces naive finite-difference method
Data augmentation: Gaussian structural noise injection ($\sigma$ = 0.1–0.5 Å) improves training robustness
PDB infrastructure: Real PDB file parser and dataset builder for training on experimental structures
PyTorch training pipeline: Cosine annealing LR scheduler, label smoothing, gradient clipping, mixed-precision support
PyTorch models: 3 GPU-trained models (128/192/256 hidden, 3/4/6 layers, up to 2.8M parameters) transferred from W20P1
NumPy models: 3 CPU-trained models (64/128 hidden, 2/3/4 layers) using stochastic finite-difference gradients
Architecture alignment: MPNN architecture aligned with P1 (3-way message concat, element-wise gating) for direct weight transfer

Project Structure

week_20_project_2/
├── app.py                          # Streamlit dashboard (10 pages)
├── main.py                         # CLI entry point (6 modes)
├── requirements.txt
├── .gitignore
├── README.md
├── week_20_project_2_outline.md
├── src/
│   ├── __init__.py                 # Package re-exports
│   ├── dms_engine.py               # Core DMS engine + MPNN (~2,980 lines)
│   ├── analysis.py                 # Analysis pipelines (~1,120 lines)
│   ├── visualization.py            # Plotly + Matplotlib renderers (~2,010 lines)
│   └── pytorch_weights.py          # Pretrained weight loader (PyTorch → NumPy)
├── benchmarks/                     # Fixed experimental DMS validation data
│   ├── __init__.py
│   ├── gb1_dms.csv                 # GB1 domain: 1,064 single-point mutations
│   └── gfp_dms.csv                 # GFP barrel: 475 single-point mutations
├── training/                       # Model training infrastructure
│   ├── __init__.py
│   ├── pdb_dataset.py              # PDB file parser + dataset builder
│   └── train_mpnn.py               # PyTorch training script (Colab-ready)
├── mpnn_weights/                   # Pretrained model weights (.npz + meta .json)
│   ├── pytorch_{small,medium,large}.npz
│   ├── numpy_{quick,default,deep}.npz
│   └── ... (metadata, training history)
├── tests/
│   └── test_digital_mutator.py     # 29 test classes, 221 methods
├── docs/
│   ├── scientific_report.md
│   ├── w20p2_digital_mutator_ieee.tex
│   └── w20p2_digital_mutator_ieee.pdf
└── figures/                        # Generated, gitignored

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Run the CLI

python main.py                                                 # Default DMS scan (GB1)
python main.py --dms --preset alpha_helix_20 --save            # Save figures
python main.py --validate --benchmark gfp                      # Experimental validation
python main.py --epistasis --preset beta_hairpin_16             # Epistasis analysis
python main.py --gallery --save --verbose                      # All presets
python main.py --compare --model pytorch_medium                # Compare pretrained vs random
python main.py --dms --model numpy_default --preset crambin_46  # DMS with pretrained weights
python main.py --retrain --epochs 20 --lr 0.001                # Retrain MPNN with NumPy

3. Launch the Streamlit Dashboard

streamlit run app.py

4. Run Tests

pytest tests/ -v

Theory — Deep Mutational Scanning in Brief

The mutational effect of substituting wildtype amino acid $a_{\text{wt}}$ with mutant $a_{\text{mut}}$ at position $i$ is:

$$\Delta\log P_i = \log P(a_{\text{mut}} \mid \mathbf{X}) - \log P(a_{\text{wt}} \mid \mathbf{X})$$

where $\mathbf{X}$ is the 3D backbone structure and $P(a \mid \mathbf{X})$ is computed by a message-passing neural network conditioned on $k$-nearest-neighbor protein graphs with SE(3)-invariant edge features.

CLI Options

Flag	Default	Description
`--dms`	✓	Full deep mutational scanning
`--validate`		Validate against experimental data
`--epistasis`		Pairwise epistasis analysis
`--gallery`		Scan all preset structures
`--preset`	`gb1_domain_56`	Preset protein structure
`--benchmark`	`gb1`	Benchmark for validation (`gb1`, `gfp`)
`--temperature`	`1.0`	Softmax temperature
`--k-neighbors`	`30`	K nearest neighbors
`--hidden-dim`	`128`	MPNN hidden dimension
`--num-layers`	`3`	MPNN layers
`--seed`	`42`	Random seed
`--top-k`	`15`	Top K epistatic pairs
`--compare`		Compare pretrained vs random-init MPNN
`--retrain`		Retrain MPNN weights on preset structures
`--model`		Pretrained model name (e.g., `pytorch_medium`)
`--epochs`	`10`	Training epochs for `--retrain`
`--lr`	`0.001`	Learning rate for `--retrain`
`--save`		Save figures to `figures/`
`--verbose`		Additional output

Streamlit Dashboard Pages

🏠 Home — Project overview and preset summary
🔬 The Mutation Scanner — Full DMS heatmap with effect categories
🧬 Virtual Mutagenesis — Interactive single-site exploration (all 19 substitutions)
📊 Effect Landscape — Effect distribution, mean profile, entropy
🧪 Predict Then Validate — Spearman/Pearson correlation scatter plots
🔗 Epistasis Explorer — Pairwise epistasis heatmap and table
📐 Conservation Profiler — Entropy and conservation per position
⚖️ NumPy vs PyTorch — Compare random-init vs pretrained MPNN weights side-by-side
🏗️ Preset Gallery — 3D backbone visualizations colored by mutational effect for all 8 presets
📚 Theory & Mathematics — Mathematical foundations (12 expanders)

Pretrained Models

6 pretrained weight sets from the companion inverse folding project (Week 20, Project 1):

Model	Backend	Hidden	Layers	Parameters	Source
`pytorch_small`	PyTorch (Colab GPU)	128	3	~530K	W20P1
`pytorch_medium`	PyTorch (Colab GPU)	192	4	~1.2M	W20P1
`pytorch_large`	PyTorch (Colab GPU)	256	6	~2.8M	W20P1
`numpy_quick`	NumPy (CPU)	64	2	~67K	W20P1
`numpy_default`	NumPy (CPU)	128	3	~530K	W20P1
`numpy_deep`	NumPy (CPU)	128	4	~660K	W20P1

Weights are stored as .npz arrays with _meta.json metadata in mpnn_weights/.

Testing

29 test classes with 221 methods covering:

Constants: Amino acid tables, thresholds, preset registry
Backbone geometry: Coordinate generation, bond lengths
Graph construction: KNN graph, edge/node features
MPNN: Weight initialization, log-probability normalization
DMS pipeline: Full scan, site scan, single mutation
Epistasis: Pairwise epistasis, additive model
Validation: Spearman/Pearson range, real benchmark data, simulated fallback
Entropy: Shannon entropy, perplexity, conservation
Utilities: Sequence validation, molecular weight, notation
Pretrained weights: Weight loading, metadata parsing, model listing, shape validation
Model comparison: Random vs pretrained analysis, effect correlation, entropy comparison
Benchmark data: Real DMS CSV loading (GB1, GFP), fallback to simulated
NumPy training: Adam optimizer, mini-batching, gradient accumulation, loss decrease
PDB dataset: PDB file parser, dataset creation, structure extraction
Data augmentation: Structural noise injection, noise-augmented training
Autoregressive decoding: AR probability computation, order-dependent results, DMS integration
Analysis: All pipeline functions and result containers
Visualization: Every Plotly and Matplotlib renderer method (including comparison charts)
CLI: Argument parsing, mode dispatch, --compare, --model, and --retrain flags
Edge cases: Tiny proteins, extreme temperatures, different seeds

Dependencies

numpy >= 1.24
scipy >= 1.10
matplotlib >= 3.7
plotly >= 5.14
streamlit >= 1.28
pandas >= 2.0
pytest >= 7.3

Model Improvement Phases

Six systematic improvement phases enhance the DMS engine from pedagogical baseline toward production readiness:

Phase	Improvement	Key Changes
1	Real experimental DMS validation	Fixed GB1/GFP benchmark CSVs replace circular simulated data
2	Improved NumPy training	Adam optimizer with momentum, mini-batch gradient accumulation
3	PDB training data infrastructure	Parse real PDB files, create training datasets with augmentation
4	PyTorch training improvements	Cosine annealing LR, label smoothing, gradient clipping, mixed precision
5	Data augmentation	Gaussian structural noise injection ($\sigma$ = 0.1–0.5 Å)
6	Autoregressive decoding	Context-dependent `run_dms_ar()` with configurable decoding order

Author

Ryan Kamp Department of Computer Science, University of Cincinnati kamprj@mail.uc.edu · GitHub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Digital Mutator — Deep Mutational Scanning Simulator

Overview

Preset Protein Structures

Key Results

Project Structure

Quick Start

1. Install Dependencies

2. Run the CLI

3. Launch the Streamlit Dashboard

4. Run Tests

Theory — Deep Mutational Scanning in Brief

CLI Options

Streamlit Dashboard Pages

Pretrained Models

Testing

Dependencies

Model Improvement Phases

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmarks		benchmarks
colab		colab
docs		docs
mpnn_weights		mpnn_weights
src		src
tests		tests
training		training
.gitignore		.gitignore
README.md		README.md
app.py		app.py
main.py		main.py
requirements.txt		requirements.txt
week_20_project_2_outline.md		week_20_project_2_outline.md

Folders and files

Latest commit

History

Repository files navigation

The Digital Mutator — Deep Mutational Scanning Simulator

Overview

Preset Protein Structures

Key Results

Project Structure

Quick Start

1. Install Dependencies

2. Run the CLI

3. Launch the Streamlit Dashboard

4. Run Tests

Theory — Deep Mutational Scanning in Brief

CLI Options

Streamlit Dashboard Pages

Pretrained Models

Testing

Dependencies

Model Improvement Phases

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages