Week 20, Project 2 · Biophysics Portfolio · CS Research Self-Study
A computational deep mutational scanning (DMS) system that uses a structure-conditioned message-passing neural network (MPNN) to predict the fitness effect of every possible single amino acid substitution in a protein. Scores
| Feature | Description |
|---|---|
| DMS Scoring | ΔlogP = log P(mutant | structure) − log P(wildtype | structure) |
| Autoregressive Decoding | Context-dependent sequence probabilities with configurable decoding order |
| MPNN Architecture |
|
| Pretrained Models | 6 model presets: 3 PyTorch GPU-trained + 3 NumPy-trained weight sets |
| NumPy Training | Improved Adam optimizer with mini-batching, proper gradient accumulation |
| Data Augmentation | Structural noise injection for training robustness |
| Real Benchmark Data | Fixed GB1 and GFP DMS datasets (non-circular validation) |
| Virtual Mutagenesis | Select any residue; see all 19 substitution effects instantly |
| Epistasis Analysis | Pairwise deviation from additive model |
| Experimental Validation | Spearman ρ against fixed benchmark DMS data (GB1, GFP) |
| Conservation Profiling | Shannon entropy, perplexity, and conservation scores |
| PDB Training Infrastructure | Parse real PDB files and create training datasets |
| 8 Preset Structures | Alpha helix, beta hairpin, Trp-cage, villin, GB1, GFP, crambin, coiled coil |
| 3D Structure Viewer | Interactive backbone visualizations colored by mean mutational effect |
| Dual Renderers | Plotly (interactive) + Matplotlib (publication) |
| Preset | Residues | Structure | DMS Data |
|---|---|---|---|
gb1_domain_56 |
56 | α/β (B1 domain of protein G) | yes |
gfp_barrel_25 |
25 | β-barrel fragment | yes |
trp_cage_20 |
20 | α/β/PPII miniprotein | no |
villin_headpiece_35 |
35 | α-helical bundle | no |
alpha_helix_20 |
20 | Ideal α-helix | no |
beta_hairpin_16 |
16 | Two-strand β-hairpin | no |
crambin_46 |
46 | Mixed α/β plant protein | no |
coiled_coil_28 |
28 | α-helical coiled coil | no |
- Negatively skewed DFE: Most mutations are deleterious, consistent with evolutionary optimization of natural sequences
- Conservation–entropy anticorrelation: Structurally critical positions show low entropy and high conservation
- Epistasis detection: Structurally proximal pairs exhibit stronger non-additive effects
- Real benchmark validation: Non-circular GB1 (1,064 mutations) and GFP (475 mutations) fixed datasets replace simulated experimental data
- Autoregressive decoding: Context-dependent sequence probabilities that account for previously placed amino acids
- Improved NumPy training: Adam optimizer with momentum, mini-batching, and proper gradient accumulation replaces naive finite-difference method
-
Data augmentation: Gaussian structural noise injection (
$\sigma$ = 0.1–0.5 Å) improves training robustness - PDB infrastructure: Real PDB file parser and dataset builder for training on experimental structures
- PyTorch training pipeline: Cosine annealing LR scheduler, label smoothing, gradient clipping, mixed-precision support
- PyTorch models: 3 GPU-trained models (128/192/256 hidden, 3/4/6 layers, up to 2.8M parameters) transferred from W20P1
- NumPy models: 3 CPU-trained models (64/128 hidden, 2/3/4 layers) using stochastic finite-difference gradients
- Architecture alignment: MPNN architecture aligned with P1 (3-way message concat, element-wise gating) for direct weight transfer
week_20_project_2/
├── app.py # Streamlit dashboard (10 pages)
├── main.py # CLI entry point (6 modes)
├── requirements.txt
├── .gitignore
├── README.md
├── week_20_project_2_outline.md
├── src/
│ ├── __init__.py # Package re-exports
│ ├── dms_engine.py # Core DMS engine + MPNN (~2,980 lines)
│ ├── analysis.py # Analysis pipelines (~1,120 lines)
│ ├── visualization.py # Plotly + Matplotlib renderers (~2,010 lines)
│ └── pytorch_weights.py # Pretrained weight loader (PyTorch → NumPy)
├── benchmarks/ # Fixed experimental DMS validation data
│ ├── __init__.py
│ ├── gb1_dms.csv # GB1 domain: 1,064 single-point mutations
│ └── gfp_dms.csv # GFP barrel: 475 single-point mutations
├── training/ # Model training infrastructure
│ ├── __init__.py
│ ├── pdb_dataset.py # PDB file parser + dataset builder
│ └── train_mpnn.py # PyTorch training script (Colab-ready)
├── mpnn_weights/ # Pretrained model weights (.npz + meta .json)
│ ├── pytorch_{small,medium,large}.npz
│ ├── numpy_{quick,default,deep}.npz
│ └── ... (metadata, training history)
├── tests/
│ └── test_digital_mutator.py # 29 test classes, 221 methods
├── docs/
│ ├── scientific_report.md
│ ├── w20p2_digital_mutator_ieee.tex
│ └── w20p2_digital_mutator_ieee.pdf
└── figures/ # Generated, gitignored
pip install -r requirements.txtpython main.py # Default DMS scan (GB1)
python main.py --dms --preset alpha_helix_20 --save # Save figures
python main.py --validate --benchmark gfp # Experimental validation
python main.py --epistasis --preset beta_hairpin_16 # Epistasis analysis
python main.py --gallery --save --verbose # All presets
python main.py --compare --model pytorch_medium # Compare pretrained vs random
python main.py --dms --model numpy_default --preset crambin_46 # DMS with pretrained weights
python main.py --retrain --epochs 20 --lr 0.001 # Retrain MPNN with NumPystreamlit run app.pypytest tests/ -vThe mutational effect of substituting wildtype amino acid
where
| Flag | Default | Description |
|---|---|---|
--dms |
✓ | Full deep mutational scanning |
--validate |
Validate against experimental data | |
--epistasis |
Pairwise epistasis analysis | |
--gallery |
Scan all preset structures | |
--preset |
gb1_domain_56 |
Preset protein structure |
--benchmark |
gb1 |
Benchmark for validation (gb1, gfp) |
--temperature |
1.0 |
Softmax temperature |
--k-neighbors |
30 |
K nearest neighbors |
--hidden-dim |
128 |
MPNN hidden dimension |
--num-layers |
3 |
MPNN layers |
--seed |
42 |
Random seed |
--top-k |
15 |
Top K epistatic pairs |
--compare |
Compare pretrained vs random-init MPNN | |
--retrain |
Retrain MPNN weights on preset structures | |
--model |
Pretrained model name (e.g., pytorch_medium) |
|
--epochs |
10 |
Training epochs for --retrain |
--lr |
0.001 |
Learning rate for --retrain |
--save |
Save figures to figures/ |
|
--verbose |
Additional output |
- 🏠 Home — Project overview and preset summary
- 🔬 The Mutation Scanner — Full DMS heatmap with effect categories
- 🧬 Virtual Mutagenesis — Interactive single-site exploration (all 19 substitutions)
- 📊 Effect Landscape — Effect distribution, mean profile, entropy
- 🧪 Predict Then Validate — Spearman/Pearson correlation scatter plots
- 🔗 Epistasis Explorer — Pairwise epistasis heatmap and table
- 📐 Conservation Profiler — Entropy and conservation per position
- ⚖️ NumPy vs PyTorch — Compare random-init vs pretrained MPNN weights side-by-side
- 🏗️ Preset Gallery — 3D backbone visualizations colored by mutational effect for all 8 presets
- 📚 Theory & Mathematics — Mathematical foundations (12 expanders)
6 pretrained weight sets from the companion inverse folding project (Week 20, Project 1):
| Model | Backend | Hidden | Layers | Parameters | Source |
|---|---|---|---|---|---|
pytorch_small |
PyTorch (Colab GPU) | 128 | 3 | ~530K | W20P1 |
pytorch_medium |
PyTorch (Colab GPU) | 192 | 4 | ~1.2M | W20P1 |
pytorch_large |
PyTorch (Colab GPU) | 256 | 6 | ~2.8M | W20P1 |
numpy_quick |
NumPy (CPU) | 64 | 2 | ~67K | W20P1 |
numpy_default |
NumPy (CPU) | 128 | 3 | ~530K | W20P1 |
numpy_deep |
NumPy (CPU) | 128 | 4 | ~660K | W20P1 |
Weights are stored as .npz arrays with _meta.json metadata in mpnn_weights/.
29 test classes with 221 methods covering:
- Constants: Amino acid tables, thresholds, preset registry
- Backbone geometry: Coordinate generation, bond lengths
- Graph construction: KNN graph, edge/node features
- MPNN: Weight initialization, log-probability normalization
- DMS pipeline: Full scan, site scan, single mutation
- Epistasis: Pairwise epistasis, additive model
- Validation: Spearman/Pearson range, real benchmark data, simulated fallback
- Entropy: Shannon entropy, perplexity, conservation
- Utilities: Sequence validation, molecular weight, notation
- Pretrained weights: Weight loading, metadata parsing, model listing, shape validation
- Model comparison: Random vs pretrained analysis, effect correlation, entropy comparison
- Benchmark data: Real DMS CSV loading (GB1, GFP), fallback to simulated
- NumPy training: Adam optimizer, mini-batching, gradient accumulation, loss decrease
- PDB dataset: PDB file parser, dataset creation, structure extraction
- Data augmentation: Structural noise injection, noise-augmented training
- Autoregressive decoding: AR probability computation, order-dependent results, DMS integration
- Analysis: All pipeline functions and result containers
- Visualization: Every Plotly and Matplotlib renderer method (including comparison charts)
- CLI: Argument parsing, mode dispatch,
--compare,--model, and--retrainflags - Edge cases: Tiny proteins, extreme temperatures, different seeds
- numpy >= 1.24
- scipy >= 1.10
- matplotlib >= 3.7
- plotly >= 5.14
- streamlit >= 1.28
- pandas >= 2.0
- pytest >= 7.3
Six systematic improvement phases enhance the DMS engine from pedagogical baseline toward production readiness:
| Phase | Improvement | Key Changes |
|---|---|---|
| 1 | Real experimental DMS validation | Fixed GB1/GFP benchmark CSVs replace circular simulated data |
| 2 | Improved NumPy training | Adam optimizer with momentum, mini-batch gradient accumulation |
| 3 | PDB training data infrastructure | Parse real PDB files, create training datasets with augmentation |
| 4 | PyTorch training improvements | Cosine annealing LR, label smoothing, gradient clipping, mixed precision |
| 5 | Data augmentation | Gaussian structural noise injection ( |
| 6 | Autoregressive decoding | Context-dependent run_dms_ar() with configurable decoding order |
Ryan Kamp Department of Computer Science, University of Cincinnati kamprj@mail.uc.edu · GitHub