Skip to content

najmulhasan-code/crc-screen

Repository files navigation

CRC-Screen

Screens DNA-synthesis orders for biosecurity-relevant proteins.

Three signals: k-mer Jaccard homology, a 5-LLM judge panel, an embedding distance to known hazards. A monotone logistic regression combines them. Conformal Risk Control picks the threshold. Tested leave-one-taxonomic-family-out on UniProt KW-0800. Homology and embedding get recomputed each fold so the held-out family can't leak into its own scoring.

from code.screening.crc import crc_threshold, empirical_fnr

res = crc_threshold(cal_scores, cal_labels, alpha=0.05)
fnr = empirical_fnr(test_scores, test_labels, res.tau_hat)

Per-fold FNR vs CRC bound

On a 200-hazard / 400-benign subsample of UniProt KW-0800, LOTO at α = 0.05: test FNR is 0% on every fold; test FPR is 0% on 9 of 10 folds (Actiniidae has one false flag out of 20 benigns). The k-mer Jaccard baseline by itself goes to 100% FPR. The bound is loose at this scale: n_cal_haz averages 55.5 per fold, slack floor 1.77%. So the 0% FNR is empirical, not certified.

Paper: arXiv:XXXX.XXXXX.


Install

git clone https://github.com/najmulhasan-code/crc-screen.git
cd crc-screen
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env

LLM panel and embedding go through OpenRouter; calls are cached on disk.


Quickstart

python -m code.run --stage data
python -m code.run --stage signal1 --n_sample 600
python -m code.run --stage signal2 --n_sample 600
python -m code.run --stage signal3 --n_sample 600
python recompute_signal2.py
python recompute_eval.py
python -m code.figures._compute_demo_fold
python -m code.figures.make_all

--stage all chains data through eval but does not run the refusal-aware reaggregation. Run recompute_signal2.py and recompute_eval.py after for the figures.


Usage

Pick a CRC threshold from a calibration set and apply it to test scores. res.tau_hat is the largest tau satisfying R_hat(tau) + B/(n+1) ≤ alpha. residual_swap_tv is a histogram proxy for the cal-vs-test shift.

from code.screening.crc import crc_threshold, empirical_fnr, empirical_fpr, residual_swap_tv

res = crc_threshold(cal_scores, cal_labels, alpha=0.05, B=1.0)
fnr = empirical_fnr(test_scores, test_labels, res.tau_hat)
fpr = empirical_fpr(test_scores, test_labels, res.tau_hat)
tv  = residual_swap_tv(cal_scores, test_scores)

Fit the monotone aggregator on a signal matrix S of shape (n, k). If a learned coefficient comes back negative, that signal gets dropped and the model refits. Repeat until all coefs are non-negative. CRC needs the loss monotone in tau, which means the score has to be monotone in every signal.

from code.screening.aggregator import fit_aggregator

agg = fit_aggregator(S_train, y_train)
scores = agg.score(S)
print(agg.coef, agg.intercept)

Run LOTO end-to-end. s2 is the LLM panel score vector you computed earlier; embs is the raw embedding matrix. s_hom and s_emb get recomputed inside each fold from train-fold hazards only, so the held-out family never sees itself.

from code.screening.loto import run_loto, run_signal_ablation
from code.data.corpus import loto_splits

splits = loto_splits(df)
loto = run_loto(df, sequences, s2, embs, splits, alphas=[0.10, 0.05, 0.01])
abl  = run_signal_ablation(df, sequences, s2, embs, splits, alpha=0.05)

How it works

Pipeline

Two signals (s_LLM, s_emb) come off the public UniProt annotation text; one (s_hom) comes off the sequence. A monotone logistic regression combines them. CRC picks the threshold over the hazard subset (class-conditional).

Signal Reads How
s_hom Sequence k-mer Jaccard (k=5), self-match dropped, rank-normalised. Uses DIAMOND if you have it; otherwise pure Python.
s_LLM Text + model priors 5 models × 2 runs at T=0.7. Median over runs, trimmed mean across models (drop the min and max). One model per provider; five Claudes agreeing is one vote, not five.
s_emb Annotation as labelled fields text-embedding-3-large, L2-normalised. K-means on train-fold hazards (K = min(8, n_train_haz/5)), max cosine to any centroid.

Both s_hom and s_emb depend on which hazards are in the reference. If you compute them once on the full corpus and reuse those values across folds, the held-out family bleeds into its own scoring through the reference. They get recomputed per fold.

Signal ablation

Across the 7 non-empty signal subsets, only LLM + embedding hits 0/0. Adding homology back makes it slightly worse (0.5% mean FPR, no recall gain). Out-of-family the homology score is mostly noise, and CRC has to pay for that noise with a slightly lower threshold.

CRC slack frontier

The CRC bound's finite-sample slack is 1/(n_cal + 1). At the n=600 demo, n_cal_haz averages 55.5 per fold, so the certifiable α floors at 1.77%. Reaching α=10⁻³ needs n_cal_haz ≥ 999, roughly 18× more calibration data than we have. The full reviewed UniProt KW-0800 corpus (~6,000 toxins) is large enough.


Scope

The demo runs on n=600 (200 hazards, 400 benigns) of UniProt KW-0800 with one random seed. Production scale is ~6,000 hazards in the full reviewed corpus, which is enough to certify α=10⁻³.

The k-mer Jaccard baseline reported here is a research stand-in for "what a naive sequence-similarity screener does on this corpus." It is not a comparison to SecureDNA, IGSC member screening, or any deployed system; those use curated reference sets and expert-tuned thresholds we cannot inspect.

Code does not generate, design, or modify biological sequences. No hazardous sequence data is shipped. The thresholds and weights here are tied to the demo and won't transfer to a production screener.


Citation

@misc{hasan2026crcscreen,
  author       = {Najmul Hasan},
  title        = {{CRC-Screen}: Certified {DNA}-Synthesis Hazard Screening Under Taxonomic Shift},
  year         = {2026},
  eprint       = {arXiv:XXXX.XXXXX},
  primaryClass = {cs.LG}
}

License

MIT.

About

CRC-Screen: Certified DNA-Synthesis Hazard Screening Under Taxonomic Shift

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages