This repo contains a complete small-scale mechanistic interpretability project on refusal behavior in Qwen2.5-0.5B-Instruct. The core question is whether sparse feature families can separately control correct refusal on unsafe prompts and spurious over-refusal on benign prompts with dangerous surface forms. On a 400-example benchmark, the model refuses 70% of unsafe prompts and 12% of over-refusal prompts. Sparse autoencoders recover high-fidelity feature dictionaries, and amplifying one refusal-feature family raises unsafe refusal on the held-out split from 0.80 to 0.85. The strongest negative result is also the most important: suppressing the top over-refusal family does not improve benign compliance.
- Apple M1 Pro or similar
- 16 GB RAM
- ~8 GB free disk
- Python 3.11
- TeX Live / pdflatex for paper and poster builds
From a fresh clone on a machine with the packages in requirements.txt and pdflatex available:
make allStep-by-step:
make data
make baseline
make activations
make train_sae
make experiments
make figures
make paper
make posterNotes:
- The benchmark builder is deterministic with seed 7.
- Baseline generation uses MPS when available.
- Activation extraction is more stable on CPU on this machine because MPS produced NaNs in saved activation arrays.
- Large activation arrays and SAE checkpoints are excluded from git via
.gitignore.
data/eval_set.json— full 400-example labeled benchmarkdata/eval_set_splits.json— 80/20 splitexperiments/build_eval_set.py— benchmark constructionexperiments/run_baseline.py— baseline refusal behaviorexperiments/collect_activations.py— residual activation dumpexperiments/train_sae.py— sparse autoencoder trainingexperiments/run_interventions.py— feature ranking + causal interventionsexperiments/run_specificity.py— paraphrase robustness + unrelated-task specificityexperiments/render_figures.py— checks figure presenceresults/— metrics, baseline generations, feature rankings, intervention outputsfigures/— publication-ready figures used in the paper and posterpaper/paper.tex— NeurIPS-style paper sourcepaper/paper.pdf— compiled paperposter/poster.tex— A0 landscape poster sourceposter/poster.pdf— compiled posterDECISIONS.md— execution log and deviations from the initial plan
- Unsafe true refusal rate:
0.70 - Over-refusal rate:
0.12 - Best intervention: amplify top refusal features at layer 18
- unsafe refusal
0.80 -> 0.85
- unsafe refusal
- Negative result: suppressing top over-refusal features on the over-refusal split
- refusal
0.05 -> 0.10
- refusal
- Specificity on unrelated tasks stays flat at
0.789
If you use this repo, cite:
@misc{rathod2026_refusal_features,
title = {Disentangling Correct Refusal from Over-Refusal via Sparse Feature Interventions},
author = {Kunj Rathod},
year = {2026},
note = {Mechanistic interpretability project repository},
howpublished = {\url{https://github.com/rathodkunj2005/refusal-feature-interp}}
}- arXiv paper:
TBD - HuggingFace dataset card:
TBD