Disentangling Correct Refusal from Over-Refusal via Sparse Feature Interventions

This repo contains a complete small-scale mechanistic interpretability project on refusal behavior in Qwen2.5-0.5B-Instruct. The core question is whether sparse feature families can separately control correct refusal on unsafe prompts and spurious over-refusal on benign prompts with dangerous surface forms. On a 400-example benchmark, the model refuses 70% of unsafe prompts and 12% of over-refusal prompts. Sparse autoencoders recover high-fidelity feature dictionaries, and amplifying one refusal-feature family raises unsafe refusal on the held-out split from 0.80 to 0.85. The strongest negative result is also the most important: suppressing the top over-refusal family does not improve benign compliance.

Hardware requirements

Apple M1 Pro or similar
16 GB RAM
~8 GB free disk
Python 3.11
TeX Live / pdflatex for paper and poster builds

Reproduction

From a fresh clone on a machine with the packages in requirements.txt and pdflatex available:

make all

Step-by-step:

make data
make baseline
make activations
make train_sae
make experiments
make figures
make paper
make poster

Notes:

The benchmark builder is deterministic with seed 7.
Baseline generation uses MPS when available.
Activation extraction is more stable on CPU on this machine because MPS produced NaNs in saved activation arrays.
Large activation arrays and SAE checkpoints are excluded from git via .gitignore.

File structure

data/eval_set.json — full 400-example labeled benchmark
data/eval_set_splits.json — 80/20 split
experiments/build_eval_set.py — benchmark construction
experiments/run_baseline.py — baseline refusal behavior
experiments/collect_activations.py — residual activation dump
experiments/train_sae.py — sparse autoencoder training
experiments/run_interventions.py — feature ranking + causal interventions
experiments/run_specificity.py — paraphrase robustness + unrelated-task specificity
experiments/render_figures.py — checks figure presence
results/ — metrics, baseline generations, feature rankings, intervention outputs
figures/ — publication-ready figures used in the paper and poster
paper/paper.tex — NeurIPS-style paper source
paper/paper.pdf — compiled paper
poster/poster.tex — A0 landscape poster source
poster/poster.pdf — compiled poster
DECISIONS.md — execution log and deviations from the initial plan

Main result snapshot

Unsafe true refusal rate: 0.70
Over-refusal rate: 0.12
Best intervention: amplify top refusal features at layer 18
- unsafe refusal 0.80 -> 0.85
Negative result: suppressing top over-refusal features on the over-refusal split
- refusal 0.05 -> 0.10
Specificity on unrelated tasks stays flat at 0.789

Citation

If you use this repo, cite:

@misc{rathod2026_refusal_features,
  title        = {Disentangling Correct Refusal from Over-Refusal via Sparse Feature Interventions},
  author       = {Kunj Rathod},
  year         = {2026},
  note         = {Mechanistic interpretability project repository},
  howpublished = {\url{https://github.com/rathodkunj2005/refusal-feature-interp}}
}

Links

arXiv paper: TBD
HuggingFace dataset card: TBD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disentangling Correct Refusal from Over-Refusal via Sparse Feature Interventions

Hardware requirements

Reproduction

File structure

Main result snapshot

Citation

Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
experiments		experiments
figures		figures
paper		paper
poster		poster
results		results
src		src
.gitignore		.gitignore
DECISIONS.md		DECISIONS.md
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Disentangling Correct Refusal from Over-Refusal via Sparse Feature Interventions

Hardware requirements

Reproduction

File structure

Main result snapshot

Citation

Links

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages