Skip to content

rathodkunj2005/refusal-feature-interp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Disentangling Correct Refusal from Over-Refusal via Sparse Feature Interventions

This repo contains a complete small-scale mechanistic interpretability project on refusal behavior in Qwen2.5-0.5B-Instruct. The core question is whether sparse feature families can separately control correct refusal on unsafe prompts and spurious over-refusal on benign prompts with dangerous surface forms. On a 400-example benchmark, the model refuses 70% of unsafe prompts and 12% of over-refusal prompts. Sparse autoencoders recover high-fidelity feature dictionaries, and amplifying one refusal-feature family raises unsafe refusal on the held-out split from 0.80 to 0.85. The strongest negative result is also the most important: suppressing the top over-refusal family does not improve benign compliance.

Hardware requirements

  • Apple M1 Pro or similar
  • 16 GB RAM
  • ~8 GB free disk
  • Python 3.11
  • TeX Live / pdflatex for paper and poster builds

Reproduction

From a fresh clone on a machine with the packages in requirements.txt and pdflatex available:

make all

Step-by-step:

make data
make baseline
make activations
make train_sae
make experiments
make figures
make paper
make poster

Notes:

  • The benchmark builder is deterministic with seed 7.
  • Baseline generation uses MPS when available.
  • Activation extraction is more stable on CPU on this machine because MPS produced NaNs in saved activation arrays.
  • Large activation arrays and SAE checkpoints are excluded from git via .gitignore.

File structure

  • data/eval_set.json — full 400-example labeled benchmark
  • data/eval_set_splits.json — 80/20 split
  • experiments/build_eval_set.py — benchmark construction
  • experiments/run_baseline.py — baseline refusal behavior
  • experiments/collect_activations.py — residual activation dump
  • experiments/train_sae.py — sparse autoencoder training
  • experiments/run_interventions.py — feature ranking + causal interventions
  • experiments/run_specificity.py — paraphrase robustness + unrelated-task specificity
  • experiments/render_figures.py — checks figure presence
  • results/ — metrics, baseline generations, feature rankings, intervention outputs
  • figures/ — publication-ready figures used in the paper and poster
  • paper/paper.tex — NeurIPS-style paper source
  • paper/paper.pdf — compiled paper
  • poster/poster.tex — A0 landscape poster source
  • poster/poster.pdf — compiled poster
  • DECISIONS.md — execution log and deviations from the initial plan

Main result snapshot

  • Unsafe true refusal rate: 0.70
  • Over-refusal rate: 0.12
  • Best intervention: amplify top refusal features at layer 18
    • unsafe refusal 0.80 -> 0.85
  • Negative result: suppressing top over-refusal features on the over-refusal split
    • refusal 0.05 -> 0.10
  • Specificity on unrelated tasks stays flat at 0.789

Citation

If you use this repo, cite:

@misc{rathod2026_refusal_features,
  title        = {Disentangling Correct Refusal from Over-Refusal via Sparse Feature Interventions},
  author       = {Kunj Rathod},
  year         = {2026},
  note         = {Mechanistic interpretability project repository},
  howpublished = {\url{https://github.com/rathodkunj2005/refusal-feature-interp}}
}

Links

  • arXiv paper: TBD
  • HuggingFace dataset card: TBD

About

Mechanistic interpretability of refusal behavior in Qwen2.5 models: sparse feature interventions, residual steering, judge-vs-rule analysis, and 1.5B replication

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors