Skip to content

Latest commit

 

History

History
51 lines (37 loc) · 917 Bytes

File metadata and controls

51 lines (37 loc) · 917 Bytes

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

Taboo models on HF

schema

Installation

pip install uv
uv sync --dev

Taboo models training

sh run_training.sh

Eliciting secret words from models

Adversarial Prompts

python evaluate_adversarial_prompts.py

Guessing Secret Words by another model

python guess_secret_word.py

Token forcing pregame

python prefill_guess_secret_word.py

Token forcing postgame

python prefill_with_prompts.py

Logit Lens

python evaluate_logit_lens.py

SAE

python evaluate_sae_weighted.py