Towards eliciting latent knowledge from LLMs with mechanistic interpretability

Installation

pip install uv
uv sync --dev

sh run_training.sh

python evaluate_adversarial_prompts.py

python guess_secret_word.py

python prefill_guess_secret_word.py

python prefill_with_prompts.py

python evaluate_logit_lens.py

python evaluate_sae_weighted.py