Whisper Encoder Interpretability via Activation Patching

Understanding how Whisper's encoder represents acoustic and linguistic information through counterfactual activation patching experiments.

Overview

This repository contains two interpretability experiments on OpenAI's Whisper model, inspired by techniques from mechanistic interpretability research like PatchScope. We use activation patching to probe how Whisper's encoder layers process and represent speech.

Key Findings:

Layer-specific representations: Same-layer patching achieves ~95% override accuracy vs ~54% for cross-layer patching
Language-agnostic encoding: 66.6% cross-lingual transfer rate suggests encoder representations are largely language-independent
Asymmetric layer influence: Later layers show stronger forward transfer to adjacent layers

Experiments

Experiment A: Bidirectional Patching (Monolingual)

Measures how encoder representations from one word can override the processing of a different word.

Methodology:

Record encoder hidden states from Word A (e.g., "bat")
Patch these states into encoder processing of Word B (e.g., "cat")
Measure if output changes toward Word A

Dataset: 869 English word pairs including:

Minimal pairs (consonant contrasts): bat/pat, big/pig, etc.
Vowel contrasts: bit/bat, ship/sheep, etc.
Semantic pairs: apple/orange, hot/cold, etc.

Results:

Metric	Value
Diagonal (same-layer)	0.951
Off-diagonal (cross-layer)	0.536
Max override	Layer 3→3 (0.954)

Override Accuracy Matrix:
[[0.95  0.50  0.50  0.50]
 [0.50  0.95  0.50  0.51]
 [0.50  0.64  0.95  0.52]
 [0.50  0.52  0.74  0.95]]

Interpretation: The strong diagonal pattern indicates layer-specific representations—each layer encodes information in a format most compatible with the same layer position.

Experiment B: Cross-Lingual Transfer

Tests whether English encoder representations can influence Spanish audio processing.

Methodology:

Process English word (e.g., "telephone") and capture encoder states
Patch English states into Spanish cognate processing (e.g., "teléfono")
Check if English word appears in output

Dataset: 162 English-Spanish pairs including:

Cognates: telephone/teléfono, hospital/hospital, chocolate/chocolate
Non-cognates: cat/gato, house/casa, water/agua

Results:

Metric	Value
Overall transfer rate	0.666
Same-layer transfer	0.708
Cross-layer transfer	0.652
Best transfer	Layer 1→0 (1.000)
Worst transfer	Layer 1→3 (0.019)

Transfer Success Matrix:
[[0.72  0.93  0.55  0.12]
 [1.00  0.71  0.87  0.02]
 [1.00  0.33  0.70  0.16]
 [0.99  0.99  0.87  0.70]]

Interpretation: High transfer rates suggest Whisper's encoder learns largely language-agnostic acoustic representations, with early/mid layers being most transferable.

Installation

pip install openai-whisper torch torchaudio matplotlib numpy gTTS pydub seaborn

Usage

Run in Google Colab (recommended for GPU access):

# Experiment A - Monolingual patching (~5 hours on T4 GPU)
# Open experiment_a_bidirectional_patching.ipynb

# Experiment B - Cross-lingual transfer (~1 hour on T4 GPU)  
# Open experiment_b_crosslingual_transfer.ipynb

Repository Structure

├── experiment_a_bidirectional_patching.ipynb  # Monolingual patching experiment
├── experiment_b_crosslingual_transfer.ipynb   # Cross-lingual transfer experiment
├── README.md
└── figures/
    ├── experiment_a_heatmap.png
    └── experiment_b_figure.png

Technical Details

Model: Whisper Tiny (37.2M parameters, 4 encoder layers, 4 decoder layers)

Patching Mechanism:

def patch_hook(module, input, output):
    return source_state.to(output.device).to(output.dtype)

hook = model.encoder.blocks[target_layer].register_forward_hook(patch_hook)

Audio Generation: Google Text-to-Speech (gTTS) for consistent synthetic audio

Key Observations

Layer Specialization: Each encoder layer develops specialized representations incompatible with other layer positions (diagonal dominance in Exp A)
Asymmetric Transfer: Later source layers (2, 3) show elevated transfer to adjacent target layers, suggesting hierarchical processing
Language Independence: The encoder appears to build language-agnostic acoustic representations that transfer across languages
Early Layer Universality: Layers 0-1 show the highest cross-lingual transfer, possibly encoding more universal acoustic features

Limitations

Uses Whisper Tiny; larger models may show different patterns
Synthetic TTS audio may not reflect natural speech characteristics
Limited to single-word stimuli
Transfer metric is binary (word presence) rather than graded

Future Directions

Extend to larger Whisper models (Base, Small, Medium, Large)
Test with natural speech recordings
Analyze decoder layer interactions
Probe specific phonetic feature representations
Compare with other multilingual ASR models

Citation

If you use this code in your research, please cite:

@software{whisper_patching_2024,
  title={Whisper Encoder Interpretability via Activation Patching},
  year={2024},
  url={https://github.com/YOUR_USERNAME/whisper-activation-patching}
}

License

MIT License

Acknowledgments

OpenAI for the Whisper model
Inspired by PatchScope and mechanistic interpretability research

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
whishper_steering_2.ipynb		whishper_steering_2.ipynb
whisper_steering.ipynb		whisper_steering.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Whisper Encoder Interpretability via Activation Patching

Overview

Experiments

Experiment A: Bidirectional Patching (Monolingual)

Experiment B: Cross-Lingual Transfer

Installation

Usage

Repository Structure

Technical Details

Key Observations

Limitations

Future Directions

Citation

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Whisper Encoder Interpretability via Activation Patching

Overview

Experiments

Experiment A: Bidirectional Patching (Monolingual)

Experiment B: Cross-Lingual Transfer

Installation

Usage

Repository Structure

Technical Details

Key Observations

Limitations

Future Directions

Citation

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages