Understanding how Whisper's encoder represents acoustic and linguistic information through counterfactual activation patching experiments.
This repository contains two interpretability experiments on OpenAI's Whisper model, inspired by techniques from mechanistic interpretability research like PatchScope. We use activation patching to probe how Whisper's encoder layers process and represent speech.
Key Findings:
- Layer-specific representations: Same-layer patching achieves ~95% override accuracy vs ~54% for cross-layer patching
- Language-agnostic encoding: 66.6% cross-lingual transfer rate suggests encoder representations are largely language-independent
- Asymmetric layer influence: Later layers show stronger forward transfer to adjacent layers
Measures how encoder representations from one word can override the processing of a different word.
Methodology:
- Record encoder hidden states from Word A (e.g., "bat")
- Patch these states into encoder processing of Word B (e.g., "cat")
- Measure if output changes toward Word A
Dataset: 869 English word pairs including:
- Minimal pairs (consonant contrasts): bat/pat, big/pig, etc.
- Vowel contrasts: bit/bat, ship/sheep, etc.
- Semantic pairs: apple/orange, hot/cold, etc.
Results:
| Metric | Value |
|---|---|
| Diagonal (same-layer) | 0.951 |
| Off-diagonal (cross-layer) | 0.536 |
| Max override | Layer 3→3 (0.954) |
Override Accuracy Matrix:
[[0.95 0.50 0.50 0.50]
[0.50 0.95 0.50 0.51]
[0.50 0.64 0.95 0.52]
[0.50 0.52 0.74 0.95]]
Interpretation: The strong diagonal pattern indicates layer-specific representations—each layer encodes information in a format most compatible with the same layer position.
Tests whether English encoder representations can influence Spanish audio processing.
Methodology:
- Process English word (e.g., "telephone") and capture encoder states
- Patch English states into Spanish cognate processing (e.g., "teléfono")
- Check if English word appears in output
Dataset: 162 English-Spanish pairs including:
- Cognates: telephone/teléfono, hospital/hospital, chocolate/chocolate
- Non-cognates: cat/gato, house/casa, water/agua
Results:
| Metric | Value |
|---|---|
| Overall transfer rate | 0.666 |
| Same-layer transfer | 0.708 |
| Cross-layer transfer | 0.652 |
| Best transfer | Layer 1→0 (1.000) |
| Worst transfer | Layer 1→3 (0.019) |
Transfer Success Matrix:
[[0.72 0.93 0.55 0.12]
[1.00 0.71 0.87 0.02]
[1.00 0.33 0.70 0.16]
[0.99 0.99 0.87 0.70]]
Interpretation: High transfer rates suggest Whisper's encoder learns largely language-agnostic acoustic representations, with early/mid layers being most transferable.
pip install openai-whisper torch torchaudio matplotlib numpy gTTS pydub seabornRun in Google Colab (recommended for GPU access):
# Experiment A - Monolingual patching (~5 hours on T4 GPU)
# Open experiment_a_bidirectional_patching.ipynb
# Experiment B - Cross-lingual transfer (~1 hour on T4 GPU)
# Open experiment_b_crosslingual_transfer.ipynb├── experiment_a_bidirectional_patching.ipynb # Monolingual patching experiment
├── experiment_b_crosslingual_transfer.ipynb # Cross-lingual transfer experiment
├── README.md
└── figures/
├── experiment_a_heatmap.png
└── experiment_b_figure.png
Model: Whisper Tiny (37.2M parameters, 4 encoder layers, 4 decoder layers)
Patching Mechanism:
def patch_hook(module, input, output):
return source_state.to(output.device).to(output.dtype)
hook = model.encoder.blocks[target_layer].register_forward_hook(patch_hook)Audio Generation: Google Text-to-Speech (gTTS) for consistent synthetic audio
-
Layer Specialization: Each encoder layer develops specialized representations incompatible with other layer positions (diagonal dominance in Exp A)
-
Asymmetric Transfer: Later source layers (2, 3) show elevated transfer to adjacent target layers, suggesting hierarchical processing
-
Language Independence: The encoder appears to build language-agnostic acoustic representations that transfer across languages
-
Early Layer Universality: Layers 0-1 show the highest cross-lingual transfer, possibly encoding more universal acoustic features
- Uses Whisper Tiny; larger models may show different patterns
- Synthetic TTS audio may not reflect natural speech characteristics
- Limited to single-word stimuli
- Transfer metric is binary (word presence) rather than graded
- Extend to larger Whisper models (Base, Small, Medium, Large)
- Test with natural speech recordings
- Analyze decoder layer interactions
- Probe specific phonetic feature representations
- Compare with other multilingual ASR models
If you use this code in your research, please cite:
@software{whisper_patching_2024,
title={Whisper Encoder Interpretability via Activation Patching},
year={2024},
url={https://github.com/YOUR_USERNAME/whisper-activation-patching}
}MIT License
- OpenAI for the Whisper model
- Inspired by PatchScope and mechanistic interpretability research