Quantifying the internal processing modes large language models shift through depending on task type.
A framework for measuring, calculating, and visualizing the distinct computational patterns that emerge across different LLM tasks. We extract attribution graphs from model internals (via transcoders/SAEs) and compute metrics that characterize each computational regime.
After controlling for text length, we found:
| Metric | What It Measures | After Length Control |
|---|---|---|
| Influence | Causal strength between features | d=1.08 (genuine signal) |
| Concentration | Focused vs diffuse computation | d=0.87 (genuine signal) |
| N_active | Feature count | d=0.07 (COLLAPSES - was length artifact) |
The pattern:
- Grammar tasks (CoLA): High influence, high concentration = focused computation
- Reasoning tasks (WinoGrande, HellaSwag): Low influence, low concentration = diffuse computation
- Truthfulness (TruthfulQA): No signal (d=0.05) - true/false statements look identical internally
This repo documents our research journey:
- Started: Hallucination detection via feature spectroscopy (Dec 2025)
- Realized: Most "signal" was text length confounding (Jan 2026)
- Pivoted: Task-type diagnostics with length control (Feb 2026)
- Found: Genuine computational regime differences
See archive/disproved/ for our early experiments with honest disclaimers about what didn't work.
notebooks/ # START HERE - 5-part narrative series
01_introduction.ipynb # What this project discovers
02_the_journey.ipynb # From hallucination detection to task diagnostics
03_methodology.ipynb # How we extract and analyze metrics
04_core_results.ipynb # Main findings with visualizations
05_negative_results.ipynb # What doesn't work (and why that matters)
experiments/ # Reproducible analysis code
core/ # Main validated analyses
statistics/ # Statistical tests
visualization/ # Figure generation
utilities/ # Shared code
_archive/ # Historical experiments
figures/ # Generated visualizations
paper/ # Core figures
data/results/ # Computed metrics (JSON)
scripts/ # Modal GPU runners
archive/disproved/ # Early work with honest disclaimers
Read the notebooks first - they tell the complete story:
jupyter notebook notebooks/01_introduction.ipynbRun the analysis:
pip install -e .
# Generate figures (all length-controlled)
python experiments/visualization/generate_figures.py
# Compute attribution metrics (parallel, crash-safe)
modal run scripts/modal_general_attribution.py \
--input-file data/domain_analysis/domain_samples.json \
--output-file data/results/domain_attribution_metrics.jsonWorks:
- Task type classification (grammar vs reasoning vs paraphrase)
- Computational complexity estimation
- Anomaly/adversarial input detection
Doesn't Work:
- Hallucination detection
- Truthfulness detection
- Output correctness prediction
The model processes hallucinations and false statements with the same internal structure as truthful ones. This is a fundamental limitation: we measure computation type, not output quality.
-
Attribution Graphs: We use circuit-tracer to extract causal graphs showing how features influence each other during inference. Each node is a feature (sparse autoencoder direction), each edge is causal influence.
-
Metrics: From these graphs we extract:
mean_influence: Average edge weight (how strongly features drive each other)concentration: How focused the influence is (Gini coefficient)mean_activation: Average feature activation strength
-
Length Control: Raw feature counts correlate r=0.98 with text length - longer inputs activate more features, trivially. We residualize metrics against length to isolate genuine computational differences.
-
The Signal: After length control, grammar tasks still show d=1.08 higher influence than reasoning tasks. This isn't length - it's genuine regime difference.
- Requires model internals - Only works on models with SAE/transcoder access (currently Gemma 2 via Goodfire)
- Compute intensive - ~30 sec/sample on A100
- Measures structure, not correctness - Can't detect hallucinations or factual errors
- Must control for length - Raw n_active is confounded (r=0.98 with token count)
MIT