Latent Diagnostics

Quantifying the internal processing modes large language models shift through depending on task type.

What This Is

A framework for measuring, calculating, and visualizing the distinct computational patterns that emerge across different LLM tasks. We extract attribution graphs from model internals (via transcoders/SAEs) and compute metrics that characterize each computational regime.

Key Discovery

After controlling for text length, we found:

Metric	What It Measures	After Length Control
Influence	Causal strength between features	d=1.08 (genuine signal)
Concentration	Focused vs diffuse computation	d=0.87 (genuine signal)
N_active	Feature count	d=0.07 (COLLAPSES - was length artifact)

The pattern:

Grammar tasks (CoLA): High influence, high concentration = focused computation
Reasoning tasks (WinoGrande, HellaSwag): Low influence, low concentration = diffuse computation
Truthfulness (TruthfulQA): No signal (d=0.05) - true/false statements look identical internally

The Journey

This repo documents our research journey:

Started: Hallucination detection via feature spectroscopy (Dec 2025)
Realized: Most "signal" was text length confounding (Jan 2026)
Pivoted: Task-type diagnostics with length control (Feb 2026)
Found: Genuine computational regime differences

See archive/disproved/ for our early experiments with honest disclaimers about what didn't work.

Directory Structure

notebooks/                    # START HERE - 5-part narrative series
  01_introduction.ipynb       # What this project discovers
  02_the_journey.ipynb        # From hallucination detection to task diagnostics
  03_methodology.ipynb        # How we extract and analyze metrics
  04_core_results.ipynb       # Main findings with visualizations
  05_negative_results.ipynb   # What doesn't work (and why that matters)

experiments/                  # Reproducible analysis code
  core/                       # Main validated analyses
  statistics/                 # Statistical tests
  visualization/              # Figure generation
  utilities/                  # Shared code
  _archive/                   # Historical experiments

figures/                      # Generated visualizations
  paper/                      # Core figures

data/results/                 # Computed metrics (JSON)
scripts/                      # Modal GPU runners
archive/disproved/            # Early work with honest disclaimers

Quick Start

Read the notebooks first - they tell the complete story:

jupyter notebook notebooks/01_introduction.ipynb

Run the analysis:

pip install -e .

# Generate figures (all length-controlled)
python experiments/visualization/generate_figures.py

# Compute attribution metrics (parallel, crash-safe)
modal run scripts/modal_general_attribution.py \
  --input-file data/domain_analysis/domain_samples.json \
  --output-file data/results/domain_attribution_metrics.json

What Works vs What Doesn't

Works:

Task type classification (grammar vs reasoning vs paraphrase)
Computational complexity estimation
Anomaly/adversarial input detection

Doesn't Work:

Hallucination detection
Truthfulness detection
Output correctness prediction

The model processes hallucinations and false statements with the same internal structure as truthful ones. This is a fundamental limitation: we measure computation type, not output quality.

How It Works

Attribution Graphs: We use circuit-tracer to extract causal graphs showing how features influence each other during inference. Each node is a feature (sparse autoencoder direction), each edge is causal influence.
Metrics: From these graphs we extract:
- mean_influence: Average edge weight (how strongly features drive each other)
- concentration: How focused the influence is (Gini coefficient)
- mean_activation: Average feature activation strength
Length Control: Raw feature counts correlate r=0.98 with text length - longer inputs activate more features, trivially. We residualize metrics against length to isolate genuine computational differences.
The Signal: After length control, grammar tasks still show d=1.08 higher influence than reasoning tasks. This isn't length - it's genuine regime difference.

Limitations

Requires model internals - Only works on models with SAE/transcoder access (currently Gemma 2 via Goodfire)
Compute intensive - ~30 sec/sample on A100
Measures structure, not correctness - Can't detect hallucinations or factual errors
Must control for length - Raw n_active is confounded (r=0.98 with token count)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
_meta		_meta
archive		archive
data		data
experiments		experiments
exports		exports
figures		figures
notebooks		notebooks
research		research
scripts		scripts
src/neural_polygraph		src/neural_polygraph
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-docker.txt		requirements-docker.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Latent Diagnostics

What This Is

Key Discovery

The Journey

Directory Structure

Quick Start

What Works vs What Doesn't

How It Works

Limitations

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

License

ariaxhan/latent-diagnostics

Folders and files

Latest commit

History

Repository files navigation

Latent Diagnostics

What This Is

Key Discovery

The Journey

Directory Structure

Quick Start

What Works vs What Doesn't

How It Works

Limitations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages