CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Git Commit Practices

Commit Structure

Make small, atomic commits - one logical change per commit
Each commit should be functional and not break the build
Run code formatter (black for Python) after each change
Run scripts/fix_whitespace_issues.py always on all files
Test that code runs successfully before committing

Commit Messages

MANDATORY: Always use this exact format for ALL commits:

file.py: brief description of change

Detailed explanation of what was changed and why.
Include technical details about the implementation.

Generated-by: Claude AI
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

LINE LENGTH: Maximum 70 characters per line in commit messages
- Subject line (first line): 70 characters max
- Body paragraphs: 70 characters max per line
- Ensures proper display in git log, email patches, and terminal output
CRITICAL: Never use "🤖 Generated with [Claude Code]" or "Co-Authored-By: Claude"
REQUIRED: Every commit MUST have both "Generated-by: Claude AI" and "Signed-off-by: Luis Chamberlain mcgrof@kernel.org"
NO EXCEPTIONS: This format is mandatory for ALL commits, no matter how small
STYLE: Be terse and to the point. NO shopping-list style bullet points. Write in paragraphs explaining the change, rationale, and technical details concisely. Avoid verbose enumeration unless absolutely necessary for clarity.

Cross-Agent Access

Some automation looks for agent-specific instruction files (e.g., CODEX.md) instead of CLAUDE.md. To avoid future assistants missing these guidelines, ensure every agent entrypoint symlinks back to this document. For Codex runs, CODEX.md must always be a symlink to CLAUDE.md; add additional symlinks if new agent names are introduced.

Development Workflow

Make a single focused change
Run black formatter on Python files
Test that the code runs without errors
If architectural changes: Run make check to validate
Commit with detailed message
Repeat for next change

Architectural changes include:

New attention or MLP mechanisms
Modified forward/backward pass logic
Changes to model patching or wrapper classes
New ablation steps or configurations
Updates to reciprocity/context flow

Code Style

Python

Use black formatter for all Python code
Follow PEP 8 conventions (handled by black)
No manual formatting - always use black

Defconfig Files

CRITICAL: Defconfig files must use exact Kconfig syntax: CONFIG_XXX=y (no spaces around =)
CRITICAL: NO inline comments allowed - comments MUST be on separate lines starting with #
- ✅ CORRECT:
```
# This is a comment
CONFIG_SOMETHING=y
```
- ❌ WRONG (breaks Kconfig parser):
```
CONFIG_SOMETHING=y  # This breaks everything
CONFIG_SOMETHING=y # This also breaks
```
DO NOT apply black formatter to defconfig files or .config files
Kconfig parser silently ignores lines with spaces around equals signs
After any edit to defconfigs, verify syntax: grep " = " defconfigs/* should return nothing

Markdown / Documentation Files

CRITICAL: Write documentation as technical prose, not as an AI-flavored outline.
Start by stating plainly what the document is for and why it exists.
Pull the key motivation into the top of the document. Do not bury the use case deep in the file.
Use a short Table of Contents for longer documents.
Prefer narrative paragraphs over shopping-list bullet dumps.
Use bullets only when they genuinely improve readability: short file lists, compact result summaries, or small enumerations.
Do not create sections like "What this is" if that content belongs in the intro.
Do not write prompt-y filler like "This document stands on its own" or hedged internal commentary like "the evidence is real and already public".
Do not apologize for confusing structure inside the main doc. Fix the structure.
Standalone docs must stand on their own. Do not make the reader chase older internal notes just to understand the actual result.
If lineage or provenance matters, split it into a separate lineage/provenance doc and link to it. Keep the main doc focused on the result itself.
When older work matters, summarize the result directly in the main doc. Put historical breadcrumbs in the lineage doc, not in the main narrative.
Use direct links for referenced docs and scripts. Do not write bare paths when a real markdown link is more useful.
Avoid weird audience markers like "public narrative" or "public scripts" unless the distinction truly matters.
Avoid robotic imperative spam like repeated "Use X" / "Do Y" / "Start with". Mix in natural phrasing such as "You can use..." when it reads better.
If a term is used repeatedly, define it once before leaning on it. Example: define "lane" before saying "lane configs".
If a CLI flag can be misunderstood, explain it where it is introduced. Example: --gpu all means all configured tracks, not local GPU autodetect.
When documenting examples or use cases, prefer short narrative subsections to giant bullet farms.
If a path name is historically misleading, fix the repo layout if practical. Do not leave the doc to carry the whole burden of explaining a bad structure.
Documentation edits should reduce confusion, not relocate it.

GPT2 Model Naming Convention

Overview

All GPT-2 model variants follow a consistent hierarchical naming pattern that makes the research intent and feature composition clear. Models are automatically discovered by scripts using convention-based introspection.

Naming Pattern

Base architecture with incremental features:

GPT2              # Baseline GPT-2 (no modifications)
GPT2_RA           # + Reciprocal Attention (learned alternation)
GPT2_MLA          # + Multi-head Latent Attention (cache compression)
GPT2_MLA_RA       # + Both MLA and RA
GPT2_MLA_KV       # + MLA with KV compression (KVSplice)
GPT2_MLA_RA_KV    # + MLA + RA + KV compression
GPT2_MLA_KV2      # + MLA with 2-latent (separate K/V latents)
GPT2_MLA_KV2M     # + KV2 + MLP compression (MLPSplice)
GPT2_MLA_RA_KVM   # + RA + MLA + KV + MLP compression

Feature abbreviations:

_RA: Reciprocal Attention (alternating Q@K.T and K@Q.T)
_MLA: Multi-head Latent Attention (DeepSeek-style cache compression)
_KV: KV compression via KVSplice
_KV2: 2-latent MLA (separate K and V latent spaces)
M suffix: MLP compression via MLPSplice

Adding New Models

To add a new GPT-2 model variant, follow these conventions (automatically discovered by scripts, no manual registration needed):

Naming: Class name must start with GPT2
- Example: GPT2_MyFeature or GPT2_MLA_MyFeature
Location: Define in ra.py (for MLA-based) or gpt2/model.py (for baseline variants)
Inheritance: Must inherit from nn.Module
Configuration Parameter:
- Use config: GPTConfig for GPT-2 baseline architecture
- Use cfg: RA_MLA_Config for MLA-based architectures
- Must be the first parameter after self in __init__
Interface: Must implement get_num_params() method
Example:

class GPT2_MLA_MyFeature(nn.Module):
    def __init__(self, cfg: RA_MLA_Config, vocab_size: int = 50257):
        super().__init__()
        self.cfg = cfg
        # ... implementation ...

    def get_num_params(self):
        return sum(p.numel() for p in self.parameters())

    def forward(self, idx, targets=None):
        # ... implementation ...
        return logits, loss

That's it! The model will be automatically discovered by:

scripts/compare_inference.py --list (inference benchmarking)
Future scripts using the discover_gpt2_models() pattern

Ablation Step Naming

Ablation steps use short prefixes that map to model architectures:

B         → GPT2 baseline
RA        → GPT2_RA (fixed reciprocal)
RALEARN   → GPT2_RA (learned reciprocal)
MLA       → GPT2_MLA
RAMLA     → GPT2_MLA_RA
RAMLAKV   → GPT2_MLA_RA_KV
RAMLAKVM  → GPT2_MLA_RA_KVM
MLAKV     → GPT2_MLA_KV
MLAKV2    → GPT2_MLA_KV2
MLAKV2M   → GPT2_MLA_KV2M

Legacy step names with "0" or "1" suffixes (for learning rate ablation) are still supported for backwards compatibility but are deprecated. New experiments should use the architecture name directly without suffixes.

GPU Optimization Preferences

Training Optimizations

When optimizing PyTorch training for AMD GPUs:

Increase batch size to utilize GPU memory
Enable cuDNN benchmark mode
Use mixed precision training (AMP)
Add multiple data loader workers with pinned memory
Include GPU warmup routine
Use torch.compile() for graph optimization
Enable TensorFloat32 for matrix operations
Add comprehensive timing and metrics
Save trained models after completion

Performance Monitoring

Display GPU info at startup
Show per-epoch timing
Track test accuracy after each epoch
Report total training time and average per epoch

Hardware

Primary GPU: AMD Radeon Pro W7900 (48GB)
Optimize for maximum GPU utilization

Testing Requirements

Always verify code runs before committing
Check for linting/formatting issues
Ensure no syntax errors

R&D Phases and Workflow

Research proceeds in two phases with different tooling:

Phase 1: Rapid R&D (current default)

During active research, run scripts directly. Kconfig adds overhead that slows iteration. Write standalone Python scripts, eval harnesses, and benchmarks. Run them directly:

python3 scripts/my_experiment.py
python3 eval_v28.py --phase 0

Results go to local directories. Once experiments converge and produce publishable results, move to Phase 2.

Phase 2: Reproducible Kconfig Workflow

Once R&D stabilizes and we have results worth preserving, lock the experiment into the Kconfig build system for reproducibility:

Load configuration: make defconfig-<name>
Build and run: make
Results are saved to the configured output directory

The Kconfig system ensures exact reproducibility of finalized experiments. Do NOT use Kconfig during rapid iteration — it slows things down without adding value until the experiment design is stable.

Ralph Loop for Multi-Phase Tasks

We use the Ralph Loop plugin (ralph-loop from claude-plugins-official) for large experiments that span multiple phases. The loop uses a Stop hook to prevent Claude from exiting — instead it feeds the same prompt back, creating a self-referential iteration loop where each pass sees the previous work in files and git history.

How it works:

User writes a task file (e.g., BPA-v42.txt) with numbered phases and clear completion criteria
User invokes /ralph-loop with the task and a --completion-promise (typically COMPLETE)
Claude reads the task file, executes phases in order, commits results, and outputs <promise>COMPLETE</promise> when genuinely done
If Claude tries to exit before completion, the stop hook blocks and re-feeds the prompt — Claude sees its own prior work and continues from where it left off

Typical invocation:

/ralph-loop Read BPA-v42.txt and execute all phases in order \
  --completion-promise COMPLETE --max-iterations 2000

Task file conventions (e.g., BPA-v42.txt):

Number all phases/tasks clearly (Task 1, Task 2, ...)
Include concrete success criteria per phase
Specify what to commit and when
Include a final task that summarizes results
Keep tasks independent enough that resumption works if context compresses mid-run

When to use Ralph Loop:

BPA experiment versions (multi-phase GPU experiments)
Any task with >3 sequential phases
Tasks that may exceed a single context window
Overnight/unattended experiment runs

When NOT to use Ralph Loop:

Quick one-shot edits or fixes
Interactive design discussions
Tasks requiring human judgment between steps

Monitoring: head -10 .claude/ralph-loop.local.md shows current iteration count and state.

Cancelling: /cancel-ralph removes the state file and stops the loop.

Repository Layout

`/data/knlp/` — Code Repository (this repo)

The main working tree. Contains bleeding-edge code, scripts, model implementations, eval harnesses, and configuration. Keep this repo slim. Do NOT commit large artifacts, result directories, checkpoints, plots, or bulk experimental output here.

What belongs here:

Model code (gpt2/, gnn/)
Eval harnesses (eval_v*.py)
Scripts (scripts/)
Defconfigs and Kconfig files
Documentation (docs/)
Small CSV/JSON files that are actively used by scripts

What does NOT belong here:

Result directories (bpa_v*_results/, results/)
Artifact directories (artifacts/)
Training checkpoints (.pt, .pth)
Generated plots and PNGs (except images/ for README)
Final reports and scoreboards (archive to key-results)
Large experiment output

`/data/knlp-key-results/` — Artifact Archive

Stores all historical results, artifacts, and bulk data. Commit freely here — size is not a concern. Organized by research area:

bpa/ — BPA (Bit Precision Allocation) results v1-v28+, artifacts, figures, scoreboards, reports, branch trees
key_results/ — Training matrix results with W&B configs

When an experiment version is complete, copy its artifacts:

cp -a results/v29/ /data/knlp-key-results/bpa/results/v29/
cp bpa_v29_final_report.md /data/knlp-key-results/bpa/
cd /data/knlp-key-results && git add bpa/ && git commit

`/data/paper-memory-decode/` — Paper Repository (example)

Each paper gets its own git repository. The paper repo contains LaTeX source, figures, and any data needed to reproduce the paper build. See "Paper Workflow" below.

BPA Experiment Instructions

Each BPA experiment version is defined in an instruction file (e.g., BPA-v48.txt) that specifies the experiment phases, models, metrics, and completion criteria. These files are given to Claude via the Ralph Loop plugin for autonomous execution.

Naming Convention

Use BPA-vNN.txt (all-caps prefix, dash, lowercase v, number)
Examples: BPA-v46.txt, BPA-v47.txt, BPA-v48.txt

Storage

During execution: instruction file lives in /data/knlp/
After completion: move to /data/knlp-key-results/bpa-instructions/
All historical instructions are archived there (v2 through v48+)

Result Collection Convention

Each BPA experiment stores results in a versioned directory under /data/knlp-key-results/bpaNNN/ with a consistent structure:

/data/knlp-key-results/bpa48/
├── json/           # All metrics as JSON files
├── plots/          # Generated PNG figures (300 DPI)
├── logs/           # Execution logs
├── models/         # Model checkpoints (if any)
├── fim_maps/       # Fisher sensitivity maps
├── interaction_maps/  # Cross-layer interaction data
└── bpa48_summary.md   # Final summary report

Required conventions:

All metrics saved as JSON (machine-readable)
All plots saved as PNG at 300 DPI
Summary report in markdown with tables, plot references, and interpretation
Execution script committed to /data/knlp/scripts/ as bpa_vNN_w7900.py (or appropriate GPU name)
Results committed to key-results repo separately from code

Script Pattern

BPA experiment scripts follow a consistent pattern established in v46+:

Load model with attn_implementation='eager' and .to('cuda')
Use BF16 dtype (FP16 causes NaN with large-vocab models)
CPU-offload logits to avoid OOM during dual-pass evaluation
Cache prior results from JSON for incremental re-runs
OOM try/except handling around all GPU-intensive loops
Early collapse detection before full evaluation

Paper Workflow

Each paper lives in a separate git tree (e.g., /data/paper-memory-decode/). The paper is written and built from the knlp working tree where all code and key-results are accessible.

Convention

Paper repos are independent git trees, one per paper
The paper repo must be self-contained for building: LaTeX, figures, BibTeX, and any data referenced by the paper
Experiment scripts live in knlp, not the paper repo
When experiments produce figures or tables for the paper, copy them into the paper repo and commit there
If the paper references specific data (CSV, JSON), copy the relevant subset into the paper repo so make works without external dependencies

Workflow

Run experiments in knlp (Phase 1 rapid R&D)
Archive results to knlp-key-results
Copy figures/tables/data into the paper repo
Write LaTeX in the paper repo
Commit both the paper repo and knlp independently

This ensures the paper repo can build on any machine with just a git clone and make, while knlp stays focused on code and knlp-key-results holds the full artifact history.

Configuration System Internals

Type Handling

.config files use string values: "y", "n", "value"
config.py converts to Python types: True, False, integers, floats

When checking config values in Python code, handle both types:

# Good - handles both string and boolean
if value in ("y", True):

# Bad - only works with one type
if value == "y":

Test Matrix vs Ablation Mode

Mutually exclusive: Cannot enable both CONFIG_TEST_MATRIX_MODE and CONFIG_RA_MLA_ABLATION_MODE
Test matrix mode: Tests optimizer/pruning combinations
Ablation mode: Tests architectural variations (RA, MLA, RA-CT, etc.)
Always verify which mode is active when debugging unexpected test counts

Ablation Study Requirements

Multi-File Synchronization

When extending ablation studies with new steps, THREE files must be updated in sync:

defconfigs/gpt2-ratio-ablation: Add step descriptions in comments
gpt2/train_ra_mla.py: Add step configurations (elif step == "N" blocks)
scripts/run_test_matrix.py: Update step_descriptions dictionary

Missing any of these causes:

Defconfig only: Steps run but have no description
train_ra_mla.py only: Steps fail to execute
run_test_matrix.py only: Descriptions show but steps don't run

Ablation Step Checklist

When adding a new ablation step:

Add step config block to train_ra_mla.py (around line 500+)
Update step_descriptions dict in run_test_matrix.py (around line 2095)
Document step in defconfig comments
Update CONFIG_RA_MLA_ABLATION_STEPS string to include new step number
REQUIRED: Validate with dry-run: ./scripts/validate_ablation_steps.sh

Dry-Run Validation

Architecture Validation Before GPU Training

CRITICAL: Always validate architectural changes with dry-run before committing GPU resources. Recent bugs wasted 7+ hours of GPU time that dry-run would have caught in 60 seconds.

When to Use Dry-Run

Run dry-run validation before:

Committing architectural changes (new attention mechanisms, MLP modifications)
Adding new ablation steps
Modifying forward/backward pass logic
Changing wrapper classes or patching code
After fixing bugs that affected multiple configurations

Dry-Run Tools

Quick Check (Recommended)

# Run full architecture validation via Makefile
make check

# Completes in ~97 seconds (19 steps @ ~5s each)
# Loads gpt2-ratio-ablation config with DRY_RUN=1
# Tests all ablation steps automatically
# Exit code 0: all pass, 1: failures detected

ALWAYS run make check before committing architectural changes that may affect runtime behavior.

Single Step Validation

# Test specific ablation step
python3 gpt2/train_ra_mla.py --ra-mla-ablation-step N \
  --optimizer adamwspam --dataset finewebedu --dry-run

# Exit code 0: architecture valid
# Exit code 1: error (prints stack trace)

Manual All Steps Validation

# Test all 19 RATIO ablation steps (manual script)
./scripts/validate_ablation_steps.sh

# Completes in ~60 seconds
# Reports which steps pass/fail
# Provides commands to debug failures

What Dry-Run Catches

Configuration errors (wrong test mode, invalid parameters)
Architecture errors (TypeError from wrong arguments)
Assertion failures (missing required data)
Forward pass failures (dimension mismatches)
Backward pass failures (gradient computation errors)
Optimizer step failures (parameter update errors)

What Dry-Run Misses

OOM errors (uses small batch on CPU)
Multi-GPU/DDP issues (runs single CPU)
Data loading errors (uses dummy data)
Long-term training instabilities
Performance regressions

Recent Bugs Caught by Dry-Run

RA_MLA_Block argument passing: 17/19 steps failed with TypeError when MLP received unexpected kwargs
Assertion strictness: 6/19 steps failed when first block had no context from previous block

Both would have been caught before GPU training with dry-run.

Defensive Programming

Assertions for Optional Features

When implementing optional/conditional features that depend on data flow:

Add assertions for data that MUST be present (e.g., within a single component)
Avoid assertions for data that may legitimately be None (e.g., first block in sequence)
Silent failures waste GPU time - better to fail fast with clear error messages

Pattern for required data within component:

if self.cfg.feature_enabled:
    assert required_data is not None, "feature_enabled but no required_data"

Pattern for optional data from other blocks:

if self.cfg.feature_enabled and data_from_prev_block is not None:
    # use the data

Examples from RA+MLA:

ReciprocalMLP asserts attn_weights/attn_latent are provided by RA_MLA_Block (same component, always required)
RA_MLA_Attention handles None mlp_gate_context gracefully (from previous block, None for first block)
Use dry-run validation to catch assertion failures before GPU training

Context Flow for Multi-Block Architectures

When implementing bidirectional information flow between transformer blocks:

Use wrapper classes (e.g., RA_MLA_Block) to manage context state across blocks
Store contexts in instance variable (e.g., self._ctx = {})
Pass contexts as keyword arguments (enables detection of missing connections)
Produce contexts for the next block at the end of forward pass
Never assume contexts exist - always check with assertions when used

Wrapper Class Adaptability

When creating wrapper classes for mixed configurations:

Check wrapped component type at runtime: Use hasattr() or isinstance() to detect capabilities
Conditionally pass arguments: Standard components may not accept extended keyword arguments
Graceful degradation: Support both enhanced and standard components in same wrapper

Pattern:

# Good - adapts to component type
is_enhanced = hasattr(self.component, "enhanced_method")
if is_enhanced:
    out = self.component(x, extra_arg=value)
else:
    out = self.component(x)

# Bad - assumes all components are enhanced
out = self.component(x, extra_arg=value)  # crashes on standard components

Example: RA_MLA_Block wraps either ReciprocalMLP (accepts attn_weights/attn_latent) or standard MLP (does not). Runtime check prevents TypeError when ablation steps disable reciprocity mechanisms.

Architectural Pattern Guidelines

Feature Independence and Composability

When adding new attention/MLP mechanisms:

Keep features orthogonal: RA-CT (attention-only gating) vs MLP mechanisms (cross-layer flow)
Use clear naming: ra_cross_token for attention features, mlp_attn_gate for MLP features
Enable ablation: Each feature should be independently testable
Avoid coupling: RA-CT doesn't require MLA/RA, can be tested on baseline GPT-2

Per-Head Learnable Parameters

For per-head gating mechanisms:

Initialize to near-identity: bias ≈ 2.0 for sigmoid gates (pass-through initially)
Use affine transforms: sigmoid(stat * scale + bias) for numerical stability
Shape: [n_head] for per-head parameters, expandable to [B,H,T] when needed
Consider head_average=True option for cheaper computation

Statistics-Based Gating

When implementing gating based on attention statistics:

Support multiple modes: topk, max, entropy, rms
Provide detach_stats option to compute under no_grad() for memory savings
Apply gate at multiple points: weights (pre-softmax) or output (post-aggregation)
Use alpha mixing parameter for smooth interpolation: (1-α)·x + α·(x⊙gate)

GPU Memory Management

Memory Optimization Strategies

Enable expandable segments: PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
Disable expensive metrics logging during training (e.g., entropy computation on attention weights)
Use @torch.no_grad() for statistics computation that doesn't need gradients
Monitor for OOM errors in attention mechanisms - often caused by extra allocations for metrics
Batch size × gradient accumulation = effective batch size (keep constant when adjusting for memory)

A10G-Specific Considerations

24GB VRAM per GPU requires careful batch size tuning
For GPT-2 124M with RA+MLA: batch_size=8, gradient_accumulation=8 (effective=64)
Tensor dimensions should be multiples of 64 for optimal tensor core utilization
Disable metrics logging for attention mechanisms to prevent OOM during entropy computation

WandB Helper Scripts

When analyzing experiment results or comparing GPU performance across runs, use the W&B query scripts in the scripts/ directory. These require the micromamba environment.

Environment Setup

Before running any W&B query scripts:

source ~/bin/wl700-ml  # Activates w7900-ml micromamba environment

This provides wandb, pandas, and other dependencies needed for querying experiment data.

Available Scripts

scripts/inspect_wandb_keys.py: Discover available metrics in a run

Usage for inspecting what data is available:

python scripts/inspect_wandb_keys.py \
  --entity mcgrof-citizen \
  --project gpt2-bitter9-compiled-b200x4 \
  --run-name gpt2_adamwprune_bitter9_state_50

scripts/query_wandb_gpu.py: Query GPU metrics from training history

Usage for checking GPU memory and compute utilization:

python scripts/query_wandb_gpu.py \
  --entity mcgrof-citizen \
  --project gpt2-bitter9-compiled-b200x4 \
  --run-name gpt2_adamwprune_bitter9_state_50

scripts/query_wandb_gpu_full.py: Query detailed GPU metrics from system events

Usage for detailed system metrics including power and temperature:

python scripts/query_wandb_gpu_full.py \
  --entity mcgrof-citizen \
  --project gpt2-bitter9-compiled-b200x4 \
  --run-name gpt2_adamwprune_bitter9_state_50

scripts/plot_torch_compile_impact.py: Generate publication-quality visualizations comparing GPU performance across runs

This is a reusable visualization script that queries W&B and generates four graphs showing performance comparisons. Used to prove torch.compile() was the bottleneck.

Usage:

source ~/bin/wl700-ml
python scripts/plot_torch_compile_impact.py

The script is hardcoded to query mcgrof-citizen/gpt2-bitter8-nocompile-w7900 but can be easily adapted for other projects by editing the project variable in main().

Generated graphs (300 DPI, publication quality):

torch_compile_comparison.png: Side-by-side memory and compute comparison
torch_compile_grouped.png: All runs in grouped bar chart with color coding
torch_compile_before_after.png: Dramatic before/after horizontal bars with annotations
bitter8_vs_baseline.png: Spotlight showing minimal overhead of state-based pruning

The script demonstrates the pattern for:

Querying W&B API for multiple runs
Extracting system.gpu.* metrics from event stream
Computing averages across runs
Creating matplotlib visualizations with annotations
Using color coding (red=bad, green=good) for clarity

When to use this script:

After GPU profiling reveals performance differences
To prove bottleneck hypotheses with visual evidence
To compare optimization variants systematically
To generate graphs for documentation or papers

Customization tips:

Edit project variable to query different W&B project
Modify fetch_wandb_data() to extract different metrics
Update graph functions to change visual style
Add new graph types by creating new functions following existing patterns

Comparing Runs

To compare GPU performance across multiple runs (baseline vs optimizations), write a custom Python script using the W&B API. See docs/tracker.md for detailed examples.

Pattern for comparing runs:

import wandb

api = wandb.Api()
project = "mcgrof-citizen/gpt2-bitter9-compiled-b200x4"

run_names = ["baseline", "bitter8", "bitter9"]

for name in run_names:
    runs = api.runs(project, filters={"config.run_name": name})
    if runs:
        run = runs[0]
        history = run.history(
            keys=["gpu/memory_util_avg", "gpu/compute_util_avg"],
            samples=1000
        )
        if not history.empty:
            print(f"{name}:")
            print(f"  Memory: {history['gpu/memory_util_avg'].mean():.2f}%")
            print(f"  Compute: {history['gpu/compute_util_avg'].mean():.2f}%")

Key Metrics to Check

When analyzing GPU performance issues:

gpu/memory_util_avg: Memory bandwidth utilization (%)
gpu/compute_util_avg: Compute utilization (%)
gpu/memory_used_avg_gb: Average memory per GPU (GB)

Low memory utilization (<20%) indicates memory bandwidth bottleneck. Low compute utilization (<50%) indicates compute bottleneck. Compare optimization runs to baseline to verify improvements.

Publishing Results

Before publishing experimental results in documentation, papers, or public communications, perform rigorous verification to ensure reproducibility and fairness.

Verification Checklist

When publishing statistics or performance comparisons:

Use W&B API to verify hyperparameters: Query all runs via W&B API to confirm consistent hyperparameters across comparisons. Verify batch size, gradient accumulation, learning rate, warmup steps, and all optimizer-specific settings match exactly.
Verify git commit exists and is public: Confirm the exact git commit SHA used for training exists in the public repository. Document the commit ID in published results so others can reproduce experiments with identical code.
Perform apples-to-apples sanity checks: Before claiming performance differences, verify:
- Equal training time (CONFIG_GPT2_MAX_TIME) across all methods
- Same effective batch size (batch × grad_acc × num_gpus)
- Same hardware configuration (GPU type, count, memory)
- Same torch.compile status (all enabled or all disabled)
- Same dataset and preprocessing
- Same evaluation protocol (samples, intervals)
Check for confounding variables: Verify no unintended differences like:
- Different torch.compile status (one compiled, one not)
- Different batch sizes due to GPU-specific configs
- Different stopping conditions (time vs iterations)
- Different random seeds causing outlier results
- Different CUDA/PyTorch/GPU driver versions

W&B Verification Script Pattern

Use this pattern to verify hyperparameter consistency:

import wandb

api = wandb.Api()
project = "mcgrof-citizen/your-project"
run_names = ["baseline", "method_a", "method_b"]

configs = {}
for name in run_names:
    runs = api.runs(project, filters={"display_name": name})
    if runs:
        run = runs[0]
        configs[name] = {
            "batch_size": run.config.get("batch_size"),
            "gradient_accumulation": run.config.get("gradient_accumulation"),
            "learning_rate": run.config.get("learning_rate"),
            "max_time": run.config.get("max_time"),
            "compile": run.config.get("compile_model"),
            "commit": run.config.get("git_commit"),
        }

# Verify all configs match on critical hyperparameters
for key in ["batch_size", "gradient_accumulation", "learning_rate"]:
    values = [c[key] for c in configs.values()]
    if len(set(values)) > 1:
        print(f"WARNING: {key} differs across runs: {configs}")

Publication Requirements

Published results MUST include:

Git commit SHA for exact code version
W&B project and run names for verification
Hardware specification (GPU model, count, memory)
Training time allocation per method
Effective batch size calculation
torch.compile status
Dataset and preprocessing details

This enables independent verification and reproduction of published claims. Do not publish results without completing verification checklist.

KVSplice Verification

KVSplice is a learned KV cache compression layer that achieves 12x total compression (6x from MLA + 2x from KVSplice). Before claiming compression ratios or memory savings, verify both training quality and inference memory reduction.

Training Verification

When evaluating KVSplice training results:

Compare across GPU types: Run ablation on multiple GPUs (W7900, A100, H100) to verify consistency and detect hardware-specific issues
Check transform parameter learning: Extract scale/shift values from checkpoints to verify the learned monotonic transform is actually training (not stuck at initialization)
```
python scripts/extract_kvsplice_params.py \
  --checkpoint path/to/checkpoint.pt
```
Monitor KVSplice metrics in W&B: Verify that scale_mean, scale_std, shift_mean, shift_std are logged during training. If missing, check architecture detection in _compute_kvsplice_param_metrics()
Verify compression ratio setting: Confirm CONFIG_MLA_COMPRESSION_RATIO is set correctly in defconfig and matches W&B config. Default is 0.5 (2x compression on top of MLA)
Quality degradation tolerance: KVSplice should add only 0.5-1.4% quality loss compared to MLA alone. Larger degradation indicates a bug

Inference Verification

Before publishing inference memory savings claims:

Run direct cache measurement: Use scripts/verify_kvsplice_memory.py to measure actual cache tensor sizes across sequence lengths
```
python scripts/verify_kvsplice_memory.py
```
Verify cache tensor shapes: Inspect returned cache objects to confirm dimensions:
- MLA: [B, T, d_latent] where d_latent=256
- KVSplice: [B, T, d_compressed] where d_compressed=128 (ratio=0.5)
Check compression ratio accuracy: Memory savings should match theoretical predictions within 5%:
- Expected savings: compression_ratio * 100%
- Example: ratio=0.5 should give 50% cache reduction vs MLA
Test multiple sequence lengths: Verify compression holds across 256, 512, and 1024 token sequences. Savings should scale linearly
Calculate production throughput: Estimate how many parallel sequences fit in GPU memory with compressed cache vs standard cache. Include model weights in calculation

Transform Parameter Analysis

KVSplice uses a learned monotonic transform before low-rank projection. To verify it's learning:

Extract parameters from checkpoint:

python scripts/extract_kvsplice_params.py \
  --checkpoint test_matrix_results_*/checkpoint.pt

Check for variance across dimensions: If all scale values are identical and all shift values are zero, parameters are not learning
Initial values to expect:
- Scale: softplus(1.0) ≈ 1.3133 (initialization)
- Shift: 0.0 (initialization)
- After training: should show variance across 256 dimensions
Pruning candidates: Dimensions with scale < 0.1 after training are low-importance and candidates for pruning
LayerNorm impact: If transform parameters don't learn, try adding LayerNorm to latent space to stabilize gradients

Known Issues

Transform parameters not learning: Current experiments show KVSplice transform parameters remain at initialization values (scale ≈ 1.3133, shift = 0.0) even after 1000+ iterations. This means KVSplice is working purely via low-rank projection (compress/expand layers), not the learned transform. This may be optimal if the compress/expand layers can learn the mapping directly.

Architecture detection for metrics: Early versions failed to log KVSplice metrics because code only checked for raw_model.transformer (standard GPT-2) but MLA uses raw_model.blocks. Fixed in commit that added dual architecture detection.

Memory measurement pitfalls: Don't measure cache memory by running full forward passes (passing all previous tokens). This defeats the purpose of caching. Instead, extract cache objects from blocks with use_cache=True and measure tensor sizes directly.

Verification Scripts

scripts/verify_kvsplice_memory.py: Measure cache tensor sizes
scripts/extract_kvsplice_params.py: Extract learned transform parameters
scripts/compare_kvsplice_gpus.py: Compare results across GPU types
scripts/plot_kvsplice_inference_memory.py: Generate visualization plots

Documentation Updates

After verification, update documentation with plots and results:

Add inference verification section to docs/kvsplice.md:
- Include cache memory comparison plots
- Show compression breakdown visualization
- Document cache tensor shapes
- Provide memory savings table
Update GPU comparison summary in docs/kvsplice/gpu-comparison-summary.md:
- Add inference verification results
- Compare theoretical vs actual compression
- Document production implications

Generate publication-quality plots (300 DPI):

python scripts/plot_kvsplice_inference_memory.py

See docs/kvsplice.md for complete inference verification results with plots showing 50% cache reduction (12 MB → 6 MB at 1024 tokens) and 83.3% total reduction vs standard GPT-2 (36 MB → 6 MB).

Documentation

Keep changes well-documented in commit messages
Explain technical rationale for optimizations
Include performance impact where applicable

Avoid silly language

You are not allowed to use the word "comprehensive". It is overused and does not explain anything. We prefer to be terse and to the point.

Companion Repositories for paper-memory-decode

The paper "Memory-Traffic Saturation in Autoregressive Transformer Decode" lives at knlp.io/decode and depends on three modified serving-stack repos plus the paper LaTeX repo. All four are public on GitHub:

Repo	GitHub	Branch	What it contains
vllm-asym	`github.com/mcgrof/vllm`	`asymmetric-kv-plumbing`	vLLM v1 with tuple K/V cache, FlashAttn writer patch, asym dtype plumbing
flashinfer-asym	`github.com/mcgrof/flashinfer`	`asym-prefill-refactor-stage`	FlashInfer with FI-1..FI-5 CUDA template refactor for independent K/V dtypes in prefill+decode
lmcache	`github.com/mcgrof/LMCache`	`asymmetric-kv-codec`	LMCache with K16/V8 codec, split-tier placement, serde, 74 CPU unit tests
paper	`github.com/mcgrof/paper-memory-decode`	`main`	LaTeX source, figures, data, generate scripts

On monster (the primary workstation), the local clones live at:

Repo	Path	Branch
vllm-asym	`/home/mcgrof/devel/vllm-asym`	`asymmetric-kv-plumbing`
flashinfer-asym	`/home/mcgrof/devel/flashinfer-asym`	`asym-prefill-refactor-stage`
lmcache	`/home/mcgrof/devel/lmcache`	`asymmetric-kv-codec`
paper	`/home/mcgrof/devel/paper-memory-decode`	`main`

On prune (the storage server), mirrors live under /data/. Push with dated branch refs to avoid disturbing prune's checked-out branch:

git push prune branch:refs/heads/branch-monster-YYYY-MM-DD

Key results archive: prune:/data/knlp-key-results/flashinfer-asym-e2e-20260427/

Building the asym serving stack on a GPU pod

The vLLM asym branch requires torch >= 2.10, cmake >= 4.0, and the FlashInfer cutlass submodule initialized. The tested recipe (H100 SECURE pod, RunPod):

# 1. FlashInfer
cd /root && git clone --branch asym-prefill-refactor-stage \
    https://github.com/mcgrof/flashinfer.git flashinfer-src
cd flashinfer-src && git submodule update --init --recursive
pip install --no-build-isolation -e .

# 2. vLLM (pulls torch and rebuilds _C; ~60 min CUDA compile)
cd /root && git clone --branch asymmetric-kv-plumbing \
    https://github.com/mcgrof/vllm.git vllm-src
cd vllm-src && MAX_JOBS=32 NVCC_THREADS=2 \
    pip install --no-build-isolation -e .

# 3. Reinstall flashinfer editable (vllm pip overwrites with PyPI 0.6.6)
cd /root/flashinfer-src && pip install --no-build-isolation -e .

# 4. Verify
FLASHINFER_DISABLE_VERSION_CHECK=1 python -c "import vllm, flashinfer"

The asym K16/V8 production recipe in Python:

from vllm import LLM
llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct",
    dtype="bfloat16",
    kv_cache_dtype=("auto", "fp8_e4m3"),
    attention_config={"backend": "FLASHINFER"},
)

VLLM_ATTENTION_BACKEND env var is not honored in this vLLM build; pass attention_config={"backend": "FLASHINFER"} to the LLM constructor. Auto-selection picks FlashAttention, which lacks the asym tuple writer.

Paper build

cd /home/mcgrof/devel/paper-memory-decode && make

Generates figures via Python scripts, then runs pdflatex (3 passes for cross-refs). Always verify the rendered PDF with:

pdftotext paper.pdf - | grep -nE '<pattern>'

Source-level grep misses issues in figure PDFs and broken LaTeX label resolution (e.g., Table V-C0c from a \label inside \begin{center} instead of \begin{table}).

Reproducibility System (paper-memory-decode)

The knlp defconfig system is being extended with paper reproduction profiles. The planned targets:

make defconfig-decode       # Core asym claims (1×H100, 4-8h warm)
make defconfig-decode-sat   # Saturation model (1×H100, 18-36h)
make defconfig-decode-full  # Everything (multi-GPU, days)

After selecting a defconfig, make runs:

decode-doctor → decode-fetch → decode-build →
decode-run → decode-report → decode-upload (optional)

The orchestrator lives under tools/reproduce/paper_memory_decode/. Each stage writes results to results/decode/<run_id>/stages/<stage>/ with DONE, metrics.jsonl, stdout.log, stderr.log. Rerunning make resumes from the first missing DONE.

Telemetry: local JSONL is mandatory and canonical. W&B and trackerio are optional mirrors controlled by .config flags and env vars (WANDB_API_KEY, HF_TOKEN).

The defconfigs pin exact git refs for vllm, flashinfer, lmcache, and paper-memory-decode, and clone/fetch them into ../ (the parent directory).

Memory

I want you to remember most of our conversations about this project.

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Git Commit Practices

Commit Structure

Commit Messages

Cross-Agent Access

Development Workflow

Code Style

Python

Defconfig Files

Markdown / Documentation Files

GPT2 Model Naming Convention

Overview

Naming Pattern

Adding New Models

Ablation Step Naming

GPU Optimization Preferences

Training Optimizations

Performance Monitoring

Hardware

Testing Requirements

R&D Phases and Workflow

Phase 1: Rapid R&D (current default)

Phase 2: Reproducible Kconfig Workflow

Ralph Loop for Multi-Phase Tasks

Repository Layout

/data/knlp/ — Code Repository (this repo)

/data/knlp-key-results/ — Artifact Archive

/data/paper-memory-decode/ — Paper Repository (example)

BPA Experiment Instructions

Naming Convention

Storage

Result Collection Convention

Script Pattern

Paper Workflow

Convention

Workflow

Configuration System Internals

Type Handling

Test Matrix vs Ablation Mode

Ablation Study Requirements

Multi-File Synchronization

Ablation Step Checklist

Dry-Run Validation

Architecture Validation Before GPU Training

When to Use Dry-Run

Dry-Run Tools

Quick Check (Recommended)

Single Step Validation

Manual All Steps Validation

What Dry-Run Catches

What Dry-Run Misses

Recent Bugs Caught by Dry-Run

Defensive Programming

Assertions for Optional Features

Context Flow for Multi-Block Architectures

Wrapper Class Adaptability

Architectural Pattern Guidelines

Feature Independence and Composability

Per-Head Learnable Parameters

Statistics-Based Gating

GPU Memory Management

Memory Optimization Strategies

A10G-Specific Considerations

WandB Helper Scripts

Environment Setup

Available Scripts

Comparing Runs

Key Metrics to Check

Publishing Results

Verification Checklist

W&B Verification Script Pattern

Publication Requirements

KVSplice Verification

Training Verification

`/data/knlp/` — Code Repository (this repo)

`/data/knlp-key-results/` — Artifact Archive

`/data/paper-memory-decode/` — Paper Repository (example)