Skip to content

Latest commit

 

History

History
1159 lines (921 loc) · 42.4 KB

File metadata and controls

1159 lines (921 loc) · 42.4 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Git Commit Practices

Commit Structure

  • Make small, atomic commits - one logical change per commit
  • Each commit should be functional and not break the build
  • Run code formatter (black for Python) after each change
  • Run scripts/fix_whitespace_issues.py always on all files
  • Test that code runs successfully before committing

Commit Messages

  • MANDATORY: Always use this exact format for ALL commits:

    file.py: brief description of change
    
    Detailed explanation of what was changed and why.
    Include technical details about the implementation.
    
    Generated-by: Claude AI
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
    
  • LINE LENGTH: Maximum 70 characters per line in commit messages

    • Subject line (first line): 70 characters max
    • Body paragraphs: 70 characters max per line
    • Ensures proper display in git log, email patches, and terminal output
  • CRITICAL: Never use "🤖 Generated with [Claude Code]" or "Co-Authored-By: Claude"

  • REQUIRED: Every commit MUST have both "Generated-by: Claude AI" and "Signed-off-by: Luis Chamberlain mcgrof@kernel.org"

  • NO EXCEPTIONS: This format is mandatory for ALL commits, no matter how small

  • STYLE: Be terse and to the point. NO shopping-list style bullet points. Write in paragraphs explaining the change, rationale, and technical details concisely. Avoid verbose enumeration unless absolutely necessary for clarity.

Cross-Agent Access

Some automation looks for agent-specific instruction files (e.g., CODEX.md) instead of CLAUDE.md. To avoid future assistants missing these guidelines, ensure every agent entrypoint symlinks back to this document. For Codex runs, CODEX.md must always be a symlink to CLAUDE.md; add additional symlinks if new agent names are introduced.

Development Workflow

  1. Make a single focused change
  2. Run black formatter on Python files
  3. Test that the code runs without errors
  4. If architectural changes: Run make check to validate
  5. Commit with detailed message
  6. Repeat for next change

Architectural changes include:

  • New attention or MLP mechanisms
  • Modified forward/backward pass logic
  • Changes to model patching or wrapper classes
  • New ablation steps or configurations
  • Updates to reciprocity/context flow

Code Style

Python

  • Use black formatter for all Python code
  • Follow PEP 8 conventions (handled by black)
  • No manual formatting - always use black

Defconfig Files

  • CRITICAL: Defconfig files must use exact Kconfig syntax: CONFIG_XXX=y (no spaces around =)
  • CRITICAL: NO inline comments allowed - comments MUST be on separate lines starting with #
    • ✅ CORRECT:
      # This is a comment
      CONFIG_SOMETHING=y
      
    • ❌ WRONG (breaks Kconfig parser):
      CONFIG_SOMETHING=y  # This breaks everything
      CONFIG_SOMETHING=y # This also breaks
      
  • DO NOT apply black formatter to defconfig files or .config files
  • Kconfig parser silently ignores lines with spaces around equals signs
  • After any edit to defconfigs, verify syntax: grep " = " defconfigs/* should return nothing

Markdown / Documentation Files

  • CRITICAL: Write documentation as technical prose, not as an AI-flavored outline.
  • Start by stating plainly what the document is for and why it exists.
  • Pull the key motivation into the top of the document. Do not bury the use case deep in the file.
  • Use a short Table of Contents for longer documents.
  • Prefer narrative paragraphs over shopping-list bullet dumps.
  • Use bullets only when they genuinely improve readability: short file lists, compact result summaries, or small enumerations.
  • Do not create sections like "What this is" if that content belongs in the intro.
  • Do not write prompt-y filler like "This document stands on its own" or hedged internal commentary like "the evidence is real and already public".
  • Do not apologize for confusing structure inside the main doc. Fix the structure.
  • Standalone docs must stand on their own. Do not make the reader chase older internal notes just to understand the actual result.
  • If lineage or provenance matters, split it into a separate lineage/provenance doc and link to it. Keep the main doc focused on the result itself.
  • When older work matters, summarize the result directly in the main doc. Put historical breadcrumbs in the lineage doc, not in the main narrative.
  • Use direct links for referenced docs and scripts. Do not write bare paths when a real markdown link is more useful.
  • Avoid weird audience markers like "public narrative" or "public scripts" unless the distinction truly matters.
  • Avoid robotic imperative spam like repeated "Use X" / "Do Y" / "Start with". Mix in natural phrasing such as "You can use..." when it reads better.
  • If a term is used repeatedly, define it once before leaning on it. Example: define "lane" before saying "lane configs".
  • If a CLI flag can be misunderstood, explain it where it is introduced. Example: --gpu all means all configured tracks, not local GPU autodetect.
  • When documenting examples or use cases, prefer short narrative subsections to giant bullet farms.
  • If a path name is historically misleading, fix the repo layout if practical. Do not leave the doc to carry the whole burden of explaining a bad structure.
  • Documentation edits should reduce confusion, not relocate it.

GPT2 Model Naming Convention

Overview

All GPT-2 model variants follow a consistent hierarchical naming pattern that makes the research intent and feature composition clear. Models are automatically discovered by scripts using convention-based introspection.

Naming Pattern

Base architecture with incremental features:

GPT2              # Baseline GPT-2 (no modifications)
GPT2_RA           # + Reciprocal Attention (learned alternation)
GPT2_MLA          # + Multi-head Latent Attention (cache compression)
GPT2_MLA_RA       # + Both MLA and RA
GPT2_MLA_KV       # + MLA with KV compression (KVSplice)
GPT2_MLA_RA_KV    # + MLA + RA + KV compression
GPT2_MLA_KV2      # + MLA with 2-latent (separate K/V latents)
GPT2_MLA_KV2M     # + KV2 + MLP compression (MLPSplice)
GPT2_MLA_RA_KVM   # + RA + MLA + KV + MLP compression

Feature abbreviations:

  • _RA: Reciprocal Attention (alternating Q@K.T and K@Q.T)
  • _MLA: Multi-head Latent Attention (DeepSeek-style cache compression)
  • _KV: KV compression via KVSplice
  • _KV2: 2-latent MLA (separate K and V latent spaces)
  • M suffix: MLP compression via MLPSplice

Adding New Models

To add a new GPT-2 model variant, follow these conventions (automatically discovered by scripts, no manual registration needed):

  1. Naming: Class name must start with GPT2

    • Example: GPT2_MyFeature or GPT2_MLA_MyFeature
  2. Location: Define in ra.py (for MLA-based) or gpt2/model.py (for baseline variants)

  3. Inheritance: Must inherit from nn.Module

  4. Configuration Parameter:

    • Use config: GPTConfig for GPT-2 baseline architecture
    • Use cfg: RA_MLA_Config for MLA-based architectures
    • Must be the first parameter after self in __init__
  5. Interface: Must implement get_num_params() method

  6. Example:

class GPT2_MLA_MyFeature(nn.Module):
    def __init__(self, cfg: RA_MLA_Config, vocab_size: int = 50257):
        super().__init__()
        self.cfg = cfg
        # ... implementation ...

    def get_num_params(self):
        return sum(p.numel() for p in self.parameters())

    def forward(self, idx, targets=None):
        # ... implementation ...
        return logits, loss

That's it! The model will be automatically discovered by:

  • scripts/compare_inference.py --list (inference benchmarking)
  • Future scripts using the discover_gpt2_models() pattern

Ablation Step Naming

Ablation steps use short prefixes that map to model architectures:

B         → GPT2 baseline
RA        → GPT2_RA (fixed reciprocal)
RALEARN   → GPT2_RA (learned reciprocal)
MLA       → GPT2_MLA
RAMLA     → GPT2_MLA_RA
RAMLAKV   → GPT2_MLA_RA_KV
RAMLAKVM  → GPT2_MLA_RA_KVM
MLAKV     → GPT2_MLA_KV
MLAKV2    → GPT2_MLA_KV2
MLAKV2M   → GPT2_MLA_KV2M

Legacy step names with "0" or "1" suffixes (for learning rate ablation) are still supported for backwards compatibility but are deprecated. New experiments should use the architecture name directly without suffixes.

GPU Optimization Preferences

Training Optimizations

When optimizing PyTorch training for AMD GPUs:

  • Increase batch size to utilize GPU memory
  • Enable cuDNN benchmark mode
  • Use mixed precision training (AMP)
  • Add multiple data loader workers with pinned memory
  • Include GPU warmup routine
  • Use torch.compile() for graph optimization
  • Enable TensorFloat32 for matrix operations
  • Add comprehensive timing and metrics
  • Save trained models after completion

Performance Monitoring

  • Display GPU info at startup
  • Show per-epoch timing
  • Track test accuracy after each epoch
  • Report total training time and average per epoch

Hardware

  • Primary GPU: AMD Radeon Pro W7900 (48GB)
  • Optimize for maximum GPU utilization

Testing Requirements

  • Always verify code runs before committing
  • Check for linting/formatting issues
  • Ensure no syntax errors

R&D Phases and Workflow

Research proceeds in two phases with different tooling:

Phase 1: Rapid R&D (current default)

During active research, run scripts directly. Kconfig adds overhead that slows iteration. Write standalone Python scripts, eval harnesses, and benchmarks. Run them directly:

python3 scripts/my_experiment.py
python3 eval_v28.py --phase 0

Results go to local directories. Once experiments converge and produce publishable results, move to Phase 2.

Phase 2: Reproducible Kconfig Workflow

Once R&D stabilizes and we have results worth preserving, lock the experiment into the Kconfig build system for reproducibility:

  1. Load configuration: make defconfig-<name>
  2. Build and run: make
  3. Results are saved to the configured output directory

The Kconfig system ensures exact reproducibility of finalized experiments. Do NOT use Kconfig during rapid iteration — it slows things down without adding value until the experiment design is stable.

Ralph Loop for Multi-Phase Tasks

We use the Ralph Loop plugin (ralph-loop from claude-plugins-official) for large experiments that span multiple phases. The loop uses a Stop hook to prevent Claude from exiting — instead it feeds the same prompt back, creating a self-referential iteration loop where each pass sees the previous work in files and git history.

How it works:

  1. User writes a task file (e.g., BPA-v42.txt) with numbered phases and clear completion criteria
  2. User invokes /ralph-loop with the task and a --completion-promise (typically COMPLETE)
  3. Claude reads the task file, executes phases in order, commits results, and outputs <promise>COMPLETE</promise> when genuinely done
  4. If Claude tries to exit before completion, the stop hook blocks and re-feeds the prompt — Claude sees its own prior work and continues from where it left off

Typical invocation:

/ralph-loop Read BPA-v42.txt and execute all phases in order \
  --completion-promise COMPLETE --max-iterations 2000

Task file conventions (e.g., BPA-v42.txt):

  • Number all phases/tasks clearly (Task 1, Task 2, ...)
  • Include concrete success criteria per phase
  • Specify what to commit and when
  • Include a final task that summarizes results
  • Keep tasks independent enough that resumption works if context compresses mid-run

When to use Ralph Loop:

  • BPA experiment versions (multi-phase GPU experiments)
  • Any task with >3 sequential phases
  • Tasks that may exceed a single context window
  • Overnight/unattended experiment runs

When NOT to use Ralph Loop:

  • Quick one-shot edits or fixes
  • Interactive design discussions
  • Tasks requiring human judgment between steps

Monitoring: head -10 .claude/ralph-loop.local.md shows current iteration count and state.

Cancelling: /cancel-ralph removes the state file and stops the loop.

Repository Layout

/data/knlp/ — Code Repository (this repo)

The main working tree. Contains bleeding-edge code, scripts, model implementations, eval harnesses, and configuration. Keep this repo slim. Do NOT commit large artifacts, result directories, checkpoints, plots, or bulk experimental output here.

What belongs here:

  • Model code (gpt2/, gnn/)
  • Eval harnesses (eval_v*.py)
  • Scripts (scripts/)
  • Defconfigs and Kconfig files
  • Documentation (docs/)
  • Small CSV/JSON files that are actively used by scripts

What does NOT belong here:

  • Result directories (bpa_v*_results/, results/)
  • Artifact directories (artifacts/)
  • Training checkpoints (.pt, .pth)
  • Generated plots and PNGs (except images/ for README)
  • Final reports and scoreboards (archive to key-results)
  • Large experiment output

/data/knlp-key-results/ — Artifact Archive

Stores all historical results, artifacts, and bulk data. Commit freely here — size is not a concern. Organized by research area:

  • bpa/ — BPA (Bit Precision Allocation) results v1-v28+, artifacts, figures, scoreboards, reports, branch trees
  • key_results/ — Training matrix results with W&B configs

When an experiment version is complete, copy its artifacts:

cp -a results/v29/ /data/knlp-key-results/bpa/results/v29/
cp bpa_v29_final_report.md /data/knlp-key-results/bpa/
cd /data/knlp-key-results && git add bpa/ && git commit

/data/paper-memory-decode/ — Paper Repository (example)

Each paper gets its own git repository. The paper repo contains LaTeX source, figures, and any data needed to reproduce the paper build. See "Paper Workflow" below.

BPA Experiment Instructions

Each BPA experiment version is defined in an instruction file (e.g., BPA-v48.txt) that specifies the experiment phases, models, metrics, and completion criteria. These files are given to Claude via the Ralph Loop plugin for autonomous execution.

Naming Convention

  • Use BPA-vNN.txt (all-caps prefix, dash, lowercase v, number)
  • Examples: BPA-v46.txt, BPA-v47.txt, BPA-v48.txt

Storage

  • During execution: instruction file lives in /data/knlp/
  • After completion: move to /data/knlp-key-results/bpa-instructions/
  • All historical instructions are archived there (v2 through v48+)

Result Collection Convention

Each BPA experiment stores results in a versioned directory under /data/knlp-key-results/bpaNNN/ with a consistent structure:

/data/knlp-key-results/bpa48/
├── json/           # All metrics as JSON files
├── plots/          # Generated PNG figures (300 DPI)
├── logs/           # Execution logs
├── models/         # Model checkpoints (if any)
├── fim_maps/       # Fisher sensitivity maps
├── interaction_maps/  # Cross-layer interaction data
└── bpa48_summary.md   # Final summary report

Required conventions:

  • All metrics saved as JSON (machine-readable)
  • All plots saved as PNG at 300 DPI
  • Summary report in markdown with tables, plot references, and interpretation
  • Execution script committed to /data/knlp/scripts/ as bpa_vNN_w7900.py (or appropriate GPU name)
  • Results committed to key-results repo separately from code

Script Pattern

BPA experiment scripts follow a consistent pattern established in v46+:

  • Load model with attn_implementation='eager' and .to('cuda')
  • Use BF16 dtype (FP16 causes NaN with large-vocab models)
  • CPU-offload logits to avoid OOM during dual-pass evaluation
  • Cache prior results from JSON for incremental re-runs
  • OOM try/except handling around all GPU-intensive loops
  • Early collapse detection before full evaluation

Paper Workflow

Each paper lives in a separate git tree (e.g., /data/paper-memory-decode/). The paper is written and built from the knlp working tree where all code and key-results are accessible.

Convention

  • Paper repos are independent git trees, one per paper
  • The paper repo must be self-contained for building: LaTeX, figures, BibTeX, and any data referenced by the paper
  • Experiment scripts live in knlp, not the paper repo
  • When experiments produce figures or tables for the paper, copy them into the paper repo and commit there
  • If the paper references specific data (CSV, JSON), copy the relevant subset into the paper repo so make works without external dependencies

Workflow

  1. Run experiments in knlp (Phase 1 rapid R&D)
  2. Archive results to knlp-key-results
  3. Copy figures/tables/data into the paper repo
  4. Write LaTeX in the paper repo
  5. Commit both the paper repo and knlp independently

This ensures the paper repo can build on any machine with just a git clone and make, while knlp stays focused on code and knlp-key-results holds the full artifact history.

Configuration System Internals

Type Handling

  • .config files use string values: "y", "n", "value"
  • config.py converts to Python types: True, False, integers, floats
  • When checking config values in Python code, handle both types:
    # Good - handles both string and boolean
    if value in ("y", True):
    
    # Bad - only works with one type
    if value == "y":

Test Matrix vs Ablation Mode

  • Mutually exclusive: Cannot enable both CONFIG_TEST_MATRIX_MODE and CONFIG_RA_MLA_ABLATION_MODE
  • Test matrix mode: Tests optimizer/pruning combinations
  • Ablation mode: Tests architectural variations (RA, MLA, RA-CT, etc.)
  • Always verify which mode is active when debugging unexpected test counts

Ablation Study Requirements

Multi-File Synchronization

When extending ablation studies with new steps, THREE files must be updated in sync:

  1. defconfigs/gpt2-ratio-ablation: Add step descriptions in comments
  2. gpt2/train_ra_mla.py: Add step configurations (elif step == "N" blocks)
  3. scripts/run_test_matrix.py: Update step_descriptions dictionary

Missing any of these causes:

  • Defconfig only: Steps run but have no description
  • train_ra_mla.py only: Steps fail to execute
  • run_test_matrix.py only: Descriptions show but steps don't run

Ablation Step Checklist

When adding a new ablation step:

  • Add step config block to train_ra_mla.py (around line 500+)
  • Update step_descriptions dict in run_test_matrix.py (around line 2095)
  • Document step in defconfig comments
  • Update CONFIG_RA_MLA_ABLATION_STEPS string to include new step number
  • REQUIRED: Validate with dry-run: ./scripts/validate_ablation_steps.sh

Dry-Run Validation

Architecture Validation Before GPU Training

CRITICAL: Always validate architectural changes with dry-run before committing GPU resources. Recent bugs wasted 7+ hours of GPU time that dry-run would have caught in 60 seconds.

When to Use Dry-Run

Run dry-run validation before:

  • Committing architectural changes (new attention mechanisms, MLP modifications)
  • Adding new ablation steps
  • Modifying forward/backward pass logic
  • Changing wrapper classes or patching code
  • After fixing bugs that affected multiple configurations

Dry-Run Tools

Quick Check (Recommended)

# Run full architecture validation via Makefile
make check

# Completes in ~97 seconds (19 steps @ ~5s each)
# Loads gpt2-ratio-ablation config with DRY_RUN=1
# Tests all ablation steps automatically
# Exit code 0: all pass, 1: failures detected

ALWAYS run make check before committing architectural changes that may affect runtime behavior.

Single Step Validation

# Test specific ablation step
python3 gpt2/train_ra_mla.py --ra-mla-ablation-step N \
  --optimizer adamwspam --dataset finewebedu --dry-run

# Exit code 0: architecture valid
# Exit code 1: error (prints stack trace)

Manual All Steps Validation

# Test all 19 RATIO ablation steps (manual script)
./scripts/validate_ablation_steps.sh

# Completes in ~60 seconds
# Reports which steps pass/fail
# Provides commands to debug failures

What Dry-Run Catches

  • Configuration errors (wrong test mode, invalid parameters)
  • Architecture errors (TypeError from wrong arguments)
  • Assertion failures (missing required data)
  • Forward pass failures (dimension mismatches)
  • Backward pass failures (gradient computation errors)
  • Optimizer step failures (parameter update errors)

What Dry-Run Misses

  • OOM errors (uses small batch on CPU)
  • Multi-GPU/DDP issues (runs single CPU)
  • Data loading errors (uses dummy data)
  • Long-term training instabilities
  • Performance regressions

Recent Bugs Caught by Dry-Run

  1. RA_MLA_Block argument passing: 17/19 steps failed with TypeError when MLP received unexpected kwargs
  2. Assertion strictness: 6/19 steps failed when first block had no context from previous block

Both would have been caught before GPU training with dry-run.

Defensive Programming

Assertions for Optional Features

When implementing optional/conditional features that depend on data flow:

  • Add assertions for data that MUST be present (e.g., within a single component)
  • Avoid assertions for data that may legitimately be None (e.g., first block in sequence)
  • Silent failures waste GPU time - better to fail fast with clear error messages
  • Pattern for required data within component:
    if self.cfg.feature_enabled:
        assert required_data is not None, "feature_enabled but no required_data"
  • Pattern for optional data from other blocks:
    if self.cfg.feature_enabled and data_from_prev_block is not None:
        # use the data

Examples from RA+MLA:

  • ReciprocalMLP asserts attn_weights/attn_latent are provided by RA_MLA_Block (same component, always required)
  • RA_MLA_Attention handles None mlp_gate_context gracefully (from previous block, None for first block)
  • Use dry-run validation to catch assertion failures before GPU training

Context Flow for Multi-Block Architectures

When implementing bidirectional information flow between transformer blocks:

  • Use wrapper classes (e.g., RA_MLA_Block) to manage context state across blocks
  • Store contexts in instance variable (e.g., self._ctx = {})
  • Pass contexts as keyword arguments (enables detection of missing connections)
  • Produce contexts for the next block at the end of forward pass
  • Never assume contexts exist - always check with assertions when used

Wrapper Class Adaptability

When creating wrapper classes for mixed configurations:

  • Check wrapped component type at runtime: Use hasattr() or isinstance() to detect capabilities
  • Conditionally pass arguments: Standard components may not accept extended keyword arguments
  • Graceful degradation: Support both enhanced and standard components in same wrapper
  • Pattern:
    # Good - adapts to component type
    is_enhanced = hasattr(self.component, "enhanced_method")
    if is_enhanced:
        out = self.component(x, extra_arg=value)
    else:
        out = self.component(x)
    
    # Bad - assumes all components are enhanced
    out = self.component(x, extra_arg=value)  # crashes on standard components

Example: RA_MLA_Block wraps either ReciprocalMLP (accepts attn_weights/attn_latent) or standard MLP (does not). Runtime check prevents TypeError when ablation steps disable reciprocity mechanisms.

Architectural Pattern Guidelines

Feature Independence and Composability

When adding new attention/MLP mechanisms:

  • Keep features orthogonal: RA-CT (attention-only gating) vs MLP mechanisms (cross-layer flow)
  • Use clear naming: ra_cross_token for attention features, mlp_attn_gate for MLP features
  • Enable ablation: Each feature should be independently testable
  • Avoid coupling: RA-CT doesn't require MLA/RA, can be tested on baseline GPT-2

Per-Head Learnable Parameters

For per-head gating mechanisms:

  • Initialize to near-identity: bias ≈ 2.0 for sigmoid gates (pass-through initially)
  • Use affine transforms: sigmoid(stat * scale + bias) for numerical stability
  • Shape: [n_head] for per-head parameters, expandable to [B,H,T] when needed
  • Consider head_average=True option for cheaper computation

Statistics-Based Gating

When implementing gating based on attention statistics:

  • Support multiple modes: topk, max, entropy, rms
  • Provide detach_stats option to compute under no_grad() for memory savings
  • Apply gate at multiple points: weights (pre-softmax) or output (post-aggregation)
  • Use alpha mixing parameter for smooth interpolation: (1-α)·x + α·(x⊙gate)

GPU Memory Management

Memory Optimization Strategies

  • Enable expandable segments: PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
  • Disable expensive metrics logging during training (e.g., entropy computation on attention weights)
  • Use @torch.no_grad() for statistics computation that doesn't need gradients
  • Monitor for OOM errors in attention mechanisms - often caused by extra allocations for metrics
  • Batch size × gradient accumulation = effective batch size (keep constant when adjusting for memory)

A10G-Specific Considerations

  • 24GB VRAM per GPU requires careful batch size tuning
  • For GPT-2 124M with RA+MLA: batch_size=8, gradient_accumulation=8 (effective=64)
  • Tensor dimensions should be multiples of 64 for optimal tensor core utilization
  • Disable metrics logging for attention mechanisms to prevent OOM during entropy computation

WandB Helper Scripts

When analyzing experiment results or comparing GPU performance across runs, use the W&B query scripts in the scripts/ directory. These require the micromamba environment.

Environment Setup

Before running any W&B query scripts:

source ~/bin/wl700-ml  # Activates w7900-ml micromamba environment

This provides wandb, pandas, and other dependencies needed for querying experiment data.

Available Scripts

scripts/inspect_wandb_keys.py: Discover available metrics in a run

Usage for inspecting what data is available:

python scripts/inspect_wandb_keys.py \
  --entity mcgrof-citizen \
  --project gpt2-bitter9-compiled-b200x4 \
  --run-name gpt2_adamwprune_bitter9_state_50

scripts/query_wandb_gpu.py: Query GPU metrics from training history

Usage for checking GPU memory and compute utilization:

python scripts/query_wandb_gpu.py \
  --entity mcgrof-citizen \
  --project gpt2-bitter9-compiled-b200x4 \
  --run-name gpt2_adamwprune_bitter9_state_50

scripts/query_wandb_gpu_full.py: Query detailed GPU metrics from system events

Usage for detailed system metrics including power and temperature:

python scripts/query_wandb_gpu_full.py \
  --entity mcgrof-citizen \
  --project gpt2-bitter9-compiled-b200x4 \
  --run-name gpt2_adamwprune_bitter9_state_50

scripts/plot_torch_compile_impact.py: Generate publication-quality visualizations comparing GPU performance across runs

This is a reusable visualization script that queries W&B and generates four graphs showing performance comparisons. Used to prove torch.compile() was the bottleneck.

Usage:

source ~/bin/wl700-ml
python scripts/plot_torch_compile_impact.py

The script is hardcoded to query mcgrof-citizen/gpt2-bitter8-nocompile-w7900 but can be easily adapted for other projects by editing the project variable in main().

Generated graphs (300 DPI, publication quality):

  • torch_compile_comparison.png: Side-by-side memory and compute comparison
  • torch_compile_grouped.png: All runs in grouped bar chart with color coding
  • torch_compile_before_after.png: Dramatic before/after horizontal bars with annotations
  • bitter8_vs_baseline.png: Spotlight showing minimal overhead of state-based pruning

The script demonstrates the pattern for:

  1. Querying W&B API for multiple runs
  2. Extracting system.gpu.* metrics from event stream
  3. Computing averages across runs
  4. Creating matplotlib visualizations with annotations
  5. Using color coding (red=bad, green=good) for clarity

When to use this script:

  • After GPU profiling reveals performance differences
  • To prove bottleneck hypotheses with visual evidence
  • To compare optimization variants systematically
  • To generate graphs for documentation or papers

Customization tips:

  • Edit project variable to query different W&B project
  • Modify fetch_wandb_data() to extract different metrics
  • Update graph functions to change visual style
  • Add new graph types by creating new functions following existing patterns

Comparing Runs

To compare GPU performance across multiple runs (baseline vs optimizations), write a custom Python script using the W&B API. See docs/tracker.md for detailed examples.

Pattern for comparing runs:

import wandb

api = wandb.Api()
project = "mcgrof-citizen/gpt2-bitter9-compiled-b200x4"

run_names = ["baseline", "bitter8", "bitter9"]

for name in run_names:
    runs = api.runs(project, filters={"config.run_name": name})
    if runs:
        run = runs[0]
        history = run.history(
            keys=["gpu/memory_util_avg", "gpu/compute_util_avg"],
            samples=1000
        )
        if not history.empty:
            print(f"{name}:")
            print(f"  Memory: {history['gpu/memory_util_avg'].mean():.2f}%")
            print(f"  Compute: {history['gpu/compute_util_avg'].mean():.2f}%")

Key Metrics to Check

When analyzing GPU performance issues:

  • gpu/memory_util_avg: Memory bandwidth utilization (%)
  • gpu/compute_util_avg: Compute utilization (%)
  • gpu/memory_used_avg_gb: Average memory per GPU (GB)

Low memory utilization (<20%) indicates memory bandwidth bottleneck. Low compute utilization (<50%) indicates compute bottleneck. Compare optimization runs to baseline to verify improvements.

Publishing Results

Before publishing experimental results in documentation, papers, or public communications, perform rigorous verification to ensure reproducibility and fairness.

Verification Checklist

When publishing statistics or performance comparisons:

  1. Use W&B API to verify hyperparameters: Query all runs via W&B API to confirm consistent hyperparameters across comparisons. Verify batch size, gradient accumulation, learning rate, warmup steps, and all optimizer-specific settings match exactly.

  2. Verify git commit exists and is public: Confirm the exact git commit SHA used for training exists in the public repository. Document the commit ID in published results so others can reproduce experiments with identical code.

  3. Perform apples-to-apples sanity checks: Before claiming performance differences, verify:

    • Equal training time (CONFIG_GPT2_MAX_TIME) across all methods
    • Same effective batch size (batch × grad_acc × num_gpus)
    • Same hardware configuration (GPU type, count, memory)
    • Same torch.compile status (all enabled or all disabled)
    • Same dataset and preprocessing
    • Same evaluation protocol (samples, intervals)
  4. Check for confounding variables: Verify no unintended differences like:

    • Different torch.compile status (one compiled, one not)
    • Different batch sizes due to GPU-specific configs
    • Different stopping conditions (time vs iterations)
    • Different random seeds causing outlier results
    • Different CUDA/PyTorch/GPU driver versions

W&B Verification Script Pattern

Use this pattern to verify hyperparameter consistency:

import wandb

api = wandb.Api()
project = "mcgrof-citizen/your-project"
run_names = ["baseline", "method_a", "method_b"]

configs = {}
for name in run_names:
    runs = api.runs(project, filters={"display_name": name})
    if runs:
        run = runs[0]
        configs[name] = {
            "batch_size": run.config.get("batch_size"),
            "gradient_accumulation": run.config.get("gradient_accumulation"),
            "learning_rate": run.config.get("learning_rate"),
            "max_time": run.config.get("max_time"),
            "compile": run.config.get("compile_model"),
            "commit": run.config.get("git_commit"),
        }

# Verify all configs match on critical hyperparameters
for key in ["batch_size", "gradient_accumulation", "learning_rate"]:
    values = [c[key] for c in configs.values()]
    if len(set(values)) > 1:
        print(f"WARNING: {key} differs across runs: {configs}")

Publication Requirements

Published results MUST include:

  • Git commit SHA for exact code version
  • W&B project and run names for verification
  • Hardware specification (GPU model, count, memory)
  • Training time allocation per method
  • Effective batch size calculation
  • torch.compile status
  • Dataset and preprocessing details

This enables independent verification and reproduction of published claims. Do not publish results without completing verification checklist.

KVSplice Verification

KVSplice is a learned KV cache compression layer that achieves 12x total compression (6x from MLA + 2x from KVSplice). Before claiming compression ratios or memory savings, verify both training quality and inference memory reduction.

Training Verification

When evaluating KVSplice training results:

  1. Compare across GPU types: Run ablation on multiple GPUs (W7900, A100, H100) to verify consistency and detect hardware-specific issues

  2. Check transform parameter learning: Extract scale/shift values from checkpoints to verify the learned monotonic transform is actually training (not stuck at initialization)

    python scripts/extract_kvsplice_params.py \
      --checkpoint path/to/checkpoint.pt
  3. Monitor KVSplice metrics in W&B: Verify that scale_mean, scale_std, shift_mean, shift_std are logged during training. If missing, check architecture detection in _compute_kvsplice_param_metrics()

  4. Verify compression ratio setting: Confirm CONFIG_MLA_COMPRESSION_RATIO is set correctly in defconfig and matches W&B config. Default is 0.5 (2x compression on top of MLA)

  5. Quality degradation tolerance: KVSplice should add only 0.5-1.4% quality loss compared to MLA alone. Larger degradation indicates a bug

Inference Verification

Before publishing inference memory savings claims:

  1. Run direct cache measurement: Use scripts/verify_kvsplice_memory.py to measure actual cache tensor sizes across sequence lengths

    python scripts/verify_kvsplice_memory.py
  2. Verify cache tensor shapes: Inspect returned cache objects to confirm dimensions:

    • MLA: [B, T, d_latent] where d_latent=256
    • KVSplice: [B, T, d_compressed] where d_compressed=128 (ratio=0.5)
  3. Check compression ratio accuracy: Memory savings should match theoretical predictions within 5%:

    • Expected savings: compression_ratio * 100%
    • Example: ratio=0.5 should give 50% cache reduction vs MLA
  4. Test multiple sequence lengths: Verify compression holds across 256, 512, and 1024 token sequences. Savings should scale linearly

  5. Calculate production throughput: Estimate how many parallel sequences fit in GPU memory with compressed cache vs standard cache. Include model weights in calculation

Transform Parameter Analysis

KVSplice uses a learned monotonic transform before low-rank projection. To verify it's learning:

  1. Extract parameters from checkpoint:

    python scripts/extract_kvsplice_params.py \
      --checkpoint test_matrix_results_*/checkpoint.pt
  2. Check for variance across dimensions: If all scale values are identical and all shift values are zero, parameters are not learning

  3. Initial values to expect:

    • Scale: softplus(1.0) ≈ 1.3133 (initialization)
    • Shift: 0.0 (initialization)
    • After training: should show variance across 256 dimensions
  4. Pruning candidates: Dimensions with scale < 0.1 after training are low-importance and candidates for pruning

  5. LayerNorm impact: If transform parameters don't learn, try adding LayerNorm to latent space to stabilize gradients

Known Issues

Transform parameters not learning: Current experiments show KVSplice transform parameters remain at initialization values (scale ≈ 1.3133, shift = 0.0) even after 1000+ iterations. This means KVSplice is working purely via low-rank projection (compress/expand layers), not the learned transform. This may be optimal if the compress/expand layers can learn the mapping directly.

Architecture detection for metrics: Early versions failed to log KVSplice metrics because code only checked for raw_model.transformer (standard GPT-2) but MLA uses raw_model.blocks. Fixed in commit that added dual architecture detection.

Memory measurement pitfalls: Don't measure cache memory by running full forward passes (passing all previous tokens). This defeats the purpose of caching. Instead, extract cache objects from blocks with use_cache=True and measure tensor sizes directly.

Verification Scripts

  • scripts/verify_kvsplice_memory.py: Measure cache tensor sizes
  • scripts/extract_kvsplice_params.py: Extract learned transform parameters
  • scripts/compare_kvsplice_gpus.py: Compare results across GPU types
  • scripts/plot_kvsplice_inference_memory.py: Generate visualization plots

Documentation Updates

After verification, update documentation with plots and results:

  1. Add inference verification section to docs/kvsplice.md:

    • Include cache memory comparison plots
    • Show compression breakdown visualization
    • Document cache tensor shapes
    • Provide memory savings table
  2. Update GPU comparison summary in docs/kvsplice/gpu-comparison-summary.md:

    • Add inference verification results
    • Compare theoretical vs actual compression
    • Document production implications
  3. Generate publication-quality plots (300 DPI):

    python scripts/plot_kvsplice_inference_memory.py

See docs/kvsplice.md for complete inference verification results with plots showing 50% cache reduction (12 MB → 6 MB at 1024 tokens) and 83.3% total reduction vs standard GPT-2 (36 MB → 6 MB).

Documentation

  • Keep changes well-documented in commit messages
  • Explain technical rationale for optimizations
  • Include performance impact where applicable

Avoid silly language

You are not allowed to use the word "comprehensive". It is overused and does not explain anything. We prefer to be terse and to the point.

Companion Repositories for paper-memory-decode

The paper "Memory-Traffic Saturation in Autoregressive Transformer Decode" lives at knlp.io/decode and depends on three modified serving-stack repos plus the paper LaTeX repo. All four are public on GitHub:

Repo GitHub Branch What it contains
vllm-asym github.com/mcgrof/vllm asymmetric-kv-plumbing vLLM v1 with tuple K/V cache, FlashAttn writer patch, asym dtype plumbing
flashinfer-asym github.com/mcgrof/flashinfer asym-prefill-refactor-stage FlashInfer with FI-1..FI-5 CUDA template refactor for independent K/V dtypes in prefill+decode
lmcache github.com/mcgrof/LMCache asymmetric-kv-codec LMCache with K16/V8 codec, split-tier placement, serde, 74 CPU unit tests
paper github.com/mcgrof/paper-memory-decode main LaTeX source, figures, data, generate scripts

On monster (the primary workstation), the local clones live at:

Repo Path Branch
vllm-asym /home/mcgrof/devel/vllm-asym asymmetric-kv-plumbing
flashinfer-asym /home/mcgrof/devel/flashinfer-asym asym-prefill-refactor-stage
lmcache /home/mcgrof/devel/lmcache asymmetric-kv-codec
paper /home/mcgrof/devel/paper-memory-decode main

On prune (the storage server), mirrors live under /data/. Push with dated branch refs to avoid disturbing prune's checked-out branch:

git push prune branch:refs/heads/branch-monster-YYYY-MM-DD

Key results archive: prune:/data/knlp-key-results/flashinfer-asym-e2e-20260427/

Building the asym serving stack on a GPU pod

The vLLM asym branch requires torch >= 2.10, cmake >= 4.0, and the FlashInfer cutlass submodule initialized. The tested recipe (H100 SECURE pod, RunPod):

# 1. FlashInfer
cd /root && git clone --branch asym-prefill-refactor-stage \
    https://github.com/mcgrof/flashinfer.git flashinfer-src
cd flashinfer-src && git submodule update --init --recursive
pip install --no-build-isolation -e .

# 2. vLLM (pulls torch and rebuilds _C; ~60 min CUDA compile)
cd /root && git clone --branch asymmetric-kv-plumbing \
    https://github.com/mcgrof/vllm.git vllm-src
cd vllm-src && MAX_JOBS=32 NVCC_THREADS=2 \
    pip install --no-build-isolation -e .

# 3. Reinstall flashinfer editable (vllm pip overwrites with PyPI 0.6.6)
cd /root/flashinfer-src && pip install --no-build-isolation -e .

# 4. Verify
FLASHINFER_DISABLE_VERSION_CHECK=1 python -c "import vllm, flashinfer"

The asym K16/V8 production recipe in Python:

from vllm import LLM
llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct",
    dtype="bfloat16",
    kv_cache_dtype=("auto", "fp8_e4m3"),
    attention_config={"backend": "FLASHINFER"},
)

VLLM_ATTENTION_BACKEND env var is not honored in this vLLM build; pass attention_config={"backend": "FLASHINFER"} to the LLM constructor. Auto-selection picks FlashAttention, which lacks the asym tuple writer.

Paper build

cd /home/mcgrof/devel/paper-memory-decode && make

Generates figures via Python scripts, then runs pdflatex (3 passes for cross-refs). Always verify the rendered PDF with:

pdftotext paper.pdf - | grep -nE '<pattern>'

Source-level grep misses issues in figure PDFs and broken LaTeX label resolution (e.g., Table V-C0c from a \label inside \begin{center} instead of \begin{table}).

Reproducibility System (paper-memory-decode)

The knlp defconfig system is being extended with paper reproduction profiles. The planned targets:

make defconfig-decode       # Core asym claims (1×H100, 4-8h warm)
make defconfig-decode-sat   # Saturation model (1×H100, 18-36h)
make defconfig-decode-full  # Everything (multi-GPU, days)

After selecting a defconfig, make runs:

decode-doctor → decode-fetch → decode-build →
decode-run → decode-report → decode-upload (optional)

The orchestrator lives under tools/reproduce/paper_memory_decode/. Each stage writes results to results/decode/<run_id>/stages/<stage>/ with DONE, metrics.jsonl, stdout.log, stderr.log. Rerunning make resumes from the first missing DONE.

Telemetry: local JSONL is mandatory and canonical. W&B and trackerio are optional mirrors controlled by .config flags and env vars (WANDB_API_KEY, HF_TOKEN).

The defconfigs pin exact git refs for vllm, flashinfer, lmcache, and paper-memory-decode, and clone/fetch them into ../ (the parent directory).

Memory

I want you to remember most of our conversations about this project.