This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
- Make small, atomic commits - one logical change per commit
- Each commit should be functional and not break the build
- Run code formatter (black for Python) after each change
- Run scripts/fix_whitespace_issues.py always on all files
- Test that code runs successfully before committing
-
MANDATORY: Always use this exact format for ALL commits:
file.py: brief description of change Detailed explanation of what was changed and why. Include technical details about the implementation. Generated-by: Claude AI Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> -
LINE LENGTH: Maximum 70 characters per line in commit messages
- Subject line (first line): 70 characters max
- Body paragraphs: 70 characters max per line
- Ensures proper display in git log, email patches, and terminal output
-
CRITICAL: Never use "🤖 Generated with [Claude Code]" or "Co-Authored-By: Claude"
-
REQUIRED: Every commit MUST have both "Generated-by: Claude AI" and "Signed-off-by: Luis Chamberlain mcgrof@kernel.org"
-
NO EXCEPTIONS: This format is mandatory for ALL commits, no matter how small
-
STYLE: Be terse and to the point. NO shopping-list style bullet points. Write in paragraphs explaining the change, rationale, and technical details concisely. Avoid verbose enumeration unless absolutely necessary for clarity.
Some automation looks for agent-specific instruction files (e.g., CODEX.md)
instead of CLAUDE.md. To avoid future assistants missing these guidelines,
ensure every agent entrypoint symlinks back to this document. For Codex runs,
CODEX.md must always be a symlink to CLAUDE.md; add additional symlinks if
new agent names are introduced.
- Make a single focused change
- Run
blackformatter on Python files - Test that the code runs without errors
- If architectural changes: Run
make checkto validate - Commit with detailed message
- Repeat for next change
Architectural changes include:
- New attention or MLP mechanisms
- Modified forward/backward pass logic
- Changes to model patching or wrapper classes
- New ablation steps or configurations
- Updates to reciprocity/context flow
- Use
blackformatter for all Python code - Follow PEP 8 conventions (handled by black)
- No manual formatting - always use black
- CRITICAL: Defconfig files must use exact Kconfig syntax:
CONFIG_XXX=y(no spaces around=) - CRITICAL: NO inline comments allowed - comments MUST be on separate lines starting with
#- ✅ CORRECT:
# This is a comment CONFIG_SOMETHING=y - ❌ WRONG (breaks Kconfig parser):
CONFIG_SOMETHING=y # This breaks everything CONFIG_SOMETHING=y # This also breaks
- ✅ CORRECT:
- DO NOT apply
blackformatter to defconfig files or.configfiles - Kconfig parser silently ignores lines with spaces around equals signs
- After any edit to defconfigs, verify syntax:
grep " = " defconfigs/*should return nothing
- CRITICAL: Write documentation as technical prose, not as an AI-flavored outline.
- Start by stating plainly what the document is for and why it exists.
- Pull the key motivation into the top of the document. Do not bury the use case deep in the file.
- Use a short Table of Contents for longer documents.
- Prefer narrative paragraphs over shopping-list bullet dumps.
- Use bullets only when they genuinely improve readability: short file lists, compact result summaries, or small enumerations.
- Do not create sections like "What this is" if that content belongs in the intro.
- Do not write prompt-y filler like "This document stands on its own" or hedged internal commentary like "the evidence is real and already public".
- Do not apologize for confusing structure inside the main doc. Fix the structure.
- Standalone docs must stand on their own. Do not make the reader chase older internal notes just to understand the actual result.
- If lineage or provenance matters, split it into a separate lineage/provenance doc and link to it. Keep the main doc focused on the result itself.
- When older work matters, summarize the result directly in the main doc. Put historical breadcrumbs in the lineage doc, not in the main narrative.
- Use direct links for referenced docs and scripts. Do not write bare paths when a real markdown link is more useful.
- Avoid weird audience markers like "public narrative" or "public scripts" unless the distinction truly matters.
- Avoid robotic imperative spam like repeated "Use X" / "Do Y" / "Start with". Mix in natural phrasing such as "You can use..." when it reads better.
- If a term is used repeatedly, define it once before leaning on it. Example: define "lane" before saying "lane configs".
- If a CLI flag can be misunderstood, explain it where it is introduced.
Example:
--gpu allmeans all configured tracks, not local GPU autodetect. - When documenting examples or use cases, prefer short narrative subsections to giant bullet farms.
- If a path name is historically misleading, fix the repo layout if practical. Do not leave the doc to carry the whole burden of explaining a bad structure.
- Documentation edits should reduce confusion, not relocate it.
All GPT-2 model variants follow a consistent hierarchical naming pattern that makes the research intent and feature composition clear. Models are automatically discovered by scripts using convention-based introspection.
Base architecture with incremental features:
GPT2 # Baseline GPT-2 (no modifications)
GPT2_RA # + Reciprocal Attention (learned alternation)
GPT2_MLA # + Multi-head Latent Attention (cache compression)
GPT2_MLA_RA # + Both MLA and RA
GPT2_MLA_KV # + MLA with KV compression (KVSplice)
GPT2_MLA_RA_KV # + MLA + RA + KV compression
GPT2_MLA_KV2 # + MLA with 2-latent (separate K/V latents)
GPT2_MLA_KV2M # + KV2 + MLP compression (MLPSplice)
GPT2_MLA_RA_KVM # + RA + MLA + KV + MLP compression
Feature abbreviations:
_RA: Reciprocal Attention (alternating Q@K.T and K@Q.T)_MLA: Multi-head Latent Attention (DeepSeek-style cache compression)_KV: KV compression via KVSplice_KV2: 2-latent MLA (separate K and V latent spaces)Msuffix: MLP compression via MLPSplice
To add a new GPT-2 model variant, follow these conventions (automatically discovered by scripts, no manual registration needed):
-
Naming: Class name must start with
GPT2- Example:
GPT2_MyFeatureorGPT2_MLA_MyFeature
- Example:
-
Location: Define in
ra.py(for MLA-based) orgpt2/model.py(for baseline variants) -
Inheritance: Must inherit from
nn.Module -
Configuration Parameter:
- Use
config: GPTConfigfor GPT-2 baseline architecture - Use
cfg: RA_MLA_Configfor MLA-based architectures - Must be the first parameter after
selfin__init__
- Use
-
Interface: Must implement
get_num_params()method -
Example:
class GPT2_MLA_MyFeature(nn.Module):
def __init__(self, cfg: RA_MLA_Config, vocab_size: int = 50257):
super().__init__()
self.cfg = cfg
# ... implementation ...
def get_num_params(self):
return sum(p.numel() for p in self.parameters())
def forward(self, idx, targets=None):
# ... implementation ...
return logits, lossThat's it! The model will be automatically discovered by:
scripts/compare_inference.py --list(inference benchmarking)- Future scripts using the
discover_gpt2_models()pattern
Ablation steps use short prefixes that map to model architectures:
B → GPT2 baseline
RA → GPT2_RA (fixed reciprocal)
RALEARN → GPT2_RA (learned reciprocal)
MLA → GPT2_MLA
RAMLA → GPT2_MLA_RA
RAMLAKV → GPT2_MLA_RA_KV
RAMLAKVM → GPT2_MLA_RA_KVM
MLAKV → GPT2_MLA_KV
MLAKV2 → GPT2_MLA_KV2
MLAKV2M → GPT2_MLA_KV2M
Legacy step names with "0" or "1" suffixes (for learning rate ablation) are still supported for backwards compatibility but are deprecated. New experiments should use the architecture name directly without suffixes.
When optimizing PyTorch training for AMD GPUs:
- Increase batch size to utilize GPU memory
- Enable cuDNN benchmark mode
- Use mixed precision training (AMP)
- Add multiple data loader workers with pinned memory
- Include GPU warmup routine
- Use torch.compile() for graph optimization
- Enable TensorFloat32 for matrix operations
- Add comprehensive timing and metrics
- Save trained models after completion
- Display GPU info at startup
- Show per-epoch timing
- Track test accuracy after each epoch
- Report total training time and average per epoch
- Primary GPU: AMD Radeon Pro W7900 (48GB)
- Optimize for maximum GPU utilization
- Always verify code runs before committing
- Check for linting/formatting issues
- Ensure no syntax errors
Research proceeds in two phases with different tooling:
During active research, run scripts directly. Kconfig adds overhead that slows iteration. Write standalone Python scripts, eval harnesses, and benchmarks. Run them directly:
python3 scripts/my_experiment.py
python3 eval_v28.py --phase 0Results go to local directories. Once experiments converge and produce publishable results, move to Phase 2.
Once R&D stabilizes and we have results worth preserving, lock the experiment into the Kconfig build system for reproducibility:
- Load configuration:
make defconfig-<name> - Build and run:
make - Results are saved to the configured output directory
The Kconfig system ensures exact reproducibility of finalized experiments. Do NOT use Kconfig during rapid iteration — it slows things down without adding value until the experiment design is stable.
We use the Ralph Loop plugin (ralph-loop from
claude-plugins-official) for large experiments that span
multiple phases. The loop uses a Stop hook to prevent Claude
from exiting — instead it feeds the same prompt back,
creating a self-referential iteration loop where each pass
sees the previous work in files and git history.
How it works:
- User writes a task file (e.g.,
BPA-v42.txt) with numbered phases and clear completion criteria - User invokes
/ralph-loopwith the task and a--completion-promise(typicallyCOMPLETE) - Claude reads the task file, executes phases in order,
commits results, and outputs
<promise>COMPLETE</promise>when genuinely done - If Claude tries to exit before completion, the stop hook blocks and re-feeds the prompt — Claude sees its own prior work and continues from where it left off
Typical invocation:
/ralph-loop Read BPA-v42.txt and execute all phases in order \
--completion-promise COMPLETE --max-iterations 2000
Task file conventions (e.g., BPA-v42.txt):
- Number all phases/tasks clearly (Task 1, Task 2, ...)
- Include concrete success criteria per phase
- Specify what to commit and when
- Include a final task that summarizes results
- Keep tasks independent enough that resumption works if context compresses mid-run
When to use Ralph Loop:
- BPA experiment versions (multi-phase GPU experiments)
- Any task with >3 sequential phases
- Tasks that may exceed a single context window
- Overnight/unattended experiment runs
When NOT to use Ralph Loop:
- Quick one-shot edits or fixes
- Interactive design discussions
- Tasks requiring human judgment between steps
Monitoring: head -10 .claude/ralph-loop.local.md
shows current iteration count and state.
Cancelling: /cancel-ralph removes the state file and
stops the loop.
The main working tree. Contains bleeding-edge code, scripts, model implementations, eval harnesses, and configuration. Keep this repo slim. Do NOT commit large artifacts, result directories, checkpoints, plots, or bulk experimental output here.
What belongs here:
- Model code (
gpt2/,gnn/) - Eval harnesses (
eval_v*.py) - Scripts (
scripts/) - Defconfigs and Kconfig files
- Documentation (
docs/) - Small CSV/JSON files that are actively used by scripts
What does NOT belong here:
- Result directories (
bpa_v*_results/,results/) - Artifact directories (
artifacts/) - Training checkpoints (
.pt,.pth) - Generated plots and PNGs (except
images/for README) - Final reports and scoreboards (archive to key-results)
- Large experiment output
Stores all historical results, artifacts, and bulk data. Commit freely here — size is not a concern. Organized by research area:
bpa/— BPA (Bit Precision Allocation) results v1-v28+, artifacts, figures, scoreboards, reports, branch treeskey_results/— Training matrix results with W&B configs
When an experiment version is complete, copy its artifacts:
cp -a results/v29/ /data/knlp-key-results/bpa/results/v29/
cp bpa_v29_final_report.md /data/knlp-key-results/bpa/
cd /data/knlp-key-results && git add bpa/ && git commitEach paper gets its own git repository. The paper repo contains LaTeX source, figures, and any data needed to reproduce the paper build. See "Paper Workflow" below.
Each BPA experiment version is defined in an instruction file
(e.g., BPA-v48.txt) that specifies the experiment phases,
models, metrics, and completion criteria. These files are given
to Claude via the Ralph Loop plugin for autonomous execution.
- Use
BPA-vNN.txt(all-caps prefix, dash, lowercase v, number) - Examples:
BPA-v46.txt,BPA-v47.txt,BPA-v48.txt
- During execution: instruction file lives in
/data/knlp/ - After completion: move to
/data/knlp-key-results/bpa-instructions/ - All historical instructions are archived there (v2 through v48+)
Each BPA experiment stores results in a versioned directory under
/data/knlp-key-results/bpaNNN/ with a consistent structure:
/data/knlp-key-results/bpa48/
├── json/ # All metrics as JSON files
├── plots/ # Generated PNG figures (300 DPI)
├── logs/ # Execution logs
├── models/ # Model checkpoints (if any)
├── fim_maps/ # Fisher sensitivity maps
├── interaction_maps/ # Cross-layer interaction data
└── bpa48_summary.md # Final summary report
Required conventions:
- All metrics saved as JSON (machine-readable)
- All plots saved as PNG at 300 DPI
- Summary report in markdown with tables, plot references, and interpretation
- Execution script committed to
/data/knlp/scripts/asbpa_vNN_w7900.py(or appropriate GPU name) - Results committed to key-results repo separately from code
BPA experiment scripts follow a consistent pattern established in v46+:
- Load model with
attn_implementation='eager'and.to('cuda') - Use BF16 dtype (FP16 causes NaN with large-vocab models)
- CPU-offload logits to avoid OOM during dual-pass evaluation
- Cache prior results from JSON for incremental re-runs
- OOM try/except handling around all GPU-intensive loops
- Early collapse detection before full evaluation
Each paper lives in a separate git tree (e.g.,
/data/paper-memory-decode/). The paper is written and built
from the knlp working tree where all code and key-results
are accessible.
- Paper repos are independent git trees, one per paper
- The paper repo must be self-contained for building: LaTeX, figures, BibTeX, and any data referenced by the paper
- Experiment scripts live in knlp, not the paper repo
- When experiments produce figures or tables for the paper, copy them into the paper repo and commit there
- If the paper references specific data (CSV, JSON), copy
the relevant subset into the paper repo so
makeworks without external dependencies
- Run experiments in knlp (Phase 1 rapid R&D)
- Archive results to knlp-key-results
- Copy figures/tables/data into the paper repo
- Write LaTeX in the paper repo
- Commit both the paper repo and knlp independently
This ensures the paper repo can build on any machine with
just a git clone and make, while knlp stays focused on
code and knlp-key-results holds the full artifact history.
.configfiles use string values:"y","n","value"config.pyconverts to Python types:True,False, integers, floats- When checking config values in Python code, handle both types:
# Good - handles both string and boolean if value in ("y", True): # Bad - only works with one type if value == "y":
- Mutually exclusive: Cannot enable both
CONFIG_TEST_MATRIX_MODEandCONFIG_RA_MLA_ABLATION_MODE - Test matrix mode: Tests optimizer/pruning combinations
- Ablation mode: Tests architectural variations (RA, MLA, RA-CT, etc.)
- Always verify which mode is active when debugging unexpected test counts
When extending ablation studies with new steps, THREE files must be updated in sync:
- defconfigs/gpt2-ratio-ablation: Add step descriptions in comments
- gpt2/train_ra_mla.py: Add step configurations (elif step == "N" blocks)
- scripts/run_test_matrix.py: Update
step_descriptionsdictionary
Missing any of these causes:
- Defconfig only: Steps run but have no description
- train_ra_mla.py only: Steps fail to execute
- run_test_matrix.py only: Descriptions show but steps don't run
When adding a new ablation step:
- Add step config block to train_ra_mla.py (around line 500+)
- Update step_descriptions dict in run_test_matrix.py (around line 2095)
- Document step in defconfig comments
- Update CONFIG_RA_MLA_ABLATION_STEPS string to include new step number
- REQUIRED: Validate with dry-run:
./scripts/validate_ablation_steps.sh
CRITICAL: Always validate architectural changes with dry-run before committing GPU resources. Recent bugs wasted 7+ hours of GPU time that dry-run would have caught in 60 seconds.
Run dry-run validation before:
- Committing architectural changes (new attention mechanisms, MLP modifications)
- Adding new ablation steps
- Modifying forward/backward pass logic
- Changing wrapper classes or patching code
- After fixing bugs that affected multiple configurations
# Run full architecture validation via Makefile
make check
# Completes in ~97 seconds (19 steps @ ~5s each)
# Loads gpt2-ratio-ablation config with DRY_RUN=1
# Tests all ablation steps automatically
# Exit code 0: all pass, 1: failures detectedALWAYS run make check before committing architectural changes
that may affect runtime behavior.
# Test specific ablation step
python3 gpt2/train_ra_mla.py --ra-mla-ablation-step N \
--optimizer adamwspam --dataset finewebedu --dry-run
# Exit code 0: architecture valid
# Exit code 1: error (prints stack trace)# Test all 19 RATIO ablation steps (manual script)
./scripts/validate_ablation_steps.sh
# Completes in ~60 seconds
# Reports which steps pass/fail
# Provides commands to debug failures- Configuration errors (wrong test mode, invalid parameters)
- Architecture errors (TypeError from wrong arguments)
- Assertion failures (missing required data)
- Forward pass failures (dimension mismatches)
- Backward pass failures (gradient computation errors)
- Optimizer step failures (parameter update errors)
- OOM errors (uses small batch on CPU)
- Multi-GPU/DDP issues (runs single CPU)
- Data loading errors (uses dummy data)
- Long-term training instabilities
- Performance regressions
- RA_MLA_Block argument passing: 17/19 steps failed with TypeError when MLP received unexpected kwargs
- Assertion strictness: 6/19 steps failed when first block had no context from previous block
Both would have been caught before GPU training with dry-run.
When implementing optional/conditional features that depend on data flow:
- Add assertions for data that MUST be present (e.g., within a single component)
- Avoid assertions for data that may legitimately be None (e.g., first block in sequence)
- Silent failures waste GPU time - better to fail fast with clear error messages
- Pattern for required data within component:
if self.cfg.feature_enabled: assert required_data is not None, "feature_enabled but no required_data"
- Pattern for optional data from other blocks:
if self.cfg.feature_enabled and data_from_prev_block is not None: # use the data
Examples from RA+MLA:
- ReciprocalMLP asserts
attn_weights/attn_latentare provided by RA_MLA_Block (same component, always required) - RA_MLA_Attention handles None
mlp_gate_contextgracefully (from previous block, None for first block) - Use dry-run validation to catch assertion failures before GPU training
When implementing bidirectional information flow between transformer blocks:
- Use wrapper classes (e.g.,
RA_MLA_Block) to manage context state across blocks - Store contexts in instance variable (e.g.,
self._ctx = {}) - Pass contexts as keyword arguments (enables detection of missing connections)
- Produce contexts for the next block at the end of forward pass
- Never assume contexts exist - always check with assertions when used
When creating wrapper classes for mixed configurations:
- Check wrapped component type at runtime: Use
hasattr()orisinstance()to detect capabilities - Conditionally pass arguments: Standard components may not accept extended keyword arguments
- Graceful degradation: Support both enhanced and standard components in same wrapper
- Pattern:
# Good - adapts to component type is_enhanced = hasattr(self.component, "enhanced_method") if is_enhanced: out = self.component(x, extra_arg=value) else: out = self.component(x) # Bad - assumes all components are enhanced out = self.component(x, extra_arg=value) # crashes on standard components
Example: RA_MLA_Block wraps either ReciprocalMLP (accepts attn_weights/attn_latent) or standard MLP (does not). Runtime check prevents TypeError when ablation steps disable reciprocity mechanisms.
When adding new attention/MLP mechanisms:
- Keep features orthogonal: RA-CT (attention-only gating) vs MLP mechanisms (cross-layer flow)
- Use clear naming:
ra_cross_tokenfor attention features,mlp_attn_gatefor MLP features - Enable ablation: Each feature should be independently testable
- Avoid coupling: RA-CT doesn't require MLA/RA, can be tested on baseline GPT-2
For per-head gating mechanisms:
- Initialize to near-identity:
bias ≈ 2.0for sigmoid gates (pass-through initially) - Use affine transforms:
sigmoid(stat * scale + bias)for numerical stability - Shape:
[n_head]for per-head parameters, expandable to[B,H,T]when needed - Consider
head_average=Trueoption for cheaper computation
When implementing gating based on attention statistics:
- Support multiple modes:
topk,max,entropy,rms - Provide
detach_statsoption to compute underno_grad()for memory savings - Apply gate at multiple points:
weights(pre-softmax) oroutput(post-aggregation) - Use
alphamixing parameter for smooth interpolation:(1-α)·x + α·(x⊙gate)
- Enable expandable segments:
PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" - Disable expensive metrics logging during training (e.g., entropy computation on attention weights)
- Use
@torch.no_grad()for statistics computation that doesn't need gradients - Monitor for OOM errors in attention mechanisms - often caused by extra allocations for metrics
- Batch size × gradient accumulation = effective batch size (keep constant when adjusting for memory)
- 24GB VRAM per GPU requires careful batch size tuning
- For GPT-2 124M with RA+MLA: batch_size=8, gradient_accumulation=8 (effective=64)
- Tensor dimensions should be multiples of 64 for optimal tensor core utilization
- Disable metrics logging for attention mechanisms to prevent OOM during entropy computation
When analyzing experiment results or comparing GPU performance across runs, use the W&B query scripts in the scripts/ directory. These require the micromamba environment.
Before running any W&B query scripts:
source ~/bin/wl700-ml # Activates w7900-ml micromamba environmentThis provides wandb, pandas, and other dependencies needed for querying experiment data.
scripts/inspect_wandb_keys.py: Discover available metrics in a run
Usage for inspecting what data is available:
python scripts/inspect_wandb_keys.py \
--entity mcgrof-citizen \
--project gpt2-bitter9-compiled-b200x4 \
--run-name gpt2_adamwprune_bitter9_state_50scripts/query_wandb_gpu.py: Query GPU metrics from training history
Usage for checking GPU memory and compute utilization:
python scripts/query_wandb_gpu.py \
--entity mcgrof-citizen \
--project gpt2-bitter9-compiled-b200x4 \
--run-name gpt2_adamwprune_bitter9_state_50scripts/query_wandb_gpu_full.py: Query detailed GPU metrics from system events
Usage for detailed system metrics including power and temperature:
python scripts/query_wandb_gpu_full.py \
--entity mcgrof-citizen \
--project gpt2-bitter9-compiled-b200x4 \
--run-name gpt2_adamwprune_bitter9_state_50scripts/plot_torch_compile_impact.py: Generate publication-quality visualizations comparing GPU performance across runs
This is a reusable visualization script that queries W&B and generates four graphs showing performance comparisons. Used to prove torch.compile() was the bottleneck.
Usage:
source ~/bin/wl700-ml
python scripts/plot_torch_compile_impact.pyThe script is hardcoded to query
mcgrof-citizen/gpt2-bitter8-nocompile-w7900 but can be easily
adapted for other projects by editing the project variable in
main().
Generated graphs (300 DPI, publication quality):
torch_compile_comparison.png: Side-by-side memory and compute comparisontorch_compile_grouped.png: All runs in grouped bar chart with color codingtorch_compile_before_after.png: Dramatic before/after horizontal bars with annotationsbitter8_vs_baseline.png: Spotlight showing minimal overhead of state-based pruning
The script demonstrates the pattern for:
- Querying W&B API for multiple runs
- Extracting system.gpu.* metrics from event stream
- Computing averages across runs
- Creating matplotlib visualizations with annotations
- Using color coding (red=bad, green=good) for clarity
When to use this script:
- After GPU profiling reveals performance differences
- To prove bottleneck hypotheses with visual evidence
- To compare optimization variants systematically
- To generate graphs for documentation or papers
Customization tips:
- Edit
projectvariable to query different W&B project - Modify
fetch_wandb_data()to extract different metrics - Update graph functions to change visual style
- Add new graph types by creating new functions following existing patterns
To compare GPU performance across multiple runs (baseline vs optimizations), write a custom Python script using the W&B API. See docs/tracker.md for detailed examples.
Pattern for comparing runs:
import wandb
api = wandb.Api()
project = "mcgrof-citizen/gpt2-bitter9-compiled-b200x4"
run_names = ["baseline", "bitter8", "bitter9"]
for name in run_names:
runs = api.runs(project, filters={"config.run_name": name})
if runs:
run = runs[0]
history = run.history(
keys=["gpu/memory_util_avg", "gpu/compute_util_avg"],
samples=1000
)
if not history.empty:
print(f"{name}:")
print(f" Memory: {history['gpu/memory_util_avg'].mean():.2f}%")
print(f" Compute: {history['gpu/compute_util_avg'].mean():.2f}%")When analyzing GPU performance issues:
gpu/memory_util_avg: Memory bandwidth utilization (%)gpu/compute_util_avg: Compute utilization (%)gpu/memory_used_avg_gb: Average memory per GPU (GB)
Low memory utilization (<20%) indicates memory bandwidth bottleneck. Low compute utilization (<50%) indicates compute bottleneck. Compare optimization runs to baseline to verify improvements.
Before publishing experimental results in documentation, papers, or public communications, perform rigorous verification to ensure reproducibility and fairness.
When publishing statistics or performance comparisons:
-
Use W&B API to verify hyperparameters: Query all runs via W&B API to confirm consistent hyperparameters across comparisons. Verify batch size, gradient accumulation, learning rate, warmup steps, and all optimizer-specific settings match exactly.
-
Verify git commit exists and is public: Confirm the exact git commit SHA used for training exists in the public repository. Document the commit ID in published results so others can reproduce experiments with identical code.
-
Perform apples-to-apples sanity checks: Before claiming performance differences, verify:
- Equal training time (CONFIG_GPT2_MAX_TIME) across all methods
- Same effective batch size (batch × grad_acc × num_gpus)
- Same hardware configuration (GPU type, count, memory)
- Same torch.compile status (all enabled or all disabled)
- Same dataset and preprocessing
- Same evaluation protocol (samples, intervals)
-
Check for confounding variables: Verify no unintended differences like:
- Different torch.compile status (one compiled, one not)
- Different batch sizes due to GPU-specific configs
- Different stopping conditions (time vs iterations)
- Different random seeds causing outlier results
- Different CUDA/PyTorch/GPU driver versions
Use this pattern to verify hyperparameter consistency:
import wandb
api = wandb.Api()
project = "mcgrof-citizen/your-project"
run_names = ["baseline", "method_a", "method_b"]
configs = {}
for name in run_names:
runs = api.runs(project, filters={"display_name": name})
if runs:
run = runs[0]
configs[name] = {
"batch_size": run.config.get("batch_size"),
"gradient_accumulation": run.config.get("gradient_accumulation"),
"learning_rate": run.config.get("learning_rate"),
"max_time": run.config.get("max_time"),
"compile": run.config.get("compile_model"),
"commit": run.config.get("git_commit"),
}
# Verify all configs match on critical hyperparameters
for key in ["batch_size", "gradient_accumulation", "learning_rate"]:
values = [c[key] for c in configs.values()]
if len(set(values)) > 1:
print(f"WARNING: {key} differs across runs: {configs}")Published results MUST include:
- Git commit SHA for exact code version
- W&B project and run names for verification
- Hardware specification (GPU model, count, memory)
- Training time allocation per method
- Effective batch size calculation
- torch.compile status
- Dataset and preprocessing details
This enables independent verification and reproduction of published claims. Do not publish results without completing verification checklist.
KVSplice is a learned KV cache compression layer that achieves 12x total compression (6x from MLA + 2x from KVSplice). Before claiming compression ratios or memory savings, verify both training quality and inference memory reduction.
When evaluating KVSplice training results:
-
Compare across GPU types: Run ablation on multiple GPUs (W7900, A100, H100) to verify consistency and detect hardware-specific issues
-
Check transform parameter learning: Extract scale/shift values from checkpoints to verify the learned monotonic transform is actually training (not stuck at initialization)
python scripts/extract_kvsplice_params.py \ --checkpoint path/to/checkpoint.pt
-
Monitor KVSplice metrics in W&B: Verify that scale_mean, scale_std, shift_mean, shift_std are logged during training. If missing, check architecture detection in
_compute_kvsplice_param_metrics() -
Verify compression ratio setting: Confirm CONFIG_MLA_COMPRESSION_RATIO is set correctly in defconfig and matches W&B config. Default is 0.5 (2x compression on top of MLA)
-
Quality degradation tolerance: KVSplice should add only 0.5-1.4% quality loss compared to MLA alone. Larger degradation indicates a bug
Before publishing inference memory savings claims:
-
Run direct cache measurement: Use
scripts/verify_kvsplice_memory.pyto measure actual cache tensor sizes across sequence lengthspython scripts/verify_kvsplice_memory.py
-
Verify cache tensor shapes: Inspect returned cache objects to confirm dimensions:
- MLA:
[B, T, d_latent]where d_latent=256 - KVSplice:
[B, T, d_compressed]where d_compressed=128 (ratio=0.5)
- MLA:
-
Check compression ratio accuracy: Memory savings should match theoretical predictions within 5%:
- Expected savings:
compression_ratio * 100% - Example: ratio=0.5 should give 50% cache reduction vs MLA
- Expected savings:
-
Test multiple sequence lengths: Verify compression holds across 256, 512, and 1024 token sequences. Savings should scale linearly
-
Calculate production throughput: Estimate how many parallel sequences fit in GPU memory with compressed cache vs standard cache. Include model weights in calculation
KVSplice uses a learned monotonic transform before low-rank projection. To verify it's learning:
-
Extract parameters from checkpoint:
python scripts/extract_kvsplice_params.py \ --checkpoint test_matrix_results_*/checkpoint.pt -
Check for variance across dimensions: If all scale values are identical and all shift values are zero, parameters are not learning
-
Initial values to expect:
- Scale: softplus(1.0) ≈ 1.3133 (initialization)
- Shift: 0.0 (initialization)
- After training: should show variance across 256 dimensions
-
Pruning candidates: Dimensions with scale < 0.1 after training are low-importance and candidates for pruning
-
LayerNorm impact: If transform parameters don't learn, try adding LayerNorm to latent space to stabilize gradients
Transform parameters not learning: Current experiments show KVSplice transform parameters remain at initialization values (scale ≈ 1.3133, shift = 0.0) even after 1000+ iterations. This means KVSplice is working purely via low-rank projection (compress/expand layers), not the learned transform. This may be optimal if the compress/expand layers can learn the mapping directly.
Architecture detection for metrics: Early versions failed to log
KVSplice metrics because code only checked for raw_model.transformer
(standard GPT-2) but MLA uses raw_model.blocks. Fixed in commit
that added dual architecture detection.
Memory measurement pitfalls: Don't measure cache memory by running
full forward passes (passing all previous tokens). This defeats the
purpose of caching. Instead, extract cache objects from blocks with
use_cache=True and measure tensor sizes directly.
scripts/verify_kvsplice_memory.py: Measure cache tensor sizesscripts/extract_kvsplice_params.py: Extract learned transform parametersscripts/compare_kvsplice_gpus.py: Compare results across GPU typesscripts/plot_kvsplice_inference_memory.py: Generate visualization plots
After verification, update documentation with plots and results:
-
Add inference verification section to
docs/kvsplice.md:- Include cache memory comparison plots
- Show compression breakdown visualization
- Document cache tensor shapes
- Provide memory savings table
-
Update GPU comparison summary in
docs/kvsplice/gpu-comparison-summary.md:- Add inference verification results
- Compare theoretical vs actual compression
- Document production implications
-
Generate publication-quality plots (300 DPI):
python scripts/plot_kvsplice_inference_memory.py
See docs/kvsplice.md for complete inference verification results
with plots showing 50% cache reduction (12 MB → 6 MB at 1024 tokens)
and 83.3% total reduction vs standard GPT-2 (36 MB → 6 MB).
- Keep changes well-documented in commit messages
- Explain technical rationale for optimizations
- Include performance impact where applicable
You are not allowed to use the word "comprehensive". It is overused and does not explain anything. We prefer to be terse and to the point.
The paper "Memory-Traffic Saturation in Autoregressive Transformer
Decode" lives at knlp.io/decode and depends on three modified
serving-stack repos plus the paper LaTeX repo. All four are public
on GitHub:
| Repo | GitHub | Branch | What it contains |
|---|---|---|---|
| vllm-asym | github.com/mcgrof/vllm |
asymmetric-kv-plumbing |
vLLM v1 with tuple K/V cache, FlashAttn writer patch, asym dtype plumbing |
| flashinfer-asym | github.com/mcgrof/flashinfer |
asym-prefill-refactor-stage |
FlashInfer with FI-1..FI-5 CUDA template refactor for independent K/V dtypes in prefill+decode |
| lmcache | github.com/mcgrof/LMCache |
asymmetric-kv-codec |
LMCache with K16/V8 codec, split-tier placement, serde, 74 CPU unit tests |
| paper | github.com/mcgrof/paper-memory-decode |
main |
LaTeX source, figures, data, generate scripts |
On monster (the primary workstation), the local clones live at:
| Repo | Path | Branch |
|---|---|---|
| vllm-asym | /home/mcgrof/devel/vllm-asym |
asymmetric-kv-plumbing |
| flashinfer-asym | /home/mcgrof/devel/flashinfer-asym |
asym-prefill-refactor-stage |
| lmcache | /home/mcgrof/devel/lmcache |
asymmetric-kv-codec |
| paper | /home/mcgrof/devel/paper-memory-decode |
main |
On prune (the storage server), mirrors live under /data/. Push
with dated branch refs to avoid disturbing prune's checked-out branch:
git push prune branch:refs/heads/branch-monster-YYYY-MM-DDKey results archive: prune:/data/knlp-key-results/flashinfer-asym-e2e-20260427/
The vLLM asym branch requires torch >= 2.10, cmake >= 4.0, and the FlashInfer cutlass submodule initialized. The tested recipe (H100 SECURE pod, RunPod):
# 1. FlashInfer
cd /root && git clone --branch asym-prefill-refactor-stage \
https://github.com/mcgrof/flashinfer.git flashinfer-src
cd flashinfer-src && git submodule update --init --recursive
pip install --no-build-isolation -e .
# 2. vLLM (pulls torch and rebuilds _C; ~60 min CUDA compile)
cd /root && git clone --branch asymmetric-kv-plumbing \
https://github.com/mcgrof/vllm.git vllm-src
cd vllm-src && MAX_JOBS=32 NVCC_THREADS=2 \
pip install --no-build-isolation -e .
# 3. Reinstall flashinfer editable (vllm pip overwrites with PyPI 0.6.6)
cd /root/flashinfer-src && pip install --no-build-isolation -e .
# 4. Verify
FLASHINFER_DISABLE_VERSION_CHECK=1 python -c "import vllm, flashinfer"The asym K16/V8 production recipe in Python:
from vllm import LLM
llm = LLM(
model="Qwen/Qwen2.5-7B-Instruct",
dtype="bfloat16",
kv_cache_dtype=("auto", "fp8_e4m3"),
attention_config={"backend": "FLASHINFER"},
)VLLM_ATTENTION_BACKEND env var is not honored in this vLLM
build; pass attention_config={"backend": "FLASHINFER"} to the LLM
constructor. Auto-selection picks FlashAttention, which lacks the
asym tuple writer.
cd /home/mcgrof/devel/paper-memory-decode && makeGenerates figures via Python scripts, then runs pdflatex (3 passes for cross-refs). Always verify the rendered PDF with:
pdftotext paper.pdf - | grep -nE '<pattern>'Source-level grep misses issues in figure PDFs and broken LaTeX label
resolution (e.g., Table V-C0c from a \label inside \begin{center}
instead of \begin{table}).
The knlp defconfig system is being extended with paper reproduction profiles. The planned targets:
make defconfig-decode # Core asym claims (1×H100, 4-8h warm)
make defconfig-decode-sat # Saturation model (1×H100, 18-36h)
make defconfig-decode-full # Everything (multi-GPU, days)After selecting a defconfig, make runs:
decode-doctor → decode-fetch → decode-build →
decode-run → decode-report → decode-upload (optional)
The orchestrator lives under tools/reproduce/paper_memory_decode/.
Each stage writes results to results/decode/<run_id>/stages/<stage>/
with DONE, metrics.jsonl, stdout.log, stderr.log. Rerunning
make resumes from the first missing DONE.
Telemetry: local JSONL is mandatory and canonical. W&B and trackerio
are optional mirrors controlled by .config flags and env vars
(WANDB_API_KEY, HF_TOKEN).
The defconfigs pin exact git refs for vllm, flashinfer, lmcache, and
paper-memory-decode, and clone/fetch them into ../ (the parent
directory).
I want you to remember most of our conversations about this project.