kai-scheduler · enoodle · Jun 2, 2026 · Jun 2, 2026 · Jun 5, 2026 · Jun 5, 2026
diff --git a/.agents/README.md b/.agents/README.md
@@ -0,0 +1,36 @@
+# Agent Skills
+
+This repository keeps repo-local agent skills under `.agents/skills/`.
+
+## Consuming Repo Skills
+
+`.agents/skills/` is the canonical repo-local location for shared agent skills in this repository.
+
+### Codex
+
+Codex can consume skills from `.agents/skills/` when working inside this repository.
+
+Example prompts:
+
+```text
+Use the snapshots skill to capture a snapshot from the current cluster.
+Take a scheduler snapshot and compare it on v0.13.0 and v0.14.0.
+```
+
+### Claude Code
+
+Claude Code can use the same repo-owned skill content, but may require exposing the skill through Claude's configured skill directory if the current setup does not scan `.agents/skills/` directly.
+
+Example:
+
+```bash
+ln -s /path/to/KAI-Scheduler/.agents/skills/snapshots ~/.claude/skills/snapshots
+```
+
+### Other Harnesses
+
+Other agent harnesses should treat `.agents/skills/` as the source of truth and either scan it directly or mirror the needed skills into their own configured skill directory.
+
+## Current Skills
+
+- [`snapshots`](skills/snapshots/SKILL.md): capture KAI Scheduler snapshots, inspect archives, replay them with `snapshot-tool`, and compare behavior across refs.
diff --git a/.agents/skills/snapshots/SKILL.md b/.agents/skills/snapshots/SKILL.md
@@ -0,0 +1,55 @@
+---
+name: snapshots
+description: Use when investigating KAI Scheduler behavior with captured cluster state, especially to replay scheduler decisions on specific refs or compare behavior across versions.
+license: MIT
+compatibility: Requires bash, git, kubectl for capture, curl for capture, make/docker or a prebuilt snapshot-tool for replay.
+metadata:
+  author: KAI Scheduler maintainers
+  version: "1.0"
+---
+
+# Snapshots
+
+Use this skill when investigating KAI Scheduler behavior with captured cluster state, especially for reproducing scheduling bugs, comparing behavior across KAI versions, or gathering evidence for issues like `kai-scheduler/KAI-Scheduler#1517`.
+
+## Facts
+
+- `docs/plugins/snapshot.md` is the source of truth for capture.
+- The snapshot endpoint is `/get-snapshot` on plugin port `8081`, not the scheduler `--listen-address` port. In the observed clusters here, remote `8080` returned `404` while remote `8081` worked.
+- Snapshot files are ZIP archives containing `snapshot.json`, even when named `.gzip`.
+- `cmd/snapshot-tool/main.go` rebuilds fake clients from `snapshot.json` and replays the configured scheduler actions.
+- Replay is a simulation of scheduler behavior, not a full cluster reproduction.
+
+## Commands
+
+Run scripts from the repository root:
+
+```bash
+.agents/skills/snapshots/scripts/capture-snapshot.sh --output snapshots/issue-123.gzip
+.agents/skills/snapshots/scripts/inspect-snapshot.sh snapshots/issue-123.gzip
+.agents/skills/snapshots/scripts/run-snapshot.sh --snapshot snapshots/issue-123.gzip --verbosity 8
+.agents/skills/snapshots/scripts/run-snapshot.sh --ref v0.14.2 --snapshot snapshots/issue-123.gzip
+.agents/skills/snapshots/scripts/compare-snapshot-refs.sh --snapshot snapshots/issue-123.gzip --refs main,v0.14.2
+```
+
+- `capture-snapshot.sh`: port-forward the scheduler and download `/get-snapshot`. Default target is `deployment/kai-scheduler-default` in namespace `kai-scheduler` on local/remote port `8081`. The script inherits `KUBECONFIG`, for example:
+
+```bash
+KUBECONFIG=$HOME/.kube/engine-scale-test \
+  .agents/skills/snapshots/scripts/capture-snapshot.sh --output snapshots/example.gzip
+```
+
+- `inspect-snapshot.sh`: validate that the archive contains `snapshot.json` and print top-level structure. Run this before replaying user-provided artifacts.
+- `run-snapshot.sh`: build `snapshot-tool` with `make build-go SERVICE_NAME=snapshot-tool` and replay on the current checkout, or use `--ref` to switch to one Git ref, replay, and restore the original branch or commit. For large snapshots, start with `--verbosity 2`. For reruns, prefer `--no-build --tool bin/snapshot-tool-amd64`. If a ref-based run is interrupted hard enough that the shell trap does not execute, the repo can stay detached; check with `git status --short --branch` and restore with `git switch <branch>`.
+- `compare-snapshot-refs.sh`: run the same snapshot against several git refs and save one log per ref plus `summary.tsv`.
+
+## Workflow
+
+1. Capture or receive the snapshot. Avoid committing snapshot artifacts unless the user explicitly asks.
+2. Inspect the archive and confirm it contains `snapshot.json`.
+3. If capture fails, verify the scheduler pod is running, verify the scheduler ConfigMap includes `- name: snapshot`, and verify the scheduler logs contain `Snapshot plugin registering get-snapshot`.
+4. Replay on the reported KAI version first, aligned to the exact tag or commit.
+5. Use `--verbosity 2` first. Compare timing from action timestamps inside the logs, not whole-command wall clock, because builds and verbosity can dominate.
+6. Replay on candidate fixed or regressed refs only after the reported version is understood.
+7. If a version appears stuck, stop waiting indefinitely and keep the partial log as evidence. In the runs here, `v0.14.0` completed `reclaim` materially faster than `v0.13.0`, while `v0.14.4` appeared to stall in `reclaim` past an interactive timeout.
+8. Report refs, commands, log paths, action timings, errors, and whether the issue reproduced.
diff --git a/.agents/skills/snapshots/scripts/capture-snapshot.sh b/.agents/skills/snapshots/scripts/capture-snapshot.sh
@@ -0,0 +1,159 @@
+#!/usr/bin/env bash
+# Copyright 2026 NVIDIA CORPORATION
+# SPDX-License-Identifier: Apache-2.0
+
+set -euo pipefail
+
+namespace="kai-scheduler"
+deployment="kai-scheduler-default"
+pod=""
+local_port="8081"
+remote_port="8081"
+output=""
+context=""
+kubectl_bin="kubectl"
+curl_bin="curl"
+
+usage() {
+  cat <<'USAGE'
+Capture a KAI Scheduler snapshot through the snapshot plugin endpoint.
+
+Usage:
+  capture-snapshot.sh [options]
+
+Options:
+  --output PATH          Output snapshot archive path. Defaults to snapshot-<timestamp>.gzip.
+  --namespace NAME       Kubernetes namespace. Default: kai-scheduler.
+  --deployment NAME      Scheduler deployment name. Default: kai-scheduler-default.
+  --pod NAME             Scheduler pod name. Overrides --deployment.
+  --local-port PORT      Local port for kubectl port-forward. Default: 8081.
+  --remote-port PORT     Remote scheduler HTTP port. Default: 8081.
+  --context NAME         Kubernetes context.
+  --kubectl PATH         kubectl binary. Default: kubectl.
+  --curl PATH            curl binary. Default: curl.
+  -h, --help             Show this help.
+USAGE
+}
+
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --output)
+      output="$2"
+      shift 2
+      ;;
+    --namespace)
+      namespace="$2"
+      shift 2
+      ;;
+    --deployment)
+      deployment="$2"
+      shift 2
+      ;;
+    --pod)
+      pod="$2"
+      shift 2
+      ;;
+    --local-port)
+      local_port="$2"
+      shift 2
+      ;;
+    --remote-port)
+      remote_port="$2"
+      shift 2
+      ;;
+    --context)
+      context="$2"
+      shift 2
+      ;;
+    --kubectl)
+      kubectl_bin="$2"
+      shift 2
+      ;;
+    --curl)
+      curl_bin="$2"
+      shift 2
+      ;;
+    -h|--help)
+      usage
+      exit 0
+      ;;
+    *)
+      echo "Unknown argument: $1" >&2
+      usage >&2
+      exit 2
+      ;;
+  esac
+done
+
+if [[ -z "$output" ]]; then
+  output="snapshot-$(date -u +%Y%m%dT%H%M%SZ).gzip"
+fi
+
+command -v "$kubectl_bin" >/dev/null 2>&1 || {
+  echo "kubectl binary not found: $kubectl_bin" >&2
+  exit 1
+}
+command -v "$curl_bin" >/dev/null 2>&1 || {
+  echo "curl binary not found: $curl_bin" >&2
+  exit 1
+}
+
+mkdir -p "$(dirname "$output")"
+
+kubectl_args=()
+if [[ -n "$context" ]]; then
+  kubectl_args+=(--context "$context")
+fi
+kubectl_args+=(-n "$namespace")
+
+target="deployment/$deployment"
+if [[ -n "$pod" ]]; then
+  target="pod/$pod"
+fi
+
+tmpdir="$(mktemp -d)"
+port_forward_log="$tmpdir/port-forward.log"
+port_forward_pid=""
+
+cleanup() {
+  if [[ -n "$port_forward_pid" ]] && kill -0 "$port_forward_pid" >/dev/null 2>&1; then
+    kill "$port_forward_pid" >/dev/null 2>&1 || true
+    # Kill the port-forward if wait does not return after SIGTERM.
+    (
+      sleep 2
+      kill -0 "$port_forward_pid" >/dev/null 2>&1 && kill -9 "$port_forward_pid" >/dev/null 2>&1 || true
+    ) &
+    wait_timeout_pid="$!"
+
+    wait "$port_forward_pid" >/dev/null 2>&1 || true
+    kill "$wait_timeout_pid" >/dev/null 2>&1 || true
+    wait "$wait_timeout_pid" >/dev/null 2>&1 || true
+  fi
+  rm -rf "$tmpdir"
+}
+trap cleanup EXIT
+
+"$kubectl_bin" "${kubectl_args[@]}" port-forward "$target" "${local_port}:${remote_port}" >"$port_forward_log" 2>&1 &
+port_forward_pid="$!"
+
+ready="false"
+for _ in {1..40}; do
+  if ! kill -0 "$port_forward_pid" >/dev/null 2>&1; then
+    echo "kubectl port-forward exited before it became ready:" >&2
+    cat "$port_forward_log" >&2
+    exit 1
+  fi
+  if "$curl_bin" -fsS "http://127.0.0.1:${local_port}/get-snapshot" -o "$output"; then
+    ready="true"
+    break
+  fi
+  sleep 0.25
+done
+
+if [[ "$ready" != "true" ]]; then
+  echo "Timed out waiting for snapshot endpoint. Port-forward log:" >&2
+  cat "$port_forward_log" >&2
+  exit 1
+fi
+
+echo "Snapshot written to $output"
diff --git a/.agents/skills/snapshots/scripts/compare-snapshot-refs.sh b/.agents/skills/snapshots/scripts/compare-snapshot-refs.sh
@@ -0,0 +1,115 @@
+#!/usr/bin/env bash
+# Copyright 2026 NVIDIA CORPORATION
+# SPDX-License-Identifier: Apache-2.0
+
+set -euo pipefail
+
+snapshot_file=""
+refs_csv=""
+verbosity="4"
+output_dir=""
+allow_dirty="false"
+arch="amd64"
+
+usage() {
+  cat <<'USAGE'
+Run the same KAI Scheduler snapshot against multiple git refs.
+
+Usage:
+  compare-snapshot-refs.sh --snapshot PATH --refs REF1,REF2[,REF3...] [options]
+
+Options:
+  --snapshot PATH        Snapshot archive to replay. Required.
+  --refs LIST            Comma-separated refs to test. Required.
+  --verbosity LEVEL      snapshot-tool verbosity. Default: 4.
+  --output-dir PATH      Output directory. Defaults to snapshot-runs/<timestamp>.
+  --arch ARCH            Built binary arch suffix. Default: amd64.
+  --allow-dirty          Permit switching refs with dirty tracked files.
+  -h, --help             Show this help.
+USAGE
+}
+
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --snapshot)
+      snapshot_file="$2"
+      shift 2
+      ;;
+    --refs)
+      refs_csv="$2"
+      shift 2
+      ;;
+    --verbosity)
+      verbosity="$2"
+      shift 2
+      ;;
+    --output-dir)
+      output_dir="$2"
+      shift 2
+      ;;
+    --arch)
+      arch="$2"
+      shift 2
+      ;;
+    --allow-dirty)
+      allow_dirty="true"
+      shift
+      ;;
+    -h|--help)
+      usage
+      exit 0
+      ;;
+    *)
+      echo "Unknown argument: $1" >&2
+      usage >&2
+      exit 2
+      ;;
+  esac
+done
+
+if [[ -z "$snapshot_file" || -z "$refs_csv" ]]; then
+  echo "--snapshot and --refs are required" >&2
+  usage >&2
+  exit 2
+fi
+
+repo_root="$(git rev-parse --show-toplevel)"
+cd "$repo_root"
+
+snapshot_abs="$(cd "$(dirname "$snapshot_file")" && pwd)/$(basename "$snapshot_file")"
+if [[ ! -f "$snapshot_abs" ]]; then
+  echo "Snapshot not found: $snapshot_file" >&2
+  exit 1
+fi
+
+if [[ -z "$output_dir" ]]; then
+  output_dir="snapshot-runs/$(date -u +%Y%m%dT%H%M%SZ)"
+fi
+mkdir -p "$output_dir"
+
+summary="$output_dir/summary.tsv"
+printf 'ref\tstatus\tlog\n' >"$summary"
+
+IFS=',' read -r -a refs <<<"$refs_csv"
+for ref in "${refs[@]}"; do
+  if [[ -z "$ref" ]]; then
+    continue
+  fi
+
+  safe_ref="$(printf '%s' "$ref" | tr '/: ' '___')"
+  log_file="$output_dir/${safe_ref}.log"
+  status="pass"
+
+  args=(--ref "$ref" --snapshot "$snapshot_abs" --verbosity "$verbosity" --arch "$arch" --log-file "$log_file")
+  if [[ "$allow_dirty" == "true" ]]; then
+    args+=(--allow-dirty)
+  fi
+
+  if ! .agents/skills/snapshots/scripts/run-snapshot.sh "${args[@]}"; then
+    status="fail"
+  fi
+
+  printf '%s\t%s\t%s\n' "$ref" "$status" "$log_file" >>"$summary"
+done
+
+echo "Comparison summary written to $summary"