Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions .agents/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Agent Skills

This repository keeps repo-local agent skills under `.agents/skills/`.

## Consuming Repo Skills

`.agents/skills/` is the canonical repo-local location for shared agent skills in this repository.

### Codex

Codex can consume skills from `.agents/skills/` when working inside this repository.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And how will Codex know where to look for this skill? You do have an ln command for claude

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It knows to look at .agents (https://developers.openai.com/codex/skills#where-to-save-skills) it is a common location to set such skills (see NemoClaw for example)


Example prompts:

```text
Use the snapshots skill to capture a snapshot from the current cluster.
Take a scheduler snapshot and compare it on v0.13.0 and v0.14.0.
```

### Claude Code

Claude Code can use the same repo-owned skill content, but may require exposing the skill through Claude's configured skill directory if the current setup does not scan `.agents/skills/` directly.

Example:

```bash
ln -s /path/to/KAI-Scheduler/.agents/skills/snapshots ~/.claude/skills/snapshots
```

### Other Harnesses

Other agent harnesses should treat `.agents/skills/` as the source of truth and either scan it directly or mirror the needed skills into their own configured skill directory.

## Current Skills

- [`snapshots`](skills/snapshots/SKILL.md): capture KAI Scheduler snapshots, inspect archives, replay them with `snapshot-tool`, and compare behavior across refs.
55 changes: 55 additions & 0 deletions .agents/skills/snapshots/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
---
name: snapshots
description: Use when investigating KAI Scheduler behavior with captured cluster state, especially to replay scheduler decisions on specific refs or compare behavior across versions.
license: MIT
compatibility: Requires bash, git, kubectl for capture, curl for capture, make/docker or a prebuilt snapshot-tool for replay.
metadata:
author: KAI Scheduler maintainers
version: "1.0"
---

# Snapshots

Use this skill when investigating KAI Scheduler behavior with captured cluster state, especially for reproducing scheduling bugs, comparing behavior across KAI versions, or gathering evidence for issues like `kai-scheduler/KAI-Scheduler#1517`.

## Facts

- `docs/plugins/snapshot.md` is the source of truth for capture.
- The snapshot endpoint is `/get-snapshot` on plugin port `8081`, not the scheduler `--listen-address` port. In the observed clusters here, remote `8080` returned `404` while remote `8081` worked.
- Snapshot files are ZIP archives containing `snapshot.json`, even when named `.gzip`.
- `cmd/snapshot-tool/main.go` rebuilds fake clients from `snapshot.json` and replays the configured scheduler actions.
- Replay is a simulation of scheduler behavior, not a full cluster reproduction.

## Commands

Run scripts from the repository root:

```bash
.agents/skills/snapshots/scripts/capture-snapshot.sh --output snapshots/issue-123.gzip
.agents/skills/snapshots/scripts/inspect-snapshot.sh snapshots/issue-123.gzip
.agents/skills/snapshots/scripts/run-snapshot.sh --snapshot snapshots/issue-123.gzip --verbosity 8
.agents/skills/snapshots/scripts/run-snapshot.sh --ref v0.14.2 --snapshot snapshots/issue-123.gzip
.agents/skills/snapshots/scripts/compare-snapshot-refs.sh --snapshot snapshots/issue-123.gzip --refs main,v0.14.2
```

- `capture-snapshot.sh`: port-forward the scheduler and download `/get-snapshot`. Default target is `deployment/kai-scheduler-default` in namespace `kai-scheduler` on local/remote port `8081`. The script inherits `KUBECONFIG`, for example:

```bash
KUBECONFIG=$HOME/.kube/engine-scale-test \
.agents/skills/snapshots/scripts/capture-snapshot.sh --output snapshots/example.gzip
```

- `inspect-snapshot.sh`: validate that the archive contains `snapshot.json` and print top-level structure. Run this before replaying user-provided artifacts.
- `run-snapshot.sh`: build `snapshot-tool` with `make build-go SERVICE_NAME=snapshot-tool` and replay on the current checkout, or use `--ref` to switch to one Git ref, replay, and restore the original branch or commit. For large snapshots, start with `--verbosity 2`. For reruns, prefer `--no-build --tool bin/snapshot-tool-amd64`. If a ref-based run is interrupted hard enough that the shell trap does not execute, the repo can stay detached; check with `git status --short --branch` and restore with `git switch <branch>`.
- `compare-snapshot-refs.sh`: run the same snapshot against several git refs and save one log per ref plus `summary.tsv`.

## Workflow

1. Capture or receive the snapshot. Avoid committing snapshot artifacts unless the user explicitly asks.
2. Inspect the archive and confirm it contains `snapshot.json`.
3. If capture fails, verify the scheduler pod is running, verify the scheduler ConfigMap includes `- name: snapshot`, and verify the scheduler logs contain `Snapshot plugin registering get-snapshot`.
4. Replay on the reported KAI version first, aligned to the exact tag or commit.
5. Use `--verbosity 2` first. Compare timing from action timestamps inside the logs, not whole-command wall clock, because builds and verbosity can dominate.
6. Replay on candidate fixed or regressed refs only after the reported version is understood.
7. If a version appears stuck, stop waiting indefinitely and keep the partial log as evidence. In the runs here, `v0.14.0` completed `reclaim` materially faster than `v0.13.0`, while `v0.14.4` appeared to stall in `reclaim` past an interactive timeout.
8. Report refs, commands, log paths, action timings, errors, and whether the issue reproduced.
159 changes: 159 additions & 0 deletions .agents/skills/snapshots/scripts/capture-snapshot.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
#!/usr/bin/env bash
# Copyright 2026 NVIDIA CORPORATION
# SPDX-License-Identifier: Apache-2.0

set -euo pipefail

namespace="kai-scheduler"
deployment="kai-scheduler-default"
pod=""
local_port="8081"
remote_port="8081"
output=""
context=""
kubectl_bin="kubectl"
curl_bin="curl"

usage() {
cat <<'USAGE'
Capture a KAI Scheduler snapshot through the snapshot plugin endpoint.

Usage:
capture-snapshot.sh [options]

Options:
--output PATH Output snapshot archive path. Defaults to snapshot-<timestamp>.gzip.
--namespace NAME Kubernetes namespace. Default: kai-scheduler.
--deployment NAME Scheduler deployment name. Default: kai-scheduler-default.
--pod NAME Scheduler pod name. Overrides --deployment.
--local-port PORT Local port for kubectl port-forward. Default: 8081.
--remote-port PORT Remote scheduler HTTP port. Default: 8081.
--context NAME Kubernetes context.
--kubectl PATH kubectl binary. Default: kubectl.
--curl PATH curl binary. Default: curl.
-h, --help Show this help.
USAGE
}

while [[ $# -gt 0 ]]; do
case "$1" in
--output)
output="$2"
shift 2
;;
--namespace)
namespace="$2"
shift 2
;;
--deployment)
deployment="$2"
shift 2
;;
--pod)
pod="$2"
shift 2
;;
--local-port)
local_port="$2"
shift 2
;;
--remote-port)
remote_port="$2"
shift 2
;;
--context)
context="$2"
shift 2
;;
--kubectl)
kubectl_bin="$2"
shift 2
;;
--curl)
curl_bin="$2"
shift 2
;;
-h|--help)
usage
exit 0
;;
*)
echo "Unknown argument: $1" >&2
usage >&2
exit 2
;;
esac
done

if [[ -z "$output" ]]; then
output="snapshot-$(date -u +%Y%m%dT%H%M%SZ).gzip"
fi

command -v "$kubectl_bin" >/dev/null 2>&1 || {
echo "kubectl binary not found: $kubectl_bin" >&2
exit 1
}
command -v "$curl_bin" >/dev/null 2>&1 || {
echo "curl binary not found: $curl_bin" >&2
exit 1
}

mkdir -p "$(dirname "$output")"

kubectl_args=()
if [[ -n "$context" ]]; then
kubectl_args+=(--context "$context")
fi
kubectl_args+=(-n "$namespace")

target="deployment/$deployment"
if [[ -n "$pod" ]]; then
target="pod/$pod"
fi

tmpdir="$(mktemp -d)"
port_forward_log="$tmpdir/port-forward.log"
port_forward_pid=""

cleanup() {
if [[ -n "$port_forward_pid" ]] && kill -0 "$port_forward_pid" >/dev/null 2>&1; then
kill "$port_forward_pid" >/dev/null 2>&1 || true
# Kill the port-forward if wait does not return after SIGTERM.
(
sleep 2
kill -0 "$port_forward_pid" >/dev/null 2>&1 && kill -9 "$port_forward_pid" >/dev/null 2>&1 || true
) &
wait_timeout_pid="$!"

wait "$port_forward_pid" >/dev/null 2>&1 || true
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait with some timeout?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because of how we setup the port-forward it is not expected to resist the kill command. I can add a timeout anyway

kill "$wait_timeout_pid" >/dev/null 2>&1 || true
wait "$wait_timeout_pid" >/dev/null 2>&1 || true
fi
rm -rf "$tmpdir"
}
trap cleanup EXIT

"$kubectl_bin" "${kubectl_args[@]}" port-forward "$target" "${local_port}:${remote_port}" >"$port_forward_log" 2>&1 &
port_forward_pid="$!"

ready="false"
for _ in {1..40}; do
if ! kill -0 "$port_forward_pid" >/dev/null 2>&1; then
echo "kubectl port-forward exited before it became ready:" >&2
cat "$port_forward_log" >&2
exit 1
fi
if "$curl_bin" -fsS "http://127.0.0.1:${local_port}/get-snapshot" -o "$output"; then
ready="true"
break
fi
sleep 0.25
done

if [[ "$ready" != "true" ]]; then
echo "Timed out waiting for snapshot endpoint. Port-forward log:" >&2
cat "$port_forward_log" >&2
exit 1
fi

echo "Snapshot written to $output"
115 changes: 115 additions & 0 deletions .agents/skills/snapshots/scripts/compare-snapshot-refs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
#!/usr/bin/env bash
# Copyright 2026 NVIDIA CORPORATION
# SPDX-License-Identifier: Apache-2.0

set -euo pipefail

snapshot_file=""
refs_csv=""
verbosity="4"
output_dir=""
allow_dirty="false"
arch="amd64"

usage() {
cat <<'USAGE'
Run the same KAI Scheduler snapshot against multiple git refs.

Usage:
compare-snapshot-refs.sh --snapshot PATH --refs REF1,REF2[,REF3...] [options]

Options:
--snapshot PATH Snapshot archive to replay. Required.
--refs LIST Comma-separated refs to test. Required.
--verbosity LEVEL snapshot-tool verbosity. Default: 4.
--output-dir PATH Output directory. Defaults to snapshot-runs/<timestamp>.
--arch ARCH Built binary arch suffix. Default: amd64.
--allow-dirty Permit switching refs with dirty tracked files.
-h, --help Show this help.
USAGE
}

while [[ $# -gt 0 ]]; do
case "$1" in
--snapshot)
snapshot_file="$2"
shift 2
;;
--refs)
refs_csv="$2"
shift 2
;;
--verbosity)
verbosity="$2"
shift 2
;;
--output-dir)
output_dir="$2"
shift 2
;;
--arch)
arch="$2"
shift 2
;;
--allow-dirty)
allow_dirty="true"
shift
;;
-h|--help)
usage
exit 0
;;
*)
echo "Unknown argument: $1" >&2
usage >&2
exit 2
;;
esac
done

if [[ -z "$snapshot_file" || -z "$refs_csv" ]]; then
echo "--snapshot and --refs are required" >&2
usage >&2
exit 2
fi

repo_root="$(git rev-parse --show-toplevel)"
cd "$repo_root"

snapshot_abs="$(cd "$(dirname "$snapshot_file")" && pwd)/$(basename "$snapshot_file")"
if [[ ! -f "$snapshot_abs" ]]; then
echo "Snapshot not found: $snapshot_file" >&2
exit 1
fi

if [[ -z "$output_dir" ]]; then
output_dir="snapshot-runs/$(date -u +%Y%m%dT%H%M%SZ)"
fi
mkdir -p "$output_dir"

summary="$output_dir/summary.tsv"
printf 'ref\tstatus\tlog\n' >"$summary"

IFS=',' read -r -a refs <<<"$refs_csv"
for ref in "${refs[@]}"; do
if [[ -z "$ref" ]]; then
continue
fi

safe_ref="$(printf '%s' "$ref" | tr '/: ' '___')"
log_file="$output_dir/${safe_ref}.log"
status="pass"

args=(--ref "$ref" --snapshot "$snapshot_abs" --verbosity "$verbosity" --arch "$arch" --log-file "$log_file")
if [[ "$allow_dirty" == "true" ]]; then
args+=(--allow-dirty)
fi

if ! .agents/skills/snapshots/scripts/run-snapshot.sh "${args[@]}"; then
status="fail"
fi

printf '%s\t%s\t%s\n' "$ref" "$status" "$log_file" >>"$summary"
done

echo "Comparison summary written to $summary"
Loading
Loading