-
Notifications
You must be signed in to change notification settings - Fork 204
feat: snapshots skill #1659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
enoodle
wants to merge
5
commits into
kai-scheduler:main
Choose a base branch
from
enoodle:skill/kai-scheduler-snapshots
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+595
−0
Open
feat: snapshots skill #1659
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| # Agent Skills | ||
|
|
||
| This repository keeps repo-local agent skills under `.agents/skills/`. | ||
|
|
||
| ## Consuming Repo Skills | ||
|
|
||
| `.agents/skills/` is the canonical repo-local location for shared agent skills in this repository. | ||
|
|
||
| ### Codex | ||
|
|
||
| Codex can consume skills from `.agents/skills/` when working inside this repository. | ||
|
|
||
| Example prompts: | ||
|
|
||
| ```text | ||
| Use the snapshots skill to capture a snapshot from the current cluster. | ||
| Take a scheduler snapshot and compare it on v0.13.0 and v0.14.0. | ||
| ``` | ||
|
|
||
| ### Claude Code | ||
|
|
||
| Claude Code can use the same repo-owned skill content, but may require exposing the skill through Claude's configured skill directory if the current setup does not scan `.agents/skills/` directly. | ||
|
|
||
| Example: | ||
|
|
||
| ```bash | ||
| ln -s /path/to/KAI-Scheduler/.agents/skills/snapshots ~/.claude/skills/snapshots | ||
| ``` | ||
|
|
||
| ### Other Harnesses | ||
|
|
||
| Other agent harnesses should treat `.agents/skills/` as the source of truth and either scan it directly or mirror the needed skills into their own configured skill directory. | ||
|
|
||
| ## Current Skills | ||
|
|
||
| - [`snapshots`](skills/snapshots/SKILL.md): capture KAI Scheduler snapshots, inspect archives, replay them with `snapshot-tool`, and compare behavior across refs. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| --- | ||
| name: snapshots | ||
| description: Use when investigating KAI Scheduler behavior with captured cluster state, especially to replay scheduler decisions on specific refs or compare behavior across versions. | ||
| license: MIT | ||
| compatibility: Requires bash, git, kubectl for capture, curl for capture, make/docker or a prebuilt snapshot-tool for replay. | ||
| metadata: | ||
| author: KAI Scheduler maintainers | ||
| version: "1.0" | ||
| --- | ||
|
|
||
| # Snapshots | ||
|
|
||
| Use this skill when investigating KAI Scheduler behavior with captured cluster state, especially for reproducing scheduling bugs, comparing behavior across KAI versions, or gathering evidence for issues like `kai-scheduler/KAI-Scheduler#1517`. | ||
|
|
||
| ## Facts | ||
|
|
||
| - `docs/plugins/snapshot.md` is the source of truth for capture. | ||
| - The snapshot endpoint is `/get-snapshot` on plugin port `8081`, not the scheduler `--listen-address` port. In the observed clusters here, remote `8080` returned `404` while remote `8081` worked. | ||
| - Snapshot files are ZIP archives containing `snapshot.json`, even when named `.gzip`. | ||
| - `cmd/snapshot-tool/main.go` rebuilds fake clients from `snapshot.json` and replays the configured scheduler actions. | ||
| - Replay is a simulation of scheduler behavior, not a full cluster reproduction. | ||
|
|
||
| ## Commands | ||
|
|
||
| Run scripts from the repository root: | ||
|
|
||
| ```bash | ||
| .agents/skills/snapshots/scripts/capture-snapshot.sh --output snapshots/issue-123.gzip | ||
| .agents/skills/snapshots/scripts/inspect-snapshot.sh snapshots/issue-123.gzip | ||
| .agents/skills/snapshots/scripts/run-snapshot.sh --snapshot snapshots/issue-123.gzip --verbosity 8 | ||
| .agents/skills/snapshots/scripts/run-snapshot.sh --ref v0.14.2 --snapshot snapshots/issue-123.gzip | ||
| .agents/skills/snapshots/scripts/compare-snapshot-refs.sh --snapshot snapshots/issue-123.gzip --refs main,v0.14.2 | ||
| ``` | ||
|
|
||
| - `capture-snapshot.sh`: port-forward the scheduler and download `/get-snapshot`. Default target is `deployment/kai-scheduler-default` in namespace `kai-scheduler` on local/remote port `8081`. The script inherits `KUBECONFIG`, for example: | ||
|
|
||
| ```bash | ||
| KUBECONFIG=$HOME/.kube/engine-scale-test \ | ||
| .agents/skills/snapshots/scripts/capture-snapshot.sh --output snapshots/example.gzip | ||
| ``` | ||
|
|
||
| - `inspect-snapshot.sh`: validate that the archive contains `snapshot.json` and print top-level structure. Run this before replaying user-provided artifacts. | ||
| - `run-snapshot.sh`: build `snapshot-tool` with `make build-go SERVICE_NAME=snapshot-tool` and replay on the current checkout, or use `--ref` to switch to one Git ref, replay, and restore the original branch or commit. For large snapshots, start with `--verbosity 2`. For reruns, prefer `--no-build --tool bin/snapshot-tool-amd64`. If a ref-based run is interrupted hard enough that the shell trap does not execute, the repo can stay detached; check with `git status --short --branch` and restore with `git switch <branch>`. | ||
| - `compare-snapshot-refs.sh`: run the same snapshot against several git refs and save one log per ref plus `summary.tsv`. | ||
|
|
||
| ## Workflow | ||
|
|
||
| 1. Capture or receive the snapshot. Avoid committing snapshot artifacts unless the user explicitly asks. | ||
| 2. Inspect the archive and confirm it contains `snapshot.json`. | ||
| 3. If capture fails, verify the scheduler pod is running, verify the scheduler ConfigMap includes `- name: snapshot`, and verify the scheduler logs contain `Snapshot plugin registering get-snapshot`. | ||
| 4. Replay on the reported KAI version first, aligned to the exact tag or commit. | ||
| 5. Use `--verbosity 2` first. Compare timing from action timestamps inside the logs, not whole-command wall clock, because builds and verbosity can dominate. | ||
| 6. Replay on candidate fixed or regressed refs only after the reported version is understood. | ||
| 7. If a version appears stuck, stop waiting indefinitely and keep the partial log as evidence. In the runs here, `v0.14.0` completed `reclaim` materially faster than `v0.13.0`, while `v0.14.4` appeared to stall in `reclaim` past an interactive timeout. | ||
| 8. Report refs, commands, log paths, action timings, errors, and whether the issue reproduced. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,159 @@ | ||
| #!/usr/bin/env bash | ||
| # Copyright 2026 NVIDIA CORPORATION | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| set -euo pipefail | ||
|
|
||
| namespace="kai-scheduler" | ||
| deployment="kai-scheduler-default" | ||
| pod="" | ||
| local_port="8081" | ||
| remote_port="8081" | ||
| output="" | ||
| context="" | ||
| kubectl_bin="kubectl" | ||
| curl_bin="curl" | ||
|
|
||
| usage() { | ||
| cat <<'USAGE' | ||
| Capture a KAI Scheduler snapshot through the snapshot plugin endpoint. | ||
|
|
||
| Usage: | ||
| capture-snapshot.sh [options] | ||
|
|
||
| Options: | ||
| --output PATH Output snapshot archive path. Defaults to snapshot-<timestamp>.gzip. | ||
| --namespace NAME Kubernetes namespace. Default: kai-scheduler. | ||
| --deployment NAME Scheduler deployment name. Default: kai-scheduler-default. | ||
| --pod NAME Scheduler pod name. Overrides --deployment. | ||
| --local-port PORT Local port for kubectl port-forward. Default: 8081. | ||
| --remote-port PORT Remote scheduler HTTP port. Default: 8081. | ||
| --context NAME Kubernetes context. | ||
| --kubectl PATH kubectl binary. Default: kubectl. | ||
| --curl PATH curl binary. Default: curl. | ||
| -h, --help Show this help. | ||
| USAGE | ||
| } | ||
|
|
||
| while [[ $# -gt 0 ]]; do | ||
| case "$1" in | ||
| --output) | ||
| output="$2" | ||
| shift 2 | ||
| ;; | ||
| --namespace) | ||
| namespace="$2" | ||
| shift 2 | ||
| ;; | ||
| --deployment) | ||
| deployment="$2" | ||
| shift 2 | ||
| ;; | ||
| --pod) | ||
| pod="$2" | ||
| shift 2 | ||
| ;; | ||
| --local-port) | ||
| local_port="$2" | ||
| shift 2 | ||
| ;; | ||
| --remote-port) | ||
| remote_port="$2" | ||
| shift 2 | ||
| ;; | ||
| --context) | ||
| context="$2" | ||
| shift 2 | ||
| ;; | ||
| --kubectl) | ||
| kubectl_bin="$2" | ||
| shift 2 | ||
| ;; | ||
| --curl) | ||
| curl_bin="$2" | ||
| shift 2 | ||
| ;; | ||
| -h|--help) | ||
| usage | ||
| exit 0 | ||
| ;; | ||
| *) | ||
| echo "Unknown argument: $1" >&2 | ||
| usage >&2 | ||
| exit 2 | ||
| ;; | ||
| esac | ||
| done | ||
|
|
||
| if [[ -z "$output" ]]; then | ||
| output="snapshot-$(date -u +%Y%m%dT%H%M%SZ).gzip" | ||
| fi | ||
|
|
||
| command -v "$kubectl_bin" >/dev/null 2>&1 || { | ||
| echo "kubectl binary not found: $kubectl_bin" >&2 | ||
| exit 1 | ||
| } | ||
| command -v "$curl_bin" >/dev/null 2>&1 || { | ||
| echo "curl binary not found: $curl_bin" >&2 | ||
| exit 1 | ||
| } | ||
|
|
||
| mkdir -p "$(dirname "$output")" | ||
|
|
||
| kubectl_args=() | ||
| if [[ -n "$context" ]]; then | ||
| kubectl_args+=(--context "$context") | ||
| fi | ||
| kubectl_args+=(-n "$namespace") | ||
|
|
||
| target="deployment/$deployment" | ||
| if [[ -n "$pod" ]]; then | ||
| target="pod/$pod" | ||
| fi | ||
|
|
||
| tmpdir="$(mktemp -d)" | ||
| port_forward_log="$tmpdir/port-forward.log" | ||
| port_forward_pid="" | ||
|
|
||
| cleanup() { | ||
| if [[ -n "$port_forward_pid" ]] && kill -0 "$port_forward_pid" >/dev/null 2>&1; then | ||
| kill "$port_forward_pid" >/dev/null 2>&1 || true | ||
| # Kill the port-forward if wait does not return after SIGTERM. | ||
| ( | ||
| sleep 2 | ||
| kill -0 "$port_forward_pid" >/dev/null 2>&1 && kill -9 "$port_forward_pid" >/dev/null 2>&1 || true | ||
| ) & | ||
| wait_timeout_pid="$!" | ||
|
|
||
| wait "$port_forward_pid" >/dev/null 2>&1 || true | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wait with some timeout?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. because of how we setup the port-forward it is not expected to resist the kill command. I can add a timeout anyway |
||
| kill "$wait_timeout_pid" >/dev/null 2>&1 || true | ||
| wait "$wait_timeout_pid" >/dev/null 2>&1 || true | ||
| fi | ||
| rm -rf "$tmpdir" | ||
| } | ||
| trap cleanup EXIT | ||
|
|
||
| "$kubectl_bin" "${kubectl_args[@]}" port-forward "$target" "${local_port}:${remote_port}" >"$port_forward_log" 2>&1 & | ||
| port_forward_pid="$!" | ||
|
|
||
| ready="false" | ||
| for _ in {1..40}; do | ||
| if ! kill -0 "$port_forward_pid" >/dev/null 2>&1; then | ||
| echo "kubectl port-forward exited before it became ready:" >&2 | ||
| cat "$port_forward_log" >&2 | ||
| exit 1 | ||
| fi | ||
| if "$curl_bin" -fsS "http://127.0.0.1:${local_port}/get-snapshot" -o "$output"; then | ||
| ready="true" | ||
| break | ||
| fi | ||
| sleep 0.25 | ||
| done | ||
|
|
||
| if [[ "$ready" != "true" ]]; then | ||
| echo "Timed out waiting for snapshot endpoint. Port-forward log:" >&2 | ||
| cat "$port_forward_log" >&2 | ||
| exit 1 | ||
| fi | ||
|
|
||
| echo "Snapshot written to $output" | ||
115 changes: 115 additions & 0 deletions
115
.agents/skills/snapshots/scripts/compare-snapshot-refs.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,115 @@ | ||
| #!/usr/bin/env bash | ||
| # Copyright 2026 NVIDIA CORPORATION | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| set -euo pipefail | ||
|
|
||
| snapshot_file="" | ||
| refs_csv="" | ||
| verbosity="4" | ||
| output_dir="" | ||
| allow_dirty="false" | ||
| arch="amd64" | ||
|
|
||
| usage() { | ||
| cat <<'USAGE' | ||
| Run the same KAI Scheduler snapshot against multiple git refs. | ||
|
|
||
| Usage: | ||
| compare-snapshot-refs.sh --snapshot PATH --refs REF1,REF2[,REF3...] [options] | ||
|
|
||
| Options: | ||
| --snapshot PATH Snapshot archive to replay. Required. | ||
| --refs LIST Comma-separated refs to test. Required. | ||
| --verbosity LEVEL snapshot-tool verbosity. Default: 4. | ||
| --output-dir PATH Output directory. Defaults to snapshot-runs/<timestamp>. | ||
| --arch ARCH Built binary arch suffix. Default: amd64. | ||
| --allow-dirty Permit switching refs with dirty tracked files. | ||
| -h, --help Show this help. | ||
| USAGE | ||
| } | ||
|
|
||
| while [[ $# -gt 0 ]]; do | ||
| case "$1" in | ||
| --snapshot) | ||
| snapshot_file="$2" | ||
| shift 2 | ||
| ;; | ||
| --refs) | ||
| refs_csv="$2" | ||
| shift 2 | ||
| ;; | ||
| --verbosity) | ||
| verbosity="$2" | ||
| shift 2 | ||
| ;; | ||
| --output-dir) | ||
| output_dir="$2" | ||
| shift 2 | ||
| ;; | ||
| --arch) | ||
| arch="$2" | ||
| shift 2 | ||
| ;; | ||
| --allow-dirty) | ||
| allow_dirty="true" | ||
| shift | ||
| ;; | ||
| -h|--help) | ||
| usage | ||
| exit 0 | ||
| ;; | ||
| *) | ||
| echo "Unknown argument: $1" >&2 | ||
| usage >&2 | ||
| exit 2 | ||
| ;; | ||
| esac | ||
| done | ||
|
|
||
| if [[ -z "$snapshot_file" || -z "$refs_csv" ]]; then | ||
| echo "--snapshot and --refs are required" >&2 | ||
| usage >&2 | ||
| exit 2 | ||
| fi | ||
|
|
||
| repo_root="$(git rev-parse --show-toplevel)" | ||
| cd "$repo_root" | ||
|
|
||
| snapshot_abs="$(cd "$(dirname "$snapshot_file")" && pwd)/$(basename "$snapshot_file")" | ||
| if [[ ! -f "$snapshot_abs" ]]; then | ||
| echo "Snapshot not found: $snapshot_file" >&2 | ||
| exit 1 | ||
| fi | ||
|
|
||
| if [[ -z "$output_dir" ]]; then | ||
| output_dir="snapshot-runs/$(date -u +%Y%m%dT%H%M%SZ)" | ||
| fi | ||
| mkdir -p "$output_dir" | ||
|
|
||
| summary="$output_dir/summary.tsv" | ||
| printf 'ref\tstatus\tlog\n' >"$summary" | ||
|
|
||
| IFS=',' read -r -a refs <<<"$refs_csv" | ||
| for ref in "${refs[@]}"; do | ||
| if [[ -z "$ref" ]]; then | ||
| continue | ||
| fi | ||
|
|
||
| safe_ref="$(printf '%s' "$ref" | tr '/: ' '___')" | ||
| log_file="$output_dir/${safe_ref}.log" | ||
| status="pass" | ||
|
|
||
| args=(--ref "$ref" --snapshot "$snapshot_abs" --verbosity "$verbosity" --arch "$arch" --log-file "$log_file") | ||
| if [[ "$allow_dirty" == "true" ]]; then | ||
| args+=(--allow-dirty) | ||
| fi | ||
|
|
||
| if ! .agents/skills/snapshots/scripts/run-snapshot.sh "${args[@]}"; then | ||
| status="fail" | ||
| fi | ||
|
|
||
| printf '%s\t%s\t%s\n' "$ref" "$status" "$log_file" >>"$summary" | ||
| done | ||
|
|
||
| echo "Comparison summary written to $summary" |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And how will Codex know where to look for this skill? You do have an ln command for claude
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It knows to look at .agents (https://developers.openai.com/codex/skills#where-to-save-skills) it is a common location to set such skills (see NemoClaw for example)