Add trigger testing capability

## Background

Anthropic's updated [skill-creator](https://github.com/anthropics/skills/tree/main/skills/skill-creator) includes trigger testing: given a skill and a set of queries, it verifies the skill fires for relevant queries and stays silent for irrelevant ones. This is a useful quality signal that skill-validator doesn't currently cover.

## What trigger testing answers

> "Given my skill's name and description, will an agent correctly decide when to invoke it?"

This catches problems like:
- Descriptions that are too vague (triggers for everything)
- Descriptions that are too narrow (misses valid use cases)
- Name/description mismatch with actual skill content
- Keyword-stuffed descriptions that trigger for the wrong reasons

## How Anthropic does it

Their `run_eval.py`:
1. Takes an eval set of `{ query, expected: "trigger" | "no-trigger" }` pairs
2. For each query, runs `claude -p <query> --output-format stream-json --verbose`
3. Parses the stream to detect if the `Skill` tool was called
4. Runs each query multiple times (default 3) with a trigger threshold (default 0.5)
5. Reports pass/fail per query and overall accuracy

This is tightly coupled to the Claude Code CLI and tests Claude's actual skill selection behavior.

## Design options for skill-validator

### Option A: Shell out to `claude` (like Anthropic)

Run `claude -p` as a subprocess and parse the output.

**Pros:** Tests real-world behavior; highest fidelity  
**Cons:** Requires Claude Code CLI installed; not usable as a library; won't work with the enterprise Bedrock setup without Claude Code being configured for it; slow (spawns a full claude process per query per run)

### Option B: LLM-based trigger simulation

Send the skill's name + description (and optionally a list of other available skills) to an LLM via the existing `judge/` client, along with a query, and ask: "Would you invoke this skill for this query? Why or why not?"

**Pros:** Works with any model provider (Anthropic, OpenAI, Bedrock); usable as a library; can provide reasoning about *why* a trigger decision was made; fits naturally alongside the existing scoring system  
**Cons:** Simulates rather than tests actual behavior; different models may have different trigger thresholds; doesn't account for Claude Code's specific skill selection logic

### Option C: Heuristic-based trigger analysis (no LLM)

Analyze the skill's name and description against a query using keyword overlap, semantic similarity, or TF-IDF style scoring. No LLM call needed.

**Pros:** Fast, deterministic, no API costs  
**Cons:** Much less accurate than an LLM-based check; doesn't capture semantic understanding; probably not worth the complexity

### Option D: Don't add to skill-validator; build into review-skill instead

The `review-skill` skill (planned) would run inside Claude Code, which means it could invoke `claude -p` or use subagents to test trigger behavior as part of its review workflow, without the library needing to support it.

**Pros:** Natural fit since it already runs inside Claude Code; no new library dependency  
**Cons:** Doesn't benefit users of the library directly; trigger testing isn't available as a standalone command

## Recommendation

**Option B** is probably the best fit for the library — it aligns with the existing `judge/` client architecture, works across model providers (including enterprise Bedrock), and could provide actionable feedback ("your description mentions X but the query is about Y"). 

**Option D** is a good complement — the `review-skill` skill could do real trigger testing against Claude Code while the library provides the simulated version.

## Input format

Either way, trigger testing needs an eval set. This could be:
- A `triggers.yaml` file in the skill directory with test queries
- Auto-generated queries based on the skill's name/description (LLM generates "should trigger" and "should not trigger" examples)
- Provided at invocation time via CLI flags

Auto-generation is appealing because it removes the burden of authors maintaining an eval set, but authored eval sets would be higher quality.

## Open questions

- Should this block v1.0.0 or be a post-1.0 feature?
- For Option B, should the trigger simulation include other skill names/descriptions as context (to test differentiation between similar skills)?
- Is there value in both Option B (library) and Option D (review-skill) or would one suffice?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add trigger testing capability #22

Background

What trigger testing answers

How Anthropic does it

Design options for skill-validator

Option A: Shell out to `claude` (like Anthropic)

Option B: LLM-based trigger simulation

Option C: Heuristic-based trigger analysis (no LLM)

Option D: Don't add to skill-validator; build into review-skill instead

Recommendation

Input format

Open questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add trigger testing capability #22

Description

Background

What trigger testing answers

How Anthropic does it

Design options for skill-validator

Option A: Shell out to claude (like Anthropic)

Option B: LLM-based trigger simulation

Option C: Heuristic-based trigger analysis (no LLM)

Option D: Don't add to skill-validator; build into review-skill instead

Recommendation

Input format

Open questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Option A: Shell out to `claude` (like Anthropic)