-
-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Background
Anthropic's updated skill-creator includes trigger testing: given a skill and a set of queries, it verifies the skill fires for relevant queries and stays silent for irrelevant ones. This is a useful quality signal that skill-validator doesn't currently cover.
What trigger testing answers
"Given my skill's name and description, will an agent correctly decide when to invoke it?"
This catches problems like:
- Descriptions that are too vague (triggers for everything)
- Descriptions that are too narrow (misses valid use cases)
- Name/description mismatch with actual skill content
- Keyword-stuffed descriptions that trigger for the wrong reasons
How Anthropic does it
Their run_eval.py:
- Takes an eval set of
{ query, expected: "trigger" | "no-trigger" }pairs - For each query, runs
claude -p <query> --output-format stream-json --verbose - Parses the stream to detect if the
Skilltool was called - Runs each query multiple times (default 3) with a trigger threshold (default 0.5)
- Reports pass/fail per query and overall accuracy
This is tightly coupled to the Claude Code CLI and tests Claude's actual skill selection behavior.
Design options for skill-validator
Option A: Shell out to claude (like Anthropic)
Run claude -p as a subprocess and parse the output.
Pros: Tests real-world behavior; highest fidelity
Cons: Requires Claude Code CLI installed; not usable as a library; won't work with the enterprise Bedrock setup without Claude Code being configured for it; slow (spawns a full claude process per query per run)
Option B: LLM-based trigger simulation
Send the skill's name + description (and optionally a list of other available skills) to an LLM via the existing judge/ client, along with a query, and ask: "Would you invoke this skill for this query? Why or why not?"
Pros: Works with any model provider (Anthropic, OpenAI, Bedrock); usable as a library; can provide reasoning about why a trigger decision was made; fits naturally alongside the existing scoring system
Cons: Simulates rather than tests actual behavior; different models may have different trigger thresholds; doesn't account for Claude Code's specific skill selection logic
Option C: Heuristic-based trigger analysis (no LLM)
Analyze the skill's name and description against a query using keyword overlap, semantic similarity, or TF-IDF style scoring. No LLM call needed.
Pros: Fast, deterministic, no API costs
Cons: Much less accurate than an LLM-based check; doesn't capture semantic understanding; probably not worth the complexity
Option D: Don't add to skill-validator; build into review-skill instead
The review-skill skill (planned) would run inside Claude Code, which means it could invoke claude -p or use subagents to test trigger behavior as part of its review workflow, without the library needing to support it.
Pros: Natural fit since it already runs inside Claude Code; no new library dependency
Cons: Doesn't benefit users of the library directly; trigger testing isn't available as a standalone command
Recommendation
Option B is probably the best fit for the library — it aligns with the existing judge/ client architecture, works across model providers (including enterprise Bedrock), and could provide actionable feedback ("your description mentions X but the query is about Y").
Option D is a good complement — the review-skill skill could do real trigger testing against Claude Code while the library provides the simulated version.
Input format
Either way, trigger testing needs an eval set. This could be:
- A
triggers.yamlfile in the skill directory with test queries - Auto-generated queries based on the skill's name/description (LLM generates "should trigger" and "should not trigger" examples)
- Provided at invocation time via CLI flags
Auto-generation is appealing because it removes the burden of authors maintaining an eval set, but authored eval sets would be higher quality.
Open questions
- Should this block v1.0.0 or be a post-1.0 feature?
- For Option B, should the trigger simulation include other skill names/descriptions as context (to test differentiation between similar skills)?
- Is there value in both Option B (library) and Option D (review-skill) or would one suffice?