Skip to content

Add trigger testing capability #22

@dacharyc

Description

@dacharyc

Background

Anthropic's updated skill-creator includes trigger testing: given a skill and a set of queries, it verifies the skill fires for relevant queries and stays silent for irrelevant ones. This is a useful quality signal that skill-validator doesn't currently cover.

What trigger testing answers

"Given my skill's name and description, will an agent correctly decide when to invoke it?"

This catches problems like:

  • Descriptions that are too vague (triggers for everything)
  • Descriptions that are too narrow (misses valid use cases)
  • Name/description mismatch with actual skill content
  • Keyword-stuffed descriptions that trigger for the wrong reasons

How Anthropic does it

Their run_eval.py:

  1. Takes an eval set of { query, expected: "trigger" | "no-trigger" } pairs
  2. For each query, runs claude -p <query> --output-format stream-json --verbose
  3. Parses the stream to detect if the Skill tool was called
  4. Runs each query multiple times (default 3) with a trigger threshold (default 0.5)
  5. Reports pass/fail per query and overall accuracy

This is tightly coupled to the Claude Code CLI and tests Claude's actual skill selection behavior.

Design options for skill-validator

Option A: Shell out to claude (like Anthropic)

Run claude -p as a subprocess and parse the output.

Pros: Tests real-world behavior; highest fidelity
Cons: Requires Claude Code CLI installed; not usable as a library; won't work with the enterprise Bedrock setup without Claude Code being configured for it; slow (spawns a full claude process per query per run)

Option B: LLM-based trigger simulation

Send the skill's name + description (and optionally a list of other available skills) to an LLM via the existing judge/ client, along with a query, and ask: "Would you invoke this skill for this query? Why or why not?"

Pros: Works with any model provider (Anthropic, OpenAI, Bedrock); usable as a library; can provide reasoning about why a trigger decision was made; fits naturally alongside the existing scoring system
Cons: Simulates rather than tests actual behavior; different models may have different trigger thresholds; doesn't account for Claude Code's specific skill selection logic

Option C: Heuristic-based trigger analysis (no LLM)

Analyze the skill's name and description against a query using keyword overlap, semantic similarity, or TF-IDF style scoring. No LLM call needed.

Pros: Fast, deterministic, no API costs
Cons: Much less accurate than an LLM-based check; doesn't capture semantic understanding; probably not worth the complexity

Option D: Don't add to skill-validator; build into review-skill instead

The review-skill skill (planned) would run inside Claude Code, which means it could invoke claude -p or use subagents to test trigger behavior as part of its review workflow, without the library needing to support it.

Pros: Natural fit since it already runs inside Claude Code; no new library dependency
Cons: Doesn't benefit users of the library directly; trigger testing isn't available as a standalone command

Recommendation

Option B is probably the best fit for the library — it aligns with the existing judge/ client architecture, works across model providers (including enterprise Bedrock), and could provide actionable feedback ("your description mentions X but the query is about Y").

Option D is a good complement — the review-skill skill could do real trigger testing against Claude Code while the library provides the simulated version.

Input format

Either way, trigger testing needs an eval set. This could be:

  • A triggers.yaml file in the skill directory with test queries
  • Auto-generated queries based on the skill's name/description (LLM generates "should trigger" and "should not trigger" examples)
  • Provided at invocation time via CLI flags

Auto-generation is appealing because it removes the burden of authors maintaining an eval set, but authored eval sets would be higher quality.

Open questions

  • Should this block v1.0.0 or be a post-1.0 feature?
  • For Option B, should the trigger simulation include other skill names/descriptions as context (to test differentiation between similar skills)?
  • Is there value in both Option B (library) and Option D (review-skill) or would one suffice?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions