This guide explains how to test whether skills work correctly in the everything-copilot-cli repository.
Skills are promptware: Markdown prompt files that shape model behavior, rather than deterministic code. That means “testing” is less like unit testing and more like structured evaluation.
Skills are Markdown prompt files, not deterministic code. So “does it work?” really means:
- Does it trigger at the right time?
- Does it produce useful, correct output for the intended task?
- Does it avoid false positives (triggering when it shouldn’t)?
Because model behavior is probabilistic and context-dependent, skills aren’t purely unit-testable. Instead, testing combines human evaluation, structured comparisons, and trigger coverage.
Trigger testing ensures the skill activates when it should, and stays out of the way when it shouldn’t.
- Should-trigger queries: prompts that SHOULD invoke this skill
- Should-NOT-trigger queries: prompts that should NOT invoke this skill (prevents false positives)
How to test:
- Run the query with the skill installed.
- Run the same query without the skill installed.
- Compare:
- Did the assistant follow the skill’s intended workflow when installed?
- Did the assistant avoid that workflow (or behave differently) without the skill?
If “with skill” looks identical to “without skill,” the skill may not be triggering or may not be adding value.
Output quality testing checks whether the skill improves the result, not just whether it triggered.
- Baseline comparison: run the task without the skill (raw AI), then with the skill — compare structure, completeness, and correctness
- Checklist method: define expected output elements (sections, format, specific content) and verify they’re present
Simple scoring rubric (1–5 scale):
- Trigger accuracy (did it activate appropriately?)
- Output structure (clear sections, consistent formatting)
- Actionability (concrete next steps, commands, checklists, decisions)
- No hallucinations (facts grounded in the repo/context; uncertainty stated)
Tip: Keep the evaluation prompts and the compared outputs side-by-side so reviewers can score quickly.
Regression testing ensures skill edits don’t silently degrade behavior.
- Keep a
_workspace/dir (gitignored) per skill for test artifacts - Re-run the baseline comparison after editing a skill’s description or workflow
Focus regression testing on:
- Trigger behavior changes (new false positives / missed triggers)
- Output format drift (skill claims it outputs X, but produces Y)
- Accuracy regressions (new incorrect claims, missing safety checks)
Use this per-skill sandbox to store test prompts, before/after outputs, and evaluator notes.
skills/<category>/<skill-name>/
SKILL.md
_workspace/ ← gitignored test sandbox
baseline.md ← output WITHOUT skill
with-skill.md ← output WITH skill
test-queries.md ← should-trigger + should-not-trigger list
notes.md ← evaluator observations
Add _workspace/ to .gitignore (already present in repo).
For repeatable evaluation, use a “Grader” agent prompt to score outputs consistently. This is inspired by a Grader/Comparator/Analyzer style pattern, but kept pragmatic for day-to-day work.
You are evaluating whether a Copilot CLI skill output is high quality.
Skill name: [skill-name]
Skill description: [description field]
User query: [the original user prompt]
Output to evaluate:
---
[output]
---
Score 1–5 on each dimension. Return structured JSON only:
{
"trigger_accuracy": { "score": 1-5, "reasoning": "..." },
"output_structure": { "score": 1-5, "reasoning": "..." },
"actionability": { "score": 1-5, "reasoning": "..." },
"accuracy": { "score": 1-5, "reasoning": "..." },
"overall": 1-5
}
Notes:
- Keep the grader prompt stable over time so scores remain comparable.
- Provide the same user query and the full output you want graded.
- If needed, add a short “repo context” block (but keep it consistent across runs).
- At least 8–10 should-trigger queries written and tested (aim for diversity: different phrasings, contexts, urgency levels)
- At least 8–10 should-NOT-trigger queries verified (no false positives), including near-miss queries
- Description field follows
skill-writing-best-practices.md(trigger-first pattern) - Output format section in
SKILL.mdmatches what the skill actually produces - Tested in at least one real project context (not just synthetic examples)
A near-miss is a query that sounds like it should trigger the skill but shouldn't — because a simpler or more specific skill is the right choice.
How to design near-misses:
- Take a valid trigger query and weaken/reframe it: change scope ("just one file" vs "the whole repo"), reduce specificity ("check for issues" vs "run a full security scan"), or shift to a different skill's domain.
- Verify the near-miss does NOT trigger the skill (if it does, tighten the
NOT when…clause in the description).
Example near-miss pairs (for security-scan):
| Trigger (should fire) | Near-miss (should NOT fire) |
|---|---|
| "Run a full security scan before our release" | "Is this function safe?" → use code-review |
| "Check all dependencies for CVEs" | "What does this package do?" → use chat |
| "Audit our auth module for OWASP issues" | "Review this PR for security" → use pr-multi-perspective-review |
Reference guides/skill-writing-best-practices.md. The description is the highest-leverage field to test.
Method:
- Paste the description into a blank chat and ask: “Given this description, when would you use this skill?”
- If the AI’s answer doesn’t match your intended triggers → revise the description
This catches the most common failure mode: a well-written workflow that rarely triggers (or triggers too broadly).
guides/skill-writing-best-practices.md— description quality guideorchestration/patterns/producer-reviewer.md— iterative refinement for skill output