Summary
Create a Pi extension that can run evaluation scenarios to compare how the agent/harness/model performs on the same task with and without a specific skill enabled.
The goal is to catch regressions where a skill that is meant to improve behavior actually makes outcomes worse.
Motivation
Skills change agent behavior, but today there is no lightweight way to measure whether a skill improves task performance. A simple eval runner would make skill changes safer by producing comparable results and a reviewable report.
Acceptance Criteria
Summary
Create a Pi extension that can run evaluation scenarios to compare how the agent/harness/model performs on the same task with and without a specific skill enabled.
The goal is to catch regressions where a skill that is meant to improve behavior actually makes outcomes worse.
Motivation
Skills change agent behavior, but today there is no lightweight way to measure whether a skill improves task performance. A simple eval runner would make skill changes safer by producing comparable results and a reviewable report.
Acceptance Criteria