Create skill comparison eval extension with HTML report

## Summary

Create a Pi extension that can run evaluation scenarios to compare how the agent/harness/model performs on the same task with and without a specific skill enabled.

The goal is to catch regressions where a skill that is meant to improve behavior actually makes outcomes worse.

## Motivation

Skills change agent behavior, but today there is no lightweight way to measure whether a skill improves task performance. A simple eval runner would make skill changes safer by producing comparable results and a reviewable report.

## Acceptance Criteria

- [ ] Add an extension that can run a task/eval scenario in two modes: with a target skill and without that skill.
- [ ] Capture enough output/metadata to compare the two runs meaningfully.
- [ ] Generate a simple self-contained HTML report summarizing the findings for human review.
- [ ] Keep the first version intentionally small and easy to extend with more eval scenarios later.
- [ ] Document how to run the eval extension and where to find the HTML output.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create skill comparison eval extension with HTML report #296

Summary

Motivation

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Create skill comparison eval extension with HTML report #296

Description

Summary

Motivation

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions