Skip to content

Create skill comparison eval extension with HTML report #296

@ladislas

Description

@ladislas

Summary

Create a Pi extension that can run evaluation scenarios to compare how the agent/harness/model performs on the same task with and without a specific skill enabled.

The goal is to catch regressions where a skill that is meant to improve behavior actually makes outcomes worse.

Motivation

Skills change agent behavior, but today there is no lightweight way to measure whether a skill improves task performance. A simple eval runner would make skill changes safer by producing comparable results and a reviewable report.

Acceptance Criteria

  • Add an extension that can run a task/eval scenario in two modes: with a target skill and without that skill.
  • Capture enough output/metadata to compare the two runs meaningfully.
  • Generate a simple self-contained HTML report summarizing the findings for human review.
  • Keep the first version intentionally small and easy to extend with more eval scenarios later.
  • Document how to run the eval extension and where to find the HTML output.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestpac:ready_for_agentpac state: fully specified and ready for an AFK agent

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions