Skip to content

piyushbhavsarr/agentpytest

Repository files navigation

agentpytest

Pytest for AI agents. Record trajectories, score them with any LLM judge, replay counterfactuals, and gate your CI with statistical rigor — no server, no SaaS, no vendor lock.

PyPI License: MIT

pip install agentpytest

What it is

agentpytest is a pytest plugin that turns agent behavior into regression tests. It records each run as a JSON cassette, scores trajectories with an LLM judge of your choice (Anthropic, OpenAI, Gemini, Groq, Llama, local Ollama — anything LiteLLM supports), and tells you whether a change is a real regression or stochastic noise.

It works for any agent in any domain — coding agents, SRE runbooks, sales outreach, healthcare scheduling, finance automation. You write the agent once, point agentpytest at it, pick a judge, and you have regression tests.

60-second example

from agentpytest import trajectory_test, Judge
from agentpytest.scorers import goal_completion, right_tool, redundant_call

judge = Judge("gemini/gemini-2.5-pro")

@trajectory_test(cassette="cassettes/refactor.json", epochs=5)
def test_refactor():
    result = my_agent("Refactor auth.py to use TokenStore")
    assert goal_completion(result, judge=judge).value > 0.8
    assert right_tool(result, judge=judge).value > 0.9
    assert redundant_call(result).value < 0.1
pytest                           # diffs against cassette, fails on regression
pytest --update-trajectories     # regenerate cassette after intentional changes
agentpytest regress \
  --baseline runs/main.json \
  --candidate runs/pr-412.json   # CI gate with p-value + effect size

Why it exists

The agent-eval ecosystem has converged everywhere except the test layer:

Layer Owned by
Model abstraction LiteLLM
Trace storage + UI Phoenix, Langfuse, Weave
Benchmarks / leaderboards Inspect AI, SWE-bench, τ-bench
Hosted eval platforms Braintrust, LangSmith, Patronus
The pytest layer in your repo agentpytest

agentpytest fills the missing slot. It's the thing you pip install and run locally — like pytest, like mypy, like any other dev tool you'd commit alongside your code.

What makes it different

  • Pytest-native, vendorable, no server. pip install, write tests, commit cassettes to git. No telemetry, no account, runs offline.
  • Any judge model. Anthropic, OpenAI, Gemini, Groq, xAI, DeepSeek, Mistral, local vLLM/Ollama. Swap with one config line. Ensembles supported.
  • Agent-model decoupled from judge-model. Your agent runs OpenAI; judge it with Gemini or local Qwen. Independent configs by design.
  • Statistical regression detection in core. Bootstrap and paired-permutation tests with effect size + CI ship in the OSS library, not behind a paywall.
  • Counterfactual replay. fork_from(cassette, span_id, mutate) replays a trajectory up to step N, mutates one tool response, and runs the agent forward — debug "what if this tool had failed" without rerunning the whole trace.
  • Repo harness — eval on YOUR PRs. Point at your repo and a list of past merged PRs; the agent attempts each one, the lib scores its diff against what your team actually shipped. Nothing else does this.
  • TRAIL failure-mode detectors. ~20 named failure modes (tool-call repetition, goal deviation, format error, retry storms, hallucinated tool output, ...) shipped as scorers.

Comparison

Feature agentpytest Inspect AI Phoenix Braintrust DeepEval LangSmith Ragas
Pytest-native, vendorable, no server ⚠️ harness ❌ SaaS ❌ SaaS
Cassette record/replay with tool-lock ⚠️ cache
Mutate-and-replay debugging ⚠️ prompt-only
Statistical CI gate ⚠️ epochs ✅ paid ⚠️ paid
Eval on YOUR repo's past PRs
Any judge (~100 providers) ⚠️ ⚠️ ⚠️ ⚠️ ⚠️
Judge ensemble + variance signal
TRAIL failure-mode detectors

Works for any domain

Same library, different rubrics:

Domain Example test
Coding agents assert diff_minimality(r, judge=judge).value > 0.8
SRE / incident response assert dependency_order(r, expected_dag=runbook).value == 1.0
Outbound sales assert can_spam_compliance(r, judge=judge).value > 0.95
Healthcare scheduling assert phi_leak_check(r, judge=judge).value == 1.0
Finance / spend assert policy_adherence(r, judge=judge, policy=spend_policy).value > 0.99

Built-in coding-agent harness on day one; Harness plugin interface for SRE / sales / healthcare / finance harnesses (pagerduty, salesforce, epic, netsuite) shipping as separate packages.

What you need to provide

Required Optional
An agent function Expected tool sequence
A task / prompt Custom rubric text
A judge model + key Past PR list (for repo harness)
Domain policy documents

No labeled corpus. The cassette is the ground truth, captured from the first run you accept. The judge evaluates against rubrics, not strings.

Install & quickstart

pip install agentpytest

export ANTHROPIC_API_KEY=...    # or OPENAI_API_KEY, GEMINI_API_KEY, etc.
# tests/test_my_agent.py
from agentpytest import trajectory_test, Judge
from agentpytest.scorers import goal_completion

judge = Judge("anthropic/claude-sonnet-4-6")

@trajectory_test(cassette="cassettes/hello.json", epochs=3)
def test_hello():
    result = my_agent("greet the user politely")
    assert goal_completion(result, judge=judge).value > 0.8
pytest tests/test_my_agent.py

First run records the cassette. Future runs diff against it.

Documentation

Domain harnesses

Domain harnesses ship as separate packages — install only the ones you need. Each implements the Harness Protocol from agentpytest.harnesses.base and exposes a convenience evaluate_on_* function that scores agents against the recorded ground truth via the configured Judge.

Examples

Runnable example repos, one per domain:

Status

v0.1.0-alpha — core (trajectory_test, cassettes), 5 scorers (goal_completion, right_tool, right_args, redundant_call, dependency_order), regression_test, OTel export. Roadmap in ROADMAP.md.

License

MIT. No CLA. Fork freely.

Contributing

Issues and PRs welcome. See CONTRIBUTING.md. One rule: no PR that adds a server, dashboard, or hosted feature. This stays a library.

Acknowledgements

Built on the shoulders of LiteLLM, agent-vcr, pytest, and OpenTelemetry. TRAIL detector taxonomy inspired by Patronus AI's published research.

About

Pytest for AI agents.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors