Pytest for AI agents. Record trajectories, score them with any LLM judge, replay counterfactuals, and gate your CI with statistical rigor — no server, no SaaS, no vendor lock.
pip install agentpytestagentpytest is a pytest plugin that turns agent behavior into regression tests. It records each run as a JSON cassette, scores trajectories with an LLM judge of your choice (Anthropic, OpenAI, Gemini, Groq, Llama, local Ollama — anything LiteLLM supports), and tells you whether a change is a real regression or stochastic noise.
It works for any agent in any domain — coding agents, SRE runbooks, sales outreach, healthcare scheduling, finance automation. You write the agent once, point agentpytest at it, pick a judge, and you have regression tests.
from agentpytest import trajectory_test, Judge
from agentpytest.scorers import goal_completion, right_tool, redundant_call
judge = Judge("gemini/gemini-2.5-pro")
@trajectory_test(cassette="cassettes/refactor.json", epochs=5)
def test_refactor():
result = my_agent("Refactor auth.py to use TokenStore")
assert goal_completion(result, judge=judge).value > 0.8
assert right_tool(result, judge=judge).value > 0.9
assert redundant_call(result).value < 0.1pytest # diffs against cassette, fails on regression
pytest --update-trajectories # regenerate cassette after intentional changes
agentpytest regress \
--baseline runs/main.json \
--candidate runs/pr-412.json # CI gate with p-value + effect sizeThe agent-eval ecosystem has converged everywhere except the test layer:
| Layer | Owned by |
|---|---|
| Model abstraction | LiteLLM |
| Trace storage + UI | Phoenix, Langfuse, Weave |
| Benchmarks / leaderboards | Inspect AI, SWE-bench, τ-bench |
| Hosted eval platforms | Braintrust, LangSmith, Patronus |
| The pytest layer in your repo | agentpytest |
agentpytest fills the missing slot. It's the thing you pip install and run locally — like pytest, like mypy, like any other dev tool you'd commit alongside your code.
- Pytest-native, vendorable, no server.
pip install, write tests, commit cassettes to git. No telemetry, no account, runs offline. - Any judge model. Anthropic, OpenAI, Gemini, Groq, xAI, DeepSeek, Mistral, local vLLM/Ollama. Swap with one config line. Ensembles supported.
- Agent-model decoupled from judge-model. Your agent runs OpenAI; judge it with Gemini or local Qwen. Independent configs by design.
- Statistical regression detection in core. Bootstrap and paired-permutation tests with effect size + CI ship in the OSS library, not behind a paywall.
- Counterfactual replay.
fork_from(cassette, span_id, mutate)replays a trajectory up to step N, mutates one tool response, and runs the agent forward — debug "what if this tool had failed" without rerunning the whole trace. - Repo harness — eval on YOUR PRs. Point at your repo and a list of past merged PRs; the agent attempts each one, the lib scores its diff against what your team actually shipped. Nothing else does this.
- TRAIL failure-mode detectors. ~20 named failure modes (tool-call repetition, goal deviation, format error, retry storms, hallucinated tool output, ...) shipped as scorers.
| Feature | agentpytest | Inspect AI | Phoenix | Braintrust | DeepEval | LangSmith | Ragas |
|---|---|---|---|---|---|---|---|
| Pytest-native, vendorable, no server | ✅ | ❌ | ❌ SaaS | ✅ | ❌ SaaS | ✅ | |
| Cassette record/replay with tool-lock | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | |
| Mutate-and-replay debugging | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | |
| Statistical CI gate | ✅ | ❌ | ✅ paid | ❌ | ❌ | ||
| Eval on YOUR repo's past PRs | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Any judge (~100 providers) | ✅ | ✅ | |||||
| Judge ensemble + variance signal | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| TRAIL failure-mode detectors | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
Same library, different rubrics:
| Domain | Example test |
|---|---|
| Coding agents | assert diff_minimality(r, judge=judge).value > 0.8 |
| SRE / incident response | assert dependency_order(r, expected_dag=runbook).value == 1.0 |
| Outbound sales | assert can_spam_compliance(r, judge=judge).value > 0.95 |
| Healthcare scheduling | assert phi_leak_check(r, judge=judge).value == 1.0 |
| Finance / spend | assert policy_adherence(r, judge=judge, policy=spend_policy).value > 0.99 |
Built-in coding-agent harness on day one; Harness plugin interface for SRE / sales / healthcare / finance harnesses (pagerduty, salesforce, epic, netsuite) shipping as separate packages.
| Required | Optional |
|---|---|
| An agent function | Expected tool sequence |
| A task / prompt | Custom rubric text |
| A judge model + key | Past PR list (for repo harness) |
| Domain policy documents |
No labeled corpus. The cassette is the ground truth, captured from the first run you accept. The judge evaluates against rubrics, not strings.
pip install agentpytest
export ANTHROPIC_API_KEY=... # or OPENAI_API_KEY, GEMINI_API_KEY, etc.# tests/test_my_agent.py
from agentpytest import trajectory_test, Judge
from agentpytest.scorers import goal_completion
judge = Judge("anthropic/claude-sonnet-4-6")
@trajectory_test(cassette="cassettes/hello.json", epochs=3)
def test_hello():
result = my_agent("greet the user politely")
assert goal_completion(result, judge=judge).value > 0.8pytest tests/test_my_agent.pyFirst run records the cassette. Future runs diff against it.
- Quickstart — first test in 5 minutes
- Concepts — trajectories, cassettes, scorers, judges
- Scorers reference — every built-in scorer
- Repo harness guide — eval on your own PRs
- Domain cookbooks — SRE, sales, healthcare, finance, coding
- CI integration — GitHub Actions, GitLab, CircleCI
Domain harnesses ship as separate packages — install only the ones
you need. Each implements the Harness Protocol from
agentpytest.harnesses.base and exposes a convenience evaluate_on_*
function that scores agents against the recorded ground truth via the
configured Judge.
agentpytest-harness-pagerduty— SRE: past resolved incidents → recorded resolutions / post-mortems.agentpytest-harness-salesforce— sales: past closed-won opportunities → recorded touch sequences.agentpytest-harness-netsuite— finance: audited transactions → recorded journal entries.agentpytest-harness-fhir— healthcare: past completed encounters → recorded outcomes. No-egress mode is on by default.
Runnable example repos, one per domain:
agentpytest-examples-coding— Claude Code-style agentagentpytest-examples-sre— incident-response agentagentpytest-examples-sales— outbound prospectingagentpytest-examples-healthcare— appointment bookingagentpytest-examples-finance— expense classification
v0.1.0-alpha — core (trajectory_test, cassettes), 5 scorers (goal_completion, right_tool, right_args, redundant_call, dependency_order), regression_test, OTel export. Roadmap in ROADMAP.md.
MIT. No CLA. Fork freely.
Issues and PRs welcome. See CONTRIBUTING.md. One rule: no PR that adds a server, dashboard, or hosted feature. This stays a library.
Built on the shoulders of LiteLLM, agent-vcr, pytest, and OpenTelemetry. TRAIL detector taxonomy inspired by Patronus AI's published research.