agentpytest

Pytest for AI agents. Record trajectories, score them with any LLM judge, replay counterfactuals, and gate your CI with statistical rigor — no server, no SaaS, no vendor lock.

pip install agentpytest

What it is

agentpytest is a pytest plugin that turns agent behavior into regression tests. It records each run as a JSON cassette, scores trajectories with an LLM judge of your choice (Anthropic, OpenAI, Gemini, Groq, Llama, local Ollama — anything LiteLLM supports), and tells you whether a change is a real regression or stochastic noise.

It works for any agent in any domain — coding agents, SRE runbooks, sales outreach, healthcare scheduling, finance automation. You write the agent once, point agentpytest at it, pick a judge, and you have regression tests.

60-second example

from agentpytest import trajectory_test, Judge
from agentpytest.scorers import goal_completion, right_tool, redundant_call

judge = Judge("gemini/gemini-2.5-pro")

@trajectory_test(cassette="cassettes/refactor.json", epochs=5)
def test_refactor():
    result = my_agent("Refactor auth.py to use TokenStore")
    assert goal_completion(result, judge=judge).value > 0.8
    assert right_tool(result, judge=judge).value > 0.9
    assert redundant_call(result).value < 0.1

pytest                           # diffs against cassette, fails on regression
pytest --update-trajectories     # regenerate cassette after intentional changes
agentpytest regress \
  --baseline runs/main.json \
  --candidate runs/pr-412.json   # CI gate with p-value + effect size

Why it exists

The agent-eval ecosystem has converged everywhere except the test layer:

Layer	Owned by
Model abstraction	LiteLLM
Trace storage + UI	Phoenix, Langfuse, Weave
Benchmarks / leaderboards	Inspect AI, SWE-bench, τ-bench
Hosted eval platforms	Braintrust, LangSmith, Patronus
The pytest layer in your repo	agentpytest

agentpytest fills the missing slot. It's the thing you pip install and run locally — like pytest, like mypy, like any other dev tool you'd commit alongside your code.

What makes it different

Pytest-native, vendorable, no server. pip install, write tests, commit cassettes to git. No telemetry, no account, runs offline.
Any judge model. Anthropic, OpenAI, Gemini, Groq, xAI, DeepSeek, Mistral, local vLLM/Ollama. Swap with one config line. Ensembles supported.
Agent-model decoupled from judge-model. Your agent runs OpenAI; judge it with Gemini or local Qwen. Independent configs by design.
Statistical regression detection in core. Bootstrap and paired-permutation tests with effect size + CI ship in the OSS library, not behind a paywall.
Counterfactual replay. fork_from(cassette, span_id, mutate) replays a trajectory up to step N, mutates one tool response, and runs the agent forward — debug "what if this tool had failed" without rerunning the whole trace.
Repo harness — eval on YOUR PRs. Point at your repo and a list of past merged PRs; the agent attempts each one, the lib scores its diff against what your team actually shipped. Nothing else does this.
TRAIL failure-mode detectors. ~20 named failure modes (tool-call repetition, goal deviation, format error, retry storms, hallucinated tool output, ...) shipped as scorers.

Comparison

Feature	agentpytest	Inspect AI	Phoenix	Braintrust	DeepEval	LangSmith	Ragas
Pytest-native, vendorable, no server	✅	⚠️ harness	❌	❌ SaaS	✅	❌ SaaS	✅
Cassette record/replay with tool-lock	✅	⚠️ cache	❌	❌	❌	❌	❌
Mutate-and-replay debugging	✅	❌	⚠️ prompt-only	❌	❌	❌	❌
Statistical CI gate	✅	⚠️ epochs	❌	✅ paid	❌	⚠️ paid	❌
Eval on YOUR repo's past PRs	✅	❌	❌	❌	❌	❌	❌
Any judge (~100 providers)	✅	✅	⚠️	⚠️	⚠️	⚠️	⚠️
Judge ensemble + variance signal	✅	❌	❌	❌	❌	❌	❌
TRAIL failure-mode detectors	✅	❌	❌	❌	❌	❌	❌

Works for any domain

Same library, different rubrics:

Domain	Example test
Coding agents	`assert diff_minimality(r, judge=judge).value > 0.8`
SRE / incident response	`assert dependency_order(r, expected_dag=runbook).value == 1.0`
Outbound sales	`assert can_spam_compliance(r, judge=judge).value > 0.95`
Healthcare scheduling	`assert phi_leak_check(r, judge=judge).value == 1.0`
Finance / spend	`assert policy_adherence(r, judge=judge, policy=spend_policy).value > 0.99`

Built-in coding-agent harness on day one; Harness plugin interface for SRE / sales / healthcare / finance harnesses (pagerduty, salesforce, epic, netsuite) shipping as separate packages.

What you need to provide

Required	Optional
An agent function	Expected tool sequence
A task / prompt	Custom rubric text
A judge model + key	Past PR list (for repo harness)
	Domain policy documents

No labeled corpus. The cassette is the ground truth, captured from the first run you accept. The judge evaluates against rubrics, not strings.

Install & quickstart

pip install agentpytest

export ANTHROPIC_API_KEY=...    # or OPENAI_API_KEY, GEMINI_API_KEY, etc.

# tests/test_my_agent.py
from agentpytest import trajectory_test, Judge
from agentpytest.scorers import goal_completion

judge = Judge("anthropic/claude-sonnet-4-6")

@trajectory_test(cassette="cassettes/hello.json", epochs=3)
def test_hello():
    result = my_agent("greet the user politely")
    assert goal_completion(result, judge=judge).value > 0.8

pytest tests/test_my_agent.py

First run records the cassette. Future runs diff against it.

Documentation

Quickstart — first test in 5 minutes
Concepts — trajectories, cassettes, scorers, judges
Scorers reference — every built-in scorer
Repo harness guide — eval on your own PRs
Domain cookbooks — SRE, sales, healthcare, finance, coding
CI integration — GitHub Actions, GitLab, CircleCI

Domain harnesses

Domain harnesses ship as separate packages — install only the ones you need. Each implements the Harness Protocol from agentpytest.harnesses.base and exposes a convenience evaluate_on_* function that scores agents against the recorded ground truth via the configured Judge.

agentpytest-harness-pagerduty — SRE: past resolved incidents → recorded resolutions / post-mortems.
agentpytest-harness-salesforce — sales: past closed-won opportunities → recorded touch sequences.
agentpytest-harness-netsuite — finance: audited transactions → recorded journal entries.
agentpytest-harness-fhir — healthcare: past completed encounters → recorded outcomes. No-egress mode is on by default.

Examples

Runnable example repos, one per domain:

agentpytest-examples-coding — Claude Code-style agent
agentpytest-examples-sre — incident-response agent
agentpytest-examples-sales — outbound prospecting
agentpytest-examples-healthcare — appointment booking
agentpytest-examples-finance — expense classification

Status

v0.1.0-alpha — core (trajectory_test, cassettes), 5 scorers (goal_completion, right_tool, right_args, redundant_call, dependency_order), regression_test, OTel export. Roadmap in ROADMAP.md.

License

MIT. No CLA. Fork freely.

Contributing

Issues and PRs welcome. See CONTRIBUTING.md. One rule: no PR that adds a server, dashboard, or hosted feature. This stays a library.

Acknowledgements

Built on the shoulders of LiteLLM, agent-vcr, pytest, and OpenTelemetry. TRAIL detector taxonomy inspired by Patronus AI's published research.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
docs		docs
harnesses		harnesses
siblings		siblings
src/agentpytest		src/agentpytest
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agentpytest

What it is

60-second example

Why it exists

What makes it different

Comparison

Works for any domain

What you need to provide

Install & quickstart

Documentation

Domain harnesses

Examples

Status

License

Contributing

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agentpytest

What it is

60-second example

Why it exists

What makes it different

Comparison

Works for any domain

What you need to provide

Install & quickstart

Documentation

Domain harnesses

Examples

Status

License

Contributing

Acknowledgements

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages