Benchmark harness for evaluating how reliably different agent architectures follow a required sequential execution order as complexity (number of steps N) increases.
Current focus:
- Agent with tools: single orchestrator agent delegating to
Nsub-agent tools - Agent graph:
N-nodepydantic-graphwhere an LLM routes between nodes - Deep agents: planned
- Python
3.12+ - An OpenAI API key
From the repo root:
python -m venv .venv
pip install -e .Then activate the virtual environment in your shell.
Create .env from .env.example and set environment variables:
OPENAI_API_KEY
Optional (for Logfire telemetry):
LOGFIRE_KEYLOGFIRE_ENVIRONMENT