Summary
A standardized benchmark system that tests and compares different AI CLI tools on collaborative tasks.
Usage
squad bench --suite api-crud --agents claude,gemini,codex
Output
┌────────────┬──────────┬────────┬───────────┐
│ Agent │ Time │ Tests │ Tokens │
├────────────┼──────────┼────────┼───────────┤
│ Claude │ 2m 13s │ 8/8 │ 12,400 │
│ Gemini │ 1m 47s │ 7/8 │ 8,200 │
│ Codex │ 3m 02s │ 8/8 │ 15,100 │
└────────────┴──────────┴────────┴───────────┘
Benchmark Suites
Predefined task sets, e.g.:
- api-crud: Build a REST API with CRUD operations
- bug-fix: Fix a set of known bugs in a test repo
- refactor: Refactor messy code to clean patterns
- collab: Multi-agent task where manager + worker must coordinate
Metrics
- Completion time
- Test pass rate (predefined test cases per suite)
- Code quality (lint score, complexity)
- Token consumption (if measurable)
- Collaboration efficiency (for multi-agent suites)
Why
No one has built a multi-agent collaboration benchmark yet. This fills a gap in the ecosystem and produces high-value comparison content.
Complexity
Medium-to-large. Requires automated agent launching, result collection, and test harness integration.
Summary
A standardized benchmark system that tests and compares different AI CLI tools on collaborative tasks.
Usage
Output
Benchmark Suites
Predefined task sets, e.g.:
Metrics
Why
No one has built a multi-agent collaboration benchmark yet. This fills a gap in the ecosystem and produces high-value comparison content.
Complexity
Medium-to-large. Requires automated agent launching, result collection, and test harness integration.