Thank you for your interest in contributing! This repository is designed to qualitatively compare coding agents and orchestration paradigms across diverse, realistic use cases.
- Ways to Contribute
- Getting Started
- Contributing Agent Runs
- Contributing New Use Cases
- Code Quality Standards
- Evaluation Guidelines
- Pull Request Process
- Community Guidelines
There are two primary ways to contribute to this project:
- Add Agent/Orchestration Runs: Execute a use case with a specific agent or orchestration paradigm and submit your results
- Add New Use Cases: Design and propose new challenges that test different aspects of agent capabilities
Both types of contributions are valuable and help build a comprehensive comparison framework.
- Git: Version control
- Python 3.11+: For Python-based use cases and tooling
- Node.js 20+: For JavaScript/TypeScript use cases
- uv: Python package manager (
pip install uv) - Familiarity with the agent or orchestration tool you plan to use
- Fork the repository on GitHub
- Clone your fork locally:
git clone https://github.com/YOUR-USERNAME/agent-comparison.git cd agent-comparison - Add upstream remote:
git remote add upstream https://github.com/pyros-projects/agent-comparison.git
- Explore existing use cases to understand the structure and expectations
Agent runs demonstrate how a specific coding agent or orchestration paradigm handles a use case. Each run is isolated in its own folder.
The fastest way to start a new agent run is using the included setup tool:
python setup-run.pyThis interactive CLI will:
- Show available use cases
- Let you choose coding agent vs. orchestration
- Optionally select an orchestration paradigm
- Create a properly structured folder
- Generate a git branch name
- Copy the base files
- Create a feature branch automatically
If you prefer manual setup:
- Choose a use case (e.g.,
01-research-scraper) - Copy the base folder:
Or for orchestration:
cp -r 01-research-scraper/_base 01-research-scraper/coding_agents/my-agent-name
cp -r 01-research-scraper/_base 01-research-scraper/orchestration/paradigm/agent-name
- Create a feature branch:
git checkout -b run/01-research-scraper/my-agent-name
-
Read the prompt carefully:
- Located at
_base/prompt.md - Contains goal, requirements, constraints, and success criteria
- Treat this as your specification
- Located at
-
Use your chosen agent/tool:
- Execute the task using your coding agent or orchestration approach
- Keep all work within your run folder
- Do not modify
_base/or other runs
-
Capture the process:
- If your tool supports session logs (like
.data/folders), commit them - Document any manual interventions in
EVALUATION_REPORT_*.md - Take screenshots of final UI/outputs
- If your tool supports session logs (like
-
Test thoroughly:
- Ensure all Core Requirements are met
- Test Stretch Goals if implemented
- Run any included test suites
- Verify the application works end-to-end
-
Complete evaluation:
- Copy the appropriate evaluation template:
cp ../../EVALUATION_REPORT_CODING_AGENT.md ./EVALUATION_REPORT.md # or cp ../../../EVALUATION_REPORT_ORCHESTRATION.md ./EVALUATION_REPORT.md - Fill out all sections honestly:
- Human interventions (context provided, guidance given)
- Functional requirements completion
- Code quality assessment
- Overall scoring and observations
- Copy the appropriate evaluation template:
✅ Do:
- Follow the prompt requirements closely
- Document all human interventions transparently
- Include session logs if available (
.data/folders, chat transcripts) - Test the implementation thoroughly
- Complete the evaluation report honestly
- Keep dependencies scoped to your run folder
- Add a README if setup/usage is non-obvious
❌ Don't:
- Cherry-pick the best of multiple attempts (submit your first complete run)
- Modify the base prompt or requirements
- Touch other agents' runs
- Skip documenting manual interventions
- Inflate scores in evaluation
Your run folder should look like:
01-research-scraper/coding_agents/my-agent/
├── prompt.md # Copied from _base
├── pyproject.toml # Copied from _base
├── EVALUATION_REPORT.md # Your completed evaluation
├── README.md # Optional: setup/usage notes
├── .data/ # Optional: session logs
│ └── 2024-11-23-session.md
├── src/ # Your implementation
│ └── researcher/
│ ├── __init__.py
│ ├── main.py
│ └── ...
├── tests/ # Your tests
└── ... # Other implementation files
New use cases expand the suite's coverage and test different agent capabilities. Great use cases are:
- Realistic: Mirrors real-world development tasks
- Non-trivial: Requires problem-solving, not just boilerplate
- Focused: Tests specific skills (e.g., system design, API integration, algorithm implementation)
- Completable: Achievable in 2-6 hours for a competent developer
- Diverse: Adds something the existing 14 use cases don't cover
- Review existing use cases to avoid duplication
- Identify the skill gap your use case will test
- Consider implementation complexity - should be challenging but reasonable
- Think about evaluation - can success be assessed qualitatively?
Answer these questions:
- What skill does this test? (e.g., async programming, database design, algorithm optimization)
- What's the end deliverable? (CLI tool, web app, library, data pipeline)
- What makes it interesting? (novel constraints, integration challenges, performance requirements)
- How will success be evaluated? (functional criteria, quality expectations)
- Use the next available number (currently 15+)
- Format:
XX-descriptive-name(e.g.,15-streaming-pipeline)
mkdir -p XX-use-case-name/_base
mkdir -p XX-use-case-name/coding_agents
mkdir -p XX-use-case-name/orchestrationCopy the template and fill it out:
cp PROMPT_TEMPLATE.md XX-use-case-name/_base/prompt.mdFollow the 7-section structure:
Clear, concise pitch of what's being built.
## Goal
Build a real-time WebSocket chat server supporting multiple rooms, user presence,
and message history. The system should handle 100+ concurrent connections and
persist chat history to SQLite.Technical stack, time estimate, hardware assumptions, scope boundaries.
## Context & Constraints
**Stack**: Python 3.11+, WebSocket library of choice, SQLite
**Time Estimate**: 3-4 hours
**Scope**: Server-side only; assume clients exist
**Constraints**: Must handle concurrent connections gracefully, store last 100 messages per roomSplit into Core (must-have) and Stretch (nice-to-have).
## Requirements
### Core Requirements
- [ ] WebSocket server accepting connections on configurable port
- [ ] Support for multiple chat rooms (create, join, leave)
- [ ] Real-time message broadcasting within rooms
- [ ] Persist messages to SQLite
- [ ] User presence tracking (online/offline status)
### Stretch Goals
- [ ] Private direct messages between users
- [ ] Typing indicators
- [ ] Message search/filtering
- [ ] Rate limiting per userArchitecture, testing, UX, documentation standards.
## Quality Expectations
**Architecture**: Clean separation between WebSocket handling, business logic, and persistence
**Testing**: Unit tests for message handling, integration tests for WebSocket flows
**Error Handling**: Graceful handling of disconnects, invalid messages, storage failures
**Documentation**: README with setup, usage examples, architecture overviewResearch or design steps for complex use cases.
## Process
1. **Research**: Compare WebSocket libraries (websockets, aiohttp, Tornado)
2. **Design**: Sketch message protocol and room management logic
3. **Implement**: Start with basic connection handling, then add rooms and persistence
4. **Test**: Write tests as you go, verify concurrent connection handlingConcrete checklist of what to submit.
## Deliverables
- [ ] Working WebSocket server implementation
- [ ] SQLite schema and persistence layer
- [ ] Unit and integration tests (>70% coverage)
- [ ] README with setup and usage instructions
- [ ] Example client script demonstrating connections3-5 indicators of a successful implementation.
## Success Criteria
1. **Functionality**: Server handles 100+ concurrent connections across multiple rooms
2. **Reliability**: Messages are never lost, even during disconnects
3. **Code Quality**: Clean architecture, well-tested, properly handles edge cases
4. **Usability**: Clear README, easy to run and test locallyPrompt Writing Tips:
- Be specific about requirements but flexible about implementation
- Provide context without prescribing solutions
- Include concrete acceptance criteria where relevant (e.g., "support 100+ connections")
- Balance clarity with creative freedom - agents should problem-solve, not follow a recipe
- Keep it concise (30-50 lines ideal) - avoid over-prescription
- Test different skills than existing use cases
In _base/, include only essential boilerplate:
For Python use cases:
# pyproject.toml
[project]
name = "agent_comparison_XX_use_case"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = []
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[tool.uv]
dev-dependencies = ["pytest>=7.0", "ruff>=0.1.0"]For Node.js use cases:
{
"name": "agent-comparison-XX-use-case",
"version": "0.1.0",
"type": "module",
"scripts": {
"test": "vitest",
"dev": "vite"
},
"devDependencies": {
"vite": "^5.0.0",
"vitest": "^1.0.0"
}
}Include:
- Basic
src/directory structure (empty or minimal__init__.py) - Config files if stack is predetermined (
.eslintrc,tsconfig.json, etc.) .gitignorefor the language/framework
Don't include:
- Implementation code
- Detailed examples or hints
- Step-by-step guides
Add XX-use-case-name/README.md explaining:
# XX: Use Case Name
## What This Tests
- **Primary Skills**: [e.g., WebSocket handling, concurrent programming, state management]
- **Secondary Skills**: [e.g., database design, API design, error handling]
- **Complexity**: [Low/Medium/High]
## Why This Is Interesting
[2-3 sentences on what makes this challenging or valuable for comparison]
## Evaluation Criteria
When reviewing implementations, consider:
### Functional Completeness
- Are all Core Requirements met?
- How many Stretch Goals were completed?
- Does the implementation handle edge cases?
### Code Quality
- Is the architecture clean and maintainable?
- Are tests comprehensive and meaningful?
- Is error handling robust?
### Problem-Solving Approach
- How were design decisions made?
- Were appropriate libraries/patterns chosen?
- Is the solution over-engineered or under-engineered?
### Documentation
- Can someone else run and understand the code?
- Are setup instructions clear?
- Is the architecture documented?
## Expected Time Investment
**Typical**: 3-4 hours
**Experienced developers**: 2-3 hours
**Includes**: Research, implementation, testing, documentationFor use cases with visual output:
mkdir -p docs/assets
# Add screenshots showing expected UI/outputReference them in the README:
## Expected Output
Before submitting:
- Attempt it yourself (or have someone else try)
- Verify the prompt is clear - no ambiguities or missing context
- Check time estimate - is it actually achievable in the stated time?
- Ensure it's different - does it test something the other 10 don't?
- Validate scaffolding - is there enough to start but not too much?
Add your use case to the list in /README.md:
| 11 | [WebSocket Chat](11-websocket-chat) | Real-time chat server with rooms, presence, persistence | Python | Medium | 3-4h |See Pull Request Process below.
✅ Good Examples:
- Database-backed REST API - tests CRUD, validation, error handling, persistence
- CLI data processing pipeline - tests file I/O, streaming, error recovery, performance
- Real-time collaborative editor - tests CRDTs, concurrency, conflict resolution
- Code analysis tool - tests AST manipulation, pattern matching, refactoring logic
❌ Avoid:
- Too simple: "Build a TODO list" (too basic, well-worn)
- Too complex: "Build a distributed database" (not completable in reasonable time)
- Too vague: "Create a useful utility" (no clear success criteria)
- Too prescriptive: Step-by-step implementation guide (removes problem-solving)
Before submitting, verify:
- Directory structure created (
XX-name/_base/,/coding_agents/,/orchestration/) - Prompt follows
PROMPT_TEMPLATE.mdstructure (7 sections) - Prompt is clear, concise (30-50 lines), not overly prescriptive
- Core Requirements are specific and testable
- Stretch Goals are genuinely optional enhancements
- Minimal scaffolding in
_base/(no implementation code) - README explains what skills are tested and why it's interesting
- Time estimate is realistic (tested by attempting it yourself)
- Root README updated with use case entry
- Use case tests skills not heavily covered by existing cases
All contributions (agent runs and use cases) should follow these standards:
Python:
- Python 3.11+ syntax and features
- Type hints for function signatures
- 4-space indentation
- Follow PEP 8 (use
rufforblackfor formatting) - Modules:
snake_case.py - Classes:
PascalCase - Functions/variables:
snake_case
JavaScript/TypeScript:
- ES modules (
import/export) - 2-space indentation
- Use
const/let, avoidvar - TypeScript: strict mode, explicit types
- Components:
PascalCase - Functions/variables:
camelCase - Use ESLint/Prettier if configured
General:
- Clear, intention-revealing names
- Functions should do one thing
- Keep files focused and reasonably sized
- Comments for "why", not "what"
- No commented-out code in final submission
- Unit tests for core logic and utilities
- Integration tests for key workflows
- Coverage: Aim for >70% on critical paths
- Test naming: Descriptive, follows convention (
test_*for Python,*.test.tsfor JS) - Assertions: Clear failure messages
- Mocking: Use mocks/fakes for external dependencies in tests
Every run should include:
-
README (if setup is non-obvious):
- What was built
- How to install dependencies
- How to run the application
- How to run tests
- Any configuration needed
-
Code comments where logic is complex:
- Why a particular approach was chosen
- Gotchas or edge cases
- TODO items for Stretch Goals (if applicable)
-
EVALUATION_REPORT.md:
- Honest assessment of completion
- Documentation of interventions
- Scores and observations
- Separation of concerns: Business logic separate from I/O, UI, persistence
- Dependency management: Keep dependencies minimal and scoped to your run
- Configuration: Use environment variables or config files, not hardcoded values
- Error handling: Graceful failures, meaningful error messages
- Security: No committed secrets, tokens, API keys (use
.env.exampleinstead)
Evaluation is qualitative and focuses on:
-
Functional Completeness:
- Core Requirements met?
- Stretch Goals attempted/completed?
- Edge cases handled?
-
Code Quality:
- Architecture: Clean separation, appropriate patterns
- Testing: Coverage and quality of tests
- Readability: Clear naming, appropriate comments
- Error Handling: Robust and user-friendly
-
Human Intervention:
- How much context was needed?
- How much guidance was provided?
- Were there manual fixes?
-
Overall Execution:
- Time to completion
- Number of iterations
- Quality of problem-solving
Use the provided evaluation templates (EVALUATION_REPORT_CODING_AGENT.md or EVALUATION_REPORT_ORCHESTRATION.md) and fill them out honestly.
Use cases are evaluated on:
- Clarity: Is the prompt clear and unambiguous?
- Scope: Is it completable in the stated time?
- Value: Does it test skills not covered by other use cases?
- Quality: Is the scaffolding appropriate (minimal but sufficient)?
- Realism: Does it reflect real-world development tasks?
-
Ensure your branch is up to date:
git fetch upstream git rebase upstream/main
-
Commit your changes:
git add . git commit -m "feat(use-case-XX): add agent-name run" # or git commit -m "feat: add use case XX - descriptive name"
-
Push to your fork:
git push origin your-branch-name
-
Open a PR on GitHub from your fork to
pyros-projects/agent-comparison:main
Follow Conventional Commits:
For agent runs:
feat(01-research-scraper): add codex-cli run
feat(03-image-gen): add orchestration/agent-swarm/gpt4 run
fix(02-text-gen): correct dependency in gemini-cli run
For new use cases:
feat: add use case 11 - websocket chat server
## Type of Contribution
- [ ] Agent Run (coding agent or orchestration)
- [ ] New Use Case
- [ ] Documentation/Tooling Fix
## Summary
[Brief description of what you're adding]
## Use Case
**Number/Name**: 01 - Research Scraper
**Agent/Paradigm**: Codex CLI / Agent Swarm + GPT-4
## Checklist
### For Agent Runs:
- [ ] All Core Requirements completed
- [ ] Tests written and passing
- [ ] EVALUATION_REPORT.md completed honestly
- [ ] Session logs included (if available)
- [ ] Changes scoped to my run folder only
- [ ] Dependencies are scoped (not repo-wide)
### For New Use Cases:
- [ ] Prompt follows PROMPT_TEMPLATE.md (7 sections)
- [ ] Prompt is clear and concise (~30-50 lines)
- [ ] Minimal scaffolding only (no implementation)
- [ ] README explains what's tested and why
- [ ] Time estimate is realistic (tested myself)
- [ ] Root README updated
- [ ] Tests a skill gap in existing use cases
## Testing Done
[Commands you ran, what you verified]
Example:
```bash
cd 01-research-scraper/coding_agents/my-agent
uv sync
uv run pytest
uv run python -m researcher.main --query "machine learning"[Add screenshots for UI-based use cases]
[Anything reviewers should know - known limitations, interesting decisions, etc.]
### Review Process
1. **Automated checks** will run (if configured)
2. **Manual review** will assess:
- Code quality and testing
- Adherence to standards
- Completeness of evaluation
- Scope (changes only your run/use case)
3. **Feedback** may request changes
4. **Merge** once approved
### After Your PR is Merged
- Your run/use case is now part of the comparison suite
- Consider trying other use cases with different agents
- Share your experience and insights
---
## Community Guidelines
### Code of Conduct
- **Be respectful**: Constructive feedback only
- **Be honest**: Accurate evaluation and reporting
- **Be collaborative**: Help others when you can
- **Be transparent**: Document interventions and challenges
### Scope Discipline
**Critical for multi-contributor repos:**
- **Work in your own folder** - never modify other runs or `_base/`
- **Scoped dependencies** - install packages in your run, not repo-wide
- **Isolated branches** - one run or use case per branch
- **No cross-contamination** - don't copy/reference other agents' implementations
### Session Logs and Transparency
If your agent/tool generates session logs (e.g., `.data/` folders with conversation transcripts):
- **Do commit them** - valuable for understanding the process
- **Include context** - what prompts you gave, what guidance you provided
- **Document interventions** - if you manually fixed something, note it in EVALUATION_REPORT.md
### Honest Evaluation
This suite is for **qualitative comparison**, not competition. Honest reporting is essential:
- Submit your **first complete run**, not the best of multiple attempts
- Document **all human interventions** (context, prompts, fixes)
- Score **realistically** in evaluations
- Note **what went wrong** as well as what went right
### Getting Help
- **Questions about use cases?** Open an issue labeled `question`
- **Bug in setup tool?** Open an issue labeled `bug`
- **Stuck on implementation?** This is expected - document the challenge in your evaluation
- **Unclear prompt?** Open an issue suggesting clarifications
---
## Examples and References
### Example Agent Run PR
Title: feat(01-research-scraper): add cursor-composer run
Description: Added a coding agent run using Cursor Composer on the Research Scraper use case.
Use Case: 01 - Research Scraper Agent: Cursor with Composer mode
Checklist: ✓ All Core Requirements completed ✓ 2 of 4 Stretch Goals completed (graph view, filters) ✓ Tests passing (pytest, 75% coverage) ✓ EVALUATION_REPORT.md filled out ✓ Session logs in .data/ folder ✓ Changes scoped to run folder
Testing:
- Ran pytest suite: 12 passed
- Tested end-to-end with real arXiv queries
- Verified all three views render correctly
Notes:
- Agent struggled with graph visualization initially, needed guidance on library choice
- Documented in evaluation report
### Example New Use Case PR
Title: feat: add use case 11 - rate limiter library
Description: New use case testing concurrency, algorithm implementation, and API design.
Summary: Build a generic rate limiting library supporting multiple algorithms (token bucket, sliding window, etc.). Tests concurrent programming, algorithm correctness, and clean API design.
Checklist: ✓ Prompt follows template (7 sections, 42 lines) ✓ Minimal scaffolding (pyproject.toml only) ✓ README explains testing focus ✓ Time estimate realistic (3-4h, tested myself) ✓ Root README updated ✓ Tests algorithm implementation (not covered by other use cases)
Why This Use Case: Current suite lacks low-level algorithm challenges and concurrency testing. Rate limiting is a practical problem requiring correct algorithm implementation, thread safety, and clean API design.
Testing:
- Attempted implementation myself in 3.5 hours
- Prompt is clear and unambiguous
- Core Requirements are achievable, Stretch Goals are genuinely optional
---
## Quick Reference
### Agent Run Workflow
```bash
# 1. Setup
python setup-run.py
# 2. Implement
cd XX-use-case/coding_agents/my-agent
# ... work on implementation ...
# 3. Test
uv run pytest # or npm test
# 4. Evaluate
cp ../../EVALUATION_REPORT_CODING_AGENT.md ./EVALUATION_REPORT.md
# ... fill out evaluation ...
# 5. Submit
git add .
git commit -m "feat(XX-use-case): add my-agent run"
git push origin run/XX-use-case/my-agent
# ... open PR on GitHub ...
# 1. Create structure
mkdir -p XX-use-case-name/{_base,coding_agents,orchestration}
# 2. Write prompt
cp PROMPT_TEMPLATE.md XX-use-case-name/_base/prompt.md
# ... fill out 7 sections ...
# 3. Add scaffolding
# Create minimal pyproject.toml or package.json in _base/
# 4. Document
# Create XX-use-case-name/README.md
# 5. Update root
# Add entry to README.md use case table
# 6. Test
# Attempt the use case yourself to validate
# 7. Submit
git checkout -b feat/use-case-XX-name
git add .
git commit -m "feat: add use case XX - descriptive name"
git push origin feat/use-case-XX-name
# ... open PR on GitHub ...- General questions: Open an issue with label
question - Prompt clarifications: Open an issue referencing the use case
- Tool bugs: Open an issue with label
bugand steps to reproduce - Use case suggestions: Open an issue with label
enhancementdescribing the idea
Thank you for contributing to the Agent Comparison Suite! Your runs and use cases help build a valuable resource for understanding agent capabilities across diverse real-world challenges.