Contributing to Agent Comparison Suite

Thank you for your interest in contributing! This repository is designed to qualitatively compare coding agents and orchestration paradigms across diverse, realistic use cases.

Ways to Contribute
Getting Started
Contributing Agent Runs
Contributing New Use Cases
Code Quality Standards
Evaluation Guidelines
Pull Request Process
Community Guidelines

Ways to Contribute

There are two primary ways to contribute to this project:

Add Agent/Orchestration Runs: Execute a use case with a specific agent or orchestration paradigm and submit your results
Add New Use Cases: Design and propose new challenges that test different aspects of agent capabilities

Both types of contributions are valuable and help build a comprehensive comparison framework.

Getting Started

Prerequisites

Git: Version control
Python 3.11+: For Python-based use cases and tooling
Node.js 20+: For JavaScript/TypeScript use cases
uv: Python package manager (pip install uv)
Familiarity with the agent or orchestration tool you plan to use

Initial Setup

Fork the repository on GitHub

Clone your fork locally:

git clone https://github.com/YOUR-USERNAME/agent-comparison.git
cd agent-comparison

Add upstream remote:

git remote add upstream https://github.com/pyros-projects/agent-comparison.git

Explore existing use cases to understand the structure and expectations

Contributing Agent Runs

Agent runs demonstrate how a specific coding agent or orchestration paradigm handles a use case. Each run is isolated in its own folder.

Quick Start with Setup Tool

The fastest way to start a new agent run is using the included setup tool:

python setup-run.py

This interactive CLI will:

Show available use cases
Let you choose coding agent vs. orchestration
Optionally select an orchestration paradigm
Create a properly structured folder
Generate a git branch name
Copy the base files
Create a feature branch automatically

Manual Setup

If you prefer manual setup:

Choose a use case (e.g., 01-research-scraper)

Copy the base folder:

cp -r 01-research-scraper/_base 01-research-scraper/coding_agents/my-agent-name

Or for orchestration:

cp -r 01-research-scraper/_base 01-research-scraper/orchestration/paradigm/agent-name

Create a feature branch:

git checkout -b run/01-research-scraper/my-agent-name

Working on Your Run

Read the prompt carefully:
- Located at _base/prompt.md
- Contains goal, requirements, constraints, and success criteria
- Treat this as your specification
Use your chosen agent/tool:
- Execute the task using your coding agent or orchestration approach
- Keep all work within your run folder
- Do not modify _base/ or other runs
Capture the process:
- If your tool supports session logs (like .data/ folders), commit them
- Document any manual interventions in EVALUATION_REPORT_*.md
- Take screenshots of final UI/outputs
Test thoroughly:
- Ensure all Core Requirements are met
- Test Stretch Goals if implemented
- Run any included test suites
- Verify the application works end-to-end
Complete evaluation:
- Copy the appropriate evaluation template:
```
cp ../../EVALUATION_REPORT_CODING_AGENT.md ./EVALUATION_REPORT.md
# or
cp ../../../EVALUATION_REPORT_ORCHESTRATION.md ./EVALUATION_REPORT.md
```
- Fill out all sections honestly:
  - Human interventions (context provided, guidance given)
  - Functional requirements completion
  - Code quality assessment
  - Overall scoring and observations

What Makes a Good Agent Run

✅ Do:

Follow the prompt requirements closely
Document all human interventions transparently
Include session logs if available (.data/ folders, chat transcripts)
Test the implementation thoroughly
Complete the evaluation report honestly
Keep dependencies scoped to your run folder
Add a README if setup/usage is non-obvious

❌ Don't:

Cherry-pick the best of multiple attempts (submit your first complete run)
Modify the base prompt or requirements
Touch other agents' runs
Skip documenting manual interventions
Inflate scores in evaluation

Run Folder Structure

Your run folder should look like:

01-research-scraper/coding_agents/my-agent/
├── prompt.md                    # Copied from _base
├── pyproject.toml               # Copied from _base
├── EVALUATION_REPORT.md         # Your completed evaluation
├── README.md                    # Optional: setup/usage notes
├── .data/                       # Optional: session logs
│   └── 2024-11-23-session.md
├── src/                         # Your implementation
│   └── researcher/
│       ├── __init__.py
│       ├── main.py
│       └── ...
├── tests/                       # Your tests
└── ...                          # Other implementation files

Contributing New Use Cases

New use cases expand the suite's coverage and test different agent capabilities. Great use cases are:

Realistic: Mirrors real-world development tasks
Non-trivial: Requires problem-solving, not just boilerplate
Focused: Tests specific skills (e.g., system design, API integration, algorithm implementation)
Completable: Achievable in 2-6 hours for a competent developer
Diverse: Adds something the existing 14 use cases don't cover

Before You Start

Review existing use cases to avoid duplication
Identify the skill gap your use case will test
Consider implementation complexity - should be challenging but reasonable
Think about evaluation - can success be assessed qualitatively?

Use Case Creation Process

1. Plan Your Use Case

Answer these questions:

What skill does this test? (e.g., async programming, database design, algorithm optimization)
What's the end deliverable? (CLI tool, web app, library, data pipeline)
What makes it interesting? (novel constraints, integration challenges, performance requirements)
How will success be evaluated? (functional criteria, quality expectations)

2. Choose a Number

Use the next available number (currently 15+)
Format: XX-descriptive-name (e.g., 15-streaming-pipeline)

3. Create Directory Structure

mkdir -p XX-use-case-name/_base
mkdir -p XX-use-case-name/coding_agents
mkdir -p XX-use-case-name/orchestration

4. Write the Prompt

Copy the template and fill it out:

cp PROMPT_TEMPLATE.md XX-use-case-name/_base/prompt.md

Follow the 7-section structure:

Section 1: Goal (3-5 lines)

Clear, concise pitch of what's being built.

## Goal

Build a real-time WebSocket chat server supporting multiple rooms, user presence,
and message history. The system should handle 100+ concurrent connections and
persist chat history to SQLite.

Section 2: Context & Constraints

Technical stack, time estimate, hardware assumptions, scope boundaries.

## Context & Constraints

**Stack**: Python 3.11+, WebSocket library of choice, SQLite  
**Time Estimate**: 3-4 hours  
**Scope**: Server-side only; assume clients exist  
**Constraints**: Must handle concurrent connections gracefully, store last 100 messages per room

Section 3: Requirements

Split into Core (must-have) and Stretch (nice-to-have).

## Requirements

### Core Requirements
- [ ] WebSocket server accepting connections on configurable port
- [ ] Support for multiple chat rooms (create, join, leave)
- [ ] Real-time message broadcasting within rooms
- [ ] Persist messages to SQLite
- [ ] User presence tracking (online/offline status)

### Stretch Goals
- [ ] Private direct messages between users
- [ ] Typing indicators
- [ ] Message search/filtering
- [ ] Rate limiting per user

Section 4: Quality Expectations

Architecture, testing, UX, documentation standards.

## Quality Expectations

**Architecture**: Clean separation between WebSocket handling, business logic, and persistence  
**Testing**: Unit tests for message handling, integration tests for WebSocket flows  
**Error Handling**: Graceful handling of disconnects, invalid messages, storage failures  
**Documentation**: README with setup, usage examples, architecture overview

Section 5: Process (if applicable)

Research or design steps for complex use cases.

## Process

1. **Research**: Compare WebSocket libraries (websockets, aiohttp, Tornado)
2. **Design**: Sketch message protocol and room management logic
3. **Implement**: Start with basic connection handling, then add rooms and persistence
4. **Test**: Write tests as you go, verify concurrent connection handling

Section 6: Deliverables

Concrete checklist of what to submit.

## Deliverables

- [ ] Working WebSocket server implementation
- [ ] SQLite schema and persistence layer
- [ ] Unit and integration tests (>70% coverage)
- [ ] README with setup and usage instructions
- [ ] Example client script demonstrating connections

Section 7: Success Criteria

3-5 indicators of a successful implementation.

## Success Criteria

1. **Functionality**: Server handles 100+ concurrent connections across multiple rooms
2. **Reliability**: Messages are never lost, even during disconnects
3. **Code Quality**: Clean architecture, well-tested, properly handles edge cases
4. **Usability**: Clear README, easy to run and test locally

Prompt Writing Tips:

Be specific about requirements but flexible about implementation
Provide context without prescribing solutions
Include concrete acceptance criteria where relevant (e.g., "support 100+ connections")
Balance clarity with creative freedom - agents should problem-solve, not follow a recipe
Keep it concise (30-50 lines ideal) - avoid over-prescription
Test different skills than existing use cases

5. Add Minimal Scaffolding

In _base/, include only essential boilerplate:

For Python use cases:

# pyproject.toml
[project]
name = "agent_comparison_XX_use_case"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = []

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.uv]
dev-dependencies = ["pytest>=7.0", "ruff>=0.1.0"]

For Node.js use cases:

{
  "name": "agent-comparison-XX-use-case",
  "version": "0.1.0",
  "type": "module",
  "scripts": {
    "test": "vitest",
    "dev": "vite"
  },
  "devDependencies": {
    "vite": "^5.0.0",
    "vitest": "^1.0.0"
  }
}

Include:

Basic src/ directory structure (empty or minimal __init__.py)
Config files if stack is predetermined (.eslintrc, tsconfig.json, etc.)
.gitignore for the language/framework

Don't include:

Implementation code
Detailed examples or hints
Step-by-step guides

6. Create the README

Add XX-use-case-name/README.md explaining:

# XX: Use Case Name

## What This Tests

- **Primary Skills**: [e.g., WebSocket handling, concurrent programming, state management]
- **Secondary Skills**: [e.g., database design, API design, error handling]
- **Complexity**: [Low/Medium/High]

## Why This Is Interesting

[2-3 sentences on what makes this challenging or valuable for comparison]

## Evaluation Criteria

When reviewing implementations, consider:

### Functional Completeness
- Are all Core Requirements met?
- How many Stretch Goals were completed?
- Does the implementation handle edge cases?

### Code Quality
- Is the architecture clean and maintainable?
- Are tests comprehensive and meaningful?
- Is error handling robust?

### Problem-Solving Approach
- How were design decisions made?
- Were appropriate libraries/patterns chosen?
- Is the solution over-engineered or under-engineered?

### Documentation
- Can someone else run and understand the code?
- Are setup instructions clear?
- Is the architecture documented?

## Expected Time Investment

**Typical**: 3-4 hours  
**Experienced developers**: 2-3 hours  
**Includes**: Research, implementation, testing, documentation

7. Add Screenshots/Assets (if applicable)

For use cases with visual output:

mkdir -p docs/assets
# Add screenshots showing expected UI/output

Reference them in the README:

## Expected Output

![Example Interface](../../docs/assets/XX-screenshot.png)

8. Test Your Use Case

Before submitting:

Attempt it yourself (or have someone else try)
Verify the prompt is clear - no ambiguities or missing context
Check time estimate - is it actually achievable in the stated time?
Ensure it's different - does it test something the other 10 don't?
Validate scaffolding - is there enough to start but not too much?

9. Update Root README

Add your use case to the list in /README.md:

| 11 | [WebSocket Chat](11-websocket-chat) | Real-time chat server with rooms, presence, persistence | Python | Medium | 3-4h |

10. Create a Pull Request

See Pull Request Process below.

What Makes a Great Use Case

✅ Good Examples:

Database-backed REST API - tests CRUD, validation, error handling, persistence
CLI data processing pipeline - tests file I/O, streaming, error recovery, performance
Real-time collaborative editor - tests CRDTs, concurrency, conflict resolution
Code analysis tool - tests AST manipulation, pattern matching, refactoring logic

❌ Avoid:

Too simple: "Build a TODO list" (too basic, well-worn)
Too complex: "Build a distributed database" (not completable in reasonable time)
Too vague: "Create a useful utility" (no clear success criteria)
Too prescriptive: Step-by-step implementation guide (removes problem-solving)

Use Case Checklist

Before submitting, verify:

Code Quality Standards

All contributions (agent runs and use cases) should follow these standards:

Language-Specific Guidelines

Python:

Python 3.11+ syntax and features
Type hints for function signatures
4-space indentation
Follow PEP 8 (use ruff or black for formatting)
Modules: snake_case.py
Classes: PascalCase
Functions/variables: snake_case

JavaScript/TypeScript:

ES modules (import/export)
2-space indentation
Use const/let, avoid var
TypeScript: strict mode, explicit types
Components: PascalCase
Functions/variables: camelCase
Use ESLint/Prettier if configured

General:

Clear, intention-revealing names
Functions should do one thing
Keep files focused and reasonably sized
Comments for "why", not "what"
No commented-out code in final submission

Testing Expectations

Unit tests for core logic and utilities
Integration tests for key workflows
Coverage: Aim for >70% on critical paths
Test naming: Descriptive, follows convention (test_* for Python, *.test.ts for JS)
Assertions: Clear failure messages
Mocking: Use mocks/fakes for external dependencies in tests

Documentation Standards

Every run should include:

README (if setup is non-obvious):
- What was built
- How to install dependencies
- How to run the application
- How to run tests
- Any configuration needed
Code comments where logic is complex:
- Why a particular approach was chosen
- Gotchas or edge cases
- TODO items for Stretch Goals (if applicable)
EVALUATION_REPORT.md:
- Honest assessment of completion
- Documentation of interventions
- Scores and observations

Architecture Expectations

Separation of concerns: Business logic separate from I/O, UI, persistence
Dependency management: Keep dependencies minimal and scoped to your run
Configuration: Use environment variables or config files, not hardcoded values
Error handling: Graceful failures, meaningful error messages
Security: No committed secrets, tokens, API keys (use .env.example instead)

Evaluation Guidelines

For Agent Runs

Evaluation is qualitative and focuses on:

Functional Completeness:
- Core Requirements met?
- Stretch Goals attempted/completed?
- Edge cases handled?
Code Quality:
- Architecture: Clean separation, appropriate patterns
- Testing: Coverage and quality of tests
- Readability: Clear naming, appropriate comments
- Error Handling: Robust and user-friendly
Human Intervention:
- How much context was needed?
- How much guidance was provided?
- Were there manual fixes?
Overall Execution:
- Time to completion
- Number of iterations
- Quality of problem-solving

Use the provided evaluation templates (EVALUATION_REPORT_CODING_AGENT.md or EVALUATION_REPORT_ORCHESTRATION.md) and fill them out honestly.

For New Use Cases

Use cases are evaluated on:

Clarity: Is the prompt clear and unambiguous?
Scope: Is it completable in the stated time?
Value: Does it test skills not covered by other use cases?
Quality: Is the scaffolding appropriate (minimal but sufficient)?
Realism: Does it reflect real-world development tasks?

Pull Request Process

Creating Your PR

Ensure your branch is up to date:

git fetch upstream
git rebase upstream/main

Commit your changes:

git add .
git commit -m "feat(use-case-XX): add agent-name run"
# or
git commit -m "feat: add use case XX - descriptive name"

Push to your fork:
```
git push origin your-branch-name
```
Open a PR on GitHub from your fork to pyros-projects/agent-comparison:main

PR Title Format

Follow Conventional Commits:

For agent runs:

feat(01-research-scraper): add codex-cli run
feat(03-image-gen): add orchestration/agent-swarm/gpt4 run
fix(02-text-gen): correct dependency in gemini-cli run

For new use cases:

feat: add use case 11 - websocket chat server

PR Description Template

## Type of Contribution
- [ ] Agent Run (coding agent or orchestration)
- [ ] New Use Case
- [ ] Documentation/Tooling Fix

## Summary
[Brief description of what you're adding]

## Use Case
**Number/Name**: 01 - Research Scraper  
**Agent/Paradigm**: Codex CLI / Agent Swarm + GPT-4

## Checklist

### For Agent Runs:
- [ ] All Core Requirements completed
- [ ] Tests written and passing
- [ ] EVALUATION_REPORT.md completed honestly
- [ ] Session logs included (if available)
- [ ] Changes scoped to my run folder only
- [ ] Dependencies are scoped (not repo-wide)

### For New Use Cases:
- [ ] Prompt follows PROMPT_TEMPLATE.md (7 sections)
- [ ] Prompt is clear and concise (~30-50 lines)
- [ ] Minimal scaffolding only (no implementation)
- [ ] README explains what's tested and why
- [ ] Time estimate is realistic (tested myself)
- [ ] Root README updated
- [ ] Tests a skill gap in existing use cases

## Testing Done
[Commands you ran, what you verified]

Example:
```bash
cd 01-research-scraper/coding_agents/my-agent
uv sync
uv run pytest
uv run python -m researcher.main --query "machine learning"

Screenshots (if applicable)

[Add screenshots for UI-based use cases]

Notes

[Anything reviewers should know - known limitations, interesting decisions, etc.]


### Review Process

1. **Automated checks** will run (if configured)
2. **Manual review** will assess:
   - Code quality and testing
   - Adherence to standards
   - Completeness of evaluation
   - Scope (changes only your run/use case)
3. **Feedback** may request changes
4. **Merge** once approved

### After Your PR is Merged

- Your run/use case is now part of the comparison suite
- Consider trying other use cases with different agents
- Share your experience and insights

---

## Community Guidelines

### Code of Conduct

- **Be respectful**: Constructive feedback only
- **Be honest**: Accurate evaluation and reporting
- **Be collaborative**: Help others when you can
- **Be transparent**: Document interventions and challenges

### Scope Discipline

**Critical for multi-contributor repos:**

- **Work in your own folder** - never modify other runs or `_base/`
- **Scoped dependencies** - install packages in your run, not repo-wide
- **Isolated branches** - one run or use case per branch
- **No cross-contamination** - don't copy/reference other agents' implementations

### Session Logs and Transparency

If your agent/tool generates session logs (e.g., `.data/` folders with conversation transcripts):

- **Do commit them** - valuable for understanding the process
- **Include context** - what prompts you gave, what guidance you provided
- **Document interventions** - if you manually fixed something, note it in EVALUATION_REPORT.md

### Honest Evaluation

This suite is for **qualitative comparison**, not competition. Honest reporting is essential:

- Submit your **first complete run**, not the best of multiple attempts
- Document **all human interventions** (context, prompts, fixes)
- Score **realistically** in evaluations
- Note **what went wrong** as well as what went right

### Getting Help

- **Questions about use cases?** Open an issue labeled `question`
- **Bug in setup tool?** Open an issue labeled `bug`
- **Stuck on implementation?** This is expected - document the challenge in your evaluation
- **Unclear prompt?** Open an issue suggesting clarifications

---

## Examples and References

### Example Agent Run PR

Title: feat(01-research-scraper): add cursor-composer run

Description: Added a coding agent run using Cursor Composer on the Research Scraper use case.

Use Case: 01 - Research Scraper Agent: Cursor with Composer mode

Checklist: ✓ All Core Requirements completed ✓ 2 of 4 Stretch Goals completed (graph view, filters) ✓ Tests passing (pytest, 75% coverage) ✓ EVALUATION_REPORT.md filled out ✓ Session logs in .data/ folder ✓ Changes scoped to run folder

Testing:

Ran pytest suite: 12 passed
Tested end-to-end with real arXiv queries
Verified all three views render correctly

Notes:

Agent struggled with graph visualization initially, needed guidance on library choice
Documented in evaluation report


### Example New Use Case PR

Title: feat: add use case 11 - rate limiter library

Description: New use case testing concurrency, algorithm implementation, and API design.

Summary: Build a generic rate limiting library supporting multiple algorithms (token bucket, sliding window, etc.). Tests concurrent programming, algorithm correctness, and clean API design.

Checklist: ✓ Prompt follows template (7 sections, 42 lines) ✓ Minimal scaffolding (pyproject.toml only) ✓ README explains testing focus ✓ Time estimate realistic (3-4h, tested myself) ✓ Root README updated ✓ Tests algorithm implementation (not covered by other use cases)

Why This Use Case: Current suite lacks low-level algorithm challenges and concurrency testing. Rate limiting is a practical problem requiring correct algorithm implementation, thread safety, and clean API design.