Code and data for our paper SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving
Please refer our website for the public leaderboard.
- [June. 5, 2024]: We have released SwingArena!
SwingArena is a realistic, CI-driven evaluation framework for LLMs that simulates real-world software development by pairing models as patch submitters and reviewers, enhanced with retrieval-augmented code generation for multi-language support and long-context handling.
SwingArena employs an advanced containerized evaluation architecture that ensures cross-platform reproducibility and consistency. The system core relies on Docker for isolated environment management, combined with CI tools (such as GitHub Actions simulated through act) to achieve real-world software development workflow evaluation.
SwingArena consists of five core modules that work together to create a complete software engineering benchmark pipeline:
graph LR
A[collect] --> B[prepare]
B --> C[inference]
C --> D[harness]
D --> E[statistics]
E --> A
subgraph "Data Pipeline"
A
B
end
subgraph "Evaluation Pipeline"
C
D
E
end
- Purpose: Mine and filter high-quality GitHub repositories and pull requests
- Key Functions: Repository selection from top PyPI packages, PR collection with CI test validation, LLM-based quality filtering, expert rule-based validation
- Outputs: Task instances with issues, patches, and test cases
- Purpose: Process and index collected data for efficient retrieval
- Key Functions: Repository cloning and management, BM25 search index construction, multi-stage quality filtering (CI, annotation, content), dataset validation and testing
- Integration: Builds indexes used by
inferencefor context-aware generation
- Purpose: Generate patches and solutions using various AI models
- Key Functions: API model support (OpenAI, Anthropic, Claude), local Llama model inference, live GitHub issue solving, retrieval-augmented code generation
- Integration: Uses prepared datasets and indexes from
prepare
- Purpose: Evaluate model performance through CI-driven testing
- Key Functions: Dual-agent battle mode (patch submitter vs reviewer), CI workflow simulation, patch and test validation, Docker-based isolated execution
- Integration: Validates patches through real CI environments, similar to
collectfiltering
- Purpose: Analyze results and provide insights for dataset improvement
- Key Functions: Performance metric analysis, difficulty and clarity assessment, token usage and cost tracking, dataset quality reporting
- Integration: Provides feedback to improve
collectfiltering criteria (quality loop)
Before getting started, please ensure your system meets the following requirements:
- Docker: Follow the Docker official installation guide to install Docker Engine. Linux users are recommended to refer to the post-installation steps for optimal experience.
- Hardware Configuration: Recommended
x86_64architecture machine with at least 120GB available storage, 16GB RAM, and 8 CPU cores (arm64support is still experimental) - Python Environment: Python 3.8+ and related dependency packages
SwingArena integrates multiple cutting-edge technologies:
AI Model Integration: Supports various large language model APIs (OpenAI, Anthropic, Claude, etc.) and local model serving through a flexible model proxy system for seamless switching.
Retrieval-Augmented Generation: Built-in BM25 retriever provides precise relevant information retrieval for long-context code generation, supporting multi-language codebase indexing (Python, Rust, C++, Go, JavaScript, TypeScript, PHP, etc.).
Distributed Evaluation: Adopts multi-process parallel evaluation architecture with Modal cloud execution support, dynamically adjusting worker processes based on system resources (recommended not to exceed min(0.75 * os.cpu_count(), 24)).
Arena Mechanism: Pioneering dual-agent battle evaluation mode where one agent acts as a patch submitter and another as a code reviewer, simulating real collaborative development scenarios.
Data Processing Pipeline: Complete data collection, annotation, and evaluation pipeline with automated GitHub repository issue collection and PR analysis, multi-round annotation quality control, CI-driven validation, and detailed performance metrics analysis.
To build SwingArena from source, follow these steps:
git clone https://github.com/menik1126/Swing-Bench.git
cd Swing-Bench
pip install -e .For complete SwingArena functionality including agent battles and CI simulation:
pip install -e ".[ci-tools]"This single command will:
- ✅ Install all Python dependencies (including Docker SDK, YAML parser)
- 🐳 Automatically install Docker (on supported Linux distributions)
- 🔧 Automatically install
act(GitHub Actions local runner) - 🔗 Set up pre-commit hooks
💡 How it works:
- First installs Python packages
- Then automatically detects the
[ci-tools]extra and installs system tools- On macOS/Windows, uses Homebrew/Chocolatey when available
To skip system tools installation:
pip install -e ".[ci-tools]" --install-option="--skip-ci-tools"
# Then install manually later:
python install_ci_tools.py
⚠️ Note: On macOS/Windows, you may need to install Docker Desktop manually if package managers (brew/choco) are not available.
If you plan to use BM25 retrieval for code search (used by the prepare and inference modules), you'll need Java 21+:
Installation:
# Using conda (recommended)
conda install openjdk=21
# Set environment variables (add to ~/.bashrc or ~/.zshrc)
export JVM_PATH=$CONDA_PREFIX/lib/jvm/lib/server/libjvm.so
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib/jvm/lib/server:$LD_LIBRARY_PATHAlternative installation methods:
- Ubuntu/Debian:
sudo apt-get install openjdk-21-jdk - macOS:
brew install openjdk@21 - Windows: Download from Adoptium or use
choco install openjdk21
💡 Note: Java is required for the
pyserinilibrary used in BM25 indexing and retrieval. Without it, you can still use other SwingArena features but won't be able to build search indexes or use retrieval-augmented generation.
Prerequisites:
- Git (required for repository operations)
- Docker (required for act to run GitHub Actions and containerized environments)
- sudo/admin privileges (for system-level tool installation)
Alternative Installation Methods:
If the automatic installation doesn't work, use the dedicated installer:
python install_ci_tools.pyManual Installation (if automatic fails):
Docker Installation:
- Linux (Ubuntu/Debian):
curl -fsSL https://get.docker.com -o get-docker.sh sudo sh get-docker.sh sudo usermod -aG docker $USER - Linux (CentOS/RHEL):
sudo yum install -y yum-utils sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo sudo yum install -y docker-ce docker-ce-cli containerd.io sudo systemctl start docker && sudo systemctl enable docker sudo usermod -aG docker $USER
- macOS: Download Docker Desktop or
brew install --cask docker - Windows: Download Docker Desktop or use Chocolatey/winget
act Installation:
- Linux:
curl -s https://raw.githubusercontent.com/nektos/act/master/install.sh | sudo bash - macOS:
brew install act - Windows:
choco install act-cliorwinget install nektos.act
Verify CI tools installation:
python install_ci_tools.py --checkExpected output after successful CI tools installation:
🔍 Checking CI tools installation status...
act (GitHub Actions): ✅ Installed
Docker: ✅ Installed
Git: ✅ Installed
Python docker: ✅ Installed
Python yaml: ✅ Installed
📊 Overall status: ✅ All tools ready
SwingArena uses environment variables for API keys, paths, and configuration. All variables can be set in a .env file, via shell export, or passed inline when running scripts.
cp .env.example .envThe core battle / inference features require an OpenAI-compatible LLM endpoint:
# LLM endpoint (any OpenAI-compatible API)
LLM_BASE_URL=https://api.openai.com/v1
LLM_API_KEY=sk-xxx
LLM_MODEL=gpt-4
# Tokenizer for token counting (HuggingFace model name)
# Use "gpt2" for OpenAI models, or match the model family for others
LLM_TOK_MODEL=gpt2Common provider examples:
| Provider | LLM_BASE_URL |
LLM_MODEL example |
|---|---|---|
| OpenAI | https://api.openai.com/v1 |
gpt-4 |
| DashScope | https://dashscope.aliyuncs.com/compatible-mode/v1 |
qwen-max-latest |
| DeepSeek | https://api.deepseek.com/v1 |
deepseek-chat |
| Local vLLM | http://localhost:8000/v1 |
model path or name |
# Temporary workspace for CI evaluation runs
SWING_TESTBED_PATH=/path/to/testbed
# Directory containing cloned repositories
SWING_REPOS_DIR_PATH=/path/to/repos
# Directory containing BM25 search indexes
SWING_INDEXES_PATH=/path/to/indexes
# CI tool: "act" (local GitHub Actions) or "cargo" (Rust only)
CI_TOOL_NAME=actPath recommendations:
- Use absolute paths
- Ensure sufficient disk space (~10 GB per language for repos)
- Directories are auto-created by
scripts/setup_env.sh, or create manually:mkdir -p /path/to/{testbed,repos,indexes}
# Java (required by pyserini for BM25 retrieval; auto-detected if on PATH)
# JAVA_HOME=/path/to/jdk-21
# Direct path to libjvm.so (overrides JAVA_HOME; used by pyjnius).
# Only needed when pyjnius cannot auto-detect the JVM, e.g. in conda envs.
# JVM_PATH=/usr/lib/jvm/java-21-openjdk-amd64/lib/server/libjvm.so
# API keys for inference / collect modules (not needed for agent_battle)
# OPENAI_API_KEY=sk-xxx
# ANTHROPIC_API_KEY=sk-ant-xxx
# GITHUB_TOKEN=ghp_xxx
# GITHUB_TOKENS=ghp_token1,ghp_token2,ghp_token3Git author/committer identity used when applying patches to repositories during battle evaluation. If not set, git commit may fail in clean environments (e.g. fresh containers) where no global git config exists.
GIT_AUTHOR_NAME=SwingBench
GIT_AUTHOR_EMAIL=swingbench@local
GIT_COMMITTER_NAME=SwingBench
GIT_COMMITTER_EMAIL=swingbench@localThese are already set with default values in .env.example. Override them if you need commits attributed to a specific identity.
💡 Tip: Run
source scripts/setup_env.shto load.envand auto-detect Java. The script only sets defaults for variables that are not already exported.
🔒 Security: Never commit your
.envfile to version control. It contains sensitive API keys.
SwingArena automatically downloads datasets from Hugging Face when needed. You can also load them manually:
from datasets import load_dataset
# Load the main SwingBench dataset
dataset = load_dataset('SwingBench/SwingBench', split='test')
# Or load language-specific datasets
languages = ['rust', 'cpp', 'python', 'go', 'java', 'javascript', 'php']
swingbench = {}
for lang in languages:
swingbench[lang] = load_dataset('SwingBench/SwingBench-data', split=lang)Now let's run a simple evaluation to verify everything works. This requires two steps:
⚠️ Important Prerequisites:
- Docker must be running (check with
docker ps)- This will download dataset from HuggingFace (~500MB) and clone repositories from GitHub (~100MB)
- First run will build Docker images and may take 5-10 minutes
First, clone the repositories needed for evaluation:
python swingarena/prepare/swing_clone_repos.py \
--dataset_path SwingBench/SwingBench \
--repo_root_dir ./reposWhat this does:
- Downloads SwingBench dataset from HuggingFace
- Clones all repositories from the dataset to
./repos - Checks out the correct commits for each repository
Note: This will clone all repositories in the dataset (~10GB total). The cloning process may take 10-30 minutes depending on network speed.
Now run the evaluation harness:
python -m swingarena.harness.run_evaluation \
--dataset_name SwingBench/SwingBench \
--split test \
--predictions_path gold \
--src_folder ./repos \
--target_dir ./testbed \
--report_dir ./report \
--concurrent_workers 1 \
--instance_ids pypa__pipenv-6240What this does:
- Loads
pypa__pipenv-6240instance from SwingBench (Python project with 22 CI jobs) - Copies repository from
./reposto isolated testbed - Applies the gold patch (correct fix) and test patch
- Runs CI tests using GitHub Actions (via
acttool)
Expected output:
Loading dataset...
Copying repository to testbed...
Running CI tests...
✅ Evaluation complete - results in ./report/
If successful, you're ready to use SwingArena! 🎉
💡 Note: All 100 instances in SwingBench include both patches and test patches, with full CI configurations across 4 languages (Python, Rust, Go, C++).
⚠️ Prerequisites: You must first complete the Data Preparation step to clone repositories.
Evaluate model predictions on SwingArena using the evaluation harness:
python -m swingarena.harness.run_evaluation \
--dataset_name SwingBench/SwingBench \
--split test \
--predictions_path <path_to_predictions> \
--src_folder ./repos \
--target_dir ./testbed \
--report_dir ./report \
--concurrent_workers <num_workers>
# use --predictions_path 'gold' to verify the gold patchesKey Parameters:
--dataset_name: Dataset to use (default: SwingBench/SwingBench)--split: Dataset split to use (test, train, etc.)--predictions_path: Path to predictions file, or 'gold' for gold patches--src_folder: Directory containing cloned repositories (from prepare step)--target_dir: Isolated testbed directory for running evaluations--report_dir: Directory for evaluation results and logs--concurrent_workers: Number of parallel workers (recommended:min(0.75 * os.cpu_count(), 24))--instance_ids: Specific instance IDs to evaluate (space-separated)--timeout: Timeout in seconds for each instance (default: 600)
Output: This command generates:
- Docker build logs in
logs/build_images/ - Evaluation logs in
logs/run_evaluation/ - Final results in
evaluation_results/
To see all available options:
python -m swingarena.harness.run_evaluation --helpWarning
Resource Requirements
- Recommended:
x86_64machine with at least 120GB free storage, 16GB RAM, 8 CPU cores - For Docker Desktop: Increase virtual disk space to ~120GB
- Adjust
--concurrent_workersbased on available resources arm64support is experimental
The SwingArena repository can help you:
- Train your own models on our pre-processed datasets
- Run inference on existing models (local models like LLaMA, or API models like GPT-4)
- Run SwingArena's data collection procedure on your own repositories
⚠️ IMPORTANT: This step is required before running evaluations. The harness needs pre-cloned repositories to run CI tests.
The prepare module helps you clone repositories and build search indexes. This is required for:
- All evaluation runs (harness needs local repositories)
- Arena Battle mode (retrieval-augmented patch generation)
- Model inference with code search
- Working with custom datasets
- Java 21+ (for BM25 index building, see installation guide)
- Sufficient disk space (repos can be large, ~10GB per language)
Clone repositories from the SwingBench dataset or your custom task instances:
cd swingarena/prepare
# Clone from SwingBench dataset
python swing_clone_repos.py \
--dataset_path SwingBench/SwingBench \
--repo_root_dir /path/to/repos
# Or from a local .jsonl file
python swing_clone_repos.py \
--dataset_path /path/to/task-instances.jsonl \
--repo_root_dir /path/to/reposWhat this does:
- Downloads repositories from GitHub based on task instances
- Checks out the correct commit for each instance
- Organizes repos by
owner__reponaming convention
Build search indexes for fast code retrieval:
cd swingarena/prepare
# Build indexes for SwingBench dataset
python swing_build_index.py \
--dataset_path SwingBench/SwingBench \
--repo_root_dir /path/to/repos \
--output_dir /path/to/indexes
# Or specify a language/subset
python swing_build_index.py \
--dataset_path /path/to/task-instances.jsonl \
--repo_root_dir /path/to/repos \
--output_dir /path/to/indexes \
--sub_dataset_identifier PythonParameters:
--dataset_path: Path to dataset or HuggingFace dataset name--repo_root_dir: Directory containing cloned repositories--output_dir: Where to save the BM25 indexes--sub_dataset_identifier: Optional language filter (python,rust,go,cpp- case insensitive)
What this does:
- Parses source code files in each repository
- Builds BM25 indexes for fast text search
- Saves indexes to disk for use by inference/arena modules
Index Structure:
indexes/
├── python_index/
├── rust_index/
└── ...
💡 Note: Index building can take 1-2 hours for the full SwingBench dataset. You can build indexes for specific languages to save time.
The inference module generates patches/solutions using AI models. This step comes after data preparation if you're using retrieval-augmented generation.
Generate solutions with OpenAI, Anthropic, or other API providers:
cd swingarena/inference
python -m swingarena.inference.run_api \
--dataset_name_or_path SwingBench/SwingBench \
--split test \
--model_name_or_path gpt-4 \
--output_dir /path/to/output \
--max_cost 1.0Key Parameters:
--dataset_name_or_path: Dataset to use (HuggingFace name or local .jsonl)--model_name_or_path: Model identifier (gpt-4, claude-3-opus, etc.)--output_dir: Where to save generated patches--max_cost: Maximum API cost in USD (stops when reached)--instance_ids: Specific instances to run (optional)
Run inference with local models like LLaMA:
python -m swingarena.inference.run_llama \
--dataset_name_or_path SwingBench/SwingBench \
--model_name_or_path /path/to/llama-model \
--output_dir /path/to/outputTo use code search for better context (requires prepared data):
Prerequisites: Configure environment variables in your
.envfile (see Environment Configuration)
# Run inference with retrieval
python -m swingarena.inference.run_api \
--dataset_name_or_path SwingBench/SwingBench \
--model_name_or_path gpt-4 \
--output_dir /path/to/output \
--use_retrievalSwingArena will automatically use SWING_REPOS_DIR_PATH and SWING_INDEXES_PATH from your .env file.
For more details, see the inference README.
SwingArena's dual-agent battle evaluation mode allows you to compare two AI models in a competitive programming environment.
Prerequisites:
- Complete Data Preparation (see Data Preparation section above)
- Configure
.env(see Environment Configuration section)
Quick Start (recommended):
# 1. Load environment (auto-detects Java, creates dirs)
source scripts/setup_env.sh
# 2. Run battle — all config comes from .env / env vars
bash scripts/run_battle.shOverride any parameter via environment variables:
LLM_MODEL=gpt-4 LLM_BASE_URL=https://api.openai.com/v1 \
DATASET_NAME=SwingBench/SwingBench BATTLE_LANGUAGE=rust \
bash scripts/run_battle.shRunning directly (without script):
python swingarena/harness/agent_battle.py \
--dataset_name SwingBench/SwingBench \
--split test \
--src_folder $SWING_REPOS_DIR_PATH \
--retriever_index_dir $SWING_INDEXES_PATH \
--workdir $SWING_TESTBED_PATH \
--ci_tool_name act \
--base_url_lhs https://api.openai.com/v1 \
--api_key_lhs $LLM_API_KEY \
--model_lhs gpt-4 \
--tok_model_lhs gpt2 \
--base_url_rhs https://api.openai.com/v1 \
--api_key_rhs $LLM_API_KEY \
--model_rhs gpt-4 \
--tok_model_rhs gpt2 \
--turns 1Battle Parameters:
| Parameter | Description | Default |
|---|---|---|
--dataset_name |
HuggingFace dataset name or local .jsonl path |
SwingBench/SwingBench |
--language |
Language filter | rust |
--split |
Dataset split | test |
--src_folder |
Directory containing cloned repositories | $SWING_REPOS_DIR_PATH |
--retriever_index_dir |
Directory containing BM25 search indexes | $SWING_INDEXES_PATH |
--workdir |
Temporary workspace for CI runs | $SWING_TESTBED_PATH |
--ci_tool_name |
CI tool (act or cargo) |
act |
--model_lhs/rhs |
LLM model names for the two agents | — |
--base_url_lhs/rhs |
OpenAI-compatible API endpoints | — |
--api_key_lhs/rhs |
API keys for the two agents | — |
--tok_model_lhs/rhs |
HuggingFace tokenizer names | — |
--turns |
Number of battle rounds | 1 |
--port_range |
Port range for act artifact server |
10000-11000 |
--retrieve_file_num |
Number of files retrieved via BM25 for context | 10 |
--agent_retry_times |
Max retry attempts when agent LLM call fails | 3 |
--max_chunk_num |
Max code chunks kept after reranking for LLM context | 16 |
--max_instances |
Max dataset instances to process (0 = all) |
0 |
When using run_battle.sh, these parameters are configured via environment variables:
| Env Variable | Maps to | Default |
|---|---|---|
LLM_BASE_URL |
--base_url_lhs/rhs |
http://localhost:8000/v1 |
LLM_API_KEY |
--api_key_lhs/rhs |
no-api-key |
LLM_MODEL |
--model_lhs/rhs |
Qwen/Qwen2.5-Coder-7B-Instruct |
LLM_TOK_MODEL |
--tok_model_lhs/rhs |
Qwen/Qwen2.5-7B-Instruct |
DATASET_NAME |
--dataset_name |
SwingBench/SwingBench |
BATTLE_LANGUAGE |
--language |
python |
SPLIT |
--split |
test |
CI_TOOL |
--ci_tool_name |
act |
TURNS |
--turns |
1 |
PORT_RANGE |
--port_range |
10000-11000 |
RETRIEVE_FILE_NUM |
--retrieve_file_num |
10 |
AGENT_RETRY_TIMES |
--agent_retry_times |
3 |
MAX_CHUNK_NUM |
--max_chunk_num |
16 |
MAX_INSTANCES |
--max_instances |
1 |
RERANKER_GPU |
GPU id for CodeBERT reranker | 0 |
ACT_TIMEOUT_SECONDS |
Timeout per act CI job (for matrix jobs) | 7200 (2h) |
ACT_MATRIX_FILTER |
Additional --matrix filters for act (e.g. os:ubuntu-latest,python-version:3.10) |
empty (run full workflow matrix) |
ACT_PLATFORM_OVERRIDES |
Extra -P image mappings for act (e.g. node:16-bullseye-slim=my/node:16-with-tools) |
empty |
Many GitHub Actions workflows in SwingBench use matrix jobs. A matrix job means:
- You describe a set of parameters (for example different OS and Python versions)
- CI automatically runs the same job once for each parameter combination
For example, a matrix like this:
strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
python-version: [3.8, 3.9, 3.10, 3.11, 3.12]expands to many jobs like:
- Ubuntu + Python 3.8
- Ubuntu + Python 3.9
- ...
- Windows + Python 3.12
Each combination runs the same CI steps, but on a different environment. This is great for compatibility, but very slow to simulate locally with act.
ACT_MATRIX_FILTER lets you override the full workflow matrix at runtime and tell act to only run a subset of matrix combinations when SwingBench calls it.
- The value is a comma‑separated list of
key:valuepairs, wherekeymatches a matrix dimension (for exampleosorpython-version) - For each
key:valuepair, SwingBench adds a corresponding--matrix key:valueflag to theactcommand
For example, setting:
ACT_MATRIX_FILTER=os:ubuntu-latest,python-version:3.10results in an act invocation like:
act ... \
--matrix os:ubuntu-latest \
--matrix python-version:3.10In practice this means:
- The GitHub Actions workflow can still define a large matrix (many OS × Python versions)
- But when running under SwingBench with
ACT_MATRIX_FILTERset,actwill only execute the single filtered combination instead of the full matrix - This is very useful for speeding up local evaluation or debugging, while keeping the workflow itself unchanged
If you want full matrix coverage (all combinations defined in the workflow), simply leave ACT_MATRIX_FILTER empty or comment it out; SwingBench will then run the workflow’s complete matrix without additional --matrix filters.
Some workflows define a custom container: image for their jobs (e.g. node:16-bullseye-slim). On GitHub-hosted runners the surrounding VM already has tools like curl, git, and bash pre-installed, so the workflow works fine. However, when act runs the job it executes everything inside that container image, and minimal images often lack these tools, causing failures like curl: command not found.
ACT_PLATFORM_OVERRIDES lets you remap those images to your own versions that have the missing tools installed — without modifying the workflow files themselves.
How to use:
- Build a custom base image once on your machine (Docker must be installed):
cd Swing-Bench
bash scripts/build_act_base_image.shBy default this uses BASE_IMAGE=node:16-bullseye-slim and produces an image tagged swingbench/base-with-tools that has extra tools (curl, git, ca-certificates, build-essential, etc.) installed. You can override the base image or target tag:
BASE_IMAGE=python:3.11-slim TARGET_TAG=swingbench/python311-with-tools \
bash scripts/build_act_base_image.sh- Set the environment variable (in
.envor.env.example):
ACT_PLATFORM_OVERRIDES=node:16-bullseye-slim=swingbench/base-with-toolsMultiple mappings can be comma-separated, for example:
ACT_PLATFORM_OVERRIDES=node:16-bullseye-slim=my/node:16-tools,python:3.9-slim=my/python:3.9-toolsSwingBench converts each pair into an act -P original=replacement flag, so act will use your enhanced image instead of the original.
If ACT_PLATFORM_OVERRIDES is empty or not set, no extra -P flags are added and the default image mappings are used.
Run evaluations on the cloud using Modal to avoid local setup:
# Note: Modal evaluation requires using the modal_eval module
python -m swingarena.harness.modal_eval.run_evaluation_modal \
--predictions_path gold \
--instance_ids tokio-rs__tokio-6978Note
Modal for SwingArena is currently experimental and may not be fully supported.
This workflow shows how to use all five SwingArena modules to create and evaluate custom datasets. Follow these steps in order:
graph LR
A[collect] --> B[prepare]
B --> C[inference]
C --> D[harness]
D --> E[statistics]
E -.feedback.-> A
Mine GitHub repositories and create task instances:
# Set your GitHub token
export GITHUB_TOKEN=$(gh auth token) # Or set it manually
python swingarena/collect/get_tasks_pipeline.py \
--repos owner/repo-name \
--path_prs ./collected_data/prs \
--path_tasks ./collected_data/tasks \
--max_pulls 100Key Parameters:
--repos: GitHub repository to collect from (format:owner/repo-name)--path_prs: Directory to save PR data--path_tasks: Directory to save task instances--max_pulls: Maximum number of PRs to process (optional)
What this does:
- Collects pull requests from specified GitHub repositories
- Filters PRs with passing CI tests
- Extracts problem statements, patches, and test cases
- Saves task instances to
.jsonlformat
Output: task-instances.jsonl containing collected issues
Environment Variables:
GITHUB_TOKEN: Required for GitHub API access (get fromgh auth tokenor github.com/settings/tokens)
For more details, see the collect README.
See the Data Preparation section above for detailed instructions.
Quick commands:
cd swingarena/prepare
# Clone repositories
python swing_clone_repos.py \
--dataset_path /path/to/task-instances.jsonl \
--repo_root_dir /path/to/repos
# Build BM25 indexes
python swing_build_index.py \
--dataset_path /path/to/task-instances.jsonl \
--repo_root_dir /path/to/repos \
--output_dir /path/to/indexesOutput: Cloned repositories and BM25 search indexes
See the Model Inference section above for detailed instructions.
Quick commands:
cd swingarena/inference
# Generate patches with API models
python -m swingarena.inference.run_api \
--dataset_name_or_path /path/to/task-instances.jsonl \
--model_name_or_path gpt-4 \
--output_dir /path/to/predictions \
--max_cost 1.0Output: predictions.jsonl containing model-generated patches
Evaluate the generated patches using CI-driven testing:
python -m swingarena.harness.run_evaluation \
--dataset_name /path/to/task-instances.jsonl \
--predictions_path /path/to/predictions.jsonl \
--src_folder /path/to/repos \
--target_dir /path/to/testbed \
--report_dir /path/to/report \
--concurrent_workers 4What this does:
- Copies repositories from
src_folderto isolated testbed - Applies model-generated patches
- Runs CI tests (GitHub Actions via
actor Cargo tests) - Records pass/fail results
Output: Evaluation results in report_dir
See Basic Usage for more evaluation options.
Generate performance metrics and insights:
cd swingarena/statistics
python arena_stats.py --arena_log_dir /path/to/evaluation_resultsWhat this does:
- Calculates pass rates and success metrics
- Analyzes difficulty and clarity correlations
- Tracks token usage and API costs
- Generates reports for dataset quality assessment
Output: Statistical reports and visualizations
Use insights from the analysis (step 5) to improve your data collection criteria (step 1):
- Adjust difficulty thresholds
- Filter by clarity scores
- Refine repository selection
- Update quality criteria
This creates a feedback loop for continuous dataset improvement.
We've also written the following blog posts on how to use different parts of SwingBench. If you'd like to see a post about a particular topic, please let us know via an issue.
- [Nov 1. 2023] Collecting Evaluation Tasks for SwingArena (🔗)
1. "act: command not found"
- Ensure
/usr/local/binis in your PATH - Reinstall:
python install_ci_tools.py --force
2. "Docker daemon not running"
- Start Docker service:
sudo systemctl start docker(Linux) - Start Docker Desktop (macOS/Windows)
3. Permission denied errors
- Add user to docker group:
sudo usermod -aG docker $USER - Log out and back in
For detailed troubleshooting, see CI_TOOLS_SETUP.md.
If you find our work helpful, please use the following citations.
@article{xu2025swingarena,
title={SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving},
author={Xu, Wendong and Xiong, Jing and Zhao, Chenyang and Chen, Qiujiang and Wang, Haoran and Shen, Hui and Wan, Zhongwei and Dai, Jianbo and Wu, Taiqiang and Xiao, He and others},
journal={arXiv preprint arXiv:2505.23932},
year={2025}
}
MIT. Check LICENSE.md.

