A workspace is the portable unit of an Agent-SWE benchmark. It contains enough information to recreate the task, run the evaluator, and inspect the oracle solution, while keeping the evaluated agent away from hidden files.
The usual shape is:
tasks/
└── owner-repo-1234/
├── workspace.yaml
├── patch.diff
├── deletion_patch.diff
├── test_patch.diff
├── tests/
├── run_tests.sh
├── evaluate.sh
└── README.md
Not every task has every file. Synthetic tasks usually have deletion_patch.diff. Real PR tasks usually do not.
workspace.yaml is the task manifest. It tells the evaluator which repository to use, which commit to start from, how to install dependencies, and which commands define success.
Important sections:
task_id: owner-repo-1234
repo:
url: https://github.com/owner/repo.git
base_commit: abc123
merge_commit: abc123
path: /workspace/repo
language: python
prompt: Restore the deleted behavior...
environment:
image: platformnetwork/swe-forge:owner-repo-1234
repo_path: /workspace/repo
forge_path: /workspace/forge
tests_path: /workspace/forge/tests
install:
commands:
- pip install -e .
working_dir: /workspace/repo
tests:
fail_to_pass:
- pytest tests/test_target.py -v
pass_to_pass:
- pytest tests/ -v
working_dir: /workspace/repo
synthetic:
source_type: synthetic_feature_deletion
deletion_patch_file: deletion_patch.diff
strategy: feature_deletionFor real PR tasks, the synthetic block may be missing. For synthetic tasks, it records how the broken state is created.
patch.diff is the oracle solution.
For a real PR task, it usually comes from the original pull request. For a synthetic feature-deletion task, it is the inverse of deletion_patch.diff.
This file must be hidden from the solving agent. It is useful for:
- validating that the task is solvable;
- checking the oracle score;
- exporting training data;
- comparing a model patch with the known repair.
It is not part of the agent's context.
deletion_patch.diff exists for synthetic tasks. It is the patch that creates the bug by deleting or disabling a known behavior.
The evaluator applies it before running the before/after checks:
- checkout the base commit;
- apply
deletion_patch.diff; - confirm
fail_to_passtests fail; - apply the candidate patch;
- confirm all required tests pass.
This file must also be hidden from the agent. It can reveal exactly what was removed.
The tests/ directory contains generated or extracted benchmark tests. test_patch.diff stores the original test patch when the task came from a PR or test-generation flow.
Depending on the task, tests may be referenced directly from the repository, copied into /workspace/forge/tests, or extracted from a diff.
The important rule is that tests define the scoring contract:
fail_to_passtests prove the target behavior was broken and then restored;pass_to_passtests protect existing behavior from regressions.
evaluate.sh is the local binary scorer. It prints:
{"score": 1}or:
{"score": 0}Score 1 means the expected fail-to-pass transition happened and regression tests passed. Score 0 means at least one phase failed.
The script is intentionally simple so validators can inspect it, run it locally, or wrap it in a larger scoring system.
run_tests.sh is a helper script for humans. It prints the task configuration and can run a basic Docker verification mode.
Use it when debugging a workspace manually. Use evaluate.sh or the Python evaluator for official scoring.
Hidden vs Visible Files
In a real benchmark run, the agent should work in the repository checkout, not in the forge directory.
Recommended visibility:
| Path | Agent visibility | Reason |
|---|---|---|
/workspace/repo |
Yes | The agent needs the codebase. |
/workspace/forge/patch.diff |
No | Contains the solution. |
/workspace/forge/deletion_patch.diff |
No | Reveals the synthetic mutation. |
/workspace/forge/workspace.yaml |
Usually no | Can contain task metadata and test commands. |
/workspace/forge/tests |
Depends on benchmark mode | Public tests may be visible; hidden reward tests should not be. |
For Platform-style challenge validation, the safe default is to hide the entire forge directory from the agent and let the validator mount it only during scoring.