Skip to content

Commit 1ea6d38

Browse files
committed
Add notes on bash-only eval
1 parent 80fced6 commit 1ea6d38

File tree

1 file changed

+17
-0
lines changed

1 file changed

+17
-0
lines changed

templates/pages/bash-only.html

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,23 @@ <h2>Overview</h2>
2929
<p>The original SWE-bench benchmark aims to evaluate arbitrary systems on their ability to resolve GitHub issues. Currently, top-performing systems represent a wide variety of AI scaffolds; from simple LM agent loops, to RAG systems, to multi-rollout and review type systems. Each of these systems are totally valid solutions to the problem of solving GitHub issues.</p>
3030

3131
<p>However, when we first created SWE-bench, we were initially interested in evaluating LMs primarily. To make an apples-to-apples comparison of LMs easier, we've introduced the <strong>SWE-bench Bash Only</strong> leaderboard. In this setting, we use our <a href="https://github.com/SWE-agent/mini-swe-agent">mini-SWE-agent</a> package to evaluate LMs in a minimal bash environment. No tools, no special scaffold structure; just a simple <a href="https://arxiv.org/abs/2210.03629">ReAct</a> agent loop. Results on SWE-bench Bash Only represent the state-of-the-art LM performance when given just a bash shell and a problem.</p>
32+
33+
<details>
34+
<summary>Details</summary>
35+
36+
<ul>
37+
<li>We use <a href="https://github.com/SWE-agent/mini-swe-agent/blob/main/src/minisweagent/config/extra/swebench.yaml">this configuration</a> for all models.</li>
38+
<li>The LM temperature is set to 0.0 if the temperature parameter is supported.</li>
39+
<li><a href="https://mini-swe-agent.com/latest/usage/swebench/">This guide</a> shows how to run the evaluation yourself.</li>
40+
<li>
41+
Small changes in the setup and configuration are captured by the version number in the leaderboard.
42+
Version numbers correspond to tags in the mini-SWE-agent repository.
43+
Since the mini-SWE-agent repository contains other components as well, a new version number does not necessarily mean that anything of relevance has changed for the bash-only leaderboard setting.
44+
We do <em>not</em> aim to tune the configuration and setup to reach higher and higher scores.
45+
Instead, we only make general fixes to the framework, as well as clarifications in the prompt to provide a maximally fair evaluation setup for the LMs. </li>
46+
</ul>
47+
48+
</details>
3249
</section>
3350

3451
</div>

0 commit comments

Comments
 (0)