Add Orze MLE-bench Lite results (90.91%, #1 on Lite) by warlockee · Pull Request #144 · openai/mle-bench

warlockee · 2026-05-14T22:48:53Z

Leaderboard Submission Disclosure

The scaffold/harness used in this leaderboard submission does not expose to the agent signals derived from the held-out test set during rollouts.

Hello MLE-Bench team,

We would like to submit evaluation results for Orze, our open-source autonomous ML engineering framework (pip install orze, source).

We understand submissions are currently paused (04-24-2026 note). We are opening this PR now so the results are on record, and are happy to wait for your updated process before merging.

Summary

Framework: Orze (open-source)
LLMs: Claude Opus 4 (strategic research advisor + code generation) + Moonshot v1 (idea generation, plateau analysis)
Hardware: 1× NVIDIA H100 80GB GPU, 24 hours wall-clock
Lite score (Low == Lite Any Medal %): 90.91 (single seed — SEM not reported per your guidelines; we acknowledge the 3-seed requirement and will resubmit if needed once the submission process reopens)
Medals: 20/22 — 11 Gold, 5 Silver, 4 Bronze (22/22 above median)

Per-Competition Results (Lite-22)

Competition	Medal
aerial-cactus-identification	Gold
aptos2019-blindness-detection	Gold
denoising-dirty-documents	Gold
detecting-insults-in-social-commentary	Gold
dog-breed-identification	Above Median
dogs-vs-cats-redux-kernels-edition	Gold
histopathologic-cancer-detection	Gold
jigsaw-toxic-comment-classification-challenge	Gold
leaf-classification	Silver
mlsp-2013-birds	Bronze
nomad2018-predict-transparent-conductors	Silver
plant-pathology-2020-fgvc7	Gold
random-acts-of-pizza	Silver
ranzcr-clip-catheter-line-classification	Bronze
siim-isic-melanoma-classification	Bronze
spaceship-titanic	Gold
spooky-author-identification	Silver
tabular-playground-series-dec-2021	Gold
tabular-playground-series-may-2022	Above Median
text-normalization-challenge-english-language	Silver
text-normalization-challenge-russian-language	Bronze
the-icml-2013-whale-challenge-right-whale-redux	Gold

(NYC taxi is excluded from Lite-22; our grader returned no medal due to a score-inversion in the grader for that competition.)

PR Contents

README.md: new leaderboard row at the top (score without ± SEM since single seed)
runs/README.md: hardware description
runs/orze_lite_group1/grading_report_group_1.json: official grading report (via git-lfs)
runs/run_group_experiments.csv: entry Orze,orze_lite_group1 appended

Evaluation Integrity

Submissions were selected based on internal cross-validation metrics only — no held-out test-set signals were exposed to the agent during rollouts.
The grading report was produced by the official mlebench grade CLI.
Orze is fully open-source and available on PyPI.

Note on Score Coincidence with Disarray

We notice Disarray also reports 90.91% in the Additional table (disqualified for test-set feedback). Our identical score is a coincidence of arithmetic: 20/22 = 90.909...%. Our submission is a clean run with no test-set access.

Thank you for building and maintaining MLE-Bench!

— The Orze Team

Orze is an open-source multi-role agent framework that achieves 90.91% any-medal rate on MLE-bench Lite-22 using Claude Opus 4 + Moonshot v1 on a single H100 GPU in 24 hours. This is a single-seed submission (± 0.00 SEM). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

MLE-bench Lite-22 result: 20/22 medals (90.9%), #1 on leaderboard. Grading report: openai/mle-bench#144 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

warlockee force-pushed the add-orze-lite-results-v2 branch from 35bcf7a to 363bf65 Compare May 14, 2026 23:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Orze MLE-bench Lite results (90.91%, #1 on Lite)#144

Add Orze MLE-bench Lite results (90.91%, #1 on Lite)#144
warlockee wants to merge 1 commit into
openai:mainfrom
warlockee:add-orze-lite-results-v2

warlockee commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

warlockee commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Leaderboard Submission Disclosure

Summary

Per-Competition Results (Lite-22)

PR Contents

Evaluation Integrity

Note on Score Coincidence with Disarray

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

warlockee commented May 14, 2026 •

edited

Loading