Skip to content

Add Orze MLE-bench Lite results (90.91%, #1 on Lite)#144

Open
warlockee wants to merge 1 commit into
openai:mainfrom
warlockee:add-orze-lite-results-v2
Open

Add Orze MLE-bench Lite results (90.91%, #1 on Lite)#144
warlockee wants to merge 1 commit into
openai:mainfrom
warlockee:add-orze-lite-results-v2

Conversation

@warlockee

@warlockee warlockee commented May 14, 2026

Copy link
Copy Markdown

Leaderboard Submission Disclosure

  • The scaffold/harness used in this leaderboard submission does not expose to the agent signals derived from the held-out test set during rollouts.

Hello MLE-Bench team,

We would like to submit evaluation results for Orze, our open-source autonomous ML engineering framework (pip install orze, source).

We understand submissions are currently paused (04-24-2026 note). We are opening this PR now so the results are on record, and are happy to wait for your updated process before merging.

Summary

  • Framework: Orze (open-source)
  • LLMs: Claude Opus 4 (strategic research advisor + code generation) + Moonshot v1 (idea generation, plateau analysis)
  • Hardware: 1× NVIDIA H100 80GB GPU, 24 hours wall-clock
  • Lite score (Low == Lite Any Medal %): 90.91 (single seed — SEM not reported per your guidelines; we acknowledge the 3-seed requirement and will resubmit if needed once the submission process reopens)
  • Medals: 20/22 — 11 Gold, 5 Silver, 4 Bronze (22/22 above median)

Per-Competition Results (Lite-22)

Competition Medal
aerial-cactus-identification Gold
aptos2019-blindness-detection Gold
denoising-dirty-documents Gold
detecting-insults-in-social-commentary Gold
dog-breed-identification Above Median
dogs-vs-cats-redux-kernels-edition Gold
histopathologic-cancer-detection Gold
jigsaw-toxic-comment-classification-challenge Gold
leaf-classification Silver
mlsp-2013-birds Bronze
nomad2018-predict-transparent-conductors Silver
plant-pathology-2020-fgvc7 Gold
random-acts-of-pizza Silver
ranzcr-clip-catheter-line-classification Bronze
siim-isic-melanoma-classification Bronze
spaceship-titanic Gold
spooky-author-identification Silver
tabular-playground-series-dec-2021 Gold
tabular-playground-series-may-2022 Above Median
text-normalization-challenge-english-language Silver
text-normalization-challenge-russian-language Bronze
the-icml-2013-whale-challenge-right-whale-redux Gold

(NYC taxi is excluded from Lite-22; our grader returned no medal due to a score-inversion in the grader for that competition.)

PR Contents

  • README.md: new leaderboard row at the top (score without ± SEM since single seed)
  • runs/README.md: hardware description
  • runs/orze_lite_group1/grading_report_group_1.json: official grading report (via git-lfs)
  • runs/run_group_experiments.csv: entry Orze,orze_lite_group1 appended

Evaluation Integrity

  • Submissions were selected based on internal cross-validation metrics only — no held-out test-set signals were exposed to the agent during rollouts.
  • The grading report was produced by the official mlebench grade CLI.
  • Orze is fully open-source and available on PyPI.

Note on Score Coincidence with Disarray

We notice Disarray also reports 90.91% in the Additional table (disqualified for test-set feedback). Our identical score is a coincidence of arithmetic: 20/22 = 90.909...%. Our submission is a clean run with no test-set access.

Thank you for building and maintaining MLE-Bench!

— The Orze Team

Orze is an open-source multi-role agent framework that achieves 90.91%
any-medal rate on MLE-bench Lite-22 using Claude Opus 4 + Moonshot v1
on a single H100 GPU in 24 hours. This is a single-seed submission
(± 0.00 SEM).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@warlockee warlockee force-pushed the add-orze-lite-results-v2 branch from 35bcf7a to 363bf65 Compare May 14, 2026 23:29
warlockee pushed a commit to orze-ai/nips-2026-submission that referenced this pull request May 15, 2026
MLE-bench Lite-22 result: 20/22 medals (90.9%), #1 on leaderboard.
Grading report: openai/mle-bench#144

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant