Add Orze MLE-bench Lite results (90.91%, #1 on Lite)#144
Open
warlockee wants to merge 1 commit into
Open
Conversation
Orze is an open-source multi-role agent framework that achieves 90.91% any-medal rate on MLE-bench Lite-22 using Claude Opus 4 + Moonshot v1 on a single H100 GPU in 24 hours. This is a single-seed submission (± 0.00 SEM). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
35bcf7a to
363bf65
Compare
warlockee
pushed a commit
to orze-ai/nips-2026-submission
that referenced
this pull request
May 15, 2026
MLE-bench Lite-22 result: 20/22 medals (90.9%), #1 on leaderboard. Grading report: openai/mle-bench#144 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Leaderboard Submission Disclosure
Hello MLE-Bench team,
We would like to submit evaluation results for Orze, our open-source autonomous ML engineering framework (pip install orze, source).
We understand submissions are currently paused (04-24-2026 note). We are opening this PR now so the results are on record, and are happy to wait for your updated process before merging.
Summary
Per-Competition Results (Lite-22)
(NYC taxi is excluded from Lite-22; our grader returned no medal due to a score-inversion in the grader for that competition.)
PR Contents
README.md: new leaderboard row at the top (score without ± SEM since single seed)runs/README.md: hardware descriptionruns/orze_lite_group1/grading_report_group_1.json: official grading report (via git-lfs)runs/run_group_experiments.csv: entryOrze,orze_lite_group1appendedEvaluation Integrity
mlebench gradeCLI.Note on Score Coincidence with Disarray
We notice Disarray also reports 90.91% in the Additional table (disqualified for test-set feedback). Our identical score is a coincidence of arithmetic: 20/22 = 90.909...%. Our submission is a clean run with no test-set access.
Thank you for building and maintaining MLE-Bench!
— The Orze Team