Add DALPHA MLE-Bench Results by BenMyoung · Pull Request #129 · openai/mle-bench

BenMyoung · 2026-03-16T17:05:09Z

Hi authors of MLE-Bench,

We are the research team at Dalpha, building Cobra Agent — an autonomous ML engineering agent built on a stack of composable modules, spanning sandboxed code execution, autonomous sub-agent orchestration, and end-to-end machine learning workflows.

We evaluated Cobra Agent on MLE-Bench using three independent experiment groups, and submit the results in this PR.

Pull Request Contents

This PR updates the README.md leaderboard and includes detailed grading reports for three independent runs, located at:

runs/cobra_agent_group1
runs/cobra_agent_group2
runs/cobra_agent_group3

Resources per Run

Model: gemini-3.1-pro-preview, gemini-3-flash-preview
Hardware:
- A100 SXM ×1 — VRAM 80 GB, 11 vCPUs, 128 GB RAM
- RTX 4060 — VRAM 16 GB, 16 vCPUs, 32 GB RAM
Time Limit: 24 hours

Notes on Recently Raised Concerns

Regarding test label leakage: Our agent does not use web search at any point during evaluation. No external data sources were accessed, and there is no risk of test label leakage through this vector.

Regarding submission behavior: The submission file was selected solely based on the agent's internal cross-validation metrics. No grading signals or scores from the private test set were used to guide or update the agent's behavior during inference.

Best regards,
The Dalpha Team

joe-needham · 2026-03-23T11:23:28Z

Thanks for your submission, please could you resolve the conflicts?

BenMyoung · 2026-03-23T13:01:06Z

Thanks for your submission, please could you resolve the conflicts?

Thank you for the review! We've resolved the merge conflicts and the branch should now be up to date with main.
Please let us know if there's anything else we need to address.

pjwb-hue · 2026-03-26T01:48:31Z

Congrats on the results! Do you have plans to publish a tech report or share any artifacts/traces? Would love to learn more about how Cobra Agent works. Also really curious about AI4Code, it's pretty different from the rest of the competitions and I don't think any other submission has medaled on it. Would be cool to hear how you approached that one. Thanks and congrats again!

BenMyoung · 2026-03-26T04:43:08Z

Congrats on the results! Do you have plans to publish a tech report or share any artifacts/traces? Would love to learn more about how Cobra Agent works. Also really curious about AI4Code, it's pretty different from the rest of the competitions and I don't think any other submission has medaled on it. Would be cool to hear how you approached that one. Thanks and congrats again!

Thank you for the kind words!

As mentioned on our website, Cobra Agent is built on top of modules that are already powering Dalpha's real-world business as part of our AI-for-AI infrastructure. What we submitted here is essentially a recombination of those production modules, tuned for the MLE-Bench setting. Because of that, we don't have immediate plans to open-source the internals or publish a full tech report — but personally, I'd love to share more details in a lighter venue like a workshop at a conference, if the opportunity arises.

As for AI4Code — happy to share some of the key parts of what our agent focused on for this competition!

Here are some of the key factors our agent identified for AI4Code:

Architecture: Since the task is to predict the ordering of cells, a structure that models inter-cell relationships jointly should outperform treating each cell independently.
Model selection: We experimented with various combinations of pretrained models including CodeBERT, DeBERTa, and ModernBERT to find what worked best for this task.
Structural constraint: Code cell order is already fixed, so treating that as prior knowledge and predicting only markdown cell positions is more efficient.
Data split: Some notebooks share ancestors, so GroupKFold splitting is needed to prevent data leakage.

Small-Data Ablation → Best Combination

Rather than committing to the full dataset upfront, the agent runs quick experiments on a small data subset — each completing within minutes — to validate which combinations of the above factors actually improve local validation scores. This allowed us to confirm the optimal architecture, loss function, and model size before scaling up.

The critical factors differ per competition, but the process of rapidly identifying and validating them stays the same across all competitions.

Hope that gives a useful peek into how Cobra Agent thinks. Thanks again for the interest!

pjwb-hue · 2026-03-26T07:19:08Z

Thanks for the details on AI4Code! I dug into this a bit more since it caught my eye. Looking at all 190 AI4Code entries across every agent, the highest score outside your submission is 0.836 (CAIR MARS+). Interestingly, MARS's published logs show they tried a very similar approach to what you described — CodeBERT, sentence transformers, dual-context architectures, Kendall Tau eval — across 12+ iterations and topped out there. Your scores (0.855-0.861) sit consistently above that ceiling, which is really interesting.

Since the approach sounds similar but the results are quite different, it would be really helpful to understand what's driving the gap. Would you be able to share any agent traces or logs from the AI4Code run? Even just the generated training code would go a long way. Also curious about your Docker setup — are pretrained model weights baked into the image, or does the agent download them at runtime?

BenMyoung · 2026-03-26T09:18:34Z

Thanks for the details on AI4Code! I dug into this a bit more since it caught my eye. Looking at all 190 AI4Code entries across every agent, the highest score outside your submission is 0.836 (CAIR MARS+). Interestingly, MARS's published logs show they tried a very similar approach to what you described — CodeBERT, sentence transformers, dual-context architectures, Kendall Tau eval — across 12+ iterations and topped out there. Your scores (0.855-0.861) sit consistently above that ceiling, which is really interesting.

Since the approach sounds similar but the results are quite different, it would be really helpful to understand what's driving the gap. Would you be able to share any agent traces or logs from the AI4Code run? Even just the generated training code would go a long way. Also curious about your Docker setup — are pretrained model weights baked into the image, or does the agent download them at runtime?

Thanks for digging into this so carefully!

We haven't looked closely at MARS's logs, so it's hard to make a direct comparison. That said, based on what you've described, our intuition is that the gap likely comes from how we handle plateaus during the improvement process rather than the initial approach itself.

When our agent stalls, it doesn't just retry the same direction. It may redesign the CV strategy if the current one seems unreliable, or bring in a fresh agent with a new context to look for improvement angles that the previous iteration missed. So while the high-level ingredients might look similar, the way the agent navigates and recovers from plateaus tends to be quite different.

On the Docker setup — the pretrained weights are not baked into the image. The agent downloads them at runtime; for AI4Code, the final model used was microsoft/deberta-v3-large.

As for traces and logs — we don't have plans to share those at this point. But your question got us thinking: it might be interesting to clean up and share the training code specifically for competitions where other agents struggled to get results but ours did — AI4Code being a natural candidate. We'll consider it.

snl3576 · 2026-03-26T17:02:49Z

Congrats on the results, and thanks for engaging with questions here!

One quick clarification before merge: the PR states "no external data sources were accessed" and your website describes the sandbox as having "no network access," but then you mention that the agent downloads pretrained weights (deberta-v3-large) at runtime. Could you clarify how that works?

Separately, given that this would be the top result on the leaderboard, it'd be great to have at least one transparency artifact to go alongside the grading reports. The table in #138 shows every other top-10 submission has published source code, a tech report, or execution logs, and that norm has become pretty important for the benchmark's credibility.

@joe-needham wanted to flag this before the PR moves forward.

BenMyoung · 2026-03-26T17:42:49Z

Congrats on the results, and thanks for engaging with questions here!

One quick clarification before merge: the PR states "no external data sources were accessed" and your website describes the sandbox as having "no network access," but then you mention that the agent downloads pretrained weights (deberta-v3-large) at runtime. Could you clarify how that works?
Separately, given that this would be the top result on the leaderboard, it'd be great to have at least one transparency artifact to go alongside the grading reports. The table in #138 shows every other top-10 submission has published source code, a tech report, or execution logs, and that norm has become pretty important for the benchmark's credibility.
@joe-needham wanted to flag this before the PR moves forward.

Thanks for flagging both points!

On the network clarification: The "no network access" language on our website was intended to convey that the agent does not use web search during the agent's run — not that the sandbox is fully air-gapped. Pretrained weights are downloaded at runtime in the normal way. We recognize the wording was ambiguous and have updated it to be clearer.

On transparency artifacts: We understand the norm and appreciate why it matters for the benchmark's credibility. That said, we'd note that the current rank 1 submission (fmagent2) does not appear to have published source code, a tech report, or execution logs either — the linked resource points to v1, not the submitted version. Several other top-10 entries (disarray, PiEvolve, Thesis) appear to be in a similar position.

Ultimately, we're happy to defer to @joe-needham and the maintainers on what's required for the PR to move forward.

snl3576 · 2026-03-26T18:04:46Z

Thanks for the clarification on network access, that's helpful.

On the comparisons to other submissions, I think there are some important differences worth noting:

Famou-Agent initially published source code, which is how the community was able to identify test-set feedback in their evaluator (see The leaderboard separation in PR #130 is inconsistently applied #138). They claimed on Add Famou-Agent 2.0 (powered by Gemini 3 Pro) results to MLE-Bench leaderboard #123 to use validation-only feedback, but the published code showed scale_score_with_medal() using test scores as evolutionary fitness. Their repo was wiped after this was raised, though earlier commits and forks remain accessible. The transparency worked as intended, it enabled community review.
PiEvolve has a peer-reviewed paper and published artifact code.
Disarray self-reported their test-set feedback mechanism with specific technical details in Add Disarray MLE-Bench Results #118 (binary medal-threshold signal for early termination, code snippets showing the dog-breed exploit), and has a tech report in progress per Add Disarray MLE-Bench Results #118.

The concern isn't that every submission needs to be fully open-sourced. It's that this would be the first top result where the community has no way to understand how the result was achieved. The explanations so far have been high-level ("composable modules", "plateau recovery", "fresh agent with new context") without the kind of specific, verifiable detail that other top submissions have provided, whether through code, papers, or even just detailed artifacts.

Happy to defer to @joe-needham on what's needed to move forward.

BenMyoung · 2026-03-30T14:36:18Z

Fair point — the comparison to other submissions wasn't the right framing, and we appreciate you laying out the specifics. We've put together an artifact repository with representative agent-generated code from five competitions, including AI4Code. A few things worth noting about what's there:

Each of our three submitted runs was generated independently, so models and hyperparameters naturally vary across runs. The published code is from one of those runs.
Within a run, the agent may version files (e.g., train.py → train_v2.py) or modify them in place, so versioned files don't necessarily represent the complete iteration history.
Only Python source files are included; model weights and submission files are not.

We hope this gives a more concrete picture of how Cobra Agent approaches these competitions in practice.

snl3576 · 2026-04-12T18:55:30Z

Thanks for publishing the artifact repository, really appreciate the team engaging with the community here.

I went through the published code and had a question about the validate_submission tool defined in the README.md. The tool runs:

command = f"mlebench grade-sample {submission_csv_path} {comp} --data-dir ./zip_files"

MLE-Bench provides an official validation mechanism specifically designed for agents — validate_submission() in grade.py, served via a grading server and shell script. This official tool runs the grader but intentionally discards the score, returning only "Submission is valid" or an error message. The MLE-Bench paper (Section 2.3) explicitly states: "we allow agents to access a local validation server to check the validity of their submission, though the tool does not provide a score."

In contrast, mlebench grade-sample is the CLI grading command (cli.py:197-203) that calls grade_csv() and outputs the full CompetitionReport as JSON — including score, gold_threshold, silver_threshold, bronze_threshold, gold_medal, silver_medal, bronze_medal, any_medal, and above_median.

I see that the tool wrapper discards the check_output return value on line 53 (it's not assigned to a variable) and only returns {"valid": True/False}. However, since we can't see the implementation of ctx.mleb_env.check_output or how the agent framework handles subprocess output, it's unclear what information from the grading command is ultimately visible to the agent.

A couple of questions:

Could you share why mlebench grade-sample was used instead of the official MLE-Bench validation endpoint?
Would you be able to share the tool-call logs showing how many times validate_submission was invoked during a run, and what the agent saw from each call? This would go a long way toward addressing the community's questions about test-set feedback.
Is the full implementation of ctx.mleb_env.check_output available somewhere? Understanding whether stdout/stderr from the subprocess is logged or surfaced to the agent would help clarify the information flow.

BenMyoung · 2026-04-13T13:06:56Z

Thanks for publishing the artifact repository, really appreciate the team engaging with the community here.

I went through the published code and had a question about the validate_submission tool defined in the README.md. The tool runs:
command = f"mlebench grade-sample {submission_csv_path} {comp} --data-dir ./zip_files"
MLE-Bench provides an official validation mechanism specifically designed for agents — validate_submission() in grade.py, served via a grading server and shell script. This official tool runs the grader but intentionally discards the score, returning only "Submission is valid" or an error message. The MLE-Bench paper (Section 2.3) explicitly states: "we allow agents to access a local validation server to check the validity of their submission, though the tool does not provide a score."

In contrast, mlebench grade-sample is the CLI grading command (cli.py:197-203) that calls grade_csv() and outputs the full CompetitionReport as JSON — including score, gold_threshold, silver_threshold, bronze_threshold, gold_medal, silver_medal, bronze_medal, any_medal, and above_median.

I see that the tool wrapper discards the check_output return value on line 53 (it's not assigned to a variable) and only returns {"valid": True/False}. However, since we can't see the implementation of ctx.mleb_env.check_output or how the agent framework handles subprocess output, it's unclear what information from the grading command is ultimately visible to the agent.

A couple of questions:

Could you share why mlebench grade-sample was used instead of the official MLE-Bench validation endpoint?

Would you be able to share the tool-call logs showing how many times validate_submission was invoked during a run, and what the agent saw from each call? This would go a long way toward addressing the community's questions about test-set feedback.

Is the full implementation of ctx.mleb_env.check_output available somewhere? Understanding whether stdout/stderr from the subprocess is logged or surfaced to the agent would help clarify the information flow.

Thanks for your continued interest. I'll keep this brief.

We started using the MLE-Bench Docker CLI because it was readily accessible, and didn't migrate away from it midway. The output was logged and served as a useful debugging tool during early development. Since the tool wrapper discards the return value, the agent only receives a valid/invalid result — score and medal information is never exposed to the agent.
This varies by competition and run. For short training runs or those with heavy early exploration, it can be 20+ calls; for long training runs with less exploration, it can be 5 or fewer.
ctx.mleb_env.check_output is not publicly available. ctx is managed as runtime memory and saved for debugging after a run completes. mleb_env refers to the official MLE-Bench Docker environment, and check_output is a tool for executing shell commands.

pjwb-hue · 2026-04-18T19:17:27Z

Thanks for sharing the artifact code! I spent some time going through the AI4Code files in detail and had a couple of observations I wanted to flag.

Model clarification

Earlier in this thread you mentioned:

"the agent downloads them at runtime; for AI4Code, the final model used was microsoft/deberta-v3-large"

In the published artifacts, deberta-v3-large appears only in three diagnostic check scripts (check_deberta_mem.py, check_deberta.py, check_deberta_speed.py) that benchmark memory and speed but don't produce trained models. All training and inference files use answerdotai/ModernBERT-large — for example:

train_full_context.py:97,137,149
train_full_context_all.py:97,137,153
infer_full_context.py:108,135,142
All 5 eval_*.py files (line 26 each)

The README notes that "models may differ across runs," so perhaps the submitted scores came from a different run than the published artifacts. Could you clarify which model produced the submitted scores, and whether the published artifacts represent the run that was actually graded?

Architecture and AI4Code context

Looking at the code more closely, the AI4Code approach implements six specific architectural elements:

pct_rank as a pointwise regression target (train_full_context.py:36)
CLS token as cell boundary marker with position extraction (train_full_context.py:20,48-49)
Code cells sorted by rank, markdown cells shuffled (train_full_context.py:30-31)
Validation split by ancestor_id (train_full_context.py:125-135)
Full-notebook single-sequence encoding at 8192 tokens (train_full_context.py:15)
Combined MSE + pairwise margin ranking loss (train_base_pairwise.py:176)

These are the same six elements documented in the 1st place Kaggle AI4Code solution (published November 2022), which used DeBERTa-v3-large. The Cobra Agent artifacts substitute ModernBERT-large but keep the identical architecture. The 2nd place solution used a fundamentally different approach (bi-encoder with transformer decoder, no pct_rank).

As I noted above, the highest AI4Code score across all other agents is 0.836, and from the grading reports in this repo, the bronze medal threshold for AI4Code is 0.853. No other agent has crossed it. Additionally, the MLE-Bench paper (Section 2.3.1) notes that "agents are also forbidden from viewing solutions online, which can often be found on Kaggle or GitHub."

This isn't to suggest anything improper, LLMs trained on public Kaggle content could plausibly reproduce this approach from their training data. But given that the architecture matches the 1st place solution specifically (not the 2nd place or other competitive approaches), and the scores represent a step-function improvement over all other agents, it would be helpful to understand:

Does the agent framework include any competition-specific prompting, knowledge bases, or solution templates?
Were the Dolos plagiarism checks (described in the MLE-Bench paper) run on these artifacts?
Would you consider sharing agent conversation traces for the AI4Code run specifically? Even a summary showing the agent's reasoning as it progressed from CodeBERT baseline to the full-context ModernBERT architecture would add meaningful transparency.

Thanks again for engaging with the community on this!

BenMyoung · 2026-04-25T07:01:13Z

Thanks to the maintainers for the update on the submission process. We've read the discussion in this thread carefully.

Once the new submission criteria are published, we plan to update this PR to align with them.
We'll keep the PR open in the meantime so it's easier to track the update when the new criteria are shared.
We look forward to the next steps.

Thanks to everyone who engaged on this PR.

Add CobraAgent results

5ae0e81

BenMyoung added 2 commits March 23, 2026 21:55

Merge remote-tracking branch 'upstream/main'

541c16d

Resolve runs/run_group_experiments.csv

9a4ab65

Brawn9913 mentioned this pull request Mar 24, 2026

The leaderboard separation in PR #130 is inconsistently applied #138

Open

BenMyoung closed this Apr 25, 2026

BenMyoung reopened this Apr 27, 2026

Conversation

BenMyoung commented Mar 16, 2026

Pull Request Contents

Resources per Run

Notes on Recently Raised Concerns

Uh oh!

joe-needham commented Mar 23, 2026

Uh oh!

BenMyoung commented Mar 23, 2026

Uh oh!

pjwb-hue commented Mar 26, 2026

Uh oh!

BenMyoung commented Mar 26, 2026

Uh oh!

pjwb-hue commented Mar 26, 2026

Uh oh!

BenMyoung commented Mar 26, 2026

Uh oh!

snl3576 commented Mar 26, 2026

Uh oh!

BenMyoung commented Mar 26, 2026

Uh oh!

snl3576 commented Mar 26, 2026

Uh oh!

BenMyoung commented Mar 30, 2026

Uh oh!

snl3576 commented Apr 12, 2026

Uh oh!

BenMyoung commented Apr 13, 2026

Uh oh!

pjwb-hue commented Apr 18, 2026

Model clarification

Architecture and AI4Code context

Uh oh!

BenMyoung commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BenMyoung commented Apr 25, 2026 •

edited

Loading