Add DALPHA MLE-Bench Results#129
Conversation
|
Thanks for your submission, please could you resolve the conflicts? |
Thank you for the review! We've resolved the merge conflicts and the branch should now be up to date with main. |
|
Congrats on the results! Do you have plans to publish a tech report or share any artifacts/traces? Would love to learn more about how Cobra Agent works. Also really curious about AI4Code, it's pretty different from the rest of the competitions and I don't think any other submission has medaled on it. Would be cool to hear how you approached that one. Thanks and congrats again! |
Thank you for the kind words! As mentioned on our website, Cobra Agent is built on top of modules that are already powering Dalpha's real-world business as part of our AI-for-AI infrastructure. What we submitted here is essentially a recombination of those production modules, tuned for the MLE-Bench setting. Because of that, we don't have immediate plans to open-source the internals or publish a full tech report — but personally, I'd love to share more details in a lighter venue like a workshop at a conference, if the opportunity arises. As for AI4Code — happy to share some of the key parts of what our agent focused on for this competition! Here are some of the key factors our agent identified for AI4Code:
Small-Data Ablation → Best Combination Rather than committing to the full dataset upfront, the agent runs quick experiments on a small data subset — each completing within minutes — to validate which combinations of the above factors actually improve local validation scores. This allowed us to confirm the optimal architecture, loss function, and model size before scaling up. The critical factors differ per competition, but the process of rapidly identifying and validating them stays the same across all competitions. Hope that gives a useful peek into how Cobra Agent thinks. Thanks again for the interest! |
|
Thanks for the details on AI4Code! I dug into this a bit more since it caught my eye. Looking at all 190 AI4Code entries across every agent, the highest score outside your submission is 0.836 (CAIR MARS+). Interestingly, MARS's published logs show they tried a very similar approach to what you described — CodeBERT, sentence transformers, dual-context architectures, Kendall Tau eval — across 12+ iterations and topped out there. Your scores (0.855-0.861) sit consistently above that ceiling, which is really interesting. Since the approach sounds similar but the results are quite different, it would be really helpful to understand what's driving the gap. Would you be able to share any agent traces or logs from the AI4Code run? Even just the generated training code would go a long way. Also curious about your Docker setup — are pretrained model weights baked into the image, or does the agent download them at runtime? |
Thanks for digging into this so carefully! We haven't looked closely at MARS's logs, so it's hard to make a direct comparison. That said, based on what you've described, our intuition is that the gap likely comes from how we handle plateaus during the improvement process rather than the initial approach itself. When our agent stalls, it doesn't just retry the same direction. It may redesign the CV strategy if the current one seems unreliable, or bring in a fresh agent with a new context to look for improvement angles that the previous iteration missed. So while the high-level ingredients might look similar, the way the agent navigates and recovers from plateaus tends to be quite different. On the Docker setup — the pretrained weights are not baked into the image. The agent downloads them at runtime; for AI4Code, the final model used was As for traces and logs — we don't have plans to share those at this point. But your question got us thinking: it might be interesting to clean up and share the training code specifically for competitions where other agents struggled to get results but ours did — AI4Code being a natural candidate. We'll consider it. |
|
Congrats on the results, and thanks for engaging with questions here! One quick clarification before merge: the PR states "no external data sources were accessed" and your website describes the sandbox as having "no network access," but then you mention that the agent downloads pretrained weights (deberta-v3-large) at runtime. Could you clarify how that works?
Separately, given that this would be the top result on the leaderboard, it'd be great to have at least one transparency artifact to go alongside the grading reports. The table in #138 shows every other top-10 submission has published source code, a tech report, or execution logs, and that norm has become pretty important for the benchmark's credibility. @joe-needham wanted to flag this before the PR moves forward. |
Thanks for flagging both points! On the network clarification: The "no network access" language on our website was intended to convey that the agent does not use web search during the agent's run — not that the sandbox is fully air-gapped. Pretrained weights are downloaded at runtime in the normal way. We recognize the wording was ambiguous and have updated it to be clearer. On transparency artifacts: We understand the norm and appreciate why it matters for the benchmark's credibility. That said, we'd note that the current rank 1 submission (fmagent2) does not appear to have published source code, a tech report, or execution logs either — the linked resource points to v1, not the submitted version. Several other top-10 entries (disarray, PiEvolve, Thesis) appear to be in a similar position. Ultimately, we're happy to defer to @joe-needham and the maintainers on what's required for the PR to move forward. |
|
Thanks for the clarification on network access, that's helpful. On the comparisons to other submissions, I think there are some important differences worth noting:
The concern isn't that every submission needs to be fully open-sourced. It's that this would be the first top result where the community has no way to understand how the result was achieved. The explanations so far have been high-level ("composable modules", "plateau recovery", "fresh agent with new context") without the kind of specific, verifiable detail that other top submissions have provided, whether through code, papers, or even just detailed artifacts. Happy to defer to @joe-needham on what's needed to move forward. |
|
Fair point — the comparison to other submissions wasn't the right framing, and we appreciate you laying out the specifics. We've put together an artifact repository with representative agent-generated code from five competitions, including AI4Code. A few things worth noting about what's there:
We hope this gives a more concrete picture of how Cobra Agent approaches these competitions in practice. |
|
Thanks for publishing the artifact repository, really appreciate the team engaging with the community here. I went through the published code and had a question about the command = f"mlebench grade-sample {submission_csv_path} {comp} --data-dir ./zip_files"MLE-Bench provides an official validation mechanism specifically designed for agents — In contrast, I see that the tool wrapper discards the A couple of questions:
|
Thanks for your continued interest. I'll keep this brief.
|
|
Thanks for sharing the artifact code! I spent some time going through the AI4Code files in detail and had a couple of observations I wanted to flag. Model clarificationEarlier in this thread you mentioned:
In the published artifacts,
The README notes that "models may differ across runs," so perhaps the submitted scores came from a different run than the published artifacts. Could you clarify which model produced the submitted scores, and whether the published artifacts represent the run that was actually graded? Architecture and AI4Code contextLooking at the code more closely, the AI4Code approach implements six specific architectural elements:
These are the same six elements documented in the 1st place Kaggle AI4Code solution (published November 2022), which used DeBERTa-v3-large. The Cobra Agent artifacts substitute ModernBERT-large but keep the identical architecture. The 2nd place solution used a fundamentally different approach (bi-encoder with transformer decoder, no pct_rank). As I noted above, the highest AI4Code score across all other agents is 0.836, and from the grading reports in this repo, the bronze medal threshold for AI4Code is 0.853. No other agent has crossed it. Additionally, the MLE-Bench paper (Section 2.3.1) notes that "agents are also forbidden from viewing solutions online, which can often be found on Kaggle or GitHub." This isn't to suggest anything improper, LLMs trained on public Kaggle content could plausibly reproduce this approach from their training data. But given that the architecture matches the 1st place solution specifically (not the 2nd place or other competitive approaches), and the scores represent a step-function improvement over all other agents, it would be helpful to understand:
Thanks again for engaging with the community on this! |
|
Thanks to the maintainers for the update on the submission process. We've read the discussion in this thread carefully. Once the new submission criteria are published, we plan to update this PR to align with them. Thanks to everyone who engaged on this PR. |


Hi authors of MLE-Bench,
We are the research team at Dalpha, building Cobra Agent — an autonomous ML engineering agent built on a stack of composable modules, spanning sandboxed code execution, autonomous sub-agent orchestration, and end-to-end machine learning workflows.
We evaluated Cobra Agent on MLE-Bench using three independent experiment groups, and submit the results in this PR.
Pull Request Contents
This PR updates the
README.mdleaderboard and includes detailed grading reports for three independent runs, located at:runs/cobra_agent_group1runs/cobra_agent_group2runs/cobra_agent_group3Resources per Run
Notes on Recently Raised Concerns
Regarding test label leakage: Our agent does not use web search at any point during evaluation. No external data sources were accessed, and there is no risk of test label leakage through this vector.
Regarding submission behavior: The submission file was selected solely based on the agent's internal cross-validation metrics. No grading signals or scores from the private test set were used to guide or update the agent's behavior during inference.
Best regards,
The Dalpha Team