SharpeBench

The luck-robust benchmark for AI trading agents

Other leaderboards rank the luckiest run over one quarter. SharpeBench ranks the skill that survives deflation — and proves it forward.

Why · Quickstart · Surfaces · What it measures · Architecture · Tech stack · References

Why

Every existing financial-agent benchmark ranks on raw risk-adjusted metrics over a single short window and a handful of runs — so the leaderboard mostly measures noise. FinBen reports Sharpe confidence intervals of ±1.08, which makes its rankings statistically indistinguishable. StockBench runs one window, once. QuantBench reports Sharpe across 40 seeds but never deflates it.

In an AI trading benchmark, the hard part is not measuring return. It is separating skill from luck. A model that posts a great Sharpe over one quarter has told you almost nothing — the number is dominated by sampling noise, by the number of strategies that were tried, and by hidden risk the linear return series can't see.

SharpeBench adds, as ranking gates, the things none of the others have:

Deflated Sharpe / PSR — deflate the Sharpe by how many agents were tested × track length × return skew/kurtosis (Bailey & López de Prado), plus each agent's own declared in-sample trials, so a strategy mined from a thousand private backtests is deflated for that search too.
pass^k reliability — the agent must clear the bar on every seed × window, not on average.
Field-wide significance — a deterministic stationary bootstrap, White's Reality Check, Hansen's studentized & consistent SPA, and Romano–Wolf step-down; the edge must beat data-snooping, not just noise.
Process discipline — placing an order that never passed the risk gate, ignoring a drawdown halt, bypassing a deny-list, or selling tail risk with a naked short-gamma book zeroes the entry, however good the P&L looks. The edge must also survive a realistic execution-cost profile (typical or worst-case fees / slippage / impact / financing), not just a frictionless fill.
Forward-attestation — agents commit before the data exists, so there's nothing to overfit, and signed, tamper-evident result chains let anyone verify the board independently of the host. Every run can be captured as a raw-decision trajectory and replayed by a separate verifier that recomputes a byte-identical score — a forged trajectory recomputes differently.

Raw return is reported but is never the rank key. The composite also reports (without gating) alpha/beta attribution, calibration, edge half-life, OOS decay, turnover, Pareto-optimality, conviction-weighted return, cost-efficiency (cost-normalized DSR), rolling-window worst-case Sharpe, selection robustness, and the Sortino ratio (downside-only risk) — so a high score is legible, not a black box.

Contamination & input defenses. Comparison is restricted to the shared instruments a field actually traded, a rediscovery check flags a "novel" strategy that is a cosine-near copy of a known one, held-out datasets can be sealed (committed, opened only at scoring), a canary tripwire detects post-hoc that a model trained on the scenarios, and a briefing-neutrality audit lints the shared information packet for the salience bias that would tilt every agent at once.

An agent does not rank on raw return. It ranks only if its edge survives deflation, reliability, significance, and process discipline — and it proves all of it forward.

Status — active (pre-1.0)

All eleven crates are implemented, tested, and CI-green (fmt · clippy -D warnings · workspace tests · a determinism check · the 7-attack self-audit · a docs build · an npm build/test). The statistics kernel, the backtest-honesty verdict, scoring kernel, point-in-time simulator, run harness, forward-attestation, leaderboard, WASM bridge, npm package, MCP server, and CLI all work end-to-end — on synthetic data and on real frozen datasets (crypto majors + US equity indices).

Not yet built (need external infra or a decision): single-name equity data (a keyed feed), a live / forward public arena with hosting, and the public data-curation protocol. See docs/PLAN.md.

Quickstart

cargo install sharpebench                                    # the CLI
sharpebench run                                              # reference agents + a luck floor, ranked
sharpebench score suites/example_submissions.json           # rank a JSON field of submissions
sharpebench audit                                           # prove the scorer resists 7 known gaming attacks
sharpebench run --data data/crypto-majors-1d.csv            # run on real crypto-majors daily bars

The example field includes a skilled agent, a lucky agent with a higher raw return, and a process-violating agent. The skilled agent ranks first; the other two are ineligible — which is the whole point. run adds a luck floor of random "monkey" agents so you can see the zero-skill distribution a real edge must clear.

Use it from anywhere

One kernel, scored identically across every surface — the internal eval and the public benchmark can't drift.

Surface	Get it	What it is
Rust crate	`cargo add sharpebench-core`	The pure scoring kernel — deterministic, `#![forbid(unsafe_code)]`.
Rust (just the stats)	`cargo add sharpebench-stats`	The standalone statistics kernel — PSR, deflated Sharpe, the data-snooping tests, selection. The same math the board ranks on, with no benchmark attached.
CLI	`cargo install sharpebench`	`run` / `score` / `audit` / `sign` / `verify` / `greeks` / …
npm	`npm i @general-liquidity/sharpebench`	Typed JS/TS API over the WASM kernel — `score`, `greeks`, `selfAudit`.
MCP	`npx -y @general-liquidity/sharpebench-mcp`	An MCP server — agents call the kernel as tools.
WASM	`sharpebench-wasm`	The wasm-bindgen bridge the npm package and Gordon (Bun) embed.

import { score, greeks } from "@general-liquidity/sharpebench";

const board = score(submissions);   // ranked CompositeScore[] — raw return never buys rank
greeks({ spot: 100, strike: 100, t_years: 1, rate: 0.05, vol: 0.2, is_call: true }).price; // 10.45

From Rust

// Is this Sharpe real, or an artifact of luck and multiple testing?
use sharpebench_stats::{deflated_sharpe_ratio, probabilistic_sharpe_ratio, sharpe_ratio};

let returns = [0.012, -0.004, 0.009, 0.011, -0.002, 0.008, 0.010, -0.001];
let sr = sharpe_ratio(&returns);                     // observed, per-period
let psr = probabilistic_sharpe_ratio(&returns, 0.0); // P(true Sharpe > 0)
let dsr = deflated_sharpe_ratio(&returns, 200, 0.5); // deflated for 200 trials searched

// Rank a field of agents. The deflated Sharpe sorts the board; raw return never does.
use sharpebench_core::{rank, AgentSubmission, Run, ScoreConfig, Trace};
let mk = |id: &str, returns: Vec<f64>, trials: u32| AgentSubmission {
    agent_id: id.into(),
    runs: vec![Run { returns, trace: Trace::default(), confidences: vec![], outcomes: vec![], cost: 0.0 }],
    in_sample_trials: trials,
    candidates: vec![],
};
let board = rank(&[
    mk("skilled", vec![0.012, 0.008, 0.011, 0.009, 0.010], 1),
    mk("lucky",   vec![0.090, -0.02, 0.001, -0.03, 0.05], 500), // bigger raw return, 500 trials
], &ScoreConfig::default());
for s in &board {
    println!("{}  deflated={:.3}  eligible={}", s.agent_id, s.deflated_sharpe, s.rank_eligible);
}

Both halves are compile-and-run-checked as doctests in sharpebench-stats and sharpebench-core, so they can't silently drift from the API.

CLI commands

Command	What it does
`run` (+ `--data <csv>`, `--http`/`--cmd`)	Run agents through the point-in-time sim and rank them; `--http`/`--cmd` drives your external agent into the field.
`score <subs.json>`	Rank a JSON field of pre-computed submissions.
`check <returns.csv> --trials N`	"Is my Sharpe real?" Prints deflated Sharpe / haircut / MinTRL / verdict for your own return series; `--trials` is required (no silent default).
`audit`	Self-audit: fire 7 known gaming attacks at the scorer; non-zero exit if any survives.
`stress`	Run the adversarial stress suite (flash-crash / whipsaw), contamination-masked.
`commit` · `sign` · `verify`	Forward-attestation: pre-register a digest, sign a board, verify its chain.
`capture` · `verify-trajectory`	Capture an agent's raw-decision trajectory, then replay it to recompute the score.
`audit-briefing` · `canary`	Audit a shared briefing for salience bias; derive a do-not-train contamination tripwire.
`score-allocation` · `greeks`	Score a weight-vector trajectory (turnover); price an option + Greeks + tail-risk.

Add --json to any command for machine-readable output.

Bring your own agent

Agents are external and language-agnostic — implement the tiny JSON contract (MarketObservation → Decision) over either transport, then rank yourself into the field:

sharpebench run --cmd "cargo run -q -p reference-agent"   # stdio subprocess
sharpebench run --http 127.0.0.1:8080                     # HTTP POST /decide

A runnable reference agent (stdio + Dockerfile) and the wire format live in examples/reference-agent/.

Security — running untrusted agents. The harness executes whatever agent you point it at without sandboxing. Only run agents you trust. Safe hosting of third-party submissions is a Phase-2 item and is not yet built.

What it measures

An agent is rank-eligible only if every gate holds; eligible agents then sort by the rank key (Deflated Sharpe).

Gate	Demands	Defeats
Deflated Sharpe / PSR	edge survives deflation for trials × length × skew/kurtosis	data-snooping, lucky search
pass^k	clears the bar on every seed × window	one-lucky-seed wins
Significance	beats bootstrap + Reality Check + SPA + step-down	multiple-testing false positives
Process	zero block-severity trace violations	gate-bypass, naked tail-selling, manipulation
Mandate	respected the drawdown cap	blowing risk to chase return

Reported but never gating: Sortino + downside deviation, rolling worst-case Sharpe, selection robustness, alpha/beta, calibration, edge half-life, OOS decay, turnover, Pareto-optimality, cost-normalized DSR. Full methodology: the mdBook (mdbook serve docs/book).

Data

The benchmark runs on frozen, checksummed, point-in-time datasets — no live API in the scoring path, so a score reproduces forever.

Source	Set	Provides
Binance	`crypto-majors-1d.csv`	BTC/ETH/SOL/BNB/XRP daily closes (public API, no key)
🏛️ FRED	`us-indices-1d.csv`	SPX / DJI / IXIC daily closes (public domain)

Both are fetched and frozen by the offline Rust ingester (xtask, publish = false — its deps never reach the CLI). The format is long date,symbol,close[,dividend]; any aligned dataset works.

cargo run -p xtask -- crypto                              # re-fetch + re-checksum
sharpebench run --data data/us-indices-1d.csv

Architecture

A Rust Cargo workspace (modular, à la Paradigm's Rust OSS — reuse any crate on its own). The whole tree is #![forbid(unsafe_code)].

sharpebench-stats ── the statistics kernel: PSR, expected-max-Sharpe, deflated Sharpe, the
                     data-snooping family (bootstrap / White RC / Hansen SPA / Romano–Wolf),
                     Sortino + moments, selection
      ├── sharpebench-core ── the deterministic scoring kernel (no I/O, no ambient RNG); re-exports -stats
      │     ├── sharpebench-protocol   language-agnostic agent ⇄ harness JSON
      │     ├── sharpebench-sim        point-in-time simulator (look-ahead impossible)
      │     ├── sharpebench-harness    orchestration across seeds × windows
      │     ├── sharpebench-attest     SHA-256 commitments + signed chains + sealed data + canary
      │     ├── sharpebench-leaderboard render / sign / self-describing boards
      │     ├── sharpebench-wasm       the identical kernel for JS/TS (npm, Gordon, MCP)
      │     └── sharpebench-cli        the `sharpebench` binary
      ├── sharpebench-edge ── the "is my Sharpe real?" verdict: MinTRL + PBO + the two-tier honesty check
      └── sharpebench-memory ── the memory/retrieval benchmark: 3-arm ablation + poisoning + PIT + multi-session + confabulation, significance via -stats

Crate	Role
`sharpebench-core`	the scoring layer over `sharpebench-stats`: pass^k / process + cost floor / rolling / decay / calibration / attribution / comparison-sets / rediscovery / briefing-audit / allocation / options-Greeks / self-audit / composite, plus a disqualification-reason taxonomy (a typed `FailReason` rollup over the signals the scorer already computes, so a suite of submissions is legible as "X failed on luck/pass^k, Y on process, Z on deflation/overfit-decay") and a re-export of the whole `-stats` kernel so existing `sharpebench_core::…` paths are unchanged. Byte-identical scores forever.
`sharpebench-stats`	the deterministic statistics kernel, split out so any project can depend on just the math: PSR, expected-max-Sharpe, deflated Sharpe (Bailey & López de Prado), the data-snooping family (stationary bootstrap, White's Reality Check, Hansen SPA liberal + consistent, Romano–Wolf step-down), Sortino + moments + normal primitives, selection robustness, and a stylized-facts realism validator (Cont battery: fat tails, volatility clustering, gain/loss skew, aggregational Gaussianity, Zumbach time-reversal asymmetry) that certifies a frozen dataset is market-realistic, wired into a `sharpebench realism` CLI + CI gate so a drifted generator fails the build. No I/O, no ambient RNG, fixed reduction order.
`sharpebench-edge`	the "is my Sharpe real?" honesty layer over `-stats`: Minimum Track Record Length, Probability of Backtest Overfitting (CSCV), and the two-tier `is_my_sharpe_real` verdict (PSR / deflated Sharpe / MinTRL + haircut + Pass/Borderline/Fail; the full tier adds the data-snooping family + PBO). Powers `sharpebench check`.
`sharpebench-memory`	the memory/retrieval benchmark over `-stats`: the three-arm ablation (baseline / retrieval / oracle) with retrieval lift, stationary-bootstrap significance, and fraction-of-ceiling, plus four legs a SOTA memory benchmark also has to answer - E1 poisoning (integrity delta / attack-success rate / degradation significance), E2 interdependent multi-session (per-session conditioned lift + dependency-satisfaction rate + pooled significance), E3 point-in-time correctness (per-arm no-lookahead compliance + leak flag), and E6 confabulation (regret over reinforced-but-never-re-tested false beliefs). Pure and deterministic; significance delegated to `-stats`; it scores caller-supplied outcome vectors, with no live agent runner.
`sharpebench-sim`	fees, seeded slippage, square-root impact, financing, turnover (TRF) cost, liquidity caps, dividends, execution-cost profiles, a parameterized synthetic generator (volatility + jumps), adversarial stress paths, trajectory capture/replay, and O(1) `clone_state` / `restore_state` snapshots.
`sharpebench-attest`	SHA-256 pre-registration commitments + HMAC signed result chains + time-lock registry + sealed held-out datasets + canary contamination tripwire.
`sharpebench-harness`	seeds × windows orchestration; luck-floor producers; a runtime-vs-agent failure taxonomy.
`protocol` · `leaderboard` · `wasm` · `cli`	the JSON contract · render/sign/self-describing boards · the WASM bridge · the CLI.

Memory benchmark (`sharpebench-memory`)

The same skill-vs-luck discipline, pointed at a memory or retrieval layer. Proving a memory layer helps (rather than just adding tokens and latency) is an ablation, so the crate scores the three-arm ablation - baseline (no memory, the floor), retrieval (the layer under test), and oracle (gold records only, the ceiling) - and reports the retrieval lift, its stationary-bootstrap significance (delegated to sharpebench-stats, not reinvented), and the fraction of the oracle ceiling captured.

Around that floor and ceiling it adds the four legs a memory benchmark also has to answer, each pure and deterministic:

E1 poisoning. Inject corrupted records and measure the behavior-integrity delta, the attack-success rate, and the bootstrap significance of the degradation. Money-memory (a forged limit, a wrong balance, a spoofed venue) is the high-severity case.
E2 multi-session. Model sessions as a dependency graph; a later session's lift is credited only when the memory an earlier session wrote was actually retained. Reports per-session conditioned lift, the cross-session dependency-satisfaction rate, and pooled significance.
E3 point-in-time correctness. Score no-lookahead compliance per arm from recall-audit counts and flag whether the retrieval arm leaked future data. It takes counts only, so it stays decoupled from any enforcement layer - a bi-temporal store, a replay harness, or a hand audit can feed it.
E6 confabulation. The "honest lying" metric: among beliefs that were reinforced but never re-tested and have since resolved, the fraction that proved wrong.

Together these cover the union a SOTA memory benchmark has to prove in one place: statistical significance (stationary bootstrap via -stats), point-in-time no-lookahead, poison-resistance, cross-session dependency, and confabulation. Like the scoring kernel it consumes caller-supplied outcome vectors and has no live agent runner in this crate; 36 unit tests, #![forbid(unsafe_code)], and a fixed bootstrap seed so a verdict never moves when re-run.

Tech stack

Technology	Role
Rust	The whole kernel — pure `f64`, fixed reduction order, no `unsafe`
WebAssembly	The kernel for non-Rust hosts (`wasm-bindgen`)
TypeScript	The typed npm package + MCP server
npm	JS/TS distribution of the scoring kernel
serde	Deterministic JSON for every submission, board, and config
GitHub Actions	CI: fmt · clippy · tests · determinism · self-audit · docs · npm
MCP	Agents call the kernel as tools
cargo-deny	Supply-chain gate (advisories · bans · licenses · sources)

Methodology & references

The gates are not invented — they are the published, peer-reviewed controls for skill-vs-luck, assembled into one ranking.

Control	Reference
Deflated Sharpe & PSR	Bailey & López de Prado, The Deflated Sharpe Ratio (2014)
Reality Check	White, A Reality Check for Data Snooping (2000)
Superior Predictive Ability	Hansen, A Test for Superior Predictive Ability (2005)
Step-down multiple testing	Romano & Wolf (2005)
Reliability across runs (pass^k)	Sierra τ²-bench reliability metric
Downside risk (Sortino)	Sortino & van der Meer (1991)
Options Greeks	Black–Scholes–Merton (1973)

Full derivations in the mdBook: methodology · integrity · attestation.

Governance

Hosted by General Liquidity to start, with a roadmap to neutral governance. Credibility comes from forward-attestation + signed, independently-verifiable results, not from trust in the host — and Gordon (GL's agent) competes on the board like any other entrant. The neutral home may already exist: the FINOS-governed Open FinLLM Leaderboard covers the financial-knowledge axis but has no trading-performance axis — SharpeBench is positioned to be the skill-vs-luck trading track it lacks. See docs/GOVERNANCE.md.

License

Dual-licensed under either MIT or Apache-2.0, at your option.

_{Skill that survives deflation — and proves it forward.}

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
.github/workflows		.github/workflows
crates		crates
data		data
docs		docs
examples/reference-agent		examples/reference-agent
npm		npm
scripts		scripts
suites		suites
xtask		xtask
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
RELEASING.md		RELEASING.md
deny.toml		deny.toml
flake.nix		flake.nix
release.toml		release.toml
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SharpeBench

The luck-robust benchmark for AI trading agents

Why

Status — active (pre-1.0)

Quickstart

Use it from anywhere

From Rust

CLI commands

Bring your own agent

What it measures

Data

Architecture

Memory benchmark (`sharpebench-memory`)

Tech stack

Methodology & references

Governance

License

About

Licenses found

Uh oh!

Releases 11

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SharpeBench

The luck-robust benchmark for AI trading agents

Why

Status — active (pre-1.0)

Quickstart

Use it from anywhere

From Rust

CLI commands

Bring your own agent

What it measures

Data

Architecture

Memory benchmark (sharpebench-memory)

Tech stack

Methodology & references

Governance

License

About

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 11

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Memory benchmark (`sharpebench-memory`)

Packages