Skip to content

general-liquidity/sharpebench

Repository files navigation

SharpeBench

The luck-robust benchmark for AI trading agents

Other leaderboards rank the luckiest run over one quarter. SharpeBench ranks the skill that survives deflation — and proves it forward.

Crates.io npm docs.rs CI License Unsafe

Why · Quickstart · Surfaces · What it measures · Architecture · Tech stack · References


Why

Every existing financial-agent benchmark ranks on raw risk-adjusted metrics over a single short window and a handful of runs — so the leaderboard mostly measures noise. FinBen reports Sharpe confidence intervals of ±1.08, which makes its rankings statistically indistinguishable. StockBench runs one window, once. QuantBench reports Sharpe across 40 seeds but never deflates it.

In an AI trading benchmark, the hard part is not measuring return. It is separating skill from luck. A model that posts a great Sharpe over one quarter has told you almost nothing — the number is dominated by sampling noise, by the number of strategies that were tried, and by hidden risk the linear return series can't see.

SharpeBench adds, as ranking gates, the things none of the others have:

  1. Deflated Sharpe / PSR — deflate the Sharpe by how many agents were tested × track length × return skew/kurtosis (Bailey & López de Prado), plus each agent's own declared in-sample trials, so a strategy mined from a thousand private backtests is deflated for that search too.
  2. pass^k reliability — the agent must clear the bar on every seed × window, not on average.
  3. Field-wide significance — a deterministic stationary bootstrap, White's Reality Check, Hansen's studentized & consistent SPA, and Romano–Wolf step-down; the edge must beat data-snooping, not just noise.
  4. Process discipline — placing an order that never passed the risk gate, ignoring a drawdown halt, bypassing a deny-list, or selling tail risk with a naked short-gamma book zeroes the entry, however good the P&L looks. The edge must also survive a realistic execution-cost profile (typical or worst-case fees / slippage / impact / financing), not just a frictionless fill.
  5. Forward-attestation — agents commit before the data exists, so there's nothing to overfit, and signed, tamper-evident result chains let anyone verify the board independently of the host. Every run can be captured as a raw-decision trajectory and replayed by a separate verifier that recomputes a byte-identical score — a forged trajectory recomputes differently.

Raw return is reported but is never the rank key. The composite also reports (without gating) alpha/beta attribution, calibration, edge half-life, OOS decay, turnover, Pareto-optimality, conviction-weighted return, cost-efficiency (cost-normalized DSR), rolling-window worst-case Sharpe, selection robustness, and the Sortino ratio (downside-only risk) — so a high score is legible, not a black box.

Contamination & input defenses. Comparison is restricted to the shared instruments a field actually traded, a rediscovery check flags a "novel" strategy that is a cosine-near copy of a known one, held-out datasets can be sealed (committed, opened only at scoring), a canary tripwire detects post-hoc that a model trained on the scenarios, and a briefing-neutrality audit lints the shared information packet for the salience bias that would tilt every agent at once.

An agent does not rank on raw return. It ranks only if its edge survives deflation, reliability, significance, and process discipline — and it proves all of it forward.

Status — active (pre-1.0)

All eleven crates are implemented, tested, and CI-green (fmt · clippy -D warnings · workspace tests · a determinism check · the 7-attack self-audit · a docs build · an npm build/test). The statistics kernel, the backtest-honesty verdict, scoring kernel, point-in-time simulator, run harness, forward-attestation, leaderboard, WASM bridge, npm package, MCP server, and CLI all work end-to-end — on synthetic data and on real frozen datasets (crypto majors + US equity indices).

Not yet built (need external infra or a decision): single-name equity data (a keyed feed), a live / forward public arena with hosting, and the public data-curation protocol. See docs/PLAN.md.

Quickstart

cargo install sharpebench                                    # the CLI
sharpebench run                                              # reference agents + a luck floor, ranked
sharpebench score suites/example_submissions.json           # rank a JSON field of submissions
sharpebench audit                                           # prove the scorer resists 7 known gaming attacks
sharpebench run --data data/crypto-majors-1d.csv            # run on real crypto-majors daily bars

The example field includes a skilled agent, a lucky agent with a higher raw return, and a process-violating agent. The skilled agent ranks first; the other two are ineligible — which is the whole point. run adds a luck floor of random "monkey" agents so you can see the zero-skill distribution a real edge must clear.

Use it from anywhere

One kernel, scored identically across every surface — the internal eval and the public benchmark can't drift.

Surface Get it What it is
  Rust crate cargo add sharpebench-core The pure scoring kernel — deterministic, #![forbid(unsafe_code)].
  Rust (just the stats) cargo add sharpebench-stats The standalone statistics kernel — PSR, deflated Sharpe, the data-snooping tests, selection. The same math the board ranks on, with no benchmark attached.
  CLI cargo install sharpebench run / score / audit / sign / verify / greeks / …
  npm npm i @general-liquidity/sharpebench Typed JS/TS API over the WASM kernel — score, greeks, selfAudit.
  MCP npx -y @general-liquidity/sharpebench-mcp An MCP server — agents call the kernel as tools.
  WASM sharpebench-wasm The wasm-bindgen bridge the npm package and Gordon (Bun) embed.
import { score, greeks } from "@general-liquidity/sharpebench";

const board = score(submissions);   // ranked CompositeScore[] — raw return never buys rank
greeks({ spot: 100, strike: 100, t_years: 1, rate: 0.05, vol: 0.2, is_call: true }).price; // 10.45

From Rust

// Is this Sharpe real, or an artifact of luck and multiple testing?
use sharpebench_stats::{deflated_sharpe_ratio, probabilistic_sharpe_ratio, sharpe_ratio};

let returns = [0.012, -0.004, 0.009, 0.011, -0.002, 0.008, 0.010, -0.001];
let sr = sharpe_ratio(&returns);                     // observed, per-period
let psr = probabilistic_sharpe_ratio(&returns, 0.0); // P(true Sharpe > 0)
let dsr = deflated_sharpe_ratio(&returns, 200, 0.5); // deflated for 200 trials searched

// Rank a field of agents. The deflated Sharpe sorts the board; raw return never does.
use sharpebench_core::{rank, AgentSubmission, Run, ScoreConfig, Trace};
let mk = |id: &str, returns: Vec<f64>, trials: u32| AgentSubmission {
    agent_id: id.into(),
    runs: vec![Run { returns, trace: Trace::default(), confidences: vec![], outcomes: vec![], cost: 0.0 }],
    in_sample_trials: trials,
    candidates: vec![],
};
let board = rank(&[
    mk("skilled", vec![0.012, 0.008, 0.011, 0.009, 0.010], 1),
    mk("lucky",   vec![0.090, -0.02, 0.001, -0.03, 0.05], 500), // bigger raw return, 500 trials
], &ScoreConfig::default());
for s in &board {
    println!("{}  deflated={:.3}  eligible={}", s.agent_id, s.deflated_sharpe, s.rank_eligible);
}

Both halves are compile-and-run-checked as doctests in sharpebench-stats and sharpebench-core, so they can't silently drift from the API.

CLI commands

Command What it does
run (+ --data <csv>, --http/--cmd) Run agents through the point-in-time sim and rank them; --http/--cmd drives your external agent into the field.
score <subs.json> Rank a JSON field of pre-computed submissions.
check <returns.csv> --trials N "Is my Sharpe real?" Prints deflated Sharpe / haircut / MinTRL / verdict for your own return series; --trials is required (no silent default).
audit Self-audit: fire 7 known gaming attacks at the scorer; non-zero exit if any survives.
stress Run the adversarial stress suite (flash-crash / whipsaw), contamination-masked.
commit · sign · verify Forward-attestation: pre-register a digest, sign a board, verify its chain.
capture · verify-trajectory Capture an agent's raw-decision trajectory, then replay it to recompute the score.
audit-briefing · canary Audit a shared briefing for salience bias; derive a do-not-train contamination tripwire.
score-allocation · greeks Score a weight-vector trajectory (turnover); price an option + Greeks + tail-risk.

Add --json to any command for machine-readable output.

Bring your own agent

Agents are external and language-agnostic — implement the tiny JSON contract (MarketObservationDecision) over either transport, then rank yourself into the field:

sharpebench run --cmd "cargo run -q -p reference-agent"   # stdio subprocess
sharpebench run --http 127.0.0.1:8080                     # HTTP POST /decide

A runnable reference agent (stdio + Dockerfile) and the wire format live in examples/reference-agent/.

Security — running untrusted agents. The harness executes whatever agent you point it at without sandboxing. Only run agents you trust. Safe hosting of third-party submissions is a Phase-2 item and is not yet built.

What it measures

An agent is rank-eligible only if every gate holds; eligible agents then sort by the rank key (Deflated Sharpe).

Gate Demands Defeats
Deflated Sharpe / PSR edge survives deflation for trials × length × skew/kurtosis data-snooping, lucky search
pass^k clears the bar on every seed × window one-lucky-seed wins
Significance beats bootstrap + Reality Check + SPA + step-down multiple-testing false positives
Process zero block-severity trace violations gate-bypass, naked tail-selling, manipulation
Mandate respected the drawdown cap blowing risk to chase return

Reported but never gating: Sortino + downside deviation, rolling worst-case Sharpe, selection robustness, alpha/beta, calibration, edge half-life, OOS decay, turnover, Pareto-optimality, cost-normalized DSR. Full methodology: the mdBook (mdbook serve docs/book).

Data

The benchmark runs on frozen, checksummed, point-in-time datasets — no live API in the scoring path, so a score reproduces forever.

Source Set Provides
  Binance crypto-majors-1d.csv BTC/ETH/SOL/BNB/XRP daily closes (public API, no key)
🏛️ FRED us-indices-1d.csv SPX / DJI / IXIC daily closes (public domain)

Both are fetched and frozen by the offline Rust ingester (xtask, publish = false — its deps never reach the CLI). The format is long date,symbol,close[,dividend]; any aligned dataset works.

cargo run -p xtask -- crypto                              # re-fetch + re-checksum
sharpebench run --data data/us-indices-1d.csv

Architecture

A Rust Cargo workspace (modular, à la Paradigm's Rust OSS — reuse any crate on its own). The whole tree is #![forbid(unsafe_code)].

sharpebench-stats ── the statistics kernel: PSR, expected-max-Sharpe, deflated Sharpe, the
                     data-snooping family (bootstrap / White RC / Hansen SPA / Romano–Wolf),
                     Sortino + moments, selection
      ├── sharpebench-core ── the deterministic scoring kernel (no I/O, no ambient RNG); re-exports -stats
      │     ├── sharpebench-protocol   language-agnostic agent ⇄ harness JSON
      │     ├── sharpebench-sim        point-in-time simulator (look-ahead impossible)
      │     ├── sharpebench-harness    orchestration across seeds × windows
      │     ├── sharpebench-attest     SHA-256 commitments + signed chains + sealed data + canary
      │     ├── sharpebench-leaderboard render / sign / self-describing boards
      │     ├── sharpebench-wasm       the identical kernel for JS/TS (npm, Gordon, MCP)
      │     └── sharpebench-cli        the `sharpebench` binary
      ├── sharpebench-edge ── the "is my Sharpe real?" verdict: MinTRL + PBO + the two-tier honesty check
      └── sharpebench-memory ── the memory/retrieval benchmark: 3-arm ablation + poisoning + PIT + multi-session + confabulation, significance via -stats
Crate Role
sharpebench-core the scoring layer over sharpebench-stats: pass^k / process + cost floor / rolling / decay / calibration / attribution / comparison-sets / rediscovery / briefing-audit / allocation / options-Greeks / self-audit / composite, plus a disqualification-reason taxonomy (a typed FailReason rollup over the signals the scorer already computes, so a suite of submissions is legible as "X failed on luck/pass^k, Y on process, Z on deflation/overfit-decay") and a re-export of the whole -stats kernel so existing sharpebench_core::… paths are unchanged. Byte-identical scores forever.
sharpebench-stats the deterministic statistics kernel, split out so any project can depend on just the math: PSR, expected-max-Sharpe, deflated Sharpe (Bailey & López de Prado), the data-snooping family (stationary bootstrap, White's Reality Check, Hansen SPA liberal + consistent, Romano–Wolf step-down), Sortino + moments + normal primitives, selection robustness, and a stylized-facts realism validator (Cont battery: fat tails, volatility clustering, gain/loss skew, aggregational Gaussianity, Zumbach time-reversal asymmetry) that certifies a frozen dataset is market-realistic, wired into a sharpebench realism CLI + CI gate so a drifted generator fails the build. No I/O, no ambient RNG, fixed reduction order.
sharpebench-edge the "is my Sharpe real?" honesty layer over -stats: Minimum Track Record Length, Probability of Backtest Overfitting (CSCV), and the two-tier is_my_sharpe_real verdict (PSR / deflated Sharpe / MinTRL + haircut + Pass/Borderline/Fail; the full tier adds the data-snooping family + PBO). Powers sharpebench check.
sharpebench-memory the memory/retrieval benchmark over -stats: the three-arm ablation (baseline / retrieval / oracle) with retrieval lift, stationary-bootstrap significance, and fraction-of-ceiling, plus four legs a SOTA memory benchmark also has to answer - E1 poisoning (integrity delta / attack-success rate / degradation significance), E2 interdependent multi-session (per-session conditioned lift + dependency-satisfaction rate + pooled significance), E3 point-in-time correctness (per-arm no-lookahead compliance + leak flag), and E6 confabulation (regret over reinforced-but-never-re-tested false beliefs). Pure and deterministic; significance delegated to -stats; it scores caller-supplied outcome vectors, with no live agent runner.
sharpebench-sim fees, seeded slippage, square-root impact, financing, turnover (TRF) cost, liquidity caps, dividends, execution-cost profiles, a parameterized synthetic generator (volatility + jumps), adversarial stress paths, trajectory capture/replay, and O(1) clone_state / restore_state snapshots.
sharpebench-attest SHA-256 pre-registration commitments + HMAC signed result chains + time-lock registry + sealed held-out datasets + canary contamination tripwire.
sharpebench-harness seeds × windows orchestration; luck-floor producers; a runtime-vs-agent failure taxonomy.
protocol · leaderboard · wasm · cli the JSON contract · render/sign/self-describing boards · the WASM bridge · the CLI.

Memory benchmark (sharpebench-memory)

The same skill-vs-luck discipline, pointed at a memory or retrieval layer. Proving a memory layer helps (rather than just adding tokens and latency) is an ablation, so the crate scores the three-arm ablation - baseline (no memory, the floor), retrieval (the layer under test), and oracle (gold records only, the ceiling) - and reports the retrieval lift, its stationary-bootstrap significance (delegated to sharpebench-stats, not reinvented), and the fraction of the oracle ceiling captured.

Around that floor and ceiling it adds the four legs a memory benchmark also has to answer, each pure and deterministic:

  • E1 poisoning. Inject corrupted records and measure the behavior-integrity delta, the attack-success rate, and the bootstrap significance of the degradation. Money-memory (a forged limit, a wrong balance, a spoofed venue) is the high-severity case.
  • E2 multi-session. Model sessions as a dependency graph; a later session's lift is credited only when the memory an earlier session wrote was actually retained. Reports per-session conditioned lift, the cross-session dependency-satisfaction rate, and pooled significance.
  • E3 point-in-time correctness. Score no-lookahead compliance per arm from recall-audit counts and flag whether the retrieval arm leaked future data. It takes counts only, so it stays decoupled from any enforcement layer - a bi-temporal store, a replay harness, or a hand audit can feed it.
  • E6 confabulation. The "honest lying" metric: among beliefs that were reinforced but never re-tested and have since resolved, the fraction that proved wrong.

Together these cover the union a SOTA memory benchmark has to prove in one place: statistical significance (stationary bootstrap via -stats), point-in-time no-lookahead, poison-resistance, cross-session dependency, and confabulation. Like the scoring kernel it consumes caller-supplied outcome vectors and has no live agent runner in this crate; 36 unit tests, #![forbid(unsafe_code)], and a fixed bootstrap seed so a verdict never moves when re-run.

Tech stack

Technology Role
  Rust The whole kernel — pure f64, fixed reduction order, no unsafe
  WebAssembly The kernel for non-Rust hosts (wasm-bindgen)
  TypeScript The typed npm package + MCP server
  npm JS/TS distribution of the scoring kernel
  serde Deterministic JSON for every submission, board, and config
  GitHub Actions CI: fmt · clippy · tests · determinism · self-audit · docs · npm
  MCP Agents call the kernel as tools
cargo-deny Supply-chain gate (advisories · bans · licenses · sources)

Methodology & references

The gates are not invented — they are the published, peer-reviewed controls for skill-vs-luck, assembled into one ranking.

Control Reference
Deflated Sharpe & PSR Bailey & López de Prado, The Deflated Sharpe Ratio (2014)
Reality Check White, A Reality Check for Data Snooping (2000)
Superior Predictive Ability Hansen, A Test for Superior Predictive Ability (2005)
Step-down multiple testing Romano & Wolf (2005)
Reliability across runs (pass^k) Sierra τ²-bench reliability metric
Downside risk (Sortino) Sortino & van der Meer (1991)
Options Greeks Black–Scholes–Merton (1973)

Full derivations in the mdBook: methodology · integrity · attestation.

Governance

Hosted by General Liquidity to start, with a roadmap to neutral governance. Credibility comes from forward-attestation + signed, independently-verifiable results, not from trust in the host — and Gordon (GL's agent) competes on the board like any other entrant. The neutral home may already exist: the FINOS-governed Open FinLLM Leaderboard covers the financial-knowledge axis but has no trading-performance axis — SharpeBench is positioned to be the skill-vs-luck trading track it lacks. See docs/GOVERNANCE.md.

License

Dual-licensed under either MIT or Apache-2.0, at your option.


Skill that survives deflation — and proves it forward.

About

SharpeBench is the luck-robust benchmark for AI trading agents, which ranks risk-adjusted skill that survives deflation, not raw return. The SWE-bench moment for capital markets.

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors