Scoring and Rewards

PRISM scores a single thing: a model's ability to learn from scratch, measured as online compression. The primary metric is a prequential bits-per-byte (bpb) score that the challenge computes itself from a forced-init re-execution. A held-out delta-over-random-init breaks near-ties, and an anti-memorization gap penalizes overfitting. Lower bits-per-byte is better.

Primary Metric: Prequential Bits-Per-Byte

During the forced-init re-execution, the challenge feeds the model fresh, single-pass batches from the locked train split and records the model's loss on each new batch before the optimizer updates on it. Because the data is single-pass, this online (predict-then-train) loss is the prequential code-length by construction.

The challenge integrates that code-length over the whole run and normalizes it by the raw UTF-8 bytes of text covered:

bpb = (sum over consumed tokens of -log2 p(token)) / total_bytes_covered

Because the denominator is bytes, the metric is tokenizer-agnostic: a miner can bring any tokenizer and the score still compares like for like. Because it integrates the whole loss curve, a single good checkpoint cannot game it. Because each token is scored before being trained on, there is no held-out leakage by construction. And because the validator forces random init, smuggled pretrained weights are inert.

final_score is a documented monotone-decreasing transform of bpb, so a lower bpb yields a better (higher) final_score, and the leaderboard's ORDER BY final_score DESC ranks better learners first:

final_score = 1 / (1 + bpb)        # before tie-break, penalty, and anti-cheat multiplier

Compute Normalization, Not Wall-Clock

The score is compute-normalized: it is reported and normalized by tokens consumed (and, optionally, estimated FLOPs), never by wall-clock time. A faster GPU or more GPUs cannot buy a better score; wall-clock is only a safety cap on the run. This keeps scores fair across the 1-to-8 GPU range even though the scored run uses one physical GPU.

Tie-Breaker: Held-Out Delta Over Random Init

When two submissions are near-equal on bpb, the challenge breaks the tie with the held-out delta on the secret val split:

heldout_delta = bpb(random-init twin on val) - bpb(trained model on val)

A larger improvement over the random-init twin is better. The held-out delta is folded into final_score as a bounded tie-break term: it can only reorder submissions whose bpb is within a small epsilon of each other, so a strictly lower bpb is never ranked worse on the primary axis. When no secret val split is scored for a run, the run is graded on bpb alone with no tie-break.

Anti-Memorization Gap

The challenge also measures the train-vs-held-out gap (the converged train bpb against the held-out val bpb on the same byte basis). An excessive gap flags memorization and multiplies a penalty into final_score, so a memorizer ranks below an equivalent non-memorizing learner. The gap comparison is basis-consistent so a benign learner is not falsely flagged, while the final bpb denominator stays UTF-8 bytes.

Anomaly Zeroing

A step-0 / smuggled-weights anomaly (an impossibly low initial loss under forced random init) drives the anti-cheat multiplier to zero, so an anomalously good bpb is flagged and zeroed rather than rewarded. A degenerate run (zero coverage, non-finite, or out-of-band bpb) is failed rather than scored, so it never collapses into a fabricated score that ranks.

Leaderboard And Tie-Break Ordering

The leaderboard ranks by final_score (so by bpb and the folded-in held-out delta). When two submissions are still equal, the final deterministic tie-break is earliest-commit-wins, then submission id, producing a total, reproducible order. Each hotkey appears at most once: the best submission per hotkey survives, so a worse same-hotkey submission never supersedes a better one.

Weights

get_weights converts completed scores into normalized BASE weights: one weight per hotkey, taken from that hotkey's best final_score, normalized to sum to 1.0. Weights are always dry-run and are never written on-chain.

The Challenge Is The Source Of Truth

Every number above is recomputed by the challenge from the challenge-authored prism_run_manifest.v2.json. Miner-reported metrics and miner-written manifests are ignored. The legacy raw-loss term and the v1-NAS architecture/training ownership pools are retired from the score.

Reference Studies

Area	Study	PRISM implication
Prequential / online coding	Dawid, 1984, Present Position and Potential Developments: Some Personal Views: Statistical Theory: The Prequential Approach	Score the integrated online (predict-then-train) loss, not a final checkpoint.
Minimum description length	Rissanen, 1978, Modeling by Shortest Data Description	Treat compression (code-length) as the learning signal.
Scaling laws	Kaplan et al., 2020, Scaling Laws for Neural Language Models	Compare loss trajectories under matched compute, not raw final loss.
Compute-optimal scaling	Hoffmann et al., 2022, Training Compute-Optimal Large Language Models	Normalize by tokens/compute so under- or over-trained regimes do not skew ranking.
Dataset provenance	Penedo et al., 2024, The FineWeb Datasets	Freeze the data revision and shards for reproducible official runs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scoring and Rewards

Primary Metric: Prequential Bits-Per-Byte

Compute Normalization, Not Wall-Clock

Tie-Breaker: Held-Out Delta Over Random Init

Anti-Memorization Gap

Anomaly Zeroing

Leaderboard And Tie-Break Ordering

Weights

The Challenge Is The Source Of Truth

Reference Studies

Uh oh!

FilesExpand file tree

scoring.md

Latest commit

History

scoring.md

File metadata and controls

Scoring and Rewards

Primary Metric: Prequential Bits-Per-Byte

Compute Normalization, Not Wall-Clock

Tie-Breaker: Held-Out Delta Over Random Init

Anti-Memorization Gap

Anomaly Zeroing

Leaderboard And Tie-Break Ordering

Weights

The Challenge Is The Source Of Truth

Reference Studies