Unofficial: Interactive dashboard visualizing all 352 submissions #747

dentity007 · 2026-03-25T17:18:43Z

dentity007
Mar 25, 2026

⚠️ DISCLAIMER: This is NOT an official OpenAI resource. This dashboard is an independent, unofficial project by one participant. It is not affiliated with, endorsed by, or associated with OpenAI or the Parameter Golf organizers. All data is sourced from publicly available submission.json files in pull requests.

I built an interactive dashboard that visualizes data from all 352 submissions with BPB scores:

Live Dashboard →

What you can do:

Search for your PR by author name or number
Filter by status (Open / Closed / Merged) to see the realistic leaderboard
Filter by size compliance (under 16MB)
Sort by any column (BPB, size, date, etc.)
See which techniques the community is using

Data includes:

352 submissions ranked by BPB
275 open (likely valid), 61 closed (likely invalid TTT), 16 merged
263 submissions under 16MB
PR links for every submission

Also includes a technique effectiveness matrix showing what worked and what didn't across 46+ experiments, plus cost analysis for anyone budgeting their RunPod spend.

Source: github.com/NathanMaine/parameter-golf-experiment-lab

If your submission data looks wrong, let me know — happy to fix it. The data was pulled from submission.json files as of March 24.

Good luck everyone! 🏌️

dentity007 · 2026-04-01T15:33:03Z

dentity007
Apr 1, 2026
Author

Dashboard Update — v4 (March 31, 2026)

Major update to the dashboard:

Data:

Now tracking 1,171+ submissions (up from 352)
85+ personal experiments logged (up from 60+)
11 pods used, ~$330 total spend

New sections:

7 OpenAI research directions implemented — Text Diffusion (MDLM), H-Net learned tokenization, Universal Transformer, LLM-JEPA, Mamba SSM Hybrid, Triton Megakernels, Random Linear Map Adapters
11 new technique entries (what worked, what didn't)
7 new key discoveries (Days 10-11)
3 new pod cost cards (RTX 5090 × 2, H200 SXM)

Chart fixes:

Full Y-axis range — no more cut-off data points
All charts taller (800px)
Leaderboard table shows all entries without scroll truncation

Live dashboard: https://nathanmaine.github.io/parameter-golf-experiment-lab/

PRs for all 7 research directions: #1191, #1192, #1193, #1194, #1195, #1196, #1197

2 replies

Ribin545 Apr 2, 2026

Thanks @dentity007 This was super usefull for me I was using Universal transformer, and I should have found it sooner i used the same Universal Transformer with Depth Recurrence mentioned in PR #363. I wasted a week with it. I achieved 1.8403 bpb in RTX 3090 which i thought was good until i reduced the depth recurrence to basically none.

dentity007 Apr 2, 2026
Author

Glad it helped! Yeah depth recurrence was a trap for me too — looked promising in theory but the compute cost per step killed any BPB gains. RTX 3090 hitting 1.84 is solid though, what are you getting now without it?

dentity007 · 2026-04-02T15:11:23Z

dentity007
Apr 2, 2026
Author

@Ribin545 Glad the dashboard helped! I needed it as well many times!

The RTX 3090 is solid hitting 1.84 what are you getting now without it?

1 reply

Ribin545 Apr 3, 2026

@dentity007 Thanks! I was initially hitting only ~2BPB with the Depth Recurrence and the quantized val was also pretty bad like the PR you mentioned in you live dash board. I worked around it using LoRA so far i was able to reach around ~1.81. You can check if interested Here : PR1300 I have refered your like dashboard tool. But I guess this is as much I can go, I don't want to burn my friend's machine lol.

dentity007 · 2026-04-02T19:50:37Z

dentity007
Apr 2, 2026
Author

Dashboard Update - v5 (April 2, 2026)

Major update focused on data completeness and TTT legality filtering.

Data:

Now tracking 753 scored submissions with BPB data (up from 567)
PR coverage expanded through 11L INT7 + MuonWD + SWA (preliminary) #1258 (was ClownCar: Frugendorff compression baseline + canonical DeltaNet integration #990)
186 new entries added from PRs Record: 33.6M Int5 GPTQ + Score-First TTT (val_bpb=1.1145, 3-seed) #991-11L INT7 + MuonWD + SWA (preliminary) #1258
All 12 of my submissions now highlighted in the leaderboard

New: TTT Legality Filtering

Following the TTT legality discussion in issue #402 and the rulings from @0hq and @valerio-oai, the leaderboard now includes TTT compliance classification:

Legal (green): Standard submissions, score-first TTT, per-document independent TTT. No cross-document leakage.
Illegal (red): Multi-epoch TTT, score-every-epoch keeping min NLL, pre-eval adaptation on val data.
Suspect (amber): Closed by organizers, or BPB < 0.5 (likely n-gram cache exploits that accumulate statistics across the full eval token stream).

The leaderboard now defaults to "Legal Only" so the realistic competition state is visible immediately. All submissions are still accessible via the filter dropdown or search bar.

A disclaimer block under the leaderboard heading explains that this classification is our best interpretation of the current rules and may not be 100% accurate. @0hq @valerio-oai - if any of these classifications are off, happy to adjust. If you think your submission is miscategorized, open an issue on the dashboard repo or let me know here.

Why this matters: Without filtering, the top ~33 submissions are dominated by n-gram cache approaches scoring below 0.5 BPB. Many of these have been closed by organizers. The "Legal Only" view shows the actual state of the neural modeling competition, where the real innovation is happening in the 1.05-1.12 BPB range.

Other updates:

KPIs updated: 613 Open, 123 Closed, 17 Merged, 437 Under 16MB
New filter options: "Legal Only", "Suspect/Illegal"
TTT rules reference card with direct link to issue Invalid submissions due to information leakage during TTT #402

Live dashboard: https://nathanmaine.github.io/parameter-golf-experiment-lab/

2 replies

samquiring Apr 3, 2026

Hey thanks for making this! It would be great to see other disqualifiers like not having 3 training logs, GPTQ on training data, and over 16MB to properly capture the valid leaderboards. Also my submission #1268 is under the 16MB limit

dentity007 Apr 3, 2026
Author

Thanks @samquiring! Good suggestions. Just pushed an update:

New filters added:

"Record Eligible" - combines Legal + Open + Under 16MB. This is the closest to a "valid leaderboard" we can automate
"Over 16MB / Unknown Size" - catches submissions that may exceed the size limit or have unverified sizes
Fixed: PR #1268 now correctly shows under 16MB.

On the 3-seed logs and GPTQ-on-training-data checks - those require reading individual PR files which is harder to automate across 793 entries. I've added a note in the help text pointing people to the dashboard repo to report corrections. If you spot specific PRs that should be flagged, feel free to open an issue at https://github.com/NathanMaine/parameter-golf-experiment-lab/issues and I'll update them.

dentity007 · 2026-04-03T23:03:48Z

dentity007
Apr 3, 2026
Author

Dashboard Update - v8 (April 3, 2026)

This is NOT an official OpenAI resource. This dashboard is an independent, unofficial project by one participant. All classifications (Legal/Illegal/Suspect) are our best interpretation of the current rules based on issue #402 and the illegal submissions megathread #677. They may not be 100% accurate.

Data:

Now tracking 793 scored submissions (up from 753)
PR coverage expanded through [non-record track] BankLinear: cross-layer shared weight bank #1315
All data sourced from public submission.json files and PR titles

New filters (per community feedback from @samquiring):

"Record Eligible" - Legal + Open + Under 16MB. The closest automated approximation to valid record submissions
"Over 16MB / Unknown Size" - catches submissions that exceed or may exceed the size limit
Existing filters: Legal Only (default), All Submissions, Open/Closed/Merged, Suspect/Illegal

Updated sections:

Experiment log: 95+ personal runs across 13 pods
New techniques documented: SLOT (per-batch delta optimization), Vocab 4096 + MLP 4.0x, Brotli-11 compression
Pod comparison: 2 new pods including Iceland (802 TFLOPS, best observed)
Cost tracking: $360 total across all experiments
Key discoveries from Days 14-15

New record attempts submitted:

PR Record: Vocab4096 + MLP4.0x + WD0.085 - val_bpb 1.1048 (3-seed mean) #1287: Vocab4096 + MLP4.0x + WD0.085 - val_bpb 1.1048 (3-seed mean, no TTT/SLOT)
PR Record: Vocab4096 + MLP4.0x + SLOT - val_bpb 1.0925 (3-seed mean) #1291: Same base + SLOT - val_bpb 1.0925 (3-seed mean)

Both are pure neural submissions with no n-gram cache, no multi-epoch TTT. Full details and reproduction commands in the PRs.

Corrections welcome: If your submission's size, BPB, or legality status is showing incorrectly, open an issue at https://github.com/NathanMaine/parameter-golf-experiment-lab/issues or comment here.

Live dashboard: https://nathanmaine.github.io/parameter-golf-experiment-lab/

1 reply

dentity007 Apr 3, 2026
Author

Thanks @samquiring! Good suggestions. Just pushed an update:

New filters added:

"Record Eligible" - combines Legal + Open + Under 16MB. This is the closest to a "valid leaderboard" we can automate
"Over 16MB / Unknown Size" - catches submissions that may exceed the size limit or have unverified sizes

Fixed: PR #1268 now correctly shows under 16MB.

On the 3-seed logs and GPTQ-on-training-data checks - those require reading individual PR files which is harder to automate across 793 entries. I've added a note in the help text pointing people to the dashboard repo to report corrections. If you spot specific PRs that should be flagged, feel free to open an issue at https://github.com/NathanMaine/parameter-golf-experiment-lab/issues and I'll update them.

dentity007 · 2026-04-08T16:48:48Z

dentity007
Apr 8, 2026
Author

Dashboard Update - v9 (April 6, 2026)

This is NOT an official OpenAI resource. Independent, unofficial project by one participant.

Live Dashboard

Data

Now tracking 944 scored submissions (up from 793)
PR coverage expanded through Non-record: PROTEUS Feature Ablation - Parallel Residuals + Mixed INT5/INT6 + TTT on DGX Spark GB10 #1425 (was [non-record track] BankLinear: cross-layer shared weight bank #1315)
257 new PRs added (Non-record: Distributed 8xH100 Polar STE + QJL KV-cache baseline #1160-Non-record: PROTEUS Feature Ablation - Parallel Residuals + Mixed INT5/INT6 + TTT on DGX Spark GB10 #1425)
All data verified: KPI stats, leaderboard counts, and per-PR BPB values audited against source

Bug Fixes

Fixed BPB values for PRs Record: Vocab4096 + MLP4.0x + WD0.085 - val_bpb 1.1048 (3-seed mean) #1287 (corrected to 1.1048) and Record: Vocab4096 + MLP4.0x + SLOT - val_bpb 1.0925 (3-seed mean) #1291 (corrected to 1.0925) - these were showing wrong values from a data merge error
Fixed inconsistent submission counts across KPI bar, section headers, and actual data (all now 944)
Updated KPI stats to match real data: 764 open, 163 closed, 569 under 16MB, 24 days left

New: DGX Spark PROTEUS Ablation Data

10 overnight ablation runs testing parallel residuals, mixed INT5/INT6 quantization, and SLOT
Key finding: Parallel residuals (PARALLEL_START_LAYER=6) is the dominant feature at -0.0175 BPB with 2.3x throughput improvement on GB10
6 new experiment log rows added (April 3-6 runs)

New: SLOT Legality Context

PR Non-record: Does SLOT violate causal dependence? (empirical test + question) #1240 empirically proved standard SLOT violates causal dependence (100% violation rate)
Issue Legality question: Is context-only (causal) SLOT legal? #1336 pending ruling on context-only (causal) SLOT variant
Dashboard now notes SLOT legality concerns in technique cards and chart descriptions
Our Record: Vocab4096 + MLP4.0x + SLOT - val_bpb 1.0925 (3-seed mean) #1291 (1.0925 with SLOT) flagged as at-risk; safe fallback Record: Vocab4096 + MLP4.0x + WD0.085 - val_bpb 1.1048 (3-seed mean) #1287 (1.1048 without SLOT) highlighted

All Sections Updated

Section	What Changed
0. Leaderboard	944 entries, corrected KPIs, verified counts
1a. All Submissions	Updated with 257 new data points
1b. Top 20 Verified	Refreshed with latest legal+open entries
1c. Open vs Closed	Added SLOT legality context to TTT cliff narrative
1d. By Technique	Added parallel residuals and depth recurrence as top techniques
2. My Score Timeline	Added #1287 data point and DGX Spark results
3. My Experiments	6 new rows (April 3-6: Iceland pod + DGX Spark)
4. Techniques	New parallel residuals card, SLOT card updated with legality warning
5. Cost	DGX Spark $0 card (free local compute, 10 ablation runs)
6. Pod Comparison	DGX Spark GB10 row (free, always-on, 2.3x parallel speedup)
7. Key Discoveries	5 new entries (see below)

New Key Discoveries (Section 7)

PROTEUS integration: 4 features ported in one session - Parallel residuals, mixed INT5/INT6, score-first TTT, CPU test suite
Parallel residuals: the clear winner - -0.0175 BPB, 2.3x throughput on GB10
SLOT legality crisis - PR Non-record: Does SLOT violate causal dependence? (empirical test + question) #1240 proved 100% causal violation for standard SLOT. Issue Legality question: Is context-only (causal) SLOT legal? #1336 pending.
sp4096 not officially approved - Used by ~8 PRs but no maintainer ruling
PR Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) #1334 shows legal ceiling at 1.0897 - Track A (no eval tricks), credits our PR Record: Vocab4096 + MLP4.0x + WD0.085 - val_bpb 1.1048 (3-seed mean) #1287

Competition Landscape

PR Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) #1334 (@aryanbhosale) achieves 1.0897 BPB with zero eval-time tricks using depth recurrence + parallel residuals + MuonEq-R. Credits our PR Record: Vocab4096 + MLP4.0x + WD0.085 - val_bpb 1.1048 (3-seed mean) #1287.
Only 2 PRs have ever been merged on the entire repo (Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 and Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019)
The real competition is now between Track A (fixed predictor) and Track B (adaptive) approaches

Feedback welcome - open an issue on the dashboard repo.

0 replies

Unofficial: Interactive dashboard visualizing all 352 submissions #747

Uh oh!

dentity007 Mar 25, 2026

Replies: 5 comments · 6 replies

Uh oh!

dentity007 Apr 1, 2026 Author

Dashboard Update — v4 (March 31, 2026)

Uh oh!

Uh oh!

Ribin545 Apr 2, 2026

Uh oh!

dentity007 Apr 2, 2026 Author

Uh oh!

dentity007 Apr 2, 2026 Author

Uh oh!

Ribin545 Apr 3, 2026

Uh oh!

dentity007 Apr 2, 2026 Author

Dashboard Update - v5 (April 2, 2026)

Uh oh!

samquiring Apr 3, 2026

Uh oh!

dentity007 Apr 3, 2026 Author

Uh oh!

dentity007 Apr 3, 2026 Author

Dashboard Update - v8 (April 3, 2026)

Uh oh!

dentity007 Apr 3, 2026 Author

Uh oh!

dentity007 Apr 8, 2026 Author

Dashboard Update - v9 (April 6, 2026)

Data

Bug Fixes

New: DGX Spark PROTEUS Ablation Data

New: SLOT Legality Context

All Sections Updated

New Key Discoveries (Section 7)

Competition Landscape

dentity007
Mar 25, 2026

Replies: 5 comments 6 replies

dentity007
Apr 1, 2026
Author

dentity007 Apr 2, 2026
Author

dentity007
Apr 2, 2026
Author

dentity007
Apr 2, 2026
Author

dentity007 Apr 3, 2026
Author

dentity007
Apr 3, 2026
Author

dentity007 Apr 3, 2026
Author

dentity007
Apr 8, 2026
Author