Skip to content

Updates to excluded benchmark list#485

Open
KartikP wants to merge 6 commits intoexluded_benchmark_listfrom
kp/excluded_benchmark_list_fixed_agg
Open

Updates to excluded benchmark list#485
KartikP wants to merge 6 commits intoexluded_benchmark_listfrom
kp/excluded_benchmark_list_fixed_agg

Conversation

@KartikP
Copy link
Contributor

@KartikP KartikP commented Dec 9, 2025

This PR adds the following to #482

  1. Update global score based on new default benchmarks
  2. Re-rank models based on new global score
  3. Introduce color-utils.js which replicates representative color calculation in mv.sql but in the frontend.

All this is necessary because a lot of information about leaderboard presentation is determined in mv.sql but then modified in the backend when we define what the default benchmarks are via excluded_benchmark_list. For example, as representative color is normalized per-benchmark, parent benchmark colors will be off (e.g., two models can have 0.45 as a score but different colors if the benchmark that was removed was impactful).

A shift from compute in the DB to frontend is messy but unfortunately the ideal way to go about this. Future work should clean up mv.sql to remove re-computed defaults.



Screenshot of localhost (left) and staging (right) with same settings to check scores.
image

@KartikP KartikP closed this Dec 10, 2025
@KartikP KartikP reopened this Dec 10, 2025
KartikP added a commit that referenced this pull request Dec 18, 2025
* Wayback Slider Working Version

* Fixed Calendar and Unit Test Added

* Fixed Unit test

* Headless_True

* Remove Sleep Time

* Small change for date box

* Wayback Filter Added to Export

* Addressed code-review changes

* Try testing without ranks

* clean up

* reset slider handle after reset button press

* Move wayback timestamp slider to first column and adjust input box width

* Disable start_timestamp. Make it conditional so we can re-enable if necessary

* Fix input box overflow in col container

* re-introduce whitespace

* fix blank lines

* Update tests after web_test update

* Merge timestamp fields from kp/add-timestamp-to-scores into mv.sql

Adds start_timestamp and end_timestamp fields to materialized views:
- Added to mv_base_scores and mv_final_benchmark_context
- Added to final_agg_scores table structure
- Included in INSERT statements for leaf and parent scores
- For parent nodes, uses MIN(start_timestamp) and MAX(end_timestamp) from children
- Added to final model/benchmark context SELECT statements
- Included in score_json JSON aggregation

* upper bound calendar input with checks

* parseURLFilters() and setRangerValues() correctly

* Update benchmark counts

* use color-utils from #485 to rescale colors after wayback

* move depth calculation outside of model row loop

* fix color and aggregation calculation

* Optimization improvement:
1. Cache wayback filtering results: build Set of hidden benchmarks once, reuse for O(1) lookups instead of iterating grid nodes

2. Cache root parent lookups: build Map of benchmark -> root parent once, reuse for color recalculation instead of traversing hierarchy for each benchmark

* fix export to exclude hidden benchmark and sort by fscore

* Only exclude benchmarks hidden by wayback filtering, not where where all models failed
- Problem: Wayback filtering was hiding benchmarks when all values were X, however when a model property filter was applied that produced results where all models visible had all failures, it was hiding the benchmark column and disrupting aggregation
- Solution: Introduced logic where only hides columns when wayback filtering is active and produces results where all vlaues in a column are X. When wayback filtering inactive, don't hide columns based on X values online.
- Update wayback test to make sure sort by score instead of rank.

---------

Co-authored-by: Kartik Pradeepan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant