Updates to excluded benchmark list#485
Open
KartikP wants to merge 6 commits intoexluded_benchmark_listfrom
Open
Updates to excluded benchmark list#485KartikP wants to merge 6 commits intoexluded_benchmark_listfrom
KartikP wants to merge 6 commits intoexluded_benchmark_listfrom
Conversation
KartikP
added a commit
that referenced
this pull request
Dec 17, 2025
KartikP
added a commit
that referenced
this pull request
Dec 18, 2025
* Wayback Slider Working Version * Fixed Calendar and Unit Test Added * Fixed Unit test * Headless_True * Remove Sleep Time * Small change for date box * Wayback Filter Added to Export * Addressed code-review changes * Try testing without ranks * clean up * reset slider handle after reset button press * Move wayback timestamp slider to first column and adjust input box width * Disable start_timestamp. Make it conditional so we can re-enable if necessary * Fix input box overflow in col container * re-introduce whitespace * fix blank lines * Update tests after web_test update * Merge timestamp fields from kp/add-timestamp-to-scores into mv.sql Adds start_timestamp and end_timestamp fields to materialized views: - Added to mv_base_scores and mv_final_benchmark_context - Added to final_agg_scores table structure - Included in INSERT statements for leaf and parent scores - For parent nodes, uses MIN(start_timestamp) and MAX(end_timestamp) from children - Added to final model/benchmark context SELECT statements - Included in score_json JSON aggregation * upper bound calendar input with checks * parseURLFilters() and setRangerValues() correctly * Update benchmark counts * use color-utils from #485 to rescale colors after wayback * move depth calculation outside of model row loop * fix color and aggregation calculation * Optimization improvement: 1. Cache wayback filtering results: build Set of hidden benchmarks once, reuse for O(1) lookups instead of iterating grid nodes 2. Cache root parent lookups: build Map of benchmark -> root parent once, reuse for color recalculation instead of traversing hierarchy for each benchmark * fix export to exclude hidden benchmark and sort by fscore * Only exclude benchmarks hidden by wayback filtering, not where where all models failed - Problem: Wayback filtering was hiding benchmarks when all values were X, however when a model property filter was applied that produced results where all models visible had all failures, it was hiding the benchmark column and disrupting aggregation - Solution: Introduced logic where only hides columns when wayback filtering is active and produces results where all vlaues in a column are X. When wayback filtering inactive, don't hide columns based on X values online. - Update wayback test to make sure sort by score instead of rank. --------- Co-authored-by: Kartik Pradeepan <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds the following to #482
color-utils.jswhich replicates representative color calculation inmv.sqlbut in the frontend.All this is necessary because a lot of information about leaderboard presentation is determined in
mv.sqlbut then modified in the backend when we define what the default benchmarks are viaexcluded_benchmark_list. For example, as representative color is normalized per-benchmark, parent benchmark colors will be off (e.g., two models can have 0.45 as a score but different colors if the benchmark that was removed was impactful).A shift from compute in the DB to frontend is messy but unfortunately the ideal way to go about this. Future work should clean up
mv.sqlto remove re-computed defaults.Screenshot of localhost (left) and staging (right) with same settings to check scores.
