Merge pull request #1275 from rylev/remove-dodginess

rylev · web-flow · commit 2b236f5753ae · 2022-04-05T12:59:27.000+02:00
Remove the concept of dodginess
diff --git a/docs/comparison-analysis.md b/docs/comparison-analysis.md
@@ -68,7 +68,3 @@ The actual algorithm for determining relevance of a comparison summary may chang
 * High relevance: any number of very large or large changes, a small amount of medium changes, or a large number of small or very small changes.
 * Medium relevance: any number of very large or large changes, any medium change, or smaller but still substantial number of small or very small changes.
 * Low relevance: if it doesn't fit into the above two categories, it ends in this category.
-
-### "Dodgy" Test Cases
-
-"Dodgy" test cases are test cases that tend to produce unreliable results (i.e., noise). A test case is considered "dodgy" if its significance threshold is sufficiently far enough away from 0.
diff --git a/docs/glossary.md b/docs/glossary.md
@@ -38,7 +38,6 @@ The following is a glossary of domain specific terminology. Although benchmarks
 * **significant test result comparison**: a test result comparison above the significance threshold. Significant test result comparisons can be thought of as being "statistically significant".
 * **relevant test result comparison**: a test result comparison can be significant but still not be relevant (i.e., worth paying attention to). Relevance is a factor of the test result comparison's significance and magnitude. Comparisons are considered relevant if they are significant and have at least a small magnitude .
 * **test result comparison magnitude**: how "large" the delta is between the two test result's under comparison. This is determined by the average of two factors: the absolute size of the change (i.e., a change of 5% is larger than a change of 1%) and the amount above the significance threshold (i.e., a change that is 5x the significance threshold is larger than a change 1.5x the significance threshold).
-* **dodgy test case**: a test case for which the significance threshold is significantly large indicating a high amount of variability in the test and thus making it necessary to be somewhat skeptical of any results too close to the significance threshold.
 
 ## Other 
 
diff --git a/site/src/api.rs b/site/src/api.rs
@@ -191,7 +191,6 @@ pub mod comparison {
         pub scenario: String,
         pub is_significant: bool,
         pub significance_factor: Option<f64>,
-        pub is_dodgy: bool,
         pub magnitude: String,
         pub statistics: (f64, f64),
     }
diff --git a/site/src/comparison.rs b/site/src/comparison.rs
@@ -117,7 +117,6 @@ pub async fn handle_compare(
             benchmark: comparison.benchmark.to_string(),
             profile: comparison.profile.to_string(),
             scenario: comparison.scenario.to_string(),
-            is_dodgy: comparison.is_dodgy(),
             is_significant: comparison.is_significant(),
             significance_factor: comparison.significance_factor(),
             magnitude: comparison.magnitude().display().to_owned(),
@@ -953,13 +952,6 @@ impl HistoricalData {
             .windows(2)
             .map(|window| (window[0] - window[1]).abs())
     }
-
-    /// Whether we can trust this benchmark or not
-    fn is_dodgy(&self) -> bool {
-        // If changes are judged significant only exceeding 0.2%, then the
-        // benchmark as a whole is dodgy.
-        self.significance_threshold() * 100.0 > 0.2
-    }
 }
 
 /// Gets the previous commit
@@ -1096,13 +1088,6 @@ impl TestResultComparison {
         from_u8((as_u8(over_threshold) + as_u8(absolute_magnitude)) / 2)
     }
 
-    fn is_dodgy(&self) -> bool {
-        self.historical_data
-            .as_ref()
-            .map(|v| v.is_dodgy())
-            .unwrap_or(false)
-    }
-
     fn relative_change(&self) -> f64 {
         let (a, b) = self.results;
         (b - a) / a
diff --git a/site/static/compare.html b/site/static/compare.html
@@ -810,7 +810,6 @@ <h2>Comparing <span id="stat-header">{{stat}}</span> between <span id="before">{
                                     magnitude: c.magnitude,
                                     isSignificant: c.is_significant,
                                     significanceFactor: c.significance_factor,
-                                    isDodgy: c.is_dodgy,
                                     datumA,
                                     datumB,
                                     percent,
@@ -1049,7 +1048,7 @@ <h2>Comparing <span id="stat-header">{{stat}}</span> between <span id="before">{
                 <td>
                     <a v-bind:href="percentLink(commitB, commitA, testCase)">
                         <span v-bind:class="percentClass(testCase.percent)">
-                            {{ testCase.percent.toFixed(2) }}%{{testCase.isDodgy ? "?" : ""}}
+                            {{ testCase.percent.toFixed(2) }}%
                         </span>
                     </a>
                 </td>

Original file line number	Diff line number	Diff line change
`@@ -191,7 +191,6 @@ pub mod comparison {`
`191`	`191`	`pub scenario: String,`
`192`	`192`	`pub is_significant: bool,`
`193`	`193`	`pub significance_factor: Option<f64>,`
`194`		`- pub is_dodgy: bool,`
`195`	`194`	`pub magnitude: String,`
`196`	`195`	`pub statistics: (f64, f64),`
`197`	`196`	`}`