Address PR feedback

rylev · rylev · commit d9aedf5df093 · 2022-04-05T10:13:15.000+02:00
diff --git a/docs/comparison-analysis.md b/docs/comparison-analysis.md
@@ -33,7 +33,7 @@ How many _significant_ test results indicate performance changes and what is the
   * given 20 changes of different kinds all of low magnitude, the result is mixed unless only 2 or fewer of the changes are of one kind.
   * given 5 changes of different kinds all of low magnitude, the result is always mixed.
 
-Whether we actually _report_ an analysis or not depends on the context and how relevant we find the summary of the results over all (see below for an explanation of how summary relevant is determined). For example, in pull request performance "try" runs, we report a performance change if results are "somewhat relevant", while for the triage report, we only report if the we are confident the results are "definitely relevant".
+Whether we actually _report_ an analysis or not depends on the context and how relevant we find the summary of the results over all (see below for an explanation of how the relevance of a summary is determined). For example, in pull request performance "try" runs, we report a performance change if results are "somewhat relevant", while for the triage report, we only report if the we are confident the results are "definitely relevant".
 
 ### What makes a test result significant?
 
@@ -60,14 +60,14 @@ Magnitude is a combination of two factors:
 * how large a change is regardless of the direction of the change
 * how much that change went over the significance threshold
 
-If a large change only happens to go over the significance threshold by a small factor, then the over magnitude of the change is considered small.
+As an example, if a change that is large in absolute terms only exceeds the significance threshold by a small factor, then the overall magnitude of the change is considered small.
 
 #### Relevance algorithm
 
 The actual algorithm for determining relevance of a comparison summary may change, but in general the following rules apply:
-* High relevance: any number of very large or large changes, a small amount of medium changes, or a large amount of small or very small changes.
-* Medium relevance: any number of very large or large changes, any medium change, or smaller but still substantial amount of small or very small changes.
-* Small relevance: if it doesn't fit into the above two categories, it ends in this category.
+* High relevance: any number of very large or large changes, a small amount of medium changes, or a large number of small or very small changes.
+* Medium relevance: any number of very large or large changes, any medium change, or smaller but still substantial number of small or very small changes.
+* Low relevance: if it doesn't fit into the above two categories, it ends in this category.
 
 ### "Dodgy" Test Cases
 
diff --git a/docs/glossary.md b/docs/glossary.md
@@ -36,7 +36,7 @@ The following is a glossary of domain specific terminology. Although benchmarks
 * **test result comparison**: the delta between two test results for the same test case but (optionally) different artifacts. The [comparison page](https://perf.rust-lang.org/compare.html) lists all the test result comparisons as percentages between two runs.  
 * **significance threshold**: the threshold at which a test result comparison is considered "significant" (i.e., a real change in performance and not just noise). You can see how this is calculated [here](https://github.com/rust-lang/rustc-perf/blob/master/docs/comparison-analysis.md#what-makes-a-test-result-significant).
 * **significant test result comparison**: a test result comparison above the significance threshold. Significant test result comparisons can be thought of as being "statistically significant".
-* **relevant test result comparison**: a test result comparison can be significant but still not be relevant (i.e., worth paying attention to). Relevance is a factor of the test result comparison's *magnitude*. Comparisons are considered relevant if they have a small magnitude or more. This term is often used to mean "significant *and* relevant" since relevant changes are necessarily also significant.
+* **relevant test result comparison**: a test result comparison can be significant but still not be relevant (i.e., worth paying attention to). Relevance is a factor of the test result comparison's significance and magnitude. Comparisons are considered relevant if they are significant and have at least a small magnitude .
 * **test result comparison magnitude**: how "large" the delta is between the two test result's under comparison. This is determined by the average of two factors: the absolute size of the change (i.e., a change of 5% is larger than a change of 1%) and the amount above the significance threshold (i.e., a change that is 5x the significance threshold is larger than a change 1.5x the significance threshold).
 * **dodgy test case**: a test case for which the significance threshold is significantly large indicating a high amount of variability in the test and thus making it necessary to be somewhat skeptical of any results too close to the significance threshold.
 
diff --git a/site/src/comparison.rs b/site/src/comparison.rs
@@ -184,7 +184,6 @@ impl ComparisonSummary {
         let mut comparisons = comparison
             .statistics
             .iter()
-            .filter(|c| c.is_significant())
             .filter(|c| c.is_relevant())
             .inspect(|c| {
                 if c.is_improvement() {
@@ -1037,9 +1036,11 @@ impl TestResultComparison {
         Some(change.abs() / threshold)
     }
 
-    /// Whether the comparison is relevant or not
+    /// Whether the comparison is relevant or not.
+    ///
+    /// Relevance is a function of significance and magnitude.
     fn is_relevant(&self) -> bool {
-        self.magnitude().is_small_or_above()
+        self.is_significant() && self.magnitude().is_small_or_above()
     }
 
     /// The magnitude of the change.