Skip to content

Commit 2dcf2a9

Browse files
committed
Clarify relevance
1 parent c5270d2 commit 2dcf2a9

File tree

3 files changed

+47
-36
lines changed

3 files changed

+47
-36
lines changed

docs/glossary.md

+6-6
Original file line numberDiff line numberDiff line change
@@ -25,20 +25,20 @@ The following is a glossary of domain specific terminology. Although benchmarks
2525
* **test case**: a combination of a benchmark, a profile, and a scenario.
2626
* **test**: the act of running an artifact under a test case. Each test result is composed of many iterations.
2727
* **test iteration**: a single iteration that makes up a test. Note: we currently normally run 2 test iterations for each test.
28-
* **test result**: the result of the collection of all statistics from running a test. Currently the minimum of the statistics.
28+
* **test result**: the result of the collection of all statistics from running a test. Currently, the minimum value of a statistic from all the test iterations is used.
2929
* **statistic**: a single value of a metric in a test result
3030
* **statistic description**: the combination of a metric and a test case which describes a statistic.
3131
* **statistic series**: statistics for the same statistic description over time.
3232
* **run**: a collection of test results for all currently available test cases run on a given artifact.
33-
* **test result delta**: the delta between two test results for the same test case but (optionally) different artifacts. The [comparison page](https://perf.rust-lang.org/compare.html) lists all the test result deltas as percentages comparing two runs.
3433

3534
## Analysis
3635

37-
* **test result delta**: the difference between two test results for the same metric and test case.
38-
* **significance threshold**: the threshold at which a test result delta is considered "significant" (i.e., a real change in performance and not just noise). This is calculated using [the upper IQR fence](https://www.statisticshowto.com/upper-and-lower-fences/#:~:text=Upper%20and%20lower%20fences%20cordon,%E2%80%93%20(1.5%20*%20IQR)) as seen [here](https://github.com/rust-lang/rustc-perf/blob/8ba845644b4cfcffd96b909898d7225931b55557/site/src/comparison.rs#L935-L941).
39-
* **significant test result delta**: a test result delta above the significance threshold. Significant test result deltas can be thought of as "statistically significant".
36+
* **test result comparison**: the delta between two test results for the same test case but (optionally) different artifacts. The [comparison page](https://perf.rust-lang.org/compare.html) lists all the test result comparisons as percentages between two runs.
37+
* **significance threshold**: the threshold at which a test result comparison is considered "significant" (i.e., a real change in performance and not just noise). This is calculated using [the upper IQR fence](https://www.statisticshowto.com/upper-and-lower-fences/#:~:text=Upper%20and%20lower%20fences%20cordon,%E2%80%93%20(1.5%20*%20IQR)) as seen [here](https://github.com/rust-lang/rustc-perf/blob/8ba845644b4cfcffd96b909898d7225931b55557/site/src/comparison.rs#L935-L941).
38+
* **significant test result comparison**: a test result comparison above the significance threshold. Significant test result comparisons can be thought of as being "statistically significant".
39+
* **relevant test result comparison**: a test result comparison can be significant but still not be relevant (i.e., worth paying attention to). Relevance is a factor of the test result comparison's *magnitude*. Comparisons are considered relevant if they have a small magnitude or more. This term is often used to mean "significant *and* relevant" since relevant changes are necessarily also significant.
40+
* **test result comparison magnitude**: how "large" the delta is between the two test result's under comparison. This is determined by the average of two factors: the absolute size of the change (i.e., a change of 5% is larger than a change of 1%) and the amount above the significance threshold (i.e., a change that is 5x the significance threshold is larger than a change 1.5x the significance threshold).
4041
* **dodgy test case**: a test case for which the significance threshold is significantly large indicating a high amount of variability in the test and thus making it necessary to be somewhat skeptical of any results too close to the significance threshold.
41-
* **relevant test result delta**: a synonym for *significant test result delta* in situations where the term "significant" might be ambiguous and readers may potentially interpret *significant* as "large" or "statistically significant". For example, in try run results, we use the term relevant instead of significant.
4242

4343
## Other
4444

site/src/comparison.rs

+36-25
Original file line numberDiff line numberDiff line change
@@ -148,11 +148,11 @@ async fn populate_report(
148148
report: &mut HashMap<Option<Direction>, Vec<String>>,
149149
) {
150150
if let Some(summary) = ComparisonSummary::summarize_comparison(comparison) {
151-
let confidence = summary.confidence();
152-
if confidence.is_atleast_probably_relevant() {
151+
let relevance = summary.relevance_level();
152+
if relevance.at_least_somewhat_relevant() {
153153
if let Some(direction) = summary.direction() {
154154
let entry = report
155-
.entry(confidence.is_definitely_relevant().then(|| direction))
155+
.entry(relevance.very_relevant().then(|| direction))
156156
.or_default();
157157

158158
entry.push(write_triage_summary(ctxt, comparison).await)
@@ -163,7 +163,7 @@ async fn populate_report(
163163

164164
/// A summary of a given comparison
165165
///
166-
/// This summary only includes changes that are significant and relevant (as determined by a changes magnitude).
166+
/// This summary only includes changes that are significant and relevant (as determined by a change's magnitude).
167167
pub struct ComparisonSummary {
168168
/// Significant comparisons of magnitude small and above
169169
/// and ordered by magnitude from largest to smallest
@@ -185,7 +185,7 @@ impl ComparisonSummary {
185185
.statistics
186186
.iter()
187187
.filter(|c| c.is_significant())
188-
.filter(|c| c.magnitude().is_small_or_above())
188+
.filter(|c| c.is_relevant())
189189
.inspect(|c| {
190190
if c.is_improvement() {
191191
num_improvements += 1;
@@ -354,24 +354,25 @@ impl ComparisonSummary {
354354
self.comparisons.iter().find(|s| s.is_regression())
355355
}
356356

357-
pub fn confidence(&self) -> ComparisonConfidence {
357+
/// The relevance level of the entire comparison
358+
pub fn relevance_level(&self) -> RelevanceLevel {
358359
let mut num_small_changes = 0;
359360
let mut num_medium_changes = 0;
360361
for c in self.comparisons.iter() {
361362
match c.magnitude() {
362363
Magnitude::Small => num_small_changes += 1,
363364
Magnitude::Medium => num_medium_changes += 1,
364-
Magnitude::Large => return ComparisonConfidence::DefinitelyRelevant,
365-
Magnitude::VeryLarge => return ComparisonConfidence::DefinitelyRelevant,
365+
Magnitude::Large => return RelevanceLevel::High,
366+
Magnitude::VeryLarge => return RelevanceLevel::High,
366367
Magnitude::VerySmall => unreachable!(),
367368
}
368369
}
369370

370371
match (num_small_changes, num_medium_changes) {
371-
(_, m) if m > 1 => ComparisonConfidence::DefinitelyRelevant,
372-
(_, 1) => ComparisonConfidence::ProbablyRelevant,
373-
(s, 0) if s > 10 => ComparisonConfidence::ProbablyRelevant,
374-
_ => ComparisonConfidence::MaybeRelevant,
372+
(_, m) if m > 1 => RelevanceLevel::High,
373+
(_, 1) => RelevanceLevel::Medium,
374+
(s, 0) if s > 10 => RelevanceLevel::Medium,
375+
_ => RelevanceLevel::Low,
375376
}
376377
}
377378
}
@@ -515,22 +516,21 @@ pub fn write_summary_table(
515516
}
516517
}
517518

518-
/// The amount of confidence we have that a comparison actually represents a real
519-
/// change in the performance characteristics.
519+
/// How relevant a set of test result comparisons are.
520520
#[derive(Clone, Copy, Debug)]
521-
pub enum ComparisonConfidence {
522-
MaybeRelevant,
523-
ProbablyRelevant,
524-
DefinitelyRelevant,
521+
pub enum RelevanceLevel {
522+
Low,
523+
Medium,
524+
High,
525525
}
526526

527-
impl ComparisonConfidence {
528-
pub fn is_definitely_relevant(self) -> bool {
529-
matches!(self, Self::DefinitelyRelevant)
527+
impl RelevanceLevel {
528+
pub fn very_relevant(self) -> bool {
529+
matches!(self, Self::High)
530530
}
531531

532-
pub fn is_atleast_probably_relevant(self) -> bool {
533-
matches!(self, Self::DefinitelyRelevant | Self::ProbablyRelevant)
532+
pub fn at_least_somewhat_relevant(self) -> bool {
533+
matches!(self, Self::High | Self::Medium)
534534
}
535535
}
536536

@@ -1037,6 +1037,15 @@ impl TestResultComparison {
10371037
Some(change.abs() / threshold)
10381038
}
10391039

1040+
/// Whether the comparison is relevant or not
1041+
fn is_relevant(&self) -> bool {
1042+
self.magnitude().is_small_or_above()
1043+
}
1044+
1045+
/// The magnitude of the change.
1046+
///
1047+
/// This is the average of the absolute magnitude of the change
1048+
/// and the amount above the significance threshold.
10401049
fn magnitude(&self) -> Magnitude {
10411050
let change = self.relative_change().abs();
10421051
let threshold = self.significance_threshold();
@@ -1051,7 +1060,7 @@ impl TestResultComparison {
10511060
} else {
10521061
Magnitude::VeryLarge
10531062
};
1054-
let change_magnitude = if change < 0.002 {
1063+
let absolute_magnitude = if change < 0.002 {
10551064
Magnitude::VerySmall
10561065
} else if change < 0.01 {
10571066
Magnitude::Small
@@ -1081,7 +1090,9 @@ impl TestResultComparison {
10811090
}
10821091
}
10831092

1084-
from_u8((as_u8(over_threshold) + as_u8(change_magnitude)) / 2)
1093+
// Take the average of the absolute magnitude and the magnitude
1094+
// above the significance threshold.
1095+
from_u8((as_u8(over_threshold) + as_u8(absolute_magnitude)) / 2)
10851096
}
10861097

10871098
fn is_dodgy(&self) -> bool {

site/src/github.rs

+5-5
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
use crate::api::github::Issue;
2-
use crate::comparison::{write_summary_table, ComparisonConfidence, ComparisonSummary, Direction};
2+
use crate::comparison::{write_summary_table, ComparisonSummary, Direction, RelevanceLevel};
33
use crate::load::{Config, SiteCtxt, TryCommit};
44

55
use anyhow::Context as _;
@@ -741,7 +741,7 @@ fn generate_short_summary(
741741

742742
match summary {
743743
Some(summary) => {
744-
if comparison_is_relevant(summary.confidence(), is_master_commit) {
744+
if comparison_is_relevant(summary.relevance_level(), is_master_commit) {
745745
let direction = summary.direction().unwrap();
746746
let num_improvements = summary.number_of_improvements();
747747
let num_regressions = summary.number_of_regressions();
@@ -773,12 +773,12 @@ fn generate_short_summary(
773773
}
774774

775775
/// Whether a comparison is relevant enough to show
776-
fn comparison_is_relevant(confidence: ComparisonConfidence, is_master_commit: bool) -> bool {
776+
fn comparison_is_relevant(relevance: RelevanceLevel, is_master_commit: bool) -> bool {
777777
if is_master_commit {
778-
confidence.is_definitely_relevant()
778+
relevance.very_relevant()
779779
} else {
780780
// is try run
781-
confidence.is_atleast_probably_relevant()
781+
relevance.at_least_somewhat_relevant()
782782
}
783783
}
784784

0 commit comments

Comments
 (0)