Skip to content

Commit 55dbcb9

Browse files
authored
Merge pull request #1271 from rylev/clarify-terms
Clarify relevance
2 parents c5270d2 + d9aedf5 commit 55dbcb9

File tree

4 files changed

+60
-48
lines changed

4 files changed

+60
-48
lines changed

docs/comparison-analysis.md

+11-11
Original file line numberDiff line numberDiff line change
@@ -33,11 +33,11 @@ How many _significant_ test results indicate performance changes and what is the
3333
* given 20 changes of different kinds all of low magnitude, the result is mixed unless only 2 or fewer of the changes are of one kind.
3434
* given 5 changes of different kinds all of low magnitude, the result is always mixed.
3535

36-
Whether we actually _report_ an analysis or not depends on the context and how _confident_ we are in the summary of the results (see below for an explanation of how confidence is derived). For example, in pull request performance "try" runs, we report a performance change if we are at least confident that the results are "probably relevant", while for the triage report, we only report if the we are confident the results are "definitely relevant".
36+
Whether we actually _report_ an analysis or not depends on the context and how relevant we find the summary of the results over all (see below for an explanation of how the relevance of a summary is determined). For example, in pull request performance "try" runs, we report a performance change if results are "somewhat relevant", while for the triage report, we only report if the we are confident the results are "definitely relevant".
3737

3838
### What makes a test result significant?
3939

40-
A test result is significant if the relative change percentage is considered an outlier against historical data. Determining whether a value is an outlier is done through interquartile range "fencing" (i.e., whether a value exceeds a threshold equal to the third quartile plus 1.5 times the interquartile range):
40+
A test result is significant if the relative change percentage is considered an outlier against historical data. Determining whether a value is an outlier is done through interquartile range ["fencing"](https://www.statisticshowto.com/upper-and-lower-fences/#:~:text=Upper%20and%20lower%20fences%20cordon,%E2%80%93%20(1.5%20*%20IQR)) (i.e., whether a value exceeds a threshold equal to the third quartile plus 1.5 times the interquartile range):
4141

4242
```
4343
interquartile_range = Q3 - Q1
@@ -48,26 +48,26 @@ result > Q3 + (interquartile_range * 3)
4848

4949
We ignore the lower fence, because result data is bounded by 0.
5050

51-
This upper fence is often called the "significance threshold".
51+
This upper fence is called the "significance threshold".
5252

53-
### How is confidence in whether a test analysis is "relevant" determined?
53+
### How is relevance of a test run summary determined?
5454

55-
The confidence in whether a test analysis is relevant depends on the number of significant test results and their magnitude.
55+
The relevance test run summary is determined by the number of significant and relevant test results and their magnitude.
5656

5757
#### Magnitude
5858

5959
Magnitude is a combination of two factors:
6060
* how large a change is regardless of the direction of the change
6161
* how much that change went over the significance threshold
6262

63-
If a large change only happens to go over the significance threshold by a small factor, then the over magnitude of the change is considered small.
63+
As an example, if a change that is large in absolute terms only exceeds the significance threshold by a small factor, then the overall magnitude of the change is considered small.
6464

65-
#### Confidence algorithm
65+
#### Relevance algorithm
6666

67-
The actual algorithm for determining confidence may change, but in general the following rules apply:
68-
* Definitely relevant: any number of very large or large changes, a small amount of medium changes, or a large amount of small or very small changes.
69-
* Probably relevant: any number of very large or large changes, any medium change, or smaller but still substantial amount of small or very small changes.
70-
* Maybe relevant: if it doesn't fit into the above two categories, it ends in this category.
67+
The actual algorithm for determining relevance of a comparison summary may change, but in general the following rules apply:
68+
* High relevance: any number of very large or large changes, a small amount of medium changes, or a large number of small or very small changes.
69+
* Medium relevance: any number of very large or large changes, any medium change, or smaller but still substantial number of small or very small changes.
70+
* Low relevance: if it doesn't fit into the above two categories, it ends in this category.
7171

7272
### "Dodgy" Test Cases
7373

docs/glossary.md

+6-6
Original file line numberDiff line numberDiff line change
@@ -25,20 +25,20 @@ The following is a glossary of domain specific terminology. Although benchmarks
2525
* **test case**: a combination of a benchmark, a profile, and a scenario.
2626
* **test**: the act of running an artifact under a test case. Each test result is composed of many iterations.
2727
* **test iteration**: a single iteration that makes up a test. Note: we currently normally run 2 test iterations for each test.
28-
* **test result**: the result of the collection of all statistics from running a test. Currently the minimum of the statistics.
28+
* **test result**: the result of the collection of all statistics from running a test. Currently, the minimum value of a statistic from all the test iterations is used.
2929
* **statistic**: a single value of a metric in a test result
3030
* **statistic description**: the combination of a metric and a test case which describes a statistic.
3131
* **statistic series**: statistics for the same statistic description over time.
3232
* **run**: a collection of test results for all currently available test cases run on a given artifact.
33-
* **test result delta**: the delta between two test results for the same test case but (optionally) different artifacts. The [comparison page](https://perf.rust-lang.org/compare.html) lists all the test result deltas as percentages comparing two runs.
3433

3534
## Analysis
3635

37-
* **test result delta**: the difference between two test results for the same metric and test case.
38-
* **significance threshold**: the threshold at which a test result delta is considered "significant" (i.e., a real change in performance and not just noise). This is calculated using [the upper IQR fence](https://www.statisticshowto.com/upper-and-lower-fences/#:~:text=Upper%20and%20lower%20fences%20cordon,%E2%80%93%20(1.5%20*%20IQR)) as seen [here](https://github.com/rust-lang/rustc-perf/blob/8ba845644b4cfcffd96b909898d7225931b55557/site/src/comparison.rs#L935-L941).
39-
* **significant test result delta**: a test result delta above the significance threshold. Significant test result deltas can be thought of as "statistically significant".
36+
* **test result comparison**: the delta between two test results for the same test case but (optionally) different artifacts. The [comparison page](https://perf.rust-lang.org/compare.html) lists all the test result comparisons as percentages between two runs.
37+
* **significance threshold**: the threshold at which a test result comparison is considered "significant" (i.e., a real change in performance and not just noise). You can see how this is calculated [here](https://github.com/rust-lang/rustc-perf/blob/master/docs/comparison-analysis.md#what-makes-a-test-result-significant).
38+
* **significant test result comparison**: a test result comparison above the significance threshold. Significant test result comparisons can be thought of as being "statistically significant".
39+
* **relevant test result comparison**: a test result comparison can be significant but still not be relevant (i.e., worth paying attention to). Relevance is a factor of the test result comparison's significance and magnitude. Comparisons are considered relevant if they are significant and have at least a small magnitude .
40+
* **test result comparison magnitude**: how "large" the delta is between the two test result's under comparison. This is determined by the average of two factors: the absolute size of the change (i.e., a change of 5% is larger than a change of 1%) and the amount above the significance threshold (i.e., a change that is 5x the significance threshold is larger than a change 1.5x the significance threshold).
4041
* **dodgy test case**: a test case for which the significance threshold is significantly large indicating a high amount of variability in the test and thus making it necessary to be somewhat skeptical of any results too close to the significance threshold.
41-
* **relevant test result delta**: a synonym for *significant test result delta* in situations where the term "significant" might be ambiguous and readers may potentially interpret *significant* as "large" or "statistically significant". For example, in try run results, we use the term relevant instead of significant.
4242

4343
## Other
4444

site/src/comparison.rs

+38-26
Original file line numberDiff line numberDiff line change
@@ -148,11 +148,11 @@ async fn populate_report(
148148
report: &mut HashMap<Option<Direction>, Vec<String>>,
149149
) {
150150
if let Some(summary) = ComparisonSummary::summarize_comparison(comparison) {
151-
let confidence = summary.confidence();
152-
if confidence.is_atleast_probably_relevant() {
151+
let relevance = summary.relevance_level();
152+
if relevance.at_least_somewhat_relevant() {
153153
if let Some(direction) = summary.direction() {
154154
let entry = report
155-
.entry(confidence.is_definitely_relevant().then(|| direction))
155+
.entry(relevance.very_relevant().then(|| direction))
156156
.or_default();
157157

158158
entry.push(write_triage_summary(ctxt, comparison).await)
@@ -163,7 +163,7 @@ async fn populate_report(
163163

164164
/// A summary of a given comparison
165165
///
166-
/// This summary only includes changes that are significant and relevant (as determined by a changes magnitude).
166+
/// This summary only includes changes that are significant and relevant (as determined by a change's magnitude).
167167
pub struct ComparisonSummary {
168168
/// Significant comparisons of magnitude small and above
169169
/// and ordered by magnitude from largest to smallest
@@ -184,8 +184,7 @@ impl ComparisonSummary {
184184
let mut comparisons = comparison
185185
.statistics
186186
.iter()
187-
.filter(|c| c.is_significant())
188-
.filter(|c| c.magnitude().is_small_or_above())
187+
.filter(|c| c.is_relevant())
189188
.inspect(|c| {
190189
if c.is_improvement() {
191190
num_improvements += 1;
@@ -354,24 +353,25 @@ impl ComparisonSummary {
354353
self.comparisons.iter().find(|s| s.is_regression())
355354
}
356355

357-
pub fn confidence(&self) -> ComparisonConfidence {
356+
/// The relevance level of the entire comparison
357+
pub fn relevance_level(&self) -> RelevanceLevel {
358358
let mut num_small_changes = 0;
359359
let mut num_medium_changes = 0;
360360
for c in self.comparisons.iter() {
361361
match c.magnitude() {
362362
Magnitude::Small => num_small_changes += 1,
363363
Magnitude::Medium => num_medium_changes += 1,
364-
Magnitude::Large => return ComparisonConfidence::DefinitelyRelevant,
365-
Magnitude::VeryLarge => return ComparisonConfidence::DefinitelyRelevant,
364+
Magnitude::Large => return RelevanceLevel::High,
365+
Magnitude::VeryLarge => return RelevanceLevel::High,
366366
Magnitude::VerySmall => unreachable!(),
367367
}
368368
}
369369

370370
match (num_small_changes, num_medium_changes) {
371-
(_, m) if m > 1 => ComparisonConfidence::DefinitelyRelevant,
372-
(_, 1) => ComparisonConfidence::ProbablyRelevant,
373-
(s, 0) if s > 10 => ComparisonConfidence::ProbablyRelevant,
374-
_ => ComparisonConfidence::MaybeRelevant,
371+
(_, m) if m > 1 => RelevanceLevel::High,
372+
(_, 1) => RelevanceLevel::Medium,
373+
(s, 0) if s > 10 => RelevanceLevel::Medium,
374+
_ => RelevanceLevel::Low,
375375
}
376376
}
377377
}
@@ -515,22 +515,21 @@ pub fn write_summary_table(
515515
}
516516
}
517517

518-
/// The amount of confidence we have that a comparison actually represents a real
519-
/// change in the performance characteristics.
518+
/// How relevant a set of test result comparisons are.
520519
#[derive(Clone, Copy, Debug)]
521-
pub enum ComparisonConfidence {
522-
MaybeRelevant,
523-
ProbablyRelevant,
524-
DefinitelyRelevant,
520+
pub enum RelevanceLevel {
521+
Low,
522+
Medium,
523+
High,
525524
}
526525

527-
impl ComparisonConfidence {
528-
pub fn is_definitely_relevant(self) -> bool {
529-
matches!(self, Self::DefinitelyRelevant)
526+
impl RelevanceLevel {
527+
pub fn very_relevant(self) -> bool {
528+
matches!(self, Self::High)
530529
}
531530

532-
pub fn is_atleast_probably_relevant(self) -> bool {
533-
matches!(self, Self::DefinitelyRelevant | Self::ProbablyRelevant)
531+
pub fn at_least_somewhat_relevant(self) -> bool {
532+
matches!(self, Self::High | Self::Medium)
534533
}
535534
}
536535

@@ -1037,6 +1036,17 @@ impl TestResultComparison {
10371036
Some(change.abs() / threshold)
10381037
}
10391038

1039+
/// Whether the comparison is relevant or not.
1040+
///
1041+
/// Relevance is a function of significance and magnitude.
1042+
fn is_relevant(&self) -> bool {
1043+
self.is_significant() && self.magnitude().is_small_or_above()
1044+
}
1045+
1046+
/// The magnitude of the change.
1047+
///
1048+
/// This is the average of the absolute magnitude of the change
1049+
/// and the amount above the significance threshold.
10401050
fn magnitude(&self) -> Magnitude {
10411051
let change = self.relative_change().abs();
10421052
let threshold = self.significance_threshold();
@@ -1051,7 +1061,7 @@ impl TestResultComparison {
10511061
} else {
10521062
Magnitude::VeryLarge
10531063
};
1054-
let change_magnitude = if change < 0.002 {
1064+
let absolute_magnitude = if change < 0.002 {
10551065
Magnitude::VerySmall
10561066
} else if change < 0.01 {
10571067
Magnitude::Small
@@ -1081,7 +1091,9 @@ impl TestResultComparison {
10811091
}
10821092
}
10831093

1084-
from_u8((as_u8(over_threshold) + as_u8(change_magnitude)) / 2)
1094+
// Take the average of the absolute magnitude and the magnitude
1095+
// above the significance threshold.
1096+
from_u8((as_u8(over_threshold) + as_u8(absolute_magnitude)) / 2)
10851097
}
10861098

10871099
fn is_dodgy(&self) -> bool {

site/src/github.rs

+5-5
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
use crate::api::github::Issue;
2-
use crate::comparison::{write_summary_table, ComparisonConfidence, ComparisonSummary, Direction};
2+
use crate::comparison::{write_summary_table, ComparisonSummary, Direction, RelevanceLevel};
33
use crate::load::{Config, SiteCtxt, TryCommit};
44

55
use anyhow::Context as _;
@@ -741,7 +741,7 @@ fn generate_short_summary(
741741

742742
match summary {
743743
Some(summary) => {
744-
if comparison_is_relevant(summary.confidence(), is_master_commit) {
744+
if comparison_is_relevant(summary.relevance_level(), is_master_commit) {
745745
let direction = summary.direction().unwrap();
746746
let num_improvements = summary.number_of_improvements();
747747
let num_regressions = summary.number_of_regressions();
@@ -773,12 +773,12 @@ fn generate_short_summary(
773773
}
774774

775775
/// Whether a comparison is relevant enough to show
776-
fn comparison_is_relevant(confidence: ComparisonConfidence, is_master_commit: bool) -> bool {
776+
fn comparison_is_relevant(relevance: RelevanceLevel, is_master_commit: bool) -> bool {
777777
if is_master_commit {
778-
confidence.is_definitely_relevant()
778+
relevance.very_relevant()
779779
} else {
780780
// is try run
781-
confidence.is_atleast_probably_relevant()
781+
relevance.at_least_somewhat_relevant()
782782
}
783783
}
784784

0 commit comments

Comments
 (0)