Skip to content

Commit 7a258c8

Browse files
committed
Fold in the comparison-analysis page
1 parent 2dcf2a9 commit 7a258c8

File tree

2 files changed

+11
-11
lines changed

2 files changed

+11
-11
lines changed

docs/comparison-analysis.md

+10-10
Original file line numberDiff line numberDiff line change
@@ -33,11 +33,11 @@ How many _significant_ test results indicate performance changes and what is the
3333
* given 20 changes of different kinds all of low magnitude, the result is mixed unless only 2 or fewer of the changes are of one kind.
3434
* given 5 changes of different kinds all of low magnitude, the result is always mixed.
3535

36-
Whether we actually _report_ an analysis or not depends on the context and how _confident_ we are in the summary of the results (see below for an explanation of how confidence is derived). For example, in pull request performance "try" runs, we report a performance change if we are at least confident that the results are "probably relevant", while for the triage report, we only report if the we are confident the results are "definitely relevant".
36+
Whether we actually _report_ an analysis or not depends on the context and how relevant we find the summary of the results over all (see below for an explanation of how summary relevant is determined). For example, in pull request performance "try" runs, we report a performance change if results are "somewhat relevant", while for the triage report, we only report if the we are confident the results are "definitely relevant".
3737

3838
### What makes a test result significant?
3939

40-
A test result is significant if the relative change percentage is considered an outlier against historical data. Determining whether a value is an outlier is done through interquartile range "fencing" (i.e., whether a value exceeds a threshold equal to the third quartile plus 1.5 times the interquartile range):
40+
A test result is significant if the relative change percentage is considered an outlier against historical data. Determining whether a value is an outlier is done through interquartile range ["fencing"](https://www.statisticshowto.com/upper-and-lower-fences/#:~:text=Upper%20and%20lower%20fences%20cordon,%E2%80%93%20(1.5%20*%20IQR)) (i.e., whether a value exceeds a threshold equal to the third quartile plus 1.5 times the interquartile range):
4141

4242
```
4343
interquartile_range = Q3 - Q1
@@ -48,11 +48,11 @@ result > Q3 + (interquartile_range * 3)
4848

4949
We ignore the lower fence, because result data is bounded by 0.
5050

51-
This upper fence is often called the "significance threshold".
51+
This upper fence is called the "significance threshold".
5252

53-
### How is confidence in whether a test analysis is "relevant" determined?
53+
### How is relevance of a test run summary determined?
5454

55-
The confidence in whether a test analysis is relevant depends on the number of significant test results and their magnitude.
55+
The relevance test run summary is determined by the number of significant and relevant test results and their magnitude.
5656

5757
#### Magnitude
5858

@@ -62,12 +62,12 @@ Magnitude is a combination of two factors:
6262

6363
If a large change only happens to go over the significance threshold by a small factor, then the over magnitude of the change is considered small.
6464

65-
#### Confidence algorithm
65+
#### Relevance algorithm
6666

67-
The actual algorithm for determining confidence may change, but in general the following rules apply:
68-
* Definitely relevant: any number of very large or large changes, a small amount of medium changes, or a large amount of small or very small changes.
69-
* Probably relevant: any number of very large or large changes, any medium change, or smaller but still substantial amount of small or very small changes.
70-
* Maybe relevant: if it doesn't fit into the above two categories, it ends in this category.
67+
The actual algorithm for determining relevance of a comparison summary may change, but in general the following rules apply:
68+
* High relevance: any number of very large or large changes, a small amount of medium changes, or a large amount of small or very small changes.
69+
* Medium relevance: any number of very large or large changes, any medium change, or smaller but still substantial amount of small or very small changes.
70+
* Small relevance: if it doesn't fit into the above two categories, it ends in this category.
7171

7272
### "Dodgy" Test Cases
7373

docs/glossary.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ The following is a glossary of domain specific terminology. Although benchmarks
3434
## Analysis
3535

3636
* **test result comparison**: the delta between two test results for the same test case but (optionally) different artifacts. The [comparison page](https://perf.rust-lang.org/compare.html) lists all the test result comparisons as percentages between two runs.
37-
* **significance threshold**: the threshold at which a test result comparison is considered "significant" (i.e., a real change in performance and not just noise). This is calculated using [the upper IQR fence](https://www.statisticshowto.com/upper-and-lower-fences/#:~:text=Upper%20and%20lower%20fences%20cordon,%E2%80%93%20(1.5%20*%20IQR)) as seen [here](https://github.com/rust-lang/rustc-perf/blob/8ba845644b4cfcffd96b909898d7225931b55557/site/src/comparison.rs#L935-L941).
37+
* **significance threshold**: the threshold at which a test result comparison is considered "significant" (i.e., a real change in performance and not just noise). You can see how this is calculated [here](https://github.com/rust-lang/rustc-perf/blob/master/docs/comparison-analysis.md#what-makes-a-test-result-significant).
3838
* **significant test result comparison**: a test result comparison above the significance threshold. Significant test result comparisons can be thought of as being "statistically significant".
3939
* **relevant test result comparison**: a test result comparison can be significant but still not be relevant (i.e., worth paying attention to). Relevance is a factor of the test result comparison's *magnitude*. Comparisons are considered relevant if they have a small magnitude or more. This term is often used to mean "significant *and* relevant" since relevant changes are necessarily also significant.
4040
* **test result comparison magnitude**: how "large" the delta is between the two test result's under comparison. This is determined by the average of two factors: the absolute size of the change (i.e., a change of 5% is larger than a change of 1%) and the amount above the significance threshold (i.e., a change that is 5x the significance threshold is larger than a change 1.5x the significance threshold).

0 commit comments

Comments
 (0)