[CI] Separating model and backbone_utils test from the others

Recently I experiment with torchvision test on #6992 and I found the result is quite promising. I create this issue to give summary of the experiment, propose to separate the model test (`test_models.py` and `test_backbone_utils.py`) from other test, and ask for feedback.

# Highlights
By separating the model test (`test_models.py` and `test_backbone_utils.py`) we can have:
- Faster test waiting time for developer (Reduce waiting time from 56 minutes to 38 minutes, ~32% improvement)
- Sum of all test time (the machine time) is reduced from 617 minutes to 415 minutes (~33% reduction) and more over the machine type for non-model test can be reduced to use machine that are 2 times cheaper.
- Some test has too much input variations and this can be reduced to speedup (for example: `test_augmix`)

# Action plan:

After talking with @osalpekar , I will separate the model test for the workflow that already using GHA. And this will break down the progress by the OS.
- [ ] Separate model test for linux-cpu in GHA [In progress]
- [ ] Separate model test for macos-cpu in GHA [Waiting for the workflow to move to GHA]
- [ ] Separate model test for windows-cpu in GHA [Waiting for the workflow to move to GHA]
- [x] Discuss on test that we may be able to speedup by reducing number of input variation
- [ ] Create a PR to speedup augmix by reducing input variation

# Experiments 

## Analyze the test running time

First thing that I did is to use `--durations 5000` for the linux test with python 3.8 (both cpu and gpu) so it will print each test durations (here is the data on [google sheet](https://docs.google.com/spreadsheets/d/1qglN2rIMAW1aHOB7VcZe1YUupix-dTkhXmTuhoHJU0w/edit?usp=sharing)).

For the raw data, each row correspond to a test with specific arguments or inputs, for instance a `test_classification_model` for `resnet50` model. From this, I aggregate by the test function and the test file to see a more aggregated result.

Here are the interesting findings on this data:

- We could see that the two slowest test files on cpu are `test_models.py` and `test_backbone_utils.py` which take `34.36%` and `28.46%` of the total test duration respectively. On GPU, we could see that `test_models.py` took `61.08%` of the total test duration.
-  `test_transforms.py::test_augmix` took 145.6 seconds, whereas all `test_transforms.py` took 172.7 seconds (~ 84.3% of total `test_transforms.py` duration is from `test_augmix` alone)

## Experiment on separating model and backbone_utils test

Both `test_models.py` and `test_backbone_utils.py` are pretty similar in a sense that they are testing the models. For big models, they require high memory usage and can be pretty slow. Since they are testing models, the API that we are testing are pretty high level, hence we might not need to run the test on all python version (since the low level operator should be tested on all python level already).

From this reasoning, I think it would be beneficial to separate the model test (I will refer both `test_models.py` and `test_backbone_utils.py` as model test) and only run it on one python version only (Consideration on running on all python version for main and nightly branch only like how we do gpu test).

I did experiment on separating them, and I use pytest marker to do it. Here are some main things that I did to separate the test:
- Add global variable `pytestmark = pytest.mark.slow` on the  `test_models.py` and `test_backbone_utils.py` to mark all the test inside as `pytest.mark.slow`
- Modify `pytest.ini` addopts to have `-m "not slow"` so by default it will skip the test marked as slow
- Modify `.circleci/regenerate.py` and `.circleci/config.yml.in` so that it accept `pytest_additional_args` as parameters, and we will pass this to `run_test.sh` scripts to add `-m "slow"` to run only the model test or `-m "slow or not slow"` to run all test

(for more detail see #6992)

After separating the model test, I notice that in linux gpu we have a relatively high overhead for installing environment. Because of this overhead, we decide that it is not worth it to separate the test on GPU, hence we modify `.circleci/regenerate.py` so that we only separate the model test on CPU.

## Experiment on reducing machine type for non model test

After separating the model test, we have idea to reduce the machine type for the non-model test. The intuition is that we previously need higher machine type because the model test require high memory, hence if we separate the model test we can run the non-model test with lower spec machine.

So initially we run the test on `2xlarge+` machine for linux and `large` machine for macos. We found that for non-model test we could use `xlarge` machine for linux and `medium` machine for macos instead, and since the model test is the slowest part it does not really affect the overall waiting time for all test to finished! This would potentially a good cost saving.

For more detail on experiment, see this [docs](https://docs.google.com/document/d/14_hEcpT_FvpjPURRCyzmWuFIMotaCnEYpv86jUX9gd0/edit).

## Experiment on optimizing test_augmix

Once we separate the model test, `test_augmix` become very significant among the non-model test, to be precise it occupy around 26% of the non-model test durations. Hence it would be good if we can speedup this test.

Looking at our [data](https://docs.google.com/spreadsheets/d/1qglN2rIMAW1aHOB7VcZe1YUupix-dTkhXmTuhoHJU0w/edit#gid=1371009341), seems like `test_augmix` is called 97 times (it is called with 96 different input, and 1 time for setup). The first thing that we want to try is to reduce the amount of input variation for this test.

Looking at the code, currently they have input parameters as follow:
```
@pytest.mark.parametrize("fill", [None, 85, (128, 128, 128)])
@pytest.mark.parametrize("severity", [1, 10])
@pytest.mark.parametrize("mixture_width", [1, 2])
@pytest.mark.parametrize("chain_depth", [-1, 2])
@pytest.mark.parametrize("all_ops", [True, False])
@pytest.mark.parametrize("grayscale", [True, False])
```
Although each parameters only contain 2 or 3 different values, but because the exponential nature, the amount of total different inputs quickly grow to: `3 * 2 * 2 * 2 * 2 * 2 = 96`. The idea to reduce the total input variation is to sample from these 96 combinations that make sure at least each value on each parameters is chosen at least once. This should be okay since I think we dont really need to test all the 96 combinations. As what we did on #6992 , here are the code replacements for the input parameters:
```
@pytest.mark.parametrize(
    "fill,severity,mixture_width,chain_depth,all_ops,grayscale",
    [
        (None, 1, 1, -1, True, True),
        (85, 10, 2, 2, False, False),
        ((128, 128, 128), 1, 2, -1, False, True),
        (None, 10, 1, 2, True, False),
        (85, 1, 1, -1, False, False),
        ((128, 128, 128), 10, 2, -1, True, False),
    ],
)
```
Now we only have 6 variations, which should speedup the `test_augmix` 16x.

Note that we can increase the number of sample that we want to use, this will depend on the test owner to decide. Also, I think this method is also applicable for all the test that use a large number of varying input parameters.


cc @pmeier @datumbox @osalpekar 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] Separating model and backbone_utils test from the others #7000

Highlights

Action plan:

Experiments

Analyze the test running time

Experiment on separating model and backbone_utils test

Experiment on reducing machine type for non model test

Experiment on optimizing test_augmix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[CI] Separating model and backbone_utils test from the others #7000

Description

Highlights

Action plan:

Experiments

Analyze the test running time

Experiment on separating model and backbone_utils test

Experiment on reducing machine type for non model test

Experiment on optimizing test_augmix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions