-
Notifications
You must be signed in to change notification settings - Fork 69
Update the upload benchmark workflow to work with the BenchmarkLauncher #584
Description
Problem Description
Once #570 is resolved, we will have all the functionality needed to launch, manage, and terminate benchmarks. The goal of this issue is to update the logic for uploading internal benchmark results.
One of the main problems we want to address is the case where some instances run into Out Of Memory errors. Right now, when that happens, we have to manually add results for the missing jobs in order for the workflow to succeed. We want to remove this manual step and make sure benchmark results are uploaded after a defined amount of time, even if some instances did not finish or failed unexpectedly.
Expected behavior
- When launching our internal benchmarks, we should save the BenchmarkLauncher instance used to launch the benchmark.
- Update the logic used by the upload benchmark workflow. The idea is:
- Load the saved BenchmarkConfig daily.
- Check the status of each instance (Running / Completed / Stopped).
- If all instances have completed, upload the results.
- Otherwise, if some instances are still running but we have reached the deadline for uploading the results (timeout + 1 extra day for instance), stop all remaining instances and update the results file accordingly for the missing jobs.
Additional context
In this issue, we should also update the structure of the metainfo.yaml files that are currently saved during the benchmark. In particular, we should add the following key/value:
- Instance Name