-
Notifications
You must be signed in to change notification settings - Fork 44
[CI/Build] Remote registry cache for docker build #174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
|
I have succeeded in getting However I have not yet demonstrated build-time savings. With a completely empty remote cache, the build time was 46min - about twice as long as the current build time, due to the time needed to push all of the layers to the remote cache. This is not necessarily bad or unexpected: https://buildkite.com/vllm/ci/builds/32642#019986ef-3aa4-4865-aa56-5844310e119c However, the subsequent build - in which a one-command modification to the vLLM source had been made - still took about 33min, so on the order of the typical build time. https://buildkite.com/vllm/ci/builds/32648#01998725-e681-4b77-961b-563059c0413a We would hope that since only the vLLM source was modified, we would get cache hits for all image layers not associated with the vLLM source. Instead, here is a breakdown of which layers were and were not able to exploit cache in the "subsequent" image build linked above, broken down by each build stage (along with the time needed to build layers which had cache misses): [base 1/11] - pull [build 1/8] - CACHED [vllm-base 1/21] - performed FROM [test 1/7] - performed ADD 0.7s Note that test 5/7 - test 7/7 move the precompiled vLLM package into the image's python install and then copy in the source. In principle these are the only layers which should have had cache misses. It is TODO to figure out why that is not the case. Together the steps above take about 12min. Additionally, the following docker image build steps took a significant amount of time:
|
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
|
See the updated PR description - this PR will probably not immediately speed up docker builds, but it will support other optimizations which could improve performance. It think it is a good time to get this PR reviewed. @khluu it would be helpful to get the OK from you that this PR will not break anything, especially regarding the process of pruning old cache images (as described in the PR description) |
| {% endif %} | ||
| --tag {{ docker_image }} | ||
| --cache-from=type=registry,ref={{ docker_image_cache }} | ||
| --cache-to=type=registry,ref={{ docker_image_cache }},mode=max |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: would this override the existing cache image on the ECR repo every time? does that mean we don't need to care about cleaning up old cache images?
|
Thanks for doing this! Can you also post builds on Buildkite that run with this CI branch to verify (guide here https://github.com/vllm-project/ci-infra?tab=readme-ov-file#how-to-test-changes-in-this-repo) ? I saw some builds from late September so I guess you've already run new ones now? |
This PR attempts to optimize the docker build process for the
docker image buildtask during PR andmainregression test runs. Specifically this PR exploits layer caching in a remote registry cache (there is already caching at image granularity) (see vllm-project/vllm#25004)Although this PR implements an "optimization" in the sense of skipping unnecessary image build steps, I expect this PR to result in an essentially unchanged runtime for
docker image build(i.e. no impact on vllm-project/vllm#23588). This is because additional optimizations will be required in order to see a runtime improvement with layer caching, such as Dockerfile optimization to exploit cache (vllm-project/vllm#27585 and vllm-project/vllm#26099) and improvements to the sccache configuration in order to speed up vLLM wheel builds. So in summary this PR is a step that will support future optimizations.Furthermore, I am hopeful that by increasing layer reuse between builds, this PR will indirectly increase the utilization of worker-local layer caches during unit test
docker pulloperations (since a given worker should see a lot of repeated layer pulls across consecutive unit tests), thereby lowering individual unit test startup times as described in vllm-project/vllm#24779Key changes:
docker buildx buildinstead ofdocker build(as only buildx supports caching from and to remote registry)docker buildxbuilder instance which usesdocker-containerbackend (this is a prerequisite for using buildx with remote registry)--cache-fromand--cache-tofor remote registry caching as shown below, withmode=maxto ensure that layers from intermediate build stages are cached:The caches are docker images in S3.
For regression test runs against PRs, the cache is
public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:cacheFor regression test runs against
main, the cache ispublic.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:cacheRegarding the process of pruning or managing these caches - buildx does not implement its own cache eviction policy, so we have to ensure on our end that caches do not grow without bound. To my knowledge currently we do not have a lifecycle policy on docker images in S3, and we rely on an infrastructure maintainer to delete images manually when they are old (please confirm @khluu ). So in the context of remote registry caching, I expect that our current process for pruning old docker images would extend to periodically deleting old cache images.
Note that this PR only covers
docker image build; it does not cover the CPU or AMD scenarios which will need to be follow-up tasks.