[CI/Build] Remote registry cache for docker build #174

afeldman-nm · 2025-09-26T18:32:15Z

This PR attempts to optimize the docker build process for the docker image build task during PR and main regression test runs. Specifically this PR exploits layer caching in a remote registry cache (there is already caching at image granularity) (see vllm-project/vllm#25004)

Although this PR implements an "optimization" in the sense of skipping unnecessary image build steps, I expect this PR to result in an essentially unchanged runtime for docker image build (i.e. no impact on vllm-project/vllm#23588). This is because additional optimizations will be required in order to see a runtime improvement with layer caching, such as Dockerfile optimization to exploit cache (vllm-project/vllm#27585 and vllm-project/vllm#26099) and improvements to the sccache configuration in order to speed up vLLM wheel builds. So in summary this PR is a step that will support future optimizations.

Furthermore, I am hopeful that by increasing layer reuse between builds, this PR will indirectly increase the utilization of worker-local layer caches during unit test docker pull operations (since a given worker should see a lot of repeated layer pulls across consecutive unit tests), thereby lowering individual unit test startup times as described in vllm-project/vllm#24779

Key changes:

Use docker buildx build instead of docker build (as only buildx supports caching from and to remote registry)
Create docker buildx builder instance which uses docker-container backend (this is a prerequisite for using buildx with remote registry)
Utilize --cache-from and --cache-to for remote registry caching as shown below, with mode=max to ensure that layers from intermediate build stages are cached:

--cache-from=type=registry,ref={{ docker_image_cache }}
--cache-to=type=registry,ref={{ docker_image_cache }},mode=max

The caches are docker images in S3.

For regression test runs against PRs, the cache is public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:cache

For regression test runs against main, the cache is public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:cache

Regarding the process of pruning or managing these caches - buildx does not implement its own cache eviction policy, so we have to ensure on our end that caches do not grow without bound. To my knowledge currently we do not have a lifecycle policy on docker images in S3, and we rely on an infrastructure maintainer to delete images manually when they are old (please confirm @khluu ). So in the context of remote registry caching, I expect that our current process for pruning old docker images would extend to periodically deleting old cache images.

Note that this PR only covers docker image build; it does not cover the CPU or AMD scenarios which will need to be follow-up tasks.

Signed-off-by: Andrew Feldman <[email protected]>

afeldman-nm · 2025-09-26T19:20:48Z

I have succeeded in getting docker buildx build to use the remote registry layer cache.

However I have not yet demonstrated build-time savings.

With a completely empty remote cache, the build time was 46min - about twice as long as the current build time, due to the time needed to push all of the layers to the remote cache. This is not necessarily bad or unexpected:

https://buildkite.com/vllm/ci/builds/32642#019986ef-3aa4-4865-aa56-5844310e119c

However, the subsequent build - in which a one-command modification to the vLLM source had been made - still took about 33min, so on the order of the typical build time.

https://buildkite.com/vllm/ci/builds/32648#01998725-e681-4b77-961b-563059c0413a

We would hope that since only the vLLM source was modified, we would get cache hits for all image layers not associated with the vLLM source. Instead, here is a breakdown of which layers were and were not able to exploit cache in the "subsequent" image build linked above, broken down by each build stage (along with the time needed to build layers which had cache misses):

[base 1/11] - pull
[base 2/11] - CACHED
[base 3/11] - CACHED
[base 4/11] - CACHED
[base 5/11] - CACHED
[base 6/11] - CACHED
[base 7/11] - CACHED
[base 8/11] - CACHED
[base 9/11] - CACHED
[base 10/11] - CACHED
[base 11/11] - CACHED

[build 1/8] - CACHED
[build 2/8] - CACHED
[build 3/8] - performed COPY 48.2s
[build 4/8] - performed RUN 0.2s
[build 5/8] - performed RUN (compile) 78.3s
[build 6/8] - performed RUN 0.2s
[build 7/8] - performed COPY 0.0s
[build 8/8] - performed RUN 0.3s

[vllm-base 1/21] - performed FROM
[vllm-base 2/21] - CACHED
[vllm-base 3/21] - CACHED
[vllm-base 4/21] - CACHED
[vllm-base 5/21] - CACHED
[vllm-base 6/21] - CACHED
[vllm-base 7/21] - CACHED
[vllm-base 8/21] - performed RUN (dist/*.whl) 193.3s
[vllm-base 9/21] - performed RUN (/vllm-workspace?) 7.8s
[vllm-base 10/21] - performed COPY 0.0s 0.1s
[vllm-base 11/21] - performed COPY 0.0s
[vllm-base 12/21] - performed COPY 0.0s
[vllm-base 13/21] - 0.1s
[vllm-base 14/21] - 0.0s
[vllm-base 15/21] - 2.4s
[vllm-base 16/21] - 0.0s
[vllm-base 17/21] - 43.7s
[vllm-base 18/21] - 0.0s
[vllm-base 19/21] - 3.0s
[vllm-base 20/21] - 0.0s
[vllm-base 21/21] - 334s

[test 1/7] - performed ADD 0.7s
[test 2/7] - performed RUN 64.2s
[test 3/7] - performed RUN 1.1s
[test 4/7] - performed RUN 1.5s
[test 5/7] - performed COPY 0.0s
[test 6/7] - performed RUN 0.1s
[test 7/7] - performed RUN 0.1s

Note that test 5/7 - test 7/7 move the precompiled vLLM package into the image's python install and then copy in the source. In principle these are the only layers which should have had cache misses. It is TODO to figure out why that is not the case.

Together the steps above take about 12min.

Additionally, the following docker image build steps took a significant amount of time:

Exporting layers: 273.2s
Pushing layers: 313.3s
Writing cache: 206.8s
Total: 13.22min

Signed-off-by: Andrew Feldman <[email protected]>

afeldman-nm · 2025-10-28T20:06:07Z

See the updated PR description - this PR will probably not immediately speed up docker builds, but it will support other optimizations which could improve performance.

It think it is a good time to get this PR reviewed. @khluu it would be helpful to get the OK from you that this PR will not break anything, especially regarding the process of pruning old cache images (as described in the PR description)

khluu · 2025-10-31T08:56:13Z

buildkite/test-template-ci.j2

        {% endif %}
        --tag {{ docker_image }}
+        --cache-from=type=registry,ref={{ docker_image_cache }}
+        --cache-to=type=registry,ref={{ docker_image_cache }},mode=max


qq: would this override the existing cache image on the ECR repo every time? does that mean we don't need to care about cleaning up old cache images?

khluu · 2025-10-31T08:57:31Z

Thanks for doing this! Can you also post builds on Buildkite that run with this CI branch to verify (guide here https://github.com/vllm-project/ci-infra?tab=readme-ov-file#how-to-test-changes-in-this-repo) ? I saw some builds from late September so I guess you've already run new ones now?

afeldman-nm added 10 commits September 24, 2025 08:44

cache in one scenario

2045285

Signed-off-by: Andrew Feldman <[email protected]>

buildkit

712c3be

Signed-off-by: Andrew Feldman <[email protected]>

ls

0199a79

Signed-off-by: Andrew Feldman <[email protected]>

ls

5b53292

Signed-off-by: Andrew Feldman <[email protected]>

temp builder instance

ba60284

Signed-off-by: Andrew Feldman <[email protected]>

tweak

1c0a131

Signed-off-by: Andrew Feldman <[email protected]>

create and revert builder instance

fa20760

Signed-off-by: Andrew Feldman <[email protected]>

woops

fb6efc3

Signed-off-by: Andrew Feldman <[email protected]>

trap

f6ca11f

Signed-off-by: Andrew Feldman <[email protected]>

--push

61a7411

Signed-off-by: Andrew Feldman <[email protected]>

This was referenced Sep 26, 2025

[CI]: Test container layer caching optimization vllm-project/vllm#25004

Open

[IGNORE][CI] Prevent vllm-base docker build stage from depending on entire vLLM wheel vllm-project/vllm#26099

Open

dtrifiro approved these changes Oct 6, 2025

View reviewed changes

Andrew Feldman and others added 7 commits October 20, 2025 21:04

diff queue

b3fea57

Signed-off-by: Andrew Feldman <[email protected]>

Merge branch 'main' into test-ci-2

6fa001f

Merge branch 'main' into test-ci

0e23948

Merge branch 'test-ci' into test-ci-2

3c949e2

max jobs

2e368f9

Signed-off-by: Andrew Feldman <[email protected]>

uncompressed caching

aba102d

Signed-off-by: Andrew Feldman <[email protected]>

woops

4d7f528

Signed-off-by: Andrew Feldman <[email protected]>

afeldman-nm mentioned this pull request Oct 28, 2025

[Build] Optimize docker layers sizes vllm-project/vllm#27585

Open

5 tasks

abf149 added 2 commits October 28, 2025 15:35

Merge branch 'main' into test-ci

8e20881

removed debug and novel queues

b81538c

Signed-off-by: Andrew Feldman <[email protected]>

afeldman-nm marked this pull request as ready for review October 28, 2025 20:04

afeldman-nm changed the title ~~Remote registry cache for docker build~~ [CI/Build] Remote registry cache for docker build Oct 28, 2025

afeldman-nm mentioned this pull request Oct 28, 2025

[WIP] Docker buildkit integration for faster caches #198

Closed

Merge branch 'main' into test-ci

2b27367

khluu reviewed Oct 31, 2025

View reviewed changes

dtrifiro approved these changes Oct 31, 2025

View reviewed changes

rzabarazesh approved these changes Nov 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI/Build] Remote registry cache for docker build #174

[CI/Build] Remote registry cache for docker build #174

Uh oh!

afeldman-nm commented Sep 26, 2025 •

edited

Loading

Uh oh!

afeldman-nm commented Sep 26, 2025 •

edited

Loading

Uh oh!

afeldman-nm commented Oct 28, 2025

Uh oh!

khluu Oct 31, 2025

Uh oh!

khluu commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[CI/Build] Remote registry cache for docker build #174

Are you sure you want to change the base?

[CI/Build] Remote registry cache for docker build #174

Uh oh!

Conversation

afeldman-nm commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

afeldman-nm commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

afeldman-nm commented Oct 28, 2025

Uh oh!

khluu Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

khluu commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

afeldman-nm commented Sep 26, 2025 •

edited

Loading

afeldman-nm commented Sep 26, 2025 •

edited

Loading