Skip to content

Conversation

@rzabarazesh
Copy link
Collaborator

@rzabarazesh rzabarazesh commented Oct 28, 2025

@rzabarazesh rzabarazesh force-pushed the rzabarazesh/docker-buildkit branch from fe85e32 to 4a06948 Compare October 28, 2025 01:22
- |
#!/bin/bash
# Create buildx builder with docker-container driver if it doesn't exist
if ! docker buildx inspect vllm-builder > /dev/null 2>&1; then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in recent versions of docker docker build is an alias to docker buildx build, so these setup steps should not be required

https://docs.docker.com/reference/cli/docker/buildx/build/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amrmahdi I will say that I was not able to get remote registry caching working with docker build only

--build-arg USE_SCCACHE=1
--build-arg TORCH_CUDA_ARCH_LIST="8.0 8.9 9.0 10.0"
--build-arg FI_TORCH_CUDA_ARCH_LIST="8.0 8.9 9.0a 10.0a"
--cache-from type=registry,ref=public.ecr.aws/q9t5s3a7/vllm-ci-{% if branch == "main" %}postmerge{% else %}test{% endif %}-repo:buildcache

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a busy CI environment, concurrent --cache-to pushes to the same cache tag can actually reduce cache effectiveness due to race conditions and cache overwrites

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative to consider is using remote caching, buildkit has built in support for github actions, but since this buildkite, and the ci is using amazon infra for container registries, s3 is a good option https://github.com/moby/buildkit?tab=readme-ov-file#s3-cache-experimental

The this would be setup is similar to how github actions cache scopes is structured to maximize cache efficiency https://docs.github.com/en/actions/reference/workflows-and-actions/dependency-caching#restrictions-for-accessing-a-cache

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amrmahdi is this PR (below) doing remote caching in the way that you mean?

#174

@rzabarazesh rzabarazesh force-pushed the rzabarazesh/docker-buildkit branch from 223ce2e to 4a06948 Compare October 28, 2025 19:21
@afeldman-nm
Copy link
Collaborator

Hi @rzabarazesh JFYI I already have a PR up for this

#174

--build-arg USE_SCCACHE=1
--build-arg TORCH_CUDA_ARCH_LIST="8.0 8.9 9.0 10.0"
--build-arg FI_TORCH_CUDA_ARCH_LIST="8.0 8.9 9.0a 10.0a"
--cache-from type=registry,ref=public.ecr.aws/q9t5s3a7/vllm-ci-{% if branch == "main" %}postmerge{% else %}test{% endif %}-repo:buildcache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rzabarazesh , I see that currently you have a separate buildcache which you cache from and cache to, however in your PR description you have an example run where you cache from the :latest image and then you do not have a --cache-to. I am just wondering why you chose to change this? My PR ( #174 ) also has a separate build cache and utilizes --cache-from and --cache-to, however I am also seeing a lot of overhead from --cache-to so I am wondering if your original approach might be better?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "example ci" build is outdated. I was trying another approach with BUILDKIT_INLINE_CACHE.
Basically this:

        docker build --file docker/Dockerfile
        --build-arg max_jobs=16
        --build-arg buildkite_commit=$BUILDKITE_COMMIT
        --build-arg USE_SCCACHE=1
        --build-arg BUILDKIT_INLINE_CACHE=1
        --build-arg TORCH_CUDA_ARCH_LIST="8.0 8.9 9.0 10.0"
        --build-arg FI_TORCH_CUDA_ARCH_LIST="8.0 8.9 9.0a 10.0a"
        --cache-from public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:latest
        {% if branch != "main" %}
        --build-arg VLLM_USE_PRECOMPILED={{ vllm_use_precompiled | default("0") }}
        --build-arg VLLM_DOCKER_BUILD_CONTEXT=1{% if vllm_use_precompiled is defined and vllm_use_precompiled == "1" %}
        --build-arg USE_FLASHINFER_PREBUILT_WHEEL=true{% endif %}
        {% endif %}
        --tag {{ docker_image }}
        --target test
        --progress plain .
      - "docker push {{ docker_image }}"
      {% if branch == "main" %}
      - "docker tag {{ docker_image }} {{ docker_image_latest }}"
      - "docker push {{ docker_image_latest }}"
      {% endif %}

@rzabarazesh
Copy link
Collaborator Author

Closing. We will land #174 instead

@rzabarazesh rzabarazesh closed this Nov 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants