-
Notifications
You must be signed in to change notification settings - Fork 44
[WIP] Docker buildkit integration for faster caches #198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fe85e32 to
4a06948
Compare
| - | | ||
| #!/bin/bash | ||
| # Create buildx builder with docker-container driver if it doesn't exist | ||
| if ! docker buildx inspect vllm-builder > /dev/null 2>&1; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in recent versions of docker docker build is an alias to docker buildx build, so these setup steps should not be required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amrmahdi I will say that I was not able to get remote registry caching working with docker build only
| --build-arg USE_SCCACHE=1 | ||
| --build-arg TORCH_CUDA_ARCH_LIST="8.0 8.9 9.0 10.0" | ||
| --build-arg FI_TORCH_CUDA_ARCH_LIST="8.0 8.9 9.0a 10.0a" | ||
| --cache-from type=registry,ref=public.ecr.aws/q9t5s3a7/vllm-ci-{% if branch == "main" %}postmerge{% else %}test{% endif %}-repo:buildcache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a busy CI environment, concurrent --cache-to pushes to the same cache tag can actually reduce cache effectiveness due to race conditions and cache overwrites
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative to consider is using remote caching, buildkit has built in support for github actions, but since this buildkite, and the ci is using amazon infra for container registries, s3 is a good option https://github.com/moby/buildkit?tab=readme-ov-file#s3-cache-experimental
The this would be setup is similar to how github actions cache scopes is structured to maximize cache efficiency https://docs.github.com/en/actions/reference/workflows-and-actions/dependency-caching#restrictions-for-accessing-a-cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
223ce2e to
4a06948
Compare
|
Hi @rzabarazesh JFYI I already have a PR up for this |
| --build-arg USE_SCCACHE=1 | ||
| --build-arg TORCH_CUDA_ARCH_LIST="8.0 8.9 9.0 10.0" | ||
| --build-arg FI_TORCH_CUDA_ARCH_LIST="8.0 8.9 9.0a 10.0a" | ||
| --cache-from type=registry,ref=public.ecr.aws/q9t5s3a7/vllm-ci-{% if branch == "main" %}postmerge{% else %}test{% endif %}-repo:buildcache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rzabarazesh , I see that currently you have a separate buildcache which you cache from and cache to, however in your PR description you have an example run where you cache from the :latest image and then you do not have a --cache-to. I am just wondering why you chose to change this? My PR ( #174 ) also has a separate build cache and utilizes --cache-from and --cache-to, however I am also seeing a lot of overhead from --cache-to so I am wondering if your original approach might be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "example ci" build is outdated. I was trying another approach with BUILDKIT_INLINE_CACHE.
Basically this:
docker build --file docker/Dockerfile
--build-arg max_jobs=16
--build-arg buildkite_commit=$BUILDKITE_COMMIT
--build-arg USE_SCCACHE=1
--build-arg BUILDKIT_INLINE_CACHE=1
--build-arg TORCH_CUDA_ARCH_LIST="8.0 8.9 9.0 10.0"
--build-arg FI_TORCH_CUDA_ARCH_LIST="8.0 8.9 9.0a 10.0a"
--cache-from public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:latest
{% if branch != "main" %}
--build-arg VLLM_USE_PRECOMPILED={{ vllm_use_precompiled | default("0") }}
--build-arg VLLM_DOCKER_BUILD_CONTEXT=1{% if vllm_use_precompiled is defined and vllm_use_precompiled == "1" %}
--build-arg USE_FLASHINFER_PREBUILT_WHEEL=true{% endif %}
{% endif %}
--tag {{ docker_image }}
--target test
--progress plain .
- "docker push {{ docker_image }}"
{% if branch == "main" %}
- "docker tag {{ docker_image }} {{ docker_image_latest }}"
- "docker push {{ docker_image_latest }}"
{% endif %}
|
Closing. We will land #174 instead |
Example CI:
https://buildkite.com/vllm/ci/builds/36506/steps/canvas