-
Notifications
You must be signed in to change notification settings - Fork 636
ci: OPS-724: Move to ARC runners #2904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Dillon Cullinan <[email protected]>
WalkthroughCI workflow updated to use gpu-l40-amd64, remove a debug step, pass AWS credentials to the Docker build, and run tests with NVIDIA runtime and host networking. Dockerfiles now declare AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY build args across stages without additional usage changes. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Dev as GitHub Actions
participant R as gpu-l40-amd64 Runner
participant D as Docker
participant C as Container (pytest)
Dev->>R: Trigger build-test job
R->>D: docker build (ARG AWS_ACCESS_KEY_ID/SECRET)
Note right of D: Build args passed into build stages
R->>D: docker run --runtime=nvidia --network host
D->>C: Start container
C-->>R: Execute pytest and return results
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (1)
.github/workflows/container-validation-backends.yml (1)
64-66
: GPU container runtime flags: consider adding shared memory and deterministic networking.
- Add
--ipc=host
to avoid CUDA/PyTorch OOMs due to small /dev/shm.- Keep
--network host
if ARC runner permits it; otherwise fall back to bridge plus explicit ports.- docker run --runtime=nvidia --rm --gpus all -w /workspace \ - --network host \ + docker run --runtime=nvidia --rm --gpus all -w /workspace \ + --ipc=host \ + --network host \ --name ${{ env.CONTAINER_ID }}_pytest \Confirm ARC runner pods allow Docker
--network host
and--ipc=host
; if not, we’ll adjust to K8s-native services soon.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
.github/workflows/container-validation-backends.yml
(3 hunks)container/Dockerfile
(3 hunks)container/Dockerfile.vllm
(2 hunks)
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/container-validation-backends.yml
14-14: label "gpu-l40-amd64" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
.github/workflows/container-validation-backends.yml (1)
14-14
: Add canonical self-hosted runner labels to satisfy actionlint
Replace the single label with an array includingself-hosted
,linux
,x64
, and your specific runner label:- runs-on: gpu-l40-amd64 + runs-on: + - self-hosted + - linux + - x64 + - gpu-l40-amd64Verify that
gpu-l40-amd64
exactly matches the labels used by ARC’s RunnerDeployment and adjust spelling if necessary.
Signed-off-by: Dillon Cullinan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an expert on the specifics of .github here, but this is a much needed improvement.
Do we have 2 gpu nodes for 2 gpu tests? If a test requires 2 gpus or specific GPUs (say H100s for example) how are these requirements passed to the cluster?
My expectation is as we move to managed K8s we should be able to do better things with the docker cache that will allow us to decrease the build times even more? Is that true?
Currently, our only node group is for single GPU instance types. This is very easily expandable though. This is still in a PR for now, working on tidying things up... but here is nodegroup mapping: https://github.com/ai-dynamo/velonix/blob/a082e71d337223f2866ec468f311aa37e8a33a1d/terraform/aws/arc-eks/staging.values.tfvars#L145-L176 We can very easily add node groups in the terraform and apply them. After that we can add a I've made sure to make it very easy to add runners and instance types to the cluster for all of our CI/CD.
It can, mostly depends on the use-case. We can have a read-only mounted In terms of caching, its a bit more challenging, as all the nodes are ephemeral and once the node is de-provisioned that host cache is gone. Right now we are planning on separating the builds from docker, which is an alternative approach where we just rely on build artifacts. |
Signed-off-by: Dillon Cullinan <[email protected]> Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Dillon Cullinan <[email protected]>
Signed-off-by: Dillon Cullinan <[email protected]> Signed-off-by: hongkuanz <[email protected]>
Overview:
Whats Changing?
What to Expect?
Direct Code Changes:
runs-on
label to match the runner label given to theRunnerDeployment
sccache
workingSummary by CodeRabbit