ci: OPS-724: Move to ARC runners #2904

dillon-cullinan · 2025-09-05T16:52:38Z

Overview:

Whats Changing?

For the purpose of public self-hosted Github runners, the Dynamo OPs team has deployed a K8s cluster in EKS
This cluster uses Actions Runner Controller (ARC) to dynamically provision nodes based on workflow demand
Node groups and runners are fully customizable, codebase is being developed in https://github.com/ai-dynamo/velonix

What to Expect?

Queue times for GPU runners are ~2 minutes when it needs to provision a new node
- If there is already an existing node, queue time should be non-existent (~10min grace period on Node staying alive after a finished job)
Everything should just work. If you feel something is broken that is infra related, please contact the team on Slack
- Long queue times?
- Hanging jobs?
- Bad or no network connectivity?
- Permission issues?
- Report them to the ops team to be addressed
More jobs being moved over to runners hosted on this cluster in the very near future

Direct Code Changes:

Updates runs-on label to match the runner label given to the RunnerDeployment
Removes some debugging code, some tools are no longer available on the runner images
Due to IAM compatibility issues, we are temporarily adding AWS credentials to the docker build contexts to keep sccache working

Summary by CodeRabbit

Chores
- Updated CI to run GPU validation on a new runner for improved compatibility and stability.
- Enabled NVIDIA runtime and host networking during test execution to ensure proper GPU access.
- Removed verbose GPU debug output from the pipeline for cleaner logs.
- Added support for passing AWS credentials as build-time arguments in container images to streamline artifact access during builds.
- Propagated the new build arguments across relevant build stages to maintain consistency.

Signed-off-by: Dillon Cullinan <[email protected]>

coderabbitai · 2025-09-05T17:00:31Z

Walkthrough

CI workflow updated to use gpu-l40-amd64, remove a debug step, pass AWS credentials to the Docker build, and run tests with NVIDIA runtime and host networking. Dockerfiles now declare AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY build args across stages without additional usage changes.

Changes

Cohort / File(s)	Summary of changes
CI workflow updates `.github/workflows/container-validation-backends.yml`	Switched runs-on to gpu-l40-amd64; removed Debug step; added AWS_ACCESS_KEY_ID/SECRET to Build image env; updated pytest docker run with --runtime=nvidia and --network host.
Docker build args (AWS creds) `container/Dockerfile`, `container/Dockerfile.vllm`	Added AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as ARGs (top-level and re-declared in relevant stages); no ENV or functional usage changes shown.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor Dev as GitHub Actions
    participant R as gpu-l40-amd64 Runner
    participant D as Docker
    participant C as Container (pytest)

    Dev->>R: Trigger build-test job
    R->>D: docker build (ARG AWS_ACCESS_KEY_ID/SECRET)
    Note right of D: Build args passed into build stages
    R->>D: docker run --runtime=nvidia --network host
    D->>C: Start container
    C-->>R: Execute pytest and return results

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

ci: add support for vllm sanity testing on Github #2526 — Modifies the same CI workflow and Dockerfiles to wire AWS build-time args and related CI-to-Docker build changes.

Poem

A rabbit taps keys with a gentle cheer,
New runners hum, the GPUs near;
Credentials whisper to the build’s bright flame,
Containers sprint in a host-net game.
With vLLM stew and pytest lights,
Hop-hop—green checks through the nights. 🐇✨

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (1)

.github/workflows/container-validation-backends.yml (1)
64-66: GPU container runtime flags: consider adding shared memory and deterministic networking.

Add --ipc=host to avoid CUDA/PyTorch OOMs due to small /dev/shm.

Keep --network host if ARC runner permits it; otherwise fall back to bridge plus explicit ports.
-          docker run --runtime=nvidia --rm --gpus all -w /workspace \
-            --network host \
+          docker run --runtime=nvidia --rm --gpus all -w /workspace \
+            --ipc=host \
+            --network host \
             --name ${{ env.CONTAINER_ID }}_pytest \
Confirm ARC runner pods allow Docker --network host and --ipc=host; if not, we’ll adjust to K8s-native services soon.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between abf4caf and ccdcbb4.

📒 Files selected for processing (3)

.github/workflows/container-validation-backends.yml (3 hunks)
container/Dockerfile (3 hunks)
container/Dockerfile.vllm (2 hunks)

🧰 Additional context used

🪛 actionlint (1.7.7)

.github/workflows/container-validation-backends.yml

14-14: label "gpu-l40-amd64" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (1)

.github/workflows/container-validation-backends.yml (1)
14-14: Add canonical self-hosted runner labels to satisfy actionlint
Replace the single label with an array including self-hosted, linux, x64, and your specific runner label:
-    runs-on: gpu-l40-amd64
+    runs-on:
+      - self-hosted
+      - linux
+      - x64
+      - gpu-l40-amd64
Verify that gpu-l40-amd64 exactly matches the labels used by ARC’s RunnerDeployment and adjust spelling if necessary.

.github/workflows/container-validation-backends.yml

container/Dockerfile

container/Dockerfile.vllm

container/Dockerfile

.github/workflows/container-validation-backends.yml

Signed-off-by: Dillon Cullinan <[email protected]>

container/build.sh

.github/workflows/container-validation-backends.yml

alec-flowers

Not an expert on the specifics of .github here, but this is a much needed improvement.

Do we have 2 gpu nodes for 2 gpu tests? If a test requires 2 gpus or specific GPUs (say H100s for example) how are these requirements passed to the cluster?

My expectation is as we move to managed K8s we should be able to do better things with the docker cache that will allow us to decrease the build times even more? Is that true?

dillon-cullinan · 2025-09-08T20:27:06Z

Do we have 2 gpu nodes for 2 gpu tests? If a test requires 2 gpus or specific GPUs (say H100s for example) how are these requirements passed to the cluster?

Currently, our only node group is for single GPU instance types. This is very easily expandable though.

This is still in a PR for now, working on tidying things up... but here is nodegroup mapping: https://github.com/ai-dynamo/velonix/blob/a082e71d337223f2866ec468f311aa37e8a33a1d/terraform/aws/arc-eks/staging.values.tfvars#L145-L176

We can very easily add node groups in the terraform and apply them. After that we can add a RunnerDeployment to K8s which requests the required number of GPUs: https://github.com/ai-dynamo/velonix/blob/97c2f9ad877aa32c957a5fac5f632cd899af03a3/kubernetes/eks-dynamo-ci/apps/staging/github-runners/gpu-l40-amd64-g6-2xlarge.yaml#L33

I've made sure to make it very easy to add runners and instance types to the cluster for all of our CI/CD.

My expectation is as we move to managed K8s we should be able to do better things with the docker cache that will allow us to decrease the build times even more? Is that true?

It can, mostly depends on the use-case. We can have a read-only mounted PersistentVolume that contains our commonly used public images. That could theoretically reduce pull times significantly. Or we could use a pull-through cache in ECR to have a more local connection which is probably also better speeds.

In terms of caching, its a bit more challenging, as all the nodes are ephemeral and once the node is de-provisioned that host cache is gone. Right now we are planning on separating the builds from docker, which is an alternative approach where we just rely on build artifacts.

Signed-off-by: Dillon Cullinan <[email protected]> Signed-off-by: Indrajit Bhosale <[email protected]>

Signed-off-by: Dillon Cullinan <[email protected]>

Signed-off-by: Dillon Cullinan <[email protected]> Signed-off-by: hongkuanz <[email protected]>

Move to ARC runners

ccdcbb4

Signed-off-by: Dillon Cullinan <[email protected]>

dillon-cullinan requested review from a team, alec-flowers, ishandhanani, nnshah1, ptarasiewiczNV, richardhuo-nv, rmccorm4 and tanmayv25 as code owners September 5, 2025 16:52

pull-request-size bot added the size/S label Sep 5, 2025

github-actions bot added the ci Issues/PRs that reference CI build/test label Sep 5, 2025

coderabbitai bot reviewed Sep 5, 2025

View reviewed changes

mc-nv reviewed Sep 5, 2025

View reviewed changes

container/Dockerfile Show resolved Hide resolved

nv-tusharma reviewed Sep 5, 2025

View reviewed changes

.github/workflows/container-validation-backends.yml Show resolved Hide resolved

Add AWS build args

77f34f6

Signed-off-by: Dillon Cullinan <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB September 5, 2025 17:15 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 5, 2025 17:17 Inactive

nv-tusharma approved these changes Sep 5, 2025

View reviewed changes

nv-anants approved these changes Sep 5, 2025

View reviewed changes

nv-anants reviewed Sep 5, 2025

View reviewed changes

container/build.sh Show resolved Hide resolved

ranrubin reviewed Sep 7, 2025

View reviewed changes

.github/workflows/container-validation-backends.yml Show resolved Hide resolved

.github/workflows/container-validation-backends.yml Show resolved Hide resolved

alec-flowers approved these changes Sep 8, 2025

View reviewed changes

dillon-cullinan merged commit 41dacce into main Sep 8, 2025
12 of 13 checks passed

dillon-cullinan deleted the feat/ops-724-use-arc-runners2 branch September 8, 2025 20:19

coderabbitai bot mentioned this pull request Sep 9, 2025

ci: Fix Dockerfile mount secrets #2960

Merged

indrajit96 pushed a commit that referenced this pull request Sep 9, 2025

ci: OPS-724: Move to ARC runners (#2904)

d32f983

Signed-off-by: Dillon Cullinan <[email protected]> Signed-off-by: Indrajit Bhosale <[email protected]>

indrajit96 pushed a commit that referenced this pull request Sep 9, 2025

ci: OPS-724: Move to ARC runners (#2904)

5f4a9bb

Signed-off-by: Dillon Cullinan <[email protected]>

tedzhouhk pushed a commit that referenced this pull request Sep 10, 2025

ci: OPS-724: Move to ARC runners (#2904)

e63ec2e

Signed-off-by: Dillon Cullinan <[email protected]> Signed-off-by: hongkuanz <[email protected]>

ci: OPS-724: Move to ARC runners #2904

ci: OPS-724: Move to ARC runners #2904

Uh oh!

Conversation

dillon-cullinan commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 5, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alec-flowers left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dillon-cullinan commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dillon-cullinan commented Sep 5, 2025 •

edited

Loading

dillon-cullinan commented Sep 8, 2025 •

edited

Loading