Skip to content

Conversation

dillon-cullinan
Copy link
Contributor

@dillon-cullinan dillon-cullinan commented Sep 5, 2025

Overview:

Whats Changing?

  • For the purpose of public self-hosted Github runners, the Dynamo OPs team has deployed a K8s cluster in EKS
  • This cluster uses Actions Runner Controller (ARC) to dynamically provision nodes based on workflow demand
  • Node groups and runners are fully customizable, codebase is being developed in https://github.com/ai-dynamo/velonix

What to Expect?

  • Queue times for GPU runners are ~2 minutes when it needs to provision a new node
    • If there is already an existing node, queue time should be non-existent (~10min grace period on Node staying alive after a finished job)
  • Everything should just work. If you feel something is broken that is infra related, please contact the team on Slack
    • Long queue times?
    • Hanging jobs?
    • Bad or no network connectivity?
    • Permission issues?
    • Report them to the ops team to be addressed
  • More jobs being moved over to runners hosted on this cluster in the very near future

Direct Code Changes:

  • Updates runs-on label to match the runner label given to the RunnerDeployment
  • Removes some debugging code, some tools are no longer available on the runner images
  • Due to IAM compatibility issues, we are temporarily adding AWS credentials to the docker build contexts to keep sccache working

Summary by CodeRabbit

  • Chores
    • Updated CI to run GPU validation on a new runner for improved compatibility and stability.
    • Enabled NVIDIA runtime and host networking during test execution to ensure proper GPU access.
    • Removed verbose GPU debug output from the pipeline for cleaner logs.
    • Added support for passing AWS credentials as build-time arguments in container images to streamline artifact access during builds.
    • Propagated the new build arguments across relevant build stages to maintain consistency.

Signed-off-by: Dillon Cullinan <[email protected]>
@github-actions github-actions bot added the ci Issues/PRs that reference CI build/test label Sep 5, 2025
Copy link
Contributor

coderabbitai bot commented Sep 5, 2025

Walkthrough

CI workflow updated to use gpu-l40-amd64, remove a debug step, pass AWS credentials to the Docker build, and run tests with NVIDIA runtime and host networking. Dockerfiles now declare AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY build args across stages without additional usage changes.

Changes

Cohort / File(s) Summary of changes
CI workflow updates
.github/workflows/container-validation-backends.yml
Switched runs-on to gpu-l40-amd64; removed Debug step; added AWS_ACCESS_KEY_ID/SECRET to Build image env; updated pytest docker run with --runtime=nvidia and --network host.
Docker build args (AWS creds)
container/Dockerfile, container/Dockerfile.vllm
Added AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as ARGs (top-level and re-declared in relevant stages); no ENV or functional usage changes shown.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor Dev as GitHub Actions
    participant R as gpu-l40-amd64 Runner
    participant D as Docker
    participant C as Container (pytest)

    Dev->>R: Trigger build-test job
    R->>D: docker build (ARG AWS_ACCESS_KEY_ID/SECRET)
    Note right of D: Build args passed into build stages
    R->>D: docker run --runtime=nvidia --network host
    D->>C: Start container
    C-->>R: Execute pytest and return results
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Poem

A rabbit taps keys with a gentle cheer,
New runners hum, the GPUs near;
Credentials whisper to the build’s bright flame,
Containers sprint in a host-net game.
With vLLM stew and pytest lights,
Hop-hop—green checks through the nights. 🐇✨


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
.github/workflows/container-validation-backends.yml (1)

64-66: GPU container runtime flags: consider adding shared memory and deterministic networking.

  • Add --ipc=host to avoid CUDA/PyTorch OOMs due to small /dev/shm.
  • Keep --network host if ARC runner permits it; otherwise fall back to bridge plus explicit ports.
-          docker run --runtime=nvidia --rm --gpus all -w /workspace \
-            --network host \
+          docker run --runtime=nvidia --rm --gpus all -w /workspace \
+            --ipc=host \
+            --network host \
             --name ${{ env.CONTAINER_ID }}_pytest \

Confirm ARC runner pods allow Docker --network host and --ipc=host; if not, we’ll adjust to K8s-native services soon.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between abf4caf and ccdcbb4.

📒 Files selected for processing (3)
  • .github/workflows/container-validation-backends.yml (3 hunks)
  • container/Dockerfile (3 hunks)
  • container/Dockerfile.vllm (2 hunks)
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/container-validation-backends.yml

14-14: label "gpu-l40-amd64" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
.github/workflows/container-validation-backends.yml (1)

14-14: Add canonical self-hosted runner labels to satisfy actionlint
Replace the single label with an array including self-hosted, linux, x64, and your specific runner label:

-    runs-on: gpu-l40-amd64
+    runs-on:
+      - self-hosted
+      - linux
+      - x64
+      - gpu-l40-amd64

Verify that gpu-l40-amd64 exactly matches the labels used by ARC’s RunnerDeployment and adjust spelling if necessary.

Signed-off-by: Dillon Cullinan <[email protected]>
Copy link
Contributor

@alec-flowers alec-flowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an expert on the specifics of .github here, but this is a much needed improvement.

Do we have 2 gpu nodes for 2 gpu tests? If a test requires 2 gpus or specific GPUs (say H100s for example) how are these requirements passed to the cluster?

My expectation is as we move to managed K8s we should be able to do better things with the docker cache that will allow us to decrease the build times even more? Is that true?

@dillon-cullinan dillon-cullinan merged commit 41dacce into main Sep 8, 2025
12 of 13 checks passed
@dillon-cullinan dillon-cullinan deleted the feat/ops-724-use-arc-runners2 branch September 8, 2025 20:19
@dillon-cullinan
Copy link
Contributor Author

dillon-cullinan commented Sep 8, 2025

Do we have 2 gpu nodes for 2 gpu tests? If a test requires 2 gpus or specific GPUs (say H100s for example) how are these requirements passed to the cluster?

Currently, our only node group is for single GPU instance types. This is very easily expandable though.

This is still in a PR for now, working on tidying things up... but here is nodegroup mapping: https://github.com/ai-dynamo/velonix/blob/a082e71d337223f2866ec468f311aa37e8a33a1d/terraform/aws/arc-eks/staging.values.tfvars#L145-L176

We can very easily add node groups in the terraform and apply them. After that we can add a RunnerDeployment to K8s which requests the required number of GPUs: https://github.com/ai-dynamo/velonix/blob/97c2f9ad877aa32c957a5fac5f632cd899af03a3/kubernetes/eks-dynamo-ci/apps/staging/github-runners/gpu-l40-amd64-g6-2xlarge.yaml#L33

I've made sure to make it very easy to add runners and instance types to the cluster for all of our CI/CD.

My expectation is as we move to managed K8s we should be able to do better things with the docker cache that will allow us to decrease the build times even more? Is that true?

It can, mostly depends on the use-case. We can have a read-only mounted PersistentVolume that contains our commonly used public images. That could theoretically reduce pull times significantly. Or we could use a pull-through cache in ECR to have a more local connection which is probably also better speeds.

In terms of caching, its a bit more challenging, as all the nodes are ephemeral and once the node is de-provisioned that host cache is gone. Right now we are planning on separating the builds from docker, which is an alternative approach where we just rely on build artifacts.

indrajit96 pushed a commit that referenced this pull request Sep 9, 2025
Signed-off-by: Dillon Cullinan <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
indrajit96 pushed a commit that referenced this pull request Sep 9, 2025
tedzhouhk pushed a commit that referenced this pull request Sep 10, 2025
Signed-off-by: Dillon Cullinan <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Issues/PRs that reference CI build/test size/S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants