Skip to content

ci: test gpu on self-hosted runners #108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 20, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 31 additions & 8 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
name: CI

on: [pull_request, push]
on:
pull_request:
push:
branches:
- master

# Cancel a job if there's a new on on the same branch started.
# Cancel a job if there's a new one on the same branch started.
# Based on https://stackoverflow.com/questions/58895283/stop-already-running-workflow-job-in-github-actions/67223051#67223051
concurrency:
group: ${{ github.ref }}
Expand All @@ -14,8 +18,7 @@ env:
# Faster crates.io index checkout.
CARGO_REGISTRIES_CRATES_IO_PROTOCOL: sparse
RUST_LOG: debug
# Build the kernel only for the single architecture . This should reduce
# the overall compile-time significantly.
# Build the kernel only for the single architecture. This should reduce the overall compile-time significantly.
EC_GPU_CUDA_NVCC_ARGS: --fatbin --gpu-architecture=sm_75 --generate-code=arch=compute_75,code=sm_75
BELLMAN_CUDA_NVCC_ARGS: --fatbin --gpu-architecture=sm_75 --generate-code=arch=compute_75,code=sm_75
NEPTUNE_CUDA_NVCC_ARGS: --fatbin --gpu-architecture=sm_75 --generate-code=arch=compute_75,code=sm_75
Expand All @@ -27,7 +30,9 @@ jobs:
steps:
- uses: actions/checkout@v4
- name: Install required packages
run: sudo apt install --no-install-recommends --yes libhwloc-dev nvidia-cuda-toolkit ocl-icd-opencl-dev
run: |
sudo apt-get update
sudo apt-get install --no-install-recommends --yes libhwloc-dev nvidia-cuda-toolkit ocl-icd-opencl-dev
- name: Install cargo clippy
run: rustup component add clippy
- name: Run cargo clippy
Expand All @@ -44,13 +49,31 @@ jobs:
run: cargo fmt --all -- --check

test:
runs-on: ubuntu-24.04
runs-on: ['self-hosted', 'linux', 'x64', '2xlarge+gpu']
name: Test
steps:
- uses: actions/checkout@v4
# TODO: Move the driver installation to the AMI.
# https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html
# https://www.nvidia.com/en-us/drivers/
- name: Install CUDA drivers
run: |
curl -L --fail -o nvidia-driver-local-repo-ubuntu2404-570.148.08_1.0-1_amd64.deb https://us.download.nvidia.com/tesla/570.148.08/nvidia-driver-local-repo-ubuntu2404-570.148.08_1.0-1_amd64.deb
echo "26188e02a028874c653a6072666fd267d597a3fd3db67cdfb66b1398626a512f" nvidia-driver-local-repo-ubuntu2404-570.148.08_1.0-1_amd64.deb | sha256sum --check
sudo dpkg -i nvidia-driver-local-repo-ubuntu2404-570.148.08_1.0-1_amd64.deb
sudo cp /var/nvidia-driver-local-repo-ubuntu2404-570.148.08/nvidia-driver-local-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get install --no-install-recommends --yes cuda-drivers
rm nvidia-driver-local-repo-ubuntu2404-570.148.08_1.0-1_amd64.deb
- name: Install required packages
run: sudo apt install --no-install-recommends --yes libhwloc-dev nvidia-cuda-toolkit ocl-icd-opencl-dev
# In case no GPUs are available, it's using the CPU fallback.
run: |
sudo apt-get update
sudo apt-get install --no-install-recommends --yes libhwloc-dev nvidia-cuda-toolkit ocl-icd-opencl-dev
# TODO: Remove this and other rust installation directives from jobs running
# on self-hosted runners once rust is available on these machines by default
- uses: dtolnay/rust-toolchain@21dc36fb71dd22e3317045c0c31a3f4249868b17
with:
toolchain: 1.83
Comment on lines +74 to +76
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this kind of sucks that we can't just use the rust-toolchain file for versioning, but note from https://github.com/dtolnay/rust-toolchain?tab=readme-ov-file#inputs about versioning:

Rustup toolchain specifier e.g. stable, nightly, 1.42.0, nightly-2022-01-01. Important: the default is to match the @Rev as described above. When passing an explicit toolchain as an input instead of @Rev, you'll want to use "dtolnay/rust-toolchain@master" as the revision of the action.

i.e. it wants you to use dtolnay/[email protected] instead.

(I also notice other poeple are annoyed by this gap).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd stick with using a pinned sha of rust-toolchain despite the suggestion from the action authors as this is a more secure alternative.

I do like the idea of using the version from the toml config file! I'll check it out and update where applicable if it looks solid.

- name: Test
run: cargo test --verbose

Expand Down
Loading