Skip to content

Conversation

nv-anants
Copy link
Contributor

@nv-anants nv-anants commented Sep 19, 2025

Overview:

Update GitHub workflow for backend validation to run based on source code changes, along with reusable composite actions and centralized path filtering.

Right now we are skipping all backend checks for doc only and other top level config file changes
TODO: properly define filters for each backend and use that to trigger relevant workflow
Files changes that will trigger all backend-checks

ci: &ci
  - '.github/workflows/**'
  - '.github/filters.yaml'
  - '.github/actions/**'

has_code_changes:
  - *ci
  - 'benchmarks/**'
  - 'components/**'
  - 'container/**'
  - 'deploy/**'
  - 'examples/**'
  - 'launch/**'
  - 'lib/**'
  - 'recipes/**'
  - 'tests/**'
  - '*.toml'
  - '*.lock'
  - '*.py'
  - '*.rs'

Tested skip flow here - https://github.com/ai-dynamo/dynamo/actions/runs/17862125429

Post merge, backend-status-check will be a required check on github PR, instead of vllm/framework specific check

Key Changes

  • Update .github/workflows/container-validation-backends.yml - conditional backend testing
  • Reusable Docker build and pytest composite actions
  • github/filters.yaml for path-based conditional execution
  • Update trigger_ci.yml to use centralized filters
  • Only runs when non-docs changes detected

closes: OPS-862

Summary by CodeRabbit

  • New Features

    • Added reusable Docker image build action for backend containers.
    • Added reusable Pytest action to run GPU tests in containers with JUnit reports.
  • Chores

    • Revamped CI into a “Docker Build and Test” workflow with backend-specific jobs (vLLM, SGLang, TRT-LLM).
    • Introduced change detection to run only relevant build/test jobs.
    • Added concurrency controls to cancel redundant runs.
    • Centralized path filters for CI triggers to a single config.
    • Added a status check job to summarize backend results.

Signed-off-by: Anant Sharma <[email protected]>
Signed-off-by: Anant Sharma <[email protected]>
Signed-off-by: Anant Sharma <[email protected]>
Signed-off-by: Anant Sharma <[email protected]>
Signed-off-by: Anant Sharma <[email protected]>
Signed-off-by: Anant Sharma <[email protected]>
Signed-off-by: Anant Sharma <[email protected]>
Signed-off-by: Anant Sharma <[email protected]>
Signed-off-by: Anant Sharma <[email protected]>
Signed-off-by: Anant Sharma <[email protected]>
Signed-off-by: Anant Sharma <[email protected]>
Signed-off-by: Anant Sharma <[email protected]>
@github-actions github-actions bot added the ci Issues/PRs that reference CI build/test label Sep 19, 2025
Signed-off-by: Anant Sharma <[email protected]>
Signed-off-by: Anant Sharma <[email protected]>
@nv-anants nv-anants changed the title ci: conditional backend github workflows ci: conditional backend github workflow Sep 19, 2025
Signed-off-by: Anant Sharma <[email protected]>
Signed-off-by: Anant Sharma <[email protected]>
@nv-anants nv-anants marked this pull request as ready for review September 23, 2025 16:42
@nv-anants nv-anants requested a review from a team as a code owner September 23, 2025 16:42
Copy link
Contributor

coderabbitai bot commented Sep 23, 2025

Walkthrough

Adds two composite GitHub Actions (Docker build and Pytest), introduces centralized path filters, and refactors the container validation workflow to a multi-stage, change-detected execution with per-backend jobs (vllm, sglang, trtllm), concurrency control, and a status aggregation job. Updates trigger_ci to use the centralized filters.

Changes

Cohort / File(s) Summary
Composite actions: Docker build
.github/actions/docker-build-composite-action/action.yml
New composite action to build images via Buildx; optional nvcr.io login; computes/export IMAGE_TAG; invokes ./container/build.sh with framework/target and SCCache/AWS/CI tokens; exposes image_tag output.
Composite actions: Pytest
.github/actions/pytest-composite-action/action.yml
New composite action to run pytest inside a provided image with NVIDIA runtime and host networking; supports marks input; emits JUnit XML.
Workflow refactor: Container validation
.github/workflows/container-validation-backends.yml
Replaces matrix with change-gated, per-backend jobs (vllm/sglang/trtllm); adds changed-files detection using filters, top-level concurrency, and a backend status check aggregator; integrates new composite actions.
Centralized path filters
.github/filters.yaml
New filters file defining code/doc/CI and backend-specific path globs (vllm/sglang/trtllm/sdk) for change detection.
Trigger workflow update
.github/workflows/trigger_ci.yml
Switches to use .github/filters.yaml for source-change detection.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant GH as GitHub Workflow
  participant CF as changed-files (paths-filter)
  participant V as job: vllm
  participant S as job: sglang
  participant T as job: trtllm
  participant BSC as backend-status-check

  GH->>CF: Detect code changes via .github/filters.yaml
  CF-->>GH: outputs.has_code_changes (true/false)

  alt has_code_changes == true
    par Backend jobs (conditional)
      GH->>V: Build & Test (uses composite actions)
      GH->>S: Build & Test (uses composite actions)
      GH->>T: Build & Test (uses composite actions)
    end
  else No code changes
    Note over GH: Skip backend jobs
  end

  GH->>BSC: Summarize results (always runs)
  BSC-->>GH: Final status
Loading
sequenceDiagram
  autonumber
  participant WB as Workflow Job
  participant DBA as docker-build-composite-action
  participant BX as Docker Buildx
  participant NGC as nvcr.io Registry
  participant BS as ./container/build.sh

  WB->>DBA: inputs (framework, target, image_tag, tokens, AWS/SCCache)
  DBA->>BX: Setup Buildx
  opt NGC login (conditional on event/repo)
    DBA->>NGC: docker login with ngc_ci_access_token
    NGC-->>DBA: login success/fail
  end
  DBA->>DBA: Cleanup (pre)
  DBA->>DBA: Compute IMAGE_TAG (custom or framework:latest)
  DBA->>BS: Build with tag/target/framework, env (CI/AWS/SCCache)
  BS-->>DBA: Build complete
  DBA-->>WB: outputs.image_tag
  DBA->>DBA: Cleanup (post)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

Hop hop, CI’s in flight,
Composite gears click through the night.
Filters whisper, “only when changed,”
Backends line up, neatly arranged.
Docker hums, pytest sings—
Carrots for green checkmarks and things.
Thump-thump: my paws approve these wings!

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title "ci: conditional backend github workflow" is concise, specific, and accurately summarizes the primary change of making backend CI workflows conditional, so it reflects the main intent of the changeset.
Description Check ✅ Passed The PR description follows the repository template by providing an Overview and Details, enumerating key changes, including a tested skip flow link, and referencing the related issue (closes OPS-862), so it satisfies most required sections; however it is missing the explicit "Where should the reviewer start?" heading and specific file-level callouts for reviewers requested by the template.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (10)
.github/actions/docker-build-composite-action/action.yml (2)

60-60: Fix YAML lint: remove extra space after colon.

Conforms to yamllint and avoids noisy warnings.

-        SCCACHE_S3_BUCKET:  ${{ inputs.sccache_s3_bucket }}
+        SCCACHE_S3_BUCKET: ${{ inputs.sccache_s3_bucket }}

49-54: Avoid aggressive docker prune before build; move a safer prune after build.

docker system prune -af on shared/self‑hosted runners can nuke caches/images used by parallel jobs and kills BuildKit caching.

-    - name: Cleanup
-      if: always()
-      shell: bash
-      run: |
-        docker system prune -af
+    # (moved after Build image step)

Add this step after the Build image step:

+    - name: Cleanup (prune dangling images only)
+      if: always()
+      shell: bash
+      run: |
+        docker image prune -af
.github/filters.yaml (2)

27-33: Consider expanding backend filters to include shared test/utils if applicable.

If vllm logic lives beyond the single test file, widen globs now to avoid under‑triggering.


1-10: Unused anchors (docs, vllm, sglang, trtllm).

Since aliases aren’t used (and sequence merge isn’t supported), either remove anchors or reference them as full lists (not as items).

.github/actions/pytest-composite-action/action.yml (1)

23-27: Mount workspace and pass env into container.

Without -v, tests rely on code baked into the image. If that’s not guaranteed, mount the repo. Also export HF_HOME if needed.

-        docker run --runtime=nvidia --rm --gpus all -w /workspace \
-          --network host \
+        docker run --runtime=nvidia --rm --gpus all -w /workspace \
+          --network host \
+          --shm-size 1g \
+          -e HF_HOME=${{ env.HF_HOME }} \
+          -v "$GITHUB_WORKSPACE:/workspace" \
           --name ${{ env.CONTAINER_ID }}_pytest \
           ${{ inputs.image_tag }} \
           bash -c "pytest -xsv --basetemp=/tmp --junitxml=${{ env.PYTEST_XML_FILE }} -m \"${{ inputs.pytest_marks }}\""

Confirm whether images already contain the repo; if yes, the volume mount is optional.

.github/workflows/container-validation-backends.yml (5)

17-28: Expose per-backend filter outputs for targeted execution.

Currently only has_code_changes is exported; all backends run on any code change. Export per-backend outputs to gate jobs precisely.

   changed-files:
     runs-on: ubuntu-latest
     outputs:
-      has_code_changes: ${{ steps.filter.outputs.has_code_changes }}
+      has_code_changes: ${{ steps.filter.outputs.has_code_changes }}
+      vllm: ${{ steps.filter.outputs.vllm }}
+      sglang: ${{ steps.filter.outputs.sglang }}
+      trtllm: ${{ steps.filter.outputs.trtllm }}
     steps:

39-43: Gate vLLM job on vLLM changes only.

Avoids unnecessary builds/tests.

-    if: needs.changed-files.outputs.has_code_changes == 'true'
+    if: needs.changed-files.outputs.vllm == 'true'

65-68: Gate SGLang job on SGLang changes only.

-    if: needs.changed-files.outputs.has_code_changes == 'true'
+    if: needs.changed-files.outputs.sglang == 'true'

90-93: Gate TRT‑LLM job on TRT‑LLM changes only.

-    if: needs.changed-files.outputs.has_code_changes == 'true'
+    if: needs.changed-files.outputs.trtllm == 'true'

4-4: Optional: tighten default permissions.

Add minimal permissions at workflow root for better security posture.

 name: Docker Build and Test
+permissions:
+  contents: read
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c433447 and 2b2e788.

📒 Files selected for processing (5)
  • .github/actions/docker-build-composite-action/action.yml (1 hunks)
  • .github/actions/pytest-composite-action/action.yml (1 hunks)
  • .github/filters.yaml (1 hunks)
  • .github/workflows/container-validation-backends.yml (1 hunks)
  • .github/workflows/trigger_ci.yml (1 hunks)
🧰 Additional context used
🪛 YAMLlint (1.37.1)
.github/actions/docker-build-composite-action/action.yml

[warning] 60-60: too many spaces after colon

(colons)

🪛 actionlint (1.7.7)
.github/workflows/container-validation-backends.yml

40-40: label "gpu-l40-amd64" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


65-65: label "gpu-l40-amd64" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


90-90: label "gpu-l40-amd64" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🔇 Additional comments (6)
.github/actions/docker-build-composite-action/action.yml (1)

72-77: Double‑check secret propagation and masking in build script.

Ensure ./container/build.sh doesn't echo AWS creds/CI token. If it prints, mask or pass via build args/credentials file.

.github/filters.yaml (1)

41-48: Consistency: include TRT‑LLM deps files analogous to vLLM if needed.

Double‑check coverage for wheel/requirements changes beyond current globs.

.github/workflows/container-validation-backends.yml (3)

12-15: Concurrency group/cancel policy looks good.

Reasonable cancellation strategy for non-main branches.


30-38: Status check logic is fine; ensure jq is available.

Ubuntu images usually have jq, but if self-hosted, install it or use node -e equivalent.


40-40: Verify custom runner label presence.

gpu-l40-amd64 is fine for self-hosted; ensure runners are registered with this label.

.github/workflows/trigger_ci.yml (1)

54-54: LGTM — centralized filters reference verified

All downstream filter keys (vllm, sglang, trtllm, sdk, has_code_changes) are present in .github/filters.yaml.

Signed-off-by: Anant Sharma <[email protected]>
Copy link
Contributor

@dillon-cullinan dillon-cullinan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any specific workflow links to show the testing?

  • A workflow where you changed a filtered file, and the expected jobs are skipped
  • A workflow where you change a target file, and the expected jobs are triggered

If this hasn't been done, you can comment out the filters temporarily and add some test filters just to ensure its working properly.

@nv-anants
Copy link
Contributor Author

Do you have any specific workflow links to show the testing?

  • A workflow where you changed a filtered file, and the expected jobs are skipped
  • A workflow where you change a target file, and the expected jobs are triggered

If this hasn't been done, you can comment out the filters temporarily and add some test filters just to ensure its working properly.

First one, you can check here - https://github.com/ai-dynamo/dynamo/actions/runs/17862125429

Second is just this PR (as we trigger on ci changes rn) so any latest commit can verify that

Signed-off-by: Anant Sharma <[email protected]>
Copy link
Contributor

@dmitry-tokarev-nv dmitry-tokarev-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! Thank you Anant! 👏

@saturley-hall saturley-hall merged commit 65a1e1b into main Sep 25, 2025
17 of 18 checks passed
@saturley-hall saturley-hall deleted the anants/reorg-workflows branch September 25, 2025 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Issues/PRs that reference CI build/test size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants