feat: add AI workload policies for GPU governance#737
Conversation
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
There was a problem hiding this comment.
Pull request overview
Adds a set of Gatekeeper policies (Rego + CEL) aimed at governing Kubernetes AI/GPU workloads, and wires them into the library catalog and CI.
Changes:
- Introduces 5 new GPU governance policies with dual-engine implementations and accompanying unit/integration test assets.
- Adds an
ai-workloadbundle plus individual policy entries tocatalog.yaml. - Updates CI Gatekeeper versions used for integration/verify matrices.
Reviewed changes
Copilot reviewed 65 out of 65 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| src/general/requiredgputoleration/src.rego | Rego implementation enforcing GPU pods tolerate a specified taint key |
| src/general/requiredgputoleration/src.cel | CEL implementation for required GPU toleration |
| src/general/requiredgputoleration/src_test.rego | OPA unit tests for required GPU toleration |
| src/general/requiredgputoleration/constraint.tmpl | ConstraintTemplate for required GPU toleration (CEL + Rego targets) |
| src/general/requiredgpuruntimeclass/src.rego | Rego implementation enforcing allowed runtimeClassName for GPU pods |
| src/general/requiredgpuruntimeclass/src.cel | CEL implementation for required GPU runtime class |
| src/general/requiredgpuruntimeclass/src_test.rego | OPA unit tests for required GPU runtime class |
| src/general/requiredgpuruntimeclass/constraint.tmpl | ConstraintTemplate for required GPU runtime class (CEL + Rego targets) |
| src/general/gpusharedmemory/src.rego | Rego implementation enforcing memory-backed /dev/shm mount for GPU containers |
| src/general/gpusharedmemory/src.cel | CEL implementation for GPU shared memory enforcement |
| src/general/gpusharedmemory/src_test.rego | OPA unit tests for GPU shared memory enforcement |
| src/general/gpusharedmemory/constraint.tmpl | ConstraintTemplate for GPU shared memory (CEL + Rego targets) |
| src/general/gpuresourcelimits/src.rego | Rego implementation enforcing max GPU per container |
| src/general/gpuresourcelimits/src.cel | CEL implementation for GPU resource limits |
| src/general/gpuresourcelimits/src_test.rego | OPA unit tests for GPU resource limits |
| src/general/gpuresourcelimits/constraint.tmpl | ConstraintTemplate for GPU resource limits (CEL + Rego targets) |
| src/general/gpuactivedeadline/src.rego | Rego implementation requiring/enforcing activeDeadlineSeconds for GPU pods |
| src/general/gpuactivedeadline/src.cel | CEL implementation for GPU active deadline enforcement |
| src/general/gpuactivedeadline/src_test.rego | OPA unit tests for GPU active deadline enforcement |
| src/general/gpuactivedeadline/constraint.tmpl | ConstraintTemplate for GPU active deadline (CEL + Rego targets) |
| library/general/requiredgputoleration/template.yaml | Rendered library ConstraintTemplate for required GPU toleration |
| library/general/requiredgputoleration/suite.yaml | Gator suite for required GPU toleration |
| library/general/requiredgputoleration/kustomization.yaml | Kustomize entry for required GPU toleration template |
| library/general/requiredgputoleration/samples/no-gpu/example_allowed.yaml | Sample allowed Pod without GPU (toleration policy) |
| library/general/requiredgputoleration/samples/no-gpu/constraint.yaml | Sample constraint for no-gpu case (toleration policy) |
| library/general/requiredgputoleration/samples/gpu-without-toleration/example_disallowed.yaml | Sample disallowed GPU Pod missing toleration |
| library/general/requiredgputoleration/samples/gpu-without-toleration/constraint.yaml | Sample constraint for missing-toleration case |
| library/general/requiredgputoleration/samples/gpu-with-toleration/example_allowed.yaml | Sample allowed GPU Pod with toleration |
| library/general/requiredgputoleration/samples/gpu-with-toleration/constraint.yaml | Sample constraint for with-toleration case |
| library/general/requiredgpuruntimeclass/template.yaml | Rendered library ConstraintTemplate for required GPU runtime class |
| library/general/requiredgpuruntimeclass/suite.yaml | Gator suite for required GPU runtime class |
| library/general/requiredgpuruntimeclass/kustomization.yaml | Kustomize entry for required GPU runtime class template |
| library/general/requiredgpuruntimeclass/samples/no-gpu/example_allowed.yaml | Sample allowed Pod without GPU (runtimeclass policy) |
| library/general/requiredgpuruntimeclass/samples/no-gpu/constraint.yaml | Sample constraint for no-gpu case (runtimeclass policy) |
| library/general/requiredgpuruntimeclass/samples/gpu-without-runtimeclass/example_disallowed.yaml | Sample disallowed GPU Pod missing runtimeClassName |
| library/general/requiredgpuruntimeclass/samples/gpu-without-runtimeclass/constraint.yaml | Sample constraint for missing-runtimeclass case |
| library/general/requiredgpuruntimeclass/samples/gpu-with-runtimeclass/example_allowed.yaml | Sample allowed GPU Pod with runtimeClassName |
| library/general/requiredgpuruntimeclass/samples/gpu-with-runtimeclass/constraint.yaml | Sample constraint for allowed runtimeClassName case |
| library/general/gpusharedmemory/template.yaml | Rendered library ConstraintTemplate for GPU shared memory |
| library/general/gpusharedmemory/suite.yaml | Gator suite for GPU shared memory |
| library/general/gpusharedmemory/kustomization.yaml | Kustomize entry for GPU shared memory template |
| library/general/gpusharedmemory/samples/no-gpu/example_allowed.yaml | Sample allowed Pod without GPU (shm policy) |
| library/general/gpusharedmemory/samples/no-gpu/constraint.yaml | Sample constraint for no-gpu case (shm policy) |
| library/general/gpusharedmemory/samples/gpu-without-shm/example_disallowed.yaml | Sample disallowed GPU Pod missing shm mount |
| library/general/gpusharedmemory/samples/gpu-without-shm/constraint.yaml | Sample constraint for missing-shm case |
| library/general/gpusharedmemory/samples/gpu-with-shm/example_allowed.yaml | Sample allowed GPU Pod with memory-backed /dev/shm |
| library/general/gpusharedmemory/samples/gpu-with-shm/constraint.yaml | Sample constraint for with-shm case |
| library/general/gpuresourcelimits/template.yaml | Rendered library ConstraintTemplate for GPU resource limits |
| library/general/gpuresourcelimits/suite.yaml | Gator suite for GPU resource limits |
| library/general/gpuresourcelimits/kustomization.yaml | Kustomize entry for GPU resource limits template |
| library/general/gpuresourcelimits/samples/gpu-within-limit/example_allowed.yaml | Sample allowed GPU Pod within limit |
| library/general/gpuresourcelimits/samples/gpu-within-limit/constraint.yaml | Sample constraint for within-limit case |
| library/general/gpuresourcelimits/samples/gpu-exceeds-limit/example_disallowed.yaml | Sample disallowed GPU Pod exceeding limit |
| library/general/gpuresourcelimits/samples/gpu-exceeds-limit/constraint.yaml | Sample constraint for exceeds-limit case |
| library/general/gpuactivedeadline/template.yaml | Rendered library ConstraintTemplate for GPU active deadline |
| library/general/gpuactivedeadline/suite.yaml | Gator suite for GPU active deadline |
| library/general/gpuactivedeadline/kustomization.yaml | Kustomize entry for GPU active deadline template |
| library/general/gpuactivedeadline/samples/non-gpu-job/example_allowed.yaml | Sample allowed Pod without GPU (deadline policy) |
| library/general/gpuactivedeadline/samples/non-gpu-job/constraint.yaml | Sample constraint for non-gpu case (deadline policy) |
| library/general/gpuactivedeadline/samples/gpu-job-without-deadline/example_disallowed.yaml | Sample disallowed GPU Pod missing deadline |
| library/general/gpuactivedeadline/samples/gpu-job-without-deadline/constraint.yaml | Sample constraint for missing-deadline case |
| library/general/gpuactivedeadline/samples/gpu-job-with-deadline/example_allowed.yaml | Sample allowed GPU Pod with deadline |
| library/general/gpuactivedeadline/samples/gpu-job-with-deadline/constraint.yaml | Sample constraint enforcing max deadline |
| catalog.yaml | Adds ai-workload bundle and new policy catalog entries |
| .github/workflows/workflow.yaml | Bumps Gatekeeper versions used in CI matrices |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| is_exempt(container) { | ||
| exempt_images := object.get(input, ["parameters", "exemptImages"], []) | ||
| img := container.image | ||
| exemption := exempt_images[_] | ||
| _matches_exemption(img, exemption) | ||
| } | ||
|
|
||
| _matches_exemption(img, exemption) { | ||
| not endswith(exemption, "*") | ||
| exemption == img | ||
| } | ||
|
|
||
| _matches_exemption(img, exemption) { | ||
| endswith(exemption, "*") | ||
| prefix := trim_suffix(exemption, "*") |
There was a problem hiding this comment.
We intentionally followed the same pattern as the existing nounsupportedgpu policy (the only other GPU policy in the library), which also uses inline is_exempt/_matches_exemption rather than the shared lib_exempt_container. This keeps the GPU policies self-contained and consistent with each other. Happy to migrate all GPU policies to the shared lib in a follow-up if preferred.
9fbaef2 to
abc7702
Compare
Add 5 new policies for AI/ML workload governance on Kubernetes: - k8sgpuresourcelimits: Enforce max GPU count per container - k8srequiredgputoleration: Require GPU pods to tolerate GPU node taints - k8sgpuactivedeadline: Require GPU pods to set activeDeadlineSeconds - k8sgpusharedmemory: Require GPU containers to mount memory-backed /dev/shm - k8srequiredgpuruntimeclass: Require GPU pods to use an allowed runtimeClassName Each policy includes: - Dual-engine implementation (Rego + CEL/K8sNativeValidation) - OPA unit tests (21 tests total, all passing) - Gator integration tests (suite.yaml with sample constraints and resources) - exemptImages parameter support Also adds an 'ai-workload' bundle to catalog.yaml that groups these policies with the existing k8snounsupportedgpu policy. Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
abc7702 to
67a3edd
Compare
- Add nounsupportedgpu source and library files (needed for ai-workload bundle) - Fix catalog templatePath URLs to point to this branch instead of master - Add k8snounsupportedgpu policy entry to catalog Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
41c285e to
d270c94
Compare
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
There was a problem hiding this comment.
Pull request overview
This PR expands the Gatekeeper Library’s AI/GPU workload governance by adding new GPU-focused validation policies (with both Rego and CEL implementations), publishing the generated library + ArtifactHub assets for those policies, and updating the website sidebar/catalog presentation and CI Gatekeeper versions.
Changes:
- Added GPU governance policies (Rego + CEL) with accompanying unit tests, ConstraintTemplates, suites, and sample resources.
- Added “AI Workload Policies” website navigation (profiles for GPU Safety / Training / Inference) driven from bundle metadata.
- Updated CI Gatekeeper test matrix versions and regenerated/published generated artifacts (library + ArtifactHub).
Reviewed changes
Copilot reviewed 214 out of 215 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| website/sidebars.js | Adds “AI Workload Policies” navigation categories and links to GPU policy docs. |
| scripts/website/sidebars-template.js | Introduces sidebar template placeholders for AI workload policy groupings. |
| scripts/website/generate.go | Generates AI workload sidebar sections from bundle metadata and filters general policies accordingly. |
| scripts/website/go.mod | Updates Go version for website generator module and adds indirect dependency. |
| scripts/website/go.sum | Adds go.sum entries for the new indirect dependency. |
| go.work | Updates Go workspace version used across scripts modules. |
| .github/workflows/workflow.yaml | Bumps Gatekeeper versions tested in CI matrices. |
| src/general/nounsupportedgpu/src.rego | New Rego policy: require NVIDIA_VISIBLE_DEVICES when requesting GPUs. |
| src/general/nounsupportedgpu/src.cel | New CEL policy equivalent for nounsupportedgpu. |
| src/general/nounsupportedgpu/src_test.rego | Unit tests for nounsupportedgpu Rego policy behavior. |
| src/general/nounsupportedgpu/lib_exempt_container.rego | Exemption helper used by nounsupportedgpu. |
| src/general/nounsupportedgpu/constraint.tmpl | ConstraintTemplate source for nounsupportedgpu (Rego + CEL + libs). |
| src/general/gpuresourcelimits/src.rego | New Rego policy to cap GPUs per container. |
| src/general/gpuresourcelimits/src.cel | New CEL policy equivalent for gpuresourcelimits. |
| src/general/gpuresourcelimits/src_test.rego | Unit tests for gpuresourcelimits. |
| src/general/gpuresourcelimits/lib_exempt_container.rego | Exemption helper used by gpuresourcelimits. |
| src/general/gpuresourcelimits/constraint.tmpl | ConstraintTemplate source for gpuresourcelimits. |
| src/general/requiredgputoleration/src.rego | New Rego policy requiring a GPU taint toleration for GPU pods. |
| src/general/requiredgputoleration/src.cel | New CEL policy equivalent for requiredgputoleration. |
| src/general/requiredgputoleration/src_test.rego | Unit tests for requiredgputoleration. |
| src/general/requiredgputoleration/lib_exempt_container.rego | Exemption helper used by requiredgputoleration. |
| src/general/requiredgputoleration/constraint.tmpl | ConstraintTemplate source for requiredgputoleration. |
| src/general/requiredgpuruntimeclass/src.rego | New Rego policy requiring an allowed runtimeClassName for GPU pods. |
| src/general/requiredgpuruntimeclass/src.cel | New CEL policy equivalent for requiredgpuruntimeclass. |
| src/general/requiredgpuruntimeclass/src_test.rego | Unit tests for requiredgpuruntimeclass. |
| src/general/requiredgpuruntimeclass/lib_exempt_container.rego | Exemption helper used by requiredgpuruntimeclass. |
| src/general/requiredgpuruntimeclass/constraint.tmpl | ConstraintTemplate source for requiredgpuruntimeclass. |
| src/general/gpuactivedeadline/src.rego | New Rego policy requiring activeDeadlineSeconds for GPU pods (with optional max). |
| src/general/gpuactivedeadline/src.cel | New CEL policy equivalent for gpuactivedeadline. |
| src/general/gpuactivedeadline/src_test.rego | Unit tests for gpuactivedeadline. |
| src/general/gpuactivedeadline/lib_exempt_container.rego | Exemption helper used by gpuactivedeadline. |
| src/general/gpuactivedeadline/constraint.tmpl | ConstraintTemplate source for gpuactivedeadline. |
| src/general/gpusharedmemory/src.rego | New Rego policy requiring memory-backed /dev/shm for GPU containers. |
| src/general/gpusharedmemory/src.cel | New CEL policy equivalent for gpusharedmemory. |
| src/general/gpusharedmemory/src_test.rego | Unit tests for gpusharedmemory. |
| src/general/gpusharedmemory/lib_exempt_container.rego | Exemption helper used by gpusharedmemory. |
| src/general/gpusharedmemory/constraint.tmpl | ConstraintTemplate source for gpusharedmemory. |
| src/general/gpunodetargeting/src.rego | New Rego policy requiring GPU node targeting via nodeSelector or required node affinity. |
| src/general/gpunodetargeting/src.cel | New CEL policy equivalent for gpunodetargeting. |
| src/general/gpunodetargeting/src_test.rego | Unit tests for gpunodetargeting. |
| src/general/gpunodetargeting/lib_exempt_container.rego | Exemption helper used by gpunodetargeting. |
| src/general/gpunodetargeting/constraint.tmpl | ConstraintTemplate source for gpunodetargeting. |
| src/general/gpuworkloadresources/src.rego | New Rego policy enforcing GPU request=limit, memory request=limit, and CPU requests for GPU pods. |
| src/general/gpuworkloadresources/src.cel | New CEL policy equivalent for gpuworkloadresources. |
| src/general/gpuworkloadresources/src_test.rego | Unit tests for gpuworkloadresources. |
| src/general/gpuworkloadresources/lib_exempt_container.rego | Exemption helper used by gpuworkloadresources. |
| src/general/gpuworkloadresources/constraint.tmpl | ConstraintTemplate source for gpuworkloadresources. |
| library/general/nounsupportedgpu/template.yaml | Generated ConstraintTemplate for library distribution. |
| library/general/nounsupportedgpu/suite.yaml | Gatekeeper test suite for nounsupportedgpu library artifact. |
| library/general/nounsupportedgpu/samples/no-gpu-requested/example_allowed.yaml | Sample allowed resource (non-GPU). |
| library/general/nounsupportedgpu/samples/no-gpu-requested/constraint.yaml | Sample constraint for nounsupportedgpu. |
| library/general/nounsupportedgpu/samples/gpu-with-env-var/example_allowed.yaml | Sample allowed GPU resource with env var. |
| library/general/nounsupportedgpu/samples/gpu-with-env-var/constraint.yaml | Sample constraint for gpu-with-env-var. |
| library/general/nounsupportedgpu/samples/gpu-without-env-var/example_disallowed.yaml | Sample disallowed GPU resource without env var. |
| library/general/nounsupportedgpu/samples/gpu-without-env-var/example_allowed_exempt.yaml | Sample allowed exempted image. |
| library/general/nounsupportedgpu/samples/gpu-without-env-var/constraint.yaml | Sample constraint with exemptImages. |
| library/general/nounsupportedgpu/kustomization.yaml | Kustomize entry for nounsupportedgpu library package. |
| library/general/gpuresourcelimits/template.yaml | Generated ConstraintTemplate for gpuresourcelimits. |
| library/general/gpuresourcelimits/suite.yaml | Gatekeeper test suite for gpuresourcelimits. |
| library/general/gpuresourcelimits/samples/gpu-within-limit/example_allowed.yaml | Sample allowed GPU within max. |
| library/general/gpuresourcelimits/samples/gpu-within-limit/constraint.yaml | Sample constraint defining maxGpuPerContainer. |
| library/general/gpuresourcelimits/samples/gpu-exceeds-limit/example_disallowed.yaml | Sample disallowed GPU exceeding max. |
| library/general/gpuresourcelimits/samples/gpu-exceeds-limit/constraint.yaml | Sample constraint for exceeds-limit. |
| library/general/gpuresourcelimits/kustomization.yaml | Kustomize entry for gpuresourcelimits. |
| library/general/requiredgputoleration/template.yaml | Generated ConstraintTemplate for requiredgputoleration. |
| library/general/requiredgputoleration/suite.yaml | Gatekeeper test suite for requiredgputoleration. |
| library/general/requiredgputoleration/samples/no-gpu/example_allowed.yaml | Sample allowed non-GPU pod. |
| library/general/requiredgputoleration/samples/no-gpu/constraint.yaml | Sample constraint requiring tolerationKey. |
| library/general/requiredgputoleration/samples/gpu-with-toleration/example_allowed.yaml | Sample allowed GPU pod with toleration. |
| library/general/requiredgputoleration/samples/gpu-with-toleration/constraint.yaml | Sample constraint for gpu-with-toleration. |
| library/general/requiredgputoleration/samples/gpu-without-toleration/example_disallowed.yaml | Sample disallowed GPU pod missing toleration. |
| library/general/requiredgputoleration/samples/gpu-without-toleration/constraint.yaml | Sample constraint for gpu-without-toleration. |
| library/general/requiredgputoleration/kustomization.yaml | Kustomize entry for requiredgputoleration. |
| library/general/requiredgpuruntimeclass/template.yaml | Generated ConstraintTemplate for requiredgpuruntimeclass. |
| library/general/requiredgpuruntimeclass/suite.yaml | Gatekeeper test suite for requiredgpuruntimeclass. |
| library/general/requiredgpuruntimeclass/samples/no-gpu/example_allowed.yaml | Sample allowed non-GPU pod. |
| library/general/requiredgpuruntimeclass/samples/no-gpu/constraint.yaml | Sample constraint defining allowed runtime classes. |
| library/general/requiredgpuruntimeclass/samples/gpu-with-runtimeclass/example_allowed.yaml | Sample allowed GPU pod with runtimeClassName. |
| library/general/requiredgpuruntimeclass/samples/gpu-with-runtimeclass/constraint.yaml | Sample constraint for gpu-with-runtimeclass. |
| library/general/requiredgpuruntimeclass/samples/gpu-without-runtimeclass/example_disallowed.yaml | Sample disallowed GPU pod missing runtimeClassName. |
| library/general/requiredgpuruntimeclass/samples/gpu-without-runtimeclass/constraint.yaml | Sample constraint for gpu-without-runtimeclass. |
| library/general/requiredgpuruntimeclass/kustomization.yaml | Kustomize entry for requiredgpuruntimeclass. |
| library/general/gpuactivedeadline/template.yaml | Generated ConstraintTemplate for gpuactivedeadline. |
| library/general/gpuactivedeadline/suite.yaml | Gatekeeper test suite for gpuactivedeadline. |
| library/general/gpuactivedeadline/samples/non-gpu-job/example_allowed.yaml | Sample allowed non-GPU pod. |
| library/general/gpuactivedeadline/samples/non-gpu-job/constraint.yaml | Sample constraint for gpuactivedeadline. |
| library/general/gpuactivedeadline/samples/gpu-job-with-deadline/example_allowed.yaml | Sample allowed GPU pod with activeDeadlineSeconds. |
| library/general/gpuactivedeadline/samples/gpu-job-with-deadline/constraint.yaml | Sample constraint enforcing maxActiveDeadlineSeconds. |
| library/general/gpuactivedeadline/samples/gpu-job-without-deadline/example_disallowed.yaml | Sample disallowed GPU pod missing deadline. |
| library/general/gpuactivedeadline/samples/gpu-job-without-deadline/constraint.yaml | Sample constraint for missing deadline case. |
| library/general/gpuactivedeadline/kustomization.yaml | Kustomize entry for gpuactivedeadline. |
| library/general/gpusharedmemory/template.yaml | Generated ConstraintTemplate for gpusharedmemory. |
| library/general/gpusharedmemory/suite.yaml | Gatekeeper test suite for gpusharedmemory. |
| library/general/gpusharedmemory/samples/no-gpu/example_allowed.yaml | Sample allowed non-GPU pod. |
| library/general/gpusharedmemory/samples/no-gpu/constraint.yaml | Sample constraint for gpusharedmemory. |
| library/general/gpusharedmemory/samples/gpu-with-shm/example_allowed.yaml | Sample allowed GPU pod with /dev/shm memory-backed volume. |
| library/general/gpusharedmemory/samples/gpu-with-shm/constraint.yaml | Sample constraint for shm requirement. |
| library/general/gpusharedmemory/samples/gpu-without-shm/example_disallowed.yaml | Sample disallowed GPU pod missing shm mount. |
| library/general/gpusharedmemory/samples/gpu-without-shm/constraint.yaml | Sample constraint for missing shm mount case. |
| library/general/gpusharedmemory/kustomization.yaml | Kustomize entry for gpusharedmemory. |
| library/general/gpuworkloadresources/suite.yaml | Gatekeeper test suite for gpuworkloadresources. |
| library/general/gpuworkloadresources/samples/non-gpu-pod/example_allowed.yaml | Sample allowed non-GPU pod. |
| library/general/gpuworkloadresources/samples/non-gpu-pod/constraint.yaml | Sample constraint for gpuworkloadresources. |
| library/general/gpuworkloadresources/samples/gpu-pod-compliant/example_allowed.yaml | Sample allowed GPU pod meeting resource rules. |
| library/general/gpuworkloadresources/samples/gpu-pod-compliant/constraint.yaml | Sample constraint for compliant case. |
| library/general/gpuworkloadresources/samples/gpu-pod-memory-mismatch/example_disallowed.yaml | Sample disallowed GPU pod memory request/limit mismatch. |
| library/general/gpuworkloadresources/samples/gpu-pod-memory-mismatch/constraint.yaml | Sample constraint for memory mismatch case. |
| library/general/gpuworkloadresources/samples/gpu-pod-cpu-request-missing/example_disallowed.yaml | Sample disallowed GPU pod missing CPU request. |
| library/general/gpuworkloadresources/samples/gpu-pod-cpu-request-missing/constraint.yaml | Sample constraint for missing CPU request case. |
| library/general/gpuworkloadresources/kustomization.yaml | Kustomize entry for gpuworkloadresources. |
| library/general/gpunodetargeting/suite.yaml | Gatekeeper test suite for gpunodetargeting. |
| library/general/gpunodetargeting/samples/non-gpu-pod/example_allowed.yaml | Sample allowed non-GPU pod. |
| library/general/gpunodetargeting/samples/non-gpu-pod/constraint.yaml | Sample constraint for gpunodetargeting. |
| library/general/gpunodetargeting/samples/gpu-pod-with-node-selector/example_allowed.yaml | Sample allowed GPU pod using nodeSelector. |
| library/general/gpunodetargeting/samples/gpu-pod-with-node-selector/constraint.yaml | Sample constraint for nodeSelector path. |
| library/general/gpunodetargeting/samples/gpu-pod-with-node-affinity/example_allowed.yaml | Sample allowed GPU pod using required node affinity. |
| library/general/gpunodetargeting/samples/gpu-pod-with-node-affinity/constraint.yaml | Sample constraint for affinity path. |
| library/general/gpunodetargeting/samples/gpu-pod-without-targeting/example_disallowed.yaml | Sample disallowed GPU pod missing targeting. |
| library/general/gpunodetargeting/samples/gpu-pod-without-targeting/constraint.yaml | Sample constraint for missing targeting case. |
| library/general/gpunodetargeting/kustomization.yaml | Kustomize entry for gpunodetargeting. |
| artifacthub/library/general/nounsupportedgpu/1.0.0/template.yaml | Published ArtifactHub template for nounsupportedgpu. |
| artifacthub/library/general/nounsupportedgpu/1.0.0/suite.yaml | ArtifactHub suite for nounsupportedgpu. |
| artifacthub/library/general/nounsupportedgpu/1.0.0/samples/no-gpu-requested/example_allowed.yaml | ArtifactHub sample: allowed non-GPU pod. |
| artifacthub/library/general/nounsupportedgpu/1.0.0/samples/no-gpu-requested/constraint.yaml | ArtifactHub sample constraint: no-gpu-requested. |
| artifacthub/library/general/nounsupportedgpu/1.0.0/samples/gpu-with-env-var/example_allowed.yaml | ArtifactHub sample: allowed GPU with env var. |
| artifacthub/library/general/nounsupportedgpu/1.0.0/samples/gpu-with-env-var/constraint.yaml | ArtifactHub sample constraint: gpu-with-env-var. |
| artifacthub/library/general/nounsupportedgpu/1.0.0/samples/gpu-without-env-var/example_disallowed.yaml | ArtifactHub sample: disallowed without env var. |
| artifacthub/library/general/nounsupportedgpu/1.0.0/samples/gpu-without-env-var/example_allowed_exempt.yaml | ArtifactHub sample: allowed via exempt image. |
| artifacthub/library/general/nounsupportedgpu/1.0.0/samples/gpu-without-env-var/constraint.yaml | ArtifactHub sample constraint: exemptImages. |
| artifacthub/library/general/nounsupportedgpu/1.0.0/kustomization.yaml | ArtifactHub kustomization for nounsupportedgpu. |
| artifacthub/library/general/nounsupportedgpu/1.0.0/artifacthub-pkg.yml | ArtifactHub package metadata for nounsupportedgpu. |
| artifacthub/library/general/gpuresourcelimits/1.0.0/template.yaml | Published ArtifactHub template for gpuresourcelimits. |
| artifacthub/library/general/gpuresourcelimits/1.0.0/suite.yaml | ArtifactHub suite for gpuresourcelimits. |
| artifacthub/library/general/gpuresourcelimits/1.0.0/samples/gpu-within-limit/example_allowed.yaml | ArtifactHub sample: within limit. |
| artifacthub/library/general/gpuresourcelimits/1.0.0/samples/gpu-within-limit/constraint.yaml | ArtifactHub sample constraint: maxGpuPerContainer. |
| artifacthub/library/general/gpuresourcelimits/1.0.0/samples/gpu-exceeds-limit/example_disallowed.yaml | ArtifactHub sample: exceeds limit. |
| artifacthub/library/general/gpuresourcelimits/1.0.0/samples/gpu-exceeds-limit/constraint.yaml | ArtifactHub sample constraint: exceeds limit. |
| artifacthub/library/general/gpuresourcelimits/1.0.0/kustomization.yaml | ArtifactHub kustomization for gpuresourcelimits. |
| artifacthub/library/general/gpuresourcelimits/1.0.0/artifacthub-pkg.yml | ArtifactHub package metadata for gpuresourcelimits. |
| artifacthub/library/general/requiredgputoleration/1.0.0/template.yaml | Published ArtifactHub template for requiredgputoleration. |
| artifacthub/library/general/requiredgputoleration/1.0.0/suite.yaml | ArtifactHub suite for requiredgputoleration. |
| artifacthub/library/general/requiredgputoleration/1.0.0/samples/no-gpu/example_allowed.yaml | ArtifactHub sample: allowed non-GPU pod. |
| artifacthub/library/general/requiredgputoleration/1.0.0/samples/no-gpu/constraint.yaml | ArtifactHub sample constraint: tolerationKey. |
| artifacthub/library/general/requiredgputoleration/1.0.0/samples/gpu-with-toleration/example_allowed.yaml | ArtifactHub sample: allowed with toleration. |
| artifacthub/library/general/requiredgputoleration/1.0.0/samples/gpu-with-toleration/constraint.yaml | ArtifactHub sample constraint: gpu-with-toleration. |
| artifacthub/library/general/requiredgputoleration/1.0.0/samples/gpu-without-toleration/example_disallowed.yaml | ArtifactHub sample: disallowed missing toleration. |
| artifacthub/library/general/requiredgputoleration/1.0.0/samples/gpu-without-toleration/constraint.yaml | ArtifactHub sample constraint: gpu-without-toleration. |
| artifacthub/library/general/requiredgputoleration/1.0.0/kustomization.yaml | ArtifactHub kustomization for requiredgputoleration. |
| artifacthub/library/general/requiredgputoleration/1.0.0/artifacthub-pkg.yml | ArtifactHub package metadata for requiredgputoleration. |
| artifacthub/library/general/requiredgpuruntimeclass/1.0.0/template.yaml | Published ArtifactHub template for requiredgpuruntimeclass. |
| artifacthub/library/general/requiredgpuruntimeclass/1.0.0/suite.yaml | ArtifactHub suite for requiredgpuruntimeclass. |
| artifacthub/library/general/requiredgpuruntimeclass/1.0.0/samples/no-gpu/example_allowed.yaml | ArtifactHub sample: allowed non-GPU pod. |
| artifacthub/library/general/requiredgpuruntimeclass/1.0.0/samples/no-gpu/constraint.yaml | ArtifactHub sample constraint: allowedRuntimeClassNames. |
| artifacthub/library/general/requiredgpuruntimeclass/1.0.0/samples/gpu-with-runtimeclass/example_allowed.yaml | ArtifactHub sample: allowed with runtimeClassName. |
| artifacthub/library/general/requiredgpuruntimeclass/1.0.0/samples/gpu-with-runtimeclass/constraint.yaml | ArtifactHub sample constraint: gpu-with-runtimeclass. |
| artifacthub/library/general/requiredgpuruntimeclass/1.0.0/samples/gpu-without-runtimeclass/example_disallowed.yaml | ArtifactHub sample: disallowed missing runtimeClassName. |
| artifacthub/library/general/requiredgpuruntimeclass/1.0.0/samples/gpu-without-runtimeclass/constraint.yaml | ArtifactHub sample constraint: gpu-without-runtimeclass. |
| artifacthub/library/general/requiredgpuruntimeclass/1.0.0/kustomization.yaml | ArtifactHub kustomization for requiredgpuruntimeclass. |
| artifacthub/library/general/requiredgpuruntimeclass/1.0.0/artifacthub-pkg.yml | ArtifactHub package metadata for requiredgpuruntimeclass. |
| artifacthub/library/general/gpusharedmemory/1.0.0/template.yaml | Published ArtifactHub template for gpusharedmemory. |
| artifacthub/library/general/gpusharedmemory/1.0.0/suite.yaml | ArtifactHub suite for gpusharedmemory. |
| artifacthub/library/general/gpusharedmemory/1.0.0/samples/no-gpu/example_allowed.yaml | ArtifactHub sample: allowed non-GPU pod. |
| artifacthub/library/general/gpusharedmemory/1.0.0/samples/no-gpu/constraint.yaml | ArtifactHub sample constraint: no-gpu. |
| artifacthub/library/general/gpusharedmemory/1.0.0/samples/gpu-with-shm/example_allowed.yaml | ArtifactHub sample: allowed with shm volume/mount. |
| artifacthub/library/general/gpusharedmemory/1.0.0/samples/gpu-with-shm/constraint.yaml | ArtifactHub sample constraint: gpu-with-shm. |
| artifacthub/library/general/gpusharedmemory/1.0.0/samples/gpu-without-shm/example_disallowed.yaml | ArtifactHub sample: disallowed missing shm. |
| artifacthub/library/general/gpusharedmemory/1.0.0/samples/gpu-without-shm/constraint.yaml | ArtifactHub sample constraint: gpu-without-shm. |
| artifacthub/library/general/gpusharedmemory/1.0.0/kustomization.yaml | ArtifactHub kustomization for gpusharedmemory. |
| artifacthub/library/general/gpusharedmemory/1.0.0/artifacthub-pkg.yml | ArtifactHub package metadata for gpusharedmemory. |
| artifacthub/library/general/gpuactivedeadline/1.0.0/template.yaml | Published ArtifactHub template for gpuactivedeadline. |
| artifacthub/library/general/gpuactivedeadline/1.0.0/suite.yaml | ArtifactHub suite for gpuactivedeadline. |
| artifacthub/library/general/gpuactivedeadline/1.0.0/samples/non-gpu-job/example_allowed.yaml | ArtifactHub sample: allowed non-GPU pod. |
| artifacthub/library/general/gpuactivedeadline/1.0.0/samples/non-gpu-job/constraint.yaml | ArtifactHub sample constraint: non-gpu-job. |
| artifacthub/library/general/gpuactivedeadline/1.0.0/samples/gpu-job-with-deadline/example_allowed.yaml | ArtifactHub sample: allowed with deadline. |
| artifacthub/library/general/gpuactivedeadline/1.0.0/samples/gpu-job-with-deadline/constraint.yaml | ArtifactHub sample constraint: max deadline. |
| artifacthub/library/general/gpuactivedeadline/1.0.0/samples/gpu-job-without-deadline/example_disallowed.yaml | ArtifactHub sample: disallowed missing deadline. |
| artifacthub/library/general/gpuactivedeadline/1.0.0/samples/gpu-job-without-deadline/constraint.yaml | ArtifactHub sample constraint: missing deadline. |
| artifacthub/library/general/gpuactivedeadline/1.0.0/kustomization.yaml | ArtifactHub kustomization for gpuactivedeadline. |
| artifacthub/library/general/gpuactivedeadline/1.0.0/artifacthub-pkg.yml | ArtifactHub package metadata for gpuactivedeadline. |
| artifacthub/library/general/gpuworkloadresources/1.0.0/suite.yaml | ArtifactHub suite for gpuworkloadresources. |
| artifacthub/library/general/gpuworkloadresources/1.0.0/samples/non-gpu-pod/example_allowed.yaml | ArtifactHub sample: allowed non-GPU pod. |
| artifacthub/library/general/gpuworkloadresources/1.0.0/samples/non-gpu-pod/constraint.yaml | ArtifactHub sample constraint: non-gpu-pod. |
| artifacthub/library/general/gpuworkloadresources/1.0.0/samples/gpu-pod-compliant/example_allowed.yaml | ArtifactHub sample: compliant GPU pod. |
| artifacthub/library/general/gpuworkloadresources/1.0.0/samples/gpu-pod-compliant/constraint.yaml | ArtifactHub sample constraint: gpu-pod-compliant. |
| artifacthub/library/general/gpuworkloadresources/1.0.0/samples/gpu-pod-memory-mismatch/example_disallowed.yaml | ArtifactHub sample: memory mismatch. |
| artifacthub/library/general/gpuworkloadresources/1.0.0/samples/gpu-pod-memory-mismatch/constraint.yaml | ArtifactHub sample constraint: memory mismatch. |
| artifacthub/library/general/gpuworkloadresources/1.0.0/samples/gpu-pod-cpu-request-missing/example_disallowed.yaml | ArtifactHub sample: missing CPU request. |
| artifacthub/library/general/gpuworkloadresources/1.0.0/samples/gpu-pod-cpu-request-missing/constraint.yaml | ArtifactHub sample constraint: missing CPU request. |
| artifacthub/library/general/gpuworkloadresources/1.0.0/kustomization.yaml | ArtifactHub kustomization for gpuworkloadresources. |
| artifacthub/library/general/gpuworkloadresources/1.0.0/artifacthub-pkg.yml | ArtifactHub package metadata for gpuworkloadresources. |
| artifacthub/library/general/gpunodetargeting/1.0.0/suite.yaml | ArtifactHub suite for gpunodetargeting. |
| artifacthub/library/general/gpunodetargeting/1.0.0/samples/non-gpu-pod/example_allowed.yaml | ArtifactHub sample: allowed non-GPU pod. |
| artifacthub/library/general/gpunodetargeting/1.0.0/samples/non-gpu-pod/constraint.yaml | ArtifactHub sample constraint: non-gpu-pod. |
| artifacthub/library/general/gpunodetargeting/1.0.0/samples/gpu-pod-without-targeting/example_disallowed.yaml | ArtifactHub sample: missing targeting. |
| artifacthub/library/general/gpunodetargeting/1.0.0/samples/gpu-pod-without-targeting/constraint.yaml | ArtifactHub sample constraint: missing targeting. |
| artifacthub/library/general/gpunodetargeting/1.0.0/samples/gpu-pod-with-node-selector/example_allowed.yaml | ArtifactHub sample: nodeSelector targeting. |
| artifacthub/library/general/gpunodetargeting/1.0.0/samples/gpu-pod-with-node-selector/constraint.yaml | ArtifactHub sample constraint: nodeSelector. |
| artifacthub/library/general/gpunodetargeting/1.0.0/samples/gpu-pod-with-node-affinity/example_allowed.yaml | ArtifactHub sample: affinity targeting. |
| artifacthub/library/general/gpunodetargeting/1.0.0/samples/gpu-pod-with-node-affinity/constraint.yaml | ArtifactHub sample constraint: affinity. |
| artifacthub/library/general/gpunodetargeting/1.0.0/kustomization.yaml | ArtifactHub kustomization for gpunodetargeting. |
| artifacthub/library/general/gpunodetargeting/1.0.0/artifacthub-pkg.yml | ArtifactHub package metadata for gpunodetargeting. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| has_matching_node_affinity(label_key) { | ||
| term := input.review.object.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[_] | ||
| expr := term.matchExpressions[_] | ||
| expr.key == label_key | ||
| label_values := object.get(input.parameters, "nodeLabelValues", []) | ||
| count(label_values) > 0 | ||
| expr.operator == "In" | ||
| expr.values[_] == label_values[_] |
There was a problem hiding this comment.
In has_matching_node_affinity (the count(label_values) > 0 case), the rule succeeds if any affinity value overlaps with nodeLabelValues (expr.values[_] == label_values[_]). That allows an affinity like values: ["true", "false"] when only ["true"] is allowed, which would still permit scheduling onto disallowed nodes. Tighten the check so that all expr.values are within the allowed nodeLabelValues (i.e., require expr.values to be a subset of label_values).
| variables.anyObject.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms.exists(term, | ||
| has(term.matchExpressions) && | ||
| term.matchExpressions.exists(expr, | ||
| expr.key == variables.nodeLabelKey && | ||
| ( | ||
| size(variables.nodeLabelValues) == 0 ? | ||
| expr.operator == "Exists" : | ||
| expr.operator == "In" && | ||
| has(expr.values) && | ||
| variables.nodeLabelValues.exists(value, expr.values.exists(exprValue, exprValue == value)) | ||
| ) | ||
| ) |
There was a problem hiding this comment.
hasMatchingNodeAffinity currently treats it as a match when there is any overlap between expr.values and nodeLabelValues (the nested exists checks). This allows affinities that include both allowed and disallowed values to pass. Update the logic to require that all expr.values are contained in nodeLabelValues when nodeLabelValues is non-empty (subset check).
Add generated samples, suites, and docs for AI workload GPU policy edge cases, including exemptions, disabled parameters, init and ephemeral containers, request-only GPU usage, and invalid targeting/runtime/toleration configurations. Tighten GPU node targeting CEL logic so key-only nodeSelector matching requires a non-empty value, matching the Rego implementation. Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Scenarios1.
|
| Policy | Classification |
|---|---|
K8sGpuActiveDeadline |
Training / batch / temporary GPU workloads |
K8sGpuResourceLimits |
Training + inference / multi-tenant GPU fairness |
K8sGpuWorkloadResources |
Training + inference / resource hygiene |
K8sRequiredGpuToleration |
Training + inference / GPU node scheduling infrastructure |
K8sGpuNodeTargeting |
Training + inference / GPU placement and accelerator selection |
K8sGpuSharedMemory |
Training / distributed and multiprocessing-heavy GPU workloads |
How they fit together
| Policy | User problem it solves |
|---|---|
K8sGpuActiveDeadline |
“My training job hung and held a GPU forever.” |
K8sGpuResourceLimits |
“One user accidentally requested all GPUs on the node.” |
K8sGpuWorkloadResources |
“GPU pods are poorly requested and cause bad scheduling or idle GPUs.” |
K8sRequiredGpuToleration |
“GPU pods forget the toleration and get stuck pending.” |
K8sGpuNodeTargeting |
“GPU pods should explicitly target the right GPU node pool/type.” |
K8sGpuSharedMemory |
“Training jobs crash or hang because /dev/shm is too small.” |
| !has(variables.anyObject.spec.affinity.nodeAffinity) || | ||
| !has(variables.anyObject.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution) || | ||
| !has(variables.anyObject.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms) ? false : | ||
| variables.anyObject.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms.exists(term, |
There was a problem hiding this comment.
nodeSelectorTerms are ORed by Kubernetes, so checking that any term has the GPU label is not sufficient. A pod could include one valid GPU-targeting term and another broad/non-GPU term, pass this policy, and still be schedulable through the broader term.
Can we require every nodeSelectorTerm to contain an acceptable GPU label constraint instead? The Rego implementation has the same issue because term := ...nodeSelectorTerms[_] is also existential.
Suggested behavior:
- one valid GPU term + one broad/non-GPU term => deny
- all required terms include acceptable GPU label constraints => allow
| ( | ||
| size(variables.nodeLabelValues) == 0 ? | ||
| expr.operator == "Exists" : | ||
| expr.operator == "In" && |
There was a problem hiding this comment.
In key-only mode (nodeLabelValues omitted), this currently accepts only operator: Exists. But operator: In with non-empty values also guarantees the configured label key is present, and is actually more specific than Exists.
For example, this should satisfy key-only mode:
- key: nvidia.com/gpu.product
operator: In
values:
- A100Can we allow both of these when nodeLabelValues is empty?
- operator: Exists
- operator: In with non-empty values
Operators like DoesNotExist and NotIn should still be rejected because they do not reliably require the label key to be present. The Rego implementation has the same restriction in the count(label_values) == 0 affinity rule.
| import data.lib.exempt_container.is_exempt | ||
|
|
||
| violation[{"msg": msg}] { | ||
| container := input.review.object.spec.containers[_] |
There was a problem hiding this comment.
This only evaluates regular containers. Kubernetes also allows GPU limits on initContainers, so a GPU-requesting init container can currently bypass the /dev/shm memory-backed mount requirement.
Can we evaluate both regular containers and init containers here, and make the same change in the CEL implementation where variables.containers is used for exemptImages and badContainers?
Summary
Expand AI/GPU workload governance support in gatekeeper-library by:
Policies Added
Catalog and Bundle Changes
Implementation Details
Validation
```bash
make generate
make generate-website-docs
make generate-artifacthub-artifacts
./test.sh
make verify-gator-dockerized POLICY_ENGINE=rego
make verify-gator-dockerized POLICY_ENGINE=cel
```