Skip to content

feat: add AI workload policies for GPU governance#737

Open
JaydipGabani wants to merge 12 commits into
open-policy-agent:masterfrom
JaydipGabani:ai-workload-policies
Open

feat: add AI workload policies for GPU governance#737
JaydipGabani wants to merge 12 commits into
open-policy-agent:masterfrom
JaydipGabani:ai-workload-policies

Conversation

@JaydipGabani
Copy link
Copy Markdown
Contributor

@JaydipGabani JaydipGabani commented Mar 16, 2026

Summary

Expand AI/GPU workload governance support in gatekeeper-library by:

  • adding eight GPU-focused validation policies for training and inference workloads
  • adding three intent-specific catalog bundles
  • regenerating library manifests, ArtifactHub assets, website docs, and catalog.yaml
  • bumping Gatekeeper test versions in CI to 3.21.1 and 3.22.0

Policies Added

Policy ConstraintTemplate What It Enforces
No Unsupported GPU K8sNoUnsupportedGpu GPU-requesting containers must declare NVIDIA_VISIBLE_DEVICES so the image is actually GPU-capable
GPU Resource Limits K8sGpuResourceLimits Caps the number of GPUs a container may request
Required GPU Toleration K8sRequiredGpuToleration GPU pods must tolerate the GPU node taint
GPU Active Deadline K8sGpuActiveDeadline GPU jobs must set activeDeadlineSeconds to avoid runaway training workloads
GPU Shared Memory K8sGpuSharedMemory GPU workloads must mount memory-backed /dev/shm for common training frameworks
Required GPU Runtime Class K8sRequiredGpuRuntimeClass GPU pods must use an allowed runtimeClassName
GPU Node Targeting K8sGpuNodeTargeting GPU pods must target GPU-labeled nodes via nodeSelector or required node affinity
GPU Workload Resources K8sGpuWorkloadResources GPU pods must use matching GPU request/limit, memory request=limit, and set a CPU request

Catalog and Bundle Changes

  • Adding three intent-specific bundles:
    • gatekeeper-gpu-safety-policies: k8sgpuresourcelimits, k8sgpuworkloadresources, k8sgpunodetargeting, k8srequiredgputoleration
    • gatekeeper-ai-training-policies: the GPU safety policies plus k8sgpuactivedeadline and k8sgpusharedmemory
    • gatekeeper-ai-inference-policies: the GPU safety policies for inference-focused installs
  • Keeps k8snounsupportedgpu and k8srequiredgpuruntimeclass available as standalone catalog policies instead of forcing them into the default bundles
  • Regenerates catalog bundle metadata, per-policy bundle membership, and bundle constraint mappings

Implementation Details

  • Every new policy includes both Rego and CEL (K8sNativeValidation) implementations
  • Each policy adds generated library manifests, suite.yaml coverage, sample constraints/resources, ArtifactHub assets, and website docs
  • CI Gatekeeper test versions are bumped to 3.21.1 and 3.22.0

Validation

```bash
make generate
make generate-website-docs
make generate-artifacthub-artifacts
./test.sh
make verify-gator-dockerized POLICY_ENGINE=rego
make verify-gator-dockerized POLICY_ENGINE=cel
```

Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
@JaydipGabani JaydipGabani requested a review from a team as a code owner March 16, 2026 23:26
Copilot AI review requested due to automatic review settings March 16, 2026 23:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a set of Gatekeeper policies (Rego + CEL) aimed at governing Kubernetes AI/GPU workloads, and wires them into the library catalog and CI.

Changes:

  • Introduces 5 new GPU governance policies with dual-engine implementations and accompanying unit/integration test assets.
  • Adds an ai-workload bundle plus individual policy entries to catalog.yaml.
  • Updates CI Gatekeeper versions used for integration/verify matrices.

Reviewed changes

Copilot reviewed 65 out of 65 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
src/general/requiredgputoleration/src.rego Rego implementation enforcing GPU pods tolerate a specified taint key
src/general/requiredgputoleration/src.cel CEL implementation for required GPU toleration
src/general/requiredgputoleration/src_test.rego OPA unit tests for required GPU toleration
src/general/requiredgputoleration/constraint.tmpl ConstraintTemplate for required GPU toleration (CEL + Rego targets)
src/general/requiredgpuruntimeclass/src.rego Rego implementation enforcing allowed runtimeClassName for GPU pods
src/general/requiredgpuruntimeclass/src.cel CEL implementation for required GPU runtime class
src/general/requiredgpuruntimeclass/src_test.rego OPA unit tests for required GPU runtime class
src/general/requiredgpuruntimeclass/constraint.tmpl ConstraintTemplate for required GPU runtime class (CEL + Rego targets)
src/general/gpusharedmemory/src.rego Rego implementation enforcing memory-backed /dev/shm mount for GPU containers
src/general/gpusharedmemory/src.cel CEL implementation for GPU shared memory enforcement
src/general/gpusharedmemory/src_test.rego OPA unit tests for GPU shared memory enforcement
src/general/gpusharedmemory/constraint.tmpl ConstraintTemplate for GPU shared memory (CEL + Rego targets)
src/general/gpuresourcelimits/src.rego Rego implementation enforcing max GPU per container
src/general/gpuresourcelimits/src.cel CEL implementation for GPU resource limits
src/general/gpuresourcelimits/src_test.rego OPA unit tests for GPU resource limits
src/general/gpuresourcelimits/constraint.tmpl ConstraintTemplate for GPU resource limits (CEL + Rego targets)
src/general/gpuactivedeadline/src.rego Rego implementation requiring/enforcing activeDeadlineSeconds for GPU pods
src/general/gpuactivedeadline/src.cel CEL implementation for GPU active deadline enforcement
src/general/gpuactivedeadline/src_test.rego OPA unit tests for GPU active deadline enforcement
src/general/gpuactivedeadline/constraint.tmpl ConstraintTemplate for GPU active deadline (CEL + Rego targets)
library/general/requiredgputoleration/template.yaml Rendered library ConstraintTemplate for required GPU toleration
library/general/requiredgputoleration/suite.yaml Gator suite for required GPU toleration
library/general/requiredgputoleration/kustomization.yaml Kustomize entry for required GPU toleration template
library/general/requiredgputoleration/samples/no-gpu/example_allowed.yaml Sample allowed Pod without GPU (toleration policy)
library/general/requiredgputoleration/samples/no-gpu/constraint.yaml Sample constraint for no-gpu case (toleration policy)
library/general/requiredgputoleration/samples/gpu-without-toleration/example_disallowed.yaml Sample disallowed GPU Pod missing toleration
library/general/requiredgputoleration/samples/gpu-without-toleration/constraint.yaml Sample constraint for missing-toleration case
library/general/requiredgputoleration/samples/gpu-with-toleration/example_allowed.yaml Sample allowed GPU Pod with toleration
library/general/requiredgputoleration/samples/gpu-with-toleration/constraint.yaml Sample constraint for with-toleration case
library/general/requiredgpuruntimeclass/template.yaml Rendered library ConstraintTemplate for required GPU runtime class
library/general/requiredgpuruntimeclass/suite.yaml Gator suite for required GPU runtime class
library/general/requiredgpuruntimeclass/kustomization.yaml Kustomize entry for required GPU runtime class template
library/general/requiredgpuruntimeclass/samples/no-gpu/example_allowed.yaml Sample allowed Pod without GPU (runtimeclass policy)
library/general/requiredgpuruntimeclass/samples/no-gpu/constraint.yaml Sample constraint for no-gpu case (runtimeclass policy)
library/general/requiredgpuruntimeclass/samples/gpu-without-runtimeclass/example_disallowed.yaml Sample disallowed GPU Pod missing runtimeClassName
library/general/requiredgpuruntimeclass/samples/gpu-without-runtimeclass/constraint.yaml Sample constraint for missing-runtimeclass case
library/general/requiredgpuruntimeclass/samples/gpu-with-runtimeclass/example_allowed.yaml Sample allowed GPU Pod with runtimeClassName
library/general/requiredgpuruntimeclass/samples/gpu-with-runtimeclass/constraint.yaml Sample constraint for allowed runtimeClassName case
library/general/gpusharedmemory/template.yaml Rendered library ConstraintTemplate for GPU shared memory
library/general/gpusharedmemory/suite.yaml Gator suite for GPU shared memory
library/general/gpusharedmemory/kustomization.yaml Kustomize entry for GPU shared memory template
library/general/gpusharedmemory/samples/no-gpu/example_allowed.yaml Sample allowed Pod without GPU (shm policy)
library/general/gpusharedmemory/samples/no-gpu/constraint.yaml Sample constraint for no-gpu case (shm policy)
library/general/gpusharedmemory/samples/gpu-without-shm/example_disallowed.yaml Sample disallowed GPU Pod missing shm mount
library/general/gpusharedmemory/samples/gpu-without-shm/constraint.yaml Sample constraint for missing-shm case
library/general/gpusharedmemory/samples/gpu-with-shm/example_allowed.yaml Sample allowed GPU Pod with memory-backed /dev/shm
library/general/gpusharedmemory/samples/gpu-with-shm/constraint.yaml Sample constraint for with-shm case
library/general/gpuresourcelimits/template.yaml Rendered library ConstraintTemplate for GPU resource limits
library/general/gpuresourcelimits/suite.yaml Gator suite for GPU resource limits
library/general/gpuresourcelimits/kustomization.yaml Kustomize entry for GPU resource limits template
library/general/gpuresourcelimits/samples/gpu-within-limit/example_allowed.yaml Sample allowed GPU Pod within limit
library/general/gpuresourcelimits/samples/gpu-within-limit/constraint.yaml Sample constraint for within-limit case
library/general/gpuresourcelimits/samples/gpu-exceeds-limit/example_disallowed.yaml Sample disallowed GPU Pod exceeding limit
library/general/gpuresourcelimits/samples/gpu-exceeds-limit/constraint.yaml Sample constraint for exceeds-limit case
library/general/gpuactivedeadline/template.yaml Rendered library ConstraintTemplate for GPU active deadline
library/general/gpuactivedeadline/suite.yaml Gator suite for GPU active deadline
library/general/gpuactivedeadline/kustomization.yaml Kustomize entry for GPU active deadline template
library/general/gpuactivedeadline/samples/non-gpu-job/example_allowed.yaml Sample allowed Pod without GPU (deadline policy)
library/general/gpuactivedeadline/samples/non-gpu-job/constraint.yaml Sample constraint for non-gpu case (deadline policy)
library/general/gpuactivedeadline/samples/gpu-job-without-deadline/example_disallowed.yaml Sample disallowed GPU Pod missing deadline
library/general/gpuactivedeadline/samples/gpu-job-without-deadline/constraint.yaml Sample constraint for missing-deadline case
library/general/gpuactivedeadline/samples/gpu-job-with-deadline/example_allowed.yaml Sample allowed GPU Pod with deadline
library/general/gpuactivedeadline/samples/gpu-job-with-deadline/constraint.yaml Sample constraint enforcing max deadline
catalog.yaml Adds ai-workload bundle and new policy catalog entries
.github/workflows/workflow.yaml Bumps Gatekeeper versions used in CI matrices

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/general/gpuactivedeadline/src.cel
Comment thread src/general/gpusharedmemory/src.cel
Comment thread src/general/gpuresourcelimits/src.rego Outdated
Comment on lines +26 to +40
is_exempt(container) {
exempt_images := object.get(input, ["parameters", "exemptImages"], [])
img := container.image
exemption := exempt_images[_]
_matches_exemption(img, exemption)
}

_matches_exemption(img, exemption) {
not endswith(exemption, "*")
exemption == img
}

_matches_exemption(img, exemption) {
endswith(exemption, "*")
prefix := trim_suffix(exemption, "*")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We intentionally followed the same pattern as the existing nounsupportedgpu policy (the only other GPU policy in the library), which also uses inline is_exempt/_matches_exemption rather than the shared lib_exempt_container. This keeps the GPU policies self-contained and consistent with each other. Happy to migrate all GPU policies to the shared lib in a follow-up if preferred.

Comment thread src/general/requiredgpuruntimeclass/src.rego
Comment thread src/general/requiredgpuruntimeclass/src.cel
Comment thread src/general/gpuactivedeadline/src.rego Outdated
Comment thread src/general/gpusharedmemory/src.rego Outdated
Comment thread src/general/requiredgputoleration/src.rego Outdated
Comment thread src/general/requiredgpuruntimeclass/src.rego Outdated
Comment thread src/general/gpuactivedeadline/src.rego
@JaydipGabani JaydipGabani force-pushed the ai-workload-policies branch 2 times, most recently from 9fbaef2 to abc7702 Compare March 17, 2026 00:40
Add 5 new policies for AI/ML workload governance on Kubernetes:

- k8sgpuresourcelimits: Enforce max GPU count per container
- k8srequiredgputoleration: Require GPU pods to tolerate GPU node taints
- k8sgpuactivedeadline: Require GPU pods to set activeDeadlineSeconds
- k8sgpusharedmemory: Require GPU containers to mount memory-backed /dev/shm
- k8srequiredgpuruntimeclass: Require GPU pods to use an allowed runtimeClassName

Each policy includes:
- Dual-engine implementation (Rego + CEL/K8sNativeValidation)
- OPA unit tests (21 tests total, all passing)
- Gator integration tests (suite.yaml with sample constraints and resources)
- exemptImages parameter support

Also adds an 'ai-workload' bundle to catalog.yaml that groups these policies
with the existing k8snounsupportedgpu policy.

Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
@JaydipGabani JaydipGabani force-pushed the ai-workload-policies branch from abc7702 to 67a3edd Compare March 17, 2026 00:47
- Add nounsupportedgpu source and library files (needed for ai-workload bundle)
- Fix catalog templatePath URLs to point to this branch instead of master
- Add k8snounsupportedgpu policy entry to catalog

Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
@JaydipGabani JaydipGabani force-pushed the ai-workload-policies branch from 41c285e to d270c94 Compare March 18, 2026 00:43
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the Gatekeeper Library’s AI/GPU workload governance by adding new GPU-focused validation policies (with both Rego and CEL implementations), publishing the generated library + ArtifactHub assets for those policies, and updating the website sidebar/catalog presentation and CI Gatekeeper versions.

Changes:

  • Added GPU governance policies (Rego + CEL) with accompanying unit tests, ConstraintTemplates, suites, and sample resources.
  • Added “AI Workload Policies” website navigation (profiles for GPU Safety / Training / Inference) driven from bundle metadata.
  • Updated CI Gatekeeper test matrix versions and regenerated/published generated artifacts (library + ArtifactHub).

Reviewed changes

Copilot reviewed 214 out of 215 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
website/sidebars.js Adds “AI Workload Policies” navigation categories and links to GPU policy docs.
scripts/website/sidebars-template.js Introduces sidebar template placeholders for AI workload policy groupings.
scripts/website/generate.go Generates AI workload sidebar sections from bundle metadata and filters general policies accordingly.
scripts/website/go.mod Updates Go version for website generator module and adds indirect dependency.
scripts/website/go.sum Adds go.sum entries for the new indirect dependency.
go.work Updates Go workspace version used across scripts modules.
.github/workflows/workflow.yaml Bumps Gatekeeper versions tested in CI matrices.
src/general/nounsupportedgpu/src.rego New Rego policy: require NVIDIA_VISIBLE_DEVICES when requesting GPUs.
src/general/nounsupportedgpu/src.cel New CEL policy equivalent for nounsupportedgpu.
src/general/nounsupportedgpu/src_test.rego Unit tests for nounsupportedgpu Rego policy behavior.
src/general/nounsupportedgpu/lib_exempt_container.rego Exemption helper used by nounsupportedgpu.
src/general/nounsupportedgpu/constraint.tmpl ConstraintTemplate source for nounsupportedgpu (Rego + CEL + libs).
src/general/gpuresourcelimits/src.rego New Rego policy to cap GPUs per container.
src/general/gpuresourcelimits/src.cel New CEL policy equivalent for gpuresourcelimits.
src/general/gpuresourcelimits/src_test.rego Unit tests for gpuresourcelimits.
src/general/gpuresourcelimits/lib_exempt_container.rego Exemption helper used by gpuresourcelimits.
src/general/gpuresourcelimits/constraint.tmpl ConstraintTemplate source for gpuresourcelimits.
src/general/requiredgputoleration/src.rego New Rego policy requiring a GPU taint toleration for GPU pods.
src/general/requiredgputoleration/src.cel New CEL policy equivalent for requiredgputoleration.
src/general/requiredgputoleration/src_test.rego Unit tests for requiredgputoleration.
src/general/requiredgputoleration/lib_exempt_container.rego Exemption helper used by requiredgputoleration.
src/general/requiredgputoleration/constraint.tmpl ConstraintTemplate source for requiredgputoleration.
src/general/requiredgpuruntimeclass/src.rego New Rego policy requiring an allowed runtimeClassName for GPU pods.
src/general/requiredgpuruntimeclass/src.cel New CEL policy equivalent for requiredgpuruntimeclass.
src/general/requiredgpuruntimeclass/src_test.rego Unit tests for requiredgpuruntimeclass.
src/general/requiredgpuruntimeclass/lib_exempt_container.rego Exemption helper used by requiredgpuruntimeclass.
src/general/requiredgpuruntimeclass/constraint.tmpl ConstraintTemplate source for requiredgpuruntimeclass.
src/general/gpuactivedeadline/src.rego New Rego policy requiring activeDeadlineSeconds for GPU pods (with optional max).
src/general/gpuactivedeadline/src.cel New CEL policy equivalent for gpuactivedeadline.
src/general/gpuactivedeadline/src_test.rego Unit tests for gpuactivedeadline.
src/general/gpuactivedeadline/lib_exempt_container.rego Exemption helper used by gpuactivedeadline.
src/general/gpuactivedeadline/constraint.tmpl ConstraintTemplate source for gpuactivedeadline.
src/general/gpusharedmemory/src.rego New Rego policy requiring memory-backed /dev/shm for GPU containers.
src/general/gpusharedmemory/src.cel New CEL policy equivalent for gpusharedmemory.
src/general/gpusharedmemory/src_test.rego Unit tests for gpusharedmemory.
src/general/gpusharedmemory/lib_exempt_container.rego Exemption helper used by gpusharedmemory.
src/general/gpusharedmemory/constraint.tmpl ConstraintTemplate source for gpusharedmemory.
src/general/gpunodetargeting/src.rego New Rego policy requiring GPU node targeting via nodeSelector or required node affinity.
src/general/gpunodetargeting/src.cel New CEL policy equivalent for gpunodetargeting.
src/general/gpunodetargeting/src_test.rego Unit tests for gpunodetargeting.
src/general/gpunodetargeting/lib_exempt_container.rego Exemption helper used by gpunodetargeting.
src/general/gpunodetargeting/constraint.tmpl ConstraintTemplate source for gpunodetargeting.
src/general/gpuworkloadresources/src.rego New Rego policy enforcing GPU request=limit, memory request=limit, and CPU requests for GPU pods.
src/general/gpuworkloadresources/src.cel New CEL policy equivalent for gpuworkloadresources.
src/general/gpuworkloadresources/src_test.rego Unit tests for gpuworkloadresources.
src/general/gpuworkloadresources/lib_exempt_container.rego Exemption helper used by gpuworkloadresources.
src/general/gpuworkloadresources/constraint.tmpl ConstraintTemplate source for gpuworkloadresources.
library/general/nounsupportedgpu/template.yaml Generated ConstraintTemplate for library distribution.
library/general/nounsupportedgpu/suite.yaml Gatekeeper test suite for nounsupportedgpu library artifact.
library/general/nounsupportedgpu/samples/no-gpu-requested/example_allowed.yaml Sample allowed resource (non-GPU).
library/general/nounsupportedgpu/samples/no-gpu-requested/constraint.yaml Sample constraint for nounsupportedgpu.
library/general/nounsupportedgpu/samples/gpu-with-env-var/example_allowed.yaml Sample allowed GPU resource with env var.
library/general/nounsupportedgpu/samples/gpu-with-env-var/constraint.yaml Sample constraint for gpu-with-env-var.
library/general/nounsupportedgpu/samples/gpu-without-env-var/example_disallowed.yaml Sample disallowed GPU resource without env var.
library/general/nounsupportedgpu/samples/gpu-without-env-var/example_allowed_exempt.yaml Sample allowed exempted image.
library/general/nounsupportedgpu/samples/gpu-without-env-var/constraint.yaml Sample constraint with exemptImages.
library/general/nounsupportedgpu/kustomization.yaml Kustomize entry for nounsupportedgpu library package.
library/general/gpuresourcelimits/template.yaml Generated ConstraintTemplate for gpuresourcelimits.
library/general/gpuresourcelimits/suite.yaml Gatekeeper test suite for gpuresourcelimits.
library/general/gpuresourcelimits/samples/gpu-within-limit/example_allowed.yaml Sample allowed GPU within max.
library/general/gpuresourcelimits/samples/gpu-within-limit/constraint.yaml Sample constraint defining maxGpuPerContainer.
library/general/gpuresourcelimits/samples/gpu-exceeds-limit/example_disallowed.yaml Sample disallowed GPU exceeding max.
library/general/gpuresourcelimits/samples/gpu-exceeds-limit/constraint.yaml Sample constraint for exceeds-limit.
library/general/gpuresourcelimits/kustomization.yaml Kustomize entry for gpuresourcelimits.
library/general/requiredgputoleration/template.yaml Generated ConstraintTemplate for requiredgputoleration.
library/general/requiredgputoleration/suite.yaml Gatekeeper test suite for requiredgputoleration.
library/general/requiredgputoleration/samples/no-gpu/example_allowed.yaml Sample allowed non-GPU pod.
library/general/requiredgputoleration/samples/no-gpu/constraint.yaml Sample constraint requiring tolerationKey.
library/general/requiredgputoleration/samples/gpu-with-toleration/example_allowed.yaml Sample allowed GPU pod with toleration.
library/general/requiredgputoleration/samples/gpu-with-toleration/constraint.yaml Sample constraint for gpu-with-toleration.
library/general/requiredgputoleration/samples/gpu-without-toleration/example_disallowed.yaml Sample disallowed GPU pod missing toleration.
library/general/requiredgputoleration/samples/gpu-without-toleration/constraint.yaml Sample constraint for gpu-without-toleration.
library/general/requiredgputoleration/kustomization.yaml Kustomize entry for requiredgputoleration.
library/general/requiredgpuruntimeclass/template.yaml Generated ConstraintTemplate for requiredgpuruntimeclass.
library/general/requiredgpuruntimeclass/suite.yaml Gatekeeper test suite for requiredgpuruntimeclass.
library/general/requiredgpuruntimeclass/samples/no-gpu/example_allowed.yaml Sample allowed non-GPU pod.
library/general/requiredgpuruntimeclass/samples/no-gpu/constraint.yaml Sample constraint defining allowed runtime classes.
library/general/requiredgpuruntimeclass/samples/gpu-with-runtimeclass/example_allowed.yaml Sample allowed GPU pod with runtimeClassName.
library/general/requiredgpuruntimeclass/samples/gpu-with-runtimeclass/constraint.yaml Sample constraint for gpu-with-runtimeclass.
library/general/requiredgpuruntimeclass/samples/gpu-without-runtimeclass/example_disallowed.yaml Sample disallowed GPU pod missing runtimeClassName.
library/general/requiredgpuruntimeclass/samples/gpu-without-runtimeclass/constraint.yaml Sample constraint for gpu-without-runtimeclass.
library/general/requiredgpuruntimeclass/kustomization.yaml Kustomize entry for requiredgpuruntimeclass.
library/general/gpuactivedeadline/template.yaml Generated ConstraintTemplate for gpuactivedeadline.
library/general/gpuactivedeadline/suite.yaml Gatekeeper test suite for gpuactivedeadline.
library/general/gpuactivedeadline/samples/non-gpu-job/example_allowed.yaml Sample allowed non-GPU pod.
library/general/gpuactivedeadline/samples/non-gpu-job/constraint.yaml Sample constraint for gpuactivedeadline.
library/general/gpuactivedeadline/samples/gpu-job-with-deadline/example_allowed.yaml Sample allowed GPU pod with activeDeadlineSeconds.
library/general/gpuactivedeadline/samples/gpu-job-with-deadline/constraint.yaml Sample constraint enforcing maxActiveDeadlineSeconds.
library/general/gpuactivedeadline/samples/gpu-job-without-deadline/example_disallowed.yaml Sample disallowed GPU pod missing deadline.
library/general/gpuactivedeadline/samples/gpu-job-without-deadline/constraint.yaml Sample constraint for missing deadline case.
library/general/gpuactivedeadline/kustomization.yaml Kustomize entry for gpuactivedeadline.
library/general/gpusharedmemory/template.yaml Generated ConstraintTemplate for gpusharedmemory.
library/general/gpusharedmemory/suite.yaml Gatekeeper test suite for gpusharedmemory.
library/general/gpusharedmemory/samples/no-gpu/example_allowed.yaml Sample allowed non-GPU pod.
library/general/gpusharedmemory/samples/no-gpu/constraint.yaml Sample constraint for gpusharedmemory.
library/general/gpusharedmemory/samples/gpu-with-shm/example_allowed.yaml Sample allowed GPU pod with /dev/shm memory-backed volume.
library/general/gpusharedmemory/samples/gpu-with-shm/constraint.yaml Sample constraint for shm requirement.
library/general/gpusharedmemory/samples/gpu-without-shm/example_disallowed.yaml Sample disallowed GPU pod missing shm mount.
library/general/gpusharedmemory/samples/gpu-without-shm/constraint.yaml Sample constraint for missing shm mount case.
library/general/gpusharedmemory/kustomization.yaml Kustomize entry for gpusharedmemory.
library/general/gpuworkloadresources/suite.yaml Gatekeeper test suite for gpuworkloadresources.
library/general/gpuworkloadresources/samples/non-gpu-pod/example_allowed.yaml Sample allowed non-GPU pod.
library/general/gpuworkloadresources/samples/non-gpu-pod/constraint.yaml Sample constraint for gpuworkloadresources.
library/general/gpuworkloadresources/samples/gpu-pod-compliant/example_allowed.yaml Sample allowed GPU pod meeting resource rules.
library/general/gpuworkloadresources/samples/gpu-pod-compliant/constraint.yaml Sample constraint for compliant case.
library/general/gpuworkloadresources/samples/gpu-pod-memory-mismatch/example_disallowed.yaml Sample disallowed GPU pod memory request/limit mismatch.
library/general/gpuworkloadresources/samples/gpu-pod-memory-mismatch/constraint.yaml Sample constraint for memory mismatch case.
library/general/gpuworkloadresources/samples/gpu-pod-cpu-request-missing/example_disallowed.yaml Sample disallowed GPU pod missing CPU request.
library/general/gpuworkloadresources/samples/gpu-pod-cpu-request-missing/constraint.yaml Sample constraint for missing CPU request case.
library/general/gpuworkloadresources/kustomization.yaml Kustomize entry for gpuworkloadresources.
library/general/gpunodetargeting/suite.yaml Gatekeeper test suite for gpunodetargeting.
library/general/gpunodetargeting/samples/non-gpu-pod/example_allowed.yaml Sample allowed non-GPU pod.
library/general/gpunodetargeting/samples/non-gpu-pod/constraint.yaml Sample constraint for gpunodetargeting.
library/general/gpunodetargeting/samples/gpu-pod-with-node-selector/example_allowed.yaml Sample allowed GPU pod using nodeSelector.
library/general/gpunodetargeting/samples/gpu-pod-with-node-selector/constraint.yaml Sample constraint for nodeSelector path.
library/general/gpunodetargeting/samples/gpu-pod-with-node-affinity/example_allowed.yaml Sample allowed GPU pod using required node affinity.
library/general/gpunodetargeting/samples/gpu-pod-with-node-affinity/constraint.yaml Sample constraint for affinity path.
library/general/gpunodetargeting/samples/gpu-pod-without-targeting/example_disallowed.yaml Sample disallowed GPU pod missing targeting.
library/general/gpunodetargeting/samples/gpu-pod-without-targeting/constraint.yaml Sample constraint for missing targeting case.
library/general/gpunodetargeting/kustomization.yaml Kustomize entry for gpunodetargeting.
artifacthub/library/general/nounsupportedgpu/1.0.0/template.yaml Published ArtifactHub template for nounsupportedgpu.
artifacthub/library/general/nounsupportedgpu/1.0.0/suite.yaml ArtifactHub suite for nounsupportedgpu.
artifacthub/library/general/nounsupportedgpu/1.0.0/samples/no-gpu-requested/example_allowed.yaml ArtifactHub sample: allowed non-GPU pod.
artifacthub/library/general/nounsupportedgpu/1.0.0/samples/no-gpu-requested/constraint.yaml ArtifactHub sample constraint: no-gpu-requested.
artifacthub/library/general/nounsupportedgpu/1.0.0/samples/gpu-with-env-var/example_allowed.yaml ArtifactHub sample: allowed GPU with env var.
artifacthub/library/general/nounsupportedgpu/1.0.0/samples/gpu-with-env-var/constraint.yaml ArtifactHub sample constraint: gpu-with-env-var.
artifacthub/library/general/nounsupportedgpu/1.0.0/samples/gpu-without-env-var/example_disallowed.yaml ArtifactHub sample: disallowed without env var.
artifacthub/library/general/nounsupportedgpu/1.0.0/samples/gpu-without-env-var/example_allowed_exempt.yaml ArtifactHub sample: allowed via exempt image.
artifacthub/library/general/nounsupportedgpu/1.0.0/samples/gpu-without-env-var/constraint.yaml ArtifactHub sample constraint: exemptImages.
artifacthub/library/general/nounsupportedgpu/1.0.0/kustomization.yaml ArtifactHub kustomization for nounsupportedgpu.
artifacthub/library/general/nounsupportedgpu/1.0.0/artifacthub-pkg.yml ArtifactHub package metadata for nounsupportedgpu.
artifacthub/library/general/gpuresourcelimits/1.0.0/template.yaml Published ArtifactHub template for gpuresourcelimits.
artifacthub/library/general/gpuresourcelimits/1.0.0/suite.yaml ArtifactHub suite for gpuresourcelimits.
artifacthub/library/general/gpuresourcelimits/1.0.0/samples/gpu-within-limit/example_allowed.yaml ArtifactHub sample: within limit.
artifacthub/library/general/gpuresourcelimits/1.0.0/samples/gpu-within-limit/constraint.yaml ArtifactHub sample constraint: maxGpuPerContainer.
artifacthub/library/general/gpuresourcelimits/1.0.0/samples/gpu-exceeds-limit/example_disallowed.yaml ArtifactHub sample: exceeds limit.
artifacthub/library/general/gpuresourcelimits/1.0.0/samples/gpu-exceeds-limit/constraint.yaml ArtifactHub sample constraint: exceeds limit.
artifacthub/library/general/gpuresourcelimits/1.0.0/kustomization.yaml ArtifactHub kustomization for gpuresourcelimits.
artifacthub/library/general/gpuresourcelimits/1.0.0/artifacthub-pkg.yml ArtifactHub package metadata for gpuresourcelimits.
artifacthub/library/general/requiredgputoleration/1.0.0/template.yaml Published ArtifactHub template for requiredgputoleration.
artifacthub/library/general/requiredgputoleration/1.0.0/suite.yaml ArtifactHub suite for requiredgputoleration.
artifacthub/library/general/requiredgputoleration/1.0.0/samples/no-gpu/example_allowed.yaml ArtifactHub sample: allowed non-GPU pod.
artifacthub/library/general/requiredgputoleration/1.0.0/samples/no-gpu/constraint.yaml ArtifactHub sample constraint: tolerationKey.
artifacthub/library/general/requiredgputoleration/1.0.0/samples/gpu-with-toleration/example_allowed.yaml ArtifactHub sample: allowed with toleration.
artifacthub/library/general/requiredgputoleration/1.0.0/samples/gpu-with-toleration/constraint.yaml ArtifactHub sample constraint: gpu-with-toleration.
artifacthub/library/general/requiredgputoleration/1.0.0/samples/gpu-without-toleration/example_disallowed.yaml ArtifactHub sample: disallowed missing toleration.
artifacthub/library/general/requiredgputoleration/1.0.0/samples/gpu-without-toleration/constraint.yaml ArtifactHub sample constraint: gpu-without-toleration.
artifacthub/library/general/requiredgputoleration/1.0.0/kustomization.yaml ArtifactHub kustomization for requiredgputoleration.
artifacthub/library/general/requiredgputoleration/1.0.0/artifacthub-pkg.yml ArtifactHub package metadata for requiredgputoleration.
artifacthub/library/general/requiredgpuruntimeclass/1.0.0/template.yaml Published ArtifactHub template for requiredgpuruntimeclass.
artifacthub/library/general/requiredgpuruntimeclass/1.0.0/suite.yaml ArtifactHub suite for requiredgpuruntimeclass.
artifacthub/library/general/requiredgpuruntimeclass/1.0.0/samples/no-gpu/example_allowed.yaml ArtifactHub sample: allowed non-GPU pod.
artifacthub/library/general/requiredgpuruntimeclass/1.0.0/samples/no-gpu/constraint.yaml ArtifactHub sample constraint: allowedRuntimeClassNames.
artifacthub/library/general/requiredgpuruntimeclass/1.0.0/samples/gpu-with-runtimeclass/example_allowed.yaml ArtifactHub sample: allowed with runtimeClassName.
artifacthub/library/general/requiredgpuruntimeclass/1.0.0/samples/gpu-with-runtimeclass/constraint.yaml ArtifactHub sample constraint: gpu-with-runtimeclass.
artifacthub/library/general/requiredgpuruntimeclass/1.0.0/samples/gpu-without-runtimeclass/example_disallowed.yaml ArtifactHub sample: disallowed missing runtimeClassName.
artifacthub/library/general/requiredgpuruntimeclass/1.0.0/samples/gpu-without-runtimeclass/constraint.yaml ArtifactHub sample constraint: gpu-without-runtimeclass.
artifacthub/library/general/requiredgpuruntimeclass/1.0.0/kustomization.yaml ArtifactHub kustomization for requiredgpuruntimeclass.
artifacthub/library/general/requiredgpuruntimeclass/1.0.0/artifacthub-pkg.yml ArtifactHub package metadata for requiredgpuruntimeclass.
artifacthub/library/general/gpusharedmemory/1.0.0/template.yaml Published ArtifactHub template for gpusharedmemory.
artifacthub/library/general/gpusharedmemory/1.0.0/suite.yaml ArtifactHub suite for gpusharedmemory.
artifacthub/library/general/gpusharedmemory/1.0.0/samples/no-gpu/example_allowed.yaml ArtifactHub sample: allowed non-GPU pod.
artifacthub/library/general/gpusharedmemory/1.0.0/samples/no-gpu/constraint.yaml ArtifactHub sample constraint: no-gpu.
artifacthub/library/general/gpusharedmemory/1.0.0/samples/gpu-with-shm/example_allowed.yaml ArtifactHub sample: allowed with shm volume/mount.
artifacthub/library/general/gpusharedmemory/1.0.0/samples/gpu-with-shm/constraint.yaml ArtifactHub sample constraint: gpu-with-shm.
artifacthub/library/general/gpusharedmemory/1.0.0/samples/gpu-without-shm/example_disallowed.yaml ArtifactHub sample: disallowed missing shm.
artifacthub/library/general/gpusharedmemory/1.0.0/samples/gpu-without-shm/constraint.yaml ArtifactHub sample constraint: gpu-without-shm.
artifacthub/library/general/gpusharedmemory/1.0.0/kustomization.yaml ArtifactHub kustomization for gpusharedmemory.
artifacthub/library/general/gpusharedmemory/1.0.0/artifacthub-pkg.yml ArtifactHub package metadata for gpusharedmemory.
artifacthub/library/general/gpuactivedeadline/1.0.0/template.yaml Published ArtifactHub template for gpuactivedeadline.
artifacthub/library/general/gpuactivedeadline/1.0.0/suite.yaml ArtifactHub suite for gpuactivedeadline.
artifacthub/library/general/gpuactivedeadline/1.0.0/samples/non-gpu-job/example_allowed.yaml ArtifactHub sample: allowed non-GPU pod.
artifacthub/library/general/gpuactivedeadline/1.0.0/samples/non-gpu-job/constraint.yaml ArtifactHub sample constraint: non-gpu-job.
artifacthub/library/general/gpuactivedeadline/1.0.0/samples/gpu-job-with-deadline/example_allowed.yaml ArtifactHub sample: allowed with deadline.
artifacthub/library/general/gpuactivedeadline/1.0.0/samples/gpu-job-with-deadline/constraint.yaml ArtifactHub sample constraint: max deadline.
artifacthub/library/general/gpuactivedeadline/1.0.0/samples/gpu-job-without-deadline/example_disallowed.yaml ArtifactHub sample: disallowed missing deadline.
artifacthub/library/general/gpuactivedeadline/1.0.0/samples/gpu-job-without-deadline/constraint.yaml ArtifactHub sample constraint: missing deadline.
artifacthub/library/general/gpuactivedeadline/1.0.0/kustomization.yaml ArtifactHub kustomization for gpuactivedeadline.
artifacthub/library/general/gpuactivedeadline/1.0.0/artifacthub-pkg.yml ArtifactHub package metadata for gpuactivedeadline.
artifacthub/library/general/gpuworkloadresources/1.0.0/suite.yaml ArtifactHub suite for gpuworkloadresources.
artifacthub/library/general/gpuworkloadresources/1.0.0/samples/non-gpu-pod/example_allowed.yaml ArtifactHub sample: allowed non-GPU pod.
artifacthub/library/general/gpuworkloadresources/1.0.0/samples/non-gpu-pod/constraint.yaml ArtifactHub sample constraint: non-gpu-pod.
artifacthub/library/general/gpuworkloadresources/1.0.0/samples/gpu-pod-compliant/example_allowed.yaml ArtifactHub sample: compliant GPU pod.
artifacthub/library/general/gpuworkloadresources/1.0.0/samples/gpu-pod-compliant/constraint.yaml ArtifactHub sample constraint: gpu-pod-compliant.
artifacthub/library/general/gpuworkloadresources/1.0.0/samples/gpu-pod-memory-mismatch/example_disallowed.yaml ArtifactHub sample: memory mismatch.
artifacthub/library/general/gpuworkloadresources/1.0.0/samples/gpu-pod-memory-mismatch/constraint.yaml ArtifactHub sample constraint: memory mismatch.
artifacthub/library/general/gpuworkloadresources/1.0.0/samples/gpu-pod-cpu-request-missing/example_disallowed.yaml ArtifactHub sample: missing CPU request.
artifacthub/library/general/gpuworkloadresources/1.0.0/samples/gpu-pod-cpu-request-missing/constraint.yaml ArtifactHub sample constraint: missing CPU request.
artifacthub/library/general/gpuworkloadresources/1.0.0/kustomization.yaml ArtifactHub kustomization for gpuworkloadresources.
artifacthub/library/general/gpuworkloadresources/1.0.0/artifacthub-pkg.yml ArtifactHub package metadata for gpuworkloadresources.
artifacthub/library/general/gpunodetargeting/1.0.0/suite.yaml ArtifactHub suite for gpunodetargeting.
artifacthub/library/general/gpunodetargeting/1.0.0/samples/non-gpu-pod/example_allowed.yaml ArtifactHub sample: allowed non-GPU pod.
artifacthub/library/general/gpunodetargeting/1.0.0/samples/non-gpu-pod/constraint.yaml ArtifactHub sample constraint: non-gpu-pod.
artifacthub/library/general/gpunodetargeting/1.0.0/samples/gpu-pod-without-targeting/example_disallowed.yaml ArtifactHub sample: missing targeting.
artifacthub/library/general/gpunodetargeting/1.0.0/samples/gpu-pod-without-targeting/constraint.yaml ArtifactHub sample constraint: missing targeting.
artifacthub/library/general/gpunodetargeting/1.0.0/samples/gpu-pod-with-node-selector/example_allowed.yaml ArtifactHub sample: nodeSelector targeting.
artifacthub/library/general/gpunodetargeting/1.0.0/samples/gpu-pod-with-node-selector/constraint.yaml ArtifactHub sample constraint: nodeSelector.
artifacthub/library/general/gpunodetargeting/1.0.0/samples/gpu-pod-with-node-affinity/example_allowed.yaml ArtifactHub sample: affinity targeting.
artifacthub/library/general/gpunodetargeting/1.0.0/samples/gpu-pod-with-node-affinity/constraint.yaml ArtifactHub sample constraint: affinity.
artifacthub/library/general/gpunodetargeting/1.0.0/kustomization.yaml ArtifactHub kustomization for gpunodetargeting.
artifacthub/library/general/gpunodetargeting/1.0.0/artifacthub-pkg.yml ArtifactHub package metadata for gpunodetargeting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread go.work Outdated
Comment thread src/general/gpunodetargeting/src.rego Outdated
Comment on lines +79 to +86
has_matching_node_affinity(label_key) {
term := input.review.object.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[_]
expr := term.matchExpressions[_]
expr.key == label_key
label_values := object.get(input.parameters, "nodeLabelValues", [])
count(label_values) > 0
expr.operator == "In"
expr.values[_] == label_values[_]
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In has_matching_node_affinity (the count(label_values) > 0 case), the rule succeeds if any affinity value overlaps with nodeLabelValues (expr.values[_] == label_values[_]). That allows an affinity like values: ["true", "false"] when only ["true"] is allowed, which would still permit scheduling onto disallowed nodes. Tighten the check so that all expr.values are within the allowed nodeLabelValues (i.e., require expr.values to be a subset of label_values).

Copilot uses AI. Check for mistakes.
Comment on lines +52 to +63
variables.anyObject.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms.exists(term,
has(term.matchExpressions) &&
term.matchExpressions.exists(expr,
expr.key == variables.nodeLabelKey &&
(
size(variables.nodeLabelValues) == 0 ?
expr.operator == "Exists" :
expr.operator == "In" &&
has(expr.values) &&
variables.nodeLabelValues.exists(value, expr.values.exists(exprValue, exprValue == value))
)
)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasMatchingNodeAffinity currently treats it as a match when there is any overlap between expr.values and nodeLabelValues (the nested exists checks). This allows affinities that include both allowed and disallowed values to pass. Update the logic to require that all expr.values are contained in nodeLabelValues when nodeLabelValues is non-empty (subset check).

Copilot uses AI. Check for mistakes.
JaydipGabani and others added 5 commits April 28, 2026 19:17
Add generated samples, suites, and docs for AI workload GPU policy edge
cases, including exemptions, disabled parameters, init and ephemeral
containers, request-only GPU usage, and invalid targeting/runtime/toleration
configurations.

Tighten GPU node targeting CEL logic so key-only nodeSelector matching
requires a non-empty value, matching the Rego implementation.

Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
@sozercan
Copy link
Copy Markdown
Member

Scenarios

1. K8sGpuActiveDeadline

Classification:
Training / batch / temporary GPU workloads

User scenario:
A platform team runs a shared GPU training cluster. Researchers submit experiments, notebooks, or batch training jobs. Sometimes a job hangs, gets forgotten, or has a bad training loop and keeps holding an A100/H100 for days.

Why someone wants it:
GPUs are expensive. A runaway job can block other users and burn cloud spend quickly. Requiring activeDeadlineSeconds forces GPU workloads to have a maximum runtime.

Example:
An ML engineer launches a training pod expected to finish in 12 hours. Due to a bug, the dataloader hangs forever. Without this policy, the pod might occupy a GPU indefinitely. With this policy, the pod must declare a deadline, such as:

activeDeadlineSeconds: 43200

Useful for:

  • Batch training jobs
  • Hyperparameter searches
  • Research experiments
  • CI jobs using GPUs
  • Temporary fine-tuning workloads
  • Scheduled GPU jobs
  • Ephemeral notebook/session environments with enforced timeout policies

Less useful / risky for:

  • Long-running inference services
  • GPU model servers
  • Persistent notebook environments, unless the platform intentionally times them out

2. K8sGpuResourceLimits

Classification:
Training + inference / multi-tenant GPU fairness / quota safety

User scenario:
A shared GPU cluster has nodes with 4 or 8 GPUs. The platform team wants to prevent one container from accidentally or intentionally reserving too many GPUs.

Why someone wants it:
It protects GPU fairness and prevents typos or oversized requests from monopolizing scarce hardware.

Example:
A user intends to request 1 GPU but accidentally submits:

nvidia.com/gpu: 8

On a shared cluster, that could block an entire node. The policy can cap each container to, say, 4 GPUs:

maxGpuPerContainer: 4

Useful for:

  • Multi-tenant GPU clusters
  • Research clusters with per-user fairness
  • Cost-controlled cloud GPU pools
  • Preventing accidental over-allocation
  • Separating small inference/training jobs from large distributed jobs
  • Shared inference platforms where each service should consume only a limited number of GPUs

Less useful / should be tuned for:

  • Dedicated large-scale training namespaces where 8-GPU jobs are expected
  • Distributed training teams that legitimately need full-node GPU access
  • Specialized workloads requiring many GPUs per container

In those cases, the platform can raise the limit, scope the constraint by namespace, or use exemptions.


3. K8sGpuWorkloadResources

Classification:
Training + inference / resource hygiene / scheduling reliability

User scenario:
GPU workloads are expensive, but they also need enough CPU and memory to keep the GPU fed. If CPU or memory is under-requested, the pod may schedule onto an overloaded node, perform poorly, get OOM-killed, or leave GPUs idle.

Why someone wants it:
This policy pushes users to declare resources accurately for GPU workloads.

It enforces three main ideas:

  1. GPU request must equal GPU limit.
  2. Memory request must equal memory limit.
  3. CPU request must be set.

Example:
A training container requests a GPU but forgets memory requests:

resources:
  limits:
    nvidia.com/gpu: 1
    memory: 64Gi

The scheduler does not get a complete picture of the workload. The pod may land on a node that cannot reliably support it. This policy requires something more explicit:

resources:
  requests:
    cpu: "8"
    memory: 64Gi
    nvidia.com/gpu: 1
  limits:
    memory: 64Gi
    nvidia.com/gpu: 1

Useful for:

  • Better bin-packing
  • Avoiding GPU idling due to CPU starvation
  • Avoiding memory pressure and evictions
  • Improving cluster capacity planning
  • Making GPU workload cost attribution more accurate
  • Enforcing predictable resource declarations for expensive workloads
  • Training jobs with heavy CPU preprocessing
  • Inference services where predictable scheduling and capacity planning matter

Potential concern:
This policy can be strict because it applies memory/CPU expectations to containers in a GPU pod. If a pod has sidecars, log agents, proxies, or helper containers, they may also need resource declarations or image exemptions.

Platforms should pay particular attention to:

  • Sidecars
  • Init containers
  • Service mesh proxies
  • Logging/monitoring agents
  • Notebook helper containers

4. K8sRequiredGpuToleration

Classification:
Training + inference / GPU node scheduling infrastructure

User scenario:
The cluster has dedicated GPU nodes tainted like this:

nvidia.com/gpu=true:NoSchedule

This keeps normal CPU-only workloads off expensive GPU nodes. GPU workloads need a matching toleration so they can schedule there.

Why someone wants it:
It prevents GPU pods from getting stuck in Pending because they forgot the toleration required for the GPU node pool.

Example:
A user submits a GPU pod:

resources:
  limits:
    nvidia.com/gpu: 1

But forgets:

tolerations:
- key: nvidia.com/gpu
  operator: Exists
  effect: NoSchedule

The pod requests a GPU but cannot tolerate the GPU node taint, so it never schedules. The policy catches that at admission time and gives the user a clear error.

Useful for:

  • Dedicated GPU node pools
  • Clusters using taints to reserve GPU nodes
  • Preventing support tickets caused by unschedulable GPU pods
  • Separating CPU and GPU workloads
  • Training clusters
  • Inference clusters
  • Mixed CPU/GPU clusters

Important nuance:
A toleration does not force a pod onto GPU nodes. It only allows the pod to schedule onto tainted GPU nodes. This is usually paired with K8sGpuNodeTargeting.


5. K8sGpuNodeTargeting

Classification:
Training + inference / GPU placement / accelerator class selection

User scenario:
The platform has multiple node pools: CPU nodes, L4 nodes, A100 nodes, H100 nodes, spot GPU nodes, reserved GPU nodes, training GPU nodes, inference GPU nodes, and so on. GPU workloads should explicitly target the correct GPU node class.

Why someone wants it:
It prevents ambiguous scheduling and helps ensure GPU workloads land on the intended hardware.

Example:
A training job needs A100 nodes. The platform labels nodes like:

nvidia.com/gpu.product: A100

The pod should include either:

nodeSelector:
  nvidia.com/gpu.product: A100

or required node affinity targeting that label.

Useful for:

  • Ensuring training jobs land on training GPU pools
  • Ensuring inference jobs land on inference GPU pools
  • Selecting specific GPU types, for example A100 vs L4
  • Selecting GPU nodes managed by a specific autoscaler/node pool
  • Avoiding accidental use of expensive or specialized GPU nodes
  • Supporting chargeback/showback by node class
  • Separating on-demand and spot GPU pools
  • Separating reserved and shared GPU pools

Why this policy still matters even though GPU requests exist:
A nvidia.com/gpu resource request generally ensures the scheduler needs a node with available GPU resource. Labels and affinity are still valuable for selecting which GPU pool, product, cost class, or workload class is acceptable.

Important nuance from the review:
This policy needs to handle Kubernetes node affinity OR semantics correctly. If a pod has multiple nodeSelectorTerms, every schedulable OR branch needs to preserve the GPU-node targeting requirement. Otherwise, a user can include one valid term and one broad term, and the pod may still schedule through the broad term.


6. K8sGpuSharedMemory

Classification:
Training / distributed GPU workloads / multiprocessing-heavy workloads

User scenario:
A PyTorch, TensorFlow, NCCL, Ray, or distributed training workload uses multiple workers, dataloaders, or inter-process communication. It needs more shared memory than the default container /dev/shm.

Why someone wants it:
Without a memory-backed /dev/shm, GPU training jobs can fail, hang, crash with strange multiprocessing errors, or perform poorly.

Example:
A PyTorch training job uses multiple dataloader workers:

DataLoader(dataset, num_workers=8)

Inside a container, the default /dev/shm may be too small. The workload may fail with shared-memory or bus errors. The recommended Kubernetes pattern is usually:

volumes:
- name: dshm
  emptyDir:
    medium: Memory

containers:
- name: train
  volumeMounts:
  - name: dshm
    mountPath: /dev/shm

This policy requires GPU containers to mount a memory-backed emptyDir at /dev/shm.

Useful for:

  • PyTorch training
  • Multi-GPU training
  • NCCL-heavy workloads
  • Ray workers
  • Distributed data loading
  • Large model fine-tuning jobs
  • Training frameworks that rely on multiprocessing/shared memory
  • Batch jobs with multiple CPU workers feeding GPUs

Less useful for:

  • Simple GPU inference containers
  • Small single-process GPU jobs
  • Workloads that do not use shared memory heavily

Classification summary

Policy Classification
K8sGpuActiveDeadline Training / batch / temporary GPU workloads
K8sGpuResourceLimits Training + inference / multi-tenant GPU fairness
K8sGpuWorkloadResources Training + inference / resource hygiene
K8sRequiredGpuToleration Training + inference / GPU node scheduling infrastructure
K8sGpuNodeTargeting Training + inference / GPU placement and accelerator selection
K8sGpuSharedMemory Training / distributed and multiprocessing-heavy GPU workloads

How they fit together

Policy User problem it solves
K8sGpuActiveDeadline “My training job hung and held a GPU forever.”
K8sGpuResourceLimits “One user accidentally requested all GPUs on the node.”
K8sGpuWorkloadResources “GPU pods are poorly requested and cause bad scheduling or idle GPUs.”
K8sRequiredGpuToleration “GPU pods forget the toleration and get stuck pending.”
K8sGpuNodeTargeting “GPU pods should explicitly target the right GPU node pool/type.”
K8sGpuSharedMemory “Training jobs crash or hang because /dev/shm is too small.”

!has(variables.anyObject.spec.affinity.nodeAffinity) ||
!has(variables.anyObject.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution) ||
!has(variables.anyObject.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms) ? false :
variables.anyObject.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms.exists(term,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodeSelectorTerms are ORed by Kubernetes, so checking that any term has the GPU label is not sufficient. A pod could include one valid GPU-targeting term and another broad/non-GPU term, pass this policy, and still be schedulable through the broader term.

Can we require every nodeSelectorTerm to contain an acceptable GPU label constraint instead? The Rego implementation has the same issue because term := ...nodeSelectorTerms[_] is also existential.

Suggested behavior:

  • one valid GPU term + one broad/non-GPU term => deny
  • all required terms include acceptable GPU label constraints => allow

(
size(variables.nodeLabelValues) == 0 ?
expr.operator == "Exists" :
expr.operator == "In" &&
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In key-only mode (nodeLabelValues omitted), this currently accepts only operator: Exists. But operator: In with non-empty values also guarantees the configured label key is present, and is actually more specific than Exists.

For example, this should satisfy key-only mode:

- key: nvidia.com/gpu.product
  operator: In
  values:
  - A100

Can we allow both of these when nodeLabelValues is empty?

  • operator: Exists
  • operator: In with non-empty values

Operators like DoesNotExist and NotIn should still be rejected because they do not reliably require the label key to be present. The Rego implementation has the same restriction in the count(label_values) == 0 affinity rule.

import data.lib.exempt_container.is_exempt

violation[{"msg": msg}] {
container := input.review.object.spec.containers[_]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only evaluates regular containers. Kubernetes also allows GPU limits on initContainers, so a GPU-requesting init container can currently bypass the /dev/shm memory-backed mount requirement.

Can we evaluate both regular containers and init containers here, and make the same change in the CEL implementation where variables.containers is used for exemptImages and badContainers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants