feat: add AI workload policies for GPU governance by JaydipGabani · Pull Request #737 · open-policy-agent/gatekeeper-library

JaydipGabani · 2026-03-16T23:26:38Z

Summary

Expand AI/GPU workload governance support in gatekeeper-library by:

adding eight GPU-focused validation policies for training and inference workloads
adding three intent-specific catalog bundles
regenerating library manifests, ArtifactHub assets, website docs, and catalog.yaml
bumping Gatekeeper test versions in CI to 3.21.1 and 3.22.0

Policies Added

Policy	ConstraintTemplate	What It Enforces
No Unsupported GPU	K8sNoUnsupportedGpu	GPU-requesting containers must declare NVIDIA_VISIBLE_DEVICES so the image is actually GPU-capable
GPU Resource Limits	K8sGpuResourceLimits	Caps the number of GPUs a container may request
Required GPU Toleration	K8sRequiredGpuToleration	GPU pods must tolerate the GPU node taint
GPU Active Deadline	K8sGpuActiveDeadline	GPU jobs must set activeDeadlineSeconds to avoid runaway training workloads
GPU Shared Memory	K8sGpuSharedMemory	GPU workloads must mount memory-backed /dev/shm for common training frameworks
Required GPU Runtime Class	K8sRequiredGpuRuntimeClass	GPU pods must use an allowed runtimeClassName
GPU Node Targeting	K8sGpuNodeTargeting	GPU pods must target GPU-labeled nodes via nodeSelector or required node affinity
GPU Workload Resources	K8sGpuWorkloadResources	GPU pods must use matching GPU request/limit, memory request=limit, and set a CPU request

Catalog and Bundle Changes

Adding three intent-specific bundles:
- gatekeeper-gpu-safety-policies: k8sgpuresourcelimits, k8sgpuworkloadresources, k8sgpunodetargeting, k8srequiredgputoleration
- gatekeeper-ai-training-policies: the GPU safety policies plus k8sgpuactivedeadline and k8sgpusharedmemory
- gatekeeper-ai-inference-policies: the GPU safety policies for inference-focused installs
Keeps k8snounsupportedgpu and k8srequiredgpuruntimeclass available as standalone catalog policies instead of forcing them into the default bundles
Regenerates catalog bundle metadata, per-policy bundle membership, and bundle constraint mappings

Implementation Details

Every new policy includes both Rego and CEL (K8sNativeValidation) implementations
Each policy adds generated library manifests, suite.yaml coverage, sample constraints/resources, ArtifactHub assets, and website docs
CI Gatekeeper test versions are bumped to 3.21.1 and 3.22.0

Validation

```bash
make generate
make generate-website-docs
make generate-artifacthub-artifacts
./test.sh
make verify-gator-dockerized POLICY_ENGINE=rego
make verify-gator-dockerized POLICY_ENGINE=cel
```

Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>

Copilot

Pull request overview

Adds a set of Gatekeeper policies (Rego + CEL) aimed at governing Kubernetes AI/GPU workloads, and wires them into the library catalog and CI.

Changes:

Introduces 5 new GPU governance policies with dual-engine implementations and accompanying unit/integration test assets.
Adds an ai-workload bundle plus individual policy entries to catalog.yaml.
Updates CI Gatekeeper versions used for integration/verify matrices.

Reviewed changes

Copilot reviewed 65 out of 65 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
src/general/requiredgputoleration/src.rego	Rego implementation enforcing GPU pods tolerate a specified taint key
src/general/requiredgputoleration/src.cel	CEL implementation for required GPU toleration
src/general/requiredgputoleration/src_test.rego	OPA unit tests for required GPU toleration
src/general/requiredgputoleration/constraint.tmpl	ConstraintTemplate for required GPU toleration (CEL + Rego targets)
src/general/requiredgpuruntimeclass/src.rego	Rego implementation enforcing allowed runtimeClassName for GPU pods
src/general/requiredgpuruntimeclass/src.cel	CEL implementation for required GPU runtime class
src/general/requiredgpuruntimeclass/src_test.rego	OPA unit tests for required GPU runtime class
src/general/requiredgpuruntimeclass/constraint.tmpl	ConstraintTemplate for required GPU runtime class (CEL + Rego targets)
src/general/gpusharedmemory/src.rego	Rego implementation enforcing memory-backed `/dev/shm` mount for GPU containers
src/general/gpusharedmemory/src.cel	CEL implementation for GPU shared memory enforcement
src/general/gpusharedmemory/src_test.rego	OPA unit tests for GPU shared memory enforcement
src/general/gpusharedmemory/constraint.tmpl	ConstraintTemplate for GPU shared memory (CEL + Rego targets)
src/general/gpuresourcelimits/src.rego	Rego implementation enforcing max GPU per container
src/general/gpuresourcelimits/src.cel	CEL implementation for GPU resource limits
src/general/gpuresourcelimits/src_test.rego	OPA unit tests for GPU resource limits
src/general/gpuresourcelimits/constraint.tmpl	ConstraintTemplate for GPU resource limits (CEL + Rego targets)
src/general/gpuactivedeadline/src.rego	Rego implementation requiring/enforcing `activeDeadlineSeconds` for GPU pods
src/general/gpuactivedeadline/src.cel	CEL implementation for GPU active deadline enforcement
src/general/gpuactivedeadline/src_test.rego	OPA unit tests for GPU active deadline enforcement
src/general/gpuactivedeadline/constraint.tmpl	ConstraintTemplate for GPU active deadline (CEL + Rego targets)
library/general/requiredgputoleration/template.yaml	Rendered library ConstraintTemplate for required GPU toleration
library/general/requiredgputoleration/suite.yaml	Gator suite for required GPU toleration
library/general/requiredgputoleration/kustomization.yaml	Kustomize entry for required GPU toleration template
library/general/requiredgputoleration/samples/no-gpu/example_allowed.yaml	Sample allowed Pod without GPU (toleration policy)
library/general/requiredgputoleration/samples/no-gpu/constraint.yaml	Sample constraint for no-gpu case (toleration policy)
library/general/requiredgputoleration/samples/gpu-without-toleration/example_disallowed.yaml	Sample disallowed GPU Pod missing toleration
library/general/requiredgputoleration/samples/gpu-without-toleration/constraint.yaml	Sample constraint for missing-toleration case
library/general/requiredgputoleration/samples/gpu-with-toleration/example_allowed.yaml	Sample allowed GPU Pod with toleration
library/general/requiredgputoleration/samples/gpu-with-toleration/constraint.yaml	Sample constraint for with-toleration case
library/general/requiredgpuruntimeclass/template.yaml	Rendered library ConstraintTemplate for required GPU runtime class
library/general/requiredgpuruntimeclass/suite.yaml	Gator suite for required GPU runtime class
library/general/requiredgpuruntimeclass/kustomization.yaml	Kustomize entry for required GPU runtime class template
library/general/requiredgpuruntimeclass/samples/no-gpu/example_allowed.yaml	Sample allowed Pod without GPU (runtimeclass policy)
library/general/requiredgpuruntimeclass/samples/no-gpu/constraint.yaml	Sample constraint for no-gpu case (runtimeclass policy)
library/general/requiredgpuruntimeclass/samples/gpu-without-runtimeclass/example_disallowed.yaml	Sample disallowed GPU Pod missing runtimeClassName
library/general/requiredgpuruntimeclass/samples/gpu-without-runtimeclass/constraint.yaml	Sample constraint for missing-runtimeclass case
library/general/requiredgpuruntimeclass/samples/gpu-with-runtimeclass/example_allowed.yaml	Sample allowed GPU Pod with runtimeClassName
library/general/requiredgpuruntimeclass/samples/gpu-with-runtimeclass/constraint.yaml	Sample constraint for allowed runtimeClassName case
library/general/gpusharedmemory/template.yaml	Rendered library ConstraintTemplate for GPU shared memory
library/general/gpusharedmemory/suite.yaml	Gator suite for GPU shared memory
library/general/gpusharedmemory/kustomization.yaml	Kustomize entry for GPU shared memory template
library/general/gpusharedmemory/samples/no-gpu/example_allowed.yaml	Sample allowed Pod without GPU (shm policy)
library/general/gpusharedmemory/samples/no-gpu/constraint.yaml	Sample constraint for no-gpu case (shm policy)
library/general/gpusharedmemory/samples/gpu-without-shm/example_disallowed.yaml	Sample disallowed GPU Pod missing shm mount
library/general/gpusharedmemory/samples/gpu-without-shm/constraint.yaml	Sample constraint for missing-shm case
library/general/gpusharedmemory/samples/gpu-with-shm/example_allowed.yaml	Sample allowed GPU Pod with memory-backed `/dev/shm`
library/general/gpusharedmemory/samples/gpu-with-shm/constraint.yaml	Sample constraint for with-shm case
library/general/gpuresourcelimits/template.yaml	Rendered library ConstraintTemplate for GPU resource limits
library/general/gpuresourcelimits/suite.yaml	Gator suite for GPU resource limits
library/general/gpuresourcelimits/kustomization.yaml	Kustomize entry for GPU resource limits template
library/general/gpuresourcelimits/samples/gpu-within-limit/example_allowed.yaml	Sample allowed GPU Pod within limit
library/general/gpuresourcelimits/samples/gpu-within-limit/constraint.yaml	Sample constraint for within-limit case
library/general/gpuresourcelimits/samples/gpu-exceeds-limit/example_disallowed.yaml	Sample disallowed GPU Pod exceeding limit
library/general/gpuresourcelimits/samples/gpu-exceeds-limit/constraint.yaml	Sample constraint for exceeds-limit case
library/general/gpuactivedeadline/template.yaml	Rendered library ConstraintTemplate for GPU active deadline
library/general/gpuactivedeadline/suite.yaml	Gator suite for GPU active deadline
library/general/gpuactivedeadline/kustomization.yaml	Kustomize entry for GPU active deadline template
library/general/gpuactivedeadline/samples/non-gpu-job/example_allowed.yaml	Sample allowed Pod without GPU (deadline policy)
library/general/gpuactivedeadline/samples/non-gpu-job/constraint.yaml	Sample constraint for non-gpu case (deadline policy)
library/general/gpuactivedeadline/samples/gpu-job-without-deadline/example_disallowed.yaml	Sample disallowed GPU Pod missing deadline
library/general/gpuactivedeadline/samples/gpu-job-without-deadline/constraint.yaml	Sample constraint for missing-deadline case
library/general/gpuactivedeadline/samples/gpu-job-with-deadline/example_allowed.yaml	Sample allowed GPU Pod with deadline
library/general/gpuactivedeadline/samples/gpu-job-with-deadline/constraint.yaml	Sample constraint enforcing max deadline
catalog.yaml	Adds `ai-workload` bundle and new policy catalog entries
.github/workflows/workflow.yaml	Bumps Gatekeeper versions used in CI matrices

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

JaydipGabani · 2026-03-17T00:42:08Z

+is_exempt(container) {
+    exempt_images := object.get(input, ["parameters", "exemptImages"], [])
+    img := container.image
+    exemption := exempt_images[_]
+    _matches_exemption(img, exemption)
+}
+
+_matches_exemption(img, exemption) {
+    not endswith(exemption, "*")
+    exemption == img
+}
+
+_matches_exemption(img, exemption) {
+    endswith(exemption, "*")
+    prefix := trim_suffix(exemption, "*")


We intentionally followed the same pattern as the existing nounsupportedgpu policy (the only other GPU policy in the library), which also uses inline is_exempt/_matches_exemption rather than the shared lib_exempt_container. This keeps the GPU policies self-contained and consistent with each other. Happy to migrate all GPU policies to the shared lib in a follow-up if preferred.

Add 5 new policies for AI/ML workload governance on Kubernetes: - k8sgpuresourcelimits: Enforce max GPU count per container - k8srequiredgputoleration: Require GPU pods to tolerate GPU node taints - k8sgpuactivedeadline: Require GPU pods to set activeDeadlineSeconds - k8sgpusharedmemory: Require GPU containers to mount memory-backed /dev/shm - k8srequiredgpuruntimeclass: Require GPU pods to use an allowed runtimeClassName Each policy includes: - Dual-engine implementation (Rego + CEL/K8sNativeValidation) - OPA unit tests (21 tests total, all passing) - Gator integration tests (suite.yaml with sample constraints and resources) - exemptImages parameter support Also adds an 'ai-workload' bundle to catalog.yaml that groups these policies with the existing k8snounsupportedgpu policy. Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>

- Add nounsupportedgpu source and library files (needed for ai-workload bundle) - Fix catalog templatePath URLs to point to this branch instead of master - Add k8snounsupportedgpu policy entry to catalog Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>

Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>

Copilot

Pull request overview

This PR expands the Gatekeeper Library’s AI/GPU workload governance by adding new GPU-focused validation policies (with both Rego and CEL implementations), publishing the generated library + ArtifactHub assets for those policies, and updating the website sidebar/catalog presentation and CI Gatekeeper versions.

Changes:

Added GPU governance policies (Rego + CEL) with accompanying unit tests, ConstraintTemplates, suites, and sample resources.
Added “AI Workload Policies” website navigation (profiles for GPU Safety / Training / Inference) driven from bundle metadata.
Updated CI Gatekeeper test matrix versions and regenerated/published generated artifacts (library + ArtifactHub).

Reviewed changes

Copilot reviewed 214 out of 215 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
website/sidebars.js	Adds “AI Workload Policies” navigation categories and links to GPU policy docs.
scripts/website/sidebars-template.js	Introduces sidebar template placeholders for AI workload policy groupings.
scripts/website/generate.go	Generates AI workload sidebar sections from bundle metadata and filters general policies accordingly.
scripts/website/go.mod	Updates Go version for website generator module and adds indirect dependency.
scripts/website/go.sum	Adds go.sum entries for the new indirect dependency.
go.work	Updates Go workspace version used across scripts modules.
.github/workflows/workflow.yaml	Bumps Gatekeeper versions tested in CI matrices.
src/general/nounsupportedgpu/src.rego	New Rego policy: require NVIDIA_VISIBLE_DEVICES when requesting GPUs.
src/general/nounsupportedgpu/src.cel	New CEL policy equivalent for nounsupportedgpu.
src/general/nounsupportedgpu/src_test.rego	Unit tests for nounsupportedgpu Rego policy behavior.
src/general/nounsupportedgpu/lib_exempt_container.rego	Exemption helper used by nounsupportedgpu.
src/general/nounsupportedgpu/constraint.tmpl	ConstraintTemplate source for nounsupportedgpu (Rego + CEL + libs).
src/general/gpuresourcelimits/src.rego	New Rego policy to cap GPUs per container.
src/general/gpuresourcelimits/src.cel	New CEL policy equivalent for gpuresourcelimits.
src/general/gpuresourcelimits/src_test.rego	Unit tests for gpuresourcelimits.
src/general/gpuresourcelimits/lib_exempt_container.rego	Exemption helper used by gpuresourcelimits.
src/general/gpuresourcelimits/constraint.tmpl	ConstraintTemplate source for gpuresourcelimits.
src/general/requiredgputoleration/src.rego	New Rego policy requiring a GPU taint toleration for GPU pods.
src/general/requiredgputoleration/src.cel	New CEL policy equivalent for requiredgputoleration.
src/general/requiredgputoleration/src_test.rego	Unit tests for requiredgputoleration.
src/general/requiredgputoleration/lib_exempt_container.rego	Exemption helper used by requiredgputoleration.
src/general/requiredgputoleration/constraint.tmpl	ConstraintTemplate source for requiredgputoleration.
src/general/requiredgpuruntimeclass/src.rego	New Rego policy requiring an allowed runtimeClassName for GPU pods.
src/general/requiredgpuruntimeclass/src.cel	New CEL policy equivalent for requiredgpuruntimeclass.
src/general/requiredgpuruntimeclass/src_test.rego	Unit tests for requiredgpuruntimeclass.
src/general/requiredgpuruntimeclass/lib_exempt_container.rego	Exemption helper used by requiredgpuruntimeclass.
src/general/requiredgpuruntimeclass/constraint.tmpl	ConstraintTemplate source for requiredgpuruntimeclass.
src/general/gpuactivedeadline/src.rego	New Rego policy requiring activeDeadlineSeconds for GPU pods (with optional max).
src/general/gpuactivedeadline/src.cel	New CEL policy equivalent for gpuactivedeadline.
src/general/gpuactivedeadline/src_test.rego	Unit tests for gpuactivedeadline.
src/general/gpuactivedeadline/lib_exempt_container.rego	Exemption helper used by gpuactivedeadline.
src/general/gpuactivedeadline/constraint.tmpl	ConstraintTemplate source for gpuactivedeadline.
src/general/gpusharedmemory/src.rego	New Rego policy requiring memory-backed /dev/shm for GPU containers.
src/general/gpusharedmemory/src.cel	New CEL policy equivalent for gpusharedmemory.
src/general/gpusharedmemory/src_test.rego	Unit tests for gpusharedmemory.
src/general/gpusharedmemory/lib_exempt_container.rego	Exemption helper used by gpusharedmemory.
src/general/gpusharedmemory/constraint.tmpl	ConstraintTemplate source for gpusharedmemory.
src/general/gpunodetargeting/src.rego	New Rego policy requiring GPU node targeting via nodeSelector or required node affinity.
src/general/gpunodetargeting/src.cel	New CEL policy equivalent for gpunodetargeting.
src/general/gpunodetargeting/src_test.rego	Unit tests for gpunodetargeting.
src/general/gpunodetargeting/lib_exempt_container.rego	Exemption helper used by gpunodetargeting.
src/general/gpunodetargeting/constraint.tmpl	ConstraintTemplate source for gpunodetargeting.
src/general/gpuworkloadresources/src.rego	New Rego policy enforcing GPU request=limit, memory request=limit, and CPU requests for GPU pods.
src/general/gpuworkloadresources/src.cel	New CEL policy equivalent for gpuworkloadresources.
src/general/gpuworkloadresources/src_test.rego	Unit tests for gpuworkloadresources.
src/general/gpuworkloadresources/lib_exempt_container.rego	Exemption helper used by gpuworkloadresources.
src/general/gpuworkloadresources/constraint.tmpl	ConstraintTemplate source for gpuworkloadresources.
library/general/nounsupportedgpu/template.yaml	Generated ConstraintTemplate for library distribution.
library/general/nounsupportedgpu/suite.yaml	Gatekeeper test suite for nounsupportedgpu library artifact.
library/general/nounsupportedgpu/samples/no-gpu-requested/example_allowed.yaml	Sample allowed resource (non-GPU).
library/general/nounsupportedgpu/samples/no-gpu-requested/constraint.yaml	Sample constraint for nounsupportedgpu.
library/general/nounsupportedgpu/samples/gpu-with-env-var/example_allowed.yaml	Sample allowed GPU resource with env var.
library/general/nounsupportedgpu/samples/gpu-with-env-var/constraint.yaml	Sample constraint for gpu-with-env-var.
library/general/nounsupportedgpu/samples/gpu-without-env-var/example_disallowed.yaml	Sample disallowed GPU resource without env var.
library/general/nounsupportedgpu/samples/gpu-without-env-var/example_allowed_exempt.yaml	Sample allowed exempted image.
library/general/nounsupportedgpu/samples/gpu-without-env-var/constraint.yaml	Sample constraint with exemptImages.
library/general/nounsupportedgpu/kustomization.yaml	Kustomize entry for nounsupportedgpu library package.
library/general/gpuresourcelimits/template.yaml	Generated ConstraintTemplate for gpuresourcelimits.
library/general/gpuresourcelimits/suite.yaml	Gatekeeper test suite for gpuresourcelimits.
library/general/gpuresourcelimits/samples/gpu-within-limit/example_allowed.yaml	Sample allowed GPU within max.
library/general/gpuresourcelimits/samples/gpu-within-limit/constraint.yaml	Sample constraint defining maxGpuPerContainer.
library/general/gpuresourcelimits/samples/gpu-exceeds-limit/example_disallowed.yaml	Sample disallowed GPU exceeding max.
library/general/gpuresourcelimits/samples/gpu-exceeds-limit/constraint.yaml	Sample constraint for exceeds-limit.
library/general/gpuresourcelimits/kustomization.yaml	Kustomize entry for gpuresourcelimits.
library/general/requiredgputoleration/template.yaml	Generated ConstraintTemplate for requiredgputoleration.
library/general/requiredgputoleration/suite.yaml	Gatekeeper test suite for requiredgputoleration.
library/general/requiredgputoleration/samples/no-gpu/example_allowed.yaml	Sample allowed non-GPU pod.
library/general/requiredgputoleration/samples/no-gpu/constraint.yaml	Sample constraint requiring tolerationKey.
library/general/requiredgputoleration/samples/gpu-with-toleration/example_allowed.yaml	Sample allowed GPU pod with toleration.
library/general/requiredgputoleration/samples/gpu-with-toleration/constraint.yaml	Sample constraint for gpu-with-toleration.
library/general/requiredgputoleration/samples/gpu-without-toleration/example_disallowed.yaml	Sample disallowed GPU pod missing toleration.
library/general/requiredgputoleration/samples/gpu-without-toleration/constraint.yaml	Sample constraint for gpu-without-toleration.
library/general/requiredgputoleration/kustomization.yaml	Kustomize entry for requiredgputoleration.
library/general/requiredgpuruntimeclass/template.yaml	Generated ConstraintTemplate for requiredgpuruntimeclass.
library/general/requiredgpuruntimeclass/suite.yaml	Gatekeeper test suite for requiredgpuruntimeclass.
library/general/requiredgpuruntimeclass/samples/no-gpu/example_allowed.yaml	Sample allowed non-GPU pod.
library/general/requiredgpuruntimeclass/samples/no-gpu/constraint.yaml	Sample constraint defining allowed runtime classes.
library/general/requiredgpuruntimeclass/samples/gpu-with-runtimeclass/example_allowed.yaml	Sample allowed GPU pod with runtimeClassName.
library/general/requiredgpuruntimeclass/samples/gpu-with-runtimeclass/constraint.yaml	Sample constraint for gpu-with-runtimeclass.
library/general/requiredgpuruntimeclass/samples/gpu-without-runtimeclass/example_disallowed.yaml	Sample disallowed GPU pod missing runtimeClassName.
library/general/requiredgpuruntimeclass/samples/gpu-without-runtimeclass/constraint.yaml	Sample constraint for gpu-without-runtimeclass.
library/general/requiredgpuruntimeclass/kustomization.yaml	Kustomize entry for requiredgpuruntimeclass.
library/general/gpuactivedeadline/template.yaml	Generated ConstraintTemplate for gpuactivedeadline.
library/general/gpuactivedeadline/suite.yaml	Gatekeeper test suite for gpuactivedeadline.
library/general/gpuactivedeadline/samples/non-gpu-job/example_allowed.yaml	Sample allowed non-GPU pod.
library/general/gpuactivedeadline/samples/non-gpu-job/constraint.yaml	Sample constraint for gpuactivedeadline.
library/general/gpuactivedeadline/samples/gpu-job-with-deadline/example_allowed.yaml	Sample allowed GPU pod with activeDeadlineSeconds.
library/general/gpuactivedeadline/samples/gpu-job-with-deadline/constraint.yaml	Sample constraint enforcing maxActiveDeadlineSeconds.
library/general/gpuactivedeadline/samples/gpu-job-without-deadline/example_disallowed.yaml	Sample disallowed GPU pod missing deadline.
library/general/gpuactivedeadline/samples/gpu-job-without-deadline/constraint.yaml	Sample constraint for missing deadline case.
library/general/gpuactivedeadline/kustomization.yaml	Kustomize entry for gpuactivedeadline.
library/general/gpusharedmemory/template.yaml	Generated ConstraintTemplate for gpusharedmemory.
library/general/gpusharedmemory/suite.yaml	Gatekeeper test suite for gpusharedmemory.
library/general/gpusharedmemory/samples/no-gpu/example_allowed.yaml	Sample allowed non-GPU pod.
library/general/gpusharedmemory/samples/no-gpu/constraint.yaml	Sample constraint for gpusharedmemory.
library/general/gpusharedmemory/samples/gpu-with-shm/example_allowed.yaml	Sample allowed GPU pod with /dev/shm memory-backed volume.
library/general/gpusharedmemory/samples/gpu-with-shm/constraint.yaml	Sample constraint for shm requirement.
library/general/gpusharedmemory/samples/gpu-without-shm/example_disallowed.yaml	Sample disallowed GPU pod missing shm mount.
library/general/gpusharedmemory/samples/gpu-without-shm/constraint.yaml	Sample constraint for missing shm mount case.
library/general/gpusharedmemory/kustomization.yaml	Kustomize entry for gpusharedmemory.
library/general/gpuworkloadresources/suite.yaml	Gatekeeper test suite for gpuworkloadresources.
library/general/gpuworkloadresources/samples/non-gpu-pod/example_allowed.yaml	Sample allowed non-GPU pod.
library/general/gpuworkloadresources/samples/non-gpu-pod/constraint.yaml	Sample constraint for gpuworkloadresources.
library/general/gpuworkloadresources/samples/gpu-pod-compliant/example_allowed.yaml	Sample allowed GPU pod meeting resource rules.
library/general/gpuworkloadresources/samples/gpu-pod-compliant/constraint.yaml	Sample constraint for compliant case.
library/general/gpuworkloadresources/samples/gpu-pod-memory-mismatch/example_disallowed.yaml	Sample disallowed GPU pod memory request/limit mismatch.
library/general/gpuworkloadresources/samples/gpu-pod-memory-mismatch/constraint.yaml	Sample constraint for memory mismatch case.
library/general/gpuworkloadresources/samples/gpu-pod-cpu-request-missing/example_disallowed.yaml	Sample disallowed GPU pod missing CPU request.
library/general/gpuworkloadresources/samples/gpu-pod-cpu-request-missing/constraint.yaml	Sample constraint for missing CPU request case.
library/general/gpuworkloadresources/kustomization.yaml	Kustomize entry for gpuworkloadresources.
library/general/gpunodetargeting/suite.yaml	Gatekeeper test suite for gpunodetargeting.
library/general/gpunodetargeting/samples/non-gpu-pod/example_allowed.yaml	Sample allowed non-GPU pod.
library/general/gpunodetargeting/samples/non-gpu-pod/constraint.yaml	Sample constraint for gpunodetargeting.
library/general/gpunodetargeting/samples/gpu-pod-with-node-selector/example_allowed.yaml	Sample allowed GPU pod using nodeSelector.
library/general/gpunodetargeting/samples/gpu-pod-with-node-selector/constraint.yaml	Sample constraint for nodeSelector path.
library/general/gpunodetargeting/samples/gpu-pod-with-node-affinity/example_allowed.yaml	Sample allowed GPU pod using required node affinity.
library/general/gpunodetargeting/samples/gpu-pod-with-node-affinity/constraint.yaml	Sample constraint for affinity path.
library/general/gpunodetargeting/samples/gpu-pod-without-targeting/example_disallowed.yaml	Sample disallowed GPU pod missing targeting.
library/general/gpunodetargeting/samples/gpu-pod-without-targeting/constraint.yaml	Sample constraint for missing targeting case.
library/general/gpunodetargeting/kustomization.yaml	Kustomize entry for gpunodetargeting.
artifacthub/library/general/nounsupportedgpu/1.0.0/template.yaml	Published ArtifactHub template for nounsupportedgpu.
artifacthub/library/general/nounsupportedgpu/1.0.0/suite.yaml	ArtifactHub suite for nounsupportedgpu.
artifacthub/library/general/nounsupportedgpu/1.0.0/samples/no-gpu-requested/example_allowed.yaml	ArtifactHub sample: allowed non-GPU pod.
artifacthub/library/general/nounsupportedgpu/1.0.0/samples/no-gpu-requested/constraint.yaml	ArtifactHub sample constraint: no-gpu-requested.
artifacthub/library/general/nounsupportedgpu/1.0.0/samples/gpu-with-env-var/example_allowed.yaml	ArtifactHub sample: allowed GPU with env var.
artifacthub/library/general/nounsupportedgpu/1.0.0/samples/gpu-with-env-var/constraint.yaml	ArtifactHub sample constraint: gpu-with-env-var.
artifacthub/library/general/nounsupportedgpu/1.0.0/samples/gpu-without-env-var/example_disallowed.yaml	ArtifactHub sample: disallowed without env var.
artifacthub/library/general/nounsupportedgpu/1.0.0/samples/gpu-without-env-var/example_allowed_exempt.yaml	ArtifactHub sample: allowed via exempt image.
artifacthub/library/general/nounsupportedgpu/1.0.0/samples/gpu-without-env-var/constraint.yaml	ArtifactHub sample constraint: exemptImages.
artifacthub/library/general/nounsupportedgpu/1.0.0/kustomization.yaml	ArtifactHub kustomization for nounsupportedgpu.
artifacthub/library/general/nounsupportedgpu/1.0.0/artifacthub-pkg.yml	ArtifactHub package metadata for nounsupportedgpu.
artifacthub/library/general/gpuresourcelimits/1.0.0/template.yaml	Published ArtifactHub template for gpuresourcelimits.
artifacthub/library/general/gpuresourcelimits/1.0.0/suite.yaml	ArtifactHub suite for gpuresourcelimits.
artifacthub/library/general/gpuresourcelimits/1.0.0/samples/gpu-within-limit/example_allowed.yaml	ArtifactHub sample: within limit.
artifacthub/library/general/gpuresourcelimits/1.0.0/samples/gpu-within-limit/constraint.yaml	ArtifactHub sample constraint: maxGpuPerContainer.
artifacthub/library/general/gpuresourcelimits/1.0.0/samples/gpu-exceeds-limit/example_disallowed.yaml	ArtifactHub sample: exceeds limit.
artifacthub/library/general/gpuresourcelimits/1.0.0/samples/gpu-exceeds-limit/constraint.yaml	ArtifactHub sample constraint: exceeds limit.
artifacthub/library/general/gpuresourcelimits/1.0.0/kustomization.yaml	ArtifactHub kustomization for gpuresourcelimits.
artifacthub/library/general/gpuresourcelimits/1.0.0/artifacthub-pkg.yml	ArtifactHub package metadata for gpuresourcelimits.
artifacthub/library/general/requiredgputoleration/1.0.0/template.yaml	Published ArtifactHub template for requiredgputoleration.
artifacthub/library/general/requiredgputoleration/1.0.0/suite.yaml	ArtifactHub suite for requiredgputoleration.
artifacthub/library/general/requiredgputoleration/1.0.0/samples/no-gpu/example_allowed.yaml	ArtifactHub sample: allowed non-GPU pod.
artifacthub/library/general/requiredgputoleration/1.0.0/samples/no-gpu/constraint.yaml	ArtifactHub sample constraint: tolerationKey.
artifacthub/library/general/requiredgputoleration/1.0.0/samples/gpu-with-toleration/example_allowed.yaml	ArtifactHub sample: allowed with toleration.
artifacthub/library/general/requiredgputoleration/1.0.0/samples/gpu-with-toleration/constraint.yaml	ArtifactHub sample constraint: gpu-with-toleration.
artifacthub/library/general/requiredgputoleration/1.0.0/samples/gpu-without-toleration/example_disallowed.yaml	ArtifactHub sample: disallowed missing toleration.
artifacthub/library/general/requiredgputoleration/1.0.0/samples/gpu-without-toleration/constraint.yaml	ArtifactHub sample constraint: gpu-without-toleration.
artifacthub/library/general/requiredgputoleration/1.0.0/kustomization.yaml	ArtifactHub kustomization for requiredgputoleration.
artifacthub/library/general/requiredgputoleration/1.0.0/artifacthub-pkg.yml	ArtifactHub package metadata for requiredgputoleration.
artifacthub/library/general/requiredgpuruntimeclass/1.0.0/template.yaml	Published ArtifactHub template for requiredgpuruntimeclass.
artifacthub/library/general/requiredgpuruntimeclass/1.0.0/suite.yaml	ArtifactHub suite for requiredgpuruntimeclass.
artifacthub/library/general/requiredgpuruntimeclass/1.0.0/samples/no-gpu/example_allowed.yaml	ArtifactHub sample: allowed non-GPU pod.
artifacthub/library/general/requiredgpuruntimeclass/1.0.0/samples/no-gpu/constraint.yaml	ArtifactHub sample constraint: allowedRuntimeClassNames.
artifacthub/library/general/requiredgpuruntimeclass/1.0.0/samples/gpu-with-runtimeclass/example_allowed.yaml	ArtifactHub sample: allowed with runtimeClassName.
artifacthub/library/general/requiredgpuruntimeclass/1.0.0/samples/gpu-with-runtimeclass/constraint.yaml	ArtifactHub sample constraint: gpu-with-runtimeclass.
artifacthub/library/general/requiredgpuruntimeclass/1.0.0/samples/gpu-without-runtimeclass/example_disallowed.yaml	ArtifactHub sample: disallowed missing runtimeClassName.
artifacthub/library/general/requiredgpuruntimeclass/1.0.0/samples/gpu-without-runtimeclass/constraint.yaml	ArtifactHub sample constraint: gpu-without-runtimeclass.
artifacthub/library/general/requiredgpuruntimeclass/1.0.0/kustomization.yaml	ArtifactHub kustomization for requiredgpuruntimeclass.
artifacthub/library/general/requiredgpuruntimeclass/1.0.0/artifacthub-pkg.yml	ArtifactHub package metadata for requiredgpuruntimeclass.
artifacthub/library/general/gpusharedmemory/1.0.0/template.yaml	Published ArtifactHub template for gpusharedmemory.
artifacthub/library/general/gpusharedmemory/1.0.0/suite.yaml	ArtifactHub suite for gpusharedmemory.
artifacthub/library/general/gpusharedmemory/1.0.0/samples/no-gpu/example_allowed.yaml	ArtifactHub sample: allowed non-GPU pod.
artifacthub/library/general/gpusharedmemory/1.0.0/samples/no-gpu/constraint.yaml	ArtifactHub sample constraint: no-gpu.
artifacthub/library/general/gpusharedmemory/1.0.0/samples/gpu-with-shm/example_allowed.yaml	ArtifactHub sample: allowed with shm volume/mount.
artifacthub/library/general/gpusharedmemory/1.0.0/samples/gpu-with-shm/constraint.yaml	ArtifactHub sample constraint: gpu-with-shm.
artifacthub/library/general/gpusharedmemory/1.0.0/samples/gpu-without-shm/example_disallowed.yaml	ArtifactHub sample: disallowed missing shm.
artifacthub/library/general/gpusharedmemory/1.0.0/samples/gpu-without-shm/constraint.yaml	ArtifactHub sample constraint: gpu-without-shm.
artifacthub/library/general/gpusharedmemory/1.0.0/kustomization.yaml	ArtifactHub kustomization for gpusharedmemory.
artifacthub/library/general/gpusharedmemory/1.0.0/artifacthub-pkg.yml	ArtifactHub package metadata for gpusharedmemory.
artifacthub/library/general/gpuactivedeadline/1.0.0/template.yaml	Published ArtifactHub template for gpuactivedeadline.
artifacthub/library/general/gpuactivedeadline/1.0.0/suite.yaml	ArtifactHub suite for gpuactivedeadline.
artifacthub/library/general/gpuactivedeadline/1.0.0/samples/non-gpu-job/example_allowed.yaml	ArtifactHub sample: allowed non-GPU pod.
artifacthub/library/general/gpuactivedeadline/1.0.0/samples/non-gpu-job/constraint.yaml	ArtifactHub sample constraint: non-gpu-job.
artifacthub/library/general/gpuactivedeadline/1.0.0/samples/gpu-job-with-deadline/example_allowed.yaml	ArtifactHub sample: allowed with deadline.
artifacthub/library/general/gpuactivedeadline/1.0.0/samples/gpu-job-with-deadline/constraint.yaml	ArtifactHub sample constraint: max deadline.
artifacthub/library/general/gpuactivedeadline/1.0.0/samples/gpu-job-without-deadline/example_disallowed.yaml	ArtifactHub sample: disallowed missing deadline.
artifacthub/library/general/gpuactivedeadline/1.0.0/samples/gpu-job-without-deadline/constraint.yaml	ArtifactHub sample constraint: missing deadline.
artifacthub/library/general/gpuactivedeadline/1.0.0/kustomization.yaml	ArtifactHub kustomization for gpuactivedeadline.
artifacthub/library/general/gpuactivedeadline/1.0.0/artifacthub-pkg.yml	ArtifactHub package metadata for gpuactivedeadline.
artifacthub/library/general/gpuworkloadresources/1.0.0/suite.yaml	ArtifactHub suite for gpuworkloadresources.
artifacthub/library/general/gpuworkloadresources/1.0.0/samples/non-gpu-pod/example_allowed.yaml	ArtifactHub sample: allowed non-GPU pod.
artifacthub/library/general/gpuworkloadresources/1.0.0/samples/non-gpu-pod/constraint.yaml	ArtifactHub sample constraint: non-gpu-pod.
artifacthub/library/general/gpuworkloadresources/1.0.0/samples/gpu-pod-compliant/example_allowed.yaml	ArtifactHub sample: compliant GPU pod.
artifacthub/library/general/gpuworkloadresources/1.0.0/samples/gpu-pod-compliant/constraint.yaml	ArtifactHub sample constraint: gpu-pod-compliant.
artifacthub/library/general/gpuworkloadresources/1.0.0/samples/gpu-pod-memory-mismatch/example_disallowed.yaml	ArtifactHub sample: memory mismatch.
artifacthub/library/general/gpuworkloadresources/1.0.0/samples/gpu-pod-memory-mismatch/constraint.yaml	ArtifactHub sample constraint: memory mismatch.
artifacthub/library/general/gpuworkloadresources/1.0.0/samples/gpu-pod-cpu-request-missing/example_disallowed.yaml	ArtifactHub sample: missing CPU request.
artifacthub/library/general/gpuworkloadresources/1.0.0/samples/gpu-pod-cpu-request-missing/constraint.yaml	ArtifactHub sample constraint: missing CPU request.
artifacthub/library/general/gpuworkloadresources/1.0.0/kustomization.yaml	ArtifactHub kustomization for gpuworkloadresources.
artifacthub/library/general/gpuworkloadresources/1.0.0/artifacthub-pkg.yml	ArtifactHub package metadata for gpuworkloadresources.
artifacthub/library/general/gpunodetargeting/1.0.0/suite.yaml	ArtifactHub suite for gpunodetargeting.
artifacthub/library/general/gpunodetargeting/1.0.0/samples/non-gpu-pod/example_allowed.yaml	ArtifactHub sample: allowed non-GPU pod.
artifacthub/library/general/gpunodetargeting/1.0.0/samples/non-gpu-pod/constraint.yaml	ArtifactHub sample constraint: non-gpu-pod.
artifacthub/library/general/gpunodetargeting/1.0.0/samples/gpu-pod-without-targeting/example_disallowed.yaml	ArtifactHub sample: missing targeting.
artifacthub/library/general/gpunodetargeting/1.0.0/samples/gpu-pod-without-targeting/constraint.yaml	ArtifactHub sample constraint: missing targeting.
artifacthub/library/general/gpunodetargeting/1.0.0/samples/gpu-pod-with-node-selector/example_allowed.yaml	ArtifactHub sample: nodeSelector targeting.
artifacthub/library/general/gpunodetargeting/1.0.0/samples/gpu-pod-with-node-selector/constraint.yaml	ArtifactHub sample constraint: nodeSelector.
artifacthub/library/general/gpunodetargeting/1.0.0/samples/gpu-pod-with-node-affinity/example_allowed.yaml	ArtifactHub sample: affinity targeting.
artifacthub/library/general/gpunodetargeting/1.0.0/samples/gpu-pod-with-node-affinity/constraint.yaml	ArtifactHub sample constraint: affinity.
artifacthub/library/general/gpunodetargeting/1.0.0/kustomization.yaml	ArtifactHub kustomization for gpunodetargeting.
artifacthub/library/general/gpunodetargeting/1.0.0/artifacthub-pkg.yml	ArtifactHub package metadata for gpunodetargeting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-27T23:54:42Z

+has_matching_node_affinity(label_key) {
+  term := input.review.object.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[_]
+  expr := term.matchExpressions[_]
+  expr.key == label_key
+  label_values := object.get(input.parameters, "nodeLabelValues", [])
+  count(label_values) > 0
+  expr.operator == "In"
+  expr.values[_] == label_values[_]


In has_matching_node_affinity (the count(label_values) > 0 case), the rule succeeds if any affinity value overlaps with nodeLabelValues (expr.values[_] == label_values[_]). That allows an affinity like values: ["true", "false"] when only ["true"] is allowed, which would still permit scheduling onto disallowed nodes. Tighten the check so that all expr.values are within the allowed nodeLabelValues (i.e., require expr.values to be a subset of label_values).

Copilot · 2026-04-27T23:54:42Z

+      variables.anyObject.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms.exists(term,
+        has(term.matchExpressions) &&
+        term.matchExpressions.exists(expr,
+          expr.key == variables.nodeLabelKey &&
+          (
+            size(variables.nodeLabelValues) == 0 ?
+              expr.operator == "Exists" :
+              expr.operator == "In" &&
+              has(expr.values) &&
+              variables.nodeLabelValues.exists(value, expr.values.exists(exprValue, exprValue == value))
+          )
+        )


hasMatchingNodeAffinity currently treats it as a match when there is any overlap between expr.values and nodeLabelValues (the nested exists checks). This allows affinities that include both allowed and disallowed values to pass. Update the logic to require that all expr.values are contained in nodeLabelValues when nodeLabelValues is non-empty (subset check).

Add generated samples, suites, and docs for AI workload GPU policy edge cases, including exemptions, disabled parameters, init and ephemeral containers, request-only GPU usage, and invalid targeting/runtime/toleration configurations. Tighten GPU node targeting CEL logic so key-only nodeSelector matching requires a non-empty value, matching the Rego implementation. Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>

Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>

sozercan · 2026-05-15T17:12:07Z

Scenarios

1. `K8sGpuActiveDeadline`

Classification:
Training / batch / temporary GPU workloads

User scenario:
A platform team runs a shared GPU training cluster. Researchers submit experiments, notebooks, or batch training jobs. Sometimes a job hangs, gets forgotten, or has a bad training loop and keeps holding an A100/H100 for days.

Why someone wants it:
GPUs are expensive. A runaway job can block other users and burn cloud spend quickly. Requiring activeDeadlineSeconds forces GPU workloads to have a maximum runtime.

Example:
An ML engineer launches a training pod expected to finish in 12 hours. Due to a bug, the dataloader hangs forever. Without this policy, the pod might occupy a GPU indefinitely. With this policy, the pod must declare a deadline, such as:

activeDeadlineSeconds: 43200

Useful for:

Batch training jobs
Hyperparameter searches
Research experiments
CI jobs using GPUs
Temporary fine-tuning workloads
Scheduled GPU jobs
Ephemeral notebook/session environments with enforced timeout policies

Less useful / risky for:

Long-running inference services
GPU model servers
Persistent notebook environments, unless the platform intentionally times them out

2. `K8sGpuResourceLimits`

Classification:
Training + inference / multi-tenant GPU fairness / quota safety

User scenario:
A shared GPU cluster has nodes with 4 or 8 GPUs. The platform team wants to prevent one container from accidentally or intentionally reserving too many GPUs.

Why someone wants it:
It protects GPU fairness and prevents typos or oversized requests from monopolizing scarce hardware.

Example:
A user intends to request 1 GPU but accidentally submits:

nvidia.com/gpu: 8

On a shared cluster, that could block an entire node. The policy can cap each container to, say, 4 GPUs:

maxGpuPerContainer: 4

Useful for:

Multi-tenant GPU clusters
Research clusters with per-user fairness
Cost-controlled cloud GPU pools
Preventing accidental over-allocation
Separating small inference/training jobs from large distributed jobs
Shared inference platforms where each service should consume only a limited number of GPUs

Less useful / should be tuned for:

Dedicated large-scale training namespaces where 8-GPU jobs are expected
Distributed training teams that legitimately need full-node GPU access
Specialized workloads requiring many GPUs per container

In those cases, the platform can raise the limit, scope the constraint by namespace, or use exemptions.

3. `K8sGpuWorkloadResources`

Classification:
Training + inference / resource hygiene / scheduling reliability

User scenario:
GPU workloads are expensive, but they also need enough CPU and memory to keep the GPU fed. If CPU or memory is under-requested, the pod may schedule onto an overloaded node, perform poorly, get OOM-killed, or leave GPUs idle.

Why someone wants it:
This policy pushes users to declare resources accurately for GPU workloads.

It enforces three main ideas:

GPU request must equal GPU limit.
Memory request must equal memory limit.
CPU request must be set.

Example:
A training container requests a GPU but forgets memory requests:

resources:
  limits:
    nvidia.com/gpu: 1
    memory: 64Gi

The scheduler does not get a complete picture of the workload. The pod may land on a node that cannot reliably support it. This policy requires something more explicit:

resources:
  requests:
    cpu: "8"
    memory: 64Gi
    nvidia.com/gpu: 1
  limits:
    memory: 64Gi
    nvidia.com/gpu: 1

Useful for:

Better bin-packing
Avoiding GPU idling due to CPU starvation
Avoiding memory pressure and evictions
Improving cluster capacity planning
Making GPU workload cost attribution more accurate
Enforcing predictable resource declarations for expensive workloads
Training jobs with heavy CPU preprocessing
Inference services where predictable scheduling and capacity planning matter

Potential concern:
This policy can be strict because it applies memory/CPU expectations to containers in a GPU pod. If a pod has sidecars, log agents, proxies, or helper containers, they may also need resource declarations or image exemptions.

Platforms should pay particular attention to:

Sidecars
Init containers
Service mesh proxies
Logging/monitoring agents
Notebook helper containers

4. `K8sRequiredGpuToleration`

Classification:
Training + inference / GPU node scheduling infrastructure

User scenario:
The cluster has dedicated GPU nodes tainted like this:

nvidia.com/gpu=true:NoSchedule

This keeps normal CPU-only workloads off expensive GPU nodes. GPU workloads need a matching toleration so they can schedule there.

Why someone wants it:
It prevents GPU pods from getting stuck in Pending because they forgot the toleration required for the GPU node pool.

Example:
A user submits a GPU pod:

resources:
  limits:
    nvidia.com/gpu: 1

But forgets:

tolerations:
- key: nvidia.com/gpu
  operator: Exists
  effect: NoSchedule

The pod requests a GPU but cannot tolerate the GPU node taint, so it never schedules. The policy catches that at admission time and gives the user a clear error.

Useful for:

Dedicated GPU node pools
Clusters using taints to reserve GPU nodes
Preventing support tickets caused by unschedulable GPU pods
Separating CPU and GPU workloads
Training clusters
Inference clusters
Mixed CPU/GPU clusters

Important nuance:
A toleration does not force a pod onto GPU nodes. It only allows the pod to schedule onto tainted GPU nodes. This is usually paired with K8sGpuNodeTargeting.

5. `K8sGpuNodeTargeting`

Classification:
Training + inference / GPU placement / accelerator class selection

User scenario:
The platform has multiple node pools: CPU nodes, L4 nodes, A100 nodes, H100 nodes, spot GPU nodes, reserved GPU nodes, training GPU nodes, inference GPU nodes, and so on. GPU workloads should explicitly target the correct GPU node class.

Why someone wants it:
It prevents ambiguous scheduling and helps ensure GPU workloads land on the intended hardware.

Example:
A training job needs A100 nodes. The platform labels nodes like:

nvidia.com/gpu.product: A100

The pod should include either:

nodeSelector:
  nvidia.com/gpu.product: A100

or required node affinity targeting that label.

Useful for:

Ensuring training jobs land on training GPU pools
Ensuring inference jobs land on inference GPU pools
Selecting specific GPU types, for example A100 vs L4
Selecting GPU nodes managed by a specific autoscaler/node pool
Avoiding accidental use of expensive or specialized GPU nodes
Supporting chargeback/showback by node class
Separating on-demand and spot GPU pools
Separating reserved and shared GPU pools

Why this policy still matters even though GPU requests exist:
A nvidia.com/gpu resource request generally ensures the scheduler needs a node with available GPU resource. Labels and affinity are still valuable for selecting which GPU pool, product, cost class, or workload class is acceptable.

Important nuance from the review:
This policy needs to handle Kubernetes node affinity OR semantics correctly. If a pod has multiple nodeSelectorTerms, every schedulable OR branch needs to preserve the GPU-node targeting requirement. Otherwise, a user can include one valid term and one broad term, and the pod may still schedule through the broad term.

6. `K8sGpuSharedMemory`

Classification:
Training / distributed GPU workloads / multiprocessing-heavy workloads

User scenario:
A PyTorch, TensorFlow, NCCL, Ray, or distributed training workload uses multiple workers, dataloaders, or inter-process communication. It needs more shared memory than the default container /dev/shm.

Why someone wants it:
Without a memory-backed /dev/shm, GPU training jobs can fail, hang, crash with strange multiprocessing errors, or perform poorly.

Example:
A PyTorch training job uses multiple dataloader workers:

DataLoader(dataset, num_workers=8)

Inside a container, the default /dev/shm may be too small. The workload may fail with shared-memory or bus errors. The recommended Kubernetes pattern is usually:

volumes:
- name: dshm
  emptyDir:
    medium: Memory

containers:
- name: train
  volumeMounts:
  - name: dshm
    mountPath: /dev/shm

This policy requires GPU containers to mount a memory-backed emptyDir at /dev/shm.

Useful for:

PyTorch training
Multi-GPU training
NCCL-heavy workloads
Ray workers
Distributed data loading
Large model fine-tuning jobs
Training frameworks that rely on multiprocessing/shared memory
Batch jobs with multiple CPU workers feeding GPUs

Less useful for:

Simple GPU inference containers
Small single-process GPU jobs
Workloads that do not use shared memory heavily

Classification summary

Policy	Classification
`K8sGpuActiveDeadline`	Training / batch / temporary GPU workloads
`K8sGpuResourceLimits`	Training + inference / multi-tenant GPU fairness
`K8sGpuWorkloadResources`	Training + inference / resource hygiene
`K8sRequiredGpuToleration`	Training + inference / GPU node scheduling infrastructure
`K8sGpuNodeTargeting`	Training + inference / GPU placement and accelerator selection
`K8sGpuSharedMemory`	Training / distributed and multiprocessing-heavy GPU workloads

How they fit together

Policy	User problem it solves
`K8sGpuActiveDeadline`	“My training job hung and held a GPU forever.”
`K8sGpuResourceLimits`	“One user accidentally requested all GPUs on the node.”
`K8sGpuWorkloadResources`	“GPU pods are poorly requested and cause bad scheduling or idle GPUs.”
`K8sRequiredGpuToleration`	“GPU pods forget the toleration and get stuck pending.”
`K8sGpuNodeTargeting`	“GPU pods should explicitly target the right GPU node pool/type.”
`K8sGpuSharedMemory`	“Training jobs crash or hang because `/dev/shm` is too small.”

sozercan · 2026-05-15T18:00:26Z

+    !has(variables.anyObject.spec.affinity.nodeAffinity) ||
+    !has(variables.anyObject.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution) ||
+    !has(variables.anyObject.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms) ? false :
+      variables.anyObject.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms.exists(term,


nodeSelectorTerms are ORed by Kubernetes, so checking that any term has the GPU label is not sufficient. A pod could include one valid GPU-targeting term and another broad/non-GPU term, pass this policy, and still be schedulable through the broader term.

Can we require every nodeSelectorTerm to contain an acceptable GPU label constraint instead? The Rego implementation has the same issue because term := ...nodeSelectorTerms[_] is also existential.

Suggested behavior:

one valid GPU term + one broad/non-GPU term => deny

all required terms include acceptable GPU label constraints => allow

sozercan · 2026-05-15T18:00:43Z

+          (
+            size(variables.nodeLabelValues) == 0 ?
+              expr.operator == "Exists" :
+              expr.operator == "In" &&


In key-only mode (nodeLabelValues omitted), this currently accepts only operator: Exists. But operator: In with non-empty values also guarantees the configured label key is present, and is actually more specific than Exists.

For example, this should satisfy key-only mode:

- key: nvidia.com/gpu.product operator: In values: - A100

Can we allow both of these when nodeLabelValues is empty?

operator: Exists

operator: In with non-empty values

Operators like DoesNotExist and NotIn should still be rejected because they do not reliably require the label key to be present. The Rego implementation has the same restriction in the count(label_values) == 0 affinity rule.

sozercan · 2026-05-15T18:02:01Z

+import data.lib.exempt_container.is_exempt
+
+violation[{"msg": msg}] {
+    container := input.review.object.spec.containers[_]


This only evaluates regular containers. Kubernetes also allows GPU limits on initContainers, so a GPU-requesting init container can currently bypass the /dev/shm memory-backed mount requirement.

Can we evaluate both regular containers and init containers here, and make the same change in the CEL implementation where variables.containers is used for exemptImages and badContainers?

chore: bump gatekeeper test versions to 3.21.1 and 3.22.0

978bd88

Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>

JaydipGabani requested a review from a team as a code owner March 16, 2026 23:26

Copilot AI review requested due to automatic review settings March 16, 2026 23:26

Copilot started reviewing on behalf of JaydipGabani March 16, 2026 23:27 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

JaydipGabani force-pushed the ai-workload-policies branch 2 times, most recently from 9fbaef2 to abc7702 Compare March 17, 2026 00:40

JaydipGabani force-pushed the ai-workload-policies branch from abc7702 to 67a3edd Compare March 17, 2026 00:47

JaydipGabani force-pushed the ai-workload-policies branch from 41c285e to d270c94 Compare March 18, 2026 00:43

JaydipGabani added 4 commits April 10, 2026 21:22

adding node affinity and req validation policies for ai workloads

1ed2187

Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>

updating catalog and ai bundling to be intent specific

0fa351e

Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>

adding exempt container lib

f09d629

Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>

adding AI category in the website

64215e0

Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>

JaydipGabani requested a review from Copilot April 27, 2026 23:49

Copilot started reviewing on behalf of JaydipGabani April 27, 2026 23:49 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

JaydipGabani and others added 5 commits April 28, 2026 19:17

Fix CI setup for AI workload policies

f21164f

Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>

Require GPU affinity values to be allowed

66aaf7c

Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>

fixing ci

4cb8233

Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>

Merge branch 'master' into ai-workload-policies

7da0634

sozercan reviewed May 15, 2026

View reviewed changes

Conversation

JaydipGabani commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Policies Added

Catalog and Bundle Changes

Implementation Details

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

JaydipGabani Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

sozercan commented May 15, 2026

Scenarios

1. K8sGpuActiveDeadline

2. K8sGpuResourceLimits

3. K8sGpuWorkloadResources

4. K8sRequiredGpuToleration

5. K8sGpuNodeTargeting

6. K8sGpuSharedMemory

Classification summary

How they fit together

Uh oh!

sozercan May 15, 2026

Choose a reason for hiding this comment

Uh oh!

sozercan May 15, 2026

Choose a reason for hiding this comment

Uh oh!

sozercan May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JaydipGabani commented Mar 16, 2026 •

edited

Loading

1. `K8sGpuActiveDeadline`

2. `K8sGpuResourceLimits`

3. `K8sGpuWorkloadResources`

4. `K8sRequiredGpuToleration`

5. `K8sGpuNodeTargeting`

6. `K8sGpuSharedMemory`