Skip to content

API: Add KelosPolicy CRD for namespace-level allowlist enforcement β€” restricting permitted agent types, models, images, and repositories via admission webhooksΒ #1020

@kelos-bot

Description

@kelos-bot

πŸ€– Kelos Strategist Agent @gjkim42

Area: New CRDs & API Extensions

Summary

Platform operators deploying Kelos in shared clusters have no mechanism to restrict which agent configurations are permitted in a namespace. Existing governance proposals address how many tasks can run (#675 ConcurrencyPolicy), how securely they run (#860 securityProfile), and how much they can spend (#977 budgetPolicy) β€” but none address the fundamental question: which agent types, models, container images, and repositories are allowed?

This proposal introduces a KelosPolicy CRD β€” a namespace-scoped resource that enforces allowlist and denylist constraints on Tasks, TaskSpawners, and Workspaces at admission time. It also introduces the first Kubernetes ValidatingAdmissionWebhook in the Kelos codebase, establishing the foundation for all future policy enforcement.

Problem

1. Any agent type, model, and image can be used without restriction

The JobBuilder's resolveImage() function accepts any agent type and maps it to a default image, or uses a custom image if specified. There is no validation that restricts which agent types or images are permitted in a given namespace:

// internal/controller/job_builder.go:351-358
mainContainer := corev1.Container{
    Name:            task.Spec.Type,  // Unconstrained
    Image:           image,            // Unconstrained  
    ImagePullPolicy: pullPolicy,
    Command:         []string{"/kelos_entrypoint.sh"},
    Args:            []string{prompt},
    Env:             envVars,
}

The resolveImage() function (job_builder.go:258-285) maps agent types to default images but does not validate whether the type is permitted. A Task specifying type: claude-code with model: claude-opus-4 will be accepted in any namespace, even if the platform team intended that namespace for low-cost workloads only.

This matters because model costs vary dramatically (Opus can be 10-20x more expensive per token than Haiku). Without restrictions, a single misconfigured TaskSpawner can generate unexpectedly high API costs before per-spawner budget limits (proposed in #977, #624) would catch it β€” those limits require explicit opt-in on each spawner.

2. Any repository can be accessed by agents

Workspace resources accept any repository URL:

// api/v1alpha1/workspace_types.go
type WorkspaceSpec struct {
    Repo      string           `json:"repo"`  // No restriction on which repos
    Ref       string           `json:"ref,omitempty"`
    SecretRef *SecretReference `json:"secretRef,omitempty"`
    // ...
}

A Task in any namespace can reference a Workspace pointing to any accessible repository. In organizations with internal/sensitive repos, platform teams need to restrict which repos agents can access per-namespace:

  • Team namespaces should only access that team's repos
  • Staging namespaces should not access production config repos
  • External contributor namespaces should be restricted to specific repos

3. Custom images bypass all controls

Tasks support custom container images via spec.image, which completely bypasses the default agent image mapping. Without image restrictions, users could run arbitrary container images as "agent" pods:

// internal/controller/job_builder.go:258-285
func (b *JobBuilder) resolveImage(task *kelosv1alpha1.Task) string {
    if task.Spec.Image != "" {
        return task.Spec.Image  // User-specified image, unchecked
    }
    // ... default image mapping
}

This is a supply chain risk in shared clusters β€” platform teams need to restrict agent pods to images from approved registries.

4. Zero admission webhooks exist in the codebase

The current enforcement model relies entirely on CRD-level x-kubernetes-validations rules (field-level constraints like "secretRef required for non-none credentials") and controller-side logic. There are no ValidatingWebhookConfiguration or MutatingWebhookConfiguration resources in the Helm chart or codebase.

CRD validation rules cannot:

  • Reference other resources (e.g., look up a policy CRD)
  • Perform namespace-scoped logic
  • Aggregate state across resources
  • Apply dynamic policies that change without CRD updates

This architectural gap means any governance feature that requires cross-resource or namespace-level validation is currently impossible to implement.

Proposed API

KelosPolicy CRD

apiVersion: kelos.dev/v1alpha1
kind: KelosPolicy
metadata:
  name: team-backend
  namespace: team-backend
spec:
  # Which agent types are permitted
  agents:
    allowedTypes:
      - claude-code
      - codex
    # Optional model constraints (regex patterns)
    allowedModels:
      - "claude-sonnet-.*"
      - "claude-haiku-.*"
    deniedModels:          # Takes precedence over allowedModels
      - "claude-opus-.*"

  # Which container images are permitted (glob patterns)
  images:
    allowed:
      - "ghcr.io/kelos-dev/*"
      - "registry.internal.company.com/kelos/*"

  # Which repositories agents can access
  repositories:
    allowed:
      - "https://github.com/myorg/backend-*"
      - "https://github.com/myorg/shared-libs"
    denied:
      - "https://github.com/myorg/infrastructure-*"

  # Which credential types are permitted
  credentials:
    allowedTypes:
      - oauth            # Require OAuth, disallow raw API keys

status:
  enforced: true
  stats:
    tasksAdmitted: 342
    tasksRejected: 7
    workspacesAdmitted: 15
    workspacesRejected: 1
  lastRejection:
    time: "2026-04-27T10:30:00Z"
    resource: "Task/expensive-analysis"
    reason: 'model "claude-opus-4" matches deniedModels pattern "claude-opus-.*"'
  conditions:
    - type: WebhookRegistered
      status: "True"
      message: "ValidatingWebhookConfiguration is active"

Validation behavior

Resource Created Fields Validated Rejection Reason Example
Task spec.type, spec.model, spec.image, spec.credentials.type agent type "gemini" not in allowedTypes [claude-code, codex]
TaskSpawner spec.taskTemplate.type, spec.taskTemplate.model, spec.taskTemplate.image model "claude-opus-4" matches deniedModels
Workspace spec.repo repository "https://github.com/myorg/infra-secrets" matches denied pattern

Multiple policies

When multiple KelosPolicy resources exist in a namespace, the most restrictive union applies:

  • Allowlists: intersection (must be allowed by ALL policies)
  • Denylists: union (denied by ANY policy means denied)

This enables layered policies: a platform-wide policy with org standards + a team-specific policy with additional restrictions.

Implementation approach

Phase 1: ValidatingAdmissionWebhook

Add a new webhook endpoint to the existing kelos-controller binary (or a dedicated kelos-admission-webhook binary):

  1. Webhook handler: Intercepts CREATE and UPDATE operations on Tasks, TaskSpawners, and Workspaces
  2. Policy lookup: Reads all KelosPolicy resources in the target namespace via an informer/cache
  3. Constraint evaluation: Checks the resource against all applicable policies
  4. Response: Returns allowed: true or allowed: false with a descriptive reason
  5. Status update: Reconciler periodically updates KelosPolicy status with admission statistics

The Helm chart would include a ValidatingWebhookConfiguration:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: kelos-policy
webhooks:
  - name: policy.kelos.dev
    rules:
      - apiGroups: ["kelos.dev"]
        apiVersions: ["v1alpha1"]
        resources: ["tasks", "taskspawners", "workspaces"]
        operations: ["CREATE", "UPDATE"]
    clientConfig:
      service:
        name: kelos-admission-webhook
        namespace: kelos-system
        path: /validate
    namespaceSelector:
      matchExpressions:
        - key: kelos.dev/policy-enforcement
          operator: In
          values: ["enabled"]
    failurePolicy: Fail
    sideEffects: None

The namespaceSelector ensures only namespaces that opt in to policy enforcement are affected, preserving backward compatibility.

Phase 2 (future): MutatingAdmissionWebhook for defaults

A natural extension would be adding a defaults section to KelosPolicy that injects namespace-wide defaults (resources, TTL, labels) into Tasks that don't specify them. This would be implemented as a separate MutatingAdmissionWebhook and can be proposed independently.

Relationship to existing proposals

Proposal Governance Dimension KelosPolicy Dimension
#675 ConcurrencyPolicy How many β€” aggregate concurrency limits Which ones β€” what's permitted at all
#860 securityProfile How secure β€” SecurityContext hardening Which images/types β€” restricting to approved configs
#977 budgetPolicy How much β€” per-spawner cost limits Which models β€” preventing expensive models entirely
#624 maxCostUSD How much β€” per-spawner spend cap Which models β€” cost control via restriction, not limits
#907 schedulingPolicy When β€” time windows for execution Orthogonal β€” KelosPolicy does not address scheduling

KelosPolicy complements all of these. They operate at different levels of the governance stack:

  • KelosPolicy β†’ "Is this resource allowed to exist?" (admission gate)
  • ConcurrencyPolicy β†’ "Can this resource run right now?" (runtime scheduling)
  • securityProfile β†’ "How is this resource hardened?" (pod configuration)
  • budgetPolicy β†’ "Can this resource afford to run?" (cost tracking)

Admission webhooks as foundation

Introducing the admission webhook infrastructure for KelosPolicy creates a foundation that #675, #860, and other governance proposals could also leverage. For example, ConcurrencyPolicy enforcement at admission time would be more robust than controller-side enforcement, since it prevents the resource from being created rather than leaving it in a pending/rejected state.

Use cases

  1. Cost governance: "Dev namespaces can only use Haiku/Sonnet models; Opus requires the team-ml namespace"
  2. Supply chain security: "Agent pods must use images from our internal registry, not public registries"
  3. Data isolation: "The payments namespace can only access payments-* repos; the frontend namespace cannot access backend repos"
  4. Credential hygiene: "All namespaces must use OAuth credentials; raw API keys are not permitted"
  5. Agent standardization: "Only claude-code and codex are approved agent types; new types require platform team approval"

Backward compatibility

  • Fully opt-in: Namespaces without a KelosPolicy resource are unaffected
  • Namespace selector: The webhook only fires for namespaces with the kelos.dev/policy-enforcement=enabled label
  • Gradual rollout: Platform teams can enable enforcement namespace-by-namespace
  • Audit mode (future): A spec.mode: audit option that logs violations without rejecting, enabling dry-run before enforcement

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions