API: Add KelosPolicy CRD for namespace-level allowlist enforcement — restricting permitted agent types, models, images, and repositories via admission webhooks

🤖 **Kelos Strategist Agent** @gjkim42

## Area: New CRDs & API Extensions

## Summary

Platform operators deploying Kelos in shared clusters have no mechanism to restrict **which** agent configurations are permitted in a namespace. Existing governance proposals address **how many** tasks can run (#675 ConcurrencyPolicy), **how securely** they run (#860 securityProfile), and **how much** they can spend (#977 budgetPolicy) — but none address the fundamental question: **which agent types, models, container images, and repositories are allowed?**

This proposal introduces a **KelosPolicy** CRD — a namespace-scoped resource that enforces allowlist and denylist constraints on Tasks, TaskSpawners, and Workspaces at admission time. It also introduces the first Kubernetes ValidatingAdmissionWebhook in the Kelos codebase, establishing the foundation for all future policy enforcement.

## Problem

### 1. Any agent type, model, and image can be used without restriction

The JobBuilder's `resolveImage()` function accepts any agent type and maps it to a default image, or uses a custom image if specified. There is no validation that restricts which agent types or images are permitted in a given namespace:

```go
// internal/controller/job_builder.go:351-358
mainContainer := corev1.Container{
    Name:            task.Spec.Type,  // Unconstrained
    Image:           image,            // Unconstrained  
    ImagePullPolicy: pullPolicy,
    Command:         []string{"/kelos_entrypoint.sh"},
    Args:            []string{prompt},
    Env:             envVars,
}
```

The `resolveImage()` function (`job_builder.go:258-285`) maps agent types to default images but does not validate whether the type is permitted. A Task specifying `type: claude-code` with `model: claude-opus-4` will be accepted in any namespace, even if the platform team intended that namespace for low-cost workloads only.

This matters because model costs vary dramatically (Opus can be 10-20x more expensive per token than Haiku). Without restrictions, a single misconfigured TaskSpawner can generate unexpectedly high API costs before per-spawner budget limits (proposed in #977, #624) would catch it — those limits require explicit opt-in on each spawner.

### 2. Any repository can be accessed by agents

Workspace resources accept any repository URL:

```go
// api/v1alpha1/workspace_types.go
type WorkspaceSpec struct {
    Repo      string           `json:"repo"`  // No restriction on which repos
    Ref       string           `json:"ref,omitempty"`
    SecretRef *SecretReference `json:"secretRef,omitempty"`
    // ...
}
```

A Task in any namespace can reference a Workspace pointing to any accessible repository. In organizations with internal/sensitive repos, platform teams need to restrict which repos agents can access per-namespace:
- **Team namespaces** should only access that team's repos
- **Staging namespaces** should not access production config repos
- **External contributor namespaces** should be restricted to specific repos

### 3. Custom images bypass all controls

Tasks support custom container images via `spec.image`, which completely bypasses the default agent image mapping. Without image restrictions, users could run arbitrary container images as "agent" pods:

```go
// internal/controller/job_builder.go:258-285
func (b *JobBuilder) resolveImage(task *kelosv1alpha1.Task) string {
    if task.Spec.Image != "" {
        return task.Spec.Image  // User-specified image, unchecked
    }
    // ... default image mapping
}
```

This is a supply chain risk in shared clusters — platform teams need to restrict agent pods to images from approved registries.

### 4. Zero admission webhooks exist in the codebase

The current enforcement model relies entirely on CRD-level `x-kubernetes-validations` rules (field-level constraints like "secretRef required for non-none credentials") and controller-side logic. There are no `ValidatingWebhookConfiguration` or `MutatingWebhookConfiguration` resources in the Helm chart or codebase.

CRD validation rules cannot:
- Reference other resources (e.g., look up a policy CRD)
- Perform namespace-scoped logic
- Aggregate state across resources
- Apply dynamic policies that change without CRD updates

This architectural gap means **any governance feature that requires cross-resource or namespace-level validation is currently impossible to implement**.

## Proposed API

### KelosPolicy CRD

```yaml
apiVersion: kelos.dev/v1alpha1
kind: KelosPolicy
metadata:
  name: team-backend
  namespace: team-backend
spec:
  # Which agent types are permitted
  agents:
    allowedTypes:
      - claude-code
      - codex
    # Optional model constraints (regex patterns)
    allowedModels:
      - "claude-sonnet-.*"
      - "claude-haiku-.*"
    deniedModels:          # Takes precedence over allowedModels
      - "claude-opus-.*"

  # Which container images are permitted (glob patterns)
  images:
    allowed:
      - "ghcr.io/kelos-dev/*"
      - "registry.internal.company.com/kelos/*"

  # Which repositories agents can access
  repositories:
    allowed:
      - "https://github.com/myorg/backend-*"
      - "https://github.com/myorg/shared-libs"
    denied:
      - "https://github.com/myorg/infrastructure-*"

  # Which credential types are permitted
  credentials:
    allowedTypes:
      - oauth            # Require OAuth, disallow raw API keys

status:
  enforced: true
  stats:
    tasksAdmitted: 342
    tasksRejected: 7
    workspacesAdmitted: 15
    workspacesRejected: 1
  lastRejection:
    time: "2026-04-27T10:30:00Z"
    resource: "Task/expensive-analysis"
    reason: 'model "claude-opus-4" matches deniedModels pattern "claude-opus-.*"'
  conditions:
    - type: WebhookRegistered
      status: "True"
      message: "ValidatingWebhookConfiguration is active"
```

### Validation behavior

| Resource Created | Fields Validated | Rejection Reason Example |
|---|---|---|
| **Task** | `spec.type`, `spec.model`, `spec.image`, `spec.credentials.type` | `agent type "gemini" not in allowedTypes [claude-code, codex]` |
| **TaskSpawner** | `spec.taskTemplate.type`, `spec.taskTemplate.model`, `spec.taskTemplate.image` | `model "claude-opus-4" matches deniedModels` |
| **Workspace** | `spec.repo` | `repository "https://github.com/myorg/infra-secrets" matches denied pattern` |

### Multiple policies

When multiple KelosPolicy resources exist in a namespace, the most restrictive union applies:
- **Allowlists**: intersection (must be allowed by ALL policies)
- **Denylists**: union (denied by ANY policy means denied)

This enables layered policies: a platform-wide policy with org standards + a team-specific policy with additional restrictions.

## Implementation approach

### Phase 1: ValidatingAdmissionWebhook

Add a new webhook endpoint to the existing `kelos-controller` binary (or a dedicated `kelos-admission-webhook` binary):

1. **Webhook handler**: Intercepts CREATE and UPDATE operations on Tasks, TaskSpawners, and Workspaces
2. **Policy lookup**: Reads all KelosPolicy resources in the target namespace via an informer/cache
3. **Constraint evaluation**: Checks the resource against all applicable policies
4. **Response**: Returns `allowed: true` or `allowed: false` with a descriptive reason
5. **Status update**: Reconciler periodically updates KelosPolicy status with admission statistics

The Helm chart would include a `ValidatingWebhookConfiguration`:

```yaml
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: kelos-policy
webhooks:
  - name: policy.kelos.dev
    rules:
      - apiGroups: ["kelos.dev"]
        apiVersions: ["v1alpha1"]
        resources: ["tasks", "taskspawners", "workspaces"]
        operations: ["CREATE", "UPDATE"]
    clientConfig:
      service:
        name: kelos-admission-webhook
        namespace: kelos-system
        path: /validate
    namespaceSelector:
      matchExpressions:
        - key: kelos.dev/policy-enforcement
          operator: In
          values: ["enabled"]
    failurePolicy: Fail
    sideEffects: None
```

The `namespaceSelector` ensures only namespaces that opt in to policy enforcement are affected, preserving backward compatibility.

### Phase 2 (future): MutatingAdmissionWebhook for defaults

A natural extension would be adding a `defaults` section to KelosPolicy that injects namespace-wide defaults (resources, TTL, labels) into Tasks that don't specify them. This would be implemented as a separate MutatingAdmissionWebhook and can be proposed independently.

## Relationship to existing proposals

| Proposal | Governance Dimension | KelosPolicy Dimension |
|---|---|---|
| #675 ConcurrencyPolicy | **How many** — aggregate concurrency limits | **Which ones** — what's permitted at all |
| #860 securityProfile | **How secure** — SecurityContext hardening | **Which images/types** — restricting to approved configs |
| #977 budgetPolicy | **How much** — per-spawner cost limits | **Which models** — preventing expensive models entirely |
| #624 maxCostUSD | **How much** — per-spawner spend cap | **Which models** — cost control via restriction, not limits |
| #907 schedulingPolicy | **When** — time windows for execution | Orthogonal — KelosPolicy does not address scheduling |

KelosPolicy complements all of these. They operate at different levels of the governance stack:
- **KelosPolicy** → "Is this resource allowed to exist?" (admission gate)
- **ConcurrencyPolicy** → "Can this resource run right now?" (runtime scheduling)
- **securityProfile** → "How is this resource hardened?" (pod configuration)
- **budgetPolicy** → "Can this resource afford to run?" (cost tracking)

### Admission webhooks as foundation

Introducing the admission webhook infrastructure for KelosPolicy creates a foundation that #675, #860, and other governance proposals could also leverage. For example, ConcurrencyPolicy enforcement at admission time would be more robust than controller-side enforcement, since it prevents the resource from being created rather than leaving it in a pending/rejected state.

## Use cases

1. **Cost governance**: "Dev namespaces can only use Haiku/Sonnet models; Opus requires the `team-ml` namespace"
2. **Supply chain security**: "Agent pods must use images from our internal registry, not public registries"
3. **Data isolation**: "The `payments` namespace can only access `payments-*` repos; the `frontend` namespace cannot access backend repos"
4. **Credential hygiene**: "All namespaces must use OAuth credentials; raw API keys are not permitted"
5. **Agent standardization**: "Only `claude-code` and `codex` are approved agent types; new types require platform team approval"

## Backward compatibility

- **Fully opt-in**: Namespaces without a KelosPolicy resource are unaffected
- **Namespace selector**: The webhook only fires for namespaces with the `kelos.dev/policy-enforcement=enabled` label
- **Gradual rollout**: Platform teams can enable enforcement namespace-by-namespace
- **Audit mode** (future): A `spec.mode: audit` option that logs violations without rejecting, enabling dry-run before enforcement

Resource Created	Fields Validated	Rejection Reason Example
Task	`spec.type`, `spec.model`, `spec.image`, `spec.credentials.type`	`agent type "gemini" not in allowedTypes [claude-code, codex]`
TaskSpawner	`spec.taskTemplate.type`, `spec.taskTemplate.model`, `spec.taskTemplate.image`	`model "claude-opus-4" matches deniedModels`
Workspace	`spec.repo`	`repository "https://github.com/myorg/infra-secrets" matches denied pattern`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Add KelosPolicy CRD for namespace-level allowlist enforcement — restricting permitted agent types, models, images, and repositories via admission webhooks #1020

Area: New CRDs & API Extensions

Summary

Problem

1. Any agent type, model, and image can be used without restriction

2. Any repository can be accessed by agents

3. Custom images bypass all controls

4. Zero admission webhooks exist in the codebase

Proposed API

KelosPolicy CRD

Validation behavior

Multiple policies

Implementation approach

Phase 1: ValidatingAdmissionWebhook

Phase 2 (future): MutatingAdmissionWebhook for defaults

Relationship to existing proposals

Admission webhooks as foundation

Use cases

Backward compatibility

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal	Governance Dimension	KelosPolicy Dimension
#675 ConcurrencyPolicy	How many — aggregate concurrency limits	Which ones — what's permitted at all
#860 securityProfile	How secure — SecurityContext hardening	Which images/types — restricting to approved configs
#977 budgetPolicy	How much — per-spawner cost limits	Which models — preventing expensive models entirely
#624 maxCostUSD	How much — per-spawner spend cap	Which models — cost control via restriction, not limits
#907 schedulingPolicy	When — time windows for execution	Orthogonal — KelosPolicy does not address scheduling

API: Add KelosPolicy CRD for namespace-level allowlist enforcement — restricting permitted agent types, models, images, and repositories via admission webhooks #1020

Description

Area: New CRDs & API Extensions

Summary

Problem

1. Any agent type, model, and image can be used without restriction

2. Any repository can be accessed by agents

3. Custom images bypass all controls

4. Zero admission webhooks exist in the codebase

Proposed API

KelosPolicy CRD

Validation behavior

Multiple policies

Implementation approach

Phase 1: ValidatingAdmissionWebhook

Phase 2 (future): MutatingAdmissionWebhook for defaults

Relationship to existing proposals

Admission webhooks as foundation

Use cases

Backward compatibility

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions