Skip to content

feat: add onCompletion webhook hooks to TaskSpawner#141

Open
j-bennet wants to merge 14 commits into
prodfrom
feat/on-completion-hook
Open

feat: add onCompletion webhook hooks to TaskSpawner#141
j-bennet wants to merge 14 commits into
prodfrom
feat/on-completion-hook

Conversation

@j-bennet
Copy link
Copy Markdown

@j-bennet j-bennet commented May 20, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds a new spec.onCompletion field to TaskSpawnerSpec that configures outbound HTTP webhook notifications when spawned Tasks reach terminal phases (Succeeded or Failed). This enables push-based alerting to external systems (Slack incoming webhooks, PagerDuty, custom dashboards) without requiring users to poll the Kubernetes API or set up custom Prometheus alerting rules.

Key design decisions:

  • Hooks config is serialized into a Task annotation (kelos.dev/on-completion) at spawn time so the reporting loop can dispatch without looking up the TaskSpawner
  • Webhook dispatch runs in the existing reporting loop (not inline in the task controller) to keep reconciliation fast
  • Phase filtering defaults to both Succeeded and Failed when not specified
  • Idempotency via kelos.dev/webhook-report-phase annotation prevents duplicate deliveries
  • Custom headers from a referenced Secret — all keys in the Secret are sent as HTTP headers (e.g. Authorization, X-Api-Key)
  • SSRF protection: HTTPS-only CRD validation, runtime IP blocklist (IPv4/IPv6 private ranges), DNS resolution of hostnames, and redirect guard on all HTTP clients

Which issue(s) this PR is related to:

Fixes kelos-dev#749

Special notes for your reviewer:

This is Phase 1 of the incremental adoption path described in kelos-dev#749. Future phases will add built-in Slack notification formatting and consolidate GitHubReporting into the same framework.

The webhook payload includes: task name, namespace, spawner name, phase, message, agent type, model, start/completion times, outputs, and results (cost/tokens).

The secretRef field references a Secret whose keys are used as HTTP request headers. For example, a Secret with keys Authorization and X-Api-Key will set both headers on the webhook request.

Does this PR introduce a user-facing change?

Add `spec.onCompletion` field to TaskSpawner for configuring outbound webhook notifications when tasks reach terminal phases (Succeeded/Failed). Supports phase filtering and custom HTTP headers via secretRef (all Secret keys are sent as request headers).

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 20, 2026

Greptile Summary

This PR adds an onCompletion field to TaskSpawnerSpec that enables push-based webhook notifications when spawned Tasks reach terminal phases (Succeeded or Failed). The hooks config is serialized into a Task annotation at spawn time and consumed by a new runWebhookReportingCycle that runs alongside the existing reporting loop on every reconcile tick.

  • New WebhookReporter (internal/reporting/webhook.go): dispatches HTTP POST notifications with SSRF mitigations (HTTPS-only CRD pattern + runtime DNS resolution check + redirect guard applied to all injected HTTP clients), idempotency via kelos.dev/webhook-report-phase annotation persisted through retry.RetryOnConflict + merge-patch, and optional Authorization header sourced from a Kubernetes Secret.
  • CRD schema (taskspawner_types.go, YAML files): adds TerminalTaskPhase type alias with correct +kubebuilder:validation:Enum at the type level (enum constraint confirmed in generated YAML), NotificationHook, WebhookNotification, and OnCompletion; max 8 hooks enforced by schema.
  • Test suite (webhook_test.go): exercises the full ReportWebhooks path via sigs.k8s.io/controller-runtime/pkg/client/fake, covering phase gating, idempotency, auth headers, phase filtering, redirect-guard cloning, and URL validation.

Confidence Score: 5/5

Safe to merge. All core paths — idempotency, SSRF protection, secret resolution, and annotation persistence — are implemented correctly and covered by tests.

The three previously flagged implementation gaps (missing retry loop in persistAnnotationRetry, missing response-body drain, SSRF bypass via injected HTTP client and domain names) are all addressed correctly in this revision. The remaining gap is test coverage for non-2xx webhook responses — the annotation-not-persisted guarantee on delivery failure is not exercised by any table case — but the production code path itself is correct.

internal/reporting/webhook_test.go — missing test case for server error responses

Sequence Diagram

sequenceDiagram
    participant TS as TaskSpawner Reconciler
    participant K8s as Kubernetes API
    participant WR as WebhookReporter
    participant DNS as DNS Resolver
    participant EXT as External Webhook

    TS->>K8s: runOnce() — fetch TaskSpawner
    TS->>TS: runCycleWithSourceCore()
    note over TS: onCompletionAnnotation() serialises hooks to JSON
    TS->>K8s: Create/Update Task (with kelos.dev/on-completion annotation)

    TS->>K8s: runWebhookReportingCycle() — List Tasks (label selector)
    K8s-->>TS: TaskList

    loop for each Task
        TS->>WR: ReportWebhooks(ctx, task)
        alt no annotation / non-terminal phase / already reported
            WR-->>TS: nil (no-op)
        else terminal phase, not yet reported
            WR->>WR: json.Unmarshal hooks annotation
            WR->>WR: buildWebhookPayload(task)
            loop for each matching hook
                WR->>WR: validateWebhookURL (HTTPS check)
                WR->>DNS: net.LookupIP(host)
                DNS-->>WR: []IP (private IPs rejected)
                WR->>K8s: ReadSecret (if secretRef configured)
                K8s-->>WR: Authorization header value
                WR->>EXT: HTTP POST /webhook (JSON payload)
                EXT-->>WR: 200 OK
            end
            WR->>K8s: Patch Task (MergeFrom, with retry on conflict)
            note over WR: sets kelos.dev/webhook-report-phase annotation
        end
    end
Loading

Fix All in Claude Code

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
internal/reporting/webhook_test.go:35-170
**Missing test coverage for non-2xx server responses**

The `TestWebhookReporter_ReportWebhooks` table always uses `serverStatus: 200`. There is no case where the server returns a 4xx or 5xx status, so the path at `webhook.go:160–161` that returns `fmt.Errorf("webhook %q returned status %d", ...)` is never exercised end-to-end. More importantly, the critical guarantee that `AnnotationWebhookReportPhase` is **not** persisted when delivery fails (allowing retry on the next reconcile cycle) is untested: a test with `serverStatus: 500`, `wantRequests: 1`, and `wantAnnotation: ""` would cover that invariant.

Reviews (8): Last reviewed commit: "chore: regenerate CRD manifests for upda..." | Re-trigger Greptile

Comment thread internal/reporting/webhook.go Outdated
Comment thread internal/reporting/webhook.go
Comment thread internal/reporting/webhook_test.go Outdated
@j-bennet
Copy link
Copy Markdown
Author

@greptile Review.

1 similar comment
@j-bennet
Copy link
Copy Markdown
Author

@greptile Review.

Comment thread internal/reporting/webhook.go
@j-bennet
Copy link
Copy Markdown
Author

@greptile Review.

Comment thread internal/reporting/webhook.go
@j-bennet
Copy link
Copy Markdown
Author

@greptile Review.

1 similar comment
@j-bennet
Copy link
Copy Markdown
Author

@greptile Review.

Comment thread internal/reporting/webhook.go
@j-bennet
Copy link
Copy Markdown
Author

@greptile Review.

1 similar comment
@j-bennet
Copy link
Copy Markdown
Author

@greptile Review.

j-bennet and others added 11 commits May 29, 2026 12:08
Add a new spec.onCompletion field to TaskSpawnerSpec that configures
outbound HTTP webhook notifications when spawned Tasks reach terminal
phases (Succeeded or Failed). This enables push-based alerting to
external systems (Slack, PagerDuty, custom dashboards) without polling.

Each hook supports phase filtering and optional Authorization header
via secretRef. The hooks config is propagated to Tasks as an annotation
and dispatched by the existing reporting loop for idempotency.

Closes kelos-dev#749

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add retry-on-conflict loop to persistAnnotationRetry using
  retry.RetryOnConflict to prevent duplicate webhook deliveries
- Drain response body before close for HTTP connection reuse
- Use TerminalTaskPhase type with kubebuilder enum validation so
  the CRD rejects invalid phase values at admission time
- Rewrite tests to use fake client exercising the full ReportWebhooks
  flow including idempotency and annotation persistence

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The persistent execution mode test involves multiple reconcile cycles
(session controller + task controller) and 10s is too tight under CI
resource contention. Increase to 30s to match the complexity of the
test's multi-step lifecycle verification.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Persist idempotency annotation even when all hooks are filtered by
  phase, preventing infinite re-evaluation on every reporting cycle
- Remove spec.onCompletion gate so in-flight tasks get webhooks even if
  the spawner spec is later modified
- Redact webhook URL from error messages to avoid leaking URL-embedded
  tokens (use hook name instead)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Restrict CRD URL pattern to HTTPS-only (^https://.+)
- Add runtime host validation rejecting private, loopback, and
  link-local IP ranges before dispatching webhooks
- Apply CheckRedirect policy on the default HTTP client to block
  redirects to internal addresses
- Initialize a shared default HTTP client instead of allocating per-call

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The CheckRedirect policy on defaultWebhookHTTPClient was unreachable
when runWebhookReportingCycle injects cfg.HTTPClient. Now httpClient()
shallow-clones the injected client and applies ssrfCheckRedirect when
no CheckRedirect is already configured, closing the redirect-based
SSRF bypass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add IPv6 loopback (::1/128), link-local (fe80::/10), and unique-local
(fc00::/7) ranges to isPrivateIP so webhook URLs targeting IPv6
internal addresses are rejected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
validateWebhookURL now performs DNS resolution for non-IP hostnames and
checks all resulting addresses against the private IP blocklist. This
prevents SSRF via internal Kubernetes service FQDNs or attacker domains
with private-IP DNS records.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sure

- Switch persistAnnotationRetry from full-object Update to MergeFrom
  Patch, preventing accidental clobber of concurrent changes from other
  controllers
- Document that webhook URLs are stored in task annotations and should
  not contain embedded tokens; use secretRef instead

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add test case verifying that when the webhook endpoint returns a 5xx
status, ReportWebhooks returns an error and does not persist the
idempotency annotation, allowing retry on the next reconcile cycle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
j-bennet and others added 3 commits May 29, 2026 12:08
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change secretRef behavior so all keys in the referenced Secret are sent
as HTTP headers on the webhook request (not just Authorization). This
allows users to configure any headers their webhook endpoint requires
(e.g. X-Api-Key, X-Webhook-Secret) by adding keys to the Secret.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
These intermediate CRD/RBAC files are generated by controller-gen and
processed by hack/update-install-manifest.sh into the final manifests.
They should not be checked in.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@j-bennet j-bennet force-pushed the feat/on-completion-hook branch from a3aa19b to a8e5b67 Compare May 29, 2026 19:08
@j-bennet j-bennet requested a review from knechtionscoding May 29, 2026 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

API: Add onCompletion notification hooks to TaskSpawner for outbound event delivery on task terminal phases

2 participants