Skip to content

Merge Main to Prod#39

Closed
knechtionscoding wants to merge 18 commits into
prodfrom
main
Closed

Merge Main to Prod#39
knechtionscoding wants to merge 18 commits into
prodfrom
main

Conversation

@knechtionscoding
Copy link
Copy Markdown
Collaborator

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?


gjkim42 and others added 18 commits March 30, 2026 15:55
The README YAML snippets used the deprecated top-level
spec.pollInterval field. Move pollInterval inside the
githubIssues source block in both the self-development
section and the TaskSpawner quick-start example, aligning
them with the reference docs, examples, and the actual
kelos-workers.yaml manifest.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
docs: use per-source pollInterval in README snippets
Shallow clone with depth 1 prevents git from reporting parent SHAs
of the merge commit, causing the head SHA validation to always fail.
fix: use fetch-depth 2 for fork e2e merge ref checkout
* feat: add github webhook events

* fix: build images

* Fixed Issues
1. P1 - MaxConcurrencyError HTTP Response ✅
Fixed ServeHTTP to properly handle MaxConcurrencyError with HTTP 503 and Retry-After header
Previously returned generic HTTP 500, now correctly returns 503 for rate limiting
2. P1 - Empty APIVersion/Kind in Owner References ✅
Fixed owner reference creation to use client.Scheme().ObjectKinds() to get proper GVK
Added missing Controller and BlockOwnerDeletion fields for proper garbage collection
Previously had empty strings, now has correct API version and kind
3. P1 - TOCTOU Race Condition in Idempotency ✅
Replaced separate IsProcessed()/MarkProcessed() with atomic CheckAndMark()
Eliminated race window where two requests with same delivery ID could both pass the check
Now uses single lock operation for check-and-set
4. P2 - Invalid Kubernetes Names from Truncation ✅
Fixed task name generation to handle nil/empty ID values safely
Added strings.TrimRight(taskName[:63], "-.") to ensure names don't end with invalid characters
Prevents server-side validation failures
5. P2 - BodyContains Filter Ignored for IssuesEvent ✅
Fixed GitHub filter to check issue body for BodyContains on IssuesEvent
Previously only checked comment body on IssueCommentEvent
Now properly filters by issue body content
All tests pass and the implementation builds successfully. The webhook system now properly handles rate limiting, ensures atomic idempotency checks, creates valid owner references for garbage collection, generates valid Kubernetes resource names, and correctly applies all filter conditions.

* chore: generate
* fix: normalize the tasknames
Replace the centralized static ghproxy with per-workspace instances
managed by a new WorkspaceReconciler. Each workspace ghproxy owns
its GitHub auth directly via token-refresher sidecar, fixing the
ETag-per-token cache invalidation cycle caused by multiple spawner
pods using different tokens through a shared proxy.

Also logs cache misses with the cache key for observability.
…ar-webhooks-upstream

feat: Add support for real-time GithubEvents and webhooks taskSpawners
feat: scope ghproxy by workspace with own auth
…026.03.30-a5d3e17

Update cursor image to 2026.03.30-a5d3e17
PR kelos-dev#851 introduced taskbuilder.BuildTask with SpawnerRef but the
spawner call site only set Name, leaving APIVersion, Kind, and UID
empty. Kubernetes rejects owner references with empty required fields,
causing Task creation to fail.

Also remove BlockOwnerDeletion since neither the spawner nor webhook
RBAC role has the permissions Kubernetes requires for that field.
fix: populate SpawnerRef fields for Task owner references
…age-2.1.88

Update claude-code image to 2.1.88
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Mar 31, 2026

Greptile Summary

This PR merges a substantial set of features from main into prod, including: GitHub webhook-driven task spawning, per-workspace ghproxy Deployments, a shared taskbuilder package, and a refactored ghproxy that uses a fixed single upstream with its own token resolver instead of a client-supplied allowlist.

Key changes:

  • New GitHubWebhook source type in the TaskSpawner API — a new kelos-webhook-server binary listens for GitHub webhooks, filters events against all TaskSpawner resources with when.githubWebhook, and creates Task objects. Signature validation (HMAC-SHA256), idempotency via an in-memory delivery cache, and comprehensive event-type filters are all present.
  • Per-workspace ghproxy — the WorkspaceReconciler now provisions a dedicated ghproxy Deployment + Service per Workspace.
  • ghproxy refactor — simplified to a single fixed upstream with hot-reloading token file support.
  • taskbuilder package — task construction extracted into a shared package used by both the spawner and webhook handler.
  • Owner references on spawned Tasks — enabling accurate activeTasks counting for webhook-based concurrency control.

One P1 issue found: the README documents HTTP 503 when maxConcurrency is exceeded, but the code silently drops the event and returns 200, causing permanent event loss under load.

Confidence Score: 4/5

Safe to merge after resolving the max-concurrency 200-vs-503 discrepancy.

One P1 finding: when max concurrency is exceeded the handler returns 200 OK instead of 503, so GitHub considers the delivery successful and will not retry — causing silent event loss. All other findings are P2 style/cleanup. The rest of the PR is well-structured with good test coverage.

internal/webhook/handler.go — max concurrency path returns 200 instead of 503.

Important Files Changed

Filename Overview
internal/webhook/handler.go New webhook handler: max concurrency check silently drops events with 200 instead of 503 as documented; idempotency, signature validation, and task creation paths are otherwise correct.
internal/webhook/github_filter.go New GitHub webhook event parser and filter evaluator; comprehensive coverage of issue/PR/push event types, branch glob matching, label filtering.
internal/webhook/signature.go New HMAC-SHA256 GitHub signature validation; uses constant-time comparison via hmac.Equal — correct and secure.
cmd/kelos-webhook-server/main.go New webhook server binary; proper timeouts, graceful shutdown, health/readiness probes, and leader-election wired up correctly.
cmd/ghproxy/main.go Refactored from allowlist-based multi-upstream to single fixed upstream with per-instance token resolver; security surface reduced, cache key simplified, token file hot-reloading added.
internal/controller/workspace_ghproxy_builder.go New workspace-scoped ghproxy builder; uses SHA1 for name truncation (minor style issue); otherwise correctly wires token refresher sidecar and deployment/service resources.
internal/controller/workspace_controller.go New Workspace reconciler that manages per-workspace ghproxy Deployment and Service; controller references, ownership, and requeue logic are correct.
internal/controller/taskspawner_controller.go Added webhook-based TaskSpawner reconciliation path, Task watch for active-count updates, and safety guard in deleteStaleResource to check ownership before deletion.
internal/taskbuilder/builder.go New shared task construction helper; consolidates template rendering, label/annotation rendering, and owner reference into one place.
api/v1alpha1/taskspawner_types.go Added GitHubWebhook and GitHubWebhookFilter API types, updated CEL validation rule.
cmd/ghproxy/main_test.go Test suite updated for new single-upstream proxy; one test still sets the now-ignored UpstreamBaseURLHeader.
cmd/kelos-spawner/main.go Spawner now accepts --gh-proxy-url flag; task construction delegated to shared taskbuilder; owner references added to spawned Tasks.

Sequence Diagram

sequenceDiagram
    participant GH as GitHub
    participant WS as kelos-webhook-server
    participant K8s as Kubernetes API
    participant Ctrl as kelos-controller
    participant Proxy as workspace-ghproxy
    participant Spawner as kelos-spawner

    GH->>WS: POST /webhook (X-Hub-Signature-256)
    WS->>WS: ValidateGitHubSignature (HMAC-SHA256)
    WS->>WS: CheckAndMark deliveryID (idempotency)
    WS->>K8s: List TaskSpawners with githubWebhook
    K8s-->>WS: matching spawners
    WS->>WS: MatchesGitHubEvent (filter evaluation)
    WS->>K8s: Create Task (owner ref to TaskSpawner)
    K8s-->>WS: 201 Created
    WS-->>GH: 200 OK

    K8s-->>Ctrl: Task status change
    Ctrl->>K8s: Update TaskSpawner.status.activeTasks

    Spawner->>Proxy: GET /repos/.../issues (no token)
    Proxy->>Proxy: tokenResolver() reads token file
    Proxy->>GH: GET api.github.com (Authorization: token)
    GH-->>Proxy: 200
    Proxy-->>Spawner: 200 (cached response)
Loading

Reviews (1): Last reviewed commit: "Merge pull request #854 from kelos-dev/u..." | Re-trigger Greptile

Comment on lines +239 to +250
// Check max concurrency
// Note: For webhook TaskSpawners, activeTasks is updated by the kelos-controller
// when Tasks change status. This provides eventually consistent rate limiting.
if spawner.Spec.MaxConcurrency != nil && *spawner.Spec.MaxConcurrency > 0 {
activeTasks := spawner.Status.ActiveTasks
if int32(activeTasks) >= *spawner.Spec.MaxConcurrency {
spawnerLog.Info("Max concurrency reached, dropping webhook event",
"activeTasks", activeTasks,
"maxConcurrency", *spawner.Spec.MaxConcurrency,
"reason", "Webhook accepted but task creation skipped due to concurrency limits")
continue // Skip this spawner, continue with others
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Max concurrency silently drops events instead of returning 503

The README.md for the webhook example explicitly states: "When exceeded, returns HTTP 503 with Retry-After header — GitHub will automatically retry failed webhook deliveries."

However, the code silently continues and returns HTTP 200 (success) when all spawners are over the concurrency limit (processWebhook returns (false, nil)). GitHub will interpret a 200 as successful delivery and will not retry the event, causing permanent, silent event loss.

At minimum, update the README to accurately reflect the current 200 behaviour, or update the code to match the documented 503 behaviour.

Comment thread cmd/ghproxy/main_test.go

doGET := func() {
req, _ := http.NewRequest("GET", proxyServer.URL+"/repos/owner/repo/issues", nil)
req.Header.Set(source.UpstreamBaseURLHeader, upstream.URL)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Leftover UpstreamBaseURLHeader has no effect after proxy refactor

The proxy was refactored to use a single fixed upstreamBaseURL configured at construction time; the X-Upstream-Base-URL header is now completely ignored in ServeHTTP. This header set here is dead code and can mislead future readers into thinking it still controls routing.

Suggested change
req.Header.Set(source.UpstreamBaseURLHeader, upstream.URL)
req, _ := http.NewRequest("GET", proxyServer.URL+"/repos/owner/repo/issues", nil)

Comment on lines +80 to +89
return &corev1.Service{
ObjectMeta: metav1.ObjectMeta{
Name: WorkspaceGHProxyName(workspace.Name),
Namespace: workspace.Namespace,
Labels: labels,
},
Spec: corev1.ServiceSpec{
Selector: labels,
Ports: []corev1.ServicePort{
{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 SHA1 used for name truncation hashing

SHA1 is cryptographically broken. While the usage here is purely for generating deterministic short suffixes (not security-sensitive), the rest of the codebase uses SHA256 (e.g., delivery-ID hashing in handler.go). Prefer sha256 for consistency:

sum := sha256.Sum256([]byte(name))
suffix := hex.EncodeToString(sum[:])[:8]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants