Merge Main to Prod#39
Conversation
The README YAML snippets used the deprecated top-level spec.pollInterval field. Move pollInterval inside the githubIssues source block in both the self-development section and the TaskSpawner quick-start example, aligning them with the reference docs, examples, and the actual kelos-workers.yaml manifest. Co-Authored-By: Claude Opus 4.6 <[email protected]>
docs: use per-source pollInterval in README snippets
feat: log cache misses in ghproxy
Shallow clone with depth 1 prevents git from reporting parent SHAs of the merge commit, causing the head SHA validation to always fail.
fix: use fetch-depth 2 for fork e2e merge ref checkout
* feat: add github webhook events * fix: build images * Fixed Issues 1. P1 - MaxConcurrencyError HTTP Response ✅ Fixed ServeHTTP to properly handle MaxConcurrencyError with HTTP 503 and Retry-After header Previously returned generic HTTP 500, now correctly returns 503 for rate limiting 2. P1 - Empty APIVersion/Kind in Owner References ✅ Fixed owner reference creation to use client.Scheme().ObjectKinds() to get proper GVK Added missing Controller and BlockOwnerDeletion fields for proper garbage collection Previously had empty strings, now has correct API version and kind 3. P1 - TOCTOU Race Condition in Idempotency ✅ Replaced separate IsProcessed()/MarkProcessed() with atomic CheckAndMark() Eliminated race window where two requests with same delivery ID could both pass the check Now uses single lock operation for check-and-set 4. P2 - Invalid Kubernetes Names from Truncation ✅ Fixed task name generation to handle nil/empty ID values safely Added strings.TrimRight(taskName[:63], "-.") to ensure names don't end with invalid characters Prevents server-side validation failures 5. P2 - BodyContains Filter Ignored for IssuesEvent ✅ Fixed GitHub filter to check issue body for BodyContains on IssuesEvent Previously only checked comment body on IssueCommentEvent Now properly filters by issue body content All tests pass and the implementation builds successfully. The webhook system now properly handles rate limiting, ensures atomic idempotency checks, creates valid owner references for garbage collection, generates valid Kubernetes resource names, and correctly applies all filter conditions. * chore: generate * fix: normalize the tasknames
Replace the centralized static ghproxy with per-workspace instances managed by a new WorkspaceReconciler. Each workspace ghproxy owns its GitHub auth directly via token-refresher sidecar, fixing the ETag-per-token cache invalidation cycle caused by multiple spawner pods using different tokens through a shared proxy. Also logs cache misses with the cache key for observability.
…ar-webhooks-upstream feat: Add support for real-time GithubEvents and webhooks taskSpawners
feat: scope ghproxy by workspace with own auth
…026.03.30-a5d3e17 Update cursor image to 2026.03.30-a5d3e17
PR kelos-dev#851 introduced taskbuilder.BuildTask with SpawnerRef but the spawner call site only set Name, leaving APIVersion, Kind, and UID empty. Kubernetes rejects owner references with empty required fields, causing Task creation to fail. Also remove BlockOwnerDeletion since neither the spawner nor webhook RBAC role has the permissions Kubernetes requires for that field.
fix: populate SpawnerRef fields for Task owner references
…-1.3.6 Update opencode image to 1.3.9
…age-2.1.88 Update claude-code image to 2.1.88
Greptile SummaryThis PR merges a substantial set of features from Key changes:
One P1 issue found: the README documents HTTP 503 when Confidence Score: 4/5Safe to merge after resolving the max-concurrency 200-vs-503 discrepancy. One P1 finding: when max concurrency is exceeded the handler returns 200 OK instead of 503, so GitHub considers the delivery successful and will not retry — causing silent event loss. All other findings are P2 style/cleanup. The rest of the PR is well-structured with good test coverage. internal/webhook/handler.go — max concurrency path returns 200 instead of 503. Important Files Changed
Sequence DiagramsequenceDiagram
participant GH as GitHub
participant WS as kelos-webhook-server
participant K8s as Kubernetes API
participant Ctrl as kelos-controller
participant Proxy as workspace-ghproxy
participant Spawner as kelos-spawner
GH->>WS: POST /webhook (X-Hub-Signature-256)
WS->>WS: ValidateGitHubSignature (HMAC-SHA256)
WS->>WS: CheckAndMark deliveryID (idempotency)
WS->>K8s: List TaskSpawners with githubWebhook
K8s-->>WS: matching spawners
WS->>WS: MatchesGitHubEvent (filter evaluation)
WS->>K8s: Create Task (owner ref to TaskSpawner)
K8s-->>WS: 201 Created
WS-->>GH: 200 OK
K8s-->>Ctrl: Task status change
Ctrl->>K8s: Update TaskSpawner.status.activeTasks
Spawner->>Proxy: GET /repos/.../issues (no token)
Proxy->>Proxy: tokenResolver() reads token file
Proxy->>GH: GET api.github.com (Authorization: token)
GH-->>Proxy: 200
Proxy-->>Spawner: 200 (cached response)
Reviews (1): Last reviewed commit: "Merge pull request #854 from kelos-dev/u..." | Re-trigger Greptile |
| // Check max concurrency | ||
| // Note: For webhook TaskSpawners, activeTasks is updated by the kelos-controller | ||
| // when Tasks change status. This provides eventually consistent rate limiting. | ||
| if spawner.Spec.MaxConcurrency != nil && *spawner.Spec.MaxConcurrency > 0 { | ||
| activeTasks := spawner.Status.ActiveTasks | ||
| if int32(activeTasks) >= *spawner.Spec.MaxConcurrency { | ||
| spawnerLog.Info("Max concurrency reached, dropping webhook event", | ||
| "activeTasks", activeTasks, | ||
| "maxConcurrency", *spawner.Spec.MaxConcurrency, | ||
| "reason", "Webhook accepted but task creation skipped due to concurrency limits") | ||
| continue // Skip this spawner, continue with others | ||
| } |
There was a problem hiding this comment.
Max concurrency silently drops events instead of returning 503
The README.md for the webhook example explicitly states: "When exceeded, returns HTTP 503 with Retry-After header — GitHub will automatically retry failed webhook deliveries."
However, the code silently continues and returns HTTP 200 (success) when all spawners are over the concurrency limit (processWebhook returns (false, nil)). GitHub will interpret a 200 as successful delivery and will not retry the event, causing permanent, silent event loss.
At minimum, update the README to accurately reflect the current 200 behaviour, or update the code to match the documented 503 behaviour.
|
|
||
| doGET := func() { | ||
| req, _ := http.NewRequest("GET", proxyServer.URL+"/repos/owner/repo/issues", nil) | ||
| req.Header.Set(source.UpstreamBaseURLHeader, upstream.URL) |
There was a problem hiding this comment.
Leftover
UpstreamBaseURLHeader has no effect after proxy refactor
The proxy was refactored to use a single fixed upstreamBaseURL configured at construction time; the X-Upstream-Base-URL header is now completely ignored in ServeHTTP. This header set here is dead code and can mislead future readers into thinking it still controls routing.
| req.Header.Set(source.UpstreamBaseURLHeader, upstream.URL) | |
| req, _ := http.NewRequest("GET", proxyServer.URL+"/repos/owner/repo/issues", nil) |
| return &corev1.Service{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Name: WorkspaceGHProxyName(workspace.Name), | ||
| Namespace: workspace.Namespace, | ||
| Labels: labels, | ||
| }, | ||
| Spec: corev1.ServiceSpec{ | ||
| Selector: labels, | ||
| Ports: []corev1.ServicePort{ | ||
| { |
There was a problem hiding this comment.
SHA1 used for name truncation hashing
SHA1 is cryptographically broken. While the usage here is purely for generating deterministic short suffixes (not security-sensitive), the rest of the codebase uses SHA256 (e.g., delivery-ID hashing in handler.go). Prefer sha256 for consistency:
sum := sha256.Sum256([]byte(name))
suffix := hex.EncodeToString(sum[:])[:8]
What type of PR is this?
What this PR does / why we need it:
Which issue(s) this PR is related to:
Special notes for your reviewer:
Does this PR introduce a user-facing change?