fix: capture agent outputs in session runner for persistent mode#126
fix: capture agent outputs in session runner for persistent mode#126
Conversation
The session runner was not capturing agent stdout to extract output markers (---KELOS_OUTPUTS_START/END---). This meant Task status never received outputs/results, so Slack reporting had nothing to post back to the thread. Now runAgent uses io.MultiWriter to tee stdout into a buffer, parses the output markers after the agent exits, and writes outputs/results to the Task status via the Kelos API. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Greptile SummaryThis PR fixes persistent-mode task execution by capturing agent stdout through a bounded
Confidence Score: 5/5Safe to merge — all previously raised concerns are addressed and no new defects were found. The three structural fixes (bounded ring buffer, deferred status write with a fresh context, conflict-retrying UpdateStatus) are all present and correct. The tailWriter ring buffer logic was verified to handle small writes, exact-fill, overflow, and multi-write wrap-around correctly. The deferred function captures outputs and results by closure reference, so their final post-agent values are always used. All edge cases called out in earlier review rounds are resolved, and the new unit and integration tests cover the critical paths. No files require special attention.
|
Tests the full SessionReconciler lifecycle through envtest: - Task with execution-mode label transitions to Queued (not Job) - SessionReconciler assigns task to available session pod - Annotation-based protocol: running → succeeded/failed transitions - Task outputs/results written by session runner are preserved - Requeue behavior when no session pod is available Also registers SessionReconciler in the integration test suite. Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Add retry-on-conflict to updateTaskStatus (renamed from updateTaskOutputs) to avoid silently dropping outputs when the SessionReconciler writes concurrently - Always set CompletionTime/StartTime regardless of whether output markers are present, so downstream consumers (TTL, Slack reporter) don't see the task as perpetually in-progress - Replace unbounded bytes.Buffer with a 256KB ring buffer (tailWriter) to cap memory usage from verbose agents Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
@greptile Review and update Greptile Summary. |
- Accept either Queued or Pending when verifying initial phase, since the SessionReconciler can assign the task before the first poll - Use crypto/rand for namespace suffixes to avoid collisions when tests run within the same second Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
@greptile Review and update Greptile Summary. |
Move ParseOutputs and ResultsFromOutputs into internal/capture as the single source of truth. The controller delegates to capture, and the session runner imports directly instead of duplicating the logic. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Capture the start timestamp at the top of processTask (before runAgent) and pass it to updateTaskStatus as a fallback. Previously metav1.Now() was captured only at completion time and shared for both StartTime and CompletionTime, producing zero-duration metrics for fast-completing tasks. Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
@greptile Review and update Greptile Summary. |
- Remove unnecessary markerStart/markerEnd unexported aliases in internal/capture; use the exported constants directly - Use defer in processTask to guarantee updateTaskStatus is called (and CompletionTime written) even when workspace reset fails early Co-Authored-By: Claude Opus 4.6 <[email protected]>
No external package references them directly; keep the API surface minimal. Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
@greptile Review and update Greptile Summary. |
Use context.Background with a 10s timeout in the deferred updateTaskStatus call so that CompletionTime is written even when the parent context is cancelled (e.g. SIGTERM during agent execution). Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
@greptile Review and update Greptile Summary. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Fixes session runner output capture for persistent execution mode. Previously, when tasks ran inside session pods, the agent's stdout was not captured, so the output markers (
---KELOS_OUTPUTS_START---/---KELOS_OUTPUTS_END---) were never parsed. This meantTask.status.outputsandTask.status.resultswere always empty, and downstream reporters (e.g. Slack) had nothing to post back.Changes:
tailWriter) and parse output markers after each task completesCompletionTimeon task completion regardless of whether outputs were capturedinternal/capture(single source of truth for both controller and session runner)Which issue(s) this PR is related to:
Fixes kelos-dev#911
Special notes for your reviewer:
The
internal/controller/output_parser.gonow delegates tointernal/capture.ParseOutputs/ResultsFromOutputsrather than duplicating the logic. The session runner imports directly frominternal/capture.The
tailWriterring buffer caps memory at 256KB per task run — only the tail of stdout is retained sincekelos-capturealways emits markers at the end.Does this PR introduce a user-facing change?