fix: capture agent outputs in session runner for persistent mode by j-bennet · Pull Request #126 · datagravity-ai/kelos

j-bennet · 2026-05-08T02:06:04Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fixes session runner output capture for persistent execution mode. Previously, when tasks ran inside session pods, the agent's stdout was not captured, so the output markers (---KELOS_OUTPUTS_START--- / ---KELOS_OUTPUTS_END---) were never parsed. This meant Task.status.outputs and Task.status.results were always empty, and downstream reporters (e.g. Slack) had nothing to post back.

Changes:

Capture agent stdout via a bounded ring buffer (tailWriter) and parse output markers after each task completes
Write parsed outputs and results to the Task status with retry-on-conflict
Always set CompletionTime on task completion regardless of whether outputs were captured
Consolidate duplicated output parsing logic into internal/capture (single source of truth for both controller and session runner)
Add integration tests covering the full persistent-mode task lifecycle (Queued → Pending → Running → Succeeded/Failed), output preservation, and requeue behavior

Which issue(s) this PR is related to:

Fixes kelos-dev#911

Special notes for your reviewer:

The internal/controller/output_parser.go now delegates to internal/capture.ParseOutputs / ResultsFromOutputs rather than duplicating the logic. The session runner imports directly from internal/capture.

The tailWriter ring buffer caps memory at 256KB per task run — only the tail of stdout is retained since kelos-capture always emits markers at the end.

Does this PR introduce a user-facing change?

Fix persistent execution mode tasks not reporting outputs (branch, commit, PR URLs, cost, response) back to Task status

The session runner was not capturing agent stdout to extract output markers (---KELOS_OUTPUTS_START/END---). This meant Task status never received outputs/results, so Slack reporting had nothing to post back to the thread. Now runAgent uses io.MultiWriter to tee stdout into a buffer, parses the output markers after the agent exits, and writes outputs/results to the Task status via the Kelos API. Co-Authored-By: Claude Opus 4.6 <[email protected]>

greptile-apps · 2026-05-08T02:08:52Z

Greptile Summary

This PR fixes persistent-mode task execution by capturing agent stdout through a bounded tailWriter ring buffer (256 KB), parsing the ---KELOS_OUTPUTS_START---/---KELOS_OUTPUTS_END--- markers after each run, and writing CompletionTime, Outputs, and Results to the Task status via a deferred call that uses a fresh context.Background() timeout — resolving all the output-capture, cancellation, and timestamp concerns raised on prior iterations.

Output capture: runAgent now returns the tail of stdout; processTask parses markers and stores results; deferred updateTaskStatus always writes CompletionTime regardless of whether markers were found.
Memory safety: tailWriter is a fixed-size ring buffer that evicts old data, capping memory at 256 KB per run while still retaining the marker block that kelos-capture always emits at the end.
Deduplication: ParseOutputs / ResultsFromOutputs are promoted to internal/capture and the controller's output_parser.go becomes thin delegation wrappers, eliminating the duplicated logic.

Confidence Score: 5/5

Safe to merge — all previously raised concerns are addressed and no new defects were found.

The three structural fixes (bounded ring buffer, deferred status write with a fresh context, conflict-retrying UpdateStatus) are all present and correct. The tailWriter ring buffer logic was verified to handle small writes, exact-fill, overflow, and multi-write wrap-around correctly. The deferred function captures outputs and results by closure reference, so their final post-agent values are always used. All edge cases called out in earlier review rounds are resolved, and the new unit and integration tests cover the critical paths.

No files require special attention.

Sequence Diagram

sequenceDiagram
    participant SR as SessionReconciler
    participant R as Runner.processTask
    participant WS as workspace.Reset
    participant A as Agent (kelos_entrypoint.sh)
    participant TW as tailWriter (256KB ring)
    participant K8s as Kubernetes API

    R->>K8s: Get Task
    R->>SR: setTaskStatus("running") via annotation
    Note over R: startTime = metav1.Now()
    Note over R: defer updateTaskStatus(context.Background()+10s)
    R->>WS: Reset(ctx, branch)
    WS-->>R: ok / error
    R->>A: exec.CommandContext(ctx, entrypoint, prompt)
    A->>TW: stdout writes (tee via io.MultiWriter)
    A->>TW: ... (ring buffer evicts old data, keeps last 256KB)
    A-->>R: exit (success or error)
    R->>TW: String() → agentOutput
    R->>R: capture.ParseOutputs(agentOutput)
    R->>R: capture.ResultsFromOutputs(outputs)
    Note over R: defer fires (even on early return)
    R->>K8s: Get Task (retry loop, up to 3x on 409)
    R->>K8s: UpdateStatus(StartTime, CompletionTime, Outputs, Results)
    R->>SR: setTaskStatus("succeeded"/"failed") via annotation
    SR->>K8s: Update Task.Status.Phase

_{Reviews (6): Last reviewed commit: "fix: use independent context for deferre..." | Re-trigger Greptile}

Tests the full SessionReconciler lifecycle through envtest: - Task with execution-mode label transitions to Queued (not Job) - SessionReconciler assigns task to available session pod - Annotation-based protocol: running → succeeded/failed transitions - Task outputs/results written by session runner are preserved - Requeue behavior when no session pod is available Also registers SessionReconciler in the integration test suite. Co-Authored-By: Claude Opus 4.6 <[email protected]>

- Add retry-on-conflict to updateTaskStatus (renamed from updateTaskOutputs) to avoid silently dropping outputs when the SessionReconciler writes concurrently - Always set CompletionTime/StartTime regardless of whether output markers are present, so downstream consumers (TTL, Slack reporter) don't see the task as perpetually in-progress - Replace unbounded bytes.Buffer with a 256KB ring buffer (tailWriter) to cap memory usage from verbose agents Co-Authored-By: Claude Opus 4.6 <[email protected]>

j-bennet · 2026-05-08T02:27:06Z