Skip to content

fix: capture agent outputs in session runner for persistent mode#126

Merged
j-bennet merged 9 commits intoprodfrom
fix-session-runner-output-capture
May 8, 2026
Merged

fix: capture agent outputs in session runner for persistent mode#126
j-bennet merged 9 commits intoprodfrom
fix-session-runner-output-capture

Conversation

@j-bennet
Copy link
Copy Markdown

@j-bennet j-bennet commented May 8, 2026

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fixes session runner output capture for persistent execution mode. Previously, when tasks ran inside session pods, the agent's stdout was not captured, so the output markers (---KELOS_OUTPUTS_START--- / ---KELOS_OUTPUTS_END---) were never parsed. This meant Task.status.outputs and Task.status.results were always empty, and downstream reporters (e.g. Slack) had nothing to post back.

Changes:

  • Capture agent stdout via a bounded ring buffer (tailWriter) and parse output markers after each task completes
  • Write parsed outputs and results to the Task status with retry-on-conflict
  • Always set CompletionTime on task completion regardless of whether outputs were captured
  • Consolidate duplicated output parsing logic into internal/capture (single source of truth for both controller and session runner)
  • Add integration tests covering the full persistent-mode task lifecycle (Queued → Pending → Running → Succeeded/Failed), output preservation, and requeue behavior

Which issue(s) this PR is related to:

Fixes kelos-dev#911

Special notes for your reviewer:

The internal/controller/output_parser.go now delegates to internal/capture.ParseOutputs / ResultsFromOutputs rather than duplicating the logic. The session runner imports directly from internal/capture.

The tailWriter ring buffer caps memory at 256KB per task run — only the tail of stdout is retained since kelos-capture always emits markers at the end.

Does this PR introduce a user-facing change?

Fix persistent execution mode tasks not reporting outputs (branch, commit, PR URLs, cost, response) back to Task status

The session runner was not capturing agent stdout to extract output
markers (---KELOS_OUTPUTS_START/END---). This meant Task status never
received outputs/results, so Slack reporting had nothing to post back
to the thread.

Now runAgent uses io.MultiWriter to tee stdout into a buffer, parses
the output markers after the agent exits, and writes outputs/results
to the Task status via the Kelos API.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 8, 2026

Greptile Summary

This PR fixes persistent-mode task execution by capturing agent stdout through a bounded tailWriter ring buffer (256 KB), parsing the ---KELOS_OUTPUTS_START---/---KELOS_OUTPUTS_END--- markers after each run, and writing CompletionTime, Outputs, and Results to the Task status via a deferred call that uses a fresh context.Background() timeout — resolving all the output-capture, cancellation, and timestamp concerns raised on prior iterations.

  • Output capture: runAgent now returns the tail of stdout; processTask parses markers and stores results; deferred updateTaskStatus always writes CompletionTime regardless of whether markers were found.
  • Memory safety: tailWriter is a fixed-size ring buffer that evicts old data, capping memory at 256 KB per run while still retaining the marker block that kelos-capture always emits at the end.
  • Deduplication: ParseOutputs / ResultsFromOutputs are promoted to internal/capture and the controller's output_parser.go becomes thin delegation wrappers, eliminating the duplicated logic.

Confidence Score: 5/5

Safe to merge — all previously raised concerns are addressed and no new defects were found.

The three structural fixes (bounded ring buffer, deferred status write with a fresh context, conflict-retrying UpdateStatus) are all present and correct. The tailWriter ring buffer logic was verified to handle small writes, exact-fill, overflow, and multi-write wrap-around correctly. The deferred function captures outputs and results by closure reference, so their final post-agent values are always used. All edge cases called out in earlier review rounds are resolved, and the new unit and integration tests cover the critical paths.

No files require special attention.

Sequence Diagram

sequenceDiagram
    participant SR as SessionReconciler
    participant R as Runner.processTask
    participant WS as workspace.Reset
    participant A as Agent (kelos_entrypoint.sh)
    participant TW as tailWriter (256KB ring)
    participant K8s as Kubernetes API

    R->>K8s: Get Task
    R->>SR: setTaskStatus("running") via annotation
    Note over R: startTime = metav1.Now()
    Note over R: defer updateTaskStatus(context.Background()+10s)
    R->>WS: Reset(ctx, branch)
    WS-->>R: ok / error
    R->>A: exec.CommandContext(ctx, entrypoint, prompt)
    A->>TW: stdout writes (tee via io.MultiWriter)
    A->>TW: ... (ring buffer evicts old data, keeps last 256KB)
    A-->>R: exit (success or error)
    R->>TW: String() → agentOutput
    R->>R: capture.ParseOutputs(agentOutput)
    R->>R: capture.ResultsFromOutputs(outputs)
    Note over R: defer fires (even on early return)
    R->>K8s: Get Task (retry loop, up to 3x on 409)
    R->>K8s: UpdateStatus(StartTime, CompletionTime, Outputs, Results)
    R->>SR: setTaskStatus("succeeded"/"failed") via annotation
    SR->>K8s: Update Task.Status.Phase
Loading

Reviews (6): Last reviewed commit: "fix: use independent context for deferre..." | Re-trigger Greptile

Comment thread internal/sessionrunner/runner.go Outdated
Comment thread internal/sessionrunner/runner.go Outdated
Comment thread internal/sessionrunner/runner.go Outdated
j-bennet and others added 2 commits May 7, 2026 21:17
Tests the full SessionReconciler lifecycle through envtest:
- Task with execution-mode label transitions to Queued (not Job)
- SessionReconciler assigns task to available session pod
- Annotation-based protocol: running → succeeded/failed transitions
- Task outputs/results written by session runner are preserved
- Requeue behavior when no session pod is available

Also registers SessionReconciler in the integration test suite.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Add retry-on-conflict to updateTaskStatus (renamed from
  updateTaskOutputs) to avoid silently dropping outputs when the
  SessionReconciler writes concurrently
- Always set CompletionTime/StartTime regardless of whether output
  markers are present, so downstream consumers (TTL, Slack reporter)
  don't see the task as perpetually in-progress
- Replace unbounded bytes.Buffer with a 256KB ring buffer (tailWriter)
  to cap memory usage from verbose agents

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@j-bennet
Copy link
Copy Markdown
Author

j-bennet commented May 8, 2026

@greptile Review and update Greptile Summary.

- Accept either Queued or Pending when verifying initial phase, since
  the SessionReconciler can assign the task before the first poll
- Use crypto/rand for namespace suffixes to avoid collisions when
  tests run within the same second

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@j-bennet
Copy link
Copy Markdown
Author

j-bennet commented May 8, 2026

@greptile Review and update Greptile Summary.

Comment thread internal/sessionrunner/runner.go Outdated
Move ParseOutputs and ResultsFromOutputs into internal/capture as the
single source of truth. The controller delegates to capture, and the
session runner imports directly instead of duplicating the logic.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Capture the start timestamp at the top of processTask (before runAgent)
and pass it to updateTaskStatus as a fallback. Previously metav1.Now()
was captured only at completion time and shared for both StartTime and
CompletionTime, producing zero-duration metrics for fast-completing tasks.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@j-bennet
Copy link
Copy Markdown
Author

j-bennet commented May 8, 2026

@greptile Review and update Greptile Summary.

Comment thread internal/capture/capture.go Outdated
Comment thread internal/sessionrunner/runner.go
Comment thread internal/sessionrunner/runner.go Outdated
j-bennet and others added 2 commits May 7, 2026 22:17
- Remove unnecessary markerStart/markerEnd unexported aliases in
  internal/capture; use the exported constants directly
- Use defer in processTask to guarantee updateTaskStatus is called
  (and CompletionTime written) even when workspace reset fails early

Co-Authored-By: Claude Opus 4.6 <[email protected]>
No external package references them directly; keep the API surface
minimal.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@j-bennet
Copy link
Copy Markdown
Author

j-bennet commented May 8, 2026

@greptile Review and update Greptile Summary.

Comment thread internal/sessionrunner/runner.go
Use context.Background with a 10s timeout in the deferred
updateTaskStatus call so that CompletionTime is written even when the
parent context is cancelled (e.g. SIGTERM during agent execution).

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@j-bennet
Copy link
Copy Markdown
Author

j-bennet commented May 8, 2026

@greptile Review and update Greptile Summary.

@j-bennet j-bennet merged commit 9f46760 into prod May 8, 2026
11 checks passed
@j-bennet j-bennet deleted the fix-session-runner-output-capture branch May 8, 2026 03:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

API: Add persistent execution mode to TaskSpawner for warm agent sessions that eliminate per-task cold-start overhead

2 participants