Skip to content

Agent to Agent Handoff Data Structure #1026

@knechtionscoding

Description

@knechtionscoding

Area: New API Extension

Summary

Kelos agents today pass only flat metadata between dependent tasks — branch names, PR URLs, token counts (TaskStatus.Results as map[string]string). When a Linear investigation agent concludes, there is no efficient way to transfer its rich findings (root cause analysis, relevant code paths, reproduction steps) to a downstream solver agent. The solver either re-derives everything from scratch (burning tokens and time) or operates blind.

This proposal adds a standard structured handoff payload to TaskStatus — a versioned JSON format that agents write during execution and downstream agents consume via prompt templates. It is deliberately scoped to the data format and plumbing for inter-task context transfer, not control flow, pipeline management, or dynamic spawning.

Problem

1. Results are metadata, not context

The existing TaskStatus.Results (map[string]string) is designed for machine-readable execution metadata — branch names, commit SHAs, PR URLs, cost, token counts. These are outputs of kelos-capture, not the agent itself. There is no channel for the agent to pass substantive context to a downstream agent: investigation findings, analysis summaries, Slack thread context, or decision rationale.

The prompt template system (internal/controller/task_controller.go:849-886) injects dependency results into downstream prompts:

{{index .Deps "investigate" "Results" "branch"}}

But there is no Results key for "here's what I found and why." Agents would have to stuff freeform text into a flat string map, which has no schema, no size discipline, and no semantic separation between metadata and content.

2. Downstream agents waste tokens re-deriving context

In a pipeline like investigate → fix → open-pr, the fix agent currently receives only coordinates (which branch, which PR). It must re-read the issue, re-analyze the codebase, and re-identify the root cause — duplicating work the investigation agent already completed. This is the primary cost multiplier in multi-stage agent workflows.

With a structured handoff, the investigation agent's summary flows directly into the fix agent's prompt, eliminating redundant analysis.

3. No audit trail for inter-agent communication

When debugging why an agent made a particular decision, operators can inspect TaskStatus.Results for metadata but cannot see what context the agent received from its upstream dependency. A structured handoff stored in TaskStatus provides a persistent, kubectl-inspectable record of exactly what context was transferred between agents.

4. Existing proposals assume Results is sufficient

Several open proposals (#747 conditional dependencies, #792 historyContext, #829 parent-child tasks, #983 TaskPipeline CRD) reference dependency results as the primary inter-task data channel. All of them would benefit from a richer, standardized handoff format — but none of them define one.

Proposed Design

The Handoff Type

Add a new type to api/v1alpha1/task_types.go:

// TaskHandoff is the standard structured payload for agent-to-agent
// context transfer. Versioned for forward compatibility.
type TaskHandoff struct {
    // Version of the handoff schema.
    // +kubebuilder:validation:Minimum=1
    // +kubebuilder:default=1
    Version int `json:"version"`

    // Summary is a concise description of what was accomplished or found.
    // Intended to be token-efficient when injected into downstream prompts.
    // +kubebuilder:validation:MaxLength=4096
    Summary string `json:"summary"`

    // Detail contains rich context: findings, analysis, reasoning.
    // Downstream tasks can selectively include this when depth is needed.
    // +optional
    // +kubebuilder:validation:MaxLength=65536
    Detail string `json:"detail,omitempty"`

    // Data contains structured key-value pairs for machine-readable fields.
    // Supports templating in downstream prompts and CEL evaluation in
    // conditional dependencies (#747).
    // +optional
    Data map[string]string `json:"data,omitempty"`
}

Add the field to TaskStatus:

type TaskStatus struct {
    // ... existing fields ...

    // Handoff contains the structured context payload produced by the agent
    // for consumption by downstream dependent tasks.
    // +optional
    Handoff *TaskHandoff `json:"handoff,omitempty"`
}

Why separate from Results

Results Handoff
Purpose Execution metadata Agent-to-agent context
Producer kelos-capture (post-run binary) The agent itself (during execution)
Content Branch, PR URL, commit SHA, cost, tokens Investigation findings, analysis, summaries
Format Flat map[string]string Versioned struct with summary/detail/data
Consumer Controller (metrics, branch lock), prompt templates Downstream agent prompts

Keeping them separate means each can evolve independently. Results is part of the agent image interface; Handoff is part of the agent's output to other agents.

How agents produce handoffs

1. Well-known file path. The agent writes JSON to a path specified by the KELOS_HANDOFF_PATH environment variable (default: /tmp/kelos-handoff.json).

2. kelos-capture emits it. After the agent exits, kelos-capture reads the handoff file (if it exists), validates the schema, and emits it between markers in stdout:

---KELOS_HANDOFF_START---
{"version":1,"summary":"...","detail":"...","data":{"key":"value"}}
---KELOS_HANDOFF_END---

3. Controller parses it. The controller extracts the handoff JSON from pod logs (same mechanism as KELOS_OUTPUTS_START/END) and stores it in TaskStatus.Handoff.

4. No handoff = no error. If the agent doesn't write a handoff file, nothing happens. The field remains nil. This makes handoffs fully opt-in with zero impact on existing tasks.

How downstream tasks consume handoffs

The existing .Deps template context is extended to include Handoff:

deps[depName] = map[string]interface{}{
    "Outputs": depTask.Status.Outputs,
    "Results": depTask.Status.Results,
    "Handoff": depTask.Status.Handoff,   // NEW
    "Name":    depName,
}

Downstream prompts access handoff fields via Go templates:

prompt: |
  ## Investigation Summary
  {{index .Deps "investigate" "Handoff" "Summary"}}

  ## Detailed Findings
  {{index .Deps "investigate" "Handoff" "Detail"}}

  ## Specific Data
  Root cause file: {{index .Deps "investigate" "Handoff" "Data" "root_cause_file"}}

  Fix the issue described above on branch {{index .Deps "investigate" "Results" "branch"}}.

The task author controls exactly what gets injected — Summary for token efficiency, Detail when depth is needed, specific Data values for targeted references.

Size limits

  • Summary: max 4 KB — forces conciseness, keeps downstream prompts lean
  • Detail: max 64 KB — room for rich context without blowing up etcd (object size limit ~1.5 MB)
  • Data: inherits the 64 KB detail limit for the overall handoff
  • kelos-capture validates and rejects oversized handoffs with a warning log

Agent-side experience

The agent writes a JSON file during execution. No new tools, no MCP server, no special SDK — any agent that can write a file can produce a handoff:

# During agent execution, the agent writes:
cat > "$KELOS_HANDOFF_PATH" <<'EOF'
{
  "version": 1,
  "summary": "Root cause: null pointer in auth.go:142 when session cookie is expired. The middleware skips validation but the handler assumes non-nil session.",
  "detail": "Full stack trace:\n  auth.go:142 → session.Validate()\n  middleware.go:89 → next.ServeHTTP()\n\nReproduction:\n  curl -H 'Cookie: session=expired' https://api.example.com/protected\n\nThe fix requires a nil check before accessing session fields in the auth handler.",
  "data": {
    "root_cause_file": "pkg/auth/auth.go",
    "root_cause_line": "142",
    "severity": "critical",
    "issue_key": "LIN-423"
  }
}
EOF

For AI coding agents (Claude Code, Codex, etc.), the prompt can instruct the agent to write this file. The instruction can be part of the task prompt or injected via AgentConfig agentsMD.

Concrete Examples

Example 1: Investigation → Fix pipeline

# Stage 1: Investigate the issue
apiVersion: kelos.dev/v1alpha1
kind: Task
metadata:
  name: investigate
spec:
  type: claude-code
  credentials:
    type: oauth
    secretRef:
      name: claude-credentials
  workspaceRef:
    name: my-workspace
  prompt: |
    Investigate Linear issue LIN-423: "Users getting 500 errors on login."

    Analyze the codebase, identify the root cause, and find relevant code paths.
    Do NOT fix the issue — only investigate.

    When done, write your findings to $KELOS_HANDOFF_PATH as JSON:
    {
      "version": 1,
      "summary": "<concise root cause and location>",
      "detail": "<full analysis with code paths, stack traces, reproduction steps>",
      "data": {"root_cause_file": "<path>", "severity": "<low|medium|high|critical>"}
    }
---
# Stage 2: Fix the issue (receives investigation context)
apiVersion: kelos.dev/v1alpha1
kind: Task
metadata:
  name: fix
spec:
  type: claude-code
  credentials:
    type: oauth
    secretRef:
      name: claude-credentials
  workspaceRef:
    name: my-workspace
  branch: fix/lin-423
  dependsOn: [investigate]
  prompt: |
    ## Investigation Summary
    {{index .Deps "investigate" "Handoff" "Summary"}}

    ## Detailed Findings
    {{index .Deps "investigate" "Handoff" "Detail"}}

    Fix the issue described above. Write tests that cover the failure case.
    Commit and push your changes.
---
# Stage 3: Open PR
apiVersion: kelos.dev/v1alpha1
kind: Task
metadata:
  name: open-pr
spec:
  type: claude-code
  credentials:
    type: oauth
    secretRef:
      name: claude-credentials
  workspaceRef:
    name: my-workspace
  branch: fix/lin-423
  dependsOn: [fix]
  prompt: |
    The fix for LIN-423 is ready on branch {{index .Deps "fix" "Results" "branch"}}.

    Investigation context:
    {{index .Deps "investigate" "Handoff" "Summary"}}

    Review the diff and open a pull request with `gh pr create`.
    Reference LIN-423 in the PR description.

The fix agent starts with full context immediately — no re-investigation, no wasted tokens. The PR agent also references the investigation summary for a well-written PR description.

Example 2: Slack-triggered triage with context preservation

apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: slack-triage
spec:
  when:
    genericWebhook:
      path: /slack-escalation
  taskTemplate:
    type: claude-code
    credentials:
      type: oauth
      secretRef:
        name: claude-credentials
    workspaceRef:
      name: my-workspace
    promptTemplate: |
      A Slack escalation was received: {{.Body}}

      Triage this issue:
      1. Identify the affected service and severity
      2. Check recent deployments and error logs
      3. Determine if this is a known issue

      Write your triage findings to $KELOS_HANDOFF_PATH so the resolver
      agent has full context without re-reading the Slack thread.

The downstream resolver task (via dependsOn or a future TaskPipeline) gets the triage context without needing to re-fetch and re-parse the Slack thread.

Relationship to Existing Proposals

Issue What it does How handoffs interact
#792 (historyContext) Injects prior task outcomes from the same spawner into prompts Handoff summaries from historical tasks could be included via includeKeys: [handoff_summary]. Different axes: #792 is temporal (across runs), handoffs are spatial (across pipeline stages).
#747 (conditional deps) CEL-based routing on dependency results Handoff Data fields become available for CEL evaluation: handoff.data["severity"] == "critical". Richer routing signals than flat Results alone.
#829 (parent-child tasks) Agent-initiated dynamic task spawning via MCP Parent agent reads child task handoffs via get_task_status. Child tasks inherit the parent's handoff context. The handoff format standardizes what flows in both directions.
#983 (TaskPipeline CRD) First-class pipeline with stages, matrix fan-out Handoff becomes the inter-stage data format. Pipeline stages access upstream handoffs via {{.Stages}} templates. Matrix stages could aggregate handoffs from parallel tasks.

This proposal does not depend on any of these issues and does not block them. It is a standalone, additive primitive that each of them benefits from.

Files to Change

File Change
api/v1alpha1/task_types.go Add TaskHandoff struct, add Handoff *TaskHandoff to TaskStatus
internal/controller/output_parser.go Add ParseHandoff() for KELOS_HANDOFF_START/END markers
internal/controller/output_parser_test.go Tests for handoff parsing, size validation, malformed JSON
internal/controller/task_controller.go Parse handoff from pod logs, store in status; inject into .Deps template context; set KELOS_HANDOFF_PATH env var
internal/controller/task_controller_test.go Tests for handoff in template resolution, env var injection
cmd/kelos-capture/main.go Read /tmp/kelos-handoff.json, validate, emit between markers
docs/agent-image-interface.md Document handoff file contract, env var, markers
examples/07-task-pipeline/pipeline.yaml Update example to demonstrate handoff usage

Estimated: ~60 lines of types + ~50 lines of parser + ~30 lines of controller wiring + tests.

Backward Compatibility

  • Purely additive: New optional field on TaskStatus, no changes to existing behavior
  • Zero config for existing users: Tasks that don't write a handoff file are completely unaffected
  • Existing Results/Outputs unchanged: kelos-capture continues emitting KELOS_OUTPUTS_START/END as before; handoff markers are separate
  • Safe template fallback: If {{.Deps.X.Handoff.Summary}} is referenced but no handoff exists, the template renders empty (Go template zero-value behavior)
  • No new CRDs: Extends existing TaskStatus within the Task resource

/kind feature

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions