Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/).

## [Unreleased]

### Fixed

- **Sub-agent empty final response no longer surfaces `"(no output)"` to the parent** (#1369). When a sub-agent's final LLM response in `execute_sub_agent` came back with `tool_calls.is_empty() && content.is_none()` (or `Some("")`), the dispatcher used to return the literal sentinel `"(no output)"` as the sub-agent's answer. Pre-#1366 (bg-spawn era) this was largely invisible because the sentinel landed in the auto-drained completion mail; post-#1366 phase 1 (sync dispatch) the sentinel surfaces directly as `ToolCallResult.output` on the parent's `InvokeAgent` call, and the parent (a) treats it as the agent's actual answer, (b) re-invokes the same sub-agent hoping for a real one, or (c) hits the same edge in its own loop and emits the generic "Model produced an empty response after tool use" warning. The user-visible symptom from the #1366 phase-1 validation was the parent calling `InvokeAgent explore` three times in a row with each returning `(no output)` before the parent itself bailed empty. NOT a Gemini-specific bug — every provider's response normalization (`providers/{anthropic,gemini,openai_compat}.rs`) reaches `LlmResponse { content: None, tool_calls: vec![] }` for any model that emits a bare stop signal with no text payload (Gemini hits it more often after long tool chains; Claude and GPT can produce it too); the parent inference loop already had an empty-response Warn handler at `inference.rs:1184-1192` but the sub-agent dispatch path didn't, which is the asymmetry that became the bug. Fix mirrors `claude_code_src/src/tools/AgentTool/agentToolUtils.ts::finalizeAgentTool` (~line 297, vendored at `peers/claude_code_src/`): if the final assistant turn is empty, walk backward through the sub-agent's assistant message history for the most recent non-empty trimmed text and surface THAT as the answer (the model said something useful at some point in the run; let the parent see it). Falls through to a structured marker (`"[sub-agent '{name}' finished after {iter} turn(s) without producing any text response. The model may have hit a provider-specific stop condition (e.g. Gemini's bare-stop after long tool chains). Try rephrasing the prompt, simplifying the task, or switching models.]"`) only when no recoverable text exists anywhere in the run — which means something is genuinely wrong (e.g. every assistant turn was a pure tool-call), and the marker is actionable rather than the bare sentinel. The recovery scan is implemented as a free function `recover_last_assistant_text(messages: &[ChatMessage]) -> Option<String>` next to `workspace_provision_failure_marker` so the contract is testable without invoking the 1300-line dispatcher loop. Considered alternatives: (a) gemini-cli's required `complete_task` tool + grace turn pattern (`packages/core/src/tools/complete-task.ts` + `agents/local-executor.ts:361-381`) — stricter and more elegant but requires updating every built-in sub-agent prompt to mandate the call, plus per-agent designer effort, deferred to #1370; (b) a synthetic re-prompt grace turn ("Please summarize your findings.") — costs one extra LLM call per occurrence and only helps when the model has more to say, scan-back is strictly cheaper and more robust for the common case. Tests: 7 new unit tests in `sub_agent_dispatch::recover_last_assistant_text_tests` pin the helper's contract directly (happy path, content-less skip, whitespace-only filtering, trim guarantee, user/system filtering, empty-input edge, no-text-anywhere None return); 1 new e2e in `e2e_agent_test.rs::sub_agent_empty_final_response_falls_back_to_recovered_text` pins the dispatcher integration end-to-end including the regression guard that the literal `"(no output)"` string MUST NOT reach the parent. Verified: 1401 koda-core lib tests pass (+7 new), all integration suites pass (+1 new), `cargo clippy --workspace --lib --tests --all-features` clean, `cargo fmt --check` clean.

### Changed (BREAKING for sub-agent dispatch shape)

- **`InvokeAgent` / `SpawnAgent` are now synchronous tool delegations — bg-spawn dispatch path retired** (#1366 phase 1). Dispatch now drives the sub-agent's inference loop inline on the parent's task and returns the sub-agent's final answer as the `ToolCallResult.output`, the same shape as every other tool. Pre-#1366 (i.e., post-#1163) every InvokeAgent call was `tokio::spawn`'d into a `run_bg_agent` task that returned a `task_id` immediately; results reached the model via auto-drain on a future iteration or via an explicit `WaitForMail` call against the mailbox bridge from #1336. That whole async surface is gone from the public dispatch path: the `inline_only=true` recursion-guard branch in `execute_sub_agent` is now the only public path; the `inline_only=false` bg-spawn block (and `run_bg_agent`, `BgRegistry`, `ChildAgentActivity` event family, mailbox infrastructure, activity-pill UI, child-activity overlay, `WaitForMail` tool) becomes dead from the public surface and will be deleted across phases 2–5 of #1366. Architectural rationale (DESIGN.md "Execution Model" section, post-#1366): the LLM API caps throughput, not concurrency, so async dispatch buys little while costing a lot of surface area; sub-agents' real value is context isolation + tool restriction + specialization, all of which work fine with sync delegation. Codex's ExecCell and gemini-CLI's `SubagentGroupDisplay` patterns inspired the swap. Deliberately NO runtime config knob — per DESIGN.md P1 ("compile decisions in, don't configure them at runtime"), the sync swap is the only behavior; a previous draft that gated it behind `KODA_SUBAGENT_SYNC` was rejected as a principle violation. The `tool_dispatch.rs` change is two lines: `inline_only=false` → `true` at the InvokeAgent and SpawnAgent callsites. The InvokeAgent tool description and module-level rustdoc are rewritten to reflect sync semantics: drops the "Spawn a sub-agent in the background" / "Returns IMMEDIATELY with a task_id" framing in favor of "Delegate a task and get its final answer back as the tool result. The call BLOCKS until the sub-agent finishes." The `[ERROR: sub-agent` marker convention is preserved AND extended to recognize the dispatcher's `Err` arm format (`"Error invoking sub-agent: {chain}"`) as the second structural-failure shape. Schema unchanged — `agent_name` + `prompt` required, no `background` param (the regression guard from #1163 still passes). Tests: 6 e2e cases in `e2e_agent_test.rs` migrated from `collect_bg_events_after(...).find_map(AgentStatus::Completed { summary }/...)` to direct `events.iter().find_map(EngineEvent::ToolCallResult { name == "InvokeAgent", output })` — the pre-#1163 wait-free assertion shape returns, now without the `tokio::spawn` race window. Two e2e cases in `e2e_sub_agent_trust_test.rs` lose their `#1321/#1323/#1327` deflake waits because the race they guarded against (parent-loop termination vs bg-agent completion) does not exist on the sync path. The `bg_agent_iter_counter_advances_via_status_channel` test is deleted entirely — it pinned the `Running { iter: n }` heartbeat from the bg path, which is now unreachable from the public dispatcher; phase 2 of #1366 will delete the underlying code, at which point the test would have been dead anyway. The 4 description-pinning tests in `tools/agent.rs` are rewritten: `test_invoke_agent_description_documents_spawn_only_model` becomes `test_invoke_agent_description_documents_sync_dispatch_model` and asserts on `synchronously` / `BLOCKS` / `final answer` while explicitly forbidding `task_id` / `WaitForMail` / `auto-drain` / `in the background` (a regression that re-introduces them would mislead the model about dispatch shape); `test_invoke_agent_top_level_documents_drain_semantics` and `test_invoke_agent_description_discourages_immediate_wait` are deleted (drain semantics and the WaitForMail anti-pattern only existed because of bg-spawn). Wire format unchanged — ACP / headless clients are unaffected; sub-agent `EngineEvent::SubAgentStart` still fires at dispatch time. Verified: 1394 koda-core lib tests pass (−3: env-var test never created + 1 description test deleted + 1 description test deleted; −1 from the deleted bg_agent_iter_counter test that lived in the integration suite, not lib), 656 koda-cli lib tests pass, all `koda-core` integration suites pass, `cargo clippy --workspace --lib --tests --all-features` clean, `cargo fmt --check` clean.
Expand Down
243 changes: 240 additions & 3 deletions koda-core/src/sub_agent_dispatch.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1723,9 +1723,43 @@ pub(crate) fn execute_sub_agent<'a>(
}

if response.tool_calls.is_empty() {
let result = response
.content
.unwrap_or_else(|| "(no output)".to_string());
// **#1369**: pre-fix this branch did
// `response.content.unwrap_or_else(|| "(no output)".to_string())`
// and returned. That worked when the model produced a
// final summary, but a sub-agent loop can also exit
// here with `content == None` (or `Some("")`) when the
// model emits a bare stop signal after a long tool
// chain — a documented Gemini behavior, and reachable
// from any provider per `providers/{anthropic,gemini,
// openai_compat}.rs` which all normalize empty text
// to `None`. The literal sentinel `"(no output)"`
// then surfaced as the sub-agent's "answer", and the
// parent (especially after #1366 phase 1's sync
// dispatch) treated that string as the agent's
// conclusion — either re-invoking in a loop or acting
// on garbage. Fix mirrors `claude_code_src/src/tools/
// AgentTool/agentToolUtils.ts::finalizeAgentTool`
// (~line 297, also vendored under `peers/`): if the
// final assistant turn is empty, walk backward through
// history for the most recent assistant text. Falls
// through to a structured marker only if the entire
// run produced no assistant text at all (which means
// something is genuinely wrong, not just a quiet last
// turn). #1370 will layer a stricter `complete_task`
// contract on top of this; the scan-back stays as the
// robust fallback.
let result = match response.content.as_deref() {
Some(text) if !text.trim().is_empty() => text.to_string(),
_ => recover_last_assistant_text(&messages).unwrap_or_else(|| {
format!(
"[sub-agent '{agent_name}' finished after {iter} turn(s) without \
producing any text response. The model may have hit a \
provider-specific stop condition (e.g. Gemini's bare-stop after \
long tool chains). Try rephrasing the prompt, simplifying the \
task, or switching models.]"
)
}),
};
// Cache the result for future identical calls
sub_agent_cache.put(agent_name, prompt, &result);
// Release workspace; surface branch hint if agent left changes.
Expand Down Expand Up @@ -2045,6 +2079,209 @@ fn workspace_provision_failure_marker(agent_name: &str, reason: &str) -> String
)
}

/// Walk `messages` backward and return the most recent assistant
/// message's non-empty trimmed text content, if any.
///
/// **#1369**: used as a fallback when the final LLM response in a
/// sub-agent loop comes back with neither tool calls nor text
/// content (a documented Gemini behavior, but reachable from any
/// provider per `providers/{anthropic,gemini,openai_compat}.rs`
/// which all normalize empty content to `None`). Pre-fix the
/// dispatch loop returned the literal sentinel `"(no output)"`
/// here, which the parent then mis-treated as the sub-agent's
/// actual answer.
///
/// Mirrors the pattern from `claude_code_src/src/tools/AgentTool/
/// agentToolUtils.ts::finalizeAgentTool` (~line 297, also vendored
/// at `peers/claude_code_src/`):
///
/// > *"If the final assistant message is a pure tool_use block
/// > (loop exited mid-turn), fall back to the most recent
/// > assistant message that has text content."*
///
/// Filters out empty / whitespace-only content so a `Some("")`
/// row from the DB doesn't shadow earlier real text. Stops at
/// the first hit (most recent, which is what the model is most
/// likely treating as its conclusion).
///
/// Returns `None` only if the entire run produced no assistant
/// text — which means something is genuinely wrong (e.g. the
/// model never said anything, only called tools). Caller surfaces
/// a structured error in that case rather than an empty string.
fn recover_last_assistant_text(messages: &[ChatMessage]) -> Option<String> {
messages
.iter()
.rev()
.filter(|m| m.role == "assistant")
.find_map(|m| {
m.content
.as_deref()
.map(str::trim)
.filter(|t| !t.is_empty())
.map(|t| t.to_string())
})
}

#[cfg(test)]
mod recover_last_assistant_text_tests {
//! **#1369** regression tests for [`super::recover_last_assistant_text`].
//!
//! The helper guards the post-#1366-phase-1 sub-agent dispatch
//! path: when the final LLM response in a sub-agent loop comes
//! back with neither tool calls nor text content, the dispatcher
//! falls back to scanning prior assistant turns for the most
//! recent useful text. Pre-#1369 this branch returned the bare
//! sentinel `"(no output)"`, which the parent (post-sync-dispatch)
//! treated as the sub-agent's actual answer — producing the
//! parent-loops-on-junk behavior reported in the #1366 phase-1
//! validation. Pinning the helper directly is much cheaper and
//! more legible than an end-to-end dispatch test, since the
//! actual integration sits inside the 1300-line
//! `execute_sub_agent` loop and would need a mock provider that
//! returns `LlmResponse { content: None, tool_calls: vec![] }`.

use super::recover_last_assistant_text;
use crate::providers::ChatMessage;

fn assistant(text: &str) -> ChatMessage {
ChatMessage::text("assistant", text)
}

fn user(text: &str) -> ChatMessage {
ChatMessage::text("user", text)
}

fn system(text: &str) -> ChatMessage {
ChatMessage::text("system", text)
}

fn assistant_no_content() -> ChatMessage {
// Mirrors what the DB-load path produces for a tool-call-only
// assistant turn: role is "assistant" but `content` is `None`.
ChatMessage {
role: "assistant".to_string(),
content: None,
tool_calls: None,
tool_call_id: None,
images: None,
}
}

/// Happy path: most recent assistant turn has text — use it.
/// This shouldn't actually be hit by the dispatcher (the dispatcher
/// only calls the helper when the LATEST response is empty), but
/// pinning it keeps the helper's contract self-consistent: "return
/// the most recent assistant text" with no special edge for
/// position.
#[test]
fn returns_most_recent_assistant_text_when_present() {
let messages = vec![
system("you are an explorer agent"),
user("explore the repo"),
assistant("I'll start by listing files."),
assistant("I found three Rust crates."),
];
assert_eq!(
recover_last_assistant_text(&messages),
Some("I found three Rust crates.".to_string())
);
}

/// The bug we're actually fixing: latest assistant turn is
/// content-less (a pure tool-call row from the DB load path),
/// so we walk further back and surface the prior turn's text.
/// Mirrors claude_code_src's framing: "If the final assistant
/// message is a pure tool_use block (loop exited mid-turn),
/// fall back to the most recent assistant message that has
/// text content."
#[test]
fn skips_content_less_assistant_turns_and_finds_earlier_text() {
let messages = vec![
system("you are an explorer agent"),
user("explore the repo"),
assistant("I see three crates: koda-core, koda-cli, koda-sandbox."),
assistant_no_content(), // pure tool-call turn
assistant_no_content(), // another pure tool-call turn
];
assert_eq!(
recover_last_assistant_text(&messages),
Some("I see three crates: koda-core, koda-cli, koda-sandbox.".to_string())
);
}

/// Whitespace-only content must NOT shadow earlier real text.
/// Without the `.trim().is_empty()` guard, a `Some(" \n")` row
/// from a quirky provider would short-circuit the scan and
/// surface a useless string — same UX failure as the original
/// `(no output)` bug, just in disguise.
#[test]
fn whitespace_only_content_does_not_shadow_earlier_real_text() {
let messages = vec![
user("explore"),
assistant("Found Cargo.toml and src/lib.rs."),
assistant(" \n \t "), // whitespace-only — must be skipped
];
assert_eq!(
recover_last_assistant_text(&messages),
Some("Found Cargo.toml and src/lib.rs.".to_string())
);
}

/// Returned text must come back trimmed. The dispatcher uses the
/// return value verbatim as the sub-agent's tool-result string;
/// leading/trailing whitespace would render oddly in the parent's
/// transcript and burn tokens for nothing.
#[test]
fn returned_text_is_trimmed() {
let messages = vec![
user(""),
assistant(" final answer with surrounding whitespace. "),
];
assert_eq!(
recover_last_assistant_text(&messages),
Some("final answer with surrounding whitespace.".to_string())
);
}

/// User and system turns with text are NOT recovery candidates.
/// Surfacing the user's prompt back to the parent as the sub-agent's
/// "answer" would be actively misleading — the parent might think
/// the sub-agent agreed with the prompt verbatim. System turns
/// (including the grace-turn warning at #1135) are scaffolding,
/// not content. Both must be filtered.
#[test]
fn ignores_user_and_system_messages() {
let messages = vec![
system("You are explorer. You MUST summarize at the end."),
user("please explore the repo and tell me what you find"),
// No assistant text at all.
];
assert_eq!(recover_last_assistant_text(&messages), None);
}

/// No assistant text anywhere → caller surfaces a structured
/// marker. The helper signals this case with `None`, NOT with an
/// empty `Some("")` (which would re-introduce the bug).
#[test]
fn returns_none_when_no_assistant_text_anywhere() {
let messages = vec![
system("explorer prompt"),
user("explore"),
assistant_no_content(),
assistant_no_content(),
];
assert_eq!(recover_last_assistant_text(&messages), None);
}

/// Empty input slice → `None`. Pin the obvious edge so a future
/// refactor that swaps `iter().rev()` for some indexed access
/// doesn't panic in the wild.
#[test]
fn returns_none_for_empty_messages() {
assert_eq!(recover_last_assistant_text(&[]), None);
}
}

#[cfg(test)]
mod child_agent_path_tests {
//! **#1325 Phase 4 (commit 2)** regression tests for
Expand Down
Loading
Loading