Skip to content

Add ToolOutputManagement capability#185

Open
DouweM wants to merge 4 commits intomainfrom
capability/tool-output-management
Open

Add ToolOutputManagement capability#185
DouweM wants to merge 4 commits intomainfrom
capability/tool-output-management

Conversation

@DouweM
Copy link
Copy Markdown
Contributor

@DouweM DouweM commented Apr 10, 2026

Summary

  • Adds ToolOutputManagement capability that intercepts tool return values via after_tool_execute and truncates/summarizes large outputs to prevent context window blowup
  • Supports three truncation strategies (head, tail, head_tail), per-tool limits and strategies, and an optional summarize_fn with truncation as safety net
  • Exports ToolOutputManagement and TruncationStrategy from pydantic_harness

Closes #82

Design decisions

Per maintainer feedback, this is implemented as a configurable capability using the existing after_tool_execute hook, rather than as framework infrastructure baked into the agent loop.

The capability deliberately does not implement spill-to-file, token-based limits, or model-aware scaling (see PLAN.md for rationale on each).

Test plan

  • Unit tests for _stringify, _truncate, _head_tail_default_split helpers
  • Integration tests for after_tool_execute covering all strategies, per-tool overrides, sync/async summarize_fn, safety net truncation, boundary conditions
  • Export tests
  • 27 tests passing, ruff clean, pyright strict clean

AI generated code

  • Check this box to confirm this PR contains AI-generated code

🤖 Generated with Claude Code

DouweM and others added 4 commits April 2, 2026 05:28
Implements a capability that intercepts tool return values via the
after_tool_execute hook and truncates or summarizes large outputs to
prevent context window blowup.

Supports head, tail, and head+tail truncation strategies, per-tool
limits and strategies, and an optional summarize_fn with truncation
as a safety net.

Closes #82

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…putManagement

Addresses maintainer audit findings from PR #131:

- Line-count limits: `max_output_lines` and `per_tool_line_limits` parameters
  truncate by line count. When both char and line limits are set, whichever
  fires first wins.
- Spill-to-file: `spill_to_file` and `spill_dir` parameters. When enabled,
  oversized output is written to a temp file and the model receives a pointer
  plus a truncated preview.
- Binary detection: `bytes`, `bytearray`, and `memoryview` results are
  immediately replaced with `[Binary data, N bytes]` instead of being
  stringified and truncated.

48 new tests (75 total for this capability, 81 in the full suite).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ally-uncalled fn

- Add test for the case where both line and char limits are exceeded,
  triggering double truncation (line 209)
- Add `# pragma: no cover` to summarize_fn that's intentionally never
  called in the under-limit test

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 3 additional findings in Devin Review.

Open in Devin Review

"""Return True if *text* exceeds either the char or line limit."""
if len(text) > char_limit:
return True
if line_limit is not None and text.count('\n') + 1 > line_limit:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Line counting inconsistency between _exceeds_limits and _truncate_by_lines causes ineffective line truncation

_exceeds_limits at line 246 counts lines as text.count('\n') + 1, but _truncate_by_lines at line 107 counts lines as len(text.splitlines(keepends=True)). These methods disagree for any text with a trailing newline (extremely common for command/tool output). For example, "line1\nline2\nline3\n" gives 4 by count+1 but 3 by splitlines. This means _exceeds_limits can trigger the truncation path (thinking the text has more lines than allowed), but _truncate_by_lines then returns the text unchanged (since by its counting, the limit isn't exceeded). The same inconsistency exists in _apply_truncation at line 255 which also uses count('\n') + 1.

Concrete example showing the bug

With max_output_lines=3 and text "line1\nline2\nline3\n" (3 content lines + trailing newline):

  1. _exceeds_limits: count('\n') + 1 = 4 > 3 → True, enters truncation
  2. _truncate_by_lines(text, 3, ...): splitlines gives 3 elements → 3 <= 3 → returns text unchanged
  3. Text passes through un-truncated despite _exceeds_limits flagging it
Prompt for agents
The line counting method in _exceeds_limits (and _apply_truncation at line 255) uses text.count('\n') + 1, while _truncate_by_lines uses len(text.splitlines(keepends=True)). These give different results for text with trailing newlines: count+1 produces a higher number. To fix the inconsistency, both should use the same method. The simplest fix is to change _exceeds_limits and _apply_truncation to also use len(text.splitlines(keepends=True)), or extract a shared _count_lines(text) helper used by all three locations (lines 246, 255, and 107). Alternatively, if the intent is that a trailing newline should count as an extra line, then _truncate_by_lines should also use count+1 based counting.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

if not self._exceeds_limits(text, char_limit, line_limit):
# If we stripped ANSI, return the cleaned text so the model
# never sees escape codes. Otherwise return the original value.
return text if stripped else result
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 strip_ansi=True (default) converts non-string results to strings even when under the limit

When strip_ansi is True (the default) and the result is a non-string type (e.g., a dict or list) that's under the size limit, after_tool_execute returns text (the stringified version) at line 309 rather than the original result. This means {'key': 'value'} becomes the string "{'key': 'value'}". When strip_ansi=False, the original object is preserved (return result). This type coercion is arguably correct since the model always receives strings, but it's a subtle behavioral difference: other after_tool_execute hooks downstream would see a string instead of the original type. The framework's own default implementation returns the original result unchanged, so this is a semantic change for non-string results even when no ANSI codes are present.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@DouweM
Copy link
Copy Markdown
Contributor Author

DouweM commented Apr 10, 2026

Originally posted by @DouweM in #131 comment (PR was recreated)

Audit vs prior art: ToolOutputManagement

Worth adding now:

  • max_output_lines alongside max_output_chars (Claude Code truncates by lines)
  • Spill-to-file strategy: write full output to temp file, return path + summary
  • Binary detection: skip truncation for non-text results

Follow-up opportunities:

  • Token-based limits when context window tracking exists
  • Model-aware scaling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tool Output Management (truncation, spill-to-file, per-tool summarization)

2 participants