Skip to content

feat(code_mode): resolve deferred/approval-required tool calls via HandleDeferredToolCalls#220

Merged
DouweM merged 6 commits intomainfrom
handle-deferred-in-code-mode
Apr 25, 2026
Merged

feat(code_mode): resolve deferred/approval-required tool calls via HandleDeferredToolCalls#220
DouweM merged 6 commits intomainfrom
handle-deferred-in-code-mode

Conversation

@DouweM
Copy link
Copy Markdown
Contributor

@DouweM DouweM commented Apr 24, 2026

Summary

  • Remove the td.defer exclusion so external/approval-required tools stay in the CodeMode sandbox instead of bouncing out as native tools
  • Point the sandbox UserError at the HandleDeferredToolCalls capability so users know how to resolve deferrals inline
  • Drop the now-unused native_fallbacks return value and deferred-tool warning
  • Update the deferred-tools tests: the old test expected promotion to native, the new one asserts the tool is sandboxed and the error message is updated

Background

Depends on pydantic/pydantic-ai#5142, which adds the HandleDeferredToolCalls capability. Before that PR, any tool that raised ApprovalRequired/CallDeferred inside CodeMode had to bubble out — there was no inline resolver, so the sandbox intentionally hid those tools and surfaced them as native tools instead. With a handler capability, the inline flow works, so the hide-and-promote workaround is no longer needed.

The inline positive test was removed from this PR because the types it references don't exist in any released pydantic-ai-slim yet, which would break pyright. Once a version with HandleDeferredToolCalls ships, we should:

  1. Bump pydantic-ai-slim >= to that version in pyproject.toml
  2. Add back an inline-resolution test (an approval-required tool sandboxed inside CodeMode, resolved by a HandleDeferredToolCalls handler, returning the tool's value to the sandbox)

CI on this PR will go red until that release lands — opening now so we don't forget the follow-up.

Test plan

  • CI passes locally once the pydantic-ai-slim release ships
  • Manually verify an agent with capabilities=[CodeMode[None](), HandleDeferredToolCalls(handler=...)] lets a tool that raises ApprovalRequired resolve inside the sandbox (no native fallback, handler approves, tool returns value to the model)

🤖 Generated with Claude Code

Tools with `kind='external'` or `'unapproved'` (and tools that raise
ApprovalRequired/CallDeferred at runtime) are no longer excluded from the
sandbox and promoted back to native tools. They now take the normal sandboxed
path, and a HandleDeferredToolCalls capability on the agent can resolve them
inline — so the model sees the resolved return value instead of having the
deferral bounce out as a separate native tool call.

- Remove the td.defer filter in _partition_callable_tools (no more native
  fallback for deferred tools).
- Drop the native_fallbacks return value and the corresponding deferred-tool
  warning.
- Update the sandbox UserError message when no handler is configured to point
  users at HandleDeferredToolCalls.
- Update the deferred_execution test to assert sandbox inclusion and the
  approval-retry test to match the new error message.

Depends on pydantic/pydantic-ai#5142 landing and being released; once it does,
bump the pydantic-ai-slim lower bound.
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 1 additional finding in Devin Review.

Open in Devin Review

Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Class docstring now describes behavior the PR removed

The CodeModeToolset class docstring at pydantic_ai_harness/code_mode/_toolset.py:170-171 still says "Tools that require deferred execution (kind external/unapproved) cannot be called from inside the sandbox and are dropped with a one-time UserWarning." This is now factually incorrect — the entire point of this PR is to sandbox those tools instead. The docstring was not updated because it falls in unchanged context lines, but it will be misleading to anyone reading the class documentation.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

…er denials

When the `HandleDeferredToolCalls` handler denies a tool call, `handle_call` now raises
`ToolDeniedError` (on pydantic-ai-slim once released) instead of returning the denial
message as a plain string. CodeMode catches it, records a
`ToolReturnPart(outcome='denied')` in `nested_returns` so message history reflects the
denial correctly, and re-raises so the sandbox surfaces the denial as an exception
rather than as what would look like a successful tool return.

The `ToolDeniedError` import is gated behind a compat shim so this module still loads
against the currently released pydantic-ai-slim (which lacks the exception); the shim
resolves to a placeholder class that never matches a real exception, leaving the except
clause inert until a release ships `ToolDeniedError`.

Depends on pydantic/pydantic-ai#5142.
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 new potential issues.

View 2 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Stale docstring claims deferred-execution tools are excluded from sandbox

The CodeModeToolset class docstring at lines 188-189 states: "Tools that require deferred execution (kind external/unapproved) cannot be called from inside the sandbox and are dropped with a one-time UserWarning." This PR specifically removes that behavior — deferred-execution tools are now sandboxed like any other tool. The docstring was not updated to match, leaving incorrect documentation that contradicts the implementation.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread pydantic_ai_harness/code_mode/_toolset.py Outdated
Comment thread pydantic_ai_harness/code_mode/_toolset.py Outdated
`ToolManager.handle_call` no longer raises a (now-removed) `ToolDeniedError`
on handler denial — it returns the `ToolDenied` value the handler produced.

Drop the compat shim, import `ToolDenied` directly, and switch the dispatch
to inspect the return value: record the denial as `outcome='denied'` on the
nested `ToolReturnPart` and raise a `RuntimeError` inside the sandbox so the
script can't mistake the denial message for a regular string return.
devin-ai-integration[bot]

This comment was marked as resolved.

Now that the slim PR has merged to main, refresh the lockfile to pick up
the `HandleDeferredToolCalls` capability and `handle_call`'s `ToolDenied`
return value. Add a denial test that asserts the denied-call flow
surfaces as `ModelRetry` with the original denial message preserved in
the trace.

Notes on the test:
- The handler returns `ToolDenied('nope')`; the harness records
  `outcome='denied'` on the nested `ToolReturnPart` and raises
  `RuntimeError` inside the sandbox.
- The script doesn't catch the RuntimeError, so Monty surfaces it as
  `MontyRuntimeError`, which the harness converts back to `ModelRetry`.
  The retry message preserves the denial message so the model knows
  what went wrong.
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 0 new potential issues.

View 5 additional findings in Devin Review.

Open in Devin Review

The default `test` matrix uses the `[tool.uv.sources]` override pinning
slim to its main branch, so it never exercises the published-PyPI install
path. Add a `test-floor` job that overrides slim to the lowest version
declared in `pyproject.toml` (>=1.80.0) and runs the test suite, so we
catch any accidental dependency on unreleased slim features in code paths
that should be backward-compatible.

Gate the new HandleDeferredToolCalls denial test with `pytest.skip` when
the capability isn't importable — currently the only test that requires a
post-1.80.0 slim, but the pattern can be reused if more land later.
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 0 new potential issues.

View 6 additional findings in Devin Review.

Open in Devin Review

@DouweM DouweM changed the title feat(code_mode): resolve deferred tool calls via HandleDeferredToolCalls feat(code_mode): resolve deferred/approval-required tool calls via HandleDeferredToolCalls Apr 25, 2026
The `except ImportError → pytest.skip` branch only fires when running
against the slim floor (1.80.0) where `HandleDeferredToolCalls` doesn't
exist yet. The default test matrix runs against slim main, so coverage
counted those two lines as uncovered.

Mark the branch `# pragma: no cover` since it's an explicit skip path
that the floor-slim CI job exercises but isn't included in the coverage
report (the floor job doesn't gate on coverage by design).
@DouweM DouweM merged commit fe9a587 into main Apr 25, 2026
19 of 20 checks passed
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 0 new potential issues.

View 8 additional findings in Devin Review.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant