Conversation
…acefully Catches unhandled tool execution errors and applies configurable recovery strategies (inform, retry, fallback) per tool, preventing agent run crashes and enabling the model to self-correct. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…r to ToolErrorRecovery - retry() now accepts retry_delay (base seconds for 2^attempt backoff) and retryable_exceptions (tuple of exception types eligible for retry) - ToolErrorRecovery gains max_total_errors: after N total errors across all tools, recovery stops and errors propagate as-is - Per-run state (_total_errors) is reset by for_run() alongside _retry_counts - Full test coverage for all new features including backoff timing verification, exception subclass matching, cross-tool budget exhaustion, and validation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| # If the exception isn't retryable, stop immediately. | ||
| if not isinstance(exc, retryable_exceptions): | ||
| return _format_error(call.tool_name, exc, include_traceback=self.include_traceback) |
There was a problem hiding this comment.
🟡 Non-retryable exceptions bypass the max_total_errors budget check in wrap_tool_execute
In wrap_tool_execute, the non-retryable exception check at line 305 returns an inform message before the budget exhaustion check at line 309. This means when a tool configured with a retry strategy and a custom retryable_exceptions filter encounters a non-retryable exception, it will always be "recovered" (returned as an inform string) even if max_total_errors budget is already exhausted. This contradicts the documented contract of max_total_errors (src/pydantic_harness/tool_error_recovery.py:222-228): "Once the budget is exhausted, subsequent errors propagate as-is instead of being recovered."
Concrete scenario triggering the bug
With ToolErrorRecovery(tool_strategies={'t': retry(3, retryable_exceptions=(ConnectionError,))}, max_total_errors=0), raising a ValueError will increment _total_errors to 1, then hit the non-retryable check and return an inform message — even though _budget_exhausted() would return True (1 > 0). The budget check on line 309 is never reached.
| # If the exception isn't retryable, stop immediately. | |
| if not isinstance(exc, retryable_exceptions): | |
| return _format_error(call.tool_name, exc, include_traceback=self.include_traceback) | |
| # If the error budget is exhausted, let the error propagate. | |
| if self._budget_exhausted(): | |
| raise | |
| # If the exception isn't retryable, stop immediately. | |
| if not isinstance(exc, retryable_exceptions): | |
| return _format_error(call.tool_name, exc, include_traceback=self.include_traceback) |
Was this helpful? React with 👍 or 👎 to provide feedback.
| # If the error budget is exhausted, let the error propagate. | ||
| if self._budget_exhausted(): | ||
| raise |
There was a problem hiding this comment.
🚩 Potential double-counting of _total_errors if framework calls on_tool_execute_error after wrap_tool_execute raises
When the retry strategy's budget is exhausted, wrap_tool_execute re-raises the exception at line 310 (after already incrementing _total_errors at line 301). If the PydanticAI framework then calls on_tool_execute_error for this propagated exception, _total_errors would be incremented again at src/pydantic_harness/tool_error_recovery.py:337. This depends on the framework's hook dispatch behavior — specifically whether on_tool_execute_error fires for exceptions that escape wrap_tool_execute. Without access to the pydantic-ai AbstractCapability source, I can't confirm whether this happens. If it does, the error count would be inflated, though in practice it wouldn't change behavior since the budget is already exhausted at that point.
Was this helpful? React with 👍 or 👎 to provide feedback.
Audit vs prior art: ToolErrorRecoveryWorth adding now:
Follow-up opportunities:
|
Summary
ToolErrorRecoverycapability that catches unhandled tool execution errors and recovers gracefully, preventing agent run crashes'inform'(default, returns error message to model),('retry', N)(retries up to N times then informs),('fallback', value)(returns static value)tool_strategiesdict, withdefault_strategyfor unconfigured toolsfor_run()for retry count trackingretry()andfallback()for readable strategy definitionsTest plan
retry,fallback) including validation_validate_strategy) covering all valid and invalid formsfor_run()isolation: fresh instance with reset retry countsinformstrategy: error message returned to model (with/without traceback)fallbackstrategy: static value returned (None, string, dict)retrystrategy: success on first attempt, success after failures, exhaustion falls back to informon_tool_execute_error(wrap_tool_execute re-raises)pydantic_harnessCloses #61
🤖 Generated with Claude Code