Skip to content

Improve Hermes tool-call recovery#206

Open
samuelfaj wants to merge 7 commits into
raullenchai:mainfrom
samuelfaj:hermes-pr204-tool-recovery
Open

Improve Hermes tool-call recovery#206
samuelfaj wants to merge 7 commits into
raullenchai:mainfrom
samuelfaj:hermes-pr204-tool-recovery

Conversation

@samuelfaj
Copy link
Copy Markdown
Contributor

Based on PR #204.

This keeps the PR #204 tool-call fixes and adds the smallest extra recovery needed for Hermes/Qwen agent runs to complete autonomous file-writing tasks.

Changes:

  • Strengthen tool-use system suffix so assistant promises to create/edit/run/verify become tool calls instead of plain text.
  • Coerce object/list arguments to strings when a tool schema expects string parameters, matching OpenAI tool-call validation.
  • Make raw JSON tool-call extraction robust when file contents contain braces inside strings.
  • Retry non-streaming chat outputs that look like deferred tool use but returned no parsed tool calls.

Validation:

  • uv run pytest tests/test_chat_tool_retry.py tests/test_tool_calling.py tests/test_tool_parsers.py tests/test_upstream_regression.py tests/test_postprocessor.py
  • uv run ruff check vllm_mlx/routes/chat.py vllm_mlx/api/tool_calling.py vllm_mlx/service/helpers.py tests/test_tool_calling.py tests/test_chat_tool_retry.py
  • Hermes command against local Qwen generated a REST API using Express, Bun, TypeScript, and sequelize-typescript with vertical slices, models, migrations, seeders, and service unit tests.
  • Generated project verification: bun test passed with 19 tests.

@saikot3014s-cell
Copy link
Copy Markdown

Imporve Hermes took call recovery#206

@michaelasper
Copy link
Copy Markdown
Contributor

michaelasper commented May 6, 2026

We had some free compute available for another deeper pass here, especially since this is stacked on #204 and would help a lot with the Hermes/Qwen autonomous coding path. These are suggestions from a multi-pass review, not a formal requested-changes review. 🙂

I trimmed this to the pieces that still look specific to #206 after #204 is merged/rebased.

Suggestions to check before merge

  1. Preserve explicit string args before JSON-looking decode

vllm_mlx/api/tool_calling.py now fixes object/list values for string schema params by serializing them, which is the right direction. The sharp edge is that _coerce_schema_value() still calls _decode_json_like() before it checks whether the schema type is string.

That means a real string value containing formatted JSON file content is parsed and reserialized, losing whitespace/trailing newline:

content = "{\n  \"compilerOptions\": {\n    \"strict\": true\n  }\n}\n"
text = "Calling tool: write_file({\"path\":\"/tmp/tsconfig.json\", \"content\": " + json.dumps(content) + "})"
_, calls = parse_tool_calls(text, request)
args = json.loads(calls[0].function.arguments)

# current:
args["content"] == "{\"compilerOptions\": {\"strict\": true}}"

# expected:
args["content"] == content

Suggestion: infer the schema type first. If the schema explicitly says string and the value is already a string, preserve it exactly. Only json.dumps() when the original value is actually an object/list but the schema expects a string.

  1. The non-streaming retry trigger is very broad

_looks_like_deferred_tool_use() currently returns true for normal explanatory answers containing words like test, write, create, run, or I will, and for any text containing "path". With tools present, that can add up to two hidden generations and replace a valid plain-text answer with a forced tool retry.

Examples that currently return True:

"The best way to test this API is with pytest."
"The article explains how to write plugins for this API."
"Path is \"path\" in JSON schema."
"I will summarize the API."

Suggestion: narrow the retry to stronger deferred-tool signals, maybe exact assistant promise patterns plus raw partial tool-call tails, and consider respecting tool_choice="none". If this remains heuristic-based, adding negative tests for normal explanatory answers would help keep it from expanding accidentally.

  1. Retry accounting / API semantics may need a small pass

If the hidden retry loop stays, a few details are worth tightening:

  • It appears to reuse the full request timeout for each retry, so a timeout=60 request can take roughly initial call + two more 60s waits.
  • If the client disconnects during a retry, the loop breaks and returns the earlier output rather than preserving the existing non-streaming 499 behavior.
  • usage reports only the final retry GenerationOutput, so hidden attempts are not counted.
  • logprobs collected from the first output can remain attached after output is replaced by a retry result.
  1. Public raw-JSON cleanup still needs coverage

The balanced raw JSON extraction handles braces inside string/file content now, which is good. The added test covers _parse_raw_json_tool_calls() directly, but parse_tool_calls() still uses the older brace regex for cleanup after raw JSON fallback. In a mixed preamble + raw JSON tool call, it can leave malformed raw JSON in cleaned_text even though it found a tool call.

A public parse_tool_calls() regression test with file content containing braces would catch this.

Validated locally on #206 head:

  • GitHub checks are green.
  • uv run --extra dev pytest tests/test_chat_tool_retry.py tests/test_tool_calling.py tests/test_tool_parsers.py tests/test_upstream_regression.py tests/test_postprocessor.py -> 327 passed
  • uv run --extra dev ruff check vllm_mlx/routes/chat.py vllm_mlx/api/tool_calling.py vllm_mlx/service/helpers.py vllm_mlx/service/postprocessor.py vllm_mlx/tool_parsers/qwen3coder_tool_parser.py tests/test_tool_calling.py tests/test_chat_tool_retry.py tests/test_postprocessor.py -> all checks passed

@raullenchai
Copy link
Copy Markdown
Owner

SOP §0–§2 review. Holding — depends on #204 outcome + has its own scope concerns.

§0 Necessity: ✅ partially — the JSON-content-in-string fix is needed (it actually addresses P1#1 from my #204 review). The deferred-tool-use retry is more speculative.

Relationship to #204:

I left review on #204 requesting a fix for _coerce_schema_value() corrupting string args with JSON-like content. Good news: this PR already has the fix:

if schema_type in ("string", "str", "text", "varchar", "char", "enum"):
    if isinstance(value, str):
        return value
    return json.dumps(value, ensure_ascii=False)

That's exactly the right fix.

Suggested path: backport this _coerce_schema_value change (and the _iter_calling_tool_calls advance-past-malformed-marker fix from P1#3) to #204 and ship #204 first as the focused parser-correctness PR. Then re-target #206 to be the smaller "Hermes recovery + retry logic" PR on top.

§2 — issues with #206-specific scope (the retry logic):

In vllm_mlx/routes/chat.py _looks_like_deferred_tool_use:

  1. _TOOL_INTENT_RE matches too broadly. The regex hits let me, i'll, i will, starting with, create, write, edit, run, test, verify, fix, repair — common English verbs that show up in non-tool responses all the time. A user asking "explain what chmod does" with tools enabled (because they're using an agent) gets a normal explanation containing "let me explain" → retry triggered → synthetic user message injected → surprise output.

  2. '\"path\"' in lowered is also very broad. Any response mentioning quoted path (talking about filesystem paths, environment paths, URL paths, chat templates) triggers retry.

  3. 2 retries × full engine.chat() = up to 3x latency on any false positive. For a typical agentic workflow, false-positive retries are a real cost.

  4. Synthetic user message injection. retry_messages + [{role:assistant,content:...}, {role:user,content:'Call the appropriate tool now...'}] is invisible to the client. If the retry succeeds, the client sees a tool call that was prompted by a hidden message it never sent. If it fails, the client sees the post-retry output — possibly unrelated to the original turn. This needs to be at minimum logged at INFO so users can debug "why did my agent suddenly call a different tool than I asked for?"

Suggested fixes for the retry path:

  • Tighten the trigger: require BOTH a tool-intent verb AND no tool call AND a tool that matches the verb (e.g., "write_file" tool exists in the schema for the "write" verb). Or scope to specific known-pathological agents (OpenCode/Hermes) via a CLI/config flag.
  • Cap at 1 retry, not 2.
  • Log the synthetic message injection at INFO with [deferred-tool-retry] prefix so it's grep-able.
  • Add a --no-deferred-tool-retry CLI flag for users who want to opt out.

§3 + §4: I'll run targeted tests after #204 lands and you've rebased.

To summarize the path I recommend:

  1. Backport _coerce_schema_value string fix + _iter_calling_tool_calls advance-past-malformed fix from this PR to Fix Qwen tool-call OpenAI translation #204.
  2. Land Fix Qwen tool-call OpenAI translation #204 (focused parser-correctness fixes — high value, low risk).
  3. Rebase this PR on the new main.
  4. Tighten _looks_like_deferred_tool_use per above.
  5. Re-review the retry logic with a smaller diff.

Thanks for the work — the parser fixes are excellent. The retry logic just needs a tighter trigger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants