Skip to content

feat: FSM-based tool call constrained decoding via outlines-core#132

Draft
raullenchai wants to merge 8 commits into
mainfrom
feat/fsm-tool-parser
Draft

feat: FSM-based tool call constrained decoding via outlines-core#132
raullenchai wants to merge 8 commits into
mainfrom
feat/fsm-tool-parser

Conversation

@raullenchai
Copy link
Copy Markdown
Owner

Summary

Replace fragile string-based tool call parsing with FSM (finite state machine) constrained decoding using outlines-core. When a model generates a tool call, the FSM guarantees the JSON output is structurally valid by masking invalid tokens during generation.

Key changes:

  • New vllm_mlx/api/fsm_tool_call.py: FSM cache, Guide, two-mode logits processor
  • Works on both SimpleEngine and BatchedEngine (same FSM code path)
  • Per-parser trigger patterns for all 18 tool call formats
  • Graceful fallback: XML/Nemotron formats bypass FSM (handled by existing parsers)
  • outlines-core added to [guided] extra in pyproject.toml

Performance:

  • Per-token FSM overhead: 0.9 µs (0.004% of 20ms decode step)
  • FSM compile: ~2.3s (one-time per tool schema, cached by hash)
  • Doctor check: 0 regression across all 13 metrics
  • Benchmark: qwopus-27b 22.4 tok/s — PASS

How it works:

  1. Model generates freely (text, reasoning)
  2. When output ends with tool call trigger (e.g., <tool_call>\n)...
  3. Next token checked: if { → FSM activates (JSON mode)
  4. FSM masks invalid tokens → guaranteed valid JSON
  5. When FSM reaches terminal state → back to free mode
  6. If not { (XML format) → FSM skips, existing parsers handle it

Direction

This PR is step 1 toward unifying SimpleEngine and BatchedEngine. Both engines now share the same FSM logits processor interface. Next: deprecate SimpleEngine as a separate code path (see #131).

Test plan

  • 13 FSM unit tests (cache, processor, factory, performance, XML skip)
  • Doctor smoke: ruff ✅, 2086/2087 pytest pass
  • Doctor check: 0 regression, all metrics ±5%
  • Doctor benchmark: qwopus-27b PASS
  • E2E verified on both engines (FSM trigger → constrained → complete)
  • Codex review: 0 issues on FSM code

🤖 Generated with Claude Code

Your Name and others added 8 commits April 16, 2026 14:58
Core infrastructure for replacing 18 string-based tool parsers with
a finite state machine that guarantees valid JSON tool calls.

Architecture:
- FSMToolCallCache: compiles tool schemas → outlines Index, cached
  by schema hash. Precompile at server startup (2-8s one-time cost).
- FSMToolCallProcessor: two-mode logits processor — free mode (all
  tokens allowed) → constrained mode (only FSM-valid tokens) when
  model outputs a tool call trigger (e.g., <tool_call>\n).
- TOOL_CALL_TRIGGERS: per-parser trigger/closing patterns for all
  18 parser formats.

Performance:
- Per-token FSM overhead: 0.9 µs (0.004% of 20ms decode step)
- Cache hit: instant (same tools → same compiled FSM)
- Compile: ~2.3s for generic schema, cached permanently

12 tests covering cache, processor, factory, and performance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire FSMToolCallProcessor into the server startup:
- _setup_fsm_tool_calls(): initializes FSM cache + vocabulary at
  startup, creates processor factory for BatchedEngine
- _deferred_fsm_setup(): completes init after engine.start() for
  BatchedEngine (tokenizer available late), propagates factory to
  Scheduler
- Falls back to legacy bias-based processor if outlines-core not
  installed
- Vocabulary resolution handles both HF model IDs and local snapshot
  paths (extracts org/repo from cache layout)

Add outlines-core to [guided] extra in pyproject.toml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Trigger FSM grammar compilation on the first request that includes
tools. Subsequent requests with the same tool set hit the cache
(instant). The compile happens in the request handler background,
never blocking the response.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scheduler calls factory() without tools arg. Now falls back to a
generic schema (any string name + any object arguments) instead of
returning None. Also fixed _build_tool_call_schema to skip the
__generic__ sentinel.

Verified E2E: FSM trigger detection and constrained mode work
correctly with BatchedEngine + hermes parser.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The registration code accidentally ended up inside the legacy tool
logits function when the FSM setup was refactored. Extracted to
_register_model() and called at the end of load_model().

Doctor check confirms: 0 TPS regression, tool calls work.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Engine

SimpleEngine integration:
- MLXLanguageModel gets _logits_processor_factory attribute
- stream_generate() passes FSM processor via logits_processors kwarg
  to mlx_lm.generate_step (which already supports it)
- Server sets factory on _engine._model for SimpleEngine path

Verified E2E on both engines:
- SimpleEngine: FSM trigger detected, constrained mode activated,
  valid tool call returned. Doctor check: 0 regression.
- BatchedEngine: Same (verified earlier).

Promoted FSM trigger/complete logs from DEBUG to INFO for visibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pre-existing bug exposed when registration code was extracted to
its own function. Smoke tier ruff check now passes.

Doctor results after FSM integration:
- smoke: 4/5 pass (1 pre-existing pytest failure)
- check: 0 regression, all 13 metrics within ±5%
- benchmark: qwopus-27b 22.9 tok/s, 505ms TTFT — PASS

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the model uses Nemotron XML format (<function=...>) instead of
JSON ({"name": ...}), the FSM must not activate — XML is handled by
existing parsers. Now the trigger sets a pending flag, and activation
only happens on the next token if it starts with '{'.

Also: doctor check/full/benchmark tiers now pass --enable-auto-tool-choice
--tool-call-parser hermes --enable-tool-logits-bias so tool calling +
FSM are exercised in all tiers.

13 tests pass including new XML-skip test.

Benchmark: qwopus-27b-8bit 22.4 tok/s, 508ms TTFT, PASS.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@raullenchai
Copy link
Copy Markdown
Owner Author

Status: research-only, not for merge.

Per knowledge/fsm_parser_decision.md (internal) — the value of FSM parsing is correctness (guaranteed valid JSON), NOT speed. Industry context:

  • vLLM: xgrammar for response_format, regex parsers for tools (separated)
  • SGLang: xgrammar for both
  • Ollama: GBNF for response_format, templates for tools
  • All are opt-in, never default

This PR sketches the migration of all 18 string-based tool parsers to FSM-constrained decoding via outlines-core. Marking as draft / research — the architectural direction is right but:

  • Lint failing
  • Branch is 22 days behind main
  • 18-parser migration needs careful per-parser verification with real model output, not just unit tests
  • Need a flag-protected rollout (start opt-in via --tool-call-fsm, only flip default once we have N weeks of production telemetry showing parity)
  • Should NOT merge until each migrated parser has been A/B tested against the existing parser on real model output (pre-existing test suite is necessary but not sufficient)

Keeping open as a tracking PR for the architectural direction. Will revisit when the surrounding work (per-alias profile of which parser to use, telemetry on parser failures) is in place. If the user wants to revive: rebase + add --tool-call-fsm flag + start with one parser only (qwen3coder is highest-value).

Marking as draft.

@raullenchai raullenchai marked this pull request as draft May 9, 2026 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant