feat: FSM-based tool call constrained decoding via outlines-core by raullenchai · Pull Request #132 · raullenchai/Rapid-MLX

raullenchai · 2026-04-17T00:00:47Z

Summary

Replace fragile string-based tool call parsing with FSM (finite state machine) constrained decoding using outlines-core. When a model generates a tool call, the FSM guarantees the JSON output is structurally valid by masking invalid tokens during generation.

Key changes:

New vllm_mlx/api/fsm_tool_call.py: FSM cache, Guide, two-mode logits processor
Works on both SimpleEngine and BatchedEngine (same FSM code path)
Per-parser trigger patterns for all 18 tool call formats
Graceful fallback: XML/Nemotron formats bypass FSM (handled by existing parsers)
outlines-core added to [guided] extra in pyproject.toml

Performance:

Per-token FSM overhead: 0.9 µs (0.004% of 20ms decode step)
FSM compile: ~2.3s (one-time per tool schema, cached by hash)
Doctor check: 0 regression across all 13 metrics
Benchmark: qwopus-27b 22.4 tok/s — PASS

How it works:

Model generates freely (text, reasoning)
When output ends with tool call trigger (e.g., <tool_call>\n)...
Next token checked: if { → FSM activates (JSON mode)
FSM masks invalid tokens → guaranteed valid JSON
When FSM reaches terminal state → back to free mode
If not { (XML format) → FSM skips, existing parsers handle it

Direction

This PR is step 1 toward unifying SimpleEngine and BatchedEngine. Both engines now share the same FSM logits processor interface. Next: deprecate SimpleEngine as a separate code path (see #131).

Test plan

13 FSM unit tests (cache, processor, factory, performance, XML skip)
Doctor smoke: ruff ✅, 2086/2087 pytest pass
Doctor check: 0 regression, all metrics ±5%
Doctor benchmark: qwopus-27b PASS
E2E verified on both engines (FSM trigger → constrained → complete)
Codex review: 0 issues on FSM code

🤖 Generated with Claude Code

Core infrastructure for replacing 18 string-based tool parsers with a finite state machine that guarantees valid JSON tool calls. Architecture: - FSMToolCallCache: compiles tool schemas → outlines Index, cached by schema hash. Precompile at server startup (2-8s one-time cost). - FSMToolCallProcessor: two-mode logits processor — free mode (all tokens allowed) → constrained mode (only FSM-valid tokens) when model outputs a tool call trigger (e.g., <tool_call>\n). - TOOL_CALL_TRIGGERS: per-parser trigger/closing patterns for all 18 parser formats. Performance: - Per-token FSM overhead: 0.9 µs (0.004% of 20ms decode step) - Cache hit: instant (same tools → same compiled FSM) - Compile: ~2.3s for generic schema, cached permanently 12 tests covering cache, processor, factory, and performance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Wire FSMToolCallProcessor into the server startup: - _setup_fsm_tool_calls(): initializes FSM cache + vocabulary at startup, creates processor factory for BatchedEngine - _deferred_fsm_setup(): completes init after engine.start() for BatchedEngine (tokenizer available late), propagates factory to Scheduler - Falls back to legacy bias-based processor if outlines-core not installed - Vocabulary resolution handles both HF model IDs and local snapshot paths (extracts org/repo from cache layout) Add outlines-core to [guided] extra in pyproject.toml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Trigger FSM grammar compilation on the first request that includes tools. Subsequent requests with the same tool set hit the cache (instant). The compile happens in the request handler background, never blocking the response. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Scheduler calls factory() without tools arg. Now falls back to a generic schema (any string name + any object arguments) instead of returning None. Also fixed _build_tool_call_schema to skip the __generic__ sentinel. Verified E2E: FSM trigger detection and constrained mode work correctly with BatchedEngine + hermes parser. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The registration code accidentally ended up inside the legacy tool logits function when the FSM setup was refactored. Extracted to _register_model() and called at the end of load_model(). Doctor check confirms: 0 TPS regression, tool calls work. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…Engine SimpleEngine integration: - MLXLanguageModel gets _logits_processor_factory attribute - stream_generate() passes FSM processor via logits_processors kwarg to mlx_lm.generate_step (which already supports it) - Server sets factory on _engine._model for SimpleEngine path Verified E2E on both engines: - SimpleEngine: FSM trigger detected, constrained mode activated, valid tool call returned. Doctor check: 0 regression. - BatchedEngine: Same (verified earlier). Promoted FSM trigger/complete logs from DEBUG to INFO for visibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pre-existing bug exposed when registration code was extracted to its own function. Smoke tier ruff check now passes. Doctor results after FSM integration: - smoke: 4/5 pass (1 pre-existing pytest failure) - check: 0 regression, all 13 metrics within ±5% - benchmark: qwopus-27b 22.9 tok/s, 505ms TTFT — PASS Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When the model uses Nemotron XML format (<function=...>) instead of JSON ({"name": ...}), the FSM must not activate — XML is handled by existing parsers. Now the trigger sets a pending flag, and activation only happens on the next token if it starts with '{'. Also: doctor check/full/benchmark tiers now pass --enable-auto-tool-choice --tool-call-parser hermes --enable-tool-logits-bias so tool calling + FSM are exercised in all tiers. 13 tests pass including new XML-skip test. Benchmark: qwopus-27b-8bit 22.4 tok/s, 508ms TTFT, PASS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

raullenchai · 2026-05-09T15:54:20Z

Status: research-only, not for merge.

Per knowledge/fsm_parser_decision.md (internal) — the value of FSM parsing is correctness (guaranteed valid JSON), NOT speed. Industry context:

vLLM: xgrammar for response_format, regex parsers for tools (separated)
SGLang: xgrammar for both
Ollama: GBNF for response_format, templates for tools
All are opt-in, never default

This PR sketches the migration of all 18 string-based tool parsers to FSM-constrained decoding via outlines-core. Marking as draft / research — the architectural direction is right but:

Lint failing
Branch is 22 days behind main
18-parser migration needs careful per-parser verification with real model output, not just unit tests
Need a flag-protected rollout (start opt-in via --tool-call-fsm, only flip default once we have N weeks of production telemetry showing parity)
Should NOT merge until each migrated parser has been A/B tested against the existing parser on real model output (pre-existing test suite is necessary but not sufficient)

Keeping open as a tracking PR for the architectural direction. Will revisit when the surrounding work (per-alias profile of which parser to use, telemetry on parser failures) is in place. If the user wants to revive: rebase + add --tool-call-fsm flag + start with one parser only (qwen3coder is highest-value).

Marking as draft.

Your Name and others added 8 commits April 16, 2026 14:58

raullenchai marked this pull request as draft May 9, 2026 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: FSM-based tool call constrained decoding via outlines-core#132

feat: FSM-based tool call constrained decoding via outlines-core#132
raullenchai wants to merge 8 commits into
mainfrom
feat/fsm-tool-parser

raullenchai commented Apr 17, 2026

Uh oh!

raullenchai commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raullenchai commented Apr 17, 2026

Summary

Direction

Test plan

Uh oh!

raullenchai commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant