feat: FSM-based tool call constrained decoding via outlines-core#132
Draft
raullenchai wants to merge 8 commits into
Draft
feat: FSM-based tool call constrained decoding via outlines-core#132raullenchai wants to merge 8 commits into
raullenchai wants to merge 8 commits into
Conversation
Core infrastructure for replacing 18 string-based tool parsers with a finite state machine that guarantees valid JSON tool calls. Architecture: - FSMToolCallCache: compiles tool schemas → outlines Index, cached by schema hash. Precompile at server startup (2-8s one-time cost). - FSMToolCallProcessor: two-mode logits processor — free mode (all tokens allowed) → constrained mode (only FSM-valid tokens) when model outputs a tool call trigger (e.g., <tool_call>\n). - TOOL_CALL_TRIGGERS: per-parser trigger/closing patterns for all 18 parser formats. Performance: - Per-token FSM overhead: 0.9 µs (0.004% of 20ms decode step) - Cache hit: instant (same tools → same compiled FSM) - Compile: ~2.3s for generic schema, cached permanently 12 tests covering cache, processor, factory, and performance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire FSMToolCallProcessor into the server startup: - _setup_fsm_tool_calls(): initializes FSM cache + vocabulary at startup, creates processor factory for BatchedEngine - _deferred_fsm_setup(): completes init after engine.start() for BatchedEngine (tokenizer available late), propagates factory to Scheduler - Falls back to legacy bias-based processor if outlines-core not installed - Vocabulary resolution handles both HF model IDs and local snapshot paths (extracts org/repo from cache layout) Add outlines-core to [guided] extra in pyproject.toml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Trigger FSM grammar compilation on the first request that includes tools. Subsequent requests with the same tool set hit the cache (instant). The compile happens in the request handler background, never blocking the response. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scheduler calls factory() without tools arg. Now falls back to a generic schema (any string name + any object arguments) instead of returning None. Also fixed _build_tool_call_schema to skip the __generic__ sentinel. Verified E2E: FSM trigger detection and constrained mode work correctly with BatchedEngine + hermes parser. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The registration code accidentally ended up inside the legacy tool logits function when the FSM setup was refactored. Extracted to _register_model() and called at the end of load_model(). Doctor check confirms: 0 TPS regression, tool calls work. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Engine SimpleEngine integration: - MLXLanguageModel gets _logits_processor_factory attribute - stream_generate() passes FSM processor via logits_processors kwarg to mlx_lm.generate_step (which already supports it) - Server sets factory on _engine._model for SimpleEngine path Verified E2E on both engines: - SimpleEngine: FSM trigger detected, constrained mode activated, valid tool call returned. Doctor check: 0 regression. - BatchedEngine: Same (verified earlier). Promoted FSM trigger/complete logs from DEBUG to INFO for visibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pre-existing bug exposed when registration code was extracted to its own function. Smoke tier ruff check now passes. Doctor results after FSM integration: - smoke: 4/5 pass (1 pre-existing pytest failure) - check: 0 regression, all 13 metrics within ±5% - benchmark: qwopus-27b 22.9 tok/s, 505ms TTFT — PASS Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the model uses Nemotron XML format (<function=...>) instead of
JSON ({"name": ...}), the FSM must not activate — XML is handled by
existing parsers. Now the trigger sets a pending flag, and activation
only happens on the next token if it starts with '{'.
Also: doctor check/full/benchmark tiers now pass --enable-auto-tool-choice
--tool-call-parser hermes --enable-tool-logits-bias so tool calling +
FSM are exercised in all tiers.
13 tests pass including new XML-skip test.
Benchmark: qwopus-27b-8bit 22.4 tok/s, 508ms TTFT, PASS.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Owner
Author
|
Status: research-only, not for merge. Per knowledge/fsm_parser_decision.md (internal) — the value of FSM parsing is correctness (guaranteed valid JSON), NOT speed. Industry context:
This PR sketches the migration of all 18 string-based tool parsers to FSM-constrained decoding via outlines-core. Marking as draft / research — the architectural direction is right but:
Keeping open as a tracking PR for the architectural direction. Will revisit when the surrounding work (per-alias profile of which parser to use, telemetry on parser failures) is in place. If the user wants to revive: rebase + add Marking as draft. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace fragile string-based tool call parsing with FSM (finite state machine) constrained decoding using outlines-core. When a model generates a tool call, the FSM guarantees the JSON output is structurally valid by masking invalid tokens during generation.
Key changes:
vllm_mlx/api/fsm_tool_call.py: FSM cache, Guide, two-mode logits processoroutlines-coreadded to[guided]extra in pyproject.tomlPerformance:
How it works:
<tool_call>\n)...{→ FSM activates (JSON mode){(XML format) → FSM skips, existing parsers handle itDirection
This PR is step 1 toward unifying SimpleEngine and BatchedEngine. Both engines now share the same FSM logits processor interface. Next: deprecate SimpleEngine as a separate code path (see #131).
Test plan
🤖 Generated with Claude Code