feat: day-0 DeepSeek-V4-Flash support (vendored from mlx-lm #1192)#168
feat: day-0 DeepSeek-V4-Flash support (vendored from mlx-lm #1192)#168raullenchai wants to merge 5 commits intomainfrom
Conversation
Vendor Prince Canuma's pending mlx-lm PR #1192 architecture into vllm_mlx/models/deepseek_v4.py and register it under sys.modules['mlx_lm.models.deepseek_v4'] so mlx-lm's importlib lookup finds it transparently. Also fall back to chat_template.jinja when the template isn't embedded in tokenizer_config.json (DeepSeek V4 ships it as a separate file). Adds aliases: deepseek-v4-flash (4bit), -2bit, -8bit. Removable once mlx-lm 0.32+ ships native deepseek_v4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- chat_template.jinja: read with encoding='utf-8-sig' so a UTF-8 BOM (if the file was saved with one) doesn't end up in the rendered template - vendored arch registration: use sys.modules.setdefault for explicit atomicity under the GIL Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mlx-lm's load() internally uses AutoTokenizer → PreTrainedConfig, which trips on deepseek_v4 (transformers 5.6.0.dev0 has no deepseek_v4 entry in CONFIG_MAPPING and the generic PreTrainedConfig fallback then crashes on RoPE standardization with AttributeError, not the ValueError our existing predicate catches). Added _is_vendored_arch_model() — reads config.json upfront and routes any model_type in the vendored set straight to the lower-level load_model() + raw tokenizer.json path, skipping the brittle load()/AutoConfig retry-on-error flow entirely. Verified end-to-end on mlx-community/DeepSeek-V4-Flash-2bit-DQ (158B-A13B): 32.3 tok/s decode on M3 Ultra, coherent generation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
✅ End-to-end verified on real model
Decode: 32.3 tok/s for 158B-A13B at 2-bit (~13B active params, expected band on M3 Ultra). Quality is the usual 2-bit-DQ degradation — production users should pick the Bug found + fixed during smoke testmlx-lm's high-level Final fix (commit Cache deleted post-test (disk is constrained), 90 GB reclaimed. Ready to merge. |
The 4-bit variant on mlx-community is ~489 GB on disk (per HF API usedStorage). Even on M3 Ultra Max (512 GB RAM), that leaves no room for KV cache. Point the default alias to the 8-bit variant (155 GB on disk, ~136 GB RAM) which actually fits 192 GB+ Macs. Tested on M3 Ultra 256 GB: - 8-bit: 31.2 tok/s decode, 136 GB peak RAM, 7/8 stress scenarios pass - 2-bit DQ: 55.8 tok/s decode, 91 GB peak RAM (separate alias for speed/footprint demos) Tool calling currently 0/30 on the comprehensive eval — root cause is that mlx-community's chat_template.jinja is chat-only (no tools/ tool_call rendering). Plain chat works perfectly. Tool support requires a follow-up template fix; tracking upstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📊 Comprehensive test results — both 2-bit and 8-bit variantsBench (Mac Studio M3 Ultra 256 GB)
Stress test (8 scenarios) — both variants identical
7/8 PASS on both variants. Tool-calling eval (
|
| Category | Pass rate | Category | Pass rate | |
|---|---|---|---|---|
| single_tool (4) | 0/4 | sequential (4) | 0/4 | |
| function_selection (5) | 0/5 | irrelevance (3) | 0/3 | |
| complex_args (4) | 0/4 | missing_params (2) | 0/2 | |
| parallel (4) | 0/4 | error_recovery (2) | 0/2 | |
| nested_dependent (2) | 0/2 |
Every scenario logs tool_detected: False — model is not emitting anything that parses as a tool call. Root cause is upstream, not our code:
```
$ cat ~/.cache/huggingface/hub/models--mlx-community--DeepSeek-V4-Flash-8bit/.../chat_template.jinja
{%- set mode = thinking_mode|default('chat') -%}
<|begin▁of▁sentence|>
{%- for message in messages -%}
{%- if message['role'] == 'system' -%} ... {%- elif message['role'] == 'user' -%} ...
{%- elif message['role'] == 'assistant' -%} ...
{%- endif -%}
{%- endfor -%}
```
The mlx-community chat template only handles `system`/`user`/`assistant` — no `tool` role rendering, no `tools` array iteration, no `<tool_call>` markers. Tools passed via the OpenAI API are silently dropped before the model ever sees them. Same root cause we documented earlier for Mistral / Gemma in MEMORY.md.
Agent integration (Hermes / OpenClaude profiles, API-level)
| Test | 2-bit Hermes | 8-bit Hermes | 2-bit OpenClaude | 8-bit OpenClaude |
|---|---|---|---|---|
| plain_chat | ✅ | 💥 (parse error in profile) | 💥 | 💥 |
| no_tool_leak | ✅ | ✅ | ✅ | ✅ |
| stress_no_leak | ✅ | ✅ | ✅ | ✅ |
| tag_suppression (Hermes only) | ✅ | ✅ | — | — |
| All tool-call tests | ❌ × 7 | ❌ × 7 | ❌ × 7 | ❌ × 7 |
| Total | 4/12 | 3/12 | 2/10 | 2/10 |
Same story: anything not requiring `tool_calls` in the response works.
Summary + recommendations
✅ Vendored architecture works. mlx-lm load → vendored `deepseek_v4` → real weights → coherent generation, both quants.
✅ Performance is solid for size: 31 tok/s for 158B-A13B at 8-bit on M3 Ultra is on-curve with bandwidth expectations (~13B active params, ~800 GB/s mem bw).
✅ Stress + concurrency stable: 7/8 stress scenarios pass on both quants.
❌ Tool calling is broken at the chat-template level (upstream) — needs a follow-up. Recommendations:
- Ship V4 day-0 with chat-only support; document agent limitation.
- File issue with mlx-community / Blaizzy about adding tool template (similar to what mlx-community/Qwen3.6-* does).
- For agentic use today, recommend Qwen3.6-35B (100% tool calling per scorecard).
Alias change (commit `8ca9f3d`)
Repointed default `deepseek-v4-flash` from 4-bit (489 GB on disk per HF API — won't fit any single Mac) to 8-bit (155 GB → fits 192 GB+ Macs). Added explicit `deepseek-v4-flash-4bit` alias for users on distributed setups.
| Alias | Backing model | Disk | RAM | Mac tier |
|---|---|---|---|---|
| `deepseek-v4-flash` | DeepSeek-V4-Flash-8bit | 155 GB | 136 GB | 192+ GB |
| `deepseek-v4-flash-2bit` | DeepSeek-V4-Flash-2bit-DQ | 90 GB | 91 GB | 128+ GB |
| `deepseek-v4-flash-8bit` | (same as default) | |||
| `deepseek-v4-flash-4bit` | DeepSeek-V4-Flash-4bit | ~489 GB | — | distributed only |
All 8-bit + 2-bit cache deleted post-test (180 GB reclaimed). Ready to merge with the documented limitations.
Hero table, Mac fit table, and Benchmarks comparison table now include DeepSeek V4 Flash 158B-A13B results from M3 Ultra 256GB testing: - 2-bit DQ: 56 tok/s decode, 91 GB peak RAM, 128 GB+ Mac tier - 8-bit: 31 tok/s decode, 136 GB peak RAM, 192 GB+ Mac tier Both tagged 'chat only' since the mlx-community chat template ships without tool/function rendering — see PR #168 for details. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds day-0 support for DeepSeek-V4-Flash (158B-A13B, 1M context, dropped 2026-04-24).
mlx-lm 0.31.x doesn't yet ship
deepseek_v4— five PRs are racing to add it (#1189, #1192, #1195, #1201, #1216). To not block users, we vendor the cleanest one (Prince Canuma / Blaizzy, mlx-vlm maintainer) intovllm_mlx/models/deepseek_v4.pyand hook it into mlx-lm's loader.How the hook works
mlx-lm's
_get_classes()resolvesmodel_typeviaimportlib.import_module(f"mlx_lm.models.{model_type}"). We pre-populatesys.modules['mlx_lm.models.deepseek_v4']with our vendored module insideload_model_with_fallback()— no monkey-patch of mlx-lm's source.Imports in the vendored file are rewritten from relative (
from .base ...) to absolute (from mlx_lm.models.base ...) so it pulls the rest of the architecture (KV cache, MLA, switch_layers, distributed) from the installed mlx-lm 0.31.2.Tokenizer
DeepSeek V4 uses
tokenizer_class: "TokenizersBackend"which mlx-lm 0.31.2 doesn't recognize. The existing_load_with_tokenizer_fallback()already catches that error pattern and loadstokenizer.jsondirectly — but it only readchat_templatefromtokenizer_config.json, missing V4's separatechat_template.jinjafile. Patched to fall through to the jinja file (read withutf-8-sigto strip any BOM) when the embedded template is absent.Aliases (affine-quantized variants only)
rapid-mlx serve deepseek-v4-flash→mlx-community/DeepSeek-V4-Flash-4bit(~151 GB)rapid-mlx serve deepseek-v4-flash-2bit→mlx-community/DeepSeek-V4-Flash-2bit-DQ(~96 GB)rapid-mlx serve deepseek-v4-flash-8bit→mlx-community/deepseek-ai-DeepSeek-V4-Flash-8bitOut of scope (known gaps)
-mxfp4/-mxfp8/-nvfp4/-bf16variants intentionally not aliased. PR #1192 also patchesmlx_lm/utils.pywith a_load_safetensorswrapper that handles theF8_E8M0safetensors dtype used by mxfp8 scale tensors, plus aquant_method == "fp8"branch inload_model(). We didn't vendor those because: (a) they require monkey-patching mlx-lm'sload_model()itself (not just an additive sys.modules hook), (b) they only matter for the fp8-flavored quantization variants, and (c) the affine-quantized 4bit/8bit/2bit-DQ aliases above cover the common case. A user tryingmlx-community/DeepSeek-V4-Flash-mxfp8directly will hitRuntimeErroronmx.load(). Follow-up.model_type=deepseek_v4, so the vendored arch supports it, but Pro 4-bit ≈ 430 GB — needsmlx-distributedsetup, not a single-Mac config.bench_command/bench_detok_commandin cli.py callfrom mlx_lm import loaddirectly and skip our wrapper. Benchingdeepseek_v4via those wouldModuleNotFoundError. Acceptable — benchmarks aren't day-0 surface.Test plan
BatchRotatingKVCache,MultiLinear,PipelineMixin,SwitchGLU,shard_inplace)_register_vendored_archs()makes mlx-lm'simportlib.import_module("mlx_lm.models.deepseek_v4")resolve to our modulerapid-mlx serve deepseek-v4-flash-2bit+ chat completion before mergePlan to remove
Once mlx-lm 0.32+ merges any of the five competing V4 PRs, delete:
vllm_mlx/models/deepseek_v4.py_register_vendored_archs()callpyproject.tomlThe
chat_template.jinjafallback is a general improvement worth keeping.Risks
MERGEABLEwith a credible author (mlx-vlm maintainer, 12.4k+ downloads of the 4-bit weights confirm people are running this branch successfully), but it hasn't gone through mlx-lm's review yet.🤖 Generated with Claude Code