feat: day-0 DeepSeek-V4-Flash support (vendored from mlx-lm #1192) by raullenchai · Pull Request #168 · raullenchai/Rapid-MLX

raullenchai · 2026-04-29T03:46:00Z

Summary

Adds day-0 support for DeepSeek-V4-Flash (158B-A13B, 1M context, dropped 2026-04-24).

mlx-lm 0.31.x doesn't yet ship deepseek_v4 — five PRs are racing to add it (#1189, #1192, #1195, #1201, #1216). To not block users, we vendor the cleanest one (Prince Canuma / Blaizzy, mlx-vlm maintainer) into vllm_mlx/models/deepseek_v4.py and hook it into mlx-lm's loader.

How the hook works

mlx-lm's _get_classes() resolves model_type via importlib.import_module(f"mlx_lm.models.{model_type}"). We pre-populate sys.modules['mlx_lm.models.deepseek_v4'] with our vendored module inside load_model_with_fallback() — no monkey-patch of mlx-lm's source.

Imports in the vendored file are rewritten from relative (from .base ...) to absolute (from mlx_lm.models.base ...) so it pulls the rest of the architecture (KV cache, MLA, switch_layers, distributed) from the installed mlx-lm 0.31.2.

Tokenizer

DeepSeek V4 uses tokenizer_class: "TokenizersBackend" which mlx-lm 0.31.2 doesn't recognize. The existing _load_with_tokenizer_fallback() already catches that error pattern and loads tokenizer.json directly — but it only read chat_template from tokenizer_config.json, missing V4's separate chat_template.jinja file. Patched to fall through to the jinja file (read with utf-8-sig to strip any BOM) when the embedded template is absent.

Aliases (affine-quantized variants only)

rapid-mlx serve deepseek-v4-flash → mlx-community/DeepSeek-V4-Flash-4bit (~151 GB)
rapid-mlx serve deepseek-v4-flash-2bit → mlx-community/DeepSeek-V4-Flash-2bit-DQ (~96 GB)
rapid-mlx serve deepseek-v4-flash-8bit → mlx-community/deepseek-ai-DeepSeek-V4-Flash-8bit

Out of scope (known gaps)

-mxfp4 / -mxfp8 / -nvfp4 / -bf16 variants intentionally not aliased. PR #1192 also patches mlx_lm/utils.py with a _load_safetensors wrapper that handles the F8_E8M0 safetensors dtype used by mxfp8 scale tensors, plus a quant_method == "fp8" branch in load_model(). We didn't vendor those because: (a) they require monkey-patching mlx-lm's load_model() itself (not just an additive sys.modules hook), (b) they only matter for the fp8-flavored quantization variants, and (c) the affine-quantized 4bit/8bit/2bit-DQ aliases above cover the common case. A user trying mlx-community/DeepSeek-V4-Flash-mxfp8 directly will hit RuntimeError on mx.load(). Follow-up.
DeepSeek-V4-Pro family (862B / 1.6T) intentionally not aliased. Same model_type=deepseek_v4, so the vendored arch supports it, but Pro 4-bit ≈ 430 GB — needs mlx-distributed setup, not a single-Mac config.
bench_command / bench_detok_command in cli.py call from mlx_lm import load directly and skip our wrapper. Benching deepseek_v4 via those would ModuleNotFoundError. Acceptable — benchmarks aren't day-0 surface.

Test plan

Module imports cleanly on installed mlx-lm 0.31.2 (all internal deps present: BatchRotatingKVCache, MultiLinear, PipelineMixin, SwitchGLU, shard_inplace)
_register_vendored_archs() makes mlx-lm's importlib.import_module("mlx_lm.models.deepseek_v4") resolve to our module
Idempotence — second call is a no-op
Tiny synthetic config end-to-end forward pass (HCA + sinkhorn + MoE) — Metal kernels compile, logits shape matches
26 adjacent batching/engine tests still green (no regression)
Codex review applied: utf-8-sig BOM strip, sys.modules.setdefault, mxfp gap documented
Pending: real model load — 2-bit (96 GB) downloading now, will smoke-test rapid-mlx serve deepseek-v4-flash-2bit + chat completion before merge

Plan to remove

Once mlx-lm 0.32+ merges any of the five competing V4 PRs, delete:

vllm_mlx/models/deepseek_v4.py
The _register_vendored_archs() call
The per-file ruff ignore in pyproject.toml

The chat_template.jinja fallback is a general improvement worth keeping.

Risks

Vendored architecture is unmerged — Blaizzy's PR is MERGEABLE with a credible author (mlx-vlm maintainer, 12.4k+ downloads of the 4-bit weights confirm people are running this branch successfully), but it hasn't gone through mlx-lm's review yet.
We're sourcing weights conversion + reading hot-path Metal kernels from a single contributor. Worst case: a kernel bug surfaces only on certain shapes / contexts — we can hot-fix the vendored file independently of mlx-lm.

🤖 Generated with Claude Code

Vendor Prince Canuma's pending mlx-lm PR #1192 architecture into vllm_mlx/models/deepseek_v4.py and register it under sys.modules['mlx_lm.models.deepseek_v4'] so mlx-lm's importlib lookup finds it transparently. Also fall back to chat_template.jinja when the template isn't embedded in tokenizer_config.json (DeepSeek V4 ships it as a separate file). Adds aliases: deepseek-v4-flash (4bit), -2bit, -8bit. Removable once mlx-lm 0.32+ ships native deepseek_v4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- chat_template.jinja: read with encoding='utf-8-sig' so a UTF-8 BOM (if the file was saved with one) doesn't end up in the rendered template - vendored arch registration: use sys.modules.setdefault for explicit atomicity under the GIL Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mlx-lm's load() internally uses AutoTokenizer → PreTrainedConfig, which trips on deepseek_v4 (transformers 5.6.0.dev0 has no deepseek_v4 entry in CONFIG_MAPPING and the generic PreTrainedConfig fallback then crashes on RoPE standardization with AttributeError, not the ValueError our existing predicate catches). Added _is_vendored_arch_model() — reads config.json upfront and routes any model_type in the vendored set straight to the lower-level load_model() + raw tokenizer.json path, skipping the brittle load()/AutoConfig retry-on-error flow entirely. Verified end-to-end on mlx-community/DeepSeek-V4-Flash-2bit-DQ (158B-A13B): 32.3 tok/s decode on M3 Ultra, coherent generation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

raullenchai · 2026-04-29T03:58:34Z

✅ End-to-end verified on real model

rapid-mlx serve mlx-community/DeepSeek-V4-Flash-2bit-DQ boots cleanly on M3 Ultra (256 GB, 90 GB on disk) and serves an OpenAI-compatible chat completion:

$ curl -s http://127.0.0.1:8085/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/DeepSeek-V4-Flash-2bit-DQ",
       "messages":[{"role":"user","content":"What is 2+2?"}],
       "max_tokens":64}'

{"choices":[{"message":{"role":"assistant","content":"2+2 equals 4."},"finish_reason":"stop"}],
 "usage":{"prompt_tokens":18,"completion_tokens":8,"total_tokens":26}}

Decode: 32.3 tok/s for 158B-A13B at 2-bit (~13B active params, expected band on M3 Ultra). Quality is the usual 2-bit-DQ degradation — production users should pick the -4bit or -8bit aliases.

Bug found + fixed during smoke test

mlx-lm's high-level load() internally calls AutoTokenizer.from_pretrained which routes through transformers.PreTrainedConfig. transformers 5.6.0.dev0 has no deepseek_v4 entry in CONFIG_MAPPING, and the generic PreTrainedConfig fallback then crashes during RoPE standardization with AttributeError: 'PreTrainedConfig' object has no attribute 'max_position_embeddings' — not the ValueError my initial predicate caught.

Final fix (commit eb9b9f2): added _is_vendored_arch_model() that reads config.json upfront and routes any model_type in _VENDORED_MODEL_TYPES = {"deepseek_v4"} straight to the lower-level load_model() + raw tokenizer.json path, bypassing the brittle load()/AutoConfig retry-on-error flow entirely. Same pattern as the existing gemma4 short-circuit.

Cache deleted post-test (disk is constrained), 90 GB reclaimed.

Ready to merge.

The 4-bit variant on mlx-community is ~489 GB on disk (per HF API usedStorage). Even on M3 Ultra Max (512 GB RAM), that leaves no room for KV cache. Point the default alias to the 8-bit variant (155 GB on disk, ~136 GB RAM) which actually fits 192 GB+ Macs. Tested on M3 Ultra 256 GB: - 8-bit: 31.2 tok/s decode, 136 GB peak RAM, 7/8 stress scenarios pass - 2-bit DQ: 55.8 tok/s decode, 91 GB peak RAM (separate alias for speed/footprint demos) Tool calling currently 0/30 on the comprehensive eval — root cause is that mlx-community's chat_template.jinja is chat-only (no tools/ tool_call rendering). Plain chat works perfectly. Tool support requires a follow-up template fix; tracking upstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

raullenchai · 2026-04-29T05:05:34Z

📊 Comprehensive test results — both 2-bit and 8-bit variants

Bench (Mac Studio M3 Ultra 256 GB)

Metric	2-bit DQ	8-bit
Decode (long, median)	55.8 tok/s	31.2 tok/s
Decode (short)	38.3 tok/s	31.7 tok/s
Prefill (long, median)	443 tok/s	415 tok/s
TTFT cold	0.78 s	2.61 s
TTFT cached	0.13 s	0.15 s
Multi-turn TTFT cold	0.54 s	0.49 s
Peak RAM	91 GB	136 GB
Disk on cache	90 GB	145 GB
Tool call rate (bench scenarios)	0%	0%

Stress test (8 scenarios) — both variants identical

Scenario	Result
Sustained throughput (20 req)	✅ PASS
Concurrent load (4 parallel streams)	✅ PASS
Long generation (1024 tok)	✅ PASS
Rapid fire (10 req)	✅ PASS
Tool call storm	❌ FAIL (root cause below)
Mixed workload (chat + tools + streaming)	✅ PASS
Disconnect resilience	✅ PASS
Memory stability (5 rounds)	✅ PASS

7/8 PASS on both variants.

Tool-calling eval (`evals/run_eval.py --suite tool_calling`, 30 scenarios, 9 categories)

0/30 PASS on 8-bit (parser=auto). Per-category breakdown all 0%:

Category	Pass rate	Category	Pass rate
single_tool (4)	0/4	sequential (4)	0/4
function_selection (5)	0/5	irrelevance (3)	0/3
complex_args (4)	0/4	missing_params (2)	0/2
parallel (4)	0/4	error_recovery (2)	0/2
nested_dependent (2)	0/2

Every scenario logs tool_detected: False — model is not emitting anything that parses as a tool call. Root cause is upstream, not our code:

```
$ cat ~/.cache/huggingface/hub/models--mlx-community--DeepSeek-V4-Flash-8bit/.../chat_template.jinja
{%- set mode = thinking_mode|default('chat') -%}
<｜begin▁of▁sentence｜>
{%- for message in messages -%}
{%- if message['role'] == 'system' -%} ... {%- elif message['role'] == 'user' -%} ...
{%- elif message['role'] == 'assistant' -%} ...
{%- endif -%}
{%- endfor -%}
```

The mlx-community chat template only handles `system`/`user`/`assistant` — no `tool` role rendering, no `tools` array iteration, no `<tool_call>` markers. Tools passed via the OpenAI API are silently dropped before the model ever sees them. Same root cause we documented earlier for Mistral / Gemma in MEMORY.md.

Agent integration (Hermes / OpenClaude profiles, API-level)

Test	2-bit Hermes	8-bit Hermes	2-bit OpenClaude	8-bit OpenClaude
plain_chat	✅	💥 (parse error in profile)	💥	💥
no_tool_leak	✅	✅	✅	✅
stress_no_leak	✅	✅	✅	✅
tag_suppression (Hermes only)	✅	✅	—	—
All tool-call tests	❌ × 7	❌ × 7	❌ × 7	❌ × 7
Total	4/12	3/12	2/10	2/10

Same story: anything not requiring `tool_calls` in the response works.

Summary + recommendations

✅ Vendored architecture works. mlx-lm load → vendored `deepseek_v4` → real weights → coherent generation, both quants.
✅ Performance is solid for size: 31 tok/s for 158B-A13B at 8-bit on M3 Ultra is on-curve with bandwidth expectations (~13B active params, ~800 GB/s mem bw).
✅ Stress + concurrency stable: 7/8 stress scenarios pass on both quants.
❌ Tool calling is broken at the chat-template level (upstream) — needs a follow-up. Recommendations:

Ship V4 day-0 with chat-only support; document agent limitation.
File issue with mlx-community / Blaizzy about adding tool template (similar to what mlx-community/Qwen3.6-* does).
For agentic use today, recommend Qwen3.6-35B (100% tool calling per scorecard).

Alias change (commit `8ca9f3d`)

Repointed default `deepseek-v4-flash` from 4-bit (489 GB on disk per HF API — won't fit any single Mac) to 8-bit (155 GB → fits 192 GB+ Macs). Added explicit `deepseek-v4-flash-4bit` alias for users on distributed setups.

Alias	Backing model	Disk	RAM	Mac tier
`deepseek-v4-flash`	DeepSeek-V4-Flash-8bit	155 GB	136 GB	192+ GB
`deepseek-v4-flash-2bit`	DeepSeek-V4-Flash-2bit-DQ	90 GB	91 GB	128+ GB
`deepseek-v4-flash-8bit`	(same as default)
`deepseek-v4-flash-4bit`	DeepSeek-V4-Flash-4bit	~489 GB	—	distributed only

All 8-bit + 2-bit cache deleted post-test (180 GB reclaimed). Ready to merge with the documented limitations.

Hero table, Mac fit table, and Benchmarks comparison table now include DeepSeek V4 Flash 158B-A13B results from M3 Ultra 256GB testing: - 2-bit DQ: 56 tok/s decode, 91 GB peak RAM, 128 GB+ Mac tier - 8-bit: 31 tok/s decode, 136 GB peak RAM, 192 GB+ Mac tier Both tagged 'chat only' since the mlx-community chat template ships without tool/function rendering — see PR #168 for details. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Your Name and others added 3 commits April 28, 2026 20:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: day-0 DeepSeek-V4-Flash support (vendored from mlx-lm #1192)#168

feat: day-0 DeepSeek-V4-Flash support (vendored from mlx-lm #1192)#168
raullenchai wants to merge 5 commits intomainfrom
feat/deepseek-v4-vendored

raullenchai commented Apr 29, 2026 •

edited

Loading

Uh oh!

raullenchai commented Apr 29, 2026

Uh oh!

raullenchai commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raullenchai commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How the hook works

Tokenizer

Aliases (affine-quantized variants only)

Out of scope (known gaps)

Test plan

Plan to remove

Risks

Uh oh!

raullenchai commented Apr 29, 2026

✅ End-to-end verified on real model

Bug found + fixed during smoke test

Uh oh!

raullenchai commented Apr 29, 2026

📊 Comprehensive test results — both 2-bit and 8-bit variants

Bench (Mac Studio M3 Ultra 256 GB)

Stress test (8 scenarios) — both variants identical

Tool-calling eval (evals/run_eval.py --suite tool_calling, 30 scenarios, 9 categories)

Agent integration (Hermes / OpenClaude profiles, API-level)

Summary + recommendations

Alias change (commit `8ca9f3d`)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

raullenchai commented Apr 29, 2026 •

edited

Loading

Tool-calling eval (`evals/run_eval.py --suite tool_calling`, 30 scenarios, 9 categories)