Add DeepSeek-v4 (Flash/Pro)#1192
Conversation
|
You can now run it on a 256GB Mac by keeping a experts in 4bit! We could do 5bit since it's much better than 4bit right now. I'm open to opinions @angeloskath
|
|
Hey @Blaizzy — just flagging some technical notes since we're both working on V4 support and PR #1189 landed ~10 hours earlier with significant overlap: Compressed attention mask direction (line 770-773): Sinkhorn normalization: sqrtsoftplus numerical stability: Happy to coordinate if the maintainers want to consolidate into one PR. Our implementation has live generation validation at 21.86 tok/s on M3 Ultra (DeepSeek-V4-Flash-4bit, 160GB peak). |
|
Hey @machiabeli, thanks! Yes, same person who left the earlier feedback, good to connect properly. I've been poking at this in parallel and landed on something close to the source numerically with minimal changes, but there's definitely room to combine approaches. A PR from you on the compressed attention mask, Sinkhorn norm, and sqrt-softplus would be really welcome, happy to review and merge what works best. Or I can cherry pick and add you as a co-author. |
| if ( | ||
| config.get("quantization", None) is None | ||
| and getattr(model_args, "quantization", None) is not None | ||
| and any(k.endswith(".scales") for k in weights) | ||
| ): | ||
| config["quantization"] = model_args.quantization | ||
|
|
||
| def _quantize(quantization): | ||
| def class_predicate(p, m): | ||
| if not hasattr(m, "to_quantized"): | ||
| return False | ||
| if f"{p}.scales" not in weights: | ||
| return False | ||
| # Handle custom per layer quantizations | ||
| if p in config["quantization"]: | ||
| return config["quantization"][p] | ||
| if not hasattr(m, "to_quantized"): | ||
| return False | ||
| return f"{p}.scales" in weights | ||
| return True |
There was a problem hiding this comment.
The goal here is to preserve mxfp4 expert quant since MLX supports it. So I made the quantize_config key in the config class default to that, and these changes help prequantized models load properly.
It can be done via predicate but couldn't find an elegant way of doing it.
Note: it doesn't affect any model.
There was a problem hiding this comment.
Alternative is to dequant -> requant similar to how we do with FP8.
…ponding unit tests for HyperConnection
… with matmul for improved performance
|
Fixed, could you try again @adurham? |
|
trying now |
Updates mlx_lm/models/deepseek_v4.py from Blaizzy's PR ml-explore#1192 head (ddeffe3). File-only — same fork strategy as 2688312, does NOT pull the rest of the PR (would revert 77ed380 quant SDPA fast path and delete mlx_lm/models/minimax_trace.py). Key fix: 8e8571a "Fix DeepSeek V4 sparse pooled prefill memory" — keeps pooled top-k attention grouped per query during prefill instead of flattening into an L*top_k dense KV sequence, avoiding oversized SDPA score buffers on long prompts. Addresses the (B, n_heads, L, L*k) cubic blowup we reported for compress_ratio==4 layers. Imports unchanged vs prior snapshot apart from dropping unused _gather_sort import. ModelArgs / Model construct cleanly against mlx-community/DeepSeek-V4-Flash-4bit config at runtime. Co-Authored-By: Blaizzy <prince.canuma@hotmail.com>
Decode regression on
|
mlx_lm/models/deepseek_v4.py rev |
Prefill (15-tok warmup) | Decode (30-tok, 836 ctx) |
|---|---|---|
15de79d8 (snapshot before #26f49f5) |
65 tok/s | 34.8 tok/s |
ddeffe33 (current PR head) |
1.9 tok/s | 0.07 tok/s |
ddeffe33 + fused HC kernel disabled |
39 tok/s | 40.4 tok/s |
Decode rate computed from per-token ChunkGenerated master-log timestamps; 3 independent runs (cold prefill, prefix-cache hit, fresh shorter prompt) all land within 1% of 40 tok/s, so the third row isn't a fluke.
The fix is a one-liner — force _hc_sinkhorn_collapse_kernel = None so HyperConnection.collapse takes the unfused split-sinkhorn + _hc_collapse_op path. The fused kernel from 26f49f5 is invoked twice per block (attn_hc + ffn_hc), so ~86 dispatches per decoded token on DSv4-Flash (43 layers). With it off, the rest of the squash lands fine — actually +16% over the prior baseline.
Reproduction
Bisect branch: adurham/mlx-lm@dsv4-perf-bisect (= ddeffe33 file content + the one-line kernel disable). On stock ddeffe33 decode is reproducibly ≤ 0.1 tok/s.
Workload: mlx-community/DeepSeek-V4-Flash-4bit, 2-rank TP via RDMA across 2× M4 Max, ~836-token prompt + 30 generated tokens.
MLX context — possibly a fork interaction
I'm on adurham/mlx (= upstream e64e280d + reverts of #3412 jaccl refactor and #3418 jaccl init bug, both comm-backend only, no GPU code touched). It's plausible _hc_sinkhorn_collapse_kernel only mis-behaves under that combination — happy to retest on stock upstream MLX if it'd help nail down whether this is fork-specific. Flagging it first since the symptom is so consistent on my side.
Suggestion
Gate _hc_sinkhorn_collapse_kernel behind an opt-in env (default off) until the root cause is clear. cc @0xClandestine since you wrote/optimized this kernel — let me know what telemetry would help (per-call latency, kernel-source variants, profiling output, etc.).
Token-dropping regression on
|
| Expected | Got | |
|---|---|---|
| こんにちは | こんこん | (drop + duplicate) |
| ありがとう | ありあり | (drop + duplicate) |
| アップグレード | アップグップ | (mid-word collapse) |
| 進化 | 進進 | (single-char repeat) |
| (long string) | normal | "おっっっっっ" infinite loop |
A second prompt also showed token dropping ("メモ帯域" instead of "メモリ帯域", "推速度" instead of "推論速度", "Metalフ" cut off, etc.).
Bisect:
| Commit | Quality |
|---|---|
910b120f (perf-optimize-ds4 merged) |
✅ clean |
ddeffe33 (current head) |
❌ token drop + repetition |
Workaround for now: pinned to 910b120fa24cecb804b795b05183cbf0037f4ba6 which is rock solid (28-30 tok/s on mxfp8, no quality issues).
Could be related to one of:
8e8571a4Fix DeepSeek V4 sparse pooled prefill memory2591b51fRefactor kv tensor reshaping in V4Attention5a4aaa41Refactor output projection22da01a1Remove redundant cache type check
Happy to bisect further commit-by-commit if useful — let me know which axis would be most valuable to narrow down. cc @Blaizzy
|
Thanks @Shinka-Man On it 👌🏽 |
Restore the transpose in HyperConnection expand so the Sinkhorn combination matrix is applied in the same orientation as the original implementation. This fixes token dropping and repetition regressions seen in Japanese generation. Reported-by: https://github.com/Shinka-Man
|
FIxed @Shinka-Man, the culprit was the HyperConnection expand orientation change from |
|
Thanks @adurham, looking into it 👌🏽 |
|
@Blaizzy Confirmed fixed on Same prompt that broke before:
Output now (mxfp8, M3 Ultra 512GB):
Clean Japanese, no token drops, no repetition loops, perfect context understanding. 30.2 tok/s (slightly faster than the pre-regression baseline). That orientation flip — |
|
My pleasure @Shinka-Man, thanks for catching it! ❤️ I added a test to avoid regressions |
The element-wise (q * pooled).sum() path broadcasts a (B,H,L,1,D) tensor against (B,1,L,topk,D), creating a (B,H,L,topk,D) intermediate. At 4k context with H=64, topk=512, D=512 this is ~137 GB per operation (x2). Replace with equivalent matmul: (B,L,H,D) @ (B,L,D,topk) which produces the (B,L,H,topk) result directly with ~0.25 GB peak memory.
|
PR from @0xClandestine fixes the 4K context issue! Here tested up to 64K! 🔥
|
|
I've found a regression (not in the @0xClandestine PR): quantization is failing, 4, 8 and all combos. @Blaizzy [INFO] Using dtype: bfloat16 |
|
Hey @Blaizzy — quick request from a downstream user. I just ran the validation harness against this PR at HEAD
So the runtime path is healthy on this branch. Would it be feasible to publish a true 8-bit weight-only quant (no per-layer 4-bit override on the FFN/MoE expert weights) alongside the existing mixed quant? My intended use is as a single-tenant Alfred backend on this box — at q8 the model would be ~284 GB, which fits comfortably on 512 GB Mac Studio without the production-coexist constraint your mixed quant was sized for. Also happy to self-quantize from |
|
Yes, I can do that. The reason why the 8bit has experts in 4bit is because the main model come with experts in MXFP4.
My pleasure! Let me merge that for you :) Update: I don't see a PR to that rerpo |
|
Following up on #issuecomment-4329720377 with more specific findings. I attempted to build a true 8-bit conversion myself by streaming through Hit an architectural wall on the routed-expert dimensions that I can't reconcile from public artifacts. Source (deepseek-ai/DeepSeek-V4-Flash) per-expert shapes: Your Both repos' configs say Param accounting confirms it: source's routed experts at (2048, 2048) sum to ~138B params (consistent with the 148 GB I8 file size), while your switch_mlp shapes imply ~280B in routed experts alone, which lines up with mxfp4's compressed 155 GB file size for a fuller param count. The shared_experts block in source is sized correctly ( Two questions, in order of usefulness to me:
If (1) has a clean answer I can finish the converter myself for (2). If not, only you have the conversion path that produces the right shapes. Either way thanks for the work — |
Performance update — Apple optimizations live ✨After pulling the latest with @angeloskath's three commits (RoPE kernel native path, GLU cast simplification, original-checkpoint loader): Setup: Mac Studio M3 Ultra 512GB, mxfp8, single-machine, no TP
The thing that jumps out: TPS is now context-flat (33.8 vs 33.6 vs 33.6 across very different workloads). Previous builds had visible decay on longer generations. This is the signature of the cast/RoPE overhead going away. Cumulative trajectory on this hardware:
~+50% from baseline in 4 days of community iteration. Quality remains pristine (no token drops, no repetition loops, perfect Japanese & code). Welcome to the PR @angeloskath — the kernel-native RoPE path is doing real work here on Apple Silicon. 🙌 |
…'s fp32-cast patches Pulls in PR ml-explore#1192 from upstream including: - Apple-team commits from Angelos Katharopoulos: - 3cf5282 "Start simplifying and speeding up the attention" (2026-04-29) - 4951496 "Fix RoPE to use the kernel by scaling freqs" (2026-04-28) - 81a8c57 "Simplify GLU and gate remove intermediate castings" (2026-04-28) - Blaizzy refactor stack (output projection, KV reshape, BatchRotatingKVCache, scoring/RoPE compile, the matmul rewrite from 0xClandestine that we already had a copy of, etc.) Conflict resolution: took theirs for mlx_lm/models/deepseek_v4.py wholesale. Our previous fork patches that are now superseded: - f4dd9e7 / 2a1dcf6 "drop fp32 casts in Indexer / MoEGate / Compressor" — Angelos' 81a8c57 covers the same ground more cleanly. Fork patches that need re-applying separately on top: - mlx_lm/profiler.py span/finalize hooks scattered across deepseek_v4.py (attn_q_lora / attn_kv_proj / attn_compressor / attn_indexer / attn_sdpa_sparse / attn_sdpa_dense / moe_* spans). - 1d78d62 Indexer wq_b/weights_proj sharding for TP.



Note: Please install this transformers PR from source to avoid tokenizer bugs.
pip install git+https://github.com/huggingface/transformers.git@refs/pull/45643/headWeights here:
https://huggingface.co/collections/mlx-community/deepseek-v4