You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Cold-start prefill in dflash-serve is 6-15x slower than llama-server (Q8_0 GGUF, build 9010) on the same model and prompt, on the same machine. Combined with the multi-turn cache miss (#19), this means dflash's TTFT in any realistic chat workflow is dominated by prefill cost — the speculative decode win on later tokens is often not enough to recover.
Reproduction
Mac Studio M3 Ultra 512GB, macOS 25.2.0
dflash-mlx==0.1.4.1 vs llama.cpp build 9010 (d05fe1d7d)
mlx-community/Qwen3.6-27B-Instruct-8bit for dflash
Qwen3.6-27B-Instruct-Q8_0.gguf for llama.cpp (--flash-attn on --cache-ram 131072)
draft: z-lab/Qwen3.6-27B-DFlash
temperature=0, single stream, identical prompts
prompt size
engine
TTFT (s)
prefill tok/s
2000
llama.cpp
4.38
617
2000
dflash
4.73
571
4000
llama.cpp
5.52
965
4000
dflash
9.51
559
8000
llama.cpp
9.87
1083
8000
dflash
20.17
530
16000
llama.cpp
19.19
1117
16000
dflash
44.67
480
32000
llama.cpp
39.99
1070
32000
dflash
114.45
374
llama.cpp's prefill scales roughly with prompt size (constant ~1000 tok/s); dflash's prefill throughput degrades as prompt grows (617 → 374 tok/s). At 32K prompt the gap is ~3x cold; with cache hits on the llama.cpp side the gap explodes (32K cache hit TTFT = 0.18s vs dflash 114s = >600x).
Related work
z-lab/dflash#71 reports the same direction on Qwen3-Coder-Next bf16: vanilla MLX prompt = 95.31 tok/s vs DFlash = 20.93 tok/s (~4.5x prefill regression). Open, unanswered.
z-lab/dflash#70 reports M1 Pro 16GB Qwen3.5-4B-MLX-4bit: vanilla MLX 48.33 tok/s → dflash 22.75 tok/s. Open since Apr 17, second-confirmed.
Flash-attention kernel: dflash's MLX prefill path appears not to use a fused FA-style kernel at long context. llama.cpp on Apple Silicon ships an FA-on path that is decisive at 16K+ prompts.
Prefill is not part of the speculative critical path — drafters don't help here. Could the prefill of the target model run with the same mlx_lm.generate prefill code path as vanilla, instead of dflash's custom loop?
For now my recommendation to users (which I'd be happy to upstream into the README) is: route requests by prompt length — vanilla mlx-lm or llama.cpp for prompts > 8K, dflash only for short prompt + short reply (where it does win 1.4-1.65x decode tok/s).
Would also be very interested in any plans to integrate a PFlash-style prefill skip path, even as an optional dependency.
Summary
Cold-start prefill in
dflash-serveis 6-15x slower thanllama-server(Q8_0 GGUF, build 9010) on the same model and prompt, on the same machine. Combined with the multi-turn cache miss (#19), this means dflash's TTFT in any realistic chat workflow is dominated by prefill cost — the speculative decode win on later tokens is often not enough to recover.Reproduction
dflash-mlx==0.1.4.1vsllama.cppbuild 9010 (d05fe1d7d)mlx-community/Qwen3.6-27B-Instruct-8bitfor dflashQwen3.6-27B-Instruct-Q8_0.gguffor llama.cpp (--flash-attn on --cache-ram 131072)z-lab/Qwen3.6-27B-DFlashllama.cpp's prefill scales roughly with prompt size (constant ~1000 tok/s); dflash's prefill throughput degrades as prompt grows (617 → 374 tok/s). At 32K prompt the gap is ~3x cold; with cache hits on the llama.cpp side the gap explodes (32K cache hit TTFT = 0.18s vs dflash 114s = >600x).
Related work
z-lab/dflash#71reports the same direction on Qwen3-Coder-Next bf16: vanilla MLX prompt = 95.31 tok/s vs DFlash = 20.93 tok/s (~4.5x prefill regression). Open, unanswered.z-lab/dflash#70reports M1 Pro 16GB Qwen3.5-4B-MLX-4bit: vanilla MLX 48.33 tok/s → dflash 22.75 tok/s. Open since Apr 17, second-confirmed.r/LocalLLM"MLX with DFlash: Surprising results" reports dflash-mlx losing across four real workloads.So this is not a 27B-specific or M3 Ultra-specific issue — the pattern is consistent across multiple model sizes / hardware / quantization formats.
Hypothesis / asks
A few directions worth investigating:
prefill_step_size = 2048(twice). On Apple Silicon larger chunks (e.g., 4K-8K) typically give better arithmetic intensity. (See also CLI flags --num-draft-tokens / --decode-concurrency / --prompt-concurrency / --prefill-step-size are silently no-ops #18 — this knob is exposed as CLI but never read.)mlx_lm.generateprefill code path as vanilla, instead of dflash's custom loop?For now my recommendation to users (which I'd be happy to upstream into the README) is: route requests by prompt length — vanilla mlx-lm or llama.cpp for prompts > 8K, dflash only for short prompt + short reply (where it does win 1.4-1.65x decode tok/s).
Would also be very interested in any plans to integrate a PFlash-style prefill skip path, even as an optional dependency.