Skip to content

Cold-start prefill is 6-15x slower than llama.cpp Q8_0 GGUF on the same model (Apple Silicon) #21

@hanxiao

Description

@hanxiao

Summary

Cold-start prefill in dflash-serve is 6-15x slower than llama-server (Q8_0 GGUF, build 9010) on the same model and prompt, on the same machine. Combined with the multi-turn cache miss (#19), this means dflash's TTFT in any realistic chat workflow is dominated by prefill cost — the speculative decode win on later tokens is often not enough to recover.

Reproduction

  • Mac Studio M3 Ultra 512GB, macOS 25.2.0
  • dflash-mlx==0.1.4.1 vs llama.cpp build 9010 (d05fe1d7d)
  • mlx-community/Qwen3.6-27B-Instruct-8bit for dflash
  • Qwen3.6-27B-Instruct-Q8_0.gguf for llama.cpp (--flash-attn on --cache-ram 131072)
  • draft: z-lab/Qwen3.6-27B-DFlash
  • temperature=0, single stream, identical prompts
prompt size engine TTFT (s) prefill tok/s
2000 llama.cpp 4.38 617
2000 dflash 4.73 571
4000 llama.cpp 5.52 965
4000 dflash 9.51 559
8000 llama.cpp 9.87 1083
8000 dflash 20.17 530
16000 llama.cpp 19.19 1117
16000 dflash 44.67 480
32000 llama.cpp 39.99 1070
32000 dflash 114.45 374

llama.cpp's prefill scales roughly with prompt size (constant ~1000 tok/s); dflash's prefill throughput degrades as prompt grows (617 → 374 tok/s). At 32K prompt the gap is ~3x cold; with cache hits on the llama.cpp side the gap explodes (32K cache hit TTFT = 0.18s vs dflash 114s = >600x).

Related work

  • z-lab/dflash#71 reports the same direction on Qwen3-Coder-Next bf16: vanilla MLX prompt = 95.31 tok/s vs DFlash = 20.93 tok/s (~4.5x prefill regression). Open, unanswered.
  • z-lab/dflash#70 reports M1 Pro 16GB Qwen3.5-4B-MLX-4bit: vanilla MLX 48.33 tok/s → dflash 22.75 tok/s. Open since Apr 17, second-confirmed.
  • r/LocalLLM "MLX with DFlash: Surprising results" reports dflash-mlx losing across four real workloads.

So this is not a 27B-specific or M3 Ultra-specific issue — the pattern is consistent across multiple model sizes / hardware / quantization formats.

Hypothesis / asks

A few directions worth investigating:

  1. Chunked-prefill step size: runtime.py hardcodes prefill_step_size = 2048 (twice). On Apple Silicon larger chunks (e.g., 4K-8K) typically give better arithmetic intensity. (See also CLI flags --num-draft-tokens / --decode-concurrency / --prompt-concurrency / --prefill-step-size are silently no-ops #18 — this knob is exposed as CLI but never read.)
  2. Flash-attention kernel: dflash's MLX prefill path appears not to use a fused FA-style kernel at long context. llama.cpp on Apple Silicon ships an FA-on path that is decisive at 16K+ prompts.
  3. Prefill is not part of the speculative critical path — drafters don't help here. Could the prefill of the target model run with the same mlx_lm.generate prefill code path as vanilla, instead of dflash's custom loop?

For now my recommendation to users (which I'd be happy to upstream into the README) is: route requests by prompt length — vanilla mlx-lm or llama.cpp for prompts > 8K, dflash only for short prompt + short reply (where it does win 1.4-1.65x decode tok/s).

Would also be very interested in any plans to integrate a PFlash-style prefill skip path, even as an optional dependency.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions