Cold-start prefill is 6-15x slower than llama.cpp Q8_0 GGUF on the same model (Apple Silicon)

## Summary

Cold-start prefill in `dflash-serve` is **6-15x slower** than `llama-server` (Q8_0 GGUF, build 9010) on the same model and prompt, on the same machine. Combined with the multi-turn cache miss (#19), this means dflash's TTFT in any realistic chat workflow is dominated by prefill cost — the speculative decode win on later tokens is often not enough to recover.

## Reproduction

- Mac Studio M3 Ultra 512GB, macOS 25.2.0
- `dflash-mlx==0.1.4.1` vs `llama.cpp` build 9010 (`d05fe1d7d`)
- `mlx-community/Qwen3.6-27B-Instruct-8bit` for dflash
- `Qwen3.6-27B-Instruct-Q8_0.gguf` for llama.cpp (`--flash-attn on --cache-ram 131072`)
- draft: `z-lab/Qwen3.6-27B-DFlash`
- temperature=0, single stream, identical prompts

| prompt size | engine | TTFT (s) | prefill tok/s |
|---:|---|---:|---:|
|  2000 | llama.cpp | 4.38   |    617 |
|  2000 | dflash    | 4.73   |    571 |
|  4000 | llama.cpp | 5.52   |    965 |
|  4000 | dflash    | 9.51   |    559 |
|  8000 | llama.cpp | 9.87   |   1083 |
|  8000 | dflash    | 20.17  |    530 |
| 16000 | llama.cpp | 19.19  |   1117 |
| 16000 | dflash    | 44.67  |    480 |
| 32000 | llama.cpp | 39.99  |   1070 |
| 32000 | dflash    | 114.45 |    374 |

llama.cpp's prefill scales roughly with prompt size (constant ~1000 tok/s); dflash's prefill throughput **degrades** as prompt grows (617 → 374 tok/s). At 32K prompt the gap is ~3x cold; with cache hits on the llama.cpp side the gap explodes (32K cache hit TTFT = 0.18s vs dflash 114s = >600x).

## Related work

- [`z-lab/dflash#71`](https://github.com/z-lab/dflash/issues/71) reports the same direction on Qwen3-Coder-Next bf16: vanilla MLX prompt = 95.31 tok/s vs DFlash = 20.93 tok/s (~4.5x prefill regression). Open, unanswered.
- [`z-lab/dflash#70`](https://github.com/z-lab/dflash/issues/70) reports M1 Pro 16GB Qwen3.5-4B-MLX-4bit: vanilla MLX 48.33 tok/s → dflash 22.75 tok/s. Open since Apr 17, second-confirmed.
- [`r/LocalLLM` "MLX with DFlash: Surprising results"](https://www.reddit.com/r/LocalLLM/comments/1srke17/) reports dflash-mlx losing across four real workloads.

So this is not a 27B-specific or M3 Ultra-specific issue — the pattern is consistent across multiple model sizes / hardware / quantization formats.

## Hypothesis / asks

A few directions worth investigating:

1. **Chunked-prefill step size:** runtime.py hardcodes `prefill_step_size = 2048` (twice). On Apple Silicon larger chunks (e.g., 4K-8K) typically give better arithmetic intensity. (See also #18 — this knob is exposed as CLI but never read.)
2. **Flash-attention kernel:** dflash's MLX prefill path appears not to use a fused FA-style kernel at long context. llama.cpp on Apple Silicon ships an FA-on path that is decisive at 16K+ prompts.
3. **Prefill is not part of the speculative critical path** — drafters don't help here. Could the prefill of the target model run with the same `mlx_lm.generate` prefill code path as vanilla, instead of dflash's custom loop?

For now my recommendation to users (which I'd be happy to upstream into the README) is: **route requests by prompt length** — vanilla mlx-lm or llama.cpp for prompts > 8K, dflash only for short prompt + short reply (where it does win 1.4-1.65x decode tok/s).

Would also be very interested in any plans to integrate a [PFlash](https://www.reddit.com/r/LocalLLaMA/comments/1t0vp3w/)-style prefill skip path, even as an optional dependency.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cold-start prefill is 6-15x slower than llama.cpp Q8_0 GGUF on the same model (Apple Silicon) #21

Summary

Reproduction

Related work

Hypothesis / asks

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

prompt size	engine	TTFT (s)	prefill tok/s
2000	llama.cpp	4.38	617
2000	dflash	4.73	571
4000	llama.cpp	5.52	965
4000	dflash	9.51	559
8000	llama.cpp	9.87	1083
8000	dflash	20.17	530
16000	llama.cpp	19.19	1117
16000	dflash	44.67	480
32000	llama.cpp	39.99	1070
32000	dflash	114.45	374

Cold-start prefill is 6-15x slower than llama.cpp Q8_0 GGUF on the same model (Apple Silicon) #21

Description

Summary

Reproduction

Related work

Hypothesis / asks

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions