Summary
Four CLI flags exposed by dflash-serve in v0.1.4.1 are accepted by argparse but never read by the runtime. They are no-ops:
--num-draft-tokens
--decode-concurrency
--prompt-concurrency
--prefill-step-size
This means --prefill-step-size is silently hardcoded to 2048 (twice, in runtime.py), and the other three have no effect at all in the current codebase. Users tuning these knobs see no change because the values never reach the inference loop.
Reproduction
# All four runs produce identical wall time, identical decode tok/s,
# and the EXACT same generated token sequence (temperature=0, single stream):
for d in 3 6 12 16; do
dflash-serve --num-draft-tokens $d ...
curl ... '{"max_tokens":2048,"temperature":0,...}'
done
for c in 32 64 128; do
dflash-serve --decode-concurrency $c ...
...
done
for p in 512 1024 2048 4096; do
dflash-serve --prefill-step-size $p ...
...
done
In my benchmark (Mac Studio M3 Ultra, Qwen3.6-27B-Instruct-8bit + z-lab/Qwen3.6-27B-DFlash, 4K prompt, max_tokens=2048):
| config |
decode tok/s |
wall (s) |
out_tok |
baseline (--num-draft-tokens 3) |
28.3 |
102.9 |
2645 |
--num-draft-tokens 6 |
28.2 |
103.3 |
2645 |
--num-draft-tokens 12 |
28.0 |
103.7 |
2645 |
--num-draft-tokens 16 |
28.2 |
103.1 |
2645 |
--decode-concurrency 64 |
28.1 |
103.4 |
2645 |
--decode-concurrency 128 |
28.3 |
103.0 |
2645 |
Same exact 2645 output tokens every time → identical generation trajectory regardless of flag value.
Source-level confirmation
$ grep -n "num_draft_tokens\|decode_concurrency\|prefill_step_size\|prompt_concurrency" \
site-packages/dflash_mlx/serve.py
# (no matches)
$ grep -rn "num_draft_tokens\|decode_concurrency\|prefill_step_size\|prompt_concurrency" \
site-packages/dflash_mlx/
runtime.py:1221: prefill_step_size = 2048
runtime.py:1224: for chunk_start in range(0, prompt_len, prefill_step_size):
runtime.py:1225: chunk_end = min(chunk_start + prefill_step_size, prompt_len)
runtime.py:1616: prefill_step_size = 2048
runtime.py:1619: for chunk_start in range(0, prompt_len, prefill_step_size):
runtime.py:1620: chunk_end = min(chunk_start + prefill_step_size, prompt_len)
prefill_step_size is hardcoded to 2048 inside runtime.py (twice), not read from cli_args.
num_draft_tokens, decode_concurrency, prompt_concurrency are not referenced anywhere in dflash_mlx/ outside the argparse declaration.
Request
One of:
- Wire the flags through to
runtime.py (prefill_step_size is the lowest-hanging fruit — just replace the literal 2048 with getattr(cli_args, "prefill_step_size", 2048)).
- Document them as no-ops in
--help instead of argparse.SUPPRESS, so users don't waste time benchmarking them.
- At minimum, log a one-time warning at startup like
[dflash] --num-draft-tokens=12 has no effect in this build so the silent fallback is discoverable.
The vLLM-side DFlash docs explicitly recommend tuning --num-speculative-tokens, so users coming from that recipe naturally try the same here and get nothing.
Happy to PR option (1) for --prefill-step-size if that's the preferred direction.
Summary
Four CLI flags exposed by
dflash-servein v0.1.4.1 are accepted byargparsebut never read by the runtime. They are no-ops:--num-draft-tokens--decode-concurrency--prompt-concurrency--prefill-step-sizeThis means
--prefill-step-sizeis silently hardcoded to 2048 (twice, inruntime.py), and the other three have no effect at all in the current codebase. Users tuning these knobs see no change because the values never reach the inference loop.Reproduction
In my benchmark (Mac Studio M3 Ultra, Qwen3.6-27B-Instruct-8bit + z-lab/Qwen3.6-27B-DFlash, 4K prompt, max_tokens=2048):
--num-draft-tokens 3)--num-draft-tokens 6--num-draft-tokens 12--num-draft-tokens 16--decode-concurrency 64--decode-concurrency 128Same exact 2645 output tokens every time → identical generation trajectory regardless of flag value.
Source-level confirmation
prefill_step_sizeis hardcoded to 2048 insideruntime.py(twice), not read fromcli_args.num_draft_tokens,decode_concurrency,prompt_concurrencyare not referenced anywhere indflash_mlx/outside the argparse declaration.Request
One of:
runtime.py(prefill_step_sizeis the lowest-hanging fruit — just replace the literal2048withgetattr(cli_args, "prefill_step_size", 2048)).--helpinstead ofargparse.SUPPRESS, so users don't waste time benchmarking them.[dflash] --num-draft-tokens=12 has no effect in this buildso the silent fallback is discoverable.The vLLM-side DFlash docs explicitly recommend tuning
--num-speculative-tokens, so users coming from that recipe naturally try the same here and get nothing.Happy to PR option (1) for
--prefill-step-sizeif that's the preferred direction.