Skip to content

CLI flags --num-draft-tokens / --decode-concurrency / --prompt-concurrency / --prefill-step-size are silently no-ops #18

@hanxiao

Description

@hanxiao

Summary

Four CLI flags exposed by dflash-serve in v0.1.4.1 are accepted by argparse but never read by the runtime. They are no-ops:

  • --num-draft-tokens
  • --decode-concurrency
  • --prompt-concurrency
  • --prefill-step-size

This means --prefill-step-size is silently hardcoded to 2048 (twice, in runtime.py), and the other three have no effect at all in the current codebase. Users tuning these knobs see no change because the values never reach the inference loop.

Reproduction

# All four runs produce identical wall time, identical decode tok/s,
# and the EXACT same generated token sequence (temperature=0, single stream):
for d in 3 6 12 16; do
  dflash-serve --num-draft-tokens $d ...
  curl ... '{"max_tokens":2048,"temperature":0,...}'
done

for c in 32 64 128; do
  dflash-serve --decode-concurrency $c ...
  ...
done

for p in 512 1024 2048 4096; do
  dflash-serve --prefill-step-size $p ...
  ...
done

In my benchmark (Mac Studio M3 Ultra, Qwen3.6-27B-Instruct-8bit + z-lab/Qwen3.6-27B-DFlash, 4K prompt, max_tokens=2048):

config decode tok/s wall (s) out_tok
baseline (--num-draft-tokens 3) 28.3 102.9 2645
--num-draft-tokens 6 28.2 103.3 2645
--num-draft-tokens 12 28.0 103.7 2645
--num-draft-tokens 16 28.2 103.1 2645
--decode-concurrency 64 28.1 103.4 2645
--decode-concurrency 128 28.3 103.0 2645

Same exact 2645 output tokens every time → identical generation trajectory regardless of flag value.

Source-level confirmation

$ grep -n "num_draft_tokens\|decode_concurrency\|prefill_step_size\|prompt_concurrency" \
    site-packages/dflash_mlx/serve.py
# (no matches)

$ grep -rn "num_draft_tokens\|decode_concurrency\|prefill_step_size\|prompt_concurrency" \
    site-packages/dflash_mlx/
runtime.py:1221:        prefill_step_size = 2048
runtime.py:1224:        for chunk_start in range(0, prompt_len, prefill_step_size):
runtime.py:1225:            chunk_end = min(chunk_start + prefill_step_size, prompt_len)
runtime.py:1616:        prefill_step_size = 2048
runtime.py:1619:        for chunk_start in range(0, prompt_len, prefill_step_size):
runtime.py:1620:            chunk_end = min(chunk_start + prefill_step_size, prompt_len)
  • prefill_step_size is hardcoded to 2048 inside runtime.py (twice), not read from cli_args.
  • num_draft_tokens, decode_concurrency, prompt_concurrency are not referenced anywhere in dflash_mlx/ outside the argparse declaration.

Request

One of:

  1. Wire the flags through to runtime.py (prefill_step_size is the lowest-hanging fruit — just replace the literal 2048 with getattr(cli_args, "prefill_step_size", 2048)).
  2. Document them as no-ops in --help instead of argparse.SUPPRESS, so users don't waste time benchmarking them.
  3. At minimum, log a one-time warning at startup like [dflash] --num-draft-tokens=12 has no effect in this build so the silent fallback is discoverable.

The vLLM-side DFlash docs explicitly recommend tuning --num-speculative-tokens, so users coming from that recipe naturally try the same here and get nothing.

Happy to PR option (1) for --prefill-step-size if that's the preferred direction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions