CLI flags --num-draft-tokens / --decode-concurrency / --prompt-concurrency / --prefill-step-size are silently no-ops

## Summary

Four CLI flags exposed by `dflash-serve` in v0.1.4.1 are accepted by `argparse` but **never read by the runtime**. They are no-ops:

- `--num-draft-tokens`
- `--decode-concurrency`
- `--prompt-concurrency`
- `--prefill-step-size`

This means `--prefill-step-size` is silently hardcoded to 2048 (twice, in `runtime.py`), and the other three have no effect at all in the current codebase. Users tuning these knobs see no change because the values never reach the inference loop.

## Reproduction

```bash
# All four runs produce identical wall time, identical decode tok/s,
# and the EXACT same generated token sequence (temperature=0, single stream):
for d in 3 6 12 16; do
  dflash-serve --num-draft-tokens $d ...
  curl ... '{"max_tokens":2048,"temperature":0,...}'
done

for c in 32 64 128; do
  dflash-serve --decode-concurrency $c ...
  ...
done

for p in 512 1024 2048 4096; do
  dflash-serve --prefill-step-size $p ...
  ...
done
```

In my benchmark (Mac Studio M3 Ultra, Qwen3.6-27B-Instruct-8bit + z-lab/Qwen3.6-27B-DFlash, 4K prompt, max_tokens=2048):

| config | decode tok/s | wall (s) | out_tok |
|---|---:|---:|---:|
| baseline (`--num-draft-tokens 3`) | 28.3 | 102.9 | 2645 |
| `--num-draft-tokens 6`   | 28.2 | 103.3 | 2645 |
| `--num-draft-tokens 12`  | 28.0 | 103.7 | 2645 |
| `--num-draft-tokens 16`  | 28.2 | 103.1 | 2645 |
| `--decode-concurrency 64`  | 28.1 | 103.4 | 2645 |
| `--decode-concurrency 128` | 28.3 | 103.0 | 2645 |

Same exact 2645 output tokens every time → identical generation trajectory regardless of flag value.

## Source-level confirmation

```bash
$ grep -n "num_draft_tokens\|decode_concurrency\|prefill_step_size\|prompt_concurrency" \
    site-packages/dflash_mlx/serve.py
# (no matches)

$ grep -rn "num_draft_tokens\|decode_concurrency\|prefill_step_size\|prompt_concurrency" \
    site-packages/dflash_mlx/
runtime.py:1221:        prefill_step_size = 2048
runtime.py:1224:        for chunk_start in range(0, prompt_len, prefill_step_size):
runtime.py:1225:            chunk_end = min(chunk_start + prefill_step_size, prompt_len)
runtime.py:1616:        prefill_step_size = 2048
runtime.py:1619:        for chunk_start in range(0, prompt_len, prefill_step_size):
runtime.py:1620:            chunk_end = min(chunk_start + prefill_step_size, prompt_len)
```

- `prefill_step_size` is **hardcoded to 2048** inside `runtime.py` (twice), not read from `cli_args`.
- `num_draft_tokens`, `decode_concurrency`, `prompt_concurrency` are not referenced anywhere in `dflash_mlx/` outside the argparse declaration.

## Request

One of:

1. **Wire the flags through** to `runtime.py` (`prefill_step_size` is the lowest-hanging fruit — just replace the literal `2048` with `getattr(cli_args, "prefill_step_size", 2048)`).
2. **Document them as no-ops** in `--help` instead of `argparse.SUPPRESS`, so users don't waste time benchmarking them.
3. At minimum, log a one-time warning at startup like `[dflash] --num-draft-tokens=12 has no effect in this build` so the silent fallback is discoverable.

The vLLM-side DFlash docs explicitly recommend tuning `--num-speculative-tokens`, so users coming from that recipe naturally try the same here and get nothing.

Happy to PR option (1) for `--prefill-step-size` if that's the preferred direction.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI flags --num-draft-tokens / --decode-concurrency / --prompt-concurrency / --prefill-step-size are silently no-ops #18

Summary

Reproduction

Source-level confirmation

Request

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

config	decode tok/s	wall (s)	out_tok
baseline (`--num-draft-tokens 3`)	28.3	102.9	2645
`--num-draft-tokens 6`	28.2	103.3	2645
`--num-draft-tokens 12`	28.0	103.7	2645
`--num-draft-tokens 16`	28.2	103.1	2645
`--decode-concurrency 64`	28.1	103.4	2645
`--decode-concurrency 128`	28.3	103.0	2645

CLI flags --num-draft-tokens / --decode-concurrency / --prompt-concurrency / --prefill-step-size are silently no-ops #18

Description

Summary

Reproduction

Source-level confirmation

Request

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions