Skip to content

[experimental chore/agentx-v0.2-aiperf-testing branch] agentic build_replay_cmd: missing --tokenizer-trust-remote-code breaks Kimi-K2.5 (and other custom-code tokenizer models) #1369

@andyluo7

Description

@andyluo7

Summary

benchmark_lib.sh::build_replay_cmd does not pass --tokenizer-trust-remote-code to aiperf profile. This makes the launcher unusable for any model whose tokenizer requires executing custom Python code from the HuggingFace repo — for example, Kimi-K2.5-MXFP4 (and likely several other Qwen/Kimi/MiniMax variants).

aiperf bails out at startup of the first trace replay request:

╭─ Tokenizer Error ────────────────────────────────────────────────────────────╮
│  Failed to load tokenizer 'amd/Kimi-K2.5-MXFP4'                              │
│                                                                              │
│  AIPerf needs a tokenizer for accurate client-side token counting and        │
│  synthetic prompt generation.                                                │
│                                                                              │
│  Possible Causes:                                                            │
│    • Tokenizer requires executing custom Python code from HuggingFace        │
│                                                                              │
│  Suggested Fixes:                                                            │
│    • Add: --tokenizer-trust-remote-code                                      │
│    • Skip tokenizer (non-synthetic data only): --use-server-token-count      │
╰──────────────────────────────────────────────────────────────────────────────╯

Notably, build_replay_cmd does already pass --use-server-token-count, but aiperf still calls into the tokenizer for trace dataset preprocessing (not just synthetic generation), so the skip-flag isn't sufficient.

Repro

Run any agentic-coding sweep against a model with a custom-code tokenizer:

podman run ... -e MODEL=amd/Kimi-K2.5-MXFP4 -e TP=4 -e CONC=1 -e OFFLOADING=none \
  -e RESULT_DIR=... \
  ... vllm/vllm-openai-rocm:v0.19.1 \
  /workspace/benchmarks/single_node/agentic/kimik2.5_fp4_mi355x.sh

vLLM server starts cleanly (loads model, /health 200). aiperf then errors immediately on the first request with the tokenizer load failure above. Every CONC point in the sweep produces an empty trace_replay/ directory with no actual benchmark data — sweep_summary.csv has only headers.

Suggested fix

 REPLAY_CMD+=" --output-artifact-dir $result_dir/trace_replay"
+# Required for models whose tokenizer ships custom Python code (Kimi-K2.5,
+# some Qwen/MiniMax variants). aiperf's trace-dataset preprocessing calls
+# AutoTokenizer.from_pretrained even when --use-server-token-count is set,
+# so the trust flag is not optional.
+REPLAY_CMD+=" --tokenizer-trust-remote-code"
 if [ "$duration" -lt 900 ] || [ "${AIPERF_UNSAFE_OVERRIDE:-false}" = "true" ]; then
     REPLAY_CMD+=" --unsafe-override"
 fi

This is safe because the launcher already uses --trust-remote-code for the vLLM server side; aiperf needs the matching flag.

Why MiniMax-M2.5 worked but Kimi-K2.5 didn't

MiniMaxAI/MiniMax-M2.5 uses a standard HF tokenizer (no custom code). amd/Kimi-K2.5-MXFP4 ships with custom tokenizer files (tool_declaration_ts.py, media_utils.py — visible in the v0.19.1 vLLM startup log: "A new version of the following files was downloaded from huggingface.co/amd/Kimi-K2.5-MXFP4"). The current build_replay_cmd only happens to work for the subset of models with stock tokenizers.

Impact

Every Kimi family model in v0.2 (kimik2.5-fp4-mi355x-vllm, kimik2.5-fp4-b200-vllm, kimik2.5-int4-b200-vllm, etc.) is currently un-benchmarkable through the agentic-coding scenario. We did a 4h kimik2.5 FP4 MI355X sweep that produced zero data (all 7 conc points failed identically) before noticing this — flag-fix is one line.

Environment

  • Branch: chore/agentx-v0.2-aiperf-testing (tip c8dfb585)
  • aiperf submodule: 70fecb2e
  • Reproducer image: vllm/vllm-openai-rocm:v0.19.1
  • Reproducer model: amd/Kimi-K2.5-MXFP4
  • Hardware: AAC1 MI355X (gfx950), TP=4

Companion to issues #1358, #1359, #1360, #1363 (other v0.2 issues found in the same campaign).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions