-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Description
Prompting Qwen3 instruct models (e.g. Qwen3-4B-Instruct-2507-FP8 or Qwen3-30B-A3B-Instruct-2507) served via vllm-openai to produce structured output seems to result in the model endlessly generating tokens until it runs into context length limits - even for very simple prompts like "What is 4+4?":
LengthFinishReasonError: Could not parse response content as the length limit was reached - CompletionUsage(completion_tokens=4081, prompt_tokens=15, total_tokens=4096, completion_tokens_details=None, prompt_tokens_details=None)
This has been observed on a server while running Qwen3-30B-A3B-Instruct-2507 on an H100 GPU but could be reproduced locally running Qwen3-4B-Instruct-2507-FP8 on an RTX 2000.
The issue could not be reproduced using regular Qwen3 models (e.g. Qwen3-0.6B), so it seems to be a problem with the instruct models in particular.
Reproduction
Run Qwen3 model in vllm-openai docker container:
docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model Qwen/Qwen3-4B-Instruct-2507-FP8 --port=8000 --served-model-name=qwen --max-model-len=4096 --gpu-memory-utilization=0.8 --max-num-seqs=1
Prompt the model to produce structured output, e.g. via curl or openai-python:
from openai import OpenAI
from pydantic import BaseModel
client = OpenAI(base_url="http://localhost:8000/v1", api_key="...")
model = "qwen"
messages=[
{
"role": "user",
"content": "What is 4+4?",
}
]
# normal completion works fine
response = client.chat.completions.create(
model=model,
messages=messages
)
print(response.choices[0].message.content)
# structured output errors
class CalculationResult(BaseModel):
answer: int
response = client.beta.chat.completions.parse(
model=model,
messages=messages,
response_format=CalculationResult
)
print(response.choices[0].message.content)
Logs
INFO 10-20 02:49:53 [__init__.py:216] Automatically detected platform cuda.
(APIServer pid=1) INFO 10-20 02:49:58 [api_server.py:1839] vLLM API server version 0.11.0
(APIServer pid=1) INFO 10-20 02:49:58 [utils.py:233] non-default args: {'model': 'Qwen/Qwen3-4B-Instruct-2507-FP8', 'max_model_len': 4096, 'served_model_name': ['qwen'], 'gpu_memory_utilization': 0.8, 'max_num_seqs': 1}
(APIServer pid=1) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=1) INFO 10-20 02:50:06 [model.py:547] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=1) INFO 10-20 02:50:06 [model.py:1510] Using max model len 4096
(APIServer pid=1) INFO 10-20 02:50:07 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 10-20 02:50:10 [__init__.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=92) INFO 10-20 02:50:12 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=92) INFO 10-20 02:50:12 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='Qwen/Qwen3-4B-Instruct-2507-FP8', speculative_config=None, tokenizer='Qwen/Qwen3-4B-Instruct-2507-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=qwen, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention","vllm.sparse_attn_indexer"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,1],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":4,"local_cache_dir":null}
(EngineCore_DP0 pid=92) W1020 02:50:13.172000 92 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(EngineCore_DP0 pid=92) W1020 02:50:13.172000 92 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=92) INFO 10-20 02:50:14 [parallel_state.py:1208] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=92) INFO 10-20 02:50:14 [topk_topp_sampler.py:55] Using FlashInfer for top-p & top-k sampling.
(EngineCore_DP0 pid=92) INFO 10-20 02:50:14 [gpu_model_runner.py:2602] Starting to load model Qwen/Qwen3-4B-Instruct-2507-FP8...
(EngineCore_DP0 pid=92) INFO 10-20 02:50:14 [gpu_model_runner.py:2634] Loading model from scratch...
(EngineCore_DP0 pid=92) INFO 10-20 02:50:14 [cuda.py:366] Using Flash Attention backend on V1 engine.
(EngineCore_DP0 pid=92) INFO 10-20 02:50:15 [weight_utils.py:392] Using model weights format ['*.safetensors']
(EngineCore_DP0 pid=92) INFO 10-20 02:50:15 [weight_utils.py:450] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.02it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.01it/s]
(EngineCore_DP0 pid=92)
(EngineCore_DP0 pid=92) INFO 10-20 02:50:16 [default_loader.py:267] Loading weights took 1.03 seconds
(EngineCore_DP0 pid=92) INFO 10-20 02:50:17 [gpu_model_runner.py:2653] Model loading took 4.2299 GiB and 2.154120 seconds
(EngineCore_DP0 pid=92) INFO 10-20 02:50:22 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/6e7e71e247/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=92) INFO 10-20 02:50:22 [backends.py:559] Dynamo bytecode transform time: 4.76 s
(EngineCore_DP0 pid=92) INFO 10-20 02:50:25 [backends.py:197] Cache the graph for dynamic shape for later use
(EngineCore_DP0 pid=92) INFO 10-20 02:50:40 [backends.py:218] Compiling a graph for dynamic shape takes 17.97 s
(EngineCore_DP0 pid=92) WARNING 10-20 02:50:41 [fp8_utils.py:576] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/configs/N=6144,K=2560,device_name=NVIDIA_RTX_2000_Ada_Generation_Laptop_GPU,dtype=fp8_w8a8,block_shape=[128,128].json
(EngineCore_DP0 pid=92) WARNING 10-20 02:50:42 [fp8_utils.py:576] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/configs/N=2560,K=4096,device_name=NVIDIA_RTX_2000_Ada_Generation_Laptop_GPU,dtype=fp8_w8a8,block_shape=[128,128].json
(EngineCore_DP0 pid=92) WARNING 10-20 02:50:42 [fp8_utils.py:576] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/configs/N=19456,K=2560,device_name=NVIDIA_RTX_2000_Ada_Generation_Laptop_GPU,dtype=fp8_w8a8,block_shape=[128,128].json
(EngineCore_DP0 pid=92) WARNING 10-20 02:50:42 [fp8_utils.py:576] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/configs/N=2560,K=9728,device_name=NVIDIA_RTX_2000_Ada_Generation_Laptop_GPU,dtype=fp8_w8a8,block_shape=[128,128].json
(EngineCore_DP0 pid=92) INFO 10-20 02:51:08 [monitor.py:34] torch.compile takes 22.73 s in total
(EngineCore_DP0 pid=92) INFO 10-20 02:51:09 [gpu_worker.py:298] Available KV cache memory: 1.72 GiB
(EngineCore_DP0 pid=92) INFO 10-20 02:51:09 [kv_cache_utils.py:1087] GPU KV cache size: 12,544 tokens
(EngineCore_DP0 pid=92) INFO 10-20 02:51:09 [kv_cache_utils.py:1091] Maximum concurrency for 4,096 tokens per request: 3.06x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 3/3 [00:01<00:00, 2.00it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 1/1 [00:00<00:00, 19.97it/s]
(EngineCore_DP0 pid=92) INFO 10-20 02:51:11 [gpu_model_runner.py:3480] Graph capturing finished in 2 secs, took 0.04 GiB
(EngineCore_DP0 pid=92) INFO 10-20 02:51:11 [core.py:210] init engine (profile, create kv cache, warmup model) took 54.83 seconds
(APIServer pid=1) INFO 10-20 02:51:12 [loggers.py:147] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 784
(APIServer pid=1) INFO 10-20 02:51:12 [api_server.py:1634] Supported_tasks: ['generate']
(APIServer pid=1) WARNING 10-20 02:51:13 [model.py:1389] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 10-20 02:51:13 [serving_responses.py:137] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=1) INFO 10-20 02:51:13 [serving_chat.py:139] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=1) INFO 10-20 02:51:13 [serving_completion.py:76] Using default completion sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=1) INFO 10-20 02:51:13 [api_server.py:1912] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:34] Available routes are:
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /docs, Methods: GET, HEAD
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /health, Methods: GET
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /load, Methods: GET
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /ping, Methods: POST
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /ping, Methods: GET
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /version, Methods: GET
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /v1/embeddings, Methods: POST
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /pooling, Methods: POST
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /classify, Methods: POST
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /score, Methods: POST
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /v1/score, Methods: POST
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /rerank, Methods: POST
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /v1/rerank, Methods: POST
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /v2/rerank, Methods: POST
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 10-20 02:51:13 [launcher.py:42] Route: /metrics, Methods: GET
(APIServer pid=1) INFO: Started server process [1]
(APIServer pid=1) INFO: Waiting for application startup.
(APIServer pid=1) INFO: Application startup complete.
(APIServer pid=1) INFO 10-20 02:51:20 [chat_utils.py:560] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) INFO 10-20 03:31:03 [loggers.py:127] Engine 000: Avg prompt throughput: 1.5 tokens/s, Avg generation throughput: 0.9 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 10-20 03:31:13 [loggers.py:127] Engine 000: Avg prompt throughput: 1.5 tokens/s, Avg generation throughput: 11.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 10-20 03:31:23 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.9%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 10-20 03:31:33 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.7%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 10-20 03:31:43 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 10-20 03:31:53 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.3%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 10-20 03:32:03 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.1%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 10-20 03:32:13 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.9%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 10-20 03:32:23 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 13.7%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 10-20 03:32:33 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 15.3%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 10-20 03:32:43 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 17.1%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 10-20 03:32:53 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 18.8%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 10-20 03:33:03 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 20.6%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 10-20 03:33:13 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 22.2%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 10-20 03:33:23 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 23.9%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 10-20 03:33:33 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 25.7%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 10-20 03:33:43 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 27.3%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 10-20 03:33:53 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 29.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 10-20 03:34:03 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 30.7%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 10-20 03:34:13 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 32.3%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO: 172.17.0.1:44742 - "POST /v1/chat/completions HTTP/1.1" 200 OKEnvironment Information
Local:
- GPU: RTX 2000
- CUDA Version: 12.4
- Driver Version: 550.163.01
- vllm: tested with 0.9.0.1 and 0.11.0 (latest)
- model: Qwen3-4B-Instruct-2507-FP8
- openai-python: 1.68.2
Server:
- GPU: H100
- CUDA Version: 12.8
- Driver Version: 570.148.08
- vllm: 0.9.0.1
- model: Qwen3-30B-A3B-Instruct-2507
Structured output parameters for vllm were left at default values:
structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='')
Known Issue
- The issue hasn't been already addressed in Documentation, Issues, and Discussions.