-
Notifications
You must be signed in to change notification settings - Fork 646
Description
Describe the Bug
When using the trtllm backend, the /v1/completions
API endpoint does not behave as standard. Instead of treating the prompt field as a raw text for the model to complete directly, the server wraps the input prompt into a chat-like structure: {"role": "user", "content": ""} before sending it to the underlying language model.
This behavior is incorrect and breaks compatibility with the OpenAI Completions API standard, as well as with client libraries and applications that expect a standard completions endpoint.
Steps to Reproduce
curl http://127.0.0.1:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-32B-block-FP8",
"prompt": "San Francisco is a",
"stream":false,
"max_tokens": 32
}'
Expected Behavior
LLM engine get raw San Francisco is a
as input and generate a completion directly based on that prompt.
Actual Behavior
The server modifies the request and presents a chat message to the model. LLM engine actually get <|im_start|>user\nSan Francisco is a<|im_end|>\n<|im_start|>assistant\n
as input.
{"role": "user", "content": "San Francisco is a"}
# after apply chat template
<|im_start|>user\nSan Francisco is a<|im_end|>\n<|im_start|>assistant\n
Environment
ai-dynamo 0.4.1
ai-dynamo-runtime 0.4.1
Additional Context
No response
Screenshots
No response