Skip to content

[BUG]: /v1/completions Endpoint Incorrectly Wraps Input in Chat Format #3012

@Nekofish-L

Description

@Nekofish-L

Describe the Bug

When using the trtllm backend, the /v1/completions API endpoint does not behave as standard. Instead of treating the prompt field as a raw text for the model to complete directly, the server wraps the input prompt into a chat-like structure: {"role": "user", "content": ""} before sending it to the underlying language model.

This behavior is incorrect and breaks compatibility with the OpenAI Completions API standard, as well as with client libraries and applications that expect a standard completions endpoint.

Steps to Reproduce

curl http://127.0.0.1:8080/v1/completions   \
    -H "Content-Type: application/json"   \
    -d '{
    "model": "Qwen3-32B-block-FP8",
    "prompt": "San Francisco is a",
    "stream":false,
    "max_tokens": 32
  }'

Expected Behavior

LLM engine get raw San Francisco is a as input and generate a completion directly based on that prompt.

Actual Behavior

The server modifies the request and presents a chat message to the model. LLM engine actually get <|im_start|>user\nSan Francisco is a<|im_end|>\n<|im_start|>assistant\n as input.

{"role": "user", "content": "San Francisco is a"}
# after apply chat template
<|im_start|>user\nSan Francisco is a<|im_end|>\n<|im_start|>assistant\n

Environment

ai-dynamo 0.4.1
ai-dynamo-runtime 0.4.1

Additional Context

No response

Screenshots

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingfrontend`python -m dynamo.frontend` and `dynamo-run in=http|text|grpc`

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions