-
Notifications
You must be signed in to change notification settings - Fork 581
fix: Do not apply chat template to completions #2718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughIntroduces a builder-based preprocessing flow in OpenAIPreprocessor, adding builder/apply_template/gather_tokens methods and refactoring request handling to produce PreprocessedRequest via PreprocessedRequestBuilder. Centralizes EOS/stop handling and backend_instance_id extraction. Removes debug logging in BasicReasoningParser and adjusts logging string interpolation in reasoning::mod without changing behavior. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Caller
participant OAI as OpenAIPreprocessor
participant B as PreprocessedRequestBuilder
participant T as Tokenizer/Template
Caller->>OAI: preprocess (chat/completion request)
OAI->>OAI: builder(request)
OAI->>T: apply_template(request)
T-->>OAI: formatted_prompt?
OAI->>B: gather_tokens(request, builder, formatted_prompt)
Note over OAI,B: Centralized EOS/stop handling and nvext extraction
OAI->>B: build()
B-->>OAI: PreprocessedRequest (+annotations)
OAI-->>Caller: PreprocessedRequest
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45–75 minutes Possibly related PRs
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/llm/src/preprocessor.rs (1)
247-320
: Bug: gather_tokens panics for text-single completions (formatted_prompt is None)In NvCreateCompletionRequest.generate you call:
- let annotations = self.gather_tokens(&request, &mut builder, None)?;
But in TextInput::Single you do:
- let formatted_prompt = formatted_prompt.expect("...unreachable");
That makes completions with a single text prompt panic. Use the provided formatted_prompt when present (chat), else the raw text (completions).
Apply this focused refactor:
- TextInput::Single(_) => { - let formatted_prompt = formatted_prompt.expect("Could not find a prompt. The paired match statements earlier should make this unreachable"); - let encoding = self.tokenizer.encode(&formatted_prompt)?; - - if request.has_annotation(ANNOTATION_FORMATTED_PROMPT) { - annotations.insert( - ANNOTATION_FORMATTED_PROMPT.to_string(), - formatted_prompt, - ); - } + TextInput::Single(text) => { + // Use chat-formatted prompt when provided; otherwise use the raw text + let prompt_for_tokenization = formatted_prompt.unwrap_or(text); + let encoding = self.tokenizer.encode(&prompt_for_tokenization)?; + + if request.has_annotation(ANNOTATION_FORMATTED_PROMPT) { + annotations.insert( + ANNOTATION_FORMATTED_PROMPT.to_string(), + prompt_for_tokenization.clone(), + ); + } if request.has_annotation(ANNOTATION_TOKEN_IDS) { annotations.insert( ANNOTATION_TOKEN_IDS.to_string(), serde_json::to_string(encoding.token_ids())?, ); } builder.token_ids(encoding.token_ids().to_vec()); }Optional: For batch text inputs, consider using encode_batch to avoid per-item overhead and potential thread-safety concerns with parallel encode calls:
- TextInput::Batch(texts) => { - let token_batches: Vec<Vec<u32>> = texts - .par_iter() - .map(|text| { - self.tokenizer - .encode(text) - .map(|encoded| encoded.token_ids().to_vec()) - }) - .collect::<Result<Vec<_>>>()?; - builder.batch_token_ids(Some(token_batches)); - builder.token_ids(vec![]); - } + TextInput::Batch(texts) => { + // Synchronous path; preprocess_request is not async, so blocking is OK here. + let encodings = self + .tokenizer + .encode_batch(&texts.iter().map(|s| s.as_str()).collect::<Vec<_>>())?; + let token_batches = encodings + .into_iter() + .map(|e| e.token_ids().to_vec()) + .collect::<Vec<_>>(); + builder.batch_token_ids(Some(token_batches)); + builder.token_ids(vec![]); + }Note: If encode_batch is not available or has different signature, ignore this optional diff.
🧹 Nitpick comments (4)
lib/parsers/src/reasoning/mod.rs (1)
127-128
: Minor: consider structured fields for better observabilityUsing structured fields helps with log filtering and avoids string formatting costs. Optional change:
- tracing::warn!( - "Unknown reasoning parser type '{name}', falling back to Basic Reasoning Parser", - ); + tracing::warn!( + parser_name = %name, + "Unknown reasoning parser type, falling back to Basic Reasoning Parser", + );Similarly, the debug above could be:
- tracing::debug!("Selected reasoning parser: {name}"); + tracing::debug!(parser_name = %name, "Selected reasoning parser");lib/llm/src/preprocessor.rs (3)
566-573
: Annotations emission pathway looks goodConverting HashMap<String, String> into Annotated events before chaining works. Consider adding token_ids for batch text/tokens when requested, mirroring embeddings path. Optional.
Also applies to: 623-627
335-355
: Consistency: embeddings path uses encode_batch via spawn_blocking; consider aligningEmbeddings use spawn_blocking + encode_batch. For large chat/completion batches, using encode_batch (synchronously) will likely be faster and simpler than rayon-parallel per-item encodes. See optional diff in the gather_tokens comment.
56-64
: Naming nit: ANNOTATION_ constants are fine; unify with other modules if any variations exist*If other modules already define or consume these keys, ensure we’re not drifting on naming (e.g., "formatted_prompt" vs "formattedPrompt"). If drift exists, add a small adapter or constants in a shared place.
Would you like me to scan the repo for other ANNOTATION_* keys and flag inconsistencies?
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (1)
lib/bindings/python/Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (3)
lib/llm/src/preprocessor.rs
(5 hunks)lib/parsers/src/reasoning/base_parser.rs
(0 hunks)lib/parsers/src/reasoning/mod.rs
(1 hunks)
💤 Files with no reviewable changes (1)
- lib/parsers/src/reasoning/base_parser.rs
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-24T20:59:35.725Z
Learnt from: ishandhanani
PR: ai-dynamo/dynamo#1626
File: lib/llm/src/preprocessor.rs:238-239
Timestamp: 2025-06-24T20:59:35.725Z
Learning: In lib/llm/src/preprocessor.rs, the `sampling_options` call in the `preprocess_request` method is placed in the common section after the match statement on `request.prompt_input_type()`, meaning it applies to both `PromptInput::Tokens` and `PromptInput::Text` request types.
Applied to files:
lib/llm/src/preprocessor.rs
📚 Learning: 2025-08-25T22:04:45.179Z
Learnt from: nachiketb-nvidia
PR: ai-dynamo/dynamo#2700
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:19-28
Timestamp: 2025-08-25T22:04:45.179Z
Learning: The response_generator() method exists on multiple request types in the codebase: NvCreateChatCompletionRequest (for chat completions) and NvCreateCompletionRequest (for text completions). When making signature changes, it's important to distinguish between these different object types as they have separate implementations and call sites.
Applied to files:
lib/llm/src/preprocessor.rs
🧬 Code graph analysis (1)
lib/llm/src/preprocessor.rs (3)
lib/llm/src/protocols/common/preprocessor.rs (2)
builder
(71-73)builder
(108-110)lib/llm/src/protocols/openai/completions.rs (5)
builder
(223-225)annotations
(102-106)nvext
(86-88)nvext
(134-136)nvext
(195-197)lib/llm/src/protocols/openai.rs (2)
nvext
(43-43)nvext
(53-53)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (8)
lib/parsers/src/reasoning/mod.rs (1)
120-120
: LGTM: switched to captured identifier formatting in tracing macroUsing "Selected reasoning parser: {name}" is fine and keeps the message concise.
lib/llm/src/preprocessor.rs (7)
28-28
: Import looks correctImporting PreprocessedRequestBuilder from common::preprocessor aligns with the new builder flow.
160-165
: Good refactor: centralized preprocess path via builder/apply_template/gather_tokensThis makes the flow explicit and easier to test. Returning both PreprocessedRequest and annotations is clear.
181-201
: EOS/stop handling is sensible but verify desired semantics with ignore_eos=true
- You ensure model EOS IDs are included in hidden stop tokens.
- You only set builder.eos_token_ids when ignore_eos is false.
Double-check downstream behavior expects no eos_token_ids at all when ignore_eos=true (as opposed to passing them and letting the engine ignore). If the engine relies on absence to skip an early stop, this is correct; otherwise you may need to pass them and let the engine branch.
Do you want me to scan downstream consumers for how they interpret eos_token_ids and ignore_eos?
213-245
: Template gating matches PR goal (no chat template for completions)apply_template only renders a template for single-text inputs and is invoked from the chat path. When nvext.use_raw_prompt is true but raw_prompt is missing, you warn and fall back to render, which is reasonable.
612-615
: Completion path now correctly bypasses templates; depends on gather_tokens fixbuilder(...); gather_tokens(..., None) achieves “no chat template for completions.” After applying the gather_tokens patch above, single-text completions will tokenize the raw prompt safely.
456-468
: Nice touch: attach LLM metrics as an annotation without clobbering existing eventsThe conditional set of event/comment only when empty avoids overriding error events.
612-618
: Potential ISL undercount for batched inputsresponse_generator.update_isl(common_request.token_ids.len() as u32) will be zero for batched requests (since token_ids is empty and batch_token_ids is set). If ISL matters for batching, you may need to sum batch_token_ids lengths or leave as zero by design. Confirm expected behavior.
I can add a small helper to compute input token count across both single and batch forms if needed.
7a970be
to
f20d1e3
Compare
Signed-off-by: Graham King <[email protected]>
Signed-off-by: Graham King <[email protected]>
Because we are no longer applying the prompt template. Signed-off-by: Graham King <[email protected]>
f20d1e3
to
2eaf6fe
Compare
/ok to test 2eaf6fe |
Signed-off-by: Graham King <[email protected]>
Signed-off-by: Graham King <[email protected]> Signed-off-by: nnshah1 <[email protected]>
Summary by CodeRabbit
New Features
Refactor
Chores