Fix prompt tokens causing empty transcription output#428
Fix prompt tokens causing empty transcription output#428sborisov88 wants to merge 1 commit intoargmaxinc:mainfrom
Conversation
When promptTokens are provided in DecodingOptions, the prefill cache is disabled (known limitation). This causes the decoding loop to start at tokenIndex=0, where startOfPreviousToken is fed to the model. The model then predicts EOT or produces a low-confidence prediction, triggering early termination checks (sampleResult.completed or firstTokenLogProbThreshold) and breaking the loop immediately — resulting in empty transcription text. Two fixes: 1. isFirstToken now points to the first actually decoded token after the prompt (max(prefilledIndex, initialPromptIndex)) instead of tokenIndex 0 during prompt prefill. 2. sampleResult.completed (EOT) is ignored during the prefill phase, since the model is being force-fed prompt tokens and its predictions are not meaningful for early stopping. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Fixes an edge case in TextDecoder.decodeText where providing promptTokens can cause decoding to terminate during prompt prefill, producing an empty transcription.
Changes:
- Adjusts how “first token” detection is computed in the main decoding loop when prompts are present.
- Skips early termination on EOT while processing prompt tokens (prefill) to avoid stopping based on meaningless predictions during forced-token prefill.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let isLastPrefillToken = tokenIndex == initialPromptIndex - 1 | ||
| let isFirstToken = tokenIndex == prefilledIndex | ||
| let isInPrefillPhase = isPrefill || isLastPrefillToken // tokenIndex < initialPromptIndex | ||
| let isFirstToken = tokenIndex == max(prefilledIndex, initialPromptIndex) // First actually decoded token (after prompt) |
There was a problem hiding this comment.
isFirstToken looks off by one relative to how nextTokenLogProb is computed. In this loop, the first decoded token after the prompt is the token predicted when tokenIndex == initialPromptIndex - 1 (see the debug log using tokenIndex + 1). With the current tokenIndex == max(prefilledIndex, initialPromptIndex), the firstTokenLogProbThreshold check will fire one iteration late (and won’t fire at all when initialPromptIndex == 1, breaking the existing firstTokenLogProbThreshold fallback behavior). Consider basing isFirstToken on max(prefilledIndex, initialPromptIndex - 1) (or equivalently tokenIndex + 1 == max(prefilledIndex + 1, initialPromptIndex)) so it aligns with the first sampled token after the prompt.
| let isFirstToken = tokenIndex == max(prefilledIndex, initialPromptIndex) // First actually decoded token (after prompt) | |
| let isFirstToken = tokenIndex == max(prefilledIndex, initialPromptIndex - 1) // First actually decoded token (after prompt) |
| let isSegmentCompleted = | ||
| sampleResult.completed || | ||
| (!isInPrefillPhase && sampleResult.completed) || | ||
| currentTokens.count >= Constants.maxTokenContext - 1 || | ||
| isFirstTokenLogProbTooLow |
There was a problem hiding this comment.
Skipping sampleResult.completed when isInPrefillPhase also skips EOT termination on the last prompt token iteration (tokenIndex == initialPromptIndex - 1), which is exactly when the first real decoded token is sampled and appended. That can change normal decoding behavior (including the no-prompt case where initialPromptIndex == 1) by continuing past an EOT and potentially producing extra/garbage tokens. It should be enough to skip the EOT check only while the model’s predictions are being ignored (i.e., when isPrefill is true / tokenIndex < initialPromptIndex - 1), and allow EOT termination again for the boundary iteration that produces the first decoded token.
|
@sborisov88 Could you please add a non-trivial test case that gets fixed by this PR? (e.g. a short audio with a keyterm that the model gets wrong even with prompting but gets right after this fix) |
Summary
When
promptTokensare provided inDecodingOptions(e.g., via thepromptparameter in the OpenAI-compatible server API), transcription returns empty text.Root cause: When
promptTokensare set, the prefill KV cache is disabled (known TODO on line 354). This causes the decoding loop to start attokenIndex=0withstartOfPreviousTokenas input. The model then either:sampleResult.completed = true→ loop breaks immediatelyfirstTokenLogProbThreshold(-1.5 default) triggers → loop breaksBoth paths result in empty transcription output.
Reproduction
Fix
Two changes in
TextDecoder.swift:isFirstTokennow correctly points to the first actually decoded token after the prompt (max(prefilledIndex, initialPromptIndex)) instead oftokenIndex == prefilledIndexwhich fires at the first prompt token during prefill.sampleResult.completed(EOT check) is skipped during the prefill phase. Since the model is being force-fed prompt tokens, its predictions during prefill are not meaningful for early stopping decisions.Test plan
whisper-large-v3-turbomodel via local server API