Skip to content

Fix prompt tokens causing empty transcription output#428

Open
sborisov88 wants to merge 1 commit intoargmaxinc:mainfrom
sborisov88:fix/prompt-tokens-empty-transcription
Open

Fix prompt tokens causing empty transcription output#428
sborisov88 wants to merge 1 commit intoargmaxinc:mainfrom
sborisov88:fix/prompt-tokens-empty-transcription

Conversation

@sborisov88
Copy link

Summary

When promptTokens are provided in DecodingOptions (e.g., via the prompt parameter in the OpenAI-compatible server API), transcription returns empty text.

Root cause: When promptTokens are set, the prefill KV cache is disabled (known TODO on line 354). This causes the decoding loop to start at tokenIndex=0 with startOfPreviousToken as input. The model then either:

  1. Predicts EOT → sampleResult.completed = true → loop breaks immediately
  2. Produces a low-confidence prediction → firstTokenLogProbThreshold (-1.5 default) triggers → loop breaks

Both paths result in empty transcription output.

Reproduction

# Start WhisperKit server with any model
whisperkit-cli serve --model-path <path> --port 8000

# Without prompt — works fine
curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F "file=@test.wav" -F "model=<model>" -F "language=ru"
# → {"text": "Привет, это тестовая запись..."}

# With prompt — returns empty text
curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F "file=@test.wav" -F "model=<model>" -F "language=ru" \
  -F "prompt=Kubernetes, Docker, RabbitMQ"
# → {"text": ""}

Fix

Two changes in TextDecoder.swift:

  1. isFirstToken now correctly points to the first actually decoded token after the prompt (max(prefilledIndex, initialPromptIndex)) instead of tokenIndex == prefilledIndex which fires at the first prompt token during prefill.

  2. sampleResult.completed (EOT check) is skipped during the prefill phase. Since the model is being force-fed prompt tokens, its predictions during prefill are not meaningful for early stopping decisions.

Test plan

  • Transcription without prompt still works correctly
  • Transcription with prompt now returns correct text (previously returned empty)
  • Tested with whisper-large-v3-turbo model via local server API

When promptTokens are provided in DecodingOptions, the prefill cache is
disabled (known limitation). This causes the decoding loop to start at
tokenIndex=0, where startOfPreviousToken is fed to the model. The model
then predicts EOT or produces a low-confidence prediction, triggering
early termination checks (sampleResult.completed or
firstTokenLogProbThreshold) and breaking the loop immediately — resulting
in empty transcription text.

Two fixes:
1. isFirstToken now points to the first actually decoded token after
   the prompt (max(prefilledIndex, initialPromptIndex)) instead of
   tokenIndex 0 during prompt prefill.
2. sampleResult.completed (EOT) is ignored during the prefill phase,
   since the model is being force-fed prompt tokens and its predictions
   are not meaningful for early stopping.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an edge case in TextDecoder.decodeText where providing promptTokens can cause decoding to terminate during prompt prefill, producing an empty transcription.

Changes:

  • Adjusts how “first token” detection is computed in the main decoding loop when prompts are present.
  • Skips early termination on EOT while processing prompt tokens (prefill) to avoid stopping based on meaningless predictions during forced-token prefill.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

let isLastPrefillToken = tokenIndex == initialPromptIndex - 1
let isFirstToken = tokenIndex == prefilledIndex
let isInPrefillPhase = isPrefill || isLastPrefillToken // tokenIndex < initialPromptIndex
let isFirstToken = tokenIndex == max(prefilledIndex, initialPromptIndex) // First actually decoded token (after prompt)
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isFirstToken looks off by one relative to how nextTokenLogProb is computed. In this loop, the first decoded token after the prompt is the token predicted when tokenIndex == initialPromptIndex - 1 (see the debug log using tokenIndex + 1). With the current tokenIndex == max(prefilledIndex, initialPromptIndex), the firstTokenLogProbThreshold check will fire one iteration late (and won’t fire at all when initialPromptIndex == 1, breaking the existing firstTokenLogProbThreshold fallback behavior). Consider basing isFirstToken on max(prefilledIndex, initialPromptIndex - 1) (or equivalently tokenIndex + 1 == max(prefilledIndex + 1, initialPromptIndex)) so it aligns with the first sampled token after the prompt.

Suggested change
let isFirstToken = tokenIndex == max(prefilledIndex, initialPromptIndex) // First actually decoded token (after prompt)
let isFirstToken = tokenIndex == max(prefilledIndex, initialPromptIndex - 1) // First actually decoded token (after prompt)

Copilot uses AI. Check for mistakes.
Comment on lines 861 to 864
let isSegmentCompleted =
sampleResult.completed ||
(!isInPrefillPhase && sampleResult.completed) ||
currentTokens.count >= Constants.maxTokenContext - 1 ||
isFirstTokenLogProbTooLow
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping sampleResult.completed when isInPrefillPhase also skips EOT termination on the last prompt token iteration (tokenIndex == initialPromptIndex - 1), which is exactly when the first real decoded token is sampled and appended. That can change normal decoding behavior (including the no-prompt case where initialPromptIndex == 1) by continuing past an EOT and potentially producing extra/garbage tokens. It should be enough to skip the EOT check only while the model’s predictions are being ignored (i.e., when isPrefill is true / tokenIndex < initialPromptIndex - 1), and allow EOT termination again for the boundary iteration that produces the first decoded token.

Copilot uses AI. Check for mistakes.
@atiorh
Copy link
Contributor

atiorh commented Feb 23, 2026

@sborisov88 Could you please add a non-trivial test case that gets fixed by this PR? (e.g. a short audio with a keyterm that the model gets wrong even with prompting but gets right after this fix)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants