Feature: Allow list of tokens/integers as initial_prompt input #426

whispy-woods · 2025-03-11T18:36:24Z

whispy-woods
Mar 11, 2025

Hi Purfview, thank you for your great work. At least someone is focusing on using Whisper in the real world, while the baseline repositories don't even offer flags to filter out surefire hallucinated segments like amara.org after all those years.

It would be cool if we could provide a list of integers as initial_prompt from the command line, such as:

--initial_prompt 50364,1,1,1,1,1,51264

I assume the reprompt works in the same way as it was merged into the openAI repo, where the tokens are prepended to the previous_prompt. When generating via without_timestamps = False without any timestamp tokens inside initial_prompt, this leads to Whisper interpreting the entire initial/reprompt as a segment within a single timestamp pair, usually nudging it to output segment lengths of roughly that length.

So you mostly end up with very long segments, while Whisper is pretty capable of breaking up the segments into short semantic chunks, much like on television (not reliably, but okay-ish). If I provide the openAI version with timestamp tokens inside the reprompt, I can usually nudge it in the direction I want.

If you run whisper without any initial_prompt at all, you are at the mercy of randomness, if Whisper decides to create long segments for the first inference, it will carry on that way, if it doesn't, peachy until the next prompt reset.

I can't figure out how to get faster-whisper-XXL to accept tokens as input, since it only accepts strings and afaik tokenizer.encode() will not convert strings like <0.00> to timestamp tokens but madness. Maybe just allowing lists of ints as input would cause problems with --ignore_dupe_prompt? Not sure.

Cheers!

Purfview · 2025-03-11T19:49:30Z

Purfview
Mar 11, 2025
Maintainer

When you write "Whisper" do you everywhere mean vanilla Whisper or it's short for faster-whisper?

Can you provide examples of your prompt with timestamp tokens.

I think you don't need to worry about long segments, just use --sentence.

0 replies

whispy-woods · 2025-03-11T22:15:49Z

whispy-woods
Mar 11, 2025
Author

Thanks for the reply! I meant both of them - I put your reprompt pull request from the openAI repo into mainline faster_whisper too in order to test a reprompting with timestamp tokens of varying "segment length". All of them should try to match the "segment length" between timestamps that they find in the prompt.

An extreme example would be:

<|startofprev|><|0.00|> This is an<|1.00|><|1.00|> extreme example<|2.00|><|2.00|> and I will<|3.00|><|3.00|> most likely<|4.00|><|4.00|> start producing<|5.00|><|5.00|> very short segments<|6.00|><|startoftranscript|><|en|><|transcribe|>

which for faster_whisper would require as initial_prompt:

[50364,639,307,364,50414,50414,8084,1365,50464,50464,293,286,486,50514,50514,881,3700,50564,50564,722,10501,50614,50614,588,2099,19904,50664]

Plug this into any whisper and it will most likely build very short segments. Now this is example is quite useless and buggy, but most likely to display the effect - as they can be inconsistent across files. Sometimes the effect is huge, sometimes nothing happens. But it's better than nothing. You can try to steer it into the kind of line-breaking you prefer by "engineering" prompts or using some whisper output you found especially pleasing in regards to line breaks.

The model can't obey exact character length limits due to its tokenized nature, but it definitely has picked up on some subtleties present in its training data taken from broadcasting platforms, where there are often strict and complicated rules for at which semantic word chunk boundaries a line break can occur.

I'm not sure what postprocessing happens with --sentence, but from reading the output I suppose it counts chars and breaks when a max line width is reached, as anything else would be a massive undertaking, especially for so many languages. The only ways I can think of to mimic that learned behaviour is to either build MANY rules based on part-of-speech and dependency trees via SpaCy/NLTK etc. or training another model on the task of "line breaks that flow well for the reader".

I'm still just getting my feet wet with testing these prompts myself but I'd love to use all the other features of XXL instead of mainline faster/openai/cpp. :D Cheers

6 replies

Purfview Mar 11, 2025
Maintainer

TL;DR, you don't need to worry about prompts nor long segments with Faster-Whisper-XXL.

whispy-woods Mar 11, 2025
Author

Sorry if it came across like belittling efforts in the --sentence function, I just mean that segmenting by sentences is different than breaking sentences up into sub-sentences at semantic chunk borders, just check out the BBC Subtitle Guidelines for broadcast, it's eyewatering D:

Purfview Mar 11, 2025
Maintainer

check out the BBC Subtitle Guidelines for broadcast

I googled it, I see "WPM", I don't want to read the nonsense further...

whispy-woods Mar 11, 2025
Author

yeah it's nuts but I'll take whatever pieces Whisper has learned to mimic it poorly o_o

whispy-woods Mar 11, 2025
Author

There might be other scenarios where passing a list of tokens might be preferable to use tokenizer.encode() to encode a string as well. Some past Whisper output you want to mimic might have used different tokens to build the same sentence than the tokenizing always-shortest-sequence-algorithm would, leading to different output when used as a prompt. It's all arcane but it's hard to use XXL for such experiments if transcribe.py's Optional[Iterable[int]] is not exposed. :(

whispy-woods · 2025-03-11T22:52:13Z

whispy-woods
Mar 11, 2025
Author

for reference, this is what i hacked into my main.py argparse that calls faster_whisper's transcribe.py, but it's probably more complicated with all the features you provide

def str_or_int_list_or_none(value: str) -> Optional[Union[str, Iterable[int]]]:
    if value is None:
        return None
    if isinstance(value, str) and value.strip() == "None":
        return None
    try:
        return [int(x) for x in value.split(",")]
    except ValueError:
        return str(value)

2 replies

Purfview Mar 11, 2025
Maintainer

Don't bother with it. :)

Better show me what is the issue [share audio, JSON output], as I don't see any.

whispy-woods Mar 11, 2025
Author

Oof okay, I'll see if I get around to it, not sure I can make a better case than I tried to make here :D... The BBC link was insane, I can't find one for English that breaks it down to the part that is important here.

Most people that ask about line width in Whisper repos want subtitle line max width of 34-37-ish, but there is more to it, you don't want the line break to be between an adjective and the noun it describes, for example. You want the char counts spread somewhat evenly across lines. Bonus points if the top line is shorter than the bottom line to block less visible screen space. And the list goes on.

There are many rules for breaking subtitle sentences down into "lines", and Whisper will perform better at those, if its prompt too honors these rules, as a big chunk of the training data will have contained lines broken up by those rules.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Allow list of tokens/integers as initial_prompt input #426

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Feature: Allow list of tokens/integers as initial_prompt input #426

whispy-woods Mar 11, 2025

Replies: 3 comments · 8 replies

Purfview Mar 11, 2025 Maintainer

whispy-woods Mar 11, 2025 Author

Purfview Mar 11, 2025 Maintainer

whispy-woods Mar 11, 2025 Author

Purfview Mar 11, 2025 Maintainer

whispy-woods Mar 11, 2025 Author

whispy-woods Mar 11, 2025 Author

whispy-woods Mar 11, 2025 Author

Purfview Mar 11, 2025 Maintainer

whispy-woods Mar 11, 2025 Author

whispy-woods
Mar 11, 2025

Replies: 3 comments 8 replies

Purfview
Mar 11, 2025
Maintainer

whispy-woods
Mar 11, 2025
Author

Purfview Mar 11, 2025
Maintainer

whispy-woods Mar 11, 2025
Author

Purfview Mar 11, 2025
Maintainer

whispy-woods Mar 11, 2025
Author

whispy-woods Mar 11, 2025
Author

whispy-woods
Mar 11, 2025
Author

Purfview Mar 11, 2025
Maintainer

whispy-woods Mar 11, 2025
Author