Feature: Allow list of tokens/integers as initial_prompt input #426
Replies: 3 comments 8 replies
-
When you write "Whisper" do you everywhere mean vanilla Whisper or it's short for faster-whisper? Can you provide examples of your prompt with timestamp tokens. I think you don't need to worry about long segments, just use |
Beta Was this translation helpful? Give feedback.
-
Thanks for the reply! I meant both of them - I put your reprompt pull request from the openAI repo into mainline faster_whisper too in order to test a reprompting with timestamp tokens of varying "segment length". All of them should try to match the "segment length" between timestamps that they find in the prompt. An extreme example would be: <|startofprev|><|0.00|> This is an<|1.00|><|1.00|> extreme example<|2.00|><|2.00|> and I will<|3.00|><|3.00|> most likely<|4.00|><|4.00|> start producing<|5.00|><|5.00|> very short segments<|6.00|><|startoftranscript|><|en|><|transcribe|> which for faster_whisper would require as initial_prompt: [50364,639,307,364,50414,50414,8084,1365,50464,50464,293,286,486,50514,50514,881,3700,50564,50564,722,10501,50614,50614,588,2099,19904,50664] Plug this into any whisper and it will most likely build very short segments. Now this is example is quite useless and buggy, but most likely to display the effect - as they can be inconsistent across files. Sometimes the effect is huge, sometimes nothing happens. But it's better than nothing. You can try to steer it into the kind of line-breaking you prefer by "engineering" prompts or using some whisper output you found especially pleasing in regards to line breaks. The model can't obey exact character length limits due to its tokenized nature, but it definitely has picked up on some subtleties present in its training data taken from broadcasting platforms, where there are often strict and complicated rules for at which semantic word chunk boundaries a line break can occur. I'm not sure what postprocessing happens with --sentence, but from reading the output I suppose it counts chars and breaks when a max line width is reached, as anything else would be a massive undertaking, especially for so many languages. The only ways I can think of to mimic that learned behaviour is to either build MANY rules based on part-of-speech and dependency trees via SpaCy/NLTK etc. or training another model on the task of "line breaks that flow well for the reader". I'm still just getting my feet wet with testing these prompts myself but I'd love to use all the other features of XXL instead of mainline faster/openai/cpp. :D Cheers |
Beta Was this translation helpful? Give feedback.
-
for reference, this is what i hacked into my main.py argparse that calls faster_whisper's transcribe.py, but it's probably more complicated with all the features you provide
|
Beta Was this translation helpful? Give feedback.
-
Hi Purfview, thank you for your great work. At least someone is focusing on using Whisper in the real world, while the baseline repositories don't even offer flags to filter out surefire hallucinated segments like amara.org after all those years.
It would be cool if we could provide a list of integers as initial_prompt from the command line, such as:
--initial_prompt 50364,1,1,1,1,1,51264
I assume the reprompt works in the same way as it was merged into the openAI repo, where the tokens are prepended to the previous_prompt. When generating via without_timestamps = False without any timestamp tokens inside initial_prompt, this leads to Whisper interpreting the entire initial/reprompt as a segment within a single timestamp pair, usually nudging it to output segment lengths of roughly that length.
So you mostly end up with very long segments, while Whisper is pretty capable of breaking up the segments into short semantic chunks, much like on television (not reliably, but okay-ish). If I provide the openAI version with timestamp tokens inside the reprompt, I can usually nudge it in the direction I want.
If you run whisper without any initial_prompt at all, you are at the mercy of randomness, if Whisper decides to create long segments for the first inference, it will carry on that way, if it doesn't, peachy until the next prompt reset.
I can't figure out how to get faster-whisper-XXL to accept tokens as input, since it only accepts strings and afaik tokenizer.encode() will not convert strings like <0.00> to timestamp tokens but madness. Maybe just allowing lists of ints as input would cause problems with --ignore_dupe_prompt? Not sure.
Cheers!
Beta Was this translation helpful? Give feedback.
All reactions