You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
How does the model understand which tokens to get rid of for a question like “Describe this video in detail”.
Also what sort of an inpact does an instruction like “Select the best option among (A,B,C,D)” or any other intruction sent along with the question have on this part.
The text was updated successfully, but these errors were encountered:
(i) The query dependent token compression is designed for VQA task, which is not effective to question such as "Describe this video in detail". However, the irrelevant frames will not be removed, but reduce to low resolution tokens.
(ii) The impact of candidates options is minimum, but you can also use add special token as a placeholder to identify the question-only tokens and remove the other instruction.
How does the model understand which tokens to get rid of for a question like “Describe this video in detail”.
Also what sort of an inpact does an instruction like “Select the best option among (A,B,C,D)” or any other intruction sent along with the question have on this part.
The text was updated successfully, but these errors were encountered: