Selective Feature reduction via cross modal query #25

insafim · 2024-11-30T14:26:04Z

How does the model understand which tokens to get rid of for a question like “Describe this video in detail”.

Also what sort of an inpact does an instruction like “Select the best option among (A,B,C,D)” or any other intruction sent along with the question have on this part.

xiaoqian-shen · 2024-12-05T11:02:33Z

Hi, @insafim

Thanks for raising this concern.

(i) The query dependent token compression is designed for VQA task, which is not effective to question such as "Describe this video in detail". However, the irrelevant frames will not be removed, but reduce to low resolution tokens.

(ii) The impact of candidates options is minimum, but you can also use add special token as a placeholder to identify the question-only tokens and remove the other instruction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selective Feature reduction via cross modal query #25

Selective Feature reduction via cross modal query #25

insafim commented Nov 30, 2024

xiaoqian-shen commented Dec 5, 2024

Selective Feature reduction via cross modal query #25

Selective Feature reduction via cross modal query #25

Comments

insafim commented Nov 30, 2024

xiaoqian-shen commented Dec 5, 2024