You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I observed that the structure of X-CLIP is used to obtain video features in the code, which is not consistent with the average pooling described in the paper. Can you give some reasonable explanations?