Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few bad samples in the dataset: still-frame vids, muted audio, short vids (< 10 sec) #6

Open
v-iashin opened this issue Aug 22, 2020 · 10 comments

Comments

@v-iashin
Copy link

v-iashin commented Aug 22, 2020

I played with the dataset a little and found some flawed examples (please correct me if it is expectable).

I am not after criticizing the paper but rather sharing my findings with others who might want to use the dataset for their applications 🤗. It is not that significant considering the size of the dataset and the number of flawed examples (<5 %) and the sets do intersect! However, it might prevent one from facing strange errors when dealing with the dataset.

@WeidiXie
Copy link
Collaborator

Hi,

Thank you very much for pointing out these noises or errors,

as the dataset was collected mainly with automatic pipeline, so the noises are inevitable, but these lists are super helpful, we will update the meta information accordingly.

Best,
Weidi

@v-iashin
Copy link
Author

Thanks for the prompt reply.

Can I ask you to wait with the update for a week or so? I am still in progress of finding more such examples. I will update the post if I will find anything else.

@WeidiXie
Copy link
Collaborator

That would be amazing, thanks a lot for your help.

Best,
Weidi

@v-iashin
Copy link
Author

@WeidiXie
Hey, I updated the post. Check it out.

@WeidiXie
Copy link
Collaborator

@v-iashin

Thank you so much for this, we are looking into it.

Best,
Weidi

@v-iashin
Copy link
Author

v-iashin commented Aug 28, 2020

Plus, of course, some of the videos are missing because they are no longer available on YouTube (~10k). I can provide a list of a month-old state.

@WeidiXie
Copy link
Collaborator

@v-iashin

oh, I see, that's OK, we expect that will happen, like Kinetics also has this problem, so, unless we release the downloaded data, otherwise the dataset will be dynamic.

Best,
Weidi

@daisukelab
Copy link

Plus, of course, some of the videos are missing because they are no longer available on YouTube (~10k). I can provide a list of a month-old state.

Hi @v-iashin, is It possible to share your list of the missing ones, please?
I'm trying to download, but I could get 178k samples so far.
It seems to have lost 20k+ samples already...

@v-iashin
Copy link
Author

v-iashin commented Aug 4, 2021

Hi @daisukelab

Here is the list of available videos at the moment when I downloaded it:
available_clips.txt

@daisukelab
Copy link

Hi @v-iashin and all,
I'd share my list of missing videos.
https://drive.google.com/file/d/13g_3d-7btA48qu1DsBfyLqZbFo6iYZXp/view?usp=sharing

  • From Japan.
  • 199,176 items are listed on vggsound.csv, and 181,683 items could be downloaded = 17,493 items are missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants