You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to download English part of Roots these days.
According to the paper, there are 484,953,009,124 bytes of English data.
However, after downloading all roots-related datasets on huggingface by filtering, I found there is only about 43.8 GB data.
I wonder how to explain the difference?
Are those huggingface datasets only a subset of Roots?
Are those huggingface datasets processed Roots so that the quantity shrinks from 480 GB to 43.8 GB?
The text was updated successfully, but these errors were encountered:
cll-mtk
changed the title
Mismatch of the available data quantity on Huggingface
Mismatch of the Available Data Quantity on Huggingface
May 2, 2023
I tried to download English part of Roots these days.
According to the paper, there are 484,953,009,124 bytes of English data.
However, after downloading all roots-related datasets on huggingface by filtering, I found there is only about 43.8 GB data.
I wonder how to explain the difference?
Are those huggingface datasets only a subset of Roots?
Are those huggingface datasets processed Roots so that the quantity shrinks from 480 GB to 43.8 GB?
The text was updated successfully, but these errors were encountered: