-
Notifications
You must be signed in to change notification settings - Fork 338
More multimodal datasets docs #1641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had one qs and small suggestions but looks good :)
docs/hub/datasets-image.md
Outdated
@@ -213,4 +218,4 @@ dataset_info: | |||
dtype: string | |||
``` | |||
|
|||
Alternatively, Parquet files with Image data can be created using the `datasets` library by setting the column type to `Image()` and using the `.to_parquet(...)` method or `.push_to_hub(...)`. You can find a guide on loading image datasets in `datasets` [here](/docs/datasets/image_load). | |||
Note that Parquet is recommended for small images (<1MB per image) and small row groups (100 rows per row group, which is what `datasets` uses for images). For larger images it is recommended to use the WebDataset format, or to share the original image files (optionally with metadata files). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, does the guidance here say that parquet makes sense only if you have small images AND small row groups? I suspect more people will default to original image files so maybe we also need to point to the repository limits page too in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup correct, I'll add a link to the repo limits
Co-authored-by: Daniel van Strien <[email protected]>
more docs for image/audio/webdataset