Skip to content

More multimodal datasets docs #1641

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 28, 2025
Merged

More multimodal datasets docs #1641

merged 6 commits into from
Mar 28, 2025

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Mar 17, 2025

more docs for image/audio/webdataset

  • added docs for video

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq lhoestq requested a review from davanstrien March 18, 2025 14:35
Copy link
Member

@davanstrien davanstrien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had one qs and small suggestions but looks good :)

@@ -213,4 +218,4 @@ dataset_info:
dtype: string
```

Alternatively, Parquet files with Image data can be created using the `datasets` library by setting the column type to `Image()` and using the `.to_parquet(...)` method or `.push_to_hub(...)`. You can find a guide on loading image datasets in `datasets` [here](/docs/datasets/image_load).
Note that Parquet is recommended for small images (<1MB per image) and small row groups (100 rows per row group, which is what `datasets` uses for images). For larger images it is recommended to use the WebDataset format, or to share the original image files (optionally with metadata files).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, does the guidance here say that parquet makes sense only if you have small images AND small row groups? I suspect more people will default to original image files so maybe we also need to point to the repository limits page too in that case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup correct, I'll add a link to the repo limits

@lhoestq lhoestq merged commit 2dda5fb into main Mar 28, 2025
2 checks passed
@lhoestq lhoestq deleted the more-multimodal-dataset-docs branch March 28, 2025 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants