From dff97d67e50db2e83d159e6ec8db5880973e27e1 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Mon, 17 Mar 2025 19:33:24 +0100 Subject: [PATCH 1/6] more multimodal datasets docs --- docs/hub/datasets-audio.md | 15 ++- docs/hub/datasets-image.md | 15 ++- docs/hub/datasets-video.md | 194 ++++++++++++++++++++++++++++++++ docs/hub/datasets-webdataset.md | 15 +++ 4 files changed, 229 insertions(+), 10 deletions(-) create mode 100644 docs/hub/datasets-video.md diff --git a/docs/hub/datasets-audio.md b/docs/hub/datasets-audio.md index 80c519629..af96a6edb 100644 --- a/docs/hub/datasets-audio.md +++ b/docs/hub/datasets-audio.md @@ -6,7 +6,7 @@ A dataset with a supported structure and [file formats](./datasets-adding#file-f --- -Additional information about your audio files - such as transcriptions - is automatically loaded as long as you include this information in a metadata file (`metadata.csv`/`metadata.jsonl`). +Additional information about your audio files - such as transcriptions - is automatically loaded as long as you include this information in a metadata file (`metadata.csv`/`metadata.jsonl`/`metadata.parquet`). Alternatively, audio files can be in Parquet files or in TAR archives following the [WebDataset](https://github.com/webdataset/webdataset) format. @@ -90,6 +90,8 @@ You can also use a [JSONL](https://jsonlines.org/) file `metadata.jsonl`: {"file_name": "4.wav","text": "dog"} ``` +And for bigger datasets or if you are interested in advanced data retrieval features, you can use a [Parquet](https://parquet.apache.org/) file `metadata.parquet`. + ## Relative paths Metadata file must be located either in the same directory with the audio files it is linked to, or in any parent directory, like in this example: @@ -115,7 +117,9 @@ audio/3.wav,dog audio/4.wav,dog ``` -Metadata file cannot be put in subdirectories of a directory with the audio files. +Metadata files cannot be put in subdirectories of a directory with the audio files. + +More generally, any column named `file_name` or `*_file_name` should contain the full relative path to the audio files. In this example, the `test` directory is used to setup the name of the training split. See [File names and splits](./datasets-file-names-and-splits) for more information. @@ -203,8 +207,9 @@ my_dataset_repository/ └── train.parquet ``` -Audio columns are of type _struct_, with a binary field `"bytes"` for the audio data and a string field `"path"` for the image file name or path. -You should specify the feature types of the columns directly in YAML in the README header, for example: +Parquet files with audio data can be created using `pandas` or the `datasets` library. To create Parquet files with audio data in `pandas`, you can use [pandas-audio-methods](https://github.com/lhoestq/pandas-audio-methods) and `df.to_parquet()`. In `datasets`, you can set the column type to `Audio()` and use the `ds.to_parquet(...)` method or `ds.push_to_hub(...)`. You can find a guide on loading audio datasets in `datasets` [here](/docs/datasets/audio_load). + +Alternatively you can manually set the audio type of Parquet created using other tools. First, make sure your audio columns are of type _struct_, with a binary field `"bytes"` for the audio data and a string field `"path"` for the audio file name or path. Then you should specify the feature types of the columns directly in YAML in the README header, for example: ```yaml dataset_info: @@ -215,4 +220,4 @@ dataset_info: dtype: string ``` -Alternatively, Parquet files with Audio data can be created using the `datasets` library by setting the column type to `Audio()` and using the `.to_parquet(...)` method or `.push_to_hub(...)`. You can find a guide on loading audio datasets in `datasets` [here](../datasets/audio_load). +Note that Parquet is recommended for small audio files (<1MB per audio file) and small row groups (100 rows per row group, which is what `datasets` uses for audio). For larger audio files it is recommended to use the WebDataset format, or to share the original audio files (optionally with metadata files). diff --git a/docs/hub/datasets-image.md b/docs/hub/datasets-image.md index be2874439..8e048a90e 100644 --- a/docs/hub/datasets-image.md +++ b/docs/hub/datasets-image.md @@ -4,7 +4,7 @@ This guide will show you how to configure your dataset repository with image fil A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a Dataset Viewer on its page on the Hub. -Additional information about your images - such as captions or bounding boxes for object detection - is automatically loaded as long as you include this information in a metadata file (`metadata.csv`/`metadata.jsonl`). +Additional information about your images - such as captions or bounding boxes for object detection - is automatically loaded as long as you include this information in a metadata file (`metadata.csv`/`metadata.jsonl`/`metadata.parquet`). Alternatively, images can be in Parquet files or in TAR archives following the [WebDataset](https://github.com/webdataset/webdataset) format. @@ -90,6 +90,8 @@ You can also use a [JSONL](https://jsonlines.org/) file `metadata.jsonl`: {"file_name": "4.jpg","text": "a cartoon ball with a smile on it's face"} ``` +And for bigger datasets or if you are interested in advanced data retrieval features, you can use a [Parquet](https://parquet.apache.org/) file `metadata.parquet`. + ## Relative paths Metadata file must be located either in the same directory with the images it is linked to, or in any parent directory, like in this example: @@ -115,7 +117,9 @@ images/3.jpg,a red and white ball with an angry look on its face images/4.jpg,a cartoon ball with a smile on it's face ``` -Metadata file cannot be put in subdirectories of a directory with the images. +Metadata files cannot be put in subdirectories of a directory with the images. + +More generally, any column named `file_name` or `*_file_name` should contain the full relative path to the images. ## Image classification @@ -201,8 +205,9 @@ my_dataset_repository/ └── train.parquet ``` -Image columns are of type _struct_, with a binary field `"bytes"` for the image data and a string field `"path"` for the image file name or path. -You should specify the feature types of the columns directly in YAML in the README header, for example: +Parquet files with image data can be created using `pandas` or the `datasets` library. To create Parquet files with image data in `pandas`, you can use [pandas-image-methods](https://github.com/lhoestq/pandas-image-methods) and `df.to_parquet()`. In `datasets`, you can set the column type to `Image()` and use the `ds.to_parquet(...)` method or `ds.push_to_hub(...)`. You can find a guide on loading image datasets in `datasets` [here](/docs/datasets/image_load). + +Alternatively you can manually set the image type of Parquet created using other tools. First, make sure your image columns are of type _struct_, with a binary field `"bytes"` for the image data and a string field `"path"` for the image file name or path. Then you should specify the feature types of the columns directly in YAML in the README header, for example: ```yaml dataset_info: @@ -213,4 +218,4 @@ dataset_info: dtype: string ``` -Alternatively, Parquet files with Image data can be created using the `datasets` library by setting the column type to `Image()` and using the `.to_parquet(...)` method or `.push_to_hub(...)`. You can find a guide on loading image datasets in `datasets` [here](/docs/datasets/image_load). +Note that Parquet is recommended for small images (<1MB per image) and small row groups (100 rows per row group, which is what `datasets` uses for images). For larger images it is recommended to use the WebDataset format, or to share the original image files (optionally with metadata files). diff --git a/docs/hub/datasets-video.md b/docs/hub/datasets-video.md new file mode 100644 index 000000000..aac12982c --- /dev/null +++ b/docs/hub/datasets-video.md @@ -0,0 +1,194 @@ +# Video Dataset + +This guide will show you how to configure your dataset repository with video files. You can find accompanying examples of repositories in this [Video datasets examples collection](https://huggingface.co/collections/datasets-examples/video-dataset-6568e7cf28639db76eb92d65). + +A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a Dataset Viewer on its page on the Hub. + +Additional information about your videos - such as captions or bounding boxes for object detection - is automatically loaded as long as you include this information in a metadata file (`metadata.csv`/`metadata.jsonl`/`metadata.parquet`). + +Alternatively, videos can be in Parquet files or in TAR archives following the [WebDataset](https://github.com/webdataset/webdataset) format. + + +## Only videos + +If your dataset only consists of one column with videos, you can simply store your video files at the root: + +``` +my_dataset_repository/ +├── 1.mp4 +├── 2.mp4 +├── 3.mp4 +└── 4.mp4 +``` + +or in a subdirectory: + +``` +my_dataset_repository/ +└── videos + ├── 1.mp4 + ├── 2.mp4 + ├── 3.mp4 + └── 4.mp4 +``` + +Multiple [formats](./datasets-adding#file-formats) are supported at the same time, including PNG, JPEG, TIFF and WebP. + +``` +my_dataset_repository/ +└── videos + ├── 1.mp4 + ├── 2.png + ├── 3.tiff + └── 4.webp +``` + +If you have several splits, you can put your videos into directories named accordingly: + +``` +my_dataset_repository/ +├── train +│   ├── 1.mp4 +│   └── 2.mp4 +└── test + ├── 3.mp4 + └── 4.mp4 +``` + +See [File names and splits](./datasets-file-names-and-splits) for more information and other ways to organize data by splits. + +## Additional columns + +If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your repository. This lets you quickly create datasets for different computer vision tasks like [video generation](https://huggingface.co/tasks/text-to-video) or [object detection](https://huggingface.co/tasks/object-detection). + +``` +my_dataset_repository/ +└── train + ├── 1.mp4 + ├── 2.mp4 + ├── 3.mp4 + ├── 4.mp4 + └── metadata.csv +``` + +Your `metadata.csv` file must have a `file_name` column which links video files with their metadata: + +```csv +file_name,text +1.mp4,an animation of a green pokemon with red eyes +2.mp4,a short video of a green and yellow toy with a red nose +3.mp4,a red and white ball shows an angry look on its face +4.mp4,a cartoon ball is smiling +``` + +You can also use a [JSONL](https://jsonlines.org/) file `metadata.jsonl`: + +```jsonl +{"file_name": "1.mp4","text": "an animation of a green pokemon with red eyes"} +{"file_name": "2.mp4","text": "a short video of a green and yellow toy with a red nose"} +{"file_name": "3.mp4","text": "a red and white ball shows an angry look on its face"} +{"file_name": "4.mp4","text": "a cartoon ball is smiling"} +``` + +And for bigger datasets or if you are interested in advanced data retrieval features, you can use a [Parquet](https://parquet.apache.org/) file `metadata.parquet`. + +## Relative paths + +Metadata file must be located either in the same directory with the videos it is linked to, or in any parent directory, like in this example: + +``` +my_dataset_repository/ +└── train + ├── videos + │   ├── 1.mp4 + │   ├── 2.mp4 + │   ├── 3.mp4 + │   └── 4.mp4 + └── metadata.csv +``` + +In this case, the `file_name` column must be a full relative path to the videos, not just the filename: + +```csv +file_name,text +videos/1.mp4,an animation of a green pokemon with red eyes +videos/2.mp4,a short video of a green and yellow toy with a red nose +videos/3.mp4,a red and white ball shows an angry look on its face +videos/4.mp4,a cartoon ball is smiling +``` + +Metadata files cannot be put in subdirectories of a directory with the videos. + +More generally, any column named `file_name` or `*_file_name` should contain the full relative path to the videos. + +## Video classification + +For video classification datasets, you can also use a simple setup: use directories to name the video classes. Store your video files in a directory structure like: + +``` +my_dataset_repository/ +├── green +│   ├── 1.mp4 +│   └── 2.mp4 +└── red + ├── 3.mp4 + └── 4.mp4 +``` + +The dataset created with this structure contains two columns: `video` and `label` (with values `green` and `red`). + +You can also provide multiple splits. To do so, your dataset directory should have the following structure (see [File names and splits](./datasets-file-names-and-splits) for more information): + +``` +my_dataset_repository/ +├── test +│   ├── green +│   │   └── 2.mp4 +│   └── red +│   └── 4.mp4 +└── train + ├── green + │   └── 1.mp4 + └── red + └── 3.mp4 +``` + +You can disable this automatic addition of the `label` column in the [YAML configuration](./datasets-manual-configuration). If your directory names have no special meaning, set `drop_labels: true` in the README header: + +```yaml +configs: + - config_name: default # Name of the dataset subset, if applicable. + drop_labels: true +``` + +## Large scale datasets + +### WebDataset format + +The [WebDataset](./datasets-webdataset) format is well suited for large scale video datasets. +It consists of TAR archives containing videos and their metadata and is optimized for streaming. It is useful if you have a large number of videos and to get streaming data loaders for large scale training. + +``` +my_dataset_repository/ +├── train-0000.tar +├── train-0001.tar +├── ... +└── train-1023.tar +``` + +To make a WebDataset TAR archive, create a directory containing the videos and metadata files to be archived and create the TAR archive using e.g. the `tar` command. +The usual size per archive is generally around 1GB. +Make sure each video and metadata pair share the same file prefix, for example: + +``` +train-0000/ +├── 000.mp4 +├── 000.json +├── 001.mp4 +├── 001.json +├── ... +├── 999.mp4 +└── 999.json +``` + +Note that for user convenience and to enable the [Dataset Viewer](./datasets-viewer), every dataset hosted in the Hub is automatically converted to Parquet format up to 5GB. Since videos can be quite large, the URLs to the videos are stored in the converted Parquet data without the video bytes themselves. Read more about it in the [Parquet format](./datasets-viewer#access-the-parquet-files) documentation. diff --git a/docs/hub/datasets-webdataset.md b/docs/hub/datasets-webdataset.md index 12a53b362..564954688 100644 --- a/docs/hub/datasets-webdataset.md +++ b/docs/hub/datasets-webdataset.md @@ -18,6 +18,21 @@ Labels and metadata can be in a `.json` file, in a `.txt` (for a caption, a desc A large scale WebDataset is made of many files called shards, where each shard is a TAR archive. Each shard is often ~1GB but the full dataset can be multiple terabytes! +## Multimodal support + +WebDataset is designed for multimodal datasets, i.e. for image, audio and/or video datasets. + +Indeed since media files tend to be quite big, the sequential I/O of WebDataset enables large reads and buffering. This results in obtaining the best data loading speed. + +Here is a non-exhaustive list of supported data formats: + +- image: jpeg, png, tiff +- audio: mp3, m4a, wav, flac +- video: mp4, mov, avi +- other: npy, npz + +The full list evolves over time and depends on the implementation. For examoke you can can find which formats the `webdataset` package supports in the source code [here](https://github.com/webdataset/webdataset/blob/main/webdataset/autodecode.py). + ## Streaming Streaming TAR archives is fast because it reads contiguous chunks of data. From 4023acbd24df132f094ec91054be4a9867932d3f Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Mon, 17 Mar 2025 19:34:59 +0100 Subject: [PATCH 2/6] minor --- docs/hub/datasets-video.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/docs/hub/datasets-video.md b/docs/hub/datasets-video.md index aac12982c..2914f7a2a 100644 --- a/docs/hub/datasets-video.md +++ b/docs/hub/datasets-video.md @@ -1,6 +1,6 @@ # Video Dataset -This guide will show you how to configure your dataset repository with video files. You can find accompanying examples of repositories in this [Video datasets examples collection](https://huggingface.co/collections/datasets-examples/video-dataset-6568e7cf28639db76eb92d65). +This guide will show you how to configure your dataset repository with video files. A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a Dataset Viewer on its page on the Hub. @@ -32,15 +32,14 @@ my_dataset_repository/ └── 4.mp4 ``` -Multiple [formats](./datasets-adding#file-formats) are supported at the same time, including PNG, JPEG, TIFF and WebP. +Multiple [formats](./datasets-adding#file-formats) are supported at the same time, including MP4, MOV and AVI. ``` my_dataset_repository/ └── videos ├── 1.mp4 - ├── 2.png - ├── 3.tiff - └── 4.webp + ├── 2.mov + └── 3.avi ``` If you have several splits, you can put your videos into directories named accordingly: From 6088d26c3be5136cb34831f45a4fef79b91d828f Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Mon, 17 Mar 2025 19:35:30 +0100 Subject: [PATCH 3/6] add to toc --- docs/hub/_toctree.yml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml index 477863b99..e83652e88 100644 --- a/docs/hub/_toctree.yml +++ b/docs/hub/_toctree.yml @@ -229,6 +229,8 @@ title: Audio Dataset - local: datasets-image title: Image Dataset + - local: datasets-video + title: Video Dataset - local: spaces title: Spaces isExpanded: true From 7473e53f00021c0757aeeda8de817692e40a96ec Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Mon, 17 Mar 2025 19:36:42 +0100 Subject: [PATCH 4/6] mention in data files page --- docs/hub/datasets-data-files-configuration.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/hub/datasets-data-files-configuration.md b/docs/hub/datasets-data-files-configuration.md index bf2064279..7bc5554c6 100644 --- a/docs/hub/datasets-data-files-configuration.md +++ b/docs/hub/datasets-data-files-configuration.md @@ -48,12 +48,13 @@ See the documentation on [Manual configuration](./datasets-manual-configuration) See the [File formats](./datasets-adding#file-formats) doc page to find the list of supported formats and recommendations for your dataset. If your dataset uses CSV or TSV files, you can find more information in the [example datasets](https://huggingface.co/collections/datasets-examples/format-csv-and-tsv-655f681cb9673a4249cccb3d). -## Image and Audio datasets +## Image, Audio and Video datasets -For image and audio classification datasets, you can also use directories to name the image and audio classes. -And if your images/audio files have metadata (e.g. captions, bounding boxes, transcriptions, etc.), you can have metadata files next to them. +For image/audio.video classification datasets, you can also use directories to name the image/audio/video classes. +And if your images/audio/video files have metadata (e.g. captions, bounding boxes, transcriptions, etc.), you can have metadata files next to them. We provide two guides that you can check out: - [How to create an image dataset](./datasets-image) ([example datasets](https://huggingface.co/collections/datasets-examples/image-dataset-6568e7cf28639db76eb92d65)) - [How to create an audio dataset](./datasets-audio) ([example datasets](https://huggingface.co/collections/datasets-examples/audio-dataset-66aca0b73e8f69e3d069e607)) +- [How to create a video dataset](./datasets-video) From 5561d3f9e7f9422994738c8e1c50e90f29d259bb Mon Sep 17 00:00:00 2001 From: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Date: Fri, 28 Mar 2025 17:43:59 +0100 Subject: [PATCH 5/6] Apply suggestions from code review Co-authored-by: Daniel van Strien --- docs/hub/datasets-data-files-configuration.md | 2 +- docs/hub/datasets-webdataset.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/hub/datasets-data-files-configuration.md b/docs/hub/datasets-data-files-configuration.md index 7bc5554c6..f781c5208 100644 --- a/docs/hub/datasets-data-files-configuration.md +++ b/docs/hub/datasets-data-files-configuration.md @@ -50,7 +50,7 @@ See the [File formats](./datasets-adding#file-formats) doc page to find the list ## Image, Audio and Video datasets -For image/audio.video classification datasets, you can also use directories to name the image/audio/video classes. +For image/audio/video classification datasets, you can also use directories to name the image/audio/video classes. And if your images/audio/video files have metadata (e.g. captions, bounding boxes, transcriptions, etc.), you can have metadata files next to them. We provide two guides that you can check out: diff --git a/docs/hub/datasets-webdataset.md b/docs/hub/datasets-webdataset.md index 564954688..3f5adae0d 100644 --- a/docs/hub/datasets-webdataset.md +++ b/docs/hub/datasets-webdataset.md @@ -22,7 +22,7 @@ Each shard is often ~1GB but the full dataset can be multiple terabytes! WebDataset is designed for multimodal datasets, i.e. for image, audio and/or video datasets. -Indeed since media files tend to be quite big, the sequential I/O of WebDataset enables large reads and buffering. This results in obtaining the best data loading speed. +Indeed, since media files tend to be quite big, WebDataset's sequential I/O enables large reads and buffering, resulting in the best data loading speed. Here is a non-exhaustive list of supported data formats: @@ -31,7 +31,7 @@ Here is a non-exhaustive list of supported data formats: - video: mp4, mov, avi - other: npy, npz -The full list evolves over time and depends on the implementation. For examoke you can can find which formats the `webdataset` package supports in the source code [here](https://github.com/webdataset/webdataset/blob/main/webdataset/autodecode.py). +The full list evolves over time and depends on the implementation. For example, you can find which formats the `webdataset` package supports in the source code [here](https://github.com/webdataset/webdataset/blob/main/webdataset/autodecode.py). ## Streaming From 901f31122f257163e44d726aa49d49a74f77f25d Mon Sep 17 00:00:00 2001 From: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Date: Fri, 28 Mar 2025 17:50:30 +0100 Subject: [PATCH 6/6] link to storage recommendations and limits for the image files cases --- docs/hub/datasets-image.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hub/datasets-image.md b/docs/hub/datasets-image.md index 8e048a90e..939ab6014 100644 --- a/docs/hub/datasets-image.md +++ b/docs/hub/datasets-image.md @@ -218,4 +218,4 @@ dataset_info: dtype: string ``` -Note that Parquet is recommended for small images (<1MB per image) and small row groups (100 rows per row group, which is what `datasets` uses for images). For larger images it is recommended to use the WebDataset format, or to share the original image files (optionally with metadata files). +Note that Parquet is recommended for small images (<1MB per image) and small row groups (100 rows per row group, which is what `datasets` uses for images). For larger images it is recommended to use the WebDataset format, or to share the original image files (optionally with metadata files, and following the [repositories recommendations and limits](https://huggingface.co/docs/hub/en/storage-limits) for storage and number of files).