Skip to content

Latest commit

 

History

History
60 lines (39 loc) · 3.55 KB

datasets-data-files-configuration.md

File metadata and controls

60 lines (39 loc) · 3.55 KB

Data files Configuration

There are no constraints on how to structure dataset repositories.

However, if you want the Dataset Viewer to show certain data files, or to separate your dataset in train/validation/test splits, you need to structure your dataset accordingly. Often it is as simple as naming your data files according to their split names, e.g. train.csv and test.csv.

What are splits and subsets?

Machine learning datasets typically have splits and may also have subsets. A dataset is generally made of splits (e.g. train and test) that are used during different stages of training and evaluating a model. A subset (also called configuration) is a sub-dataset contained within a larger dataset. Subsets are especially common in multilingual speech datasets where there may be a different subset for each language. If you're interested in learning more about splits and subsets, check out the Splits and subsets guide!

split-configs-server

Automatic splits detection

Splits are automatically detected based on file and directory names. For example, this is a dataset with train, test, and validation splits:

my_dataset_repository/
├── README.md
├── train.csv
├── test.csv
└── validation.csv

To structure your dataset by naming your data files or directories according to their split names, see the File names and splits documentation and the companion collection of example datasets.

Manual splits and subsets configuration

You can choose the data files to show in the Dataset Viewer for your dataset using YAML. It is useful if you want to specify which file goes into which split manually.

You can also define multiple subsets for your dataset, and pass dataset building parameters (e.g. the separator to use for CSV files).

Here is an example of a configuration defining a subset called "benchmark" with a test split.

configs:
- config_name: benchmark
  data_files:
  - split: test
    path: benchmark.csv

See the documentation on Manual configuration for more information. Look also to the example datasets.

Supported file formats

See the File formats doc page to find the list of supported formats and recommendations for your dataset. If your dataset uses CSV or TSV files, you can find more information in the example datasets.

Image, Audio and Video datasets

For image/audio.video classification datasets, you can also use directories to name the image/audio/video classes. And if your images/audio/video files have metadata (e.g. captions, bounding boxes, transcriptions, etc.), you can have metadata files next to them.

We provide two guides that you can check out: