Skip to content

[feat] Add metadata to datasets #123

Closed
@jlamypoirier

Description

@jlamypoirier

🧐 Problem Description

Our prepared datasets come in loose index and binary files which are hard to track for reproducibility, etc. We want to fix that with a bit of structure.

💡 Proposed Solution

Add a metadata file with our prepared datasets, which contain an index of dataset files and other useful information. As has been suggested in #120, we should consider the Croissant format, which seems natural since the preparator takes huggingface datasets as inputs, and lots of them use the croissant format already.

🔄 Alternatives Considered

We can guess the files to auto concatenate (#120) in other ways. For example:

  1. Parse a directory at runtime for .idx and .bin files to generate the full dataset. Simple, but risky because accidental modifications in the directory would go unnoticed.
  2. Use an index file to list all files to load. (Like the existing json dataset, but simpler.) Safer, but doesn't address traceability.

📈 Potential Benefits

Reproducibility, traceability, safety, etc.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions