[feat] Add metadata to datasets

# 🧐 Problem Description

Our prepared datasets come in loose index and binary files which are hard to track for reproducibility, etc. We want to fix that with a bit of structure.

# 💡 Proposed Solution

Add a metadata file with our prepared datasets, which contain an index of dataset files and other useful information. As has been suggested in #120, we should consider the [Croissant format](https://github.com/mlcommons/croissant?tab=readme-ov-file), which seems natural since the preparator takes huggingface datasets as inputs, and lots of them use the croissant format already.

# 🔄 Alternatives Considered

We can guess the files to auto concatenate (#120) in other ways. For example:
1. Parse a directory at runtime for `.idx` and `.bin` files to generate the full dataset. Simple, but risky because accidental modifications in the directory would go unnoticed.
2. Use an index file to list all files to load. (Like the existing json dataset, but simpler.) Safer, but doesn't address traceability.

# 📈 Potential Benefits

Reproducibility, traceability, safety, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat] Add metadata to datasets #123

🧐 Problem Description

💡 Proposed Solution

🔄 Alternatives Considered

📈 Potential Benefits

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[feat] Add metadata to datasets #123

Description

🧐 Problem Description

💡 Proposed Solution

🔄 Alternatives Considered

📈 Potential Benefits

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions