Closed
Description
🧐 Problem Description
Our prepared datasets come in loose index and binary files which are hard to track for reproducibility, etc. We want to fix that with a bit of structure.
💡 Proposed Solution
Add a metadata file with our prepared datasets, which contain an index of dataset files and other useful information. As has been suggested in #120, we should consider the Croissant format, which seems natural since the preparator takes huggingface datasets as inputs, and lots of them use the croissant format already.
🔄 Alternatives Considered
We can guess the files to auto concatenate (#120) in other ways. For example:
- Parse a directory at runtime for
.idx
and.bin
files to generate the full dataset. Simple, but risky because accidental modifications in the directory would go unnoticed. - Use an index file to list all files to load. (Like the existing json dataset, but simpler.) Safer, but doesn't address traceability.
📈 Potential Benefits
Reproducibility, traceability, safety, etc.