-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] Add metadata to datasets #123
Comments
We can start with the
|
The next step would be to do the same for data mixes for specific training experiments. This should also include which files or parts of files are included from the source datasets—this time, using the already processed datasets. The full history would look like this:
|
Do we also want to track, for example, which row in the final mix comes from which dataset, file, and row in the mix, and, in turn, from which original HF dataset, file, and row? In this case, we could assign a UID to each row in the input dataset and then pass it along through all the transformations. However, the current Additionally, if an aggregate operation is performed — such as near deduplication, — the resulting ID could be a list of all IDs that formed a group. I’ve done something similar for GitHub repository files in BigCode. It essentially mimics a relational database, but for a limited purpose. |
Let's not overthink this. The prepare command can take the Croissant metadata of the prepared dataset from the hf hub and put it in the output folder. |
Furthermore, the only job of FastLLM's prepare command is currently to split and tokenize a hf dataset. |
Agreed with @tscholak. I don't think we need a separate command, and dataset mixes already fit well in fast-llm configurations, so we only need to deal with single datasets, at least for now. Agreed copying existing croissant formats if a good first step. Then we can start thinking about cases where the metadata is incomplete, absent or in another format. The preparator should still work in those cases, so we can't enforce mandatory fields, though we could add a warning. Other possibilities:
|
I have checked 68 datasets on HF, 61 of them had the following fields:
38 had However, for some datasets, even big and known ones there are no croissant data at all like:
I am guessing it is because their structure is not easily mappable to the format. So, i first will concentrate on whether i can fill at least some of the fields for those datasets via HF api. But, i do not think we can get fields like |
Hey @bigximik, thanks for checking those datasets. I'm worried that we are overcomplicating things again. The goal right now is just to take advantage of existing metadata, wire it through Looking into how to backfill missing metadata or make datasets fully compliant is a separate issue and should be treated as such. Let's not expand the scope at this point. Instead, let's implement the straightforward part first, and then revisit edge cases if and when they actually matter. Hope to see your first Fast-LLM PR soon! |
🧐 Problem Description
Our prepared datasets come in loose index and binary files which are hard to track for reproducibility, etc. We want to fix that with a bit of structure.
💡 Proposed Solution
Add a metadata file with our prepared datasets, which contain an index of dataset files and other useful information. As has been suggested in #120, we should consider the Croissant format, which seems natural since the preparator takes huggingface datasets as inputs, and lots of them use the croissant format already.
🔄 Alternatives Considered
We can guess the files to auto concatenate (#120) in other ways. For example:
.idx
and.bin
files to generate the full dataset. Simple, but risky because accidental modifications in the directory would go unnoticed.📈 Potential Benefits
Reproducibility, traceability, safety, etc.
The text was updated successfully, but these errors were encountered: