Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Add metadata to datasets #123

Closed
jlamypoirier opened this issue Jan 21, 2025 · 8 comments · Fixed by #142
Closed

[feat] Add metadata to datasets #123

jlamypoirier opened this issue Jan 21, 2025 · 8 comments · Fixed by #142
Assignees
Labels
enhancement New feature or request Priority

Comments

@jlamypoirier
Copy link
Collaborator

🧐 Problem Description

Our prepared datasets come in loose index and binary files which are hard to track for reproducibility, etc. We want to fix that with a bit of structure.

💡 Proposed Solution

Add a metadata file with our prepared datasets, which contain an index of dataset files and other useful information. As has been suggested in #120, we should consider the Croissant format, which seems natural since the preparator takes huggingface datasets as inputs, and lots of them use the croissant format already.

🔄 Alternatives Considered

We can guess the files to auto concatenate (#120) in other ways. For example:

  1. Parse a directory at runtime for .idx and .bin files to generate the full dataset. Simple, but risky because accidental modifications in the directory would go unnoticed.
  2. Use an index file to list all files to load. (Like the existing json dataset, but simpler.) Safer, but doesn't address traceability.

📈 Potential Benefits

Reproducibility, traceability, safety, etc.

@jlamypoirier jlamypoirier added the enhancement New feature or request label Jan 21, 2025
@bigximik
Copy link
Contributor

We can start with the fast-llm data tool reading metadata from HF repositories and creating its own metadata for processed data based on the Croissant format:

  1. Define Required Metadata

    • Decide which minimal metadata we consider mandatory, such as, for example:
      • name
      • license
      • URL
      • description
      • distribution
      • record schema
  2. Read Metadata from Input Repositories

    • Read metadata from input repositories.
    • If required data is missing, ensure it is provided in the input configuration file.
  3. Create Croissant Metadata

    • After processing is complete, generate the Croissant file.
    • Still need to determine whether transformation history and source datasets can:
      • Be included as part of the description only,
      • Or follow a specific metadata format.
  4. Save Metadata

    • Specify a path or URL where the metadata file will be saved.

@bigximik
Copy link
Contributor

The next step would be to do the same for data mixes for specific training experiments. This should also include which files or parts of files are included from the source datasets—this time, using the already processed datasets.

The full history would look like this:

  1. A set of HF datasets.
  2. A subset of (1) HF datasets transformed into the fast-llm format.
  3. A subset combining several sets from (2).

@bigximik
Copy link
Contributor

Do we also want to track, for example, which row in the final mix comes from which dataset, file, and row in the mix, and, in turn, from which original HF dataset, file, and row?

In this case, we could assign a UID to each row in the input dataset and then pass it along through all the transformations. However, the current fast-llm format does not support this. To implement it, we would likely need to either extend the index file or add ids file.

Additionally, if an aggregate operation is performed — such as near deduplication, — the resulting ID could be a list of all IDs that formed a group. I’ve done something similar for GitHub repository files in BigCode. It essentially mimics a relational database, but for a limited purpose.

@tscholak
Copy link
Collaborator

Let's not overthink this. The prepare command can take the Croissant metadata of the prepared dataset from the hf hub and put it in the output folder.
When we then use this dataset or an amalgamation of many such datasets during training, then the metadata of the full training dataset will be the (flattened) concatenation of the individual datasets involved. No need to track individual examples.

@tscholak
Copy link
Collaborator

Furthermore, the only job of FastLLM's prepare command is currently to split and tokenize a hf dataset.

@jlamypoirier
Copy link
Collaborator Author

Agreed with @tscholak. I don't think we need a separate command, and dataset mixes already fit well in fast-llm configurations, so we only need to deal with single datasets, at least for now.

Agreed copying existing croissant formats if a good first step. Then we can start thinking about cases where the metadata is incomplete, absent or in another format. The preparator should still work in those cases, so we can't enforce mandatory fields, though we could add a warning. Other possibilities:

  • Have an option in prepare to add/modify config fields.
  • Try to derive some metadata in other ways (ex. use metadata from huggingface API, ex. name, org, commit, etc.)
  • If some other format is encountered frequently, consider supporting it.

@bigximik
Copy link
Contributor

bigximik commented Feb 3, 2025

I have checked 68 datasets on HF, 61 of them had the following fields:

@context': 61,
'@type': 61,
'distribution': 61,
'recordSet': 61,
'conformsTo': 61,
'name': 61,
'description': 61,
'alternateName': 61,
'creator': 61,
'keywords': 61,
'url': 61,

38 had license field, and 10 had sameAs. So, it seems like for a majority of the datasets all is good.

However, for some datasets, even big and known ones there are no croissant data at all like:

allenai/peS2o 
pg19 
yhavinga/ccmatrix 
atom-in-the-universe/fanfics-10k-50k 
EleutherAI/proof-pile-2 
big_patent 
allenai/dolma 

I am guessing it is because their structure is not easily mappable to the format.

So, i first will concentrate on whether i can fill at least some of the fields for those datasets via HF api. But, i do not think we can get fields like distribution and recordSet automatically, and probably will need to let people fill them manually in data processor or leave such cases just partially complaint to the format and non actionable, as those fields a crucial for automatic loading of the datasets.

@tscholak
Copy link
Collaborator

tscholak commented Feb 4, 2025

Hey @bigximik, thanks for checking those datasets. I'm worried that we are overcomplicating things again. The goal right now is just to take advantage of existing metadata, wire it through prepare, and have it in the output folder. If a dataset already has Croissant metadata, we use it. If not, we don't block anything. We just proceed without it.

Looking into how to backfill missing metadata or make datasets fully compliant is a separate issue and should be treated as such. Let's not expand the scope at this point. Instead, let's implement the straightforward part first, and then revisit edge cases if and when they actually matter. Hope to see your first Fast-LLM PR soon!

@bigximik bigximik linked a pull request Feb 6, 2025 that will close this issue
25 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants