-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Saving of croissant metadata files for HF datasets #142
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good, but after thinking about it a bit more I doubt that adding hf hub as a dependency is worth it, see comments
The So, ok, i will extract |
@tscholak I have placed |
token = hf_auth_get_token() | ||
try: | ||
# Retrieve the dataset metadata in croissant format | ||
url = f"https://huggingface.co/api/datasets/{self._config.dataset.path}/croissant" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that I think about this again, why do we need to go to hf.co and get the croissant metadata from there? why can't we use the hf datasets api?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is according to their how to https://huggingface.co/docs/dataset-viewer/en/croissant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems, at the moment at least, they do not have anything with it in the datasets library https://github.com/search?q=repo%3Ahuggingface%2Fdatasets%20croissant&type=code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, fair enough. Looks like originalCroissant metadata can only be retrieved from that endpoint then.
Btw, have you checked this approach:
import datasets
builder = datasets.load_dataset_builder('dataset_name')
dataset_info = builder.info
dataset_info
should be of type datasets.info.DatasetInfo
and contain:
description
: A textual description of the dataset.features
: The schema of the dataset (column names and types).splits
: Information about available splits (e.g., train, test, validation).size_in_bytes
: The dataset size.citation
: The citation reference.license
: The dataset's license.
This looks useful and could be all we need (cc @chrish42). We could convert it into Croissant and save it to disk. The benefit here is that this can be used with any hf dataset on disk as well, including previously downloaded and cached ones.
Furthermore, irrespective of whether we use Croissant or dataset_info
, I'm wondering how we want to handle the features field. fast-llm prepare
only keeps the main text field (and optionally a list of character spans that should be excluded from the loss, see #113). I think the features
field should be modified based on that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked for dataset info via the URL, as described here: Hugging Face Dataset Viewer - Info, instead of using the builder. However, it did not provide any additional information compared to the Croissant format and failed on the same datasets.
I have now tested it using the builder:
- It seems to be slower, as it downloads scripts and at least partially executes them.
- On the plus side, it was able to read 6 out of the 7 repositories that neither the Croissant format nor the dataset info URL could provide.
However, if switching to the builder, I don’t see a reason to convert that information to Croissant. The main purpose of Croissant metadata is to be actionable — its recordSet
and distribution
fields allow actual data loading. So, if using dataset info, it would make more sense to simply save it to a YAML file, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also as datasets on HF are actually git repos, as @sebpaquet proposed we can save their url and commit_sha for lineage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this PR currently saves, or even if we decide to save dataset info instead, is the information about the processing source.
If we were to add the current operation transformation, as proposed by @tscholak:
I'm wondering how we want to handle the features field.
fast-llm prepare
only keeps the main text field (and optionally a list of character spans that should be excluded from the loss, see PR #113). I think the features field should be modified based on that...
I propose that instead of trying to define a format to describe the transformation, we simply store the command, configuration, command Git URL, and commit SHA used for the data transformation.
Additionally, the command can provide a human-readable description of what it does, which can be included in the info.
This way, we will have an exact record of how this dataset was produced, ensuring proper lineage tracking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now tested it using the builder:
- It seems to be slower, as it downloads scripts and at least partially executes them.
- On the plus side, it was able to read 6 out of the 7 repositories that neither the Croissant format nor the dataset info URL could provide.
How much slower? Shouldn't it be small compared to the whole of prepare
?
I propose that instead of trying to define a format to describe the transformation, we simply store the command, configuration, command Git URL, and commit SHA used for the data transformation.
I don't think this is enough, we need some backup in case the source somehow disappears
Thanks @bigximik.
Is this still the case for Fast-LLM's GPT mem-mapped dataset format, are these fields still useful? My current understanding is that they would not be.
This assumes that we will always and for all times only use hf hub datasets.
Yes we should store the command config in the output folder. What I'd like to achieve here is for us to have lineage information available while running |
fast_llm/ext_utils/hf_auth.py
Outdated
@@ -0,0 +1,74 @@ | |||
# Copyright 2023 The HuggingFace Team. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How big of an addition would a hf_hub dependency be? We already have most of hf tools as dependencies anyway, and this shouldn't add much sub-dependencies...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK as-is but I think we'll need to revisit later. I'll run tests then merge
✨ Description
This PR adds a basic metadata saving for datasets from HF hub and local datasets which have the metadata in croissant format during
gpt_memmap
data processing.Closes # 123
🔍 Type of change
Select all that apply:
📝 Changes
List the key changes introduced in this PR:
index.txt
to be saved only on rank 0.✅ Checklist
Make sure the following tasks are completed before submitting the PR:
General
Dependencies and Configuration
Testing
Performance Impact (not applicable)
📊 Performance Impact Details
Should not impact performance