Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving of croissant metadata files for HF datasets #142

Merged
merged 8 commits into from
Feb 14, 2025
Merged

Conversation

bigximik
Copy link
Contributor

@bigximik bigximik commented Feb 6, 2025

✨ Description

This PR adds a basic metadata saving for datasets from HF hub and local datasets which have the metadata in croissant format during gpt_memmap data processing.

Closes # 123

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

  1. For datasets on HF hub checks if croissant file exists, downloads and saves it in the output folder
  2. For local datasets checks if croissant file exists, downloads and saves it in the output folder
  3. Fixes index.txt to be saved only on rank 0.

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

  • 📜 I have read and followed the contributing guidelines.
  • 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
  • 🎉 The functionality is complete, and I have tested the changes.
  • 📝 I have updated the documentation if needed. (not applicable)
  • ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
  • 🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

  • 🐋 I have updated the Docker configuration or dependencies, if applicable.
  • 🔄 I have ensured compatibility with the existing setup after dependency changes. (should not impact)

Testing

  • 🧪 I have added or updated tests to cover my changes.
  • ✔️ New and existing tests pass locally with my changes. (only new)
  • 🚦 I have tested these changes on GPUs and verified training stability. (not applicable)
  • 🏋️ I have tested the changes on realistic training workloads, if applicable. (not applicable)

Performance Impact (not applicable)

  • 📊 I have run benchmarks where applicable to evaluate the performance impact.
  • ✅ The benchmarks show no performance regression.
  • 🚀 The benchmarks indicate a potential performance improvement.
  • ⚠️ The benchmarks indicate a potential performance degradation.
  • 📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

Should not impact performance

@bigximik bigximik linked an issue Feb 6, 2025 that may be closed by this pull request
@bigximik bigximik changed the title Saving of croissant metadata files fro HF datasets Saving of croissant metadata files for HF datasets Feb 6, 2025
Copy link
Collaborator

@tscholak tscholak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good, but after thinking about it a bit more I doubt that adding hf hub as a dependency is worth it, see comments

@bigximik
Copy link
Contributor Author

bigximik commented Feb 6, 2025

The get_token also supports legacy token path, which we can argue is not relevant to us as most probably people will be using latest env variables with the library. https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/utils/_hf_folder.py#L62

So, ok, i will extract get_token to something like hf_util.py

@bigximik
Copy link
Contributor Author

bigximik commented Feb 7, 2025

@tscholak I have placed get_token in fast_llm/ext_utils/hf_auth.py and renamed it to hf_auth_get_token. This ensures clarity, as internal imports use the from statement, making it clear that this token is specifically for Hugging Face authentication.

@bigximik bigximik requested a review from tscholak February 7, 2025 09:00
token = hf_auth_get_token()
try:
# Retrieve the dataset metadata in croissant format
url = f"https://huggingface.co/api/datasets/{self._config.dataset.path}/croissant"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that I think about this again, why do we need to go to hf.co and get the croissant metadata from there? why can't we use the hf datasets api?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems, at the moment at least, they do not have anything with it in the datasets library https://github.com/search?q=repo%3Ahuggingface%2Fdatasets%20croissant&type=code

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, fair enough. Looks like originalCroissant metadata can only be retrieved from that endpoint then.

Btw, have you checked this approach:

import datasets

builder = datasets.load_dataset_builder('dataset_name')

dataset_info = builder.info

dataset_info should be of type datasets.info.DatasetInfo and contain:

  • description: A textual description of the dataset.
  • features: The schema of the dataset (column names and types).
  • splits: Information about available splits (e.g., train, test, validation).
  • size_in_bytes: The dataset size.
  • citation: The citation reference.
  • license: The dataset's license.

This looks useful and could be all we need (cc @chrish42). We could convert it into Croissant and save it to disk. The benefit here is that this can be used with any hf dataset on disk as well, including previously downloaded and cached ones.

Furthermore, irrespective of whether we use Croissant or dataset_info, I'm wondering how we want to handle the features field. fast-llm prepare only keeps the main text field (and optionally a list of character spans that should be excluded from the loss, see #113). I think the features field should be modified based on that...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked for dataset info via the URL, as described here: Hugging Face Dataset Viewer - Info, instead of using the builder. However, it did not provide any additional information compared to the Croissant format and failed on the same datasets.

I have now tested it using the builder:

  • It seems to be slower, as it downloads scripts and at least partially executes them.
  • On the plus side, it was able to read 6 out of the 7 repositories that neither the Croissant format nor the dataset info URL could provide.

However, if switching to the builder, I don’t see a reason to convert that information to Croissant. The main purpose of Croissant metadata is to be actionable — its recordSet and distribution fields allow actual data loading. So, if using dataset info, it would make more sense to simply save it to a YAML file, for example.

Copy link
Contributor Author

@bigximik bigximik Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also as datasets on HF are actually git repos, as @sebpaquet proposed we can save their url and commit_sha for lineage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this PR currently saves, or even if we decide to save dataset info instead, is the information about the processing source.

If we were to add the current operation transformation, as proposed by @tscholak:

I'm wondering how we want to handle the features field. fast-llm prepare only keeps the main text field (and optionally a list of character spans that should be excluded from the loss, see PR #113). I think the features field should be modified based on that...

I propose that instead of trying to define a format to describe the transformation, we simply store the command, configuration, command Git URL, and commit SHA used for the data transformation.

Additionally, the command can provide a human-readable description of what it does, which can be included in the info.

This way, we will have an exact record of how this dataset was produced, ensuring proper lineage tracking.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now tested it using the builder:

  • It seems to be slower, as it downloads scripts and at least partially executes them.
  • On the plus side, it was able to read 6 out of the 7 repositories that neither the Croissant format nor the dataset info URL could provide.

How much slower? Shouldn't it be small compared to the whole of prepare?

I propose that instead of trying to define a format to describe the transformation, we simply store the command, configuration, command Git URL, and commit SHA used for the data transformation.

I don't think this is enough, we need some backup in case the source somehow disappears

@tscholak
Copy link
Collaborator

tscholak commented Feb 7, 2025

Thanks @bigximik.

I have now tested it using the builder: It seems to be slower, as it downloads scripts and at least partially executes them.

fast-llm prepare accesses the dataset already. If it is already downloaded to the cache, it will use that. Otherwise it will download the whole thing. You should be able to tap into that, and there will be no additional overhead.

The main purpose of Croissant metadata is to be actionable — its recordSet and distribution fields allow actual data loading.

Is this still the case for Fast-LLM's GPT mem-mapped dataset format, are these fields still useful? My current understanding is that they would not be.

Also as datasets on HF are actually git repos, as @sebpaquet proposed we can save their url and commit_sha for lineage.

This assumes that we will always and for all times only use hf hub datasets. fast-llm prepare can be pointed at any folder on disk that datasets can interpret as a dataset. We want lineage also in this case. I feel it would help at this point to ask @chrish42 how he defines lineage.

I propose that instead of trying to define a format to describe the transformation, we simply store the command, configuration, command Git URL, and commit SHA used for the data transformation. Additionally, the command can provide a human-readable description of what it does, which can be included in the info.

Yes we should store the command config in the output folder. fast-llm train does it, fast-llm prepare should do it to. However, I think this doesn't address the lineage problem. As far as I understand @chrish42's goals for lineage right now, lineage is not about full reproducibility of all steps leading from the original input data to the final data we ultimately train on. It is about tracking and preserving dataset names, owners, and licenses.

What I'd like to achieve here is for us to have lineage information available while running fast-llm train. Since fast-llm train doesn't access the datasets directly but only their "prepared" version, we need to preserve and store that information when calling fast-llm prepare.

@@ -0,0 +1,74 @@
# Copyright 2023 The HuggingFace Team. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big of an addition would a hf_hub dependency be? We already have most of hf tools as dependencies anyway, and this shouldn't add much sub-dependencies...

Copy link
Collaborator

@jlamypoirier jlamypoirier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK as-is but I think we'll need to revisit later. I'll run tests then merge

@jlamypoirier jlamypoirier merged commit de7b2d8 into main Feb 14, 2025
2 checks passed
@jlamypoirier jlamypoirier deleted the croissant branch February 14, 2025 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feat] Add metadata to datasets
3 participants