Saving of croissant metadata files for HF datasets #142

bigximik · 2025-02-06T17:38:45Z

✨ Description

This PR adds a basic metadata saving for datasets from HF hub and local datasets which have the metadata in croissant format during gpt_memmap data processing.

Closes # 123

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

For datasets on HF hub checks if croissant file exists, downloads and saves it in the output folder
For local datasets checks if croissant file exists, downloads and saves it in the output folder
Fixes index.txt to be saved only on rank 0.

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed. (not applicable)
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

🐋 I have updated the Docker configuration or dependencies, if applicable.
🔄 I have ensured compatibility with the existing setup after dependency changes. (should not impact)

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes. (only new)
🚦 I have tested these changes on GPUs and verified training stability. (not applicable)
🏋️ I have tested the changes on realistic training workloads, if applicable. (not applicable)

Performance Impact (not applicable)

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

Should not impact performance

…rocessing

fast_llm/data/preparator/gpt_memmap/prepare.py

tscholak

Looks mostly good, but after thinking about it a bit more I doubt that adding hf hub as a dependency is worth it, see comments

fast_llm/data/preparator/gpt_memmap/prepare.py

bigximik · 2025-02-06T18:09:38Z

The get_token also supports legacy token path, which we can argue is not relevant to us as most probably people will be using latest env variables with the library. https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/utils/_hf_folder.py#L62

So, ok, i will extract get_token to something like hf_util.py

…e repo

bigximik · 2025-02-07T09:00:41Z

@tscholak I have placed get_token in fast_llm/ext_utils/hf_auth.py and renamed it to hf_auth_get_token. This ensures clarity, as internal imports use the from statement, making it clear that this token is specifically for Hugging Face authentication.

tscholak · 2025-02-07T12:14:39Z

fast_llm/data/preparator/gpt_memmap/prepare.py

+        token = hf_auth_get_token()
+        try:
+            # Retrieve the dataset metadata in croissant format
+            url = f"https://huggingface.co/api/datasets/{self._config.dataset.path}/croissant"


now that I think about this again, why do we need to go to hf.co and get the croissant metadata from there? why can't we use the hf datasets api?

it is according to their how to https://huggingface.co/docs/dataset-viewer/en/croissant

seems, at the moment at least, they do not have anything with it in the datasets library https://github.com/search?q=repo%3Ahuggingface%2Fdatasets%20croissant&type=code

ok, fair enough. Looks like originalCroissant metadata can only be retrieved from that endpoint then.

Btw, have you checked this approach:

import datasets builder = datasets.load_dataset_builder('dataset_name') dataset_info = builder.info

dataset_info should be of type datasets.info.DatasetInfo and contain:

description: A textual description of the dataset.

features: The schema of the dataset (column names and types).

splits: Information about available splits (e.g., train, test, validation).

size_in_bytes: The dataset size.

citation: The citation reference.

license: The dataset's license.

This looks useful and could be all we need (cc @chrish42). We could convert it into Croissant and save it to disk. The benefit here is that this can be used with any hf dataset on disk as well, including previously downloaded and cached ones.

Furthermore, irrespective of whether we use Croissant or dataset_info, I'm wondering how we want to handle the features field. fast-llm prepare only keeps the main text field (and optionally a list of character spans that should be excluded from the loss, see #113). I think the features field should be modified based on that...

I looked for dataset info via the URL, as described here: Hugging Face Dataset Viewer - Info, instead of using the builder. However, it did not provide any additional information compared to the Croissant format and failed on the same datasets.

I have now tested it using the builder:

It seems to be slower, as it downloads scripts and at least partially executes them.

On the plus side, it was able to read 6 out of the 7 repositories that neither the Croissant format nor the dataset info URL could provide.

However, if switching to the builder, I don’t see a reason to convert that information to Croissant. The main purpose of Croissant metadata is to be actionable — its recordSet and distribution fields allow actual data loading. So, if using dataset info, it would make more sense to simply save it to a YAML file, for example.

Also as datasets on HF are actually git repos, as @sebpaquet proposed we can save their url and commit_sha for lineage.

What this PR currently saves, or even if we decide to save dataset info instead, is the information about the processing source.

If we were to add the current operation transformation, as proposed by @tscholak:

I'm wondering how we want to handle the features field. fast-llm prepare only keeps the main text field (and optionally a list of character spans that should be excluded from the loss, see PR #113). I think the features field should be modified based on that...

I propose that instead of trying to define a format to describe the transformation, we simply store the command, configuration, command Git URL, and commit SHA used for the data transformation.

Additionally, the command can provide a human-readable description of what it does, which can be included in the info.

This way, we will have an exact record of how this dataset was produced, ensuring proper lineage tracking.

I have now tested it using the builder:

It seems to be slower, as it downloads scripts and at least partially executes them.

On the plus side, it was able to read 6 out of the 7 repositories that neither the Croissant format nor the dataset info URL could provide.

How much slower? Shouldn't it be small compared to the whole of prepare?

I propose that instead of trying to define a format to describe the transformation, we simply store the command, configuration, command Git URL, and commit SHA used for the data transformation.

I don't think this is enough, we need some backup in case the source somehow disappears

tscholak · 2025-02-07T16:12:37Z

Thanks @bigximik.

I have now tested it using the builder: It seems to be slower, as it downloads scripts and at least partially executes them.

fast-llm prepare accesses the dataset already. If it is already downloaded to the cache, it will use that. Otherwise it will download the whole thing. You should be able to tap into that, and there will be no additional overhead.

The main purpose of Croissant metadata is to be actionable — its recordSet and distribution fields allow actual data loading.

Is this still the case for Fast-LLM's GPT mem-mapped dataset format, are these fields still useful? My current understanding is that they would not be.

Also as datasets on HF are actually git repos, as @sebpaquet proposed we can save their url and commit_sha for lineage.

This assumes that we will always and for all times only use hf hub datasets. fast-llm prepare can be pointed at any folder on disk that datasets can interpret as a dataset. We want lineage also in this case. I feel it would help at this point to ask @chrish42 how he defines lineage.

I propose that instead of trying to define a format to describe the transformation, we simply store the command, configuration, command Git URL, and commit SHA used for the data transformation. Additionally, the command can provide a human-readable description of what it does, which can be included in the info.

Yes we should store the command config in the output folder. fast-llm train does it, fast-llm prepare should do it to. However, I think this doesn't address the lineage problem. As far as I understand @chrish42's goals for lineage right now, lineage is not about full reproducibility of all steps leading from the original input data to the final data we ultimately train on. It is about tracking and preserving dataset names, owners, and licenses.

What I'd like to achieve here is for us to have lineage information available while running fast-llm train. Since fast-llm train doesn't access the datasets directly but only their "prepared" version, we need to preserve and store that information when calling fast-llm prepare.

jlamypoirier · 2025-02-07T21:10:39Z

fast_llm/ext_utils/hf_auth.py

@@ -0,0 +1,74 @@
+# Copyright 2023 The HuggingFace Team. All rights reserved.


How big of an addition would a hf_hub dependency be? We already have most of hf tools as dependencies anyway, and this shouldn't add much sub-dependencies...

jlamypoirier

OK as-is but I think we'll need to revisit later. I'll run tests then merge

bigximik added 3 commits February 6, 2025 18:36

basic implementatin of saving of croissant metadata for datasets in p…

65b9f70

…rocessing

fix run write under rank 0 only

a513e3f

added comments

db3fcb0

bigximik requested review from tscholak and jlamypoirier February 6, 2025 17:39

bigximik linked an issue Feb 6, 2025 that may be closed by this pull request

[feat] Add metadata to datasets #123

Closed

bigximik changed the title ~~Saving of croissant metadata files fro HF datasets~~ Saving of croissant metadata files for HF datasets Feb 6, 2025

tscholak reviewed Feb 6, 2025

View reviewed changes

fast_llm/data/preparator/gpt_memmap/prepare.py Outdated Show resolved Hide resolved

tscholak reviewed Feb 6, 2025

View reviewed changes

fast_llm/data/preparator/gpt_memmap/prepare.py Outdated Show resolved Hide resolved

tscholak requested changes Feb 6, 2025

View reviewed changes

fast_llm/data/preparator/gpt_memmap/prepare.py Outdated Show resolved Hide resolved

removed depencence on huggingface_hub by implementing get_token in th…

5e1ae9d

…e repo

bigximik requested a review from tscholak February 7, 2025 09:00

tscholak reviewed Feb 7, 2025

View reviewed changes

jlamypoirier reviewed Feb 7, 2025

View reviewed changes

jlamypoirier added 3 commits February 14, 2025 15:53

Merge branch 'main' into croissant

7b00042

Use hf hub

b2a5e04

misc

6e13771

jlamypoirier approved these changes Feb 14, 2025

View reviewed changes

Improve tests

c1013f0

jlamypoirier merged commit de7b2d8 into main Feb 14, 2025
2 checks passed

jlamypoirier deleted the croissant branch February 14, 2025 21:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving of croissant metadata files for HF datasets #142

Saving of croissant metadata files for HF datasets #142

bigximik commented Feb 6, 2025

tscholak left a comment

bigximik commented Feb 6, 2025 •

edited

Loading

bigximik commented Feb 7, 2025

tscholak Feb 7, 2025

bigximik Feb 7, 2025

bigximik Feb 7, 2025

tscholak Feb 7, 2025

bigximik Feb 7, 2025

bigximik Feb 7, 2025 •

edited

Loading

bigximik Feb 7, 2025

jlamypoirier Feb 7, 2025

tscholak commented Feb 7, 2025

jlamypoirier Feb 7, 2025

jlamypoirier left a comment

		@@ -0,0 +1,74 @@
		# Copyright 2023 The HuggingFace Team. All rights reserved.

Saving of croissant metadata files for HF datasets #142

Saving of croissant metadata files for HF datasets #142

Conversation

bigximik commented Feb 6, 2025

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Dependencies and Configuration

Testing

Performance Impact (not applicable)

📊 Performance Impact Details

tscholak left a comment

Choose a reason for hiding this comment

bigximik commented Feb 6, 2025 • edited Loading

bigximik commented Feb 7, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bigximik Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tscholak commented Feb 7, 2025

Choose a reason for hiding this comment

jlamypoirier left a comment

Choose a reason for hiding this comment

bigximik commented Feb 6, 2025 •

edited

Loading

bigximik Feb 7, 2025 •

edited

Loading