-
Notifications
You must be signed in to change notification settings - Fork 28
Saving of croissant metadata files for HF datasets #142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
65b9f70
basic implementatin of saving of croissant metadata for datasets in pβ¦
bigximik a513e3f
fix run write under rank 0 only
bigximik db3fcb0
added comments
bigximik 5e1ae9d
removed depencence on huggingface_hub by implementing get_token in thβ¦
bigximik 7b00042
Merge branch 'main' into croissant
jlamypoirier b2a5e04
Use hf hub
jlamypoirier 6e13771
misc
jlamypoirier c1013f0
Improve tests
jlamypoirier File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
import json | ||
import pathlib | ||
import tempfile | ||
|
||
import numpy as np | ||
import pytest | ||
|
||
from fast_llm.data.dataset.gpt.memmap import GPTMemmapDataset | ||
from fast_llm.data.dataset.gpt.sampled import GPTSample | ||
from fast_llm.data.preparator.gpt_memmap.config import MEMMAP_DTYPES, GPTMemmapDatasetPreparatorConfig | ||
from fast_llm.data.preparator.gpt_memmap.prepare import GPTMemmapDatasetPreparator | ||
|
||
|
||
def get_preparator(output_path: str, dataset_path_name: str) -> GPTMemmapDatasetPreparator: | ||
config = GPTMemmapDatasetPreparatorConfig.from_dict( | ||
{ | ||
"output_path": output_path, | ||
"dataset": {"path": dataset_path_name}, | ||
"tokenizer": {"path": "no_tokenizer"}, | ||
}, | ||
{}, | ||
) | ||
return config.get_dataset_preparator_class()(config=config) | ||
|
||
|
||
@pytest.mark.parametrize("dtype", MEMMAP_DTYPES.values()) | ||
def test_write_memmap_dataset(dtype): | ||
documents = [GPTSample(np.random.randint(1000, size=np.random.randint(1, 100)).astype(dtype)) for _ in range(100)] | ||
with tempfile.TemporaryDirectory() as temp_dir: | ||
prefix = pathlib.Path(temp_dir) | ||
GPTMemmapDataset.write_dataset(prefix=prefix, documents=documents) | ||
dataset = GPTMemmapDataset(name="foo", prefix=prefix) | ||
for i, document in enumerate(documents): | ||
assert np.array_equal( | ||
dataset.get(i).token_ids, document.token_ids, equal_nan=True | ||
), f"Mismatch for document {i}: {document} != {dataset.get(i)}." | ||
|
||
|
||
def test_load_metadata_from_hub(): | ||
with tempfile.TemporaryDirectory(suffix="test") as local_folder: | ||
get_preparator(local_folder, "lhoestq/demo1")._save_croissant_metadata() | ||
croissant_path = pathlib.Path(local_folder) / "croissant.json" | ||
assert croissant_path.is_file() | ||
metadata = json.load(croissant_path.open("r")) | ||
assert metadata["url"] == "https://huggingface.co/datasets/lhoestq/demo1" | ||
|
||
|
||
def test_absent_metadata_from_hub(): | ||
with tempfile.TemporaryDirectory(suffix="test") as local_folder: | ||
get_preparator(local_folder, "allenai/dolma")._save_croissant_metadata() | ||
assert not (pathlib.Path(local_folder) / "croissant.json").is_file() | ||
|
||
|
||
def test_load_metadata_local(): | ||
with ( | ||
tempfile.TemporaryDirectory(suffix="dataset") as dataset_folder, | ||
tempfile.TemporaryDirectory(suffix="test") as local_folder, | ||
): | ||
metadata = {"name": "test"} | ||
json.dump(metadata, (pathlib.Path(dataset_folder) / "croissant.json").open("w")) | ||
get_preparator(local_folder, dataset_folder)._save_croissant_metadata() | ||
croissant_path = pathlib.Path(local_folder) / "croissant.json" | ||
assert croissant_path.is_file() | ||
assert json.loads(croissant_path.open("r").read()) == metadata | ||
|
||
|
||
def test_absent_metadata_local(): | ||
with ( | ||
tempfile.TemporaryDirectory(suffix="dataset") as dataset_folder, | ||
tempfile.TemporaryDirectory(suffix="test") as local_folder, | ||
): | ||
get_preparator(local_folder, dataset_folder)._save_croissant_metadata() | ||
assert not (pathlib.Path(local_folder) / "croissant.json").is_file() |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that I think about this again, why do we need to go to hf.co and get the croissant metadata from there? why can't we use the hf datasets api?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is according to their how to https://huggingface.co/docs/dataset-viewer/en/croissant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems, at the moment at least, they do not have anything with it in the datasets library https://github.com/search?q=repo%3Ahuggingface%2Fdatasets%20croissant&type=code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, fair enough. Looks like originalCroissant metadata can only be retrieved from that endpoint then.
Btw, have you checked this approach:
dataset_info
should be of typedatasets.info.DatasetInfo
and contain:description
: A textual description of the dataset.features
: The schema of the dataset (column names and types).splits
: Information about available splits (e.g., train, test, validation).size_in_bytes
: The dataset size.citation
: The citation reference.license
: The dataset's license.This looks useful and could be all we need (cc @chrish42). We could convert it into Croissant and save it to disk. The benefit here is that this can be used with any hf dataset on disk as well, including previously downloaded and cached ones.
Furthermore, irrespective of whether we use Croissant or
dataset_info
, I'm wondering how we want to handle the features field.fast-llm prepare
only keeps the main text field (and optionally a list of character spans that should be excluded from the loss, see #113). I think thefeatures
field should be modified based on that...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked for dataset info via the URL, as described here: Hugging Face Dataset Viewer - Info, instead of using the builder. However, it did not provide any additional information compared to the Croissant format and failed on the same datasets.
I have now tested it using the builder:
However, if switching to the builder, I donβt see a reason to convert that information to Croissant. The main purpose of Croissant metadata is to be actionable β its
recordSet
anddistribution
fields allow actual data loading. So, if using dataset info, it would make more sense to simply save it to a YAML file, for example.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also as datasets on HF are actually git repos, as @sebpaquet proposed we can save their url and commit_sha for lineage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this PR currently saves, or even if we decide to save dataset info instead, is the information about the processing source.
If we were to add the current operation transformation, as proposed by @tscholak:
I propose that instead of trying to define a format to describe the transformation, we simply store the command, configuration, command Git URL, and commit SHA used for the data transformation.
Additionally, the command can provide a human-readable description of what it does, which can be included in the info.
This way, we will have an exact record of how this dataset was produced, ensuring proper lineage tracking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much slower? Shouldn't it be small compared to the whole of
prepare
?I don't think this is enough, we need some backup in case the source somehow disappears