[feat] Generate concatenated datasets automatically #120

jlamypoirier · 2025-01-17T19:11:06Z

🧐 Problem Description

Concatenated datasets (see #104) can involve hundreds of dataset files, and we don't want to write them all in the config file.
See also #25

💡 Proposed Solution

We want to define a new config class that turns into a concatenation of many memmap datasets at runtime, based on the content of a directory. Considering 3 options for how to do it in practice:

Parse a directory at runtime for .idx and .bin files to generate the full dataset. Simple, but risky because accidental modifications in the directory would go unnoticed.
Use an index file to list all files to load. (Like the existing json dataset, but simpler.) A bit more complicated, but safer. Could add an (optional?) length entry for extra safety.
A mix of the two, where the config accepts either a directory or an index file.

🔄 Alternatives Considered

Write them all in the config file. This isn't great...
Keep the current dataset blending an/or json dataset format. Also not great because it doesn't reflect the underlying reality that we have one big dataset, not many small ones.

📈 Potential Benefits

Greatly simplified and safer dataset definition

The text was updated successfully, but these errors were encountered:

tscholak · 2025-01-17T20:23:48Z

This is great, and I agree that this is the natural next step after merging #104.
To simplify our approach without the burden of maintaining extensive configuration files, have you considered adopting the Croissant metadata format? Croissant is a high-level format for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file. It's designed to enhance dataset discoverability and interoperability across ML frameworks. HuggingFace Datasets supports Croissant, see for instance here: https://huggingface.co/datasets/ServiceNow/CORD-1k-Docintel/blob/main/README.md. Croissant could streamline our dataset management and align with existing tools. Could you explore if Croissant would fit our needs here?

jlamypoirier · 2025-01-19T21:04:22Z

That's a promising idea, though it's a lot more involved than what I'm trying to achieve here, and I'm not too familiar with Croissant (it would be a good opportunity to involve other team members). An important question is whether we would like to adopt it before and/or after dataset preparation. How about we discuss it in our next team meeting?

In the meantime I'll prepare a quick POC (solution 3) so we have something to work with.

bigximik · 2025-01-21T21:14:17Z

To eliminate or at least reduce the risk of folder changes going unnoticed, the directory parsing process can generate a list not just in memory but also write it to a file, for example. Alternatively, another type of signature can be created for the folder. When the folder is scanned a second time, an exception is raised if the files differ. Only after examination and approval will it allow the creation of the dataset.

tscholak · 2025-01-21T21:35:06Z

@jlamypoirier, which PRs need to be merged first before @bigximik can work on this?

jlamypoirier · 2025-01-21T21:51:25Z

For the implementation, it goes after #104 and #121 (only need to get the new tests to pass), and I'd like to do the POC I mentioned above first. But it should already be possible to plan and start, especially since much of the work will be in the preparator which there PRs don't really touch.

Also this issue is strictly about auto concatenation, I'll make a separate one for metadata.

jlamypoirier added the enhancement New feature or request label Jan 17, 2025

tscholak assigned bigximik Jan 21, 2025

jlamypoirier assigned jlamypoirier and unassigned bigximik Jan 21, 2025

jlamypoirier mentioned this issue Jan 21, 2025

[feat] Add metadata to datasets #123

Open

jlamypoirier added the Priority label Jan 22, 2025

jlamypoirier mentioned this issue Jan 22, 2025

Auto dataset concatenation prototype #128

Merged

8 tasks

jlamypoirier closed this as completed in #128 Jan 27, 2025

jlamypoirier mentioned this issue Jan 28, 2025

[feat] Integrate dataset re-weighting and preprocessing into Fast-LLM for streamlined data loading #25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Generate concatenated datasets automatically #120

[feat] Generate concatenated datasets automatically #120

jlamypoirier commented Jan 17, 2025 •

edited by tscholak

Loading

tscholak commented Jan 17, 2025

jlamypoirier commented Jan 19, 2025

bigximik commented Jan 21, 2025

tscholak commented Jan 21, 2025

jlamypoirier commented Jan 21, 2025

[feat] Generate concatenated datasets automatically #120

[feat] Generate concatenated datasets automatically #120

Comments

jlamypoirier commented Jan 17, 2025 • edited by tscholak Loading

🧐 Problem Description

💡 Proposed Solution

🔄 Alternatives Considered

📈 Potential Benefits

tscholak commented Jan 17, 2025

jlamypoirier commented Jan 19, 2025

bigximik commented Jan 21, 2025

tscholak commented Jan 21, 2025

jlamypoirier commented Jan 21, 2025

jlamypoirier commented Jan 17, 2025 •

edited by tscholak

Loading