Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Generate concatenated datasets automatically #120

Closed
jlamypoirier opened this issue Jan 17, 2025 · 5 comments · Fixed by #128
Closed

[feat] Generate concatenated datasets automatically #120

jlamypoirier opened this issue Jan 17, 2025 · 5 comments · Fixed by #128
Assignees
Labels
enhancement New feature or request Priority

Comments

@jlamypoirier
Copy link
Collaborator

jlamypoirier commented Jan 17, 2025

🧐 Problem Description

Concatenated datasets (see #104) can involve hundreds of dataset files, and we don't want to write them all in the config file.
See also #25

💡 Proposed Solution

We want to define a new config class that turns into a concatenation of many memmap datasets at runtime, based on the content of a directory. Considering 3 options for how to do it in practice:

  1. Parse a directory at runtime for .idx and .bin files to generate the full dataset. Simple, but risky because accidental modifications in the directory would go unnoticed.
  2. Use an index file to list all files to load. (Like the existing json dataset, but simpler.) A bit more complicated, but safer. Could add an (optional?) length entry for extra safety.
  3. A mix of the two, where the config accepts either a directory or an index file.

🔄 Alternatives Considered

Write them all in the config file. This isn't great...
Keep the current dataset blending an/or json dataset format. Also not great because it doesn't reflect the underlying reality that we have one big dataset, not many small ones.

📈 Potential Benefits

Greatly simplified and safer dataset definition

@jlamypoirier jlamypoirier added the enhancement New feature or request label Jan 17, 2025
@tscholak
Copy link
Collaborator

This is great, and I agree that this is the natural next step after merging #104.
To simplify our approach without the burden of maintaining extensive configuration files, have you considered adopting the Croissant metadata format? Croissant is a high-level format for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file. It's designed to enhance dataset discoverability and interoperability across ML frameworks. HuggingFace Datasets supports Croissant, see for instance here: https://huggingface.co/datasets/ServiceNow/CORD-1k-Docintel/blob/main/README.md. Croissant could streamline our dataset management and align with existing tools. Could you explore if Croissant would fit our needs here?

@jlamypoirier
Copy link
Collaborator Author

That's a promising idea, though it's a lot more involved than what I'm trying to achieve here, and I'm not too familiar with Croissant (it would be a good opportunity to involve other team members). An important question is whether we would like to adopt it before and/or after dataset preparation. How about we discuss it in our next team meeting?

In the meantime I'll prepare a quick POC (solution 3) so we have something to work with.

@bigximik
Copy link

To eliminate or at least reduce the risk of folder changes going unnoticed, the directory parsing process can generate a list not just in memory but also write it to a file, for example. Alternatively, another type of signature can be created for the folder. When the folder is scanned a second time, an exception is raised if the files differ. Only after examination and approval will it allow the creation of the dataset.

@tscholak
Copy link
Collaborator

@jlamypoirier, which PRs need to be merged first before @bigximik can work on this?

@jlamypoirier
Copy link
Collaborator Author

For the implementation, it goes after #104 and #121 (only need to get the new tests to pass), and I'd like to do the POC I mentioned above first. But it should already be possible to plan and start, especially since much of the work will be in the preparator which there PRs don't really touch.

Also this issue is strictly about auto concatenation, I'll make a separate one for metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants