-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] Generate concatenated datasets automatically #120
Comments
This is great, and I agree that this is the natural next step after merging #104. |
That's a promising idea, though it's a lot more involved than what I'm trying to achieve here, and I'm not too familiar with Croissant (it would be a good opportunity to involve other team members). An important question is whether we would like to adopt it before and/or after dataset preparation. How about we discuss it in our next team meeting? In the meantime I'll prepare a quick POC (solution 3) so we have something to work with. |
To eliminate or at least reduce the risk of folder changes going unnoticed, the directory parsing process can generate a list not just in memory but also write it to a file, for example. Alternatively, another type of signature can be created for the folder. When the folder is scanned a second time, an exception is raised if the files differ. Only after examination and approval will it allow the creation of the dataset. |
@jlamypoirier, which PRs need to be merged first before @bigximik can work on this? |
For the implementation, it goes after #104 and #121 (only need to get the new tests to pass), and I'd like to do the POC I mentioned above first. But it should already be possible to plan and start, especially since much of the work will be in the preparator which there PRs don't really touch. Also this issue is strictly about auto concatenation, I'll make a separate one for metadata. |
🧐 Problem Description
Concatenated datasets (see #104) can involve hundreds of dataset files, and we don't want to write them all in the config file.
See also #25
💡 Proposed Solution
We want to define a new config class that turns into a concatenation of many memmap datasets at runtime, based on the content of a directory. Considering 3 options for how to do it in practice:
.idx
and.bin
files to generate the full dataset. Simple, but risky because accidental modifications in the directory would go unnoticed.🔄 Alternatives Considered
Write them all in the config file. This isn't great...
Keep the current dataset blending an/or json dataset format. Also not great because it doesn't reflect the underlying reality that we have one big dataset, not many small ones.
📈 Potential Benefits
Greatly simplified and safer dataset definition
The text was updated successfully, but these errors were encountered: