-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] Integrate dataset re-weighting and preprocessing into Fast-LLM for streamlined data loading #25
Comments
That seems like something we want, but I'd like to clarify what the problem is exactly. We already support Megatron-style blended datasets, it's what we started with and are still using. The problem came when we started using really big datasets that need to be split into multiple files. We don't really support that, so as a hack we decided to treat these as separate datasets. But that means we end up with hundreds of "datasets" in a hierarchy where there are actual "datasets" with meaningful probabilities, and actual files with probability = dataset_prob * (file_tokens/dataset_tokens). The json format, concatenate_datasets.py and mix_dataset.py are all hacks to help us with that. So what we really want is allow for datasets that span across multiple files. My suggestion would be to make a concatenation wrapper for Then there is the details of how multi-file datasets are configured. Providing a directory in the config is an option, but I think it would be safer to keep some index file, ex. a yaml file containing a list of data paths. |
I understand where you're coming from, but this also creates more work for the user. It would be great if this didn't require any special tooling, because right now concatenate_dataset.py depends on |
Yes that's what I'm suggesting. Once datasets are concatenated instead of blended there is no more need for probabilities. Token counts and other metadata could be useful as an extra safety check but we can leave it out if it's too much trouble |
We still need the ability to define the target proportions though for individual datasets (that themselves are split into many mmap'ed bin files), so that's why I think Fast-LLM's config classes should be changed to allow for this then:
This would then eliminate the need for mix_datasets.py as well. |
Yes that's a good plan. Right now it's taking a |
This one, right? Fast-LLM/fast_llm/data/config.py Line 150 in f9880e2
I suppose that can work. I'm also not sure about: Fast-LLM/fast_llm/data/config.py Lines 14 to 24 in f9880e2
what's |
I think
List is the default megatron-like format,. We should be able to come up with some better way to define dataset formats, maybe something modular and model-dependent like I did with checkpoints? That would have the added benefit of making things a lot easier for custom models (#5, #20) |
🧐 Problem Description
Currently, creating a training dataset with Fast-LLM involves a multi-step, cumbersome process:
This workflow is inefficient, error-prone (e.g., issue #71), and less user-friendly compared to other LLM training frameworks that offer simpler, more integrated data-loading mechanisms:
BlendedDataset
: Supports combining datasets with different weights directly:The additional steps required by Fast-LLM add complexity and reduce competitiveness in terms of data handling and preparation.
💡 Proposed Solution
Integrate the Preprocessing Step into Fast-LLM:
Revamp the Dataset Configuration Format:
Example Configuration
With this setup, Fast-LLM will automatically distribute the proportions among datasets within the specified paths.
🔄 Alternatives Considered
Keep the Existing Script-Based Workflow:
Provide a Standalone Utility for Merging and Weighting:
📈 Potential Benefits
Improved Usability:
Enhanced Competitiveness:
Streamlined Workflow:
📝 Additional Context
Integrating preprocessing directly into Fast-LLM would bring it closer to modern LLM frameworks that offer unified dataset preparation. This approach will facilitate future support for custom dataset implementations, such as streaming Parquet files from cloud storage (e.g., S3). For reference, frameworks like Mosaic's Composer already provide flexible data-loading options, making integration smoother.
The text was updated successfully, but these errors were encountered: