-
Notifications
You must be signed in to change notification settings - Fork 28
Add data cleaning in fast-llm prepare, concept #210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Demonstration and Discussion of ConceptThis code serves as a demonstration and discussion of the proposed concept. Key Decisions:
Usage:To prepare a dataset, simply call: dataset = self._config.processors.apply(dataset) config would be something like this: processors:
steps:
-
type: length_filter
field: text
min_length_chars: 100
max_length_chars: 100000
- ... @jlamypoirier, @tscholak What do you think? |
Hi @bigximik, thanks for putting this together. I appreciate the careful thinking you've put in here! However, let's simplify significantly. The goal isn't to design a general, modular pipeline system. It's just about adding these very specific cleaning filters. We already know exactly what filters we want and in what order. Here's what I'd suggest:
We can always refactor if more complexity is actually required down the line, but let's get this feature shipped quickly and cleanly first. Can you please move forward by just implementing the concrete filters directly? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to mostly agree with @tscholak on this one. I think it's a good idea to make processors into modular Config
/Configurable
pairs since it's relatively simple and non-controversial, but anything more than that requires a bit more thinking and is probably a bit premature at this stage
fast_llm/data/preparator/hf_processors/implementations/agregator.py
Outdated
Show resolved
Hide resolved
…ept for clamav, integration not tested
Created basic implementation based on feedback.
Next Steps
|
✨ Description
part of #112
Closes #
🔍 Type of change
Select all that apply:
📝 Changes
List the key changes introduced in this PR:
✅ Checklist
Make sure the following tasks are completed before submitting the PR:
General
Dependencies and Configuration
Testing
Performance Impact
📊 Performance Impact Details
If there is any impact on performance, describe it and provide benchmark results, if applicable:
🗒️ Additional Notes
Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.