-
Notifications
You must be signed in to change notification settings - Fork 33
Description
🎯 Goal (What & Why)
Add a comprehensive data cleaning stage to the fast-llm prepare
command.
fast-llm prepare
currently downloads and tokenizes HuggingFace datasets into Fast-LLM's .bin
/ .idx
format using a distributed torchrun setup. However, it performs no data cleaning, which limits training quality and poses risks around PII and malicious content.
This ticket adds a required data cleaning phase that is configurable, fast, and integrated into the distributed preprocessing loop. The goal is to improve model quality, reduce noise, follow best practices (see OLMo-2), and meet responsible AI standards by removing PII and malware from the training corpus.
🚀 Execution Plan
Step 1: What is the smallest working version?
- Extend
fast-llm prepare
to apply a modular and configurable cleaning pipeline during preprocessing. - All cleaning steps must be integrated into the existing
torchrun
CPU-only distributed setup, preserving parallelism.
Step 2: Required cleaning filters (all must be implemented):
- Length filtering
- Remove documents exceeding a configurable max length (in characters or tokens).
- n-gram repetition
- Remove documents with ≥N repeated n-grams (default: 32), as in OLMo-2.
- Frequency-based filtering
- Remove documents where:
- The most frequent word exceeds X% of total tokens (default: 30%).
- The top-2 most frequent words together exceed Y% of total tokens (default: 50%).
- Remove documents where:
- Binary content filtering
- Remove documents that contain mostly binary data.
- Numerical content filtering
- Remove documents with a high percentage of numeric tokens (default: configurable threshold, e.g., 50%).
- PII redaction
- Integrate Microsoft Presidio for detection and redaction or removal of documents containing sensitive personal information.
- Malware removal
- Integrate ClamAV to scan documents and remove any that trigger detections.
All thresholds and filter behaviors must be exposed via the CLI and config files. Document-level logs or counters should be maintained for each filter to aid debugging and analysis.
📌 Acceptance Criteria
- All listed filters are implemented and integrated into
fast-llm prepare
. - Cleaning is fully configurable, both via CLI and YAML config files.
- The implementation works with the existing distributed CPU setup (torchrun + Gloo).
- Performance remains acceptable.
- Logs report how many documents are removed by each filter.
- Code is tested and documented.
- PR includes a performance/impact summary and example CLI usage.
🛠️ Project Management
- Assign the project to the Fast-LLM project.
- Set the
Estimate
field (in days) in the GitHub project. - Use the
Size
field to categorize the PR size (Small/Medium/Large). - Assign an owner when opening the issue.