[feat] Add data cleaning in `fast-llm prepare`

# 🎯 **Goal (What & Why)**  
Add a comprehensive data cleaning stage to the `fast-llm prepare` command.

`fast-llm prepare` currently downloads and tokenizes HuggingFace datasets into Fast-LLM's `.bin` / `.idx` format using a distributed torchrun setup. However, it performs no data cleaning, which limits training quality and poses risks around PII and malicious content.

This ticket adds a required data cleaning phase that is configurable, fast, and integrated into the distributed preprocessing loop. The goal is to improve model quality, reduce noise, follow best practices (see [OLMo-2](https://arxiv.org/abs/2501.00656)), and meet responsible AI standards by removing PII and malware from the training corpus.

# 🚀 **Execution Plan**

### **Step 1: What is the smallest working version?**
- Extend `fast-llm prepare` to apply a modular and configurable cleaning pipeline during preprocessing.
- All cleaning steps must be integrated into the existing `torchrun` CPU-only distributed setup, preserving parallelism.

### **Step 2: Required cleaning filters (all must be implemented):**
- **Length filtering**  
  - Remove documents exceeding a configurable max length (in characters or tokens).
- **n-gram repetition**  
  - Remove documents with ≥N repeated n-grams (default: 32), as in [OLMo-2](https://arxiv.org/abs/2501.00656).
- **Frequency-based filtering**  
  - Remove documents where:
    - The most frequent word exceeds X% of total tokens (default: 30%).
    - The top-2 most frequent words together exceed Y% of total tokens (default: 50%).
- **Binary content filtering**  
  - Remove documents that contain mostly binary data.
- **Numerical content filtering**  
  - Remove documents with a high percentage of numeric tokens (default: configurable threshold, e.g., 50%).
- **PII redaction**  
  - Integrate [Microsoft Presidio](https://github.com/microsoft/presidio) for detection and redaction or removal of documents containing sensitive personal information.
- **Malware removal**  
  - Integrate [ClamAV](https://www.clamav.net/) to scan documents and remove any that trigger detections.

All thresholds and filter behaviors must be exposed via the CLI and config files. Document-level logs or counters should be maintained for each filter to aid debugging and analysis.

# 📌 **Acceptance Criteria**
* All listed filters are implemented and integrated into `fast-llm prepare`.
* Cleaning is fully configurable, both via CLI and YAML config files.
* The implementation works with the existing distributed CPU setup (torchrun + Gloo).
* Performance remains acceptable.
* Logs report how many documents are removed by each filter.
* Code is tested and documented.
* PR includes a performance/impact summary and example CLI usage.

# 🛠️ **Project Management**
- [x] **Assign the project to the Fast-LLM project.**
- [ ] **Set the `Estimate` field (in days) in the GitHub project.**
- [x] **Use the `Size` field to categorize the PR size (Small/Medium/Large).**
- [x] **Assign an owner when opening the issue.**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat] Add data cleaning in `fast-llm prepare` #112

🎯 Goal (What & Why)

🚀 Execution Plan

Step 1: What is the smallest working version?

Step 2: Required cleaning filters (all must be implemented):

📌 Acceptance Criteria

🛠️ Project Management

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[feat] Add data cleaning in fast-llm prepare #112

Description

🎯 Goal (What & Why)

🚀 Execution Plan

Step 1: What is the smallest working version?

Step 2: Required cleaning filters (all must be implemented):

📌 Acceptance Criteria

🛠️ Project Management

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[feat] Add data cleaning in `fast-llm prepare` #112