Skip to content

[feat] Add data cleaning in fast-llm prepare #112

@tscholak

Description

@tscholak

🎯 Goal (What & Why)

Add a comprehensive data cleaning stage to the fast-llm prepare command.

fast-llm prepare currently downloads and tokenizes HuggingFace datasets into Fast-LLM's .bin / .idx format using a distributed torchrun setup. However, it performs no data cleaning, which limits training quality and poses risks around PII and malicious content.

This ticket adds a required data cleaning phase that is configurable, fast, and integrated into the distributed preprocessing loop. The goal is to improve model quality, reduce noise, follow best practices (see OLMo-2), and meet responsible AI standards by removing PII and malware from the training corpus.

🚀 Execution Plan

Step 1: What is the smallest working version?

  • Extend fast-llm prepare to apply a modular and configurable cleaning pipeline during preprocessing.
  • All cleaning steps must be integrated into the existing torchrun CPU-only distributed setup, preserving parallelism.

Step 2: Required cleaning filters (all must be implemented):

  • Length filtering
    • Remove documents exceeding a configurable max length (in characters or tokens).
  • n-gram repetition
    • Remove documents with ≥N repeated n-grams (default: 32), as in OLMo-2.
  • Frequency-based filtering
    • Remove documents where:
      • The most frequent word exceeds X% of total tokens (default: 30%).
      • The top-2 most frequent words together exceed Y% of total tokens (default: 50%).
  • Binary content filtering
    • Remove documents that contain mostly binary data.
  • Numerical content filtering
    • Remove documents with a high percentage of numeric tokens (default: configurable threshold, e.g., 50%).
  • PII redaction
    • Integrate Microsoft Presidio for detection and redaction or removal of documents containing sensitive personal information.
  • Malware removal
    • Integrate ClamAV to scan documents and remove any that trigger detections.

All thresholds and filter behaviors must be exposed via the CLI and config files. Document-level logs or counters should be maintained for each filter to aid debugging and analysis.

📌 Acceptance Criteria

  • All listed filters are implemented and integrated into fast-llm prepare.
  • Cleaning is fully configurable, both via CLI and YAML config files.
  • The implementation works with the existing distributed CPU setup (torchrun + Gloo).
  • Performance remains acceptable.
  • Logs report how many documents are removed by each filter.
  • Code is tested and documented.
  • PR includes a performance/impact summary and example CLI usage.

🛠️ Project Management

  • Assign the project to the Fast-LLM project.
  • Set the Estimate field (in days) in the GitHub project.
  • Use the Size field to categorize the PR size (Small/Medium/Large).
  • Assign an owner when opening the issue.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions