[Prototype] Add data cleaning in fast-llm prepare, concept #210

bigximik · 2025-03-26T18:38:23Z

✨ Description

part of #112

Closes #

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

Change A
Change B

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

🐋 I have updated the Docker configuration or dependencies, if applicable.
🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:

🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

bigximik · 2025-03-26T18:51:18Z

Demonstration and Discussion of Concept

This code serves as a demonstration and discussion of the proposed concept.

Key Decisions:

Avoiding runnable: I skip using runnable for a config and configurable for a processor. Also, all steps will operate directly on HF:datasets.Dataset.
Introducing Apply Interface: A new interface, Apply, will be implemented. It will be HF-centric and allow applying any required processing on each dataset shard be it filter map or any other step supported by datasets.
Step Specification Approach: The approach used in datasets is followed, where step types are specified in the config file by type field.
Aggregator Class: An aggregator class will sequentially invoke processing steps, in preparator config class it is called processors.
Hollow Doc Length Filter: Implemented an empty doc lenght filter step to demonstrate how steps could look like.

Usage:

To prepare a dataset, simply call:

dataset = self._config.processors.apply(dataset)

config would be something like this:

processors:
  steps:
    -
     type: length_filter
     field: text
     min_length_chars: 100
     max_length_chars: 100000
    - ...

@jlamypoirier, @tscholak What do you think?

tscholak · 2025-03-27T02:18:11Z

Hi @bigximik, thanks for putting this together. I appreciate the careful thinking you've put in here!

However, let's simplify significantly. The goal isn't to design a general, modular pipeline system. It's just about adding these very specific cleaning filters. We already know exactly what filters we want and in what order.

Here's what I'd suggest:

Just write a single function clean_dataset(dataset, config) that explicitly calls each filter step in sequence.
Each filter can be toggled with a simple boolean in the config. No need for abstract registries or complicated type-based dispatch.
Each filter is just a simple, testable function that uses .filter or .map directly on the HuggingFace dataset.
Let's keep the configuration as simple as possible (just booleans and thresholds).

We can always refactor if more complexity is actually required down the line, but let's get this feature shipped quickly and cleanly first.

Can you please move forward by just implementing the concrete filters directly?

jlamypoirier

I tend to mostly agree with @tscholak on this one. I think it's a good idea to make processors into modular Config/Configurable pairs since it's relatively simple and non-controversial, but anything more than that requires a bit more thinking and is probably a bit premature at this stage

fast_llm/data/preparator/hf_processors/configs/agregator.py

fast_llm/data/preparator/hf_processors/implementations/agregator.py

fast_llm/data/preparator/hf_processors/configs/base.py

fast_llm/config.py

…ept for clamav, integration not tested

bigximik · 2025-04-01T14:52:36Z

Created basic implementation based on feedback.

Word splitting and number detection implemented using regex.
Binary data detection is based on non-printable characters.
Presidio Model Management: Needs proper handling of model downloading and loading. Currently, it relies on SpaCy’s internal package installation, which won’t work in a frozen Docker image.
ClamAV Integration: Needs a proper implementation. The chat-gpt suggested approach does not work. Proper implementation requires database download, loading, and file scanning, either via an outdated Python wrapper or a direct approach.
Unit tests now cover all processors except ClamAV.
Need to implement test skipping for PII and malware detection processors.
Integration testing with prepare is pending.

Next Steps

Implement model download and loading management for Presidio.
Implement multiprocessing for Presidio engine.
Implement ClamAV database management and scanning.
Add test skipping for specific processors.
Verify integration with prepare.

conceptual processors implementation, not runnable

020052f

jlamypoirier reviewed Mar 27, 2025

View reviewed changes

processors implementaion based on feedback, unittest work for all exc…

eac822c

…ept for clamav, integration not tested

bigximik mentioned this pull request Apr 1, 2025

[feat] Add data cleaning in fast-llm prepare #112

Open

4 tasks

jlamypoirier changed the title ~~Add data cleaning in fast-llm prepare, concept~~ [Prototype] Add data cleaning in fast-llm prepare, concept Jun 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Prototype] Add data cleaning in fast-llm prepare, concept #210

[Prototype] Add data cleaning in fast-llm prepare, concept #210

Uh oh!

bigximik commented Mar 26, 2025

Uh oh!

bigximik commented Mar 26, 2025 •

edited

Loading

Uh oh!

tscholak commented Mar 27, 2025

Uh oh!

jlamypoirier left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bigximik commented Apr 1, 2025

Uh oh!

Uh oh!

[Prototype] Add data cleaning in fast-llm prepare, concept #210

Are you sure you want to change the base?

[Prototype] Add data cleaning in fast-llm prepare, concept #210

Uh oh!

Conversation

bigximik commented Mar 26, 2025

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Dependencies and Configuration

Testing

Performance Impact

📊 Performance Impact Details

🗒️ Additional Notes

Uh oh!

bigximik commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Demonstration and Discussion of Concept

Key Decisions:

Usage:

Uh oh!

tscholak commented Mar 27, 2025

Uh oh!

jlamypoirier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bigximik commented Apr 1, 2025

Created basic implementation based on feedback.

Next Steps

Uh oh!

Uh oh!

bigximik commented Mar 26, 2025 •

edited

Loading