Skip to content

Add data cleaning in fast-llm prepare, concept #210

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

bigximik
Copy link
Contributor

✨ Description

part of #112

Closes #

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

  1. Change A
  2. Change B

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

  • 📜 I have read and followed the contributing guidelines.
  • 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
  • 🎉 The functionality is complete, and I have tested the changes.
  • 📝 I have updated the documentation if needed.
  • ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
  • 🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

  • 🐋 I have updated the Docker configuration or dependencies, if applicable.
  • 🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

  • 🧪 I have added or updated tests to cover my changes.
  • ✔️ New and existing tests pass locally with my changes.
  • 🚦 I have tested these changes on GPUs and verified training stability.
  • 🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

  • 📊 I have run benchmarks where applicable to evaluate the performance impact.
  • ✅ The benchmarks show no performance regression.
  • 🚀 The benchmarks indicate a potential performance improvement.
  • ⚠️ The benchmarks indicate a potential performance degradation.
  • 📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:


🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

@bigximik
Copy link
Contributor Author

bigximik commented Mar 26, 2025

Demonstration and Discussion of Concept

This code serves as a demonstration and discussion of the proposed concept.

Key Decisions:

  • Avoiding runnable: I skip using runnable for a config and configurable for a processor. Also, all steps will operate directly on HF:datasets.Dataset.
  • Introducing Apply Interface: A new interface, Apply, will be implemented. It will be HF-centric and allow applying any required processing on each dataset shard be it filter map or any other step supported by datasets.
  • Step Specification Approach: The approach used in datasets is followed, where step types are specified in the config file by type field.
  • Aggregator Class: An aggregator class will sequentially invoke processing steps, in preparator config class it is called processors.
  • Hollow Doc Length Filter: Implemented an empty doc lenght filter step to demonstrate how steps could look like.

Usage:

To prepare a dataset, simply call:

dataset = self._config.processors.apply(dataset)

config would be something like this:

processors:
  steps:
    -
     type: length_filter
     field: text
     min_length_chars: 100
     max_length_chars: 100000
    - ...

@jlamypoirier, @tscholak What do you think?

@tscholak
Copy link
Collaborator

Hi @bigximik, thanks for putting this together. I appreciate the careful thinking you've put in here!

However, let's simplify significantly. The goal isn't to design a general, modular pipeline system. It's just about adding these very specific cleaning filters. We already know exactly what filters we want and in what order.

Here's what I'd suggest:

  • Just write a single function clean_dataset(dataset, config) that explicitly calls each filter step in sequence.
  • Each filter can be toggled with a simple boolean in the config. No need for abstract registries or complicated type-based dispatch.
  • Each filter is just a simple, testable function that uses .filter or .map directly on the HuggingFace dataset.
  • Let's keep the configuration as simple as possible (just booleans and thresholds).

We can always refactor if more complexity is actually required down the line, but let's get this feature shipped quickly and cleanly first.

Can you please move forward by just implementing the concrete filters directly?

Copy link
Collaborator

@jlamypoirier jlamypoirier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to mostly agree with @tscholak on this one. I think it's a good idea to make processors into modular Config/Configurable pairs since it's relatively simple and non-controversial, but anything more than that requires a bit more thinking and is probably a bit premature at this stage

@bigximik
Copy link
Contributor Author

bigximik commented Apr 1, 2025

Created basic implementation based on feedback.

  • Word splitting and number detection implemented using regex.
  • Binary data detection is based on non-printable characters.
  • Presidio Model Management: Needs proper handling of model downloading and loading. Currently, it relies on SpaCy’s internal package installation, which won’t work in a frozen Docker image.
  • ClamAV Integration: Needs a proper implementation. The chat-gpt suggested approach does not work. Proper implementation requires database download, loading, and file scanning, either via an outdated Python wrapper or a direct approach.
  • Unit tests now cover all processors except ClamAV.
  • Need to implement test skipping for PII and malware detection processors.
  • Integration testing with prepare is pending.

Next Steps

  • Implement model download and loading management for Presidio.
  • Implement multiprocessing for Presidio engine.
  • Implement ClamAV database management and scanning.
  • Add test skipping for specific processors.
  • Verify integration with prepare.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants