Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DCLM Style Deduplications #214

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

DCLM Style Deduplications #214

wants to merge 11 commits into from

Conversation

revbucket
Copy link

General updates to the dedupe command to do deduplication using a joint paragraph/document flow in the same way that DCLM does.

Nuanced update list:
Bloom Filter updates:

  • Used better binary search to get optimal BF size
  • Initialize Bloom Filters with multicore parallelism in mind

Deduper updates:

  • Switched out threadpool for rayon (cleaner, but equivalently performant)
  • Added optional read/write of bloom filter file (usually it's not necessary to save this, right?)
  • Made the main rust fxn more modular, easier to add in different types of dedupe methods
  • Logged some after-dedupe stats: {sparsity, removal rate}
  • Added DCLM style deduplication

Other stuff:

  • Modified the dedupe config to want a "dedupe.dedupe_method" attribute to specify which type {documents, paragraphs, dclm} of deduplication we do
  • Updated tutorial/etc to include this modified config^

Copy link
Contributor

@Whattabatt Whattabatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run make style to pass the linter and style check . Also, please add tests in https://github.com/allenai/dolma/blob/main/tests/python/test_deduper.py

@@ -108,7 +127,7 @@ class DeduperConfig:
dedupe: DedupeConfig = field(help="Deduplication configuration. Required.")
bloom_filter: BloomFilterConfig = field(help="Bloom filter configuration. Required.")
processes: int = field(
default=1, help="Number of processes to use for deduplication. If 1, no multiprocessing will be used."
default=0, help="Number of processes to use for deduplication. If 1, no multiprocessing will be used."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 => means we do the max parallelism (processes becomes number of cores available). I just assumed that we want this behavior almost all of the time

This might not actually play nice with beaker nodes and how CPU's get allocated here. I'll fall back on ai2-best-practices here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the help string to reflect this since it's non-obvious

python/dolma/cli/deduper.py Outdated Show resolved Hide resolved
src/bloom_filter.rs Outdated Show resolved Hide resolved
src/bloom_filter.rs Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants