Pipeline for language-distribution based sampling of tokenized datasets #239

ajude2s · 2025-09-26T10:06:04Z

This PR introduces a pipeline for sampling and filtering tokenized datasets based on a specified language distribution, with validation to ensure correctness.

Key features:

Language-distribution-based sampling: Samples a target number of documents per language from annotated JSONL files, ensuring the dataset reflects the desired language proportions.
Document ID tracking: Maintains consistent mapping between annotated JSONL files and tokenized .pbin files using the document_id = <file_hash>_<row_index> convention.
Tokenized dataset filtering: Filters .pbin files to include only the sampled documents while preserving ordering.
Validation of filtered data: Ensures that the filtered tokenized .pbin data matches the sampled raw data by tokenizing the raw documents and comparing tokens for correctness.
Folder structure preservation: Maintains the original folder structure for input and output datasets (raw_data, annotated, tokenized) and writes filtered .pbin files to a separate output folder.
Hash mapping integration: Uses a CSV-based hash mapping to locate JSONL files based on document hashes.

This pipeline guarantees reproducible, language-balanced sampling, produces correctly filtered tokenized datasets, and validates that the tokenized output corresponds exactly to the sampled documents.

…ized data using scores. - Included an example configuration file. - Added datatrove and pydantic-settings to requirements. - Note that modalities is also required for the pipeline to work, but it is not included in the requirements file.

Co-authored-by: Copilot <[email protected]>

…figuration and job submission scripts

… new paths and parameters

…e reference

…ions

BlueCrescent and others added 9 commits July 25, 2025 10:38

chore(filtering): More robust doc id parsing.

81aafa8

Co-authored-by: Copilot <[email protected]>

fix(filtering): Removed duplicate file exists check.

b1d1a46

fix(filtering): fixed docstring

af89182

Co-authored-by: Copilot <[email protected]>

feat(pipeline): add language sampling and filtering pipeline with con…

bb65bad

…figuration and job submission scripts

feat(config): update filter and sampling pipeline configurations with…

c057ecd

… new paths and parameters

fix(imports): update import path for sampling_utils to simplify modul…

0389d2b

…e reference

refactor(pipeline): remove score-based filtering files and configurat…

4c82fdf

…ions

fix(dependencies): remove unused 'datatrove' dependency from project

88435ff

ajude2s closed this Sep 26, 2025

ajude2s deleted the sampling_pipeline branch September 26, 2025 10:41

ajude2s restored the sampling_pipeline branch September 26, 2025 11:03

ajude2s reopened this Sep 26, 2025

ajude2s marked this pull request as ready for review September 26, 2025 11:53

ajude2s requested a review from fromm-m October 16, 2025 08:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pipeline for language-distribution based sampling of tokenized datasets #239

Pipeline for language-distribution based sampling of tokenized datasets #239

Uh oh!

ajude2s commented Sep 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Pipeline for language-distribution based sampling of tokenized datasets #239

Are you sure you want to change the base?

Pipeline for language-distribution based sampling of tokenized datasets #239

Uh oh!

Conversation

ajude2s commented Sep 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants