Skip to content

Conversation

@ajude2s
Copy link
Collaborator

@ajude2s ajude2s commented Sep 26, 2025

This PR introduces a pipeline for sampling and filtering tokenized datasets based on a specified language distribution, with validation to ensure correctness.

Key features:

  • Language-distribution-based sampling: Samples a target number of documents per language from annotated JSONL files, ensuring the dataset reflects the desired language proportions.
  • Document ID tracking: Maintains consistent mapping between annotated JSONL files and tokenized .pbin files using the document_id = <file_hash>_<row_index> convention.
  • Tokenized dataset filtering: Filters .pbin files to include only the sampled documents while preserving ordering.
  • Validation of filtered data: Ensures that the filtered tokenized .pbin data matches the sampled raw data by tokenizing the raw documents and comparing tokens for correctness.
  • Folder structure preservation: Maintains the original folder structure for input and output datasets (raw_data, annotated, tokenized) and writes filtered .pbin files to a separate output folder.
  • Hash mapping integration: Uses a CSV-based hash mapping to locate JSONL files based on document hashes.

This pipeline guarantees reproducible, language-balanced sampling, produces correctly filtered tokenized datasets, and validates that the tokenized output corresponds exactly to the sampled documents.

@ajude2s ajude2s closed this Sep 26, 2025
@ajude2s ajude2s deleted the sampling_pipeline branch September 26, 2025 10:41
@ajude2s ajude2s restored the sampling_pipeline branch September 26, 2025 11:03
@ajude2s ajude2s reopened this Sep 26, 2025
@ajude2s ajude2s marked this pull request as ready for review September 26, 2025 11:53
@ajude2s ajude2s requested a review from fromm-m October 16, 2025 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants