Skip to content

[Bug] Pretrain data prep duplicates every document N× when a single input file is split across N shards (row_modulus/row_remainder not applied) #260

@hvnguyenNV

Description

@hvnguyenNV

Summary

When nemotron nano3 data prep pretrain (binidx / run_pretrain_pipeline) splits one input parquet file across multiple shards, the per-shard row partition recorded in plan.json (row_modulus / row_remainder) is not applied during tokenization. Every shard re-reads and re-tokenizes the entire file instead of only its row_index % row_modulus == row_remainder subset.

Result: each document is written row_modulus times. In my run this turned a 5.5 GB input dataset into a 497 GB bin/idx output (exactly 32× duplication), and affected every dataset whose files were split across shards.

Environment

  • Command: uv run nemotron nano3 data prep pretrain --run <CLUSTER>
  • Stage 0 pretrain, output format binidx, dtype=int32, num_shards=128
  • Tokenizer: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 (vocab 131072), add_eos=true
  • From plan.json > determinism_constraints: ray==2.53.0, transformers==4.57.6, tokenizers==0.22.2

What I observed

Dataset Nemotron-CC-Math-v1-3:

  • Raw input: 4 parquet files, 5.5 GB total, 2,892,030 rows (unique documents).
  • Processed bin/idx output: 497 GiB (533,813,187,456 bytes .bin ÷ 4 bytes/int32 = 133.45 B tokens).
  • The plan assigns each of the 4 files to 32 shards via row_modulus: 32, row_remainder: 0..31.

The receipts show all 32 shards for the same file are byte-identical in counts:

shard row_modulus row_remainder num_input_rows num_documents total_tokens
0 32 0 725,078 725,078 1,045,205,332
1 32 1 725,078 725,078 1,045,205,332
2 32 2 725,078 725,078 1,045,205,332
32 725,078 725,078 1,045,205,332

725,078 is the full row count of part_000007.parquet. With row_modulus=32, each shard should have processed ≈ 22,659 rows, not the whole file.

  • Documents written across 128 shards: 92,544,960 vs. 2,892,030 unique rows → 32.0×.
  • Expected output size: ≈ 4.17 B tokens ≈ ~16.6 GB, i.e. ~30× smaller.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions