[Bug] Pretrain data prep duplicates every document N× when a single input file is split across N shards (row_modulus/row_remainder not applied)

## Summary

When `nemotron nano3 data prep pretrain` (binidx / `run_pretrain_pipeline`) splits **one input parquet file across multiple shards**, the per-shard row partition recorded in `plan.json` (`row_modulus` / `row_remainder`) is **not applied** during tokenization. Every shard re-reads and re-tokenizes the **entire** file instead of only its `row_index % row_modulus == row_remainder` subset.

Result: each document is written **`row_modulus` times**. In my run this turned a 5.5 GB input dataset into a **497 GB** bin/idx output (exactly **32× duplication**), and affected every dataset whose files were split across shards.

## Environment

- Command: `uv run nemotron nano3 data prep pretrain --run <CLUSTER>`
- Stage 0 pretrain, output format `binidx`, `dtype=int32`, `num_shards=128`
- Tokenizer: `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16` (vocab 131072), `add_eos=true`
- From `plan.json > determinism_constraints`: `ray==2.53.0`, `transformers==4.57.6`, `tokenizers==0.22.2`

## What I observed

Dataset `Nemotron-CC-Math-v1-3`:
- Raw input: 4 parquet files, 5.5 GB total, **2,892,030** rows (unique documents).
- Processed bin/idx output: **497 GiB** (`533,813,187,456` bytes `.bin` ÷ 4 bytes/int32 = **133.45 B tokens**).
- The plan assigns each of the 4 files to 32 shards via `row_modulus: 32`, `row_remainder: 0..31`.

The receipts show all 32 shards for the same file are byte-identical in counts:

| shard | row_modulus | row_remainder | num_input_rows | num_documents | total_tokens |
|------:|------:|------:|------:|------:|------:|
| 0 | 32 | 0 | 725,078 | 725,078 | 1,045,205,332 |
| 1 | 32 | 1 | 725,078 | 725,078 | 1,045,205,332 |
| 2 | 32 | 2 | 725,078 | 725,078 | 1,045,205,332 |
| … | 32 | … | 725,078 | 725,078 | 1,045,205,332 |

`725,078` is the **full** row count of `part_000007.parquet`. With `row_modulus=32`, each shard should have processed ≈ `22,659` rows, not the whole file.

- Documents written across 128 shards: **92,544,960** vs. **2,892,030** unique rows → **32.0×**.
- Expected output size: ≈ 4.17 B tokens ≈ **~16.6 GB**, i.e. ~30× smaller.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Pretrain data prep duplicates every document N× when a single input file is split across N shards (row_modulus/row_remainder not applied) #260

Summary

Environment

What I observed

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

shard	row_modulus	row_remainder	num_input_rows	num_documents	total_tokens
0	32	0	725,078	725,078	1,045,205,332
1	32	1	725,078	725,078	1,045,205,332
2	32	2	725,078	725,078	1,045,205,332
…	32	…	725,078	725,078	1,045,205,332

[Bug] Pretrain data prep duplicates every document N× when a single input file is split across N shards (row_modulus/row_remainder not applied) #260

Description

Summary

Environment

What I observed

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions