Summary
When nemotron nano3 data prep pretrain (binidx / run_pretrain_pipeline) splits one input parquet file across multiple shards, the per-shard row partition recorded in plan.json (row_modulus / row_remainder) is not applied during tokenization. Every shard re-reads and re-tokenizes the entire file instead of only its row_index % row_modulus == row_remainder subset.
Result: each document is written row_modulus times. In my run this turned a 5.5 GB input dataset into a 497 GB bin/idx output (exactly 32× duplication), and affected every dataset whose files were split across shards.
Environment
- Command:
uv run nemotron nano3 data prep pretrain --run <CLUSTER>
- Stage 0 pretrain, output format
binidx, dtype=int32, num_shards=128
- Tokenizer:
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 (vocab 131072), add_eos=true
- From
plan.json > determinism_constraints: ray==2.53.0, transformers==4.57.6, tokenizers==0.22.2
What I observed
Dataset Nemotron-CC-Math-v1-3:
- Raw input: 4 parquet files, 5.5 GB total, 2,892,030 rows (unique documents).
- Processed bin/idx output: 497 GiB (
533,813,187,456 bytes .bin ÷ 4 bytes/int32 = 133.45 B tokens).
- The plan assigns each of the 4 files to 32 shards via
row_modulus: 32, row_remainder: 0..31.
The receipts show all 32 shards for the same file are byte-identical in counts:
| shard |
row_modulus |
row_remainder |
num_input_rows |
num_documents |
total_tokens |
| 0 |
32 |
0 |
725,078 |
725,078 |
1,045,205,332 |
| 1 |
32 |
1 |
725,078 |
725,078 |
1,045,205,332 |
| 2 |
32 |
2 |
725,078 |
725,078 |
1,045,205,332 |
| … |
32 |
… |
725,078 |
725,078 |
1,045,205,332 |
725,078 is the full row count of part_000007.parquet. With row_modulus=32, each shard should have processed ≈ 22,659 rows, not the whole file.
- Documents written across 128 shards: 92,544,960 vs. 2,892,030 unique rows → 32.0×.
- Expected output size: ≈ 4.17 B tokens ≈ ~16.6 GB, i.e. ~30× smaller.
Summary
When
nemotron nano3 data prep pretrain(binidx /run_pretrain_pipeline) splits one input parquet file across multiple shards, the per-shard row partition recorded inplan.json(row_modulus/row_remainder) is not applied during tokenization. Every shard re-reads and re-tokenizes the entire file instead of only itsrow_index % row_modulus == row_remaindersubset.Result: each document is written
row_modulustimes. In my run this turned a 5.5 GB input dataset into a 497 GB bin/idx output (exactly 32× duplication), and affected every dataset whose files were split across shards.Environment
uv run nemotron nano3 data prep pretrain --run <CLUSTER>binidx,dtype=int32,num_shards=128nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16(vocab 131072),add_eos=trueplan.json > determinism_constraints:ray==2.53.0,transformers==4.57.6,tokenizers==0.22.2What I observed
Dataset
Nemotron-CC-Math-v1-3:533,813,187,456bytes.bin÷ 4 bytes/int32 = 133.45 B tokens).row_modulus: 32,row_remainder: 0..31.The receipts show all 32 shards for the same file are byte-identical in counts:
725,078is the full row count ofpart_000007.parquet. Withrow_modulus=32, each shard should have processed ≈22,659rows, not the whole file.