I was looking into the phase 1 data weights and noticed some oddities below. I'd like to understand if these are actual weights used.
-
Some tiny datasets have very high ratios. E.g. Nemotron-Pretraining-Specialized-v1.1/Nemotron-Pretraining-Formal-Logic has weight 2.1 assigned. Is this meaning 2.1% of the data mix or something different? The dataset has about 130M tokens by my calcs, if running for 25T training tokens, this is about 25T x 0.021 = 525B tokens of formal logic, or about 4000 epochs. Seems like an error has been made somewhere?
-
The Nemotron-CC-v2 weights are all lower than Nemotron-CC-v2.1, e.g. for the high-quality split weights are 2.2 (v2) and 4.3 (v2.1). But the disk sizes are 1.1TB (v2) and 44GB (v2.1). My impression was these used the same filtering criteria, just on different CC snapshots. So why would there be more weight on the smaller dataset? Similar to the above issue, this leads to repeating about 20 epochs on the 2.1 high-quality data.
I was looking into the phase 1 data weights and noticed some oddities below. I'd like to understand if these are actual weights used.
Some tiny datasets have very high ratios. E.g. Nemotron-Pretraining-Specialized-v1.1/Nemotron-Pretraining-Formal-Logic has weight 2.1 assigned. Is this meaning 2.1% of the data mix or something different? The dataset has about 130M tokens by my calcs, if running for 25T training tokens, this is about 25T x 0.021 = 525B tokens of formal logic, or about 4000 epochs. Seems like an error has been made somewhere?
The Nemotron-CC-v2 weights are all lower than Nemotron-CC-v2.1, e.g. for the high-quality split weights are 2.2 (v2) and 4.3 (v2.1). But the disk sizes are 1.1TB (v2) and 44GB (v2.1). My impression was these used the same filtering criteria, just on different CC snapshots. So why would there be more weight on the smaller dataset? Similar to the above issue, this leads to repeating about 20 epochs on the 2.1 high-quality data.