Are Phase 1 weights reliable?

I was looking into the [phase 1 data weights](https://github.com/NVIDIA-NeMo/Nemotron/blob/main/src/nemotron/recipes/super3/stage0_pretrain/config/data_prep/data_blend_raw_phase1.json) and noticed some oddities below. I'd like to understand if these are actual weights used.

- Some tiny datasets have very high ratios. E.g. Nemotron-Pretraining-Specialized-v1.1/Nemotron-Pretraining-Formal-Logic has weight 2.1 assigned. Is this meaning 2.1% of the data mix or something different? The dataset has about 130M tokens by my calcs, if running for 25T training tokens, this is about 25T x 0.021 = 525B tokens of formal logic, or about 4000 epochs. Seems like an error has been made somewhere?

- The Nemotron-CC-v2 weights are all lower than Nemotron-CC-v2.1, e.g. for the high-quality split weights are 2.2 (v2) and 4.3 (v2.1). But the disk sizes are 1.1TB (v2) and 44GB (v2.1). My impression was these used the same filtering criteria, just on different CC snapshots. So why would there be more weight on the smaller dataset? Similar to the above issue, this leads to repeating about 20 epochs on the 2.1 high-quality data.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are Phase 1 weights reliable? #248

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Are Phase 1 weights reliable? #248

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions