Skip to content

Are Phase 1 weights reliable? #248

@TeaPearce

Description

@TeaPearce

I was looking into the phase 1 data weights and noticed some oddities below. I'd like to understand if these are actual weights used.

  • Some tiny datasets have very high ratios. E.g. Nemotron-Pretraining-Specialized-v1.1/Nemotron-Pretraining-Formal-Logic has weight 2.1 assigned. Is this meaning 2.1% of the data mix or something different? The dataset has about 130M tokens by my calcs, if running for 25T training tokens, this is about 25T x 0.021 = 525B tokens of formal logic, or about 4000 epochs. Seems like an error has been made somewhere?

  • The Nemotron-CC-v2 weights are all lower than Nemotron-CC-v2.1, e.g. for the high-quality split weights are 2.2 (v2) and 4.3 (v2.1). But the disk sizes are 1.1TB (v2) and 44GB (v2.1). My impression was these used the same filtering criteria, just on different CC snapshots. So why would there be more weight on the smaller dataset? Similar to the above issue, this leads to repeating about 20 epochs on the 2.1 high-quality data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions