Skip to content

Why mismatch KL explodes as training goes? #10

@ZhichenRen

Description

@ZhichenRen

Hi authors, thank you for the great paper and code!

I understand that the training-inference mismatch stems from BF16 rounding errors and kernel discrepancies. However, I am trying to understand the dynamics of why this error grows so significantly during the training process. As shown in Fig 3 in paper, the KL Mismatch explodes from a negligible ≈ $10^{-3}$ to a massive ≈ $10^0$.

I checked the paper and didn't find a section that discussed this phenomenon. Could you clarify if this increase is primarily driven by:

Sequence Length: The model learning to generate longer chains, leading to more errors to accumulate?

Peakiness: The policy becoming more confident (lower entropy) later in training, making the KL divergence hypersensitive to small rounding errors?

Or other reasons?

Any insights on this dynamic would be very helpful. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions