Fix bugs related to loss mask, meta info, and response length #21

xiaobo-yang · 2025-03-14T06:30:14Z

Construct the loss mask immediately after obtaining the observation to prevent encoding misalignment when converting back to tokens after text transformation.
When calculating the critic/kl, information tokens should not be included, as they are not samples generated by the policy. Otherwise, it may lead to a severe negative KL explosion.
Follow up on meta info to ensure that the test batch can apply do sample.
Remove the recording of info information for response length.

After fixing these bugs, the RL training remains stable for a much longer duration:

1. Construct the loss mask immediately after obtaining the observation to prevent encoding misalignment when converting back to tokens after text transformation. 2. Follow up on meta info to ensure that the test batch can apply do sample. 3. Remove the recording of info information for response length.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bugs related to loss mask, meta info, and response length #21

Fix bugs related to loss mask, meta info, and response length #21

xiaobo-yang commented Mar 14, 2025 •

edited

Loading

Fix bugs related to loss mask, meta info, and response length #21

Are you sure you want to change the base?

Fix bugs related to loss mask, meta info, and response length #21

Conversation

xiaobo-yang commented Mar 14, 2025 • edited Loading

xiaobo-yang commented Mar 14, 2025 •

edited

Loading