Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bugs related to loss mask, meta info, and response length #21

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

xiaobo-yang
Copy link

@xiaobo-yang xiaobo-yang commented Mar 14, 2025

  1. Construct the loss mask immediately after obtaining the observation to prevent encoding misalignment when converting back to tokens after text transformation.
  2. When calculating the critic/kl, information tokens should not be included, as they are not samples generated by the policy. Otherwise, it may lead to a severe negative KL explosion.
  3. Follow up on meta info to ensure that the test batch can apply do sample.
  4. Remove the recording of info information for response length.

After fixing these bugs, the RL training remains stable for a much longer duration:
image

1. Construct the loss mask immediately after obtaining the observation to prevent encoding misalignment when converting back to tokens after text transformation.
2. Follow up on meta info to ensure that the test batch can apply do sample.
3. Remove the recording of info information for response length.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant