Skip to content

Conversation

@slimtune2023
Copy link

I trained this model using the optimal off-policy hyperparameters from before, with learning rate of 3e-5, train batch size = rollout batch size = 256, GRPO clip loss, no mean normalization, and no standard deviation normalization. This had the best results for my shorter runs, so I decided to use this for the final model as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant