-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some questions about Deepseek R1 #26
Comments
Thank you for your detailed questions and feedback on DeepSeek R1. Below, I’ve addressed each of your points systematically: 1. Query Source of Cold Data and Distillation Reasoning Data
2. RL Training Process (8,000 Steps)
3. "Aha Moment" vs. Slow-Thinking Paradigms
4. Repeated Characters and Sentences During RL Training
5. Performance of PRM and MCST (Unsuccessful Attempts)
6. Plans to Open Up RL Training Frameworks or Hyperparameters
7. Feedback on PRM and MCST
8. Typos in the Paper
Next StepsIf you’d like to contribute further, feel free to:
Thank you again for your thoughtful questions and insights! Your engagement is highly appreciated. |
Wait, in your report, is PRM a process reward model or a paired reward model? |
Is there any way to modify the reward function to speed up the appearance of "Aha Moment"? |
Hello, I am wondering if the team would be open to sharing the breakdown of time taken and resources needed in each phase of training DeepSeek-R1: unsupervised pre-training, cold start, RL for reasoning, SFT, RL for all scenarios. Also, thank you for the contribution to OSS! |
A very nice work which is THE FIRST to provide us with some details of o1-like implementation, I have some questions and viewpoints on this work:
Some personal viewpoints: I agree that these strategies such as PRM and MCST could restrict RL training, but we noticed that DeepSeek R1 also considered the performance of general tasks and the readability problem. PRM and MCTS may only be tailored for tasks that are suitable for very clear step granularity. Therefore, PRM and MCTS are still relatively important, even though there is no perfect way to solve the hacking problem.
Additionally, I found some typos in the paper.
The text was updated successfully, but these errors were encountered: