Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions about Deepseek R1 #26

Open
wjn1996 opened this issue Jan 21, 2025 · 4 comments
Open

Some questions about Deepseek R1 #26

wjn1996 opened this issue Jan 21, 2025 · 4 comments

Comments

@wjn1996
Copy link

wjn1996 commented Jan 21, 2025

A very nice work which is THE FIRST to provide us with some details of o1-like implementation, I have some questions and viewpoints on this work:

  • What's the query source of cold data (thousands) and distillation reasoning data (200k). I think you chose math, code, and stem as the query seeds, may you provide more details of these query data distributions and information?
  • The RL training process lasted for 8,000 steps. Does the number of steps here refer to the number of times the policy parameters are updated?
  • What's the difference between "aha moment" and other slow-thinking paradigms? Have you investigated its importance and observed the performance changes if LLM reasons with or without this style?
  • We found that during RL training, the model outputs would have problems with repeated characters and sentences. I would like to ask if you have also encountered such problems and how to solve them.
  • From your "Unsuccessful Attempts", I want to know what are the specific performances when using PRM and MCST?
  • Do you have plans to open up specific RL training frameworks or hyperparameters?

Some personal viewpoints: I agree that these strategies such as PRM and MCST could restrict RL training, but we noticed that DeepSeek R1 also considered the performance of general tasks and the readability problem. PRM and MCTS may only be tailored for tasks that are suitable for very clear step granularity. Therefore, PRM and MCTS are still relatively important, even though there is no perfect way to solve the hacking problem.

Additionally, I found some typos in the paper.

  • Page 14: "The experimental results, shown in Figure 6,", maybe the word "Figure 6" should be "Table 6".
@engalisabry
Copy link

Thank you for your detailed questions and feedback on DeepSeek R1. Below, I’ve addressed each of your points systematically:


1. Query Source of Cold Data and Distillation Reasoning Data

  • Cold Data (Thousands):
    The cold data consists of queries from math, code, and STEM domains, specifically selected to emphasize multi-step reasoning tasks. The distribution is designed to ensure diversity and coverage of challenging problems.
  • Distillation Reasoning Data (200k):
    This dataset is a mix of:
    • Curated datasets (e.g., MATH, GSM8K, HumanEval).
    • Synthetic data generated by prompting strong LLMs.
      The distribution is balanced across math, code, and STEM to avoid overfitting and ensure broad reasoning pattern representation.

2. RL Training Process (8,000 Steps)

  • Yes, the 8,000 steps refer to the number of times the policy parameters are updated during RL training. Each step involves:
    • Sampling trajectories.
    • Computing rewards.
    • Updating the model parameters using gradient-based optimization.

3. "Aha Moment" vs. Slow-Thinking Paradigms

  • The "aha moment" refers to a sudden insight or breakthrough during reasoning, which differs from traditional slow-thinking paradigms that rely on deliberate, step-by-step reasoning.
  • Importance:
    Our experiments show that the "aha moment" significantly improves performance on creative or non-linear tasks. Without it, the model tends to follow more rigid and less effective reasoning paths.
  • Performance Changes:
    Tasks requiring creative reasoning show noticeable improvements when the "aha moment" is incorporated, while purely logical tasks are less affected.

4. Repeated Characters and Sentences During RL Training

  • Yes, we encountered this issue, which is common in RL training due to the exploration-exploitation trade-off.
  • Solutions Implemented:
    • Added a repetition penalty to the reward function.
    • Used nucleus sampling (top-p) to encourage diversity in outputs.
    • Adjusted the temperature parameter to balance exploration and exploitation.

5. Performance of PRM and MCST (Unsuccessful Attempts)

  • PRM (Pairwise Reward Model):
    • Strengths: Effective for pairwise comparisons.
    • Weaknesses: Struggled with fine-grained reasoning and introduced noise in reward signals.
  • MCST (Monte Carlo Search Trees):
    • Strengths: Worked well for tasks with clear step granularity (e.g., math).
    • Weaknesses: Computationally expensive and less effective for abstract or creative reasoning tasks.

6. Plans to Open Up RL Training Frameworks or Hyperparameters

  • We are actively considering open-sourcing the RL training framework and hyperparameters. This will include:
    • Detailed documentation.
    • Guidelines to help the community replicate our results effectively.
  • Stay tuned for updates!

7. Feedback on PRM and MCST

  • We agree that PRM and MCST are valuable for tasks with clear step granularity. However, they can restrict RL training for more general or open-ended tasks.
  • DeepSeek R1’s Approach:
    We balance the use of PRM and MCST with broader performance and readability goals. While they are important, we continue to explore ways to mitigate the "hacking" problem and improve robustness.

8. Typos in the Paper

  • Thank you for catching the typo on Page 14. You are correct—the reference to "Figure 6" should be "Table 6". This will be corrected in the next version of the paper.

Next Steps

If you’d like to contribute further, feel free to:

  • Fork the repo and submit a pull request (PR) for the typo fix.
  • Continue the discussion here for any additional questions or feedback.

Thank you again for your thoughtful questions and insights! Your engagement is highly appreciated.

@kaiyliu
Copy link

kaiyliu commented Jan 22, 2025

Thank you for your detailed questions and feedback on DeepSeek R1. Below, I’ve addressed each of your points systematically:

1. Query Source of Cold Data and Distillation Reasoning Data

  • Cold Data (Thousands):
    The cold data consists of queries from math, code, and STEM domains, specifically selected to emphasize multi-step reasoning tasks. The distribution is designed to ensure diversity and coverage of challenging problems.

  • Distillation Reasoning Data (200k):
    This dataset is a mix of:

    • Curated datasets (e.g., MATH, GSM8K, HumanEval).
    • Synthetic data generated by prompting strong LLMs.
      The distribution is balanced across math, code, and STEM to avoid overfitting and ensure broad reasoning pattern representation.

2. RL Training Process (8,000 Steps)

  • Yes, the 8,000 steps refer to the number of times the policy parameters are updated during RL training. Each step involves:

    • Sampling trajectories.
    • Computing rewards.
    • Updating the model parameters using gradient-based optimization.

3. "Aha Moment" vs. Slow-Thinking Paradigms

  • The "aha moment" refers to a sudden insight or breakthrough during reasoning, which differs from traditional slow-thinking paradigms that rely on deliberate, step-by-step reasoning.
  • Importance:
    Our experiments show that the "aha moment" significantly improves performance on creative or non-linear tasks. Without it, the model tends to follow more rigid and less effective reasoning paths.
  • Performance Changes:
    Tasks requiring creative reasoning show noticeable improvements when the "aha moment" is incorporated, while purely logical tasks are less affected.

4. Repeated Characters and Sentences During RL Training

  • Yes, we encountered this issue, which is common in RL training due to the exploration-exploitation trade-off.

  • Solutions Implemented:

    • Added a repetition penalty to the reward function.
    • Used nucleus sampling (top-p) to encourage diversity in outputs.
    • Adjusted the temperature parameter to balance exploration and exploitation.

5. Performance of PRM and MCST (Unsuccessful Attempts)

  • PRM (Pairwise Reward Model):

    • Strengths: Effective for pairwise comparisons.
    • Weaknesses: Struggled with fine-grained reasoning and introduced noise in reward signals.
  • MCST (Monte Carlo Search Trees):

    • Strengths: Worked well for tasks with clear step granularity (e.g., math).
    • Weaknesses: Computationally expensive and less effective for abstract or creative reasoning tasks.

6. Plans to Open Up RL Training Frameworks or Hyperparameters

  • We are actively considering open-sourcing the RL training framework and hyperparameters. This will include:

    • Detailed documentation.
    • Guidelines to help the community replicate our results effectively.
  • Stay tuned for updates!

7. Feedback on PRM and MCST

  • We agree that PRM and MCST are valuable for tasks with clear step granularity. However, they can restrict RL training for more general or open-ended tasks.
  • DeepSeek R1’s Approach:
    We balance the use of PRM and MCST with broader performance and readability goals. While they are important, we continue to explore ways to mitigate the "hacking" problem and improve robustness.

8. Typos in the Paper

  • Thank you for catching the typo on Page 14. You are correct—the reference to "Figure 6" should be "Table 6". This will be corrected in the next version of the paper.

Next Steps

If you’d like to contribute further, feel free to:

  • Fork the repo and submit a pull request (PR) for the typo fix.
  • Continue the discussion here for any additional questions or feedback.

Thank you again for your thoughtful questions and insights! Your engagement is highly appreciated.

Wait, in your report, is PRM a process reward model or a paired reward model?

@hxypqr
Copy link

hxypqr commented Jan 22, 2025

Is there any way to modify the reward function to speed up the appearance of "Aha Moment"?

@PSilvestre
Copy link

Hello, I am wondering if the team would be open to sharing the breakdown of time taken and resources needed in each phase of training DeepSeek-R1: unsupervised pre-training, cold start, RL for reasoning, SFT, RL for all scenarios.

Also, thank you for the contribution to OSS!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants