Some questions about Deepseek R1 #26

wjn1996 · 2025-01-21T08:09:58Z

A very nice work which is THE FIRST to provide us with some details of o1-like implementation, I have some questions and viewpoints on this work:

What's the query source of cold data (thousands) and distillation reasoning data (200k). I think you chose math, code, and stem as the query seeds, may you provide more details of these query data distributions and information?
The RL training process lasted for 8,000 steps. Does the number of steps here refer to the number of times the policy parameters are updated?
What's the difference between "aha moment" and other slow-thinking paradigms? Have you investigated its importance and observed the performance changes if LLM reasons with or without this style?
We found that during RL training, the model outputs would have problems with repeated characters and sentences. I would like to ask if you have also encountered such problems and how to solve them.
From your "Unsuccessful Attempts", I want to know what are the specific performances when using PRM and MCST?
Do you have plans to open up specific RL training frameworks or hyperparameters?

Some personal viewpoints: I agree that these strategies such as PRM and MCST could restrict RL training, but we noticed that DeepSeek R1 also considered the performance of general tasks and the readability problem. PRM and MCTS may only be tailored for tasks that are suitable for very clear step granularity. Therefore, PRM and MCTS are still relatively important, even though there is no perfect way to solve the hacking problem.

Additionally, I found some typos in the paper.

Page 14: "The experimental results, shown in Figure 6,", maybe the word "Figure 6" should be "Table 6".

engalisabry · 2025-01-21T12:54:36Z

Thank you for your detailed questions and feedback on DeepSeek R1. Below, I’ve addressed each of your points systematically:

1. Query Source of Cold Data and Distillation Reasoning Data

Cold Data (Thousands):
The cold data consists of queries from math, code, and STEM domains, specifically selected to emphasize multi-step reasoning tasks. The distribution is designed to ensure diversity and coverage of challenging problems.
Distillation Reasoning Data (200k):
This dataset is a mix of:
- Curated datasets (e.g., MATH, GSM8K, HumanEval).
- Synthetic data generated by prompting strong LLMs.
  The distribution is balanced across math, code, and STEM to avoid overfitting and ensure broad reasoning pattern representation.

2. RL Training Process (8,000 Steps)

Yes, the 8,000 steps refer to the number of times the policy parameters are updated during RL training. Each step involves:
- Sampling trajectories.
- Computing rewards.
- Updating the model parameters using gradient-based optimization.

3. "Aha Moment" vs. Slow-Thinking Paradigms

The "aha moment" refers to a sudden insight or breakthrough during reasoning, which differs from traditional slow-thinking paradigms that rely on deliberate, step-by-step reasoning.
Importance:
Our experiments show that the "aha moment" significantly improves performance on creative or non-linear tasks. Without it, the model tends to follow more rigid and less effective reasoning paths.
Performance Changes:
Tasks requiring creative reasoning show noticeable improvements when the "aha moment" is incorporated, while purely logical tasks are less affected.

4. Repeated Characters and Sentences During RL Training

Yes, we encountered this issue, which is common in RL training due to the exploration-exploitation trade-off.
Solutions Implemented:
- Added a repetition penalty to the reward function.
- Used nucleus sampling (top-p) to encourage diversity in outputs.
- Adjusted the temperature parameter to balance exploration and exploitation.

5. Performance of PRM and MCST (Unsuccessful Attempts)

PRM (Pairwise Reward Model):
- Strengths: Effective for pairwise comparisons.
- Weaknesses: Struggled with fine-grained reasoning and introduced noise in reward signals.
MCST (Monte Carlo Search Trees):
- Strengths: Worked well for tasks with clear step granularity (e.g., math).
- Weaknesses: Computationally expensive and less effective for abstract or creative reasoning tasks.

6. Plans to Open Up RL Training Frameworks or Hyperparameters

We are actively considering open-sourcing the RL training framework and hyperparameters. This will include:
- Detailed documentation.
- Guidelines to help the community replicate our results effectively.
Stay tuned for updates!

7. Feedback on PRM and MCST

We agree that PRM and MCST are valuable for tasks with clear step granularity. However, they can restrict RL training for more general or open-ended tasks.
DeepSeek R1’s Approach:
We balance the use of PRM and MCST with broader performance and readability goals. While they are important, we continue to explore ways to mitigate the "hacking" problem and improve robustness.

8. Typos in the Paper

Thank you for catching the typo on Page 14. You are correct—the reference to "Figure 6" should be "Table 6". This will be corrected in the next version of the paper.

Next Steps

If you’d like to contribute further, feel free to:

Fork the repo and submit a pull request (PR) for the typo fix.
Continue the discussion here for any additional questions or feedback.

Thank you again for your thoughtful questions and insights! Your engagement is highly appreciated.

kaiyliu · 2025-01-22T02:52:48Z

Thank you for your detailed questions and feedback on DeepSeek R1. Below, I’ve addressed each of your points systematically:

1. Query Source of Cold Data and Distillation Reasoning Data

Cold Data (Thousands):
The cold data consists of queries from math, code, and STEM domains, specifically selected to emphasize multi-step reasoning tasks. The distribution is designed to ensure diversity and coverage of challenging problems.

Distillation Reasoning Data (200k):
This dataset is a mix of:

Curated datasets (e.g., MATH, GSM8K, HumanEval).

Synthetic data generated by prompting strong LLMs.
The distribution is balanced across math, code, and STEM to avoid overfitting and ensure broad reasoning pattern representation.

2. RL Training Process (8,000 Steps)

Yes, the 8,000 steps refer to the number of times the policy parameters are updated during RL training. Each step involves:

Sampling trajectories.

Computing rewards.

Updating the model parameters using gradient-based optimization.

3. "Aha Moment" vs. Slow-Thinking Paradigms

The "aha moment" refers to a sudden insight or breakthrough during reasoning, which differs from traditional slow-thinking paradigms that rely on deliberate, step-by-step reasoning.

Importance:
Our experiments show that the "aha moment" significantly improves performance on creative or non-linear tasks. Without it, the model tends to follow more rigid and less effective reasoning paths.

Performance Changes:
Tasks requiring creative reasoning show noticeable improvements when the "aha moment" is incorporated, while purely logical tasks are less affected.

4. Repeated Characters and Sentences During RL Training

Yes, we encountered this issue, which is common in RL training due to the exploration-exploitation trade-off.

Solutions Implemented:

Added a repetition penalty to the reward function.

Used nucleus sampling (top-p) to encourage diversity in outputs.

Adjusted the temperature parameter to balance exploration and exploitation.

5. Performance of PRM and MCST (Unsuccessful Attempts)

PRM (Pairwise Reward Model):

Strengths: Effective for pairwise comparisons.

Weaknesses: Struggled with fine-grained reasoning and introduced noise in reward signals.

MCST (Monte Carlo Search Trees):

Strengths: Worked well for tasks with clear step granularity (e.g., math).

Weaknesses: Computationally expensive and less effective for abstract or creative reasoning tasks.

6. Plans to Open Up RL Training Frameworks or Hyperparameters

We are actively considering open-sourcing the RL training framework and hyperparameters. This will include:

Detailed documentation.

Guidelines to help the community replicate our results effectively.

Stay tuned for updates!

7. Feedback on PRM and MCST

We agree that PRM and MCST are valuable for tasks with clear step granularity. However, they can restrict RL training for more general or open-ended tasks.

DeepSeek R1’s Approach:
We balance the use of PRM and MCST with broader performance and readability goals. While they are important, we continue to explore ways to mitigate the "hacking" problem and improve robustness.

8. Typos in the Paper

Thank you for catching the typo on Page 14. You are correct—the reference to "Figure 6" should be "Table 6". This will be corrected in the next version of the paper.

Next Steps

If you’d like to contribute further, feel free to:

Fork the repo and submit a pull request (PR) for the typo fix.

Continue the discussion here for any additional questions or feedback.

Thank you again for your thoughtful questions and insights! Your engagement is highly appreciated.

Wait, in your report, is PRM a process reward model or a paired reward model?

hxypqr · 2025-01-22T10:36:07Z

Is there any way to modify the reward function to speed up the appearance of "Aha Moment"?

PSilvestre · 2025-01-22T14:48:03Z

Hello, I am wondering if the team would be open to sharing the breakdown of time taken and resources needed in each phase of training DeepSeek-R1: unsupervised pre-training, cold start, RL for reasoning, SFT, RL for all scenarios.

Also, thank you for the contribution to OSS!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions about Deepseek R1 #26

Some questions about Deepseek R1 #26

wjn1996 commented Jan 21, 2025 •

edited

Loading

engalisabry commented Jan 21, 2025

kaiyliu commented Jan 22, 2025

1. Query Source of Cold Data and Distillation Reasoning Data

2. RL Training Process (8,000 Steps)

3. "Aha Moment" vs. Slow-Thinking Paradigms

4. Repeated Characters and Sentences During RL Training

5. Performance of PRM and MCST (Unsuccessful Attempts)

6. Plans to Open Up RL Training Frameworks or Hyperparameters

7. Feedback on PRM and MCST

8. Typos in the Paper

Next Steps

hxypqr commented Jan 22, 2025

PSilvestre commented Jan 22, 2025

Some questions about Deepseek R1 #26

Some questions about Deepseek R1 #26

Comments

wjn1996 commented Jan 21, 2025 • edited Loading

engalisabry commented Jan 21, 2025

1. Query Source of Cold Data and Distillation Reasoning Data

2. RL Training Process (8,000 Steps)

3. "Aha Moment" vs. Slow-Thinking Paradigms

4. Repeated Characters and Sentences During RL Training

5. Performance of PRM and MCST (Unsuccessful Attempts)

6. Plans to Open Up RL Training Frameworks or Hyperparameters

7. Feedback on PRM and MCST

8. Typos in the Paper

Next Steps

kaiyliu commented Jan 22, 2025

1. Query Source of Cold Data and Distillation Reasoning Data

2. RL Training Process (8,000 Steps)

3. "Aha Moment" vs. Slow-Thinking Paradigms

4. Repeated Characters and Sentences During RL Training

5. Performance of PRM and MCST (Unsuccessful Attempts)

6. Plans to Open Up RL Training Frameworks or Hyperparameters

7. Feedback on PRM and MCST

8. Typos in the Paper

Next Steps

hxypqr commented Jan 22, 2025

PSilvestre commented Jan 22, 2025

wjn1996 commented Jan 21, 2025 •

edited

Loading