Can we use Generative Reward Model in GRPOtrainer #3033
aabbccddwasd
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
As we know, in the GRPO framework, the process begins with generating several completions, followed by computing their advantages. Typically, a rule-based reward model is used for this computation. However, I believe that in certain tasks, we can utilize the model itself as the reward model. This involves prompting the model to evaluate the completions based on a more robust set of rules. For example, a score of 0 could be assigned if the output is entirely incomprehensible, 1 if it is comprehensible, and 2 if it is both comprehensible and consistent with human speaking habits.
I conceived this idea while aiming to fine-tune a reasoning model to extract and output key points during the reasoning process, rather than presenting the entire reasoning text to users. Initially, I attempted to prompt the model and generate data, but the quality was unsatisfactory. Without the full reasoning context, the outputs appeared illogical. However, from the model's perspective, these outputs might seem reasonable due to the available context. This suggests that prompt engineering alone may not adequately address the issue. Yet, when we evaluate these outputs without the reasoning text, it becomes evident that they lack logic—a challenge that the model also faces.
To implement this idea, I propose temporarily unloading the LoRA adapter after generating the completions. The model can then be called within the reward function passed to the trainer. I apologize for presenting this plan before thoroughly reviewing the entire codebase. However, based on the code snippets I've seen (from ZhiHu), the trainer does not utilize vLLM after generation, making this approach feasible.
(I'm just a high school student in china, and my english is not very well, so after I wrote the original version of the passage I rewrote it with qwen2.5, seems much better, but if it feels strange in your eyes, just tell me...)
Beta Was this translation helpful? Give feedback.
All reactions