Thanks for your great work! From the code, it appears that during training the answer is extracted directly using the regex r"<answer>(.*?)</answer>" and the reward is computed based on accuracy. It seems that the format reward (e.g., enforcing the <think></think><answer></answer> structure) is not incorporated into the reward function.
If this is the case, would directly applying RL on Qwen/Qwen2.5-7B without an explicit format-reward lead to lower training efficiency or stability? Thanks!
Thanks for your great work! From the code, it appears that during training the answer is extracted directly using the regex r"
<answer>(.*?)</answer>" and the reward is computed based on accuracy. It seems that the format reward (e.g., enforcing the<think></think><answer></answer>structure) is not incorporated into the reward function.If this is the case, would directly applying RL on Qwen/Qwen2.5-7B without an explicit format-reward lead to lower training efficiency or stability? Thanks!