-
Couldn't load subscription status.
- Fork 214
add prm #399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
add prm #399
Conversation
Summary of ChangesHello @Lu-Yang666, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the reinforcement learning framework by integrating support for Process Reward Models (PRM). It introduces new configuration options, a specialized workflow for combining PRM-derived rewards with standard rewards, and a standalone service for efficient PRM inference. The changes are exemplified by a new training script for the GSM8K dataset, aiming to leverage the granular feedback of PRMs to improve model performance on complex reasoning tasks. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for Process-based Reward Models (PRM) by adding new configurations, a dedicated workflow, an example training script, and a PRM scoring service. The changes are a good step towards incorporating PRM into the training loop.
My review focuses on improving code quality, maintainability, and portability. I've identified several critical issues, including a bug in the reward calculation and syntax errors in shell scripts. I've also pointed out multiple instances of hardcoded, user-specific paths and other values that should be made configurable to make the code more portable and easier to use in different environments. Additionally, there's some dead code and debug prints that should be cleaned up.
Please address the critical and high-severity comments to ensure the new functionality is robust and correct.
| # probabilities = F.softmax(prm_outputs[0], dim=-1)* token_masks.unsqueeze(-1) | ||
| # sample = probabilities[0] | ||
| # prm_reward = sample[sample != 0].view(-1, 2)[:, 1][0].item() | ||
| resp = requests.post("http://localhost:8001/score", json={"text": conversation_str}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The URL for the PRM scoring service is hardcoded. This makes the example script inflexible and difficult to run if the service is on a different host or port. It would be better to make this configurable, for example, by reading it from an environment variable or from the experiment configuration. You will need to add import os for the suggestion to work.
| resp = requests.post("http://localhost:8001/score", json={"text": conversation_str}) | |
| resp = requests.post(os.getenv("PRM_SERVICE_URL", "http://localhost:8001/score"), json={"text": conversation_str}) |
| logger = logging.getLogger("Launcher Utils") | ||
|
|
||
| LOCAL_CACHE_DIR = "/tmp/areal" | ||
| LOCAL_CACHE_DIR = "/data/yl/AReaL/tmp/areal" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LOCAL_CACHE_DIR is hardcoded to a user-specific path. This makes the code non-portable and will likely cause it to fail on other developers' machines. It's better to use a more standard temporary directory or allow this path to be configured via an environment variable.
| LOCAL_CACHE_DIR = "/data/yl/AReaL/tmp/areal" | |
| LOCAL_CACHE_DIR = os.environ.get("AREAL_CACHE_DIR", "/tmp/areal") |
| from areal.api.io_struct import FinetuneSpec, StepInfo, WeightUpdateMeta | ||
| from areal.dataset import get_custom_dataset | ||
| from areal.engine.ppo.actor import FSDPPPOActor | ||
| from areal.engine.ppo.prm import FSDPPPOPrm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| print(f"conversation str: {conversation_str}") | ||
| # prm_input_ids = prm_tokenizer.encode( | ||
| # conversation_str, | ||
| # return_tensors="pt", | ||
| # ).to(prm_model.device) | ||
| # prm_outputs = prm_model(input_ids=prm_input_ids) | ||
| # step_sep_id = prm_tokenizer.encode("<extra_0>")[0] | ||
| # token_masks = (prm_input_ids == step_sep_id) | ||
| # probabilities = F.softmax(prm_outputs[0], dim=-1)* token_masks.unsqueeze(-1) | ||
| # sample = probabilities[0] | ||
| # prm_reward = sample[sample != 0].view(-1, 2)[:, 1][0].item() | ||
| resp = requests.post("http://localhost:8001/score", json={"text": conversation_str}) | ||
| prm_reward = resp.json()["reward"] | ||
| print(f"prm_reward: {prm_reward}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These print statements appear to be for debugging. They should be removed or replaced with proper logging using the logging module to avoid cluttering the output.
| print(f"conversation str: {conversation_str}") | |
| # prm_input_ids = prm_tokenizer.encode( | |
| # conversation_str, | |
| # return_tensors="pt", | |
| # ).to(prm_model.device) | |
| # prm_outputs = prm_model(input_ids=prm_input_ids) | |
| # step_sep_id = prm_tokenizer.encode("<extra_0>")[0] | |
| # token_masks = (prm_input_ids == step_sep_id) | |
| # probabilities = F.softmax(prm_outputs[0], dim=-1)* token_masks.unsqueeze(-1) | |
| # sample = probabilities[0] | |
| # prm_reward = sample[sample != 0].view(-1, 2)[:, 1][0].item() | |
| resp = requests.post("http://localhost:8001/score", json={"text": conversation_str}) | |
| prm_reward = resp.json()["reward"] | |
| print(f"prm_reward: {prm_reward}") | |
| # prm_input_ids = prm_tokenizer.encode( | |
| # conversation_str, | |
| # return_tensors="pt", | |
| # ).to(prm_model.device) | |
| # prm_outputs = prm_model(input_ids=prm_input_ids) | |
| # step_sep_id = prm_tokenizer.encode("<extra_0>")[0] | |
| # token_masks = (prm_input_ids == step_sep_id) | |
| # probabilities = F.softmax(prm_outputs[0], dim=-1)* token_masks.unsqueeze(-1) | |
| # sample = probabilities[0] | |
| # prm_reward = sample[sample != 0].view(-1, 2)[:, 1][0].item() | |
| resp = requests.post("http://localhost:8001/score", json={"text": conversation_str}) | |
| prm_reward = resp.json()["reward"] |
| # prm_tokenizer = AutoTokenizer.from_pretrained(config.prm_path, local_files_only=True, trust_remote_code=True) | ||
| # prm_model = AutoModel.from_pretrained( | ||
| # config.prm_path, | ||
| # torch_dtype=torch.bfloat16, | ||
| # local_files_only=True, | ||
| # trust_remote_code=True, | ||
| # ).eval() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # prm_model: PreTrainedModel, | ||
| # prm_tokenizer: PreTrainedTokenizerFast, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Lu-Yang666 , thanks for the great contribution! The feature looks great, but it may not ready to be merged in its current form.
Please:
- Clean code: remove unused comments and prints for debugging
- Format files according to the contribution buide
- Follow or respond to gemini's suggestions
|
|
||
| logger = logging.getLogger("Launcher Utils") | ||
|
|
||
| LOCAL_CACHE_DIR = "/tmp/areal" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should revert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can keep these scripts for internal usage. :)
We instead recommend creating a README under the examples/prm folder to show the usage of the PRM example.
| gconfig: GenerationHyperparameters = field( | ||
| default_factory=GenerationHyperparameters | ||
| ) | ||
| prmconfig: PRMRewardHyperparameters = field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like that we can just inheirt GRPOConfig and add two new fields prm_path and reward_shaping_alpha? BTW if you refer to reward scaling, you can use actor.reward_scaling rather than creating a new field.
areal/workflow/rlvr_prm.py
Outdated
| # clip mechanism | ||
| avg_prm_reward = sum(prm_rewards) / len(prm_rewards) | ||
| for i, val in enumerate(prm_rewards): | ||
| if val > avg_prm_reward: | ||
| rewards[i] = 0 | ||
| for res, r in zip(results, rewards): | ||
| res["rewards"] = torch.tensor([float(r)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some comments or configurations to control this behavior?
This workflow still uses an outcome-based reward. How's the PRM actually used?
| # probabilities = F.softmax(prm_outputs[0], dim=-1)* token_masks.unsqueeze(-1) | ||
| # sample = probabilities[0] | ||
| # prm_reward = sample[sample != 0].view(-1, 2)[:, 1][0].item() | ||
| resp = requests.post("http://localhost:8001/score", json={"text": conversation_str}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add this URL in the config.
|
This pull request has been automatically marked as stale because it has not had recent activity within the last 14 days. Please add a comment or push new commits to keep it active. Thank you for your contribution! |
No description provided.