add prm #399

Lu-Yang666 · 2025-09-29T06:03:11Z

No description provided.

gemini-code-assist · 2025-09-29T06:03:31Z

Summary of Changes

Hello @Lu-Yang666, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the reinforcement learning framework by integrating support for Process Reward Models (PRM). It introduces new configuration options, a specialized workflow for combining PRM-derived rewards with standard rewards, and a standalone service for efficient PRM inference. The changes are exemplified by a new training script for the GSM8K dataset, aiming to leverage the granular feedback of PRMs to improve model performance on complex reasoning tasks.

Highlights

New PRM Configuration: Introduced PRMRewardHyperparameters and PRMConfig dataclasses to define and manage parameters specific to Process Reward Model (PRM) integration within the training framework.
PRM-Enhanced Workflow: Implemented a new PRMRLVRWorkflow that incorporates PRM scores into the reward calculation, featuring a reward shaping mechanism and a clipping strategy to refine the learning signal.
External PRM Scoring Service: Added a FastAPI service (prm_service.py) designed to host and provide scores from a PRM model (Qwen2.5-Math-PRM-7B), enabling decoupled and scalable PRM inference.
GSM8K PRM Training Example: Provided a new example script (gsm8k_dapo_prm.py) demonstrating how to set up and run Dapo training on the GSM8K dataset with the newly integrated PRM functionality.
Cache Directory Update: Modified the default local cache directory path in cli_args.py and launcher.py to a new, more specific location.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for Process-based Reward Models (PRM) by adding new configurations, a dedicated workflow, an example training script, and a PRM scoring service. The changes are a good step towards incorporating PRM into the training loop.

My review focuses on improving code quality, maintainability, and portability. I've identified several critical issues, including a bug in the reward calculation and syntax errors in shell scripts. I've also pointed out multiple instances of hardcoded, user-specific paths and other values that should be made configurable to make the code more portable and easier to use in different environments. Additionally, there's some dead code and debug prints that should be cleaned up.

Please address the critical and high-severity comments to ensure the new functionality is robust and correct.

scripts/dapo.sh

scripts/dapo_prm.sh

areal/workflow/rlvr_prm.py

gemini-code-assist · 2025-09-29T06:05:30Z

examples/prm/gsm8k_dapo_prm.py

+    # probabilities = F.softmax(prm_outputs[0], dim=-1)* token_masks.unsqueeze(-1)
+    # sample = probabilities[0]
+    # prm_reward = sample[sample != 0].view(-1, 2)[:, 1][0].item()
+    resp = requests.post("http://localhost:8001/score", json={"text": conversation_str})


The URL for the PRM scoring service is hardcoded. This makes the example script inflexible and difficult to run if the service is on a different host or port. It would be better to make this configurable, for example, by reading it from an environment variable or from the experiment configuration. You will need to add import os for the suggestion to work.

Suggested change

resp = requests.post("http://localhost:8001/score", json={"text": conversation_str})

resp = requests.post(os.getenv("PRM_SERVICE_URL", "http://localhost:8001/score"), json={"text": conversation_str})

gemini-code-assist · 2025-09-29T06:05:31Z

areal/utils/launcher.py

 logger = logging.getLogger("Launcher Utils")

-LOCAL_CACHE_DIR = "/tmp/areal"
+LOCAL_CACHE_DIR = "/data/yl/AReaL/tmp/areal"


The LOCAL_CACHE_DIR is hardcoded to a user-specific path. This makes the code non-portable and will likely cause it to fail on other developers' machines. It's better to use a more standard temporary directory or allow this path to be configured via an environment variable.

Suggested change

LOCAL_CACHE_DIR = "/data/yl/AReaL/tmp/areal"

LOCAL_CACHE_DIR = os.environ.get("AREAL_CACHE_DIR", "/tmp/areal")

scripts/dapo_prm.sh

gemini-code-assist · 2025-09-29T06:05:31Z

examples/prm/gsm8k_dapo_prm.py

+from areal.api.io_struct import FinetuneSpec, StepInfo, WeightUpdateMeta
+from areal.dataset import get_custom_dataset
+from areal.engine.ppo.actor import FSDPPPOActor
+from areal.engine.ppo.prm import FSDPPPOPrm


The FSDPPPOPrm class is imported but never used in this file. This unused import should be removed to keep the code clean.

gemini-code-assist · 2025-09-29T06:05:31Z

examples/prm/gsm8k_dapo_prm.py

+    print(f"conversation str: {conversation_str}")
+    # prm_input_ids = prm_tokenizer.encode(
+    #     conversation_str, 
+    #     return_tensors="pt", 
+    # ).to(prm_model.device)
+    # prm_outputs = prm_model(input_ids=prm_input_ids)
+    # step_sep_id = prm_tokenizer.encode("<extra_0>")[0]
+    # token_masks = (prm_input_ids == step_sep_id)
+    # probabilities = F.softmax(prm_outputs[0], dim=-1)* token_masks.unsqueeze(-1)
+    # sample = probabilities[0]
+    # prm_reward = sample[sample != 0].view(-1, 2)[:, 1][0].item()
+    resp = requests.post("http://localhost:8001/score", json={"text": conversation_str})
+    prm_reward = resp.json()["reward"]
+    print(f"prm_reward: {prm_reward}")


These print statements appear to be for debugging. They should be removed or replaced with proper logging using the logging module to avoid cluttering the output.

Suggested change

print(f"conversation str: {conversation_str}")

# prm_input_ids = prm_tokenizer.encode(

# conversation_str,

# return_tensors="pt",

# ).to(prm_model.device)

# prm_outputs = prm_model(input_ids=prm_input_ids)

# step_sep_id = prm_tokenizer.encode("<extra_0>")[0]

# token_masks = (prm_input_ids == step_sep_id)

# probabilities = F.softmax(prm_outputs[0], dim=-1)* token_masks.unsqueeze(-1)

# sample = probabilities[0]

# prm_reward = sample[sample != 0].view(-1, 2)[:, 1][0].item()

resp = requests.post("http://localhost:8001/score", json={"text": conversation_str})

prm_reward = resp.json()["reward"]

print(f"prm_reward: {prm_reward}")

# prm_input_ids = prm_tokenizer.encode(

# conversation_str,

# return_tensors="pt",

# ).to(prm_model.device)

# prm_outputs = prm_model(input_ids=prm_input_ids)

# step_sep_id = prm_tokenizer.encode("<extra_0>")[0]

# token_masks = (prm_input_ids == step_sep_id)

# probabilities = F.softmax(prm_outputs[0], dim=-1)* token_masks.unsqueeze(-1)

# sample = probabilities[0]

# prm_reward = sample[sample != 0].view(-1, 2)[:, 1][0].item()

resp = requests.post("http://localhost:8001/score", json={"text": conversation_str})

prm_reward = resp.json()["reward"]

gemini-code-assist · 2025-09-29T06:05:31Z

examples/prm/gsm8k_dapo_prm.py

+    # prm_tokenizer = AutoTokenizer.from_pretrained(config.prm_path, local_files_only=True, trust_remote_code=True)
+    # prm_model = AutoModel.from_pretrained(
+    #     config.prm_path, 
+    #     torch_dtype=torch.bfloat16,
+    #     local_files_only=True,
+    #     trust_remote_code=True,
+    # ).eval()


This block of commented-out code should be removed to improve code clarity and maintainability.

gemini-code-assist · 2025-09-29T06:05:31Z

areal/workflow/rlvr_prm.py

+        # prm_model: PreTrainedModel,
+        # prm_tokenizer: PreTrainedTokenizerFast,


These parameters for prm_model and prm_tokenizer are commented out, along with their usage later in the file. This dead code should be removed to improve code clarity and maintainability.

garrett4wade

Hi @Lu-Yang666 , thanks for the great contribution! The feature looks great, but it may not ready to be merged in its current form.

Please:

Clean code: remove unused comments and prints for debugging
Format files according to the contribution buide
Follow or respond to gemini's suggestions

garrett4wade · 2025-09-30T08:54:43Z

areal/utils/launcher.py


 logger = logging.getLogger("Launcher Utils")

-LOCAL_CACHE_DIR = "/tmp/areal"


Should revert.

garrett4wade · 2025-09-30T09:04:23Z

scripts/dapo.sh

You can keep these scripts for internal usage. :)

We instead recommend creating a README under the examples/prm folder to show the usage of the PRM example.

garrett4wade · 2025-09-30T09:31:24Z

areal/api/cli_args.py

+    gconfig: GenerationHyperparameters = field(
+        default_factory=GenerationHyperparameters
+    )
+    prmconfig: PRMRewardHyperparameters = field(


Looks like that we can just inheirt GRPOConfig and add two new fields prm_path and reward_shaping_alpha? BTW if you refer to reward scaling, you can use actor.reward_scaling rather than creating a new field.

garrett4wade · 2025-09-30T09:32:32Z

areal/workflow/rlvr_prm.py

+        # clip mechanism
+        avg_prm_reward = sum(prm_rewards) / len(prm_rewards)
+        for i, val in enumerate(prm_rewards):
+            if val > avg_prm_reward:
+                rewards[i] = 0
+        for res, r in zip(results, rewards):
+            res["rewards"] = torch.tensor([float(r)])


Can we add some comments or configurations to control this behavior?

This workflow still uses an outcome-based reward. How's the PRM actually used?

garrett4wade · 2025-09-30T09:35:11Z

examples/prm/gsm8k_dapo_prm.py

+    # probabilities = F.softmax(prm_outputs[0], dim=-1)* token_masks.unsqueeze(-1)
+    # sample = probabilities[0]
+    # prm_reward = sample[sample != 0].view(-1, 2)[:, 1][0].item()
+    resp = requests.post("http://localhost:8001/score", json={"text": conversation_str})


We can add this URL in the config.

github-actions · 2025-10-21T01:15:50Z

This pull request has been automatically marked as stale because it has not had recent activity within the last 14 days.

Please add a comment or push new commits to keep it active.

Thank you for your contribution!

gemini-code-assist bot reviewed Sep 29, 2025

View reviewed changes

Lu-Yang666 added 2 commits September 30, 2025 15:21

add prm

aba4313

update

0572275

Lu-Yang666 force-pushed the yl-prm branch from bbe1517 to 0572275 Compare September 30, 2025 07:21

garrett4wade reviewed Sep 30, 2025

View reviewed changes

Whaooooo and others added 9 commits October 2, 2025 11:32

update

32a84c6

update

d827ce8

add dataset

4c0d80e

step separation

da790d6

update

5937c04

update

54ef0cb

update

e2c6c9b

update

f4284d7

update

8ecbe5a

github-actions bot added the stale label Oct 21, 2025

	resp = requests.post("http://localhost:8001/score", json={"text": conversation_str})
	resp = requests.post(os.getenv("PRM_SERVICE_URL", "http://localhost:8001/score"), json={"text": conversation_str})

	LOCAL_CACHE_DIR = "/data/yl/AReaL/tmp/areal"
	LOCAL_CACHE_DIR = os.environ.get("AREAL_CACHE_DIR", "/tmp/areal")

		# prm_model: PreTrainedModel,
		# prm_tokenizer: PreTrainedTokenizerFast,


		logger = logging.getLogger("Launcher Utils")

		LOCAL_CACHE_DIR = "/tmp/areal"

Uh oh!

add prm #399

Are you sure you want to change the base?

add prm #399

Uh oh!

Conversation

Lu-Yang666 commented Sep 29, 2025

Uh oh!

gemini-code-assist bot commented Sep 29, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

garrett4wade left a comment

Choose a reason for hiding this comment

Uh oh!

garrett4wade Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

garrett4wade Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

garrett4wade Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

garrett4wade Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

garrett4wade Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants