Skip to content

Conversation

@yulangz
Copy link
Collaborator

@yulangz yulangz commented Oct 27, 2025

Description

Draft support agent from API_URL_BASE.
Support RL for any Agent framework by setting API_URL_BASE.

Related Issue

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not
    work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • I have run relevant unit tests and they pass
  • I have added tests for new functionality
  • I have updated documentation if needed
  • My branch is up to date with main
  • This PR introduces breaking changes (if yes, fill out details below)
  • If this PR changes documentation, I have built and previewed it locally with
    jb build docs
  • No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

Additional Context


Need help? Check the Contributing Guide or ask in
GitHub Discussions!

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @yulangz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust and flexible architecture for integrating external agent frameworks with the areal inference engine, with a primary focus on enabling Reinforcement Learning. By providing an OpenAI-compatible API proxy, it abstracts the underlying inference engine, allowing agents to interact through a familiar interface. The changes include dynamic configuration injection during agent runs, comprehensive reward management, and utility features like web page caching, all demonstrated through a suite of new agent examples covering search and multi-turn problem-solving.

Highlights

  • OpenAI API Proxy for Agent Frameworks: Introduced a new ProxyServer that emulates the OpenAI API, allowing external agent frameworks to seamlessly interact with the areal inference engine. This enables a flexible integration point for various agent designs.
  • Dynamic RunConfig Patching: Implemented AReaLOpenAIClientContext, an asynchronous context manager that dynamically patches the OpenAIRunner.run method. This allows for the merging of RunConfig settings, providing fine-grained control over generation parameters (like stop sequences or max tokens) during agent execution within an RL loop.
  • Reinforcement Learning Integration: Enhanced the system to support Reinforcement Learning for agents by adding mechanisms to set and retrieve reward signals (set_final_reward, get_final_reward) and manage trajectories within the ProxyServer and ArealOpenAI client. This includes applying reward discounting across turns.
  • Comprehensive Agent Examples: Added several new examples demonstrating the application of these features, including an ASearcher agent for complex search and web access tasks, and agents for single-turn and multi-turn math problem-solving. These examples showcase how to leverage the new proxy and patching for diverse agent behaviors.
  • Web Page Caching: Integrated a WebPageCache with thread-safe operations and file-based persistence to improve the efficiency of web access tools by caching frequently retrieved web page content.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant feature to support agent-based reinforcement learning by proxying an OpenAI-compatible API. This is a powerful abstraction that enables integration with various agent frameworks. The implementation includes a FastAPI-based proxy server, monkey-patching of the agents library, and several new example training scripts for math and search agents.

The overall approach is sound, but as this is a work-in-progress, there are several areas that need attention. I've identified some potential bugs, resource management issues, and opportunities for code improvement. My detailed comments below address these points, focusing on correctness, maintainability, and robustness.

Comment on lines 41 to 44
if self.config.wandb.wandb_base_url:
os.environ["WANDB_API_KEY"] = self.config.wandb.wandb_api_key
if self.config.wandb.wandb_api_key:
os.environ["WANDB_BASE_URL"] = self.config.wandb.wandb_base_url
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for setting the WANDB_API_KEY and WANDB_BASE_URL environment variables appears to be swapped. The API key is set based on the presence of the base URL, and vice versa. This will cause wandb.login() to fail or connect to the wrong endpoint if only one of the two is configured.

Suggested change
if self.config.wandb.wandb_base_url:
os.environ["WANDB_API_KEY"] = self.config.wandb.wandb_api_key
if self.config.wandb.wandb_api_key:
os.environ["WANDB_BASE_URL"] = self.config.wandb.wandb_base_url
if self.config.wandb.wandb_api_key:
os.environ["WANDB_API_KEY"] = self.config.wandb.wandb_api_key
if self.config.wandb.wandb_base_url:
os.environ["WANDB_BASE_URL"] = self.config.wandb.wandb_base_url

completion_str = resp.final_output

# agent extracts tool callings from the llm response
tool_calls = agent.consume_llm_response(resp, completion_str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The method agent.consume_llm_response is called with two arguments, resp and completion_str. However, its definition in examples/openai-agents/asearcher/agent/search_agent.py only accepts one argument, completion_text. This will cause a TypeError at runtime.

Suggested change
tool_calls = agent.consume_llm_response(resp, completion_str)
tool_calls = agent.consume_llm_response(completion_str)

# call tool and compute reward
if tool_calls is not None and len(tool_calls) > 0:
tool_call = tool_calls[0]
res = (await self.toolbox.step((qid, [tool_call])))[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The run_agent method calls self.toolbox.step, but self.toolbox is not initialized anywhere in the ASearcherAgent class. This will result in an AttributeError at runtime. The SearchToolBox should be initialized in ASearcherAgent.__init__.

Comment on lines +303 to +307
if self.search_only
else SEARCH_ACCESS_PROMPT_TEMPLATE
)
prompt = prompt_template.format(question=data["question"])
valid_inst: bool = np.random.uniform(0, 1) <= self.valid_inst_ratio
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The arun_episode method in ASearcherWorkflow uses self.search_only and self.valid_inst_ratio, but these attributes are not defined in the class. This will cause an AttributeError. These should be added as parameters to the __init__ method and likely configured in ASearcherRLConfig.

Comment on lines +456 to +463
def __del__(self):
"""Ensure socket is closed on deletion."""
if self.server:
self.server.should_exit = True
self.server = None
if self.sock:
self.sock.close()
self.sock = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using __del__ to close sockets and shut down the uvicorn server is unreliable. The __del__ method is not guaranteed to be called when an object's reference count drops to zero, which can lead to resource leaks (e.g., open sockets, running server threads).

It would be more robust to provide an explicit shutdown method that can be called deterministically by the owner of the ProxyServer instance to ensure resources are properly released.

Suggested change
def __del__(self):
"""Ensure socket is closed on deletion."""
if self.server:
self.server.should_exit = True
self.server = None
if self.sock:
self.sock.close()
self.sock = None
def shutdown(self):
"""Ensure socket is closed on deletion."""
if self.server:
self.server.should_exit = True
self.server = None
if self.sock:
self.sock.close()
self.sock = None
def __del__(self):
self.shutdown()

Comment on lines +87 to +91
except Exception as e:
# 如果出现问题,确保关闭socket
if "sock" in locals() and sock:
sock.close()
raise e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In the except block, if socket.socket() fails, the sock variable will not be defined, leading to a NameError when if "sock" in locals() and sock: is executed. It's safer to initialize sock to None before the try block to prevent this crash.

Additionally, the comment on line 88 is in Chinese, which is inconsistent with the English comments in the rest of the file. It should be translated for consistency.

Suggested change
except Exception as e:
# 如果出现问题,确保关闭socket
if "sock" in locals() and sock:
sock.close()
raise e
except Exception as e:
# Ensure the socket is closed if an issue occurs.
if 'sock' in locals() and sock:
sock.close()
raise e

Comment on lines +147 to +196
def process_webpage(self, content):
keys = [
("title", "title"),
("p", "p"),
("li", "li", lambda c: "\n" not in c),
("td", "td"),
("tr", "tr"),
]
content_list = []
init_length = len(content)
while any([f"<{k[0]}" in content and f"</{k[1]}>" in content for k in keys]):
klr = []
for k in keys:
start = 0
# print(k)
while True:
ls = [content[start:].find(f"<{k[0]}{c}") for c in [">", " "]]
ls = [l for l in ls if l != -1]
l = -1 if len(ls) == 0 else min(ls)
# print(ls)
if l == -1:
break
l += start
r = content[l:].find(f"</{k[1]}>")
if r == -1:
break
if (len(k) <= 2) or (len(k) >= 3 and k[2](content[l : l + r])):
# print(k, l, l+r)
klr.append((k, l, l + r))
break
start = l + r

if len(klr) == 0:
break
klr = sorted(klr, key=lambda x: x[1])
k, l, r = klr[0]
content_list.append(content[l : r + len(f"</{k[1]}>")])
# print(content_list[-1])
# input("stop...")
if k[0] == "p":
content_list[-1] += "\n\n"
elif k[0] == "li":
content_list[-1] += "\n"
content = content[r:]
content = "".join(content_list)
final_length = len(content)
logger.info(
f"process the webpage: {init_length} -> {final_length}. {content[:100]}"
)
return content
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The process_webpage method uses string searching (find) and manual slicing to parse HTML content. This approach is very brittle and can easily break with small variations in HTML structure. Using a dedicated HTML parsing library like BeautifulSoup would be far more robust and maintainable.

Here is an example of how you could implement this with BeautifulSoup:

from bs4 import BeautifulSoup

def process_webpage(self, content):
    soup = BeautifulSoup(content, 'html.parser')
    
    # Extract text from relevant tags
    texts = []
    for tag in soup.find_all(['title', 'p', 'li', 'td', 'tr']):
        texts.append(tag.get_text(separator=' ', strip=True))
    
    processed_content = "\n\n".join(texts)
    
    logger.info(
        f"process the webpage: {len(content)} -> {len(processed_content)}. {processed_content[:100]}"
    )
    return processed_content

Comment on lines +174 to +178
proxy_thread = threading.Thread(
target=self.proxy_server.run, args=(sock,), daemon=True
)
logger.info(f"[wht debug] Starting proxy server on port {port}")
proxy_thread.start()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The ProxyServer is started in a daemon thread, but there is no corresponding mechanism to explicitly shut it down when the training finishes or an error occurs. Relying on __del__ for cleanup is not reliable and can lead to leaked resources like sockets and threads. The MultiturnRLVRAgentWorkflow should manage the lifecycle of the ProxyServer and ensure its shutdown method (which should be implemented) is called.

Comment on lines +228 to +229
# Many of this code are copied from areal/experimental/openai/client.py
# I only add lock for thread safety
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment on line 228 highlights that a significant amount of code for reward and completion management (set_reward, apply_reward_discount, export_completions, etc.) is duplicated from areal/experimental/openai/client.py. This duplication makes the code harder to maintain, as changes will need to be made in two places.

Consider refactoring this logic into a shared CompletionCacheManager class that both ArealOpenAI and ProxyServer can use. ProxyServer could then manage a dictionary of these managers, keyed by task_id, to handle concurrent tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants