Reinforcement Learning for Agents
Orign makes it simple to train and deploy robust AI agents that can learn from human feedback and exploration. It further provides mechanisms for agents to learn interactively and autonomously.
Built on the nebulous runtime, Orign components can be ran on any cloud, and can easily connect across clouds and regions.
Ships as a single binary, performant and lightweight via Rust 🦀
It takes a team to align models, we connect them globally 🌎
Warning
Orign is in alpha, things may break.
Python
pip install orignCLI
curl -fsSL -H "Cache-Control: no-cache" https://storage.googleapis.com/orign/releases/install.sh | bashStart an orign server
orign serve --dockerOr optionally run on Kubernetes with our helm chart
Let's use reinforcement learning to train an agent to use a Playwright MCP server.
from orign import Human, QwenVL2_5, Message, Feedback, processor
from mcp_use import MCPClient
# Create an online LLM that can both learn and act
llm = QwenVL2_5(name="playwright-actor")
config = {
"mcpServers": {
"playwright": {
"command": "npx",
"args": ["@playwright/mcp@latest"],
"env": {"DISPLAY": ":1"},
}
}
}
# Create a Playwright MCP server
client = MCPClient.from_dict(config)
@processor(image="python:3.11-slim", platform="ec2")
def on_feedback(message: Message[Feedback]):
# Load our actor LLM
llm = QwenVL2_5.load("playwright-actor")
# Parse the feedback from the message
feedback = message.content
if feedback.approved:
# Send to the LLM to learn
llm.learn(feedback.messages)
# Create a human that can review for us.
# When they do the on_feedback function will be called.
human = Human(
name="playwright-reviewer",
medium="ui",
callback=on_feedback
)
# Run an agent
task = "find a flight to Germany from London in December"
max_steps = 30
ctx = f"""You are operating a web browser helping accomplish tasks.
Please help complete the task '{task}' with the tools: {client.tools()}
Given the current screenshot of the browser, please select your next action.
If you are done, simple return the `end` action.
"""
for i in range(max_steps):
# Take screenshot
before_screenshot = ""
messages = [
{"role": "user", "content": [
{"type": "text", "text": ctx},
{
"type": "image_url",
"image_url": {
"url": before_screenshot,
},
},
]}
]
resp = llm.chat(messages)
if action == "end":
break
# Take mcp action
# ...
# screenshot = ""
# append response
# Ask a human for feedback, waiting to continue loop until approved
human.feedback(messages=messages, wait=True)
# Now let's use all the feedback we collected to fine-tune the LLM!
llm.train()After training, you can run it again and see the improvement in the agent. To make the agent robust, simple train it on numerous tasks within its domain.
Tip
See examples for more
Replay buffers provide a means to store agent experiences and sample from them. They are the cornerstone of RL and Online Learning.
from orign import ReplayBuffer
buffer = ReplayBuffer(
name="sql-adapter",
)Send data to the replay buffer
data = [{"role": "user", "content": ...}]
buffer.send(data)Sample data from the buffer
buffer.sample(n=50, strategy="Random")This sample data can be used to train an agent.
Tip
See Replay Buffer examples for more
Processors are autoscaling stream workers that allow us to easily build production grade ML models and agents.
In this example, we will create a processor that trains a model using TRL on runpod with 1 A100 GPU.
from orign import Message, processor
from pydantic import BaseModel
from trl import SFTTrainer
from datasets import load_dataset
class TrainingRequest(BaseModel):
dataset: str
model: str
setup_script = "pip install trl datasets pydantic"
@processor(
image="pytorch/pytorch:latest",
platform="runpod",
accelerators=["1:A100"],
setup_script=setup_script
)
def sft(message: Message[TrainingRequest]):
request = message.content
dataset = load_dataset(request.dataset, split="train")
trainer = SFTTrainer(
model=request.model,
train_dataset=dataset,
)
trainer.train()We can then simply call it like a regular function and it will handle spinning it up on runpod with the right GPU, and scaling it down when its finished.
request = TrainingRequest(dataset="trl-lib/Capybara", model="Qwen/Qwen2.5-0.5B")
sft(request)If called multiple times the requests will be placed in a queue and processed asyncronously. If the queue is backed up, the processor will scale to meet demand.
Let's import our processor and use it with a buffer to issue two consecutive trainings.
from .train import sft
dataset = buffer.sample(n=200, link=True)
request = TrainingRequest(dataset, "Qwen/Qwen2.5-0.5B")
sft(request)
sft(request)Tip
See Processor examples for more
Online LLMs are capable of both training and inference. They learn in realtime as the data comes in.
In this example, we create an online LLM using the buffer and train function we previously created, as well as a vLLM processor we provide for inference.
from orign import OnlineLLM, vllm
actor = OnlineLLM(
name="sql-actor",
buffer=buffer,
trainer=sft,
server=vllm,
)The trainer and server can be any processor. Feel free to create your own or explore our zoo.
For simplicity, Orign supplies pre-built Online LLMs for popular models.
from orign import Gemma3
llm = Gemma3("sql-actor")Use the LLM to generate responses.
messages = [
{"role": "user", "content": "Write a SQL query to find all users who joined after January 1, 2023."},
]
response = llm.generate(messages)Send the LLM training examples.
messages = [
{"role": "user", "content": "Write a SQL query to find all users who joined after January 1, 2023."},
{"role": "assistant", "content": "sql\nSELECT * FROM users WHERE join_date > '2023-01-01';\n"},
]
llm.learn(messages)Launch a training job based on the data collected.
llm.train()Tip
See Online LLM examples for more
Connect to a human which is capable of providing feedback to the agent.
In this example, we collect feedback from humans in a slack channel. When the human provides feedback, the on_feedback processor will be called.
from orign import Human, processor
@processor(image="python:3.10")
def on_feedback(feedback):
print(feedback)
human = Human(
name="sql-adapter-annotator",
medium="slack",
channel="#agent-training",
callback=on_feedback,
)Use the human to provide feedback to the agent.
messages = [
{"role": "user", "content": "Write a SQL query to find all users who joined after January 1, 2023."},
{"role": "assistant", "content": "sql\nSELECT * FROM users WHERE join_date > '2023-01-01';\n"},
]
human.feedback(messages)As a more complex example, use the feedback to train both the agent and a verifier, enabling autonomous learning.
In this example, we create a verifier using our pre-made Gemma3 online LLM. We also define a callback function which takes the feedback and teaches the actor and verifier.
from orign import Gemma3
verifier = Gemma3(
name="sql-adapter-verifier",
model="google/gemma-3-4b-pt",
platform="ec2",
accelerators=["1:H100_SXM"],
)
@processor(image="agentsea/orign-py:latest")
def on_feedback(feedback):
# Get the buffers we previously created for our actor and verifier.
actor = Gemma3.load(name="sql-adapter-actor")
verifier = Gemma3.load(name="sql-adapter-verifier")
# Teach the verifier to judge whether the assistant's response is correct.
verifier_messages = [
{"role": "user", "content": f"Given the conversation {feedback.messages}, please judge whether the assistant's response is correct."},
{"role": "assistant", "content": feedback.correct},
]
verifier.learn(verifier_messages)
# If the assistant's response is correct, train the actor.
if feedback.correct:
actor.learn(feedback.messages)Using the previous example, once the verifier is trained, we can use it to train the actor autonomously.
while True:
# implement this function however makes sense for you
task = next_task()
response = actor.generate(task)
# implement this function to format the chat history for the verifier
verifier_messages = get_verifier_messages(task, response)
feedback = verifier.generate(verifier_messages)
if feedback.correct:
actor.learn(feedback.messages)Tip
See Human examples for more
See Processors
Agents can easily be made with processors. No need for silly agent frameworks
from orign import Gemma3, Message, processor
from pydantic import BaseModel
from mcp_use import MCPClient
class Task(BaseModel):
description: str
max_steps: int
mcp_config: Dict[str, Any]
result = Optional[str] = None
setup = "pip install mcp-use pydantic"
@processor(image="python:3.11-slim", platform="gce", setup_script=setup)
def agent(message: Message[Task]) -> Task:
task = message.content
# Create an online LLM that can both learn and act
llm = Gemma3(name="agent-actor")
# Create MCPClient from configuration dictionary
client = MCPClient.from_dict(task.mcp_config)
for i in range(task.max_steps):
# Your agent logic
...
return taskThen call it to launch the agent on GCE and run the task.
mcp_config = {
"mcpServers": {
"playwright": {
"command": "npx",
"args": ["@playwright/mcp@latest"],
"env": {
"DISPLAY": ":1"
}
}
}
}
task = Task("find a flight from Denver to LA in August", 30, mcp_config)
result = agent(task, wait=True)Tip
See Agent examples for more
- Task management
- More human backends
- More pre-backed models
Please open an issue or submit a PR.
- OpenRLHF
- AlignAnything
- TRL
- Nebulous
