GitHub - agentsea/orign: Reinforcement Learning for Agents

Reinforcement Learning for Agents

Orign makes it simple to train and deploy robust AI agents that can learn from human feedback and exploration. It further provides mechanisms for agents to learn interactively and autonomously.

Built on the nebulous runtime, Orign components can be ran on any cloud, and can easily connect across clouds and regions.

Ships as a single binary, performant and lightweight via Rust 🦀

It takes a team to align models, we connect them globally 🌎

Warning

Orign is in alpha, things may break.

Installation

Python

pip install orign

CLI

curl -fsSL -H "Cache-Control: no-cache" https://storage.googleapis.com/orign/releases/install.sh | bash

Start an orign server

orign serve --docker

Or optionally run on Kubernetes with our helm chart

Quick Start

Let's use reinforcement learning to train an agent to use a Playwright MCP server.

from orign import Human, QwenVL2_5, Message, Feedback, processor
from mcp_use import MCPClient

# Create an online LLM that can both learn and act
llm = QwenVL2_5(name="playwright-actor")

config = {
    "mcpServers": {
        "playwright": {
            "command": "npx",
            "args": ["@playwright/mcp@latest"],
            "env": {"DISPLAY": ":1"},
        }
    }
}

# Create a Playwright MCP server
client = MCPClient.from_dict(config)

@processor(image="python:3.11-slim", platform="ec2")
def on_feedback(message: Message[Feedback]):
    # Load our actor LLM
    llm = QwenVL2_5.load("playwright-actor")

    # Parse the feedback from the message
    feedback = message.content

    if feedback.approved:
        # Send to the LLM to learn
        llm.learn(feedback.messages)

# Create a human that can review for us. 
# When they do the on_feedback function will be called.
human = Human(
    name="playwright-reviewer",
    medium="ui",
    callback=on_feedback
)

# Run an agent
task = "find a flight to Germany from London in December"
max_steps = 30

ctx = f"""You are operating a web browser helping accomplish tasks.
Please help complete the task '{task}' with the tools: {client.tools()}
Given the current screenshot of the browser, please select your next action.
If you are done, simple return the `end` action.
"""

for i in range(max_steps):
    # Take screenshot
    before_screenshot = ""

    messages = [
        {"role": "user", "content": [
            {"type": "text", "text": ctx},
            {
                "type": "image_url",
                "image_url": {
                    "url": before_screenshot,
                },
            },
        ]}
        ]
    resp = llm.chat(messages)

    if action == "end":
        break

    # Take mcp action
    # ...

    # screenshot = ""

    # append response

    # Ask a human for feedback, waiting to continue loop until approved
    human.feedback(messages=messages, wait=True)

# Now let's use all the feedback we collected to fine-tune the LLM!
llm.train()

After training, you can run it again and see the improvement in the agent. To make the agent robust, simple train it on numerous tasks within its domain.

Tip

See examples for more

Usage

Replay Buffers

Replay buffers provide a means to store agent experiences and sample from them. They are the cornerstone of RL and Online Learning.

from orign import ReplayBuffer

buffer = ReplayBuffer(
    name="sql-adapter",
)

Send data to the replay buffer

data = [{"role": "user", "content": ...}]
buffer.send(data)

Sample data from the buffer

buffer.sample(n=50, strategy="Random")

This sample data can be used to train an agent.

Tip

See Replay Buffer examples for more

Processors

Processors are autoscaling stream workers that allow us to easily build production grade ML models and agents.

In this example, we will create a processor that trains a model using TRL on runpod with 1 A100 GPU.

from orign import Message, processor
from pydantic import BaseModel
from trl import SFTTrainer
from datasets import load_dataset

class TrainingRequest(BaseModel):
    dataset: str
    model: str

setup_script = "pip install trl datasets pydantic"

@processor(
    image="pytorch/pytorch:latest", 
    platform="runpod", 
    accelerators=["1:A100"], 
    setup_script=setup_script
)
def sft(message: Message[TrainingRequest]):
    request = message.content

    dataset = load_dataset(request.dataset, split="train")

    trainer = SFTTrainer(
        model=request.model,
        train_dataset=dataset,
    )
    trainer.train()

We can then simply call it like a regular function and it will handle spinning it up on runpod with the right GPU, and scaling it down when its finished.

request = TrainingRequest(dataset="trl-lib/Capybara", model="Qwen/Qwen2.5-0.5B")

sft(request)

If called multiple times the requests will be placed in a queue and processed asyncronously. If the queue is backed up, the processor will scale to meet demand.

Let's import our processor and use it with a buffer to issue two consecutive trainings.

from .train import sft

dataset = buffer.sample(n=200, link=True)
request = TrainingRequest(dataset, "Qwen/Qwen2.5-0.5B")

sft(request)
sft(request)

Tip

See Processor examples for more

Online LLMs

Online LLMs are capable of both training and inference. They learn in realtime as the data comes in.

In this example, we create an online LLM using the buffer and train function we previously created, as well as a vLLM processor we provide for inference.

from orign import OnlineLLM, vllm

actor = OnlineLLM(
    name="sql-actor",
    buffer=buffer,
    trainer=sft,
    server=vllm,
)

The trainer and server can be any processor. Feel free to create your own or explore our zoo.

For simplicity, Orign supplies pre-built Online LLMs for popular models.

from orign import Gemma3

llm = Gemma3("sql-actor")

Use the LLM to generate responses.

messages = [
    {"role": "user", "content": "Write a SQL query to find all users who joined after January 1, 2023."},
]
response = llm.generate(messages)

Send the LLM training examples.

messages = [
    {"role": "user", "content": "Write a SQL query to find all users who joined after January 1, 2023."},
    {"role": "assistant", "content": "sql\nSELECT * FROM users WHERE join_date > '2023-01-01';\n"},
]
llm.learn(messages)

Launch a training job based on the data collected.

llm.train()

Tip

See Online LLM examples for more

Humans

Connect to a human which is capable of providing feedback to the agent.

In this example, we collect feedback from humans in a slack channel. When the human provides feedback, the on_feedback processor will be called.

from orign import Human, processor 

@processor(image="python:3.10")
def on_feedback(feedback):
    print(feedback)

human = Human(
    name="sql-adapter-annotator",
    medium="slack",
    channel="#agent-training",
    callback=on_feedback,
)

Use the human to provide feedback to the agent.

messages = [
    {"role": "user", "content": "Write a SQL query to find all users who joined after January 1, 2023."},
    {"role": "assistant", "content": "sql\nSELECT * FROM users WHERE join_date > '2023-01-01';\n"},
]
human.feedback(messages)

Verifiers and Autonomous Learning

As a more complex example, use the feedback to train both the agent and a verifier, enabling autonomous learning.

In this example, we create a verifier using our pre-made Gemma3 online LLM. We also define a callback function which takes the feedback and teaches the actor and verifier.

from orign import Gemma3

verifier = Gemma3(
    name="sql-adapter-verifier",
    model="google/gemma-3-4b-pt",
    platform="ec2",
    accelerators=["1:H100_SXM"],
)

@processor(image="agentsea/orign-py:latest")
def on_feedback(feedback):
    # Get the buffers we previously created for our actor and verifier.
    actor = Gemma3.load(name="sql-adapter-actor")
    verifier = Gemma3.load(name="sql-adapter-verifier")

    # Teach the verifier to judge whether the assistant's response is correct.
    verifier_messages = [
        {"role": "user", "content": f"Given the conversation {feedback.messages}, please judge whether the assistant's response is correct."},
        {"role": "assistant", "content": feedback.correct},
    ]    
    verifier.learn(verifier_messages)

    # If the assistant's response is correct, train the actor.
    if feedback.correct:
        actor.learn(feedback.messages)

Using the previous example, once the verifier is trained, we can use it to train the actor autonomously.

while True:
    # implement this function however makes sense for you
    task = next_task()
    response = actor.generate(task)

    # implement this function to format the chat history for the verifier
    verifier_messages = get_verifier_messages(task, response)
    feedback = verifier.generate(verifier_messages)

    if feedback.correct:
        actor.learn(feedback.messages)

Tip

See Human examples for more

Agents

See Processors

Agents can easily be made with processors. No need for silly agent frameworks

from orign import Gemma3, Message, processor
from pydantic import BaseModel
from mcp_use import MCPClient

class Task(BaseModel):
    description: str
    max_steps: int
    mcp_config: Dict[str, Any]
    result = Optional[str] = None

setup = "pip install mcp-use pydantic"

@processor(image="python:3.11-slim", platform="gce", setup_script=setup)
def agent(message: Message[Task]) -> Task:
    task = message.content

    # Create an online LLM that can both learn and act
    llm = Gemma3(name="agent-actor")

    # Create MCPClient from configuration dictionary
    client = MCPClient.from_dict(task.mcp_config)

    for i in range(task.max_steps):
        # Your agent logic
        ...

    return task

Then call it to launch the agent on GCE and run the task.

mcp_config = {
    "mcpServers": {
        "playwright": {
            "command": "npx",
            "args": ["@playwright/mcp@latest"],
            "env": {
                "DISPLAY": ":1"
            }
        }
    }
}
task = Task("find a flight from Denver to LA in August", 30, mcp_config)

result = agent(task, wait=True)

Tip

See Agent examples for more

Roadmap

Task management
More human backends
More pre-backed models

Contributing

Please open an issue or submit a PR.

Inspiration

OpenRLHF
AlignAnything
TRL
Nebulous

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
clusters		clusters
docker		docker
examples		examples
img		img
migration		migration
models		models
scripts		scripts
src		src
static		static
tests		tests
ui		ui
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build.sh		build.sh
cloudbuild-release-darwin.yaml		cloudbuild-release-darwin.yaml
cloudbuild-release.yaml		cloudbuild-release.yaml
cloudbuild.yaml		cloudbuild.yaml
curl_test.sh		curl_test.sh
install.sh		install.sh
nebu_key.bin		nebu_key.bin
penguin.jsonl		penguin.jsonl
test.yaml		test.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Installation

Quick Start

Usage

Replay Buffers

Processors

Online LLMs

Humans

Verifiers and Autonomous Learning

Agents

Roadmap

Contributing

Inspiration

About

Uh oh!

Releases 1

Packages

Languages

License

agentsea/orign

Folders and files

Latest commit

History

Repository files navigation

Installation

Quick Start

Usage

Replay Buffers

Processors

Online LLMs

Humans

Verifiers and Autonomous Learning

Agents

Roadmap

Contributing

Inspiration

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages