Real Work by Ritser Labs

An extensible, modular framework for reinforcement learning with LLM agents that work in Docker environments to simulate realistic workflows.

Currently, it is implemented to simulate software development workflows.

Features

Modular Architecture: Plugin-based system for extensible functionality
Docker Environments: Isolated execution environments for each rollout
LLM Integration: Native OpenAI tool calling with OpenRouter support
Advanced Tool Calling: Native function calling instead of text parsing
Parallel Execution: Support for running multiple rollouts simultaneously
Trajectory Tracking: Comprehensive logging and analysis of agent behavior
State Persistence: File system snapshots for environment state management
Unit Test Evaluation: Automated testing and scoring of agent outputs
Dynamic Rollout Management: Plugins can modify rollout behavior based on performance

Installation

Prerequisites

Docker: This framework requires Docker to be installed and running on your system. Download and install Docker from https://www.docker.com/get-started
Python 3.8+: Required for running the framework

Installation Steps

Clone the repository:

git clone https://github.com/ritser-labs/real-work
cd real-work

Install dependencies:

pip install uv
uv sync

Verify Docker is installed and running:

docker --version
docker run hello-world

Set up your OpenRouter API key:

export OPENROUTER_API_KEY="your_openrouter_api_key_here"

Or set it directly in the configuration file.

Quick Start

Set up your configuration file (see examples/calculator_api/config.json):

{
  "environments": [
    {
      "id": "my_env",
      "docker_image": "python:3.9-slim",
      "init_commands": ["pip install --user flask pytest"],
      "unit_tests": ["python -m pytest tests/"],
      "prompt": "Your task description here...",
      "working_directory": "/tmp",
      "environment_variables": {
        "PYTHONPATH": "/tmp",
        "HOME": "/tmp",
        "PATH": "/tmp/.local/bin:/usr/local/bin:/usr/bin:/bin"
      },
      "copy_folders": [],
      "max_retries": 3
    }
  ],
  "rollout_config": {
    "max_parallel_rollouts": 4,
    "trajectory_output_path": "trajectories",
    "enable_plugins": true,
    "save_trajectory_interval": 10,
    "state_persistence_enabled": true
  },
  "episode_control_config": {
    "max_episodes": 1,
    "max_episodes_per_env": null,
    "stop_on_success": false,
    "min_success_rate": null,
    "safety_limit": 100
  },
  "timeout_config": {
    "global_timeout": 1800,
    "command_timeout": 300,
    "test_timeout": 120
  },
  "template_prompt": "You are an expert software engineer working in a Docker container environment. You have access to shell commands, can read and write files, and should implement the requested functionality step by step.",
  "plugins": []
}

Run the framework:

uv run main.py examples/calculator_api/config.json \
  --llm-api-key your-api-key-here \
  --llm-model anthropic/claude-sonnet-4 \
  --llm-base-url https://openrouter.ai/api/v1

LLM Configuration Options:

--llm-model: LLM model to use (default: anthropic/claude-sonnet-4)
--llm-api-key: API key for the LLM service (required)
--llm-base-url: Base URL for the LLM API (default: https://openrouter.ai/api/v1)
--llm-temperature: Temperature for sampling (default: 0.7)
--llm-max-tokens: Maximum tokens for response (default: 4096)
--llm-timeout: Timeout for API calls in seconds (default: 60)

View results: Results are saved as JSON files in the specified output directory. The framework provides detailed trajectory information and performance statistics.

Framework Architecture

Core Components

RolloutManager: Orchestrates the entire process, manages parallel execution
Environment: Handles Docker containers and command execution
LLMAgent: Manages LLM API calls and action parsing
TrajectoryManager: Tracks and persists episode trajectories
UnitTestRunner: Executes tests and collects results
PluginManager: Manages plugin lifecycle and execution

Plugin System

The framework supports several types of plugins:

RolloutManagerPlugin: Controls rollout execution strategy
TrajectoryAnalyzerPlugin: Analyzes trajectories for insights
ActionFilterPlugin: Filters or modifies actions before execution
EnvironmentModifierPlugin: Modifies environment configuration
EpisodeEvaluatorPlugin: Provides custom episode evaluation

Built-in Plugins

PerformanceBasedRolloutPlugin: Stops rollouts when performance threshold is met
TrajectoryAnalysisPlugin: Provides detailed trajectory analysis
SafetyFilterPlugin: Blocks potentially dangerous commands

Configuration

Required Fields: The following fields are required in your configuration file:

environments - List of environment configurations
rollout_config - Rollout execution settings
episode_control_config - Episode control settings
timeout_config - Global timeout settings
template_prompt - Base prompt template

Environment Configuration

{
  "id": "unique_environment_id",
  "docker_image": "python:3.9-slim",
  "init_commands": ["pip install requirements"],
  "unit_tests": ["python -m pytest tests/"],
  "prompt": "Detailed task description for the LLM",
  "working_directory": "/workspace",
  "environment_variables": {
    "PYTHONPATH": "/workspace"
  },
  "copy_folders": [],
  "max_retries": 3
}

LLM Configuration

Note: LLM configuration is typically provided via command-line arguments rather than in the JSON file. The framework supports the following CLI options:

--llm-model anthropic/claude-sonnet-4
--llm-api-key your-api-key-here
--llm-base-url https://openrouter.ai/api/v1
--llm-temperature 0.7
--llm-max-tokens 4096
--llm-timeout 60

For advanced LLM configuration, you can also specify:

{
  "model": "anthropic/claude-sonnet-4",
  "api_key": "your-api-key",
  "base_url": "https://openrouter.ai/api/v1",
  "temperature": 0.7,
  "max_tokens": 4096,
  "timeout": 60,
  "enable_caching": true,
  "cache_size": 100,
  "max_context_messages": 50,
  "max_output_length": 2000,
  "track_token_usage": true,
  "warn_high_usage": true
}

Rollout Configuration

{
  "max_parallel_rollouts": 4,
  "trajectory_output_path": "trajectories",
  "enable_plugins": true,
  "save_trajectory_interval": 10,
  "state_persistence_enabled": true
}

Episode Control Configuration

{
  "max_episodes": 1,
  "max_episodes_per_env": null,
  "stop_on_success": false,
  "min_success_rate": null,
  "safety_limit": 100
}

Timeout Configuration

{
  "global_timeout": 1800,
  "command_timeout": 300,
  "test_timeout": 120
}

LLM Tool Calling

The framework uses OpenAI's native tool calling functionality. The LLM has access to the following tools:

Available Tools

execute_command - Execute shell commands
- command: The shell command to execute
- timeout: Optional timeout in seconds
- working_directory: Optional working directory
write_file - Write content to files
- filepath: Path to the file to write
- content: Content to write to the file
read_file - Read content from files
- filepath: Path to the file to read
mark_done - Mark task as completed
- message: Optional completion message

The LLM automatically calls these tools as needed, and the framework handles the execution and returns results.

Example: Calculator API

The framework includes a complete example that demonstrates building a calculator API:

# Run the calculator API example
uv run python main.py examples/calculator_api/config.json --output calculator_results.json

# View results
cat calculator_results.json
cat calculator_results.stats.json

Creating Custom Plugins

Example Plugin

from src.plugins.base import TrajectoryAnalyzerPlugin, PluginHookType

class CustomAnalyzerPlugin(TrajectoryAnalyzerPlugin):
    @property
    def name(self) -> str:
        return "custom_analyzer"
    
    @property
    def version(self) -> str:
        return "1.0.0"
    
    @property
    def supported_hooks(self) -> List[PluginHookType]:
        return [PluginHookType.POST_EPISODE]
    
    async def analyze_trajectory(self, trajectory: Trajectory) -> Dict[str, Any]:
        # Your custom analysis logic here
        return {"custom_metric": 42}
    
    async def suggest_improvements(self, trajectory: Trajectory) -> List[str]:
        return ["Custom suggestion based on analysis"]

Loading Custom Plugins

Add your plugin module to the configuration:

{
  "plugins": ["my_custom_plugins.analyzer"]
}

Output Format

Trajectory JSON Structure

{
  "trajectory_id": "uuid",
  "environment_id": "env_id",
  "start_time": "2024-01-01T00:00:00",
  "end_time": "2024-01-01T01:00:00",
  "steps": [
    {
      "step_id": "uuid",
      "timestamp": "2024-01-01T00:00:00",
      "action": {
        "type": "command",
        "content": "ls -la",
        "timeout": 300
      },
      "result": {
        "success": true,
        "output": "file listing...",
        "duration": 1.5
      }
    }
  ],
  "test_results": [
    {
      "command": "python -m pytest",
      "success": true,
      "output": "test output...",
      "duration": 5.2
    }
  ],
  "episode_result": {
    "success": true,
    "final_score": 0.95,
    "terminated_reason": "completed"
  }
}

Performance Monitoring

The framework provides comprehensive performance monitoring:

Success rates per environment
Average episode duration
Test pass rates
Action efficiency metrics
Plugin performance statistics

Extending the Framework

Adding New Action Types

Extend the ActionType enum in src/core/config.py
Add parsing logic in src/agents/llm_agent.py
Implement execution logic in src/environments/environment.py

Adding New Environment Types

Extend the Environment class
Implement environment-specific initialization
Add custom test runners if needed

Adding New Evaluation Metrics

Create an EpisodeEvaluatorPlugin
Implement custom scoring logic
Register the plugin in your configuration

Configuration Validation

Validate your configuration without running:

uv run python main.py config.json --dry-run

Note: Some fields like llm_model, max_episodes, stop_on_success, max_parallel_rollouts, global_timeout, output_path, and plugins_enabled at the root level are deprecated and should be moved to their respective configuration sections (llm_config, episode_control_config, rollout_config, timeout_config).

Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Roadmap

Support for additional LLM providers
Web-based dashboard for monitoring
Integration with popular RL frameworks
Support for multi-agent environments
Advanced trajectory analysis tools
Containerized deployment options

Security disclaimer

The LLM can execute code in the Docker container, so be wary of advanced prompt injection attacks if you are using custom environments or models.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
docs		docs
examples/calculator_api		examples/calculator_api
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

License

ritser-labs/real-work

Folders and files

Latest commit

History

Repository files navigation

Real Work by Ritser Labs

Features

Installation

Prerequisites

Installation Steps

Quick Start

Framework Architecture

Core Components

Plugin System

Built-in Plugins

Configuration

Environment Configuration

LLM Configuration

Rollout Configuration

Episode Control Configuration

Timeout Configuration

LLM Tool Calling

Available Tools

Example: Calculator API

Creating Custom Plugins

Example Plugin

Loading Custom Plugins

Output Format

Trajectory JSON Structure

Performance Monitoring

Extending the Framework

Adding New Action Types

Adding New Environment Types

Adding New Evaluation Metrics

Configuration Validation

Contributing

License

Roadmap

Security disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages