Skip to content
/ reev Public

๐Ÿชธ Re-Eval: A Framework for the Reproducible Evaluation of LLM Agents

License

gist-rs/reev

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

reev ๐Ÿชธ

Reev ๐Ÿชธ: Production-Ready Framework for Solana LLM Agent Evaluation


๐ŸŽฏ Production Status: Complete & Fully Functional

reev is a mature, production-ready Rust framework for rigorously evaluating Solana-native LLM agents. After extensive development and testing, the framework now provides a complete, reliable platform for assessing autonomous agents in realistic blockchain environments.

โœ… Current Capabilities: Production Ready

The framework achieves 100% success rates across all benchmark categories:

  • ๐Ÿ”„ Real Jupiter Integration: Full swap, lending, mint/redeem operations with Jupiter SDK
  • ๐Ÿค– Advanced Agent Support: Both deterministic (ground truth) and AI agents working perfectly
  • ๐Ÿ”„ Multi-Step Workflows: Complex DeFi flows with step-by-step orchestration (200-series)
  • ๐Ÿ“Š Comprehensive Scoring: Granular instruction quality evaluation + on-chain execution metrics
  • ๐ŸŽฎ Professional Tooling: Interactive TUI cockpit, database persistence, detailed logging
  • ๐Ÿ”ฌ Real-World Testing: Mainnet fork validation with actual deployed programs
  • โœ… Scoring System Validation: Complete test suite covering 0%, 50%, 75%, and 100% score scenarios
  • ๐ŸŒŠ Flow Support: Step-by-step flow execution with proper transaction isolation

๐Ÿš€ Core Architecture: Real Programs, Controlled State

The framework operates on surfpool, a high-performance in-memory fork of Solana mainnet, providing:

  • ๐ŸŒ Real-World Logic: Agents interact with actual deployed programs (Jupiter, SPL Token, etc.)
  • ๐Ÿ”’ Controlled Environment: Precise state management via RPC cheat codes for reproducible testing
  • โšก High Performance: In-memory execution with fast state manipulation and transaction simulation
  • ๐Ÿ”„ Hermetic Testing: Every test run starts from identical, controlled initial conditions

Core Principles

  • Reproducibility: The primary goal. Every test run is hermetic, guaranteeing that a given benchmark will produce the exact same result every time.
  • Service-Oriented Environment: The Solana test validator (surfpool) is treated as a managed, external service that the environment connects to and configures via RPC. This ensures a clean architectural boundary and prevents dependency conflicts.
  • Gymnasium-Inspired API: The agent-environment interaction is modeled via a standard Rust trait (GymEnv) inspired by the Gymnasium API, promoting a clear separation of concerns.

Key Components

  1. reev-lib (Core Library):

    • SolanaEnv: A custom, hermetic evaluation environment that connects to an external surfpool process. It handles state setup, transaction execution, and observation generation.
    • Agent Interface: Defines a simple Agent trait and provides an LlmAgent that can reason about prompts.
    • Benchmark Structs: Rust types that define the structure of a benchmark YAML file, enabling strongly-typed parsing.
  2. reev-runner (CLI Orchestrator):

    • The command-line tool for loading and running benchmarks.
    • Orchestrates the entire evaluation loop, from setting up the environment to calculating metrics and reporting results.
  3. reev-agent (LLM Service):

    • A standalone server that exposes an LLM's reasoning capabilities over an API.
    • Can be configured to use different models (local, Gemini, etc.) and includes a deterministic agent for generating ground-truth instructions.
  4. Benchmark Suite:

    • A suite of evaluation tasks defined in YAML files located in the benchmarks/ directory.
    • Each test case includes a declarative initial_state, a natural language prompt, and ground_truth criteria for success.

๐Ÿš€ Quick Start

Prerequisites

  1. Rust Toolchain: Install Rust (latest stable recommended)
  2. Git: Clone the repository
  3. Optional LLM: Install LM Studio or have Gemini API key for AI agents

๐ŸŽฏ Running Benchmarks

The framework now provides automatic surfpool management - no manual setup required:

# All benchmarks work out of the box
cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent deterministic

# Jupiter protocols (swap, lending, mint/redeem)
cargo run -p reev-runner -- benchmarks/115-jup-lend-mint-usdc.yml --agent local
cargo run -p reev-runner -- benchmarks/116-jup-lend-redeem-usdc.yml --agent local

# Multi-step flows (swap + lend)
cargo run -p reev-runner -- benchmarks/200-jup-swap-then-lend-deposit.yml --agent deterministic

# API benchmarks (positions, earnings)
cargo run -p reev-runner -- benchmarks/114-jup-positions-and-earnings.yml --agent deterministic

# Scoring validation tests
cargo run -p reev-runner -- benchmarks/003-spl-transfer-fail.yml --agent deterministic  # 0% score
cargo run -p reev-runner -- benchmarks/004-partial-score-spl-transfer.yml --agent deterministic  # ~50% score

๐Ÿค– Agent Options

Deterministic Agent (Ground Truth):

cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent deterministic

Local Model Agent:

cargo run -p reev-runner -- benchmarks/116-jup-lend-redeem-usdc.yml --agent local

Gemini Agent:

cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent gemini-2.5-flash-lite

๐ŸŽฎ Interactive TUI

Launch the interactive cockpit for real-time monitoring:

cargo run -p reev-tui

Features:

  • ๐Ÿ“Š Live benchmark execution with status updates
  • ๐Ÿ” Detailed execution trace analysis
  • ๐Ÿท๏ธ Agent selection (deterministic, local, glm-4.6, gemini)
  • ๐Ÿ“ˆ Real-time scoring and metrics

๐Ÿ“Š Benchmark Categories

๐Ÿ”ง Transaction Benchmarks (100-series)

Real on-chain operations with Jupiter protocols:

# Jupiter swap
cargo run -p reev-runner -- benchmarks/100-jup-swap-sol-usdc.yml --agent local

# Jupiter lending (mint/redeem)
cargo run -p reev-runner -- benchmarks/115-jup-lend-mint-usdc.yml --agent local
cargo run -p reev-runner -- benchmarks/116-jup-lend-redeem-usdc.yml --agent local

๐ŸŒŠ Flow Benchmarks (200-series)

Multi-step DeFi workflows with step-by-step execution:

# Swap then lend (2 steps: swap SOLโ†’USDC, then deposit USDC)
cargo run -p reev-runner -- benchmarks/200-jup-swap-then-lend-deposit.yml --agent deterministic

# More flow benchmarks coming soon...

Flow Execution Features:

  • โœ… Step-by-Step Processing: Each flow step executes as a separate transaction
  • โœ… Transaction Isolation: Proper error handling per step, no cascading failures
  • โœ… State Management: Account state flows between steps automatically
  • โœ… Agent Consistency: Both deterministic and AI agents handle flows identically

๐Ÿ“ก API Benchmarks (100-series)

Data retrieval and portfolio management:

# Positions and earnings
cargo run -p reev-runner -- benchmarks/114-jup-positions-and-earnings.yml --agent deterministic

๐ŸŽฏ Success Metrics

Current Performance:

  • โœ… 100% Success Rate: All benchmarks passing with local model
  • โœ… Real Jupiter Integration: Full protocol stack working
  • โœ… Multi-Step Flows: Complex workflows executing step-by-step successfully
  • โœ… Production Infrastructure: TUI, database, logging all operational
  • โœ… Scoring System Validation: Comprehensive test suite covering full score spectrum
  • โœ… Anti-False-Positive Protection: Differentiates failure modes accurately
  • โœ… Flow Framework: Robust step-by-step execution with proper error handling

Scoring System:

The framework implements a sophisticated two-tiered scoring system:

Component Breakdown:

  • Instruction Quality (75%): Granular evaluation of generated transactions
    • Program ID matching (configurable weight)
    • Instruction data validation (configurable weight)
    • Account metadata verification (signer/writable flags)
  • On-Chain Execution (25%): Binary success/failure on surfpool
  • Composite Scoring: Weighted average for final assessment

Flow Scoring:

  • Per-Step Evaluation: Each flow step is scored individually
  • Combined Results: Step scores aggregated for final flow assessment
  • Partial Credit: Successful steps count even if later steps fail

Validated Score Scenarios:

Score Range Test Case Purpose Status
~75% 003-spl-transfer-fail Correct instruction, on-chain failure โœ… Validated
~78.6% 004-partial-score-spl-transfer Partial credit (correct ID, some errors) โœ… Validated
~75% 100-jup-swap-sol-usdc (pre-fix) Good reasoning, execution failure โœ… Validated
100% 001-sol-transfer, 002-spl-transfer Perfect execution โœ… Validated

Anti-False-Positive Testing:

  • Differentiates between "no attempt" (0%) vs "attempted but failed" (partial credit)
  • Validates granular component scoring (program ID vs data vs accounts)
  • Ensures weighted scoring prevents gaming the system

๐Ÿ”ง Development & Testing

Integration Tests:

# Full test suite (deterministic + AI agents)
cargo test -p reev-runner

# Specific agent testing
cargo test -p reev-runner --test deterministic_agent_test
cargo test -p reev-runner --test llm_agent_test

Example Testing:

# Protocol examples
cargo run -p reev-agent --example 115-jup-lend-mint-usdc

# Flow examples
cargo run -p reev-agent --example 200-jup-swap-then-lend-deposit

Debugging:

# Enable verbose logging
RUST_LOG=debug cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml

# Check surfpool status
curl http://localhost:8899/health

About

๐Ÿชธ Re-Eval: A Framework for the Reproducible Evaluation of LLM Agents

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages